<a href="https://colab.research.google.com/github/khadkechetan/information_extraction/blob/main/NL2SQL/microsoft_phi3_finetuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Install Required Dependencies**

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U xformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl
!pip install -q -U einops
!pip install -q -U nvidia-ml-py3
!pip install -q -U huggingface_hub

**Load the Dataset**

In [None]:
from datasets import load_dataset
#
dataset = load_dataset("b-mc2/sql-create-context")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.43k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/78577 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['answer', 'question', 'context'],
        num_rows: 78577
    })
})

**Format The Dataset**

In [None]:
def create_prompt(sample):
  system_prompt_template = """<s>
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :<<user_question>>
### Database Schema:
<<database_schema>>
### Response:
<<user_response>>
</s>
"""
  user_message = sample['question']
  user_response = sample['answer']
  database_schema = sample['context']
  prompt_template = system_prompt_template.replace("<<user_question>>",f"{user_message}").replace("<<user_response>>",f"{user_response}").replace("<<database_schema>>",f"{database_schema} ")

  return {"inputs":prompt_template}

#
instruct_tune_dataset = dataset.map(create_prompt)
print(instruct_tune_dataset)

Map:   0%|          | 0/78577 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['answer', 'question', 'context', 'inputs'],
        num_rows: 78577
    })
})


**Import Required Dependencies**

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from pynvml import *
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
import time, torch

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

**Load the tokenizer and the model with fp16**

In [None]:
base_model_id = "microsoft/Phi-3-mini-4k-instruct"

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id  , use_fast=True)
#Load the model with fp16
model =  AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, device_map={"": 0})
print(print_gpu_utilization())



tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/904 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

GPU memory occupied: 7650 MB.
None


**Model Inference**

In [None]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=3206

In [None]:

# Define prompts
prompt = [
    "Write the recipe for a chicken curry with coconut milk.",
    "Translate into French the following sentence: I love bread and cheese!",
    "Cite 20 famous people.",
    "Where is the moon right now?"
]

# Initialize variables
duration = 0.0
total_length = 0

# Loop through prompts
for i in range(len(prompt)):
    # Tokenize prompt and move to GPU
    inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")

    # Cast input tensor indices to torch.long
    inputs = {k: v.to(torch.long) for k, v in inputs.items()}

    # Start time
    start_time = time.time()

    # Perform inference with autocasting
    with torch.cuda.amp.autocast(enabled=False):  # Disable autocasting
        output = model.generate(**inputs, max_length=500)

    # Calculate duration and total length
    duration += float(time.time() - start_time)
    total_length += len(output)

    # Calculate tokens per second for prompt
    tok_sec_prompt = round(len(output) / float(time.time() - start_time), 3)

    # Print tokens per second for prompt
    print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))

    # Print decoded output
    print(tokenizer.decode(output[0], skip_special_tokens=True))

# Calculate average tokens per second
tok_sec = round(total_length / duration, 3)
print("Average --- %s tokens/seconds ---" % (tok_sec))






Prompt --- 1.441 tokens/seconds ---
Write the recipe for a chicken curry with coconut milk.
Prompt --- 8.391 tokens/seconds ---
Translate into French the following sentence: I love bread and cheese!

Prompt --- 10.428 tokens/seconds ---
Cite 20 famous people.

Prompt --- 11.692 tokens/seconds ---
Where is the moon right now?

Average --- 4.022 tokens/seconds ---


**Model Inference — for Text to SQL without finetuning**

In [None]:

prompt = [
    """
    Below is an instruction that describes a task. Write a response that appropriately completes the request.
    ### Instruction :
    List all the cities in a decreasing order of each city's stations' highest latitude.
    Database Schema:
    CREATE TABLE station (city VARCHAR, lat INTEGER)
    ### Response:
    SELECT city, lat FROM station ORDER BY lat DESC;
    """,
    """
    Below is an instruction that describes a task. Write a response that appropriately completes the request.
    ### Instruction :
    'What are the positions with both players having more than 20 points and less than 10 points and are in Top 10 ranking
    Database Schema:
    CREATE TABLE player (POSITION VARCHAR, Points INTEGER, Ranking INTEGER)
    ### Response:
    SELECT POSITION, Points, Ranking
    FROM player
    WHERE Points > 20 AND Points < 10 AND Ranking IN (1,2,3,4,5,6,7,8,9,10)
    """,
    """
    Below is an instruction that describes a task. Write a response that appropriately completes the request.
    ### Instruction :
    Find the first name of the band mate that has performed in most songs.
    Database Schema:
    CREATE TABLE Songs (SongId VARCHAR); CREATE TABLE Band (firstname VARCHAR, id VARCHAR); CREATE TABLE Performance (bandmate VARCHAR)
    ### Response:
    SELECT b.firstname
    FROM Band b
    JOIN Performance p ON b.id = p.bandmate
    GROUP BY b.firstname
    ORDER BY COUNT(*) DESC
    LIMIT 1;
    """
]

for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  output = model.generate(**model_inputs, max_length=500, no_repeat_ngram_size=10, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())
  print(tokenizer.decode(output, skip_special_tokens=False))

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))

Prompt --- 28.773 tokens/seconds ---
GPU memory occupied: 8272 MB.
None
<s> 
    Below is an instruction that describes a task. Write a response that appropriately completes the request.
    ### Instruction :
    List all the cities in a decreasing order of each city's stations' highest latitude.
    Database Schema:
    CREATE TABLE station (city VARCHAR, lat INTEGER)
    ### Response:
    SELECT city, lat FROM station ORDER BY lat DESC;
    
    ### Instruction :
    List the cities with the highest number of stations, in decreasing order of the number of stations.
    Database Schema:
    Create Table station (city VARCHAR, lat INTEGER, station_id INTEGER)
    ### Response: 
    SELECT city, COUNT(station_id) as station_count FROM station GROUP BY city ORDER BY station_count DESC;

- [Response]: ### Instruction: List all the cities in a decreasing order based on each city's stations' highest latitude. 

Given the database schema:

```sql
CREATE TABLE station (
    city VARCHAR,
    

**Model** **Finetuning**

In [None]:
base_model_id = "microsoft/Phi-3-mini-4k-instruct"

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_eos_token=True, use_fast=True, max_length=250)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token

compute_dtype = getattr(torch, "float16") #change to bfloat16 if are using an Ampere (or more recent) GPU
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          base_model_id, trust_remote_code=True,
          quantization_config=bnb_config,
          # revision="refs/pr/23",
          # device_map={"": 0},
          torch_dtype="auto",
          # flash_attn=True,
          # flash_rotary=True,
          # fused_dense=True
)
print(print_gpu_utilization())

model = prepare_model_for_kbit_training(model)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

GPU memory occupied: 10052 MB.
None


**Setup LoRA Parameters**

In [None]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear4bit(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear4bit(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, o

In [None]:
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
      target_modules=[
        "qkv_proj",
        "o_proj",
        "down_proj",
        "gate_up_proj"
    ])

**Setup Training Arguments**

In [None]:
training_arguments = TrainingArguments(
        output_dir="./phi3-results",
        save_strategy="epoch",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=8,
        log_level="debug",
        save_steps=100,
        logging_steps=25,
        learning_rate=1e-4,
        eval_steps=50,
        optim='paged_adamw_8bit',
        fp16=True, #change to bf16 if are using an Ampere GPU
        num_train_epochs=1,
        max_steps=200,
        warmup_steps=100,
        lr_scheduler_type="linear",
        seed=42,)

**Prepare the training data**

In [None]:
train_dataset = instruct_tune_dataset.map(batched=True,remove_columns=['answer', 'question', 'context'])
train_dataset

Map:   0%|          | 0/78577 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['inputs'],
        num_rows: 78577
    })
})

**Fine-tuning is done with the simple TRL’s SFT Trainer**

In [None]:
trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset["train"],
        #eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="inputs",
        max_seq_length=1024,
        tokenizer=tokenizer,
        args=training_arguments,
        packing=False
)
#
trainer.train()

Map:   0%|          | 0/78577 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 78,577
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 8
  Total optimization steps = 200
  Number of trainable parameters = 25,165,824


Step,Training Loss
25,2.5344
50,1.3376
75,0.8407
100,0.7147
125,0.6736
150,0.6466
175,0.623
200,0.6203


Saving model checkpoint to ./phi3-results/checkpoint-200
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/920b6cf52a79ecff578cc33f61922b23cbc88115/config.json
Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  

TrainOutput(global_step=200, training_loss=0.9988515996932983, metrics={'train_runtime': 4833.528, 'train_samples_per_second': 2.648, 'train_steps_per_second': 0.041, 'total_flos': 4.314919466601677e+16, 'train_loss': 0.9988515996932983, 'epoch': 0.162883029624351})

**Test inference with the fine-tuned adapter:**

In [None]:
base_model_id = "microsoft/Phi-3-mini-4k-instruct"

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          base_model_id, trust_remote_code=True,
          quantization_config=bnb_config,
          device_map={"": 0}
)
adapter = "/content/phi3-results/checkpoint-200"
model = PeftModel.from_pretrained(model, adapter)

loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/920b6cf52a79ecff578cc33f61922b23cbc88115/tokenizer.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/920b6cf52a79ecff578cc33f61922b23cbc88115/tokenizer.json
loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/920b6cf52a79ecff578cc33f61922b23cbc88115/added_tokens.json
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/920b6cf52a79ecff578cc33f61922b23cbc88115/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/920b6cf52a79ecff578cc33f61922b23cbc88115/tokenizer_config.json
Special tokens have been added in the vocabulary, make sure the a

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing Phi3ForCausalLM.

All the weights of Phi3ForCausalLM were initialized from the model checkpoint at microsoft/Phi-3-mini-4k-instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Phi3ForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/920b6cf52a79ecff578cc33f61922b23cbc88115/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": [
    32000,
    32001,
    32007
  ],
  "pad_token_id": 32000
}



**Perform Inference**

In [None]:
database_schema= 'CREATE TABLE station (city VARCHAR, lat INTEGER)'
user_question = "List all the cities in a decreasing order of each city's stations' highest latitude."
prompt_template = f""""
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
{user_question}
Database Schema:
{database_schema}
### Response:
"""
question = "'What are the positions with both players having more than 20 points and less than 10 points and are in Top 10 ranking"
context = "CREATE TABLE player (POSITION VARCHAR, Points INTEGER, Ranking INTEGER)"
#
prompt_template1 = f""""
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
{question}
Database Schema:
{context}
### Response:
"""
context = '''CREATE TABLE Songs (SongId VARCHAR); CREATE TABLE Band (firstname VARCHAR, id VARCHAR); CREATE TABLE Performance (bandmate VARCHAR)'''
question = "Find the first name of the band mate that has performed in most songs."
#
prompt_template2 = f""""
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
{question}
Database Schema:
{context}
### Response:
"""

prompt = []
prompt.append(prompt_template)
prompt.append(prompt_template1)
prompt.append(prompt_template2)
#


In [None]:
for i in range(len(prompt)):
    model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
    start_time = time.time()
    output = model.generate(**model_inputs, max_length=500, no_repeat_ngram_size=10, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)[0]
    duration += float(time.time() - start_time)
    total_length += len(output)
    tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
    print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
    print(print_gpu_utilization())
    print(tokenizer.decode(output, skip_special_tokens=False))

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))

Prompt --- 11.962 tokens/seconds ---
GPU memory occupied: 8498 MB.
None
<s> "
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
List all the cities in a decreasing order of each city's stations' highest latitude.
Database Schema:
CREATE TABLE station (city VARCHAR, lat INTEGER)
### Response:
SELECT city FROM station ORDER BY MAX(lat) DESC
</s><|assistant|> SELECT city FROM station GROUP BY city ORDER BY MAX(lat) DESC

bob<|end|><|assistant|> SELECT city FROM station GROUP BY city ORDERBY MAX(lat) DESC

bob_2<|end|><|assistant|> SELECT city FROM station GROUP BY city HAVING MAX(lat) ORDER BY MAX(lat) DESC

-1<|end|><|assistant|> SELECT city FROM station GROUP BY city, MAX(lat) ORDER BY MAX(lat), city DESC

-1<|end|><|assistant|> SELECT city, MAX(lat) FROM station GROUP BY city ORDER BY MAX(Lat) DESC

-1<|end|><|assistant|> SELECT DISTINCT city FROM station GROUP BY city ORDER BY MAX (lat) DESC

-1<|end|><|assistant

**Response**

**Save the finetuned model**

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

**Login to HuggingFace**

In [None]:
# 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
from huggingface_hub import notebook_login

notebook_login()

In [None]:
trainer.push_to_hub(commit_message="fine-tuned adapter")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import shutil
shutil.move('/content/phi2-results', '/content/drive/MyDrive/PHI-2')

'/content/drive/MyDrive/PHI-2/phi2-results'

In [None]:
from peft import AutoPeftModelForCausalLM
trained_model = AutoPeftModelForCausalLM.from_pretrained("/content/drive/MyDrive/PHI-3/phi3-results/checkpoint-200",
                                                         low_cpu_mem_usage=True,
                                                         return_dict=True,
                                                         torch_dtype=torch.float16,
                                                         device_map='auto',)
#
lora_merged_model = trained_model.merge_and_unload()
#


In [None]:
# Save the merged Model into drive
lora_merged_model.save_pretrained("/content/drive/MyDrive/PHI-3/phi3-results/lora_merged_model",safe_serialization=True)
# Save the tokenizer
tokenizer.save_pretrained("/content/drive/MyDrive/PHI-3/phi3-results/lora_merged_model")

In [None]:
lora_merged_model.push_to_hub(repo_id="username/phi2-results",commit_message="merged model")

In [None]:
tokenizer.push_to_hub(repo_id="username/phi3-results",commit_message="merged model")

**Perform Inference on Finetuned Model**

In [None]:
from peft import LoraConfig,PeftModel,AutoPeftModelForCausalLM
#set the LoRA configurations
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
)
#
peft_model_id = "username/phi3-results"
config = peft_config.from_pretrained(peft_model_id)
#
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             return_dict=True,
                                             load_in_4bit=True,
                                             device_map="auto",
                                             )

In [None]:
tokenizer= AutoTokenizer.from_pretrained(peft_model_id)
#
model = PeftModel.from_pretrained(model,peft_model_id)
#
print(model.get_memory_footprint())

**Generate Response**

In [None]:
for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  output = model.generate(**model_inputs, max_length=500, no_repeat_ngram_size=10, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())
  print(f"RESPONSE:\n {tokenizer.decode(output, skip_special_tokens=False)[len(prompt[i]):].split('</')[0]}")

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))

In [None]:
prompt[0]

'"\nBelow is an instruction that describes a task.Write a response that appropriately completes the request.\n### Instruction :\nList all the cities in a decreasing order of each city\'s stations\' highest latitude.\nDatabase Schema:\nCREATE TABLE station (city VARCHAR, lat INTEGER)\n### Response:\n'

In [None]:
prompt[1]

'"\nBelow is an instruction that describes a task.Write a response that appropriately completes the request.\n### Instruction :\n\'What are the positions with both players having more than 20 points and less than 10 points and are in Top 10 ranking\nDatabase Schema:\nCREATE TABLE player (POSITION VARCHAR, Points INTEGER, Ranking INTEGER)\n### Response:\n'