In [1]:
!pip install transformers trl accelerate torch bitsandbytes peft sentencepiece wandb datasets -qU 
!pip install huggingface-hub -qU

In [1]:
from huggingface_hub import notebook_login
import wandb



In [2]:
from huggingface_hub import login

login(token="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")

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/gcpuser/.cache/huggingface/token
Login successful


In [3]:
wandb.login(key="dff287d5fa2f26dfa4c28b844cd2c845842e8d9e")
wandb.init(project='mistral-embedded-c-v0.3')

[34m[1mwandb[0m: Currently logged in as: [33mgouben10[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/gcpuser/.netrc


#### Load HF Dataset

First things first, we need to load our `mosaicml/instruct-v3` dataset. It's a great collection of effective and safe tasks.

In [4]:
from datasets import load_dataset, Dataset

def create_text_row(data):
    if(input==None):
        text_row = f"""<s>[INST]{data['instruction']}[/INST]\n{data['output']}</s>"""
    else :
        text_row = f"""<s>[INST]{data['instruction']} with {data['input']} [/INST]\n {data['output']}</s>"""
    return text_row

def prepare_train_data(data_id):
    data = load_dataset(data_id, split="train")
    data_df = data.to_pandas() 
    data_df["text"] =data_df.apply(create_text_row, axis =1) 
    data = Dataset.from_pandas(data_df)
    return data 

In [5]:

instruct_tune_dataset = prepare_train_data("gouthamsk/embedded_dataset_mixed_small")
instruct_tune_dataset = instruct_tune_dataset.shuffle(seed=1234)

Let's take a peek at our dataset.

It's our job to merge these `prompt` and `response` columns into a single formatted prompt for instruct-tuning.

In [6]:
instruct_tune_dataset

Dataset({
    features: ['instruction', 'output', 'input', 'text'],
    num_rows: 452
})

In [7]:
instruct_tune_dataset[280]

{'instruction': 'Write a C ++ program to optimize a given matrix with specific values.',
 'output': '#include <iostream>\n#include <vector> \n\nusing namespace std;\n\nvoid optimizeMatrix (vector<vector<int>> &mat, int val) { \n  // Iterate over the matrix\n  for (int i = 0; i < mat.size(); i++) { \n    for (int j = 0; j < mat[i].size(); j++) { \n      // Substitute all elements to be \n      // greater than or equal to value\n      if (mat[i][j] < val) \n        mat[i][j] = val; \n    } \n  } \n} \n\nint main() \n{ \n  // Create a matrix \n  vector<vector<int>> mat = { {4, 5, 7}, \n                        {9, 1, 8}, \n                        {1, 2, 4} }; \n  int val = 3; \n  // print matrix\n  optimizeMatrix(mat, val); \n  // Print after optimization\n  for (int i = 0; i < mat.size(); i++) { \n    for (int j = 0; j < mat[i].size(); j++) \n      cout << mat[i][j] << " "; \n    cout << endl; \n  } \n  return 0; \n}',
 'input': None,
 'text': '<s>[INST]Write a C ++ program to optimize a 

We're going to train on a small subset of the data - if you were considering an Epoch based approach this would reduce the amount of time spent training!

### Loading the Base Model

We're going to load our model in `4bit`, with double quantization, with `bfloat16` as our compute dtype.

You'll notice we're loading the instruct-tuned model - this is because it's already adept at following tasks - we're just teaching it a new one!

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "gouthamsk/mistral-embedded-c-v0.4",
    device_map='auto',
    quantization_config=nf4_config,
    use_cache=False
)

tokenizer = AutoTokenizer.from_pretrained("gouthamsk/mistral-embedded-c-v0.4")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Let's example how well the model does at this task currently:

In [10]:
!nvidia-smi

Tue Apr  9 09:45:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              50W / 400W |   5099MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Now, we're going to prepare our model for 4bit LoRA training!

We can use these handy helper functions to achieve this goal thanks to `huggingface` and the `peft` library!

In [11]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

# peft_config = LoraConfig(
#     lora_alpha=32,
#     lora_dropout=0.05,
#     r=64,
#     bias="none",
#     task_type="CAUSAL_LM"
# )


peft_config = LoraConfig(
    lora_alpha=32,
    lora_dropout=0.05,
    r=64,
    target_modules=[ "v_proj",
    "up_proj",
    "gate_proj",
    "k_proj",
    "q_proj",
    "down_proj",
    "o_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

In [12]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

All that's left to do is set up a number of hyper parameters.

In [13]:
from transformers import TrainingArguments

args = TrainingArguments(
  output_dir = "mistral-embedded-c-instruct-v0.3.1",
  #num_train_epochs=10,
  max_steps = 200, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 5,
  warmup_steps = 0.03,
  logging_steps=10,
  save_strategy="epoch",
   optim="adamw_8bit", 
  #evaluation_strategy="epoch",
  # evaluation_strategy="steps",
  # eval_steps=20, # comment out this line if you want to evaluate at the end of each epoch
  learning_rate=2e-4,
  bf16=True,
  lr_scheduler_type='constant',
  report_to="wandb",
)

In [14]:
from trl import SFTTrainer

max_seq_length = 2048

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  # formatting_func=create_prompt,
  args=args,
  dataset_text_field="text",
  train_dataset=instruct_tune_dataset,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [15]:
instruct_tune_dataset["text"]

['<s>[INST]Create an Angular application with a table that displays a provided dataset. with [\n  {name: "Alice", total: 20},\n  {name: "Bob", total: 10},\n  {name: "Carol", total: 30},\n  {name: "Dave", total: 40},\n] [/INST]\n import {Component} from "@angular/core";\n\n@Component({\n selector:"datatable",\n template:`\n  <div>\n   <table>\n    <thead>\n     <th>Name</th>\n     <th>Total</th>\n    </thead>\n    <tbody *ngFor="let data of dataset">\n     <tr>\n      <td>{{data.name}}</td>\n      <td>{{data.total}}</td>\n     </tr>\n    </tbody>\n   </table>\n  </div>\n `\n})\nexport class DatatableComponent{\n  dataset = [\n  {name: "Alice", total: 20},\n  {name: "Bob", total: 10},\n  {name: "Carol", total: 30},\n  {name: "Dave", total: 40},\n  ];\n}</s>',
 '<s>[INST]Edit the following code to print the integers from 0 to 9 inclusive. with for i in range(10):\n print(i) [/INST]\n for i in range(10):\n print(i + 1)</s>',
 '<s>[INST]Create a Vue.js component with a button that calls an 

In [16]:
trainer.train()



Step,Training Loss
10,0.7993
20,0.5678
30,0.4174
40,0.3678
50,0.2272
60,0.2319
70,0.1206
80,0.1373
90,0.0815
100,0.0827




TrainOutput(global_step=200, training_loss=0.1684679690003395, metrics={'train_runtime': 928.7504, 'train_samples_per_second': 1.077, 'train_steps_per_second': 0.215, 'total_flos': 8.585988053925888e+16, 'train_loss': 0.1684679690003395, 'epoch': 10.0})

In [17]:
new_model="mistral-embedded-c-instruct-v0.4"
trainer.model.save_pretrained(new_model)

In [18]:
model_id = "gouthamsk/mistral-embedded-c-v0.4"
from peft import LoraConfig, PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [19]:
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/gouthamsk/mistral-embedded-c-instruct-v0.4/commit/b3546a08c7fc31d7ae372c108034779af7658f90', commit_message='Upload tokenizer', commit_description='', oid='b3546a08c7fc31d7ae372c108034779af7658f90', pr_url=None, pr_revision=None, pr_num=None)

In [34]:
trainer.save_model("mistral-embedded-c-instruct-v0.3")

# Save Model and Push to Hub

4bit save and push coming soon!

The PR is literally in the process of being added! Check it out [here](https://github.com/TimDettmers/bitsandbytes/pull/753)!

For now, we'll save our adapters!

In [35]:
trainer.push_to_hub("gouthamsk/mistral_embedded_c_v0.2")

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

events.out.tfevents.1709374432.sky-01ae-biboxdev-1cea-head-c28bxh13-compute:   0%|          | 0.00/6.80k [00:0…

adapter_model.safetensors:   0%|          | 0.00/109M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.92k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/gouthamsk/mistral_embedded_c/commit/d25e4b4ae4e5893de854e4090289f87b745c7366', commit_message='gouthamsk/mistral_embedded_c', commit_description='', oid='d25e4b4ae4e5893de854e4090289f87b745c7366', pr_url=None, pr_revision=None, pr_num=None)

In [36]:
merged_model = model.merge_and_unload()



In [37]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

In [43]:
generate_response("""[INST]Is the bellow answer correct answer to "the program to blink LED in GPIO PIN 2 with 1 second delay(1 second on and 1 second off)"
//user input answer start
#include <stdio.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include <driver/gpio.h>
#define LED_PIN 2
void ledON(){
    gpio_set_level(1, 0);
}
void ledOFF(){
    gpio_set_level(2, 0);
}
void app_main(){
    gpio_config_t io_conf = {
        .pin_bit_mask = (1ULL<<3),
        .mode = GPIO_MODE_OUTPUT,
    };
    gpio_config(&io_conf);
    while (1) {
        ledON();
        vTaskDelay(1000 / portTICK_PERIOD_MS);
        ledOFF();
        vTaskDelay(1000 / portTICK_PERIOD_MS);
    }
}
//user input answer end
[/INST]""",merged_model)

'<s> [INST]Is the bellow answer correct answer to "the program to blink LED in GPIO PIN 2 with 1 second delay(1 second on and 1 second off)"\n//user input answer start\n#include <stdio.h>\n#include "freertos/FreeRTOS.h"\n#include "freertos/task.h"\n#include <driver/gpio.h>\n#define LED_PIN 2\nvoid ledON(){\n    gpio_set_level(1, 0);\n}\nvoid ledOFF(){\n    gpio_set_level(2, 0);\n}\nvoid app_main(){\n    gpio_config_t io_conf = {\n        .pin_bit_mask = (1ULL<<3),\n        .mode = GPIO_MODE_OUTPUT,\n    };\n    gpio_config(&io_conf);\n    while (1) {\n        ledON();\n        vTaskDelay(1000 / portTICK_PERIOD_MS);\n        ledOFF();\n        vTaskDelay(1000 / portTICK_PERIOD_MS);\n    }\n}\n//user input answer end\n[/INST] Yes, the provided answer code appears to correctly implement a program to blink an LED connected to GPIO Pin 2 with a 1 second delay (on for 1 second and off for 1 second). The code uses the FreeRTOS library and the driver/gpio library to configure the GPIO pin as a