In [1]:
!pip install transformers trl accelerate torch bitsandbytes peft sentencepiece wandb datasets -qU 
!pip install huggingface-hub -qU

In [2]:
from huggingface_hub import notebook_login
import wandb
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
wandb.login()

#### Load HF Dataset

First things first, we need to load our `mosaicml/instruct-v3` dataset. It's a great collection of effective and safe tasks.

In [24]:
from datasets import load_dataset, Dataset

def create_text_row(data):
    if(input==None):
        text_row = f"""<s>[INST]{data['instruction']}[/INST]\\n{data['output']}</s>"""
    else :
        text_row = f"""<s>[INST]{data['instruction']} with {data['input']} [/INST] \\n {data['output']}</s>"""
    return text_row

def prepare_train_data(data_id):
    data = load_dataset(data_id, split="train")
    data_df = data.to_pandas() 
    data_df["text"] =data_df.apply(create_text_row, axis =1) 
    data = Dataset.from_pandas(data_df)
    return data 

In [25]:

instruct_tune_dataset = prepare_train_data("gouthamsk/embedded_dataset_mixed_small")
instruct_tune_dataset = instruct_tune_dataset.shuffle(seed=1234)

Let's take a peek at our dataset.

It's our job to merge these `prompt` and `response` columns into a single formatted prompt for instruct-tuning.

In [26]:
instruct_tune_dataset

Dataset({
    features: ['input', 'output', 'instruction', 'text'],
    num_rows: 281
})

In [27]:
instruct_tune_dataset[280]

{'input': None,
 'output': '#include <reg51.h>\nsbit inbit= P1^0;\nsbit outbit= P2^7;\nbit membit;\nvoid main(void)\n{\n    while(1) { //repeat forever\n      membit= inbit;\n      outbit= membit\n    }\n}',
 'instruction': 'Write an 8051 C program to get the status of bit P1.0, save it, and send it to P2.7 continuously.',
 'text': '<s>[INST]Write an 8051 C program to get the status of bit P1.0, save it, and send it to P2.7 continuously. here are the inputs None [/INST] \n #include <reg51.h>\nsbit inbit= P1^0;\nsbit outbit= P2^7;\nbit membit;\nvoid main(void)\n{\n    while(1) { //repeat forever\n      membit= inbit;\n      outbit= membit\n    }\n}</s>'}

We're going to train on a small subset of the data - if you were considering an Epoch based approach this would reduce the amount of time spent training!

### Loading the Base Model

We're going to load our model in `4bit`, with double quantization, with `bfloat16` as our compute dtype.

You'll notice we're loading the instruct-tuned model - this is because it's already adept at following tasks - we're just teaching it a new one!

In [18]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    device_map='auto',
    quantization_config=nf4_config,
    use_cache=False
)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Let's example how well the model does at this task currently:

In [19]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

In [23]:
generate_response("[INST]Develop a C program for a ESP32 microcontroller to blink led in GPIO Pin with specific second delay[/INST]",model)

'<s> [INST]Develop a C program for a ESP32 microcontroller to blink led in GPIO Pin with specific second delay[/INST] Here is a simple C program for an ESP32 microcontroller to blink an LED connected to a specific GPIO pin with a specified second delay. This program uses the FreeRTOS real-time operating system, which is often used with the ESP32. If you are not using FreeRTOS, you may need to modify the code to fit your specific setup.\n\n1. First, install the necessary components for the ESP32, including the Espressif IoT Development Framework and the FreeRTOS build environment.\n\n2. Create a new C file named `main.c` with the following code:\n\n```c\n#include "esp_system.h"\n#include "esp_log.h"\n#include " freertos.h"\n#include " task.h"\n\n#define LED_GPIO 5 // Change this to the GPIO pin connected to your LED\n\nvoid led_task(void *arg);\n\nvoid app_main()\n{\n    ESP_LOGI("Main", "Starting application...");\n\n    // Initialize the LED GPIO as output\n    gpio_set_level(LED_GPIO

In [28]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Sat Mar  2 10:13:25 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              54W / 400W |  10135MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                         

Now, we're going to prepare our model for 4bit LoRA training!

We can use these handy helper functions to achieve this goal thanks to `huggingface` and the `peft` library!

In [29]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(
    lora_alpha=32,
    lora_dropout=0.05,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)

In [30]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

All that's left to do is set up a number of hyper parameters.

In [31]:
from transformers import TrainingArguments

args = TrainingArguments(
  output_dir = "mistral_embedded_c_v0.2",
  #num_train_epochs=10,
  max_steps = 400, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 5,
  warmup_steps = 0.03,
  logging_steps=10,
  save_strategy="epoch",
  #evaluation_strategy="epoch",
  # evaluation_strategy="steps",
  # eval_steps=20, # comment out this line if you want to evaluate at the end of each epoch
  learning_rate=2e-4,
  bf16=True,
  lr_scheduler_type='constant',
  report_to="wandb",
)

In [32]:
from trl import SFTTrainer

max_seq_length = 2048

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  # formatting_func=create_prompt,
  args=args,
  dataset_text_field="text",
  train_dataset=instruct_tune_dataset,
)

Generating train split: 0 examples [00:00, ? examples/s]

In [33]:
trainer.train()



Step,Training Loss
10,1.2905
20,0.7962
30,0.6999
40,0.6407
50,0.5388
60,0.4753
70,0.4209




TrainOutput(global_step=70, training_loss=0.6946241174425397, metrics={'train_runtime': 213.6705, 'train_samples_per_second': 1.31, 'train_steps_per_second': 0.328, 'total_flos': 2.455902363844608e+16, 'train_loss': 0.6946241174425397, 'epoch': 10.0})

In [None]:
new_model="gemma_embedded_c_7b"
trainer.model.save_pretrained(new_model)

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

In [34]:
trainer.save_model("mistral_embedded_c_v0.2")

# Save Model and Push to Hub

4bit save and push coming soon!

The PR is literally in the process of being added! Check it out [here](https://github.com/TimDettmers/bitsandbytes/pull/753)!

For now, we'll save our adapters!

In [35]:
trainer.push_to_hub("gouthamsk/mistral_embedded_c_v0.2")

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

events.out.tfevents.1709374432.sky-01ae-biboxdev-1cea-head-c28bxh13-compute:   0%|          | 0.00/6.80k [00:0…

adapter_model.safetensors:   0%|          | 0.00/109M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.92k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/gouthamsk/mistral_embedded_c/commit/d25e4b4ae4e5893de854e4090289f87b745c7366', commit_message='gouthamsk/mistral_embedded_c', commit_description='', oid='d25e4b4ae4e5893de854e4090289f87b745c7366', pr_url=None, pr_revision=None, pr_num=None)

In [36]:
merged_model = model.merge_and_unload()



In [37]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

In [43]:
generate_response("""[INST]Is the bellow answer correct answer to "the program to blink LED in GPIO PIN 2 with 1 second delay(1 second on and 1 second off)"
//user input answer start
#include <stdio.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include <driver/gpio.h>
#define LED_PIN 2
void ledON(){
    gpio_set_level(1, 0);
}
void ledOFF(){
    gpio_set_level(2, 0);
}
void app_main(){
    gpio_config_t io_conf = {
        .pin_bit_mask = (1ULL<<3),
        .mode = GPIO_MODE_OUTPUT,
    };
    gpio_config(&io_conf);
    while (1) {
        ledON();
        vTaskDelay(1000 / portTICK_PERIOD_MS);
        ledOFF();
        vTaskDelay(1000 / portTICK_PERIOD_MS);
    }
}
//user input answer end
[/INST]""",merged_model)

'<s> [INST]Is the bellow answer correct answer to "the program to blink LED in GPIO PIN 2 with 1 second delay(1 second on and 1 second off)"\n//user input answer start\n#include <stdio.h>\n#include "freertos/FreeRTOS.h"\n#include "freertos/task.h"\n#include <driver/gpio.h>\n#define LED_PIN 2\nvoid ledON(){\n    gpio_set_level(1, 0);\n}\nvoid ledOFF(){\n    gpio_set_level(2, 0);\n}\nvoid app_main(){\n    gpio_config_t io_conf = {\n        .pin_bit_mask = (1ULL<<3),\n        .mode = GPIO_MODE_OUTPUT,\n    };\n    gpio_config(&io_conf);\n    while (1) {\n        ledON();\n        vTaskDelay(1000 / portTICK_PERIOD_MS);\n        ledOFF();\n        vTaskDelay(1000 / portTICK_PERIOD_MS);\n    }\n}\n//user input answer end\n[/INST] Yes, the provided answer code appears to correctly implement a program to blink an LED connected to GPIO Pin 2 with a 1 second delay (on for 1 second and off for 1 second). The code uses the FreeRTOS library and the driver/gpio library to configure the GPIO pin as a