## Modified the training notebook created by unsloth for training llama-2-7b 4 bit version on [Translation](https://huggingface.co/datasets/satpalsr/chatml-translation-filter) dataset and inferencing on validation set

## Installing libraries

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

## Loading 4-bit model

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit", # New Google 6 trillion tokens model 2.5x faster!
    "unsloth/gemma-2b-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-2-7b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


config.json:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/3.87G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/894 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `ChatML` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style. ChatML renders multi turn conversations like below:

```
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What's the capital of France?<|im_end|>
<|im_start|>assistant
Paris.
```

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old` and our own optimized `unsloth` template.

Normally one has to train `<|im_start|>` and `<|im_end|>`. We instead map `<|im_end|>` to be the EOS token, and leave `<|im_start|>` as is. This requires no additional training of additional tokens.

Note ShareGPT uses `{"from": "human", "value" : "Hi"}` and not `{"role": "user", "content" : "Hi"}`, so we use `mapping` to map it.

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("satpalsr/chatml-translation-filter", split = 'train')
dataset = dataset.map(formatting_prompts_func, batched = True,)

Unsloth: Will map <|im_end|> to EOS = </s>.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Downloading data:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/242k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1875 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/324 [00:00<?, ? examples/s]

Map:   0%|          | 0/1875 [00:00<?, ? examples/s]

Let's see how the `ChatML` format works by printing the 5th element

In [None]:
dataset[5]["conversations"]

[{'from': 'system',
  'value': 'For a given question assess whether translating the potential answer to another language might yield an inaccurate response. Avoid translation in tasks related to coding problems, alliteration, idioms, paraphrasing text, word count, spelling correction, and other linguistic constructs or contextual nuances that may affect the accuracy of the answer. When translation is deemed unsuitable, output {"translate": False}. Otherwise, output {"translate": True}.'},
 {'from': 'human',
  'value': 'Generate a paragraph based on the given information. In the year 2025, the United Nations is working hard to reduce its carbon emissions by 50%.'},
 {'from': 'gpt', 'value': '{"translate": True}'}]

In [None]:
print(dataset[5]["text"])

<|im_start|>system
For a given question assess whether translating the potential answer to another language might yield an inaccurate response. Avoid translation in tasks related to coding problems, alliteration, idioms, paraphrasing text, word count, spelling correction, and other linguistic constructs or contextual nuances that may affect the accuracy of the answer. When translation is deemed unsuitable, output {"translate": False}. Otherwise, output {"translate": True}.<|im_end|>
<|im_start|>user
Generate a paragraph based on the given information. In the year 2025, the United Nations is working hard to reduce its carbon emissions by 50%.<|im_end|>
<|im_start|>assistant
{"translate": True}<|im_end|>



In [None]:
dataset

Dataset({
    features: ['conversations', 'text'],
    num_rows: 1875
})

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 200,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
6.654 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,875 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 200
 "-____-"     Number of trainable parameters = 39,976,960


Step,Training Loss
1,0.1674
2,0.1506
3,0.1659
4,0.1166
5,0.1595
6,0.1333
7,0.1247
8,0.1196
9,0.1523
10,0.119


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

755.267 seconds used for training.
12.59 minutes used for training.
Peak reserved memory = 6.656 GB.
Peak reserved memory for training = 0.002 GB.
Peak reserved memory % of max memory = 45.132 %.
Peak reserved memory for training % of max memory = 0.014 %.


<a name="Inference"></a>
### Inference
Let's run the model! Since we're using `ChatML`, use `apply_chat_template` with `add_generation_prompt` set to `True` for inference.

In [None]:
valset = load_dataset("satpalsr/chatml-translation-filter", split = 'validation')
valset['conversations'][0]

[{'from': 'system',
  'value': 'For a given question assess whether translating the potential answer to another language might yield an inaccurate response. Avoid translation in tasks related to coding problems, alliteration, idioms, paraphrasing text, word count, spelling correction, and other linguistic constructs or contextual nuances that may affect the accuracy of the answer. When translation is deemed unsuitable, output {"translate": False}. Otherwise, output {"translate": True}.'},
 {'from': 'human',
  'value': 'Can you suggest 10 types of plants that are commonly used in traditional medicine?'},
 {'from': 'gpt', 'value': '{"translate": True}'}]

In [None]:
# Obtaining only the system and user message from sample
valset['conversations'][0][:2]

[{'from': 'system',
  'value': 'For a given question assess whether translating the potential answer to another language might yield an inaccurate response. Avoid translation in tasks related to coding problems, alliteration, idioms, paraphrasing text, word count, spelling correction, and other linguistic constructs or contextual nuances that may affect the accuracy of the answer. When translation is deemed unsuitable, output {"translate": False}. Otherwise, output {"translate": True}.'},
 {'from': 'human',
  'value': 'Can you suggest 10 types of plants that are commonly used in traditional medicine?'}]

In [None]:
# Creating a list of validation samples as dictionaries
validation = []
for convo in valset['conversations']:
    a = {}
    a['input'] = convo[:2]
    a['true_value'] = convo[2]['value']
    a['output'] = ""
    validation.append(a)

In [None]:
# Viewing a sample
validation[0]

{'input': [{'from': 'system',
   'value': 'For a given question assess whether translating the potential answer to another language might yield an inaccurate response. Avoid translation in tasks related to coding problems, alliteration, idioms, paraphrasing text, word count, spelling correction, and other linguistic constructs or contextual nuances that may affect the accuracy of the answer. When translation is deemed unsuitable, output {"translate": False}. Otherwise, output {"translate": True}.'},
  {'from': 'human',
   'value': 'Can you suggest 10 types of plants that are commonly used in traditional medicine?'}],
 'true_value': '{"translate": True}',
 'output': ''}

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference


In [None]:
# Getting inference from the model
for message in validation:
  messages = message['input']
  inputs = tokenizer.apply_chat_template(
      messages,
      tokenize = True,
      add_generation_prompt = True, # Must add for generation
      return_tensors = "pt",
  ).to("cuda")

  outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
  ans = tokenizer.batch_decode(outputs)
  message['output'] = ans

In [None]:
# Saving the data as csv file
import pandas as pd
df = pd.DataFrame(validation)
df.to_csv('Indic_outputs_200.csv')

In [None]:
validation[0]

{'input': [{'from': 'system',
   'value': 'For a given question assess whether translating the potential answer to another language might yield an inaccurate response. Avoid translation in tasks related to coding problems, alliteration, idioms, paraphrasing text, word count, spelling correction, and other linguistic constructs or contextual nuances that may affect the accuracy of the answer. When translation is deemed unsuitable, output {"translate": False}. Otherwise, output {"translate": True}.'},
  {'from': 'human',
   'value': 'Can you suggest 10 types of plants that are commonly used in traditional medicine?'}],
 'true_value': '{"translate": True}',
 'output': ['<|im_start|>system\nFor a given question assess whether translating the potential answer to another language might yield an inaccurate response. Avoid translation in tasks related to coding problems, alliteration, idioms, paraphrasing text, word count, spelling correction, and other linguistic constructs or contextual nuan

## Calculating Accuray

In [None]:
# Accuracy when model is trained for 200 steps
correct = 0
for i in range(0,len(validation)):
  if(validation[i]['true_value'] in validation[i]['output'][0].split('<|im_start|>')[-1]):
    correct +=1
print((correct/len(validation))*100)

69.1358024691358


In [None]:
# Accuracy when model is trained for 150 steps
correct = 0
for i in range(0,len(validation)):
  if(validation[i]['true_value'] in validation[i]['output'][0].split('<|im_start|>')[-1]):
    correct +=1
print((correct/len(validation))*100)

72.22222222222221


In [None]:
# Accuracy when model is trained for 100 steps
correct = 0
for i in range(0,len(validation)):
  if(validation[i]['true_value'] in validation[i]['output'][0].split('<|im_start|>')[-1]):
    correct +=1
print((correct/len(validation))*100)

66.9753086419753


In [None]:
# Accuracy when model is trained for 60 steps
correct = 0
for i in range(0,len(validation)):
  if(validation[i]['true_value'] in validation[i]['output'][0].split('<|im_start|>')[-1]):
    correct +=1
print((correct/len(validation))*100)

63.888888888888886


In [None]:
positive = 0
for i in range(0,len(validation)):
  if( "True" in validation[i]['true_value']):
    positive +=1
print((positive/len(validation))*100)

47.22222222222222


SUGGESTIONS

*   It was observed that model accuracy increased with no of steps to some extent. Training steps can be optimized to possible get better results
*   Data size could be increased for better results
*   One shot prompting may be better than 0-shot prompting to follow the output format
*   Frameworks can be used for getting output in desired format
*   Hyperparameter tuning may help

