#### **Problem Statement :**

- Fine-tuning a language model involves training or adjusting the parameters of a pre-trained model for a specific task or domain.

- Fine tuning involves adjusting the weights of some or all layers of the pre-trained model for the specific task.

- In **Supervised fine-tuning**, a model is trained on a task-specific labeled dataset where each datapoint has a label or right answer. The model learns to adjust its parameters to ensure the labels are predicted accurately.

- Types of Supervised fine-tuning : Basic hyperparameter tuning, task specific fine tuning, transfer learning, few-shot learning, multi-task learning.

- **Reinforcement Learning with Human Feedback** is based on training the language model through interactions with human feedback. Through incorporation of human feedback into the learning process, RLHF enhances model performance leading to more accurate responses.  


- Types of RLHF : Reward model training, proximal policy optimization, comparative ranking, preference learning

- PEFT : **Parameter efficient fine tuning** focuses on training a subset of the pretrained model parameters.

- Types of PEFT  : LoRA, QLoRA

- Hu, Zhiqiang, et al. "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models." arXiv preprint arXiv:2304.01933 (2023).

#### **References**

[1] https://doi.org/10.48550/arXiv.2312.12148

[2] https://huggingface.co/datasets/Falah/sentiments-dataset-381-classes

[3] https://towardsdatascience.com/fine-tuning-large-language-models-llms-23473d763b91

[4] https://www.turing.com/resources/finetuning-large-language-models

[5] https://arxiv.org/pdf/1909.08593.pdf

[6] https://arxiv.org/abs/2305.18290

[7] https://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac

[8] https://medium.com/@joaolages/direct-preference-optimization-dpo-622fc1f18707

#### **Direct Preference Optimization (DPO)**

- Reference [6] suggests *RLHF is complex and sometimes unstable, requiring a reward model reflecting human preferences and then fine-tuning the large unsupervised language model using reinforcement learning.*

- Reference [8] has suggested DPO (RLHF-like) technique has been introduced in reference [6] as an alternative to RLHF and can fine-tune a language model to align with human preferences.

- DPO is applied to preference data, consisting of prompt, chosen answer and rejected answer triplets. [8]

- At the beginning of the model fine-tuning, an exact copy of the language model (known as frozen/reference model) is created together with the trainable parameters frozen.

- For each datapoint, the chosen and rejected responses are scored by the trained and the reference language model. This score is the product of probabilities associated with the desired response token for each step.

- The ratio of the scores from the trained language model and the reference model is calculated. These ratios are used in the calcuation of the loss function, used to modify the model weights in gradient descent update.

#### **Installing Dependencies**

In [None]:
# datasets trl wandb peft bitsandbytes sentencepiece transformers
!pip install -q datasets trl wandb peft bitsandbytes sentencepiece transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/510.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m276.5/510.5 kB[0m [31m8.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.0/225.0 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [9

In [None]:
import os #importing os
import gc #optional garbage collector interface
import transformers
import torch #importing PyTorch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig #Importing AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset #download and import
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training

In [None]:
from trl import DPOTrainer
import bitsandbytes as bnb
from google.colab import userdata
import wandb

#### **API Token**

In [None]:
# Defining API Tokens in Google Colab

hf_token = userdata.get('HF_TOKEN')
wb_token = userdata.get('WANDB_API_KEY')

In [None]:
# Language Model for fine-tuning
model_name = "teknium/OpenHermes-2.5-Mistral-7B"

In [None]:
# name_of_the_fine_tuned_LLM using DPO
output_dir = "NeuralHermes-2.5-Mistral-7B-RL"

#### **ChatML Format**

In [None]:
def chatml_format(example):
  # Format system
  if len(example['system']) > 0:
    message = {"role": "system", "content": example['system']}
    system = tokenizer.apply_chat_template([message], tokenize = False)

  else:
    system = ""


  # Format instruction
  message = {"role":"user", "content":example['question']}
  prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt = True)

  # Format chosen answer
  chosen = example['chosen'] + "<|im_end|>\n"

  #Format rejected answer
  rejected = example["rejected"] + "<|im_end|>\n"

  return {
      "prompt": system + prompt,
      "chosen": chosen,
      "rejected": rejected,
  }

In [None]:
#Loading the dataset
dataset = load_dataset("Intel/orca_dpo_pairs")['train']

Downloading readme:   0%|          | 0.00/196 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/36.3M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
original_columns = dataset.column_names

In [None]:
#Tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

tokenizer_config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/101 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
dataset = dataset.map(
    chatml_format,
    remove_columns=original_columns
)

Map:   0%|          | 0/12859 [00:00<?, ? examples/s]

In [None]:
dataset['chosen'][6]

'To determine the sentiment of the tweet, we need to analyze it thoroughly.\n\nTweet: @nikkigreen I told you\n\nStep 1: Identify the words or phrases that carry emotional weight.\nIn this tweet, there is only one phrase worth examining: "I told you."\n\nStep 2: Determine the sentiment of the identified words or phrases.\n"I told you" can carry a variety of sentiments, depending on the context. It could be positive, negative, or neutral.\n\nStep 3: Consider the overall context of the tweet.\nUnfortunately, without more context, it is impossible to determine the exact sentiment of the tweet.\n\nAs a result, we cannot confidently choose an answer from the provided options, positive or negative, without more contextual information.<|im_end|>\n'

In [None]:
dataset

Dataset({
    features: ['chosen', 'rejected', 'prompt'],
    num_rows: 12859
})

In [None]:
dataset['prompt'][6]

'<|im_start|>system\nYou are an AI assistant. User will you give you a task. Your goal is to complete the task as faithfully as you can. While performing the task think step-by-step and justify your steps.<|im_end|>\n<|im_start|>user\nMulti-choice question: What is the sentiment of the following tweet?\nTweet: @nikkigreen I told you \nChoose your answer from:\n + negative;\n + positive;<|im_end|>\n<|im_start|>assistant\n'

In [None]:
dataset['rejected'][6]



#### **Direct Preference Optimization**

In [None]:
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

#### **Fine-tuning the model**

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    load_in_4bit=True
)

config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

In [None]:
model.config.use_cache = False

In [None]:
# reference model
ref_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    load_in_4bit=True
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    max_steps=500,
    save_strategy="steps",
    logging_steps=100,
    output_dir=output_dir,
    optim="paged_adamw_32bit",
    warmup_steps=100,
    bf16=True,
    report_to="wandb",
)

In [None]:
dpo_trainer = DPOTrainer(
    model,
    ref_model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=1024,
    max_length=1536,
    force_use_ref_model=True
)



Map:   0%|          | 0/12859 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
dpo_trainer.train()

Step,Training Loss
100,0.136
200,0.0012
300,0.0001
400,0.0
500,0.0


TrainOutput(global_step=500, training_loss=0.027472170781809836, metrics={'train_runtime': 7703.5253, 'train_samples_per_second': 1.038, 'train_steps_per_second': 0.065, 'total_flos': 0.0, 'train_loss': 0.027472170781809836, 'epoch': 0.62})

In [None]:
# save model
dpo_trainer.model.save_pretrained("final_checkpoint")
tokenizer.save_pretrained("final_checkpoint")

('final_checkpoint/tokenizer_config.json',
 'final_checkpoint/special_tokens_map.json',
 'final_checkpoint/tokenizer.model',
 'final_checkpoint/added_tokens.json',
 'final_checkpoint/tokenizer.json')

In [None]:
del dpo_trainer, model, ref_model
gc.collect()
torch.cuda.empty_cache()

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    torch_dtype = torch.float16,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# Merge base model with the adaptor
model = PeftModel.from_pretrained(
    base_model, "final_checkpoint"
)

In [None]:
model = model.merge_and_unload()

In [None]:
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

('NeuralHermes-2.5-Mistral-7B-RL/tokenizer_config.json',
 'NeuralHermes-2.5-Mistral-7B-RL/special_tokens_map.json',
 'NeuralHermes-2.5-Mistral-7B-RL/tokenizer.model',
 'NeuralHermes-2.5-Mistral-7B-RL/added_tokens.json',
 'NeuralHermes-2.5-Mistral-7B-RL/tokenizer.json')

In [None]:
message = [
    {
        "role": "system", "content": "You are a helpful AI assistant"
    },
    {
        "role": "user", "content" : "What is Artificial Intelligence?"
    }
]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(output_dir)

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
prompt = tokenizer.apply_chat_template(message,
                                       add_generation_prompt = True,
                                       tokenize = False)

In [None]:
prompt

'<|im_start|>system\nYou are a helpful AI assistant<|im_end|>\n<|im_start|>user\nWhat is Artificial Intelligence?<|im_end|>\n<|im_start|>assistant\n'

In [None]:
pipeline = transformers.pipeline(
    "text-generation",
    model = output_dir,
    tokenizer = tokenizer
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
response = pipeline(
    prompt,
    do_sample = True,
    temperature = 0.3,
    top_p = 0.9,
    num_return_sequences = 1,
    max_length = 255,

)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


In [None]:
response

[{'generated_text': '<|im_start|>system\nYou are a helpful AI assistant<|im_end|>\n<|im_start|>user\nWhat is Artificial Intelligence?<|im_end|>\n<|im_start|>assistant\nArtificial Intelligence (AI) refers to the development of computer systems and machines that can perform tasks that typically require human intelligence, such as learning, problem-solving, decision-making, perception, and natural language understanding. AI technologies aim to simulate and enhance human cognitive abilities, enabling machines to work and adapt in human-like ways.'}]