# Fine-tune Mistral model with Reinforcement Learning DPO

- model: `teknium/OpenHermes-2.5-Mistral-7B`
- dataset: `Intel/orca_dpo_pairs`

### 1. Setup the Google Colab environment

In [17]:
import sys
import gc
from google.colab import drive, userdata

In [8]:
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [9]:
!ls /content/gdrive/MyDrive/LLMs/RL/RL-LAB/

data_handler.py  __pycache__  requirements.txt	rl_dpo_finetune.ipynb  rl_dpo_trainer.py


In [3]:
!pip install -r /content/gdrive/MyDrive/LLMs/RL/RL-LAB/requirements.txt

Collecting torch==2.1.2 (from -r /content/gdrive/MyDrive/LLMs/RL/RL-LAB/requirements.txt (line 1))
  Downloading torch-2.1.2-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Collecting transformers==4.37.2 (from transformers[sentencepiece]==4.37.2->-r /content/gdrive/MyDrive/LLMs/RL/RL-LAB/requirements.txt (line 6))
  Downloading transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.16.1 (from -r /content/gdrive/MyDrive/LLMs/RL/RL-LAB/requirements.txt (line 7))
  Downloading datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Collecting accelerate==0.26.1 (from -r /content/gdrive/MyDrive/LLMs/RL/RL-LAB/requirements.txt (line 8))
  Downloading accelerate-0.26.1-py3-none-any.whl.metadata (18 kB)
Collecting evaluate==0.4.1 (from -r /content/gdrive/MyDrive/LLMs/RL/RL-LAB/requirements.txt (line 9))
  Downloading evaluate-0.4.1-py3-none-any.

In [10]:
sys.path.append('/content/gdrive/MyDrive/LLMs/RL/RL-LAB/')

### 2. Dataset preparation

#### 2.1 Preliminaries

In [11]:
from data_handler import DataHandler
from rl_dpo_trainer import RL_DPOTrainer

In [12]:
# Configuration
model_name = "teknium/OpenHermes-2.5-Mistral-7B"
dataset_name = "Intel/orca_dpo_pairs"
new_model = "/content/gdrive/MyDrive/LLMs/RL/RL-LAB/trained_model"
wb_token = userdata.get('wandb')

In [13]:
# Initialize the RL_DPOTrainer for reinforcement learning fine-tuning
rldpo_trainer = RL_DPOTrainer(
    model_name,    # The name or path of the pre-trained model
    dataset_name,  # The name of the dataset used for training
    new_model,     # The directory where the fine-tuned model will be saved
    wb_token       # The authentication token for logging to Weights & Biases (WandB)
)

Logging to WandB


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mnbassiou[0m ([33mnbassiou-org[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Loading the models
Loading models: teknium/OpenHermes-2.5-Mistral-7B


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/101 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

#### 2.2 Load and Process the data

In [14]:
# Load and process dataset
data_handler = DataHandler(dataset_name, rldpo_trainer.tokenizer)
print(len(data_handler.dataset))
data_handler.map_dataset()


Loading dataset: Intel/orca_dpo_pairs


Downloading readme:   0%|          | 0.00/196 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/36.3M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

12859
Formating the dataset


Map:   0%|          | 0/12859 [00:00<?, ? examples/s]

For fast development or demo/test purposes you can use a subset of the dataset by running the followint to randomly sample a portion of the dataset and then use the sampled_dataset to train the model

`sampled_dataset = data_handler.sample_data()`

### 2. Train the model with DPO

In [16]:
# Train the model
rldpo_trainer.train(dataset)



Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
1,0.5199
2,0.1733
3,0.088
4,0.0002


In [18]:
# Save the model
rldpo_trainer.save_model()

Saving the models to /content/gdrive/MyDrive/LLMs/RL/RL-LAB/trained_model




Models saved


In [19]:
!ls /content/gdrive/MyDrive/LLMs/RL/RL-LAB/trained_model

adapter_config.json	   added_tokens.json  special_tokens_map.json  tokenizer.json
adapter_model.safetensors  README.md	      tokenizer_config.json    tokenizer.model


#### Optional: Merge Lora adapter model in to the original model

In [20]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

In [None]:
# Flush memory
del rldpo_trainer.model, rldpo_trainer.tokenizer
gc.collect()
torch.cuda.empty_cache()

In [22]:
# Reload model in FP16 (instead of NF4)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Merge base model with the adapter
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [25]:
# Save merged model and tokenizer
merged_model_path = "/content/gdrive/MyDrive/LLMs/RL/RL-LAB/merged_model"
model.save_pretrained(merged_model_path)
tokenizer.save_pretrained(merged_model_path)

('/content/gdrive/MyDrive/LLMs/RL/RL-LAB/merged_model/tokenizer_config.json',
 '/content/gdrive/MyDrive/LLMs/RL/RL-LAB/merged_model/special_tokens_map.json',
 '/content/gdrive/MyDrive/LLMs/RL/RL-LAB/merged_model/tokenizer.model',
 '/content/gdrive/MyDrive/LLMs/RL/RL-LAB/merged_model/added_tokens.json',
 '/content/gdrive/MyDrive/LLMs/RL/RL-LAB/merged_model/tokenizer.json')

In [26]:
!ls /content/gdrive/MyDrive/LLMs/RL/RL-LAB/merged_model/

added_tokens.json		  model-00002-of-00003.safetensors  tokenizer_config.json
config.json			  model-00003-of-00003.safetensors  tokenizer.json
generation_config.json		  model.safetensors.index.json	    tokenizer.model
model-00001-of-00003.safetensors  special_tokens_map.json


### 3. Evaluate the model

Manually check if the responses from the aligned model is more helpful, accurate and aligned to the human preferences.

#### 3.1. Load the aligned model and generate responses on the sample data

In [30]:
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer


In [31]:
# Randomly select prompts from the original dataset
message = [
    {"role": "system", "content":"You are a teacher. Given a task, you explain in simple steps what the task is asking, any guidelines it provides and how to use those guidelines to find the answer."},
    {"role": "user", "content": "Given the task definition and input, reply with output. In this task, you are given reviews written about the books in Bengali. You are expected to classify the sentiment of the reviews into two classes: positive or negative. পড়তে পড়তে বিরক্ত লাগছিলো।বিশেষ কিছুই নেই"}
]

In [32]:
def generate_text(model_name: str, message: dict, temperature: float = 0.7, top_p: float = 0.9, num_return_sequences: int = 1, max_length: int = 2000) -> str:
    """
    Generates text using a pre-trained model and a given message.

    Args:
        model_name (str): The name of the Hugging Face model to use.
        message (dict): The chat template input message.
        temperature (float): Sampling temperature for generation.
        top_p (float): Nucleus sampling probability.
        num_return_sequences (int): Number of generated sequences.
        max_length (int): Maximum length of the generated text.

    Returns:
        str: The generated text output.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    original_prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

    text_pipeline = transformers.pipeline(
        "text-generation",
        model=model_name,
        tokenizer=tokenizer
    )

    sequences = text_pipeline(
        original_prompt,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        num_return_sequences=num_return_sequences,
        max_length=max_length,
    )

    return sequences# [0]['generated_text']

In [33]:
dpo_sequences = generate_text(merged_model_path, message)
print(dpo_sequences[0]['generated_text'])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


<|im_start|>system
You are a teacher. Given a task, you explain in simple steps what the task is asking, any guidelines it provides and how to use those guidelines to find the answer.<|im_end|>
<|im_start|>user
Given the task definition and input, reply with output. In this task, you are given reviews written about the books in Bengali. You are expected to classify the sentiment of the reviews into two classes: positive or negative. পড়তে পড়তে বিরক্ত লাগছিলো।বিশেষ কিছুই নেই<|im_end|>
<|im_start|>assistant
Task: The task is to analyze a given book review in Bengali and classify its sentiment as either positive or negative.

Guidelines:
1. Understand the meaning of the Bengali text provided.
2. Identify the overall sentiment expressed in the review.
3. Classify the sentiment as positive or negative based on the review's content.

Steps to find the answer:
1. Translate the Bengali text into English: "Gradually, I got annoyed. There is nothing special."
2. Analyze the sentiment of the trans

#### 3.2 Load the original model (before alignment) and generate some responses for sample data

In [35]:
sequences = generate_text(model_name, message)
print(sequences[0]['generated_text'])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


<|im_start|>system
You are a teacher. Given a task, you explain in simple steps what the task is asking, any guidelines it provides and how to use those guidelines to find the answer.<|im_end|>
<|im_start|>user
Given the task definition and input, reply with output. In this task, you are given reviews written about the books in Bengali. You are expected to classify the sentiment of the reviews into two classes: positive or negative. পড়তে পড়তে বিরক্ত লাগছিলো।বিশেষ কিছুই নেই<|im_end|>
<|im_start|>assistant
This task is asking you to analyze a Bengali book review and determine whether the sentiment expressed in the review is positive or negative. Here are the simple steps to follow:

1. Understand the task: The main goal is to classify a book review into either positive or negative sentiment.

2. Read the given input: In this case, the input is a Bengali sentence - "পড়তে পড়তে বিরক্ত লাগছিলো। বিশেষ কিছুই নেই"

3. Translate the input (if necessary): If you are not familiar with the Bengali 

Potential next steps: You can further evaluate using an LLM as a judge to rate the responses between the two different models