# **Fine-tune LLM on custom dataset**

Article: [part 1](https://wandb.ai/capecape/alpaca_ft/reports/How-to-implement-fine-tuning-of-an-LLM-Part-1-Dataset-for-Instruction-Tuning--Vmlldzo1NTcxNzE2) [part 2](https://wandb.ai/capecape/alpaca_ft/reports/How-to-fine-tune-an-LLM-Part-2-Instruction-tuning-Llama-2--Vmlldzo1NjY0MjE1) \
Cleaned dataset in use: [alpaca cleaned](https://github.com/gururise/AlpacaDataCleaned/blob/main/alpaca_data_cleaned.json)

## **Dependencies**

In [31]:
import pandas as pd

### **Load dataset**

In [5]:
import json

from pprint import pprint

In [2]:
with open("/kaggle/input/alpaca-cleaned/alpaca_data_cleaned.json", "r") as file:
    alpaca = json.load(file)

In [3]:
len(alpaca)

51760

In [6]:
pprint(alpaca[123])

{'input': '',
 'instruction': "Find the synonyms of the following word: 'Tenacious'.",
 'output': "Here are some synonyms for the word 'Tenacious':\n"
           '\n'
           '1. Persistent\n'
           '2. Determined \n'
           '3. Resolute \n'
           '4. Steadfast \n'
           '5. Obstinate\n'
           '6. Persevering\n'
           '7. Unyielding\n'
           '8. Unwavering\n'
           '9. Strong-willed\n'
           '10. Dogged.'}


### **Preprocess the data**
As we have instructions both with and without prompts, hence each case must be dealt with them separately.

In [11]:
def prompt_no_input(row):
    return ("Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Response:\n").format_map(row)

In [12]:
def prompt_input(row):
    return ("Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n").format_map(row)

In [15]:
print(prompt_no_input(alpaca[123]))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Find the synonyms of the following word: 'Tenacious'.

### Response:



### **We can merge both paths into one**

In [16]:
def create_prompt(row):
    return prompt_no_input(row) if row["input"] == "" else prompt_input(row)

In [17]:
prompts = [create_prompt(row) for row in alpaca]

### **End-of-String tokens (EOS)**
This token is essential because it tells the model when to stop producing text \
For LLaMa models, it is `EOS_TOKEN = "</s>"`

In [18]:
# Append EOS after each response
EOS_TOKEN = "</s>"
outputs = [row["output"] + EOS_TOKEN for row in alpaca]

In [20]:
print(outputs[0])

1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>


**Concatenate instructions and outputs to form dataset**

In [21]:
dataset = [{
    "prompt": s,
    "output": t,
    "example": s + t
} for s, t in zip(prompts, outputs)]

### **Time to tokenize**
We need to convert the dataset into tokens.

In [22]:
from transformers import AutoTokenizer

In [24]:
model_id = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

**Sample tokens**

In [25]:
tokenizer.encode(
    "This sentence is sentenced for tokenization!",
    padding = "max_length",
    max_length = 10,
    return_tensors = "pt"
)

tensor([[    1,   851, 12271,   349,  2662,  4697,   354,  6029,  1837, 28808]])

### **Creating a train-eval split**

In [26]:
import random

# shuffle in-place
random.shuffle(dataset)

In [32]:
train_dataset = dataset[:-1000]
eval_dataset = dataset[-1000:]

train_table = pd.DataFrame(train_dataset)
eval_table = pd.DataFrame(eval_dataset)

### **Packing: Combine multiple samples into a longer sequence**
> To make training more efficient and use the longer context of these LLMs we'll do something called **"packing"** \
We will combine multiple examples to fill the model's memory and make training more efficient instead of feeding examples individually.

The main idea here is that the instruction/output samples are short, so let's concatenate a bunch of them together, separated by the EOS token.

In [33]:
max_seq_len = 1024

In [34]:
def pack(dataset, max_seq_len = 1024):
    tkds_ids = tokenizer([s["example"] for s in dataset])["input_ids"]
    all_token_ids = []
    packed_ds = []
    
    for tokenized_input in tkds_ids:
        all_token_ids.extend(tokenized_input + [tokenizer.eos_token_id])

    for i in range(0, len(all_token_ids), max_seq_len+1):
        input_ids = all_token_ids[i : i + max_seq_len+1]
        
        if len(input_ids) == (max_seq_len+1):
            packed_ds.append({ "input_ids": input_ids[:-1], "labels": input_ids[1:] })

    return packed_ds

In [35]:
train_ds_packed = pack(train_dataset)
eval_ds_packed = pack(eval_dataset)

## **Storing our preprocessed datasets**

In [36]:
def save_jsonl(data, filename):
    with open(filename, "w") as file:
        for entry in data:
            json.dump(entry, file)
            file.write("\n")

In [37]:
save_jsonl(train_ds_packed, "train_packed_alpaca.jsonl")
save_jsonl(eval_ds_packed, "eval_packed_alpaca.jsonl")