# **Fine-tune LLM on custom dataset**

Article: [part 1](https://wandb.ai/capecape/alpaca_ft/reports/How-to-implement-fine-tuning-of-an-LLM-Part-1-Dataset-for-Instruction-Tuning--Vmlldzo1NTcxNzE2) [part 2](https://wandb.ai/capecape/alpaca_ft/reports/How-to-fine-tune-an-LLM-Part-2-Instruction-tuning-Llama-2--Vmlldzo1NjY0MjE1) \
Cleaned dataset in use: [alpaca cleaned](https://github.com/gururise/AlpacaDataCleaned/blob/main/alpaca_data_cleaned.json)

## **Dependencies**

In [72]:
import pandas as pd

### **Load dataset**

In [73]:
import json

from pprint import pprint

In [74]:
with open("/kaggle/input/alpaca-cleaned/alpaca_data_cleaned.json", "r") as file:
    alpaca = json.load(file)

In [75]:
len(alpaca)

51760

In [76]:
pprint(alpaca[123])

{'input': '',
 'instruction': "Find the synonyms of the following word: 'Tenacious'.",
 'output': "Here are some synonyms for the word 'Tenacious':\n"
           '\n'
           '1. Persistent\n'
           '2. Determined \n'
           '3. Resolute \n'
           '4. Steadfast \n'
           '5. Obstinate\n'
           '6. Persevering\n'
           '7. Unyielding\n'
           '8. Unwavering\n'
           '9. Strong-willed\n'
           '10. Dogged.'}


### **Preprocess the data**
As we have instructions both with and without prompts, hence each case must be dealt with them separately.

In [77]:
def prompt_no_input(row):
    return ("Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Response:\n").format_map(row)

In [78]:
def prompt_input(row):
    return ("Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n").format_map(row)

In [79]:
print(prompt_no_input(alpaca[123]))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Find the synonyms of the following word: 'Tenacious'.

### Response:



### **We can merge both paths into one**

In [80]:
def create_prompt(row):
    return prompt_no_input(row) if row["input"] == "" else prompt_input(row)

In [81]:
prompts = [create_prompt(row) for row in alpaca]

### **End-of-String tokens (EOS)**
This token is essential because it tells the model when to stop producing text \
For LLaMa models, it is `EOS_TOKEN = "</s>"`

In [82]:
# Append EOS after each response
EOS_TOKEN = "</s>"
outputs = [row["output"] + EOS_TOKEN for row in alpaca]

In [83]:
print(outputs[0])

1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>


**Concatenate instructions and outputs to form dataset**

In [84]:
dataset = [{
    "prompt": s,
    "output": t,
    "example": s + t
} for s, t in zip(prompts, outputs)]

### **Time to tokenize**
We need to convert the dataset into tokens.

In [85]:
from transformers import AutoTokenizer

In [86]:
model_id = "mistralai/Mistral-7B-v0.1"
# model_id = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

**Sample tokens**

In [87]:
tokenizer.encode(
    "This sentence is sentenced for tokenization!",
    padding = "max_length",
    max_length = 10,
    return_tensors = "pt"
)

tensor([[    1,   851, 12271,   349,  2662,  4697,   354,  6029,  1837, 28808]])

### **Creating a train-eval split**

In [88]:
import random

# shuffle in-place
random.shuffle(dataset)

In [89]:
train_dataset = dataset[:-1000]
eval_dataset = dataset[-1000:]

train_table = pd.DataFrame(train_dataset)
eval_table = pd.DataFrame(eval_dataset)

### **Packing: Combine multiple samples into a longer sequence**
> To make training more efficient and use the longer context of these LLMs we'll do something called **"packing"** \
We will combine multiple examples to fill the model's memory and make training more efficient instead of feeding examples individually.

The main idea here is that the instruction/output samples are short, so let's concatenate a bunch of them together, separated by the EOS token.

In [90]:
max_seq_len = 1024

In [91]:
def pack(dataset, max_seq_len = 1024):
    tkds_ids = tokenizer([s["example"] for s in dataset])["input_ids"]
    all_token_ids = []
    packed_ds = []
    
    for tokenized_input in tkds_ids:
        all_token_ids.extend(tokenized_input + [tokenizer.eos_token_id])

    for i in range(0, len(all_token_ids), max_seq_len+1):
        input_ids = all_token_ids[i : i + max_seq_len+1]
        
        if len(input_ids) == (max_seq_len+1):
            packed_ds.append({ "input_ids": input_ids[:-1], "labels": input_ids[1:] })

    return packed_ds

In [92]:
train_ds_packed = pack(train_dataset)
eval_ds_packed = pack(eval_dataset)

## **Storing our preprocessed datasets**

In [93]:
def save_jsonl(data, filename):
    with open(filename, "w") as file:
        for entry in data:
            json.dump(entry, file)
            file.write("\n")

In [94]:
save_jsonl(train_ds_packed, "train_packed_alpaca.jsonl")
save_jsonl(eval_ds_packed, "eval_packed_alpaca.jsonl")

# **Loading the preprocessed dataset**

In [95]:
import json

In [96]:
def load_jsonl(filename):
    data = []
    
    with open(filename, "r") as file:
        for line in file:
            data.append(json.loads(line))
    
    return data

In [97]:
train_ds_packed = load_jsonl("/kaggle/working/train_packed_alpaca.jsonl")
eval_ds_packed = load_jsonl("/kaggle/working/eval_packed_alpaca.jsonl")

## **Data Loader**

A standard PyTorch dataloader

In [98]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

In [99]:
batch_size = 8

In [100]:
train_dataloader = DataLoader(
    train_ds_packed,
    batch_size = batch_size,
    collate_fn = default_data_collator
)

eval_dataloader = DataLoader(
    eval_ds_packed,
    batch_size = batch_size,
    collate_fn = default_data_collator,
    shuffle = False
)

In [101]:
b = next(iter(train_dataloader))
b.keys(), b["input_ids"][0][:25], b["labels"][0][:25]

(dict_keys(['input_ids', 'labels']),
 tensor([    1, 20811,   349,   396, 13126,   369, 13966,   264,  3638, 28725,
          5881,  1360,   395,   396,  2787,   369,  5312,  3629,  2758, 28723,
         12018,   264,  2899,   369,  6582]),
 tensor([20811,   349,   396, 13126,   369, 13966,   264,  3638, 28725,  5881,
          1360,   395,   396,  2787,   369,  5312,  3629,  2758, 28723, 12018,
           264,  2899,   369,  6582,  1999]))

# **Training Loop**
We'll train the model and make the model complete the sentence naively.

We'll also be using `SimpleNamespace` to access attributes with a dot `.` like `config.batch_size` and not `config["batch_size"]`

In [102]:
from types import SimpleNamespace

In [103]:
config = SimpleNamespace(
    model_id = model_id,
    dataset_name = "alpaca-cleaned",
    precision = "bf16",
    n_freeze = 24,        # number of layers we dont train LLaMa 7B has 32
    lr = 2e-4,
    n_eval_sample = 10,   # number of samples to generate on validation
    max_seq_len = 1024,   # Length of the sequences to pack
    epochs = 3,
    gradient_accumulation_steps = 32 // batch_size,  # how many iterations we update the gradients
    batch_size = batch_size,
    log_model = False,
    mom = 0.9,         # momentum
    gradient_checkpointing = True,
    freeze_embed = True
)

config.total_train_steps = config.epochs * len(train_dataloader) // config.gradient_accumulation_steps

**Get a pre-trained model with some config parameters**

In [104]:
import torch

from transformers import AutoModelForCausalLM

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    config.model_id,
    device_map = 0,
    trust_remote_code = True,
    low_cpu_mem_usage = True,
    torch_dtype = torch.float16,
    use_cache = False
)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

## **Freezing the Model to Save Memory**
> Training the full models is expensive

Instead, we will train a subset of the model parameters. This technique was pioneered by [Jeremy Howard and Seb Ruder](<https://arxiv.org/abs/1801.06146>)

**Transformer-based models like Llama are a stack of identical layers on top of each other with a classification layer at the end.** LLaMa 2 (7B) has 32 transformer layers, we will only train the last 8 layers (Number of layers to freeze can be experimented). You always want to train the classification head (the last layer which makes predictions)

![](https://storage.googleapis.com/wandb-production.appspot.com/capecape/images/projects/38233410/0a3d44f0.png)

Before we do any fancy parameter efficient methods, let's freeze most model layers. After loading the model, we freeze most of it. This way, we save a ton of memory by not computing gradients on the frozen layers.

In [None]:
n_freeze = 24

**Freeze gradients (disable gradients)**

In [None]:
for param in model.parameters(): param.requires_grad = False
for param in model.lm_head.parameters(): param.requires_grad = False
for param in model.model.layers[n_freeze:].parameters(): param.requires_grad = False

**can even gain a little bit more memory by freezing the embeddings!**

In [None]:
if config.freeze_embed:
    model.model.embed_tokens.requires_grad_(False)

**can also use gradient checkpointing to save even more memory (makes training slower)**

In [None]:
if config.gradient_checkpointing:
    model.gradient_checkpointing_enable(gradient_checkpointing_kwargs = {
        "use_reentrant": False
    })

## **Optimizer and Scheduler**
`Adam` and `cosine_schedule` are safe starting points. We'll make use of our training loop using `bfloat` to make good use of those tensor cores available on modern nvidia gpus. Cross entropy will be the loss function

In [None]:
from transformers import get_cosine_schedule_with_warmup

In [None]:
optim = torch.optim.Adam(
    model.parameters(),
    lr = config.lr,
    betas = (0.9, 0.99),
    eps = 1e-5
)

In [None]:
scheduler = get_cosine_schedule_with_warmup(
    optim,
    num_training_steps = config.total_train_steps,
    num_warmup_steps = config.total_train_steps // 10
)

In [None]:
def loss_fn(x, y):
    return torch.nn.functional.cross_entropy(x.view(-1, x.shape[-1]), y.view(-1))

### **Sampling from the model**
Make a simple function to sample from the model now and then to visually see what the model is outputting

In [None]:
from transformers import GenerationConfig

In [None]:
gen_config = GenerationConfig.from_pretrained(config.model_id)

In [None]:
def generate(prompt, max_new_tokens = 100, gen_config = gen_config):
    with torch.inference_mode():
        tokenized_prompt = tokenizer(prompt, return_tensors = "pt")["input_ids"].cuda()
        
        output = model.generate(
            tokenized_prompt,
            max_new_tokens = max_new_tokens,
            generation_config = gen_config
        )
    
    return tokenizer.decode(output[0][len(tokenized_prompt[0]):], skip_special_tokens = True)

In [None]:
prompt = eval_dataset[14]["prompt"]
print(prompt + generate(prompt, 128))

In [None]:
class Accuracy:
    def __init__(self):
        self.count = 0
        self.tp = 0.
    
    def update(self, logits, labels):
        logits, labels = logits.argmax(dim = -1).view(-1).cpu(), labels.view(-1).cpu()
        tp = (logits == labels).sum()
        self.count += len(logits)
        self.tp += tp
        
        return tp / len(logits)
    
    def compute(self):
        return self.tp / self.count

In [None]:
def to_gpu(tensor_dict):
    return { k: v.to('cuda') for k, v in tensor_dict.items() }

### **Save the model checkpoints**

In [None]:
from pathlib import Path

In [None]:
def save_model(model, model_name, models_folder = "/kaggle/working"):
    model_name = f"{model_name}"
    file_name = Path(f"{models_folder}/{model_name}")
    file_name.parent.mkdir(parents = True, exist_ok = True)
    model.save_pretrained(file_name, safe_serialization = True)
    
    # save tokenizer for easy inference
    tokenizer = AutoTokenizer.from_pretrained(model.name_or_path)
    tokenizer.save_pretrained(model_name)

#### **Training**

In [None]:
from tqdm.auto import tqdm

In [None]:
acc = Accuracy()
model.train()
train_step = 0
pbar = tqdm(total = config.total_train_steps)

In [None]:
for epoch in range(1):
    for step, batch in enumerate(train_dataloader):
        batch = to_gpu(batch)
        
        with torch.amp.autocast("cuda", dtype = torch.float16):
            out = model(**batch)
            loss = loss_fn(out.logits, batch["labels"]) / config.gradient_accumulation_steps
            loss.backward()
        
        if step % config.gradient_accumulation_steps == 0:
            optim.step()
            scheduler.step()
            optim.zero_grad(set_to_none = True)
            train_step += 1
            pbar.update(1)
            
pbar.close()

In [None]:
save_model(model, model_name = config.model_id.replace("/", "_"))