# Fine-tuning LLMs

Originally based on https://github.com/ShawhinT/YouTube-Blog/tree/main/LLMs/fine-tuning
But could not fork since it is just one notebook in a large collection of project in single git.

The following is a short tutorial on fine tuning an LLM using mainly Huggin Face (HF) helper functions and PyTorch

The **dataset** import is from HF https://pypi.org/project/datasets/  
This gives access to HF public datasets and your own uploaded datasets.  
Create and load datasets: https://huggingface.co/docs/datasets/upload_dataset  
Find more datasets: https://huggingface.co/datasets  

The **transformers** import is also from HF https://pypi.org/project/transformers/  
Similar to the datasets class this gives you access to HF hosted models and just like with datasets you can upload and host your own ones.  
Upload and share models: https://huggingface.co/docs/hub/en/models-uploading  
Find more models: https://huggingface.co/models  


In [None]:
from datasets import load_dataset, DatasetDict, Dataset

from transformers import (
    AutoTokenizer,
    AutoConfig, 
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate
import torch
import numpy as np

### dataset

In [None]:
# # how dataset was generated

# # load imdb data
# imdb_dataset = load_dataset("imdb")

# # define subsample size
# N = 1000 
# # generate indexes for random subsample
# rand_idx = np.random.randint(24999, size=N) 

# # extract train and test data
# x_train = imdb_dataset['train'][rand_idx]['text']
# y_train = imdb_dataset['train'][rand_idx]['label']

# x_test = imdb_dataset['test'][rand_idx]['text']
# y_test = imdb_dataset['test'][rand_idx]['label']

# # create new dataset
# dataset = DatasetDict({'train':Dataset.from_dict({'label':y_train,'text':x_train}),
#                              'validation':Dataset.from_dict({'label':y_test,'text':x_test})})

In [None]:
# load dataset
dataset = load_dataset('shawhin/imdb-truncated')
dataset

### model

**distilbert-base-uncased**  
66M parameters  
"This model is a distilled version of the BERT base model"  
"The model was trained on 8 16 GB V100 for 90 hours"  
https://huggingface.co/distilbert/distilbert-base-uncased  

**AutoModelForSequenceClassification.from_pretrained()**  
Will in this case download and cache the model from HF model repo

If you want to clone models and run from local instead:  
go to: https://huggingface.co/bert-base-uncased?clone=true  (For this example)  
Clone the model, then do similar but with local path:  

_model_checkpoint = "/path/to/your/local/model"  
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint ..._  



In [None]:
model_checkpoint = 'distilbert-base-uncased'

# define label maps
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative":0, "Positive":1}

# generate classification model from model_checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id)

In [None]:
# display architecture
model

### preprocess data

In [None]:
# create tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

# add pad token if none exists
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

In [None]:
# create tokenize function
def tokenize_function(examples):
    # extract text
    text = examples["text"]

    #tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

In [None]:
# tokenize training and validation datasets
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

In [None]:
# create data collator
# this will dynamically pad examples in each batch to be equal length
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### evaluation

**evaluate**: is another package from HF  
https://pypi.org/project/evaluate/  


In [None]:
# import accuracy evaluation metric
accuracy = evaluate.load("accuracy")

In [None]:
# define an evaluation function to pass into trainer later
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)

    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

### Apply untrained model to text

In [None]:
# define list of examples
text_list = [
    "It was good.", 
    "Not a fan, don't recommed.", 
    "Better than the first one.", 
    "This is not worth watching even once.", 
    "This one is a pass."
]

print("Untrained model predictions:")
print("----------------------------")
for text in text_list:
    # tokenize text
    inputs = tokenizer.encode(text, return_tensors="pt")  # pt = pytorch https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__.return_tensors
    
    # compute logits
    logits = model(inputs).logits
    
    # convert logits to label
    predictions = torch.argmax(logits)

    print(text + " - " + id2label[predictions.tolist()])

### Train model

#### PEFT - Parameter-Efficient Fine-Tuning
https://huggingface.co/docs/peft/en/index  
Another usefull HF package!

The package contains LoRA and similar algorithms used for finetuning models on simpler hardware.
https://huggingface.co/docs/peft/main/en/package_reference/lora  

Conceptual explanations for some of them can be found here:  
https://huggingface.co/docs/peft/en/conceptual_guides/adapter  

and all methods in the left side panel under **_API REFERENCES / ADAPTERS_**

See papers related to algos in PEFT:  
https://huggingface.co/collections/PEFT/peft-papers-6573a1a95da75f987fb873ad







In [None]:
peft_config = LoraConfig(task_type="SEQ_CLS",
                        r=4,
                        lora_alpha=32,
                        lora_dropout=0.01,
                        target_modules = ['q_lin'])

**r** - **Dimensions in the LoRA trainble parameters, AxR * RxB = AxB**  
**lora_alpha** - The alpha parameter for Lora scaling. **When applying AxB to original W, its done with a scaling: W + (AxB * (lora_alpha/r))**  
**lora_dropout** - The dropout probability for Lora layers.  
**target_modules** - The names of the modules to apply Lora to.  

In [None]:
peft_config

In [None]:
# Create a trainable model from froozen model + peft config
model = get_peft_model(model, peft_config)

model.print_trainable_parameters

# Print trainable layers
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)

In [None]:
# hyperparameters
lr = 1e-3
batch_size = 4
num_epochs = 10

In [None]:
# define training arguments
training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [None]:
# creater trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator, # this will dynamically pad examples in each batch to be equal length
    compute_metrics=compute_metrics,
)

# On my personal pc with 2x rtx 3090 this takes less than 5 min

# On my eghed laptop with i7-1355U (10 cores) it said 1h30m 
# but after interupting at 52 batches, the printed prediction still gave 5/5 correct!

# so retraining 1% of the network, with 52 batches (2% of intended) still showed that it works!

# train model
trainer.train()

### Generate prediction

In [None]:
model.to('cpu')

print("Trained model predictions:")
print("--------------------------")
for text in text_list:
    inputs = tokenizer.encode(text, return_tensors="pt")

    logits = model(inputs).logits
    predictions = torch.max(logits,1).indices

    print(text + " - " + id2label[predictions.tolist()[0]])

### Option 1: push model to hub

In [None]:
# option 1: notebook login
from huggingface_hub import notebook_login
notebook_login() # ensure token gives write access

# # option 2: key login
# from huggingface_hub import login
# write_key = 'hf_' # paste token here
# login(write_key)

hf_name = 'TODO' # your hf username or org name
model_id = hf_name + "/" + model_checkpoint + "-lora-text-classification" # you can name the model whatever you want

model.push_to_hub(model_id) # save model
trainer.push_to_hub(model_id) # save trainer

### Option 2: save locally

In [None]:
# Specify the local directory where you want to save the model
local_model_path = "/path/to/save/model"

# Save the model locally
model.save_pretrained(local_model_path)