<a href="https://colab.research.google.com/github/lehai-ml/fine-tune-llm/blob/main/fine-tune-llm-unsloth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tuning LLM

LLMs currently have billions of parameters. I'm interested in understanding how you can fine-tune the LLM to fit a specific task. Fine-tuning an LLM from scratch is resource-intensive, and if not done appropriately, it could lead to the model losing its base language understanding.

Parameter-efficient finetuning (PEFT) is a class of fine-tuning method. Low-Rank Approximation (LoRA) is a type of PEFT method that can achieve with minimal computational resources. PEFT methods can be defined into three subtypes [[1]](https://medium.com/@mujahidabdullahi1992/an-introduction-to-lora-unpacking-the-theory-and-practical-implementation-e665c5d78295):
1. Selective - only a subset of weights are fine-tuned
2. Reparametrisation - creates a low-dimensional representation of a specific module in the original LLM
3. Additive - adding a new modules for fine-tuning. These modules are further trained to incorporate knowledge of the new domain into the pre-trained LLM.

## Lower Rank Approximation method

Put simply, previous researches found that a trained LLM to contain many redundant parameters and can function just as good with less weights (referred to as "intrinsic rank").

If the weight matrix update is $W$, then it can be represented by two lower rank matrices, $A$ and $B$, such that $A$ is the lower dimension matrix and $B$ is the linear transformation of $A$ to the higher dimension, i.e.
$\Delta W = AB$

Case in point, if you have 500 x 400 matrix, then it means you have 200000 parameters, but if we can decompose it into two lower rank matrices of 500 x 4 & 4 x 100, we only have to fine tune 2400 parameters.

In terms of which weights to tune (and which to freeze) and what rank to set, the authors found that applying LoRA on several attention weights with a low rank can perform better than applying LoRA on a single attention weight with a high rank [[2]](https://arxiv.org/pdf/2106.09685).

LoRA can be used in several context:
1. Text classification - adpating a model to classify text into predefined categories, such as sentiment analysis or topic classification
2. Question Answering: Fine-tuning the model to provide accurate answers to questions based on a given context or dataset
3. Named Entity Recognition: Customising the model to identify and classify entities (like names, dates, locations) in text.
4. Translation: improving the translation capabilities
5. Dialogue systems: tailoring the response / output of the chatbots.

# Question and Answering

Let's start with the BERT question and answering model. The point of this model is given a passage of text, the user can ask a question and can expect an highly accurate answer based on that text.





In [None]:
from transformers import pipeline

qa_model = pipeline("question-answering")
question = "Where do I live?"
context = "My name is Merve and I live in İstanbul."
qa_model(question = question, context = context)
## {'answer': 'İstanbul', 'end': 39, 'score': 0.953, 'start': 31}


In [None]:
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
import warnings
warnings.simplefilter("ignore")

weight_path = "kaporter/bert-base-uncased-finetuned-squad"
# loading tokenizer
tokenizer = BertTokenizer.from_pretrained(weight_path)
#loading the model
model = BertForQuestionAnswering.from_pretrained(weight_path)


In [None]:
from datasets import load_dataset
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType, get_peft_model
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}


model= AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", id2label=id2label, label2id=label2id)
dataset = load_dataset("rotten_tomatoes")
dataset

In [None]:
!pip install trl

In [None]:
!pip install unsloth

In [None]:
import torch
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
from datasets import Dataset
from unsloth import is_bfloat16_supported

# Saving model
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Warnings
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline


In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("Amod/mental_health_counseling_conversations")


In [None]:
data = dataset['train']

In [None]:
data = dataset['train'].to_pandas()

In [None]:
data['Context_length'] = data['Context'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(data['Context_length'], bins=50, kde=True)
plt.title('Distribution of Context Lengths')
plt.xlabel('Length of Context')
plt.ylabel('Frequency')
plt.show()



In [None]:
filtered_data = data[data['Context_length'] <= 500]

ln_Context = filtered_data['Context'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(ln_Context, bins=50, kde=True)
plt.title('Distribution of Context Lengths')
plt.xlabel('Length of Context')
plt.ylabel('Frequency')
plt.show()


In [None]:
ln_Response = filtered_data['Response'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(ln_Response, bins=50, kde=True, color='teal')
plt.title('Distribution of Response Lengths')
plt.xlabel('Length of Response')
plt.ylabel('Frequency')
plt.show()


In [None]:
filtered_data = filtered_data[ln_Response <= 1000]

ln_Response = filtered_data['Response'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(ln_Response, bins=50, kde=True, color='teal')
plt.title('Distribution of Response Lengths')
plt.xlabel('Length of Response')
plt.ylabel('Frequency')
plt.show()


In [None]:
filtered_data_sampled = filtered_data.sample(100)

In [None]:
max_seq_length = 5020
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
    random_state = 32,
    loftq_config = None,
)
print(model.print_trainable_parameters())


In [None]:
data_prompt = """Analyze the provided text from a mental health perspective. Identify any indicators of emotional distress, coping mechanisms, or psychological well-being. Highlight any potential concerns or positive aspects related to mental health, and provide a brief explanation for each observation.

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompt(examples):
    inputs       = examples["Context"]
    outputs      = examples["Response"]
    texts = []
    for input_, output in zip(inputs, outputs):
        text = data_prompt.format(input_, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }


In [None]:
training_data = Dataset.from_pandas(filtered_data_sampled)
training_data = training_data.map(formatting_prompt, batched=True)


In [None]:
!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

In [None]:
torch.cuda.empty_cache()

In [None]:
training_data

In [None]:
trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=training_data,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=10,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        seed=0,
    ),
)

trainer.train()


In [None]:
text="I'm going through some things with my feelings and myself. \
I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here. \
I've never tried or contemplated suicide. \
I've always wanted to fix my issues, but I never get around to it. \
How can I change my feeling of being worthless to everyone?"



In [None]:
model = FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    data_prompt.format(
        #instructions
        text,
        #answer
        "",
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 5020, use_cache = True)
answer=tokenizer.batch_decode(outputs)
answer = answer[0].split("### Response:")[-1]
print("Answer of the question is:", answer)
