# Fine-Tuning 1B LLaMA 3.2: A Comprehensive Step-by-Step Guide with Code

Source: 

https://huggingface.co/blog/ImranzamanML/fine-tuning-1b-llama-32-a-comprehensive-article

Resources used:

- **Unsloth** enhances the efficiency of fine-tuning large language models (LLMs) specially LLaMA and Mistral. 
 
- With Unsloth, we can use advanced quantization techniques, such as 4-bit and 16-bit quantization, to reduce the memory and speed up both training and inference.

- Unsloth broad compatibility and customization options allow to do the quantization process to fit the specific needs of products. 

- This flexibility combined with its ability to cut VRAM usage by up to 60%.

https://docs.unsloth.ai/


In [None]:
!pip install numpy scipy pandas matplotlib seaborn scikit-learn jupyter notebook ipykernel 
!pip install torch torchvision

In [None]:
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

## Building a Mental Health Chatbot by fine tuning Llama 3.2

Mental health is a critical aspect of overall well being for emotional, psychological and social dimensions.

We are going to fine-tune the LLM **Llama 3.2** on mental health dataset from the Hugging Face

Source:

https://www.llama.com/


### Data Handling and Visualization

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

### LLM model training

In [None]:
import torch
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
from datasets import Dataset
from unsloth import is_bfloat16_supported

# Saving model
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Warnings
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

### Calling the dataset

NOTE: REPLACE DATASET BELOW WITH DATASET ON PARIS, TEXAS!!!

In [None]:
data = pd.read_json("hf://datasets/Amod/mental_health_counseling_conversations/combined_dataset.json", lines=True)


### Exploratory data analysis

In [None]:
# length of words in each context:
data['Context_length'] = data['Context'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(data['Context_length'], bins=50, kde=True)
plt.title('Distribution of Context Lengths')
plt.xlabel('Length of Context')
plt.ylabel('Frequency')
plt.show()

In [None]:
# filtering for less than 1500 words:
filtered_data = data[data['Context_length'] <= 1500]

ln_Context = filtered_data['Context'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(ln_Context, bins=50, kde=True)
plt.title('Distribution of Context Lengths')
plt.xlabel('Length of Context')
plt.ylabel('Frequency')
plt.show()

In [None]:
# length of words in each response:
ln_Response = filtered_data['Response'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(ln_Response, bins=50, kde=True, color='teal')
plt.title('Distribution of Response Lengths')
plt.xlabel('Length of Response')
plt.ylabel('Frequency')
plt.show()

In [None]:
# filtering for less than 4000 words:
filtered_data = filtered_data[ln_Response <= 4000]

ln_Response = filtered_data['Response'].apply(len)
plt.figure(figsize=(10, 3))
sns.histplot(ln_Response, bins=50, kde=True, color='teal')
plt.title('Distribution of Response Lengths')
plt.xlabel('Length of Response')
plt.ylabel('Frequency')
plt.show()

### Model training

#### Loading the model

We are going to use Llama 3.2 with only 1 billion parameters.

(You can use the 3, 11 or 90 billion version as well.)

- Max Sequence Length:
    We used max_seq_length 5020.

- Loading Llama 3.2 Model:

    - The model and tokenizer are loaded using `FastLanguageModel.from_pretrained` with a specific pre-trained model, "unsloth/Llama-3.2-1B-bnb-4bitt". 
    - This is optimized for 4-bit precision, which reduces memory usage and increases training speed without significantly compromising performance.  
    - load_in_4bit=True 

- Applying PEFT (Parameter-Efficient Fine-Tuning):

    - Then we configured model using get_peft_model, which applies LoRA (Low-Rank Adaptation) techniques. 
    - This approach focuses on fine-tuning only specific layers or parts of the model, rather than the entire network.
    - This drastically reduces the computational resources needed.

- Parameters:

    - r=16
    - lora_alpha=16 for target_modules (include key components involved in attention mechanisms like q_proj, k_proj, and v_proj)
    - use_rslora=True (activates Rank-Stabilized LoRA())
    - use_gradient_checkpointing="unsloth" (memory usage optimized during training)

- Verifying Trainable Parameters:
    We used `model.print_trainable_parameters()`.


In [None]:
max_seq_length = 5020
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
    random_state = 32,
    loftq_config = None,
)
print(model.print_trainable_parameters())

#### Prepare data for model feed

Main points to remember:

- Data Prompt Structure:
The data_prompt is a formatted string template designed to guide the model in analyzing the provided text. It includes placeholders for the input text (the context) and the model's response. This template specifically prompts the model to identify mental health indicators, making it easier to fine-tune the model for mental health-related tasks.

- End-of-Sequence Token:
The EOS_TOKEN is retrieved from the tokenizer to signify the end of each text sequence. This token is essential for the model to recognize when a prompt has ended, helping to maintain the structure of the data during training or inference.

- Formatting Function:
The formatting_prompt used to take a batch of examples and formats them according to the data_prompt. It iterates over the input and output pairs, inserting them into the template and appending the EOS token at the end. The function then returns a dictionary containing the formatted text, ready for model training or evaluation.

- Function Output:
The function outputs a dictionary where the key is "text" and the value is a list of formatted strings. Each string represents a fully prepared prompt for the model, combining the context, response and the structured prompt template.

In [None]:
data_prompt = """Analyze the provided text from a mental health perspective. Identify any indicators of emotional distress, coping mechanisms, or psychological well-being. Highlight any potential concerns or positive aspects related to mental health, and provide a brief explanation for each observation.

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompt(examples):
    inputs       = examples["Context"]
    outputs      = examples["Response"]
    texts = []
    for input_, output in zip(inputs, outputs):
        text = data_prompt.format(input_, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }


#### Format the data for training

In [None]:
training_data = Dataset.from_pandas(filtered_data)
training_data = training_data.map(formatting_prompt, batched=True)

#### Model training with custom parameters and data

In Shell commands (only if necssary):

#sudo apt-get update

#sudo apt-get install build-essential

#### Training setup to start fine tuning

- Trainer Initialization:
We are going to initialize SFTTrainer with the model and tokenizer, as well as the training dataset. 

- Training Arguments:
The TrainingArguments class is used to define key hyperparameters for the training process:

    - learning_rate=3e-4: Sets the learning rate for the optimizer.
    - per_device_train_batch_size=32: Defines the batch size per device, optimizing GPU usage.
    - num_train_epochs=20: Specifies the number of training epochs.
    - fp16=not is_bfloat16_supported() and bf16=is_bfloat16_supported(): Enable mixed precision training to reduce memory usage, depending on hardware support.
    - optim="adamw_8bit": Uses the 8-bit AdamW optimizer for efficient memory usage.
    - weight_decay=0.01: Applies weight decay to prevent overfitting.
    - output_dir="output": Specifies the directory where the trained model and logs will be saved.

- Training Process:

    - Finally we called trainer.train() method to start the training process. 
    - It uses the defined parameters of our fine-tune the model, adjusting weights and learning from the provided dataset. 
    - The trainer also handles data packing and gradient accumulation, optimizing the training pipeline for better performance.

Settings if necessary:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

torch.cuda.empty_cache().


In [None]:
trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=training_data,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=16,
        gradient_accumulation_steps=8,
        num_train_epochs=40,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        seed=0,
    ),
)

trainer.train()

### Inference

In [None]:
text="I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here. I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it. How can I change my feeling of being worthless to everyone?"

In [None]:
model = FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    data_prompt.format(
        #instructions
        text,
        #answer
        "",
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 5020, use_cache = True)
answer=tokenizer.batch_decode(outputs)
answer = answer[0].split("### Response:")[-1]
print("Answer of the question is:", answer)

### Expected response:

"I'm sorry to hear that you are feeling so overwhelmed. It sounds like you are trying to figure out what is going on with you. I would suggest that you see a therapist who specializes in working with people who are struggling with depression. Depression is a common issue that people struggle with. It is important to address the issue of depression in order to improve your quality of life. Depression can lead to other issues such as anxiety, hopelessness, and loss of pleasure in activities. Depression can also lead to thoughts of suicide. If you are thinking of suicide, please call 911 or go to the nearest hospital emergency department. If you are not thinking of suicide, but you are feeling overwhelmed, please call 800-273-8255. This number is free and confidential and you can talk to someone about anything. You can also go to www.suicidepreventionlifeline.org to find a local suicide prevention hotline.<|end_of_text|>"

### Push a fine-tuned model and its tokenizer to the Hugging Face Hub

In [None]:
os.environ["HF_TOKEN"] = "hugging face token key, you can create from your HF account."
model.push_to_hub("ImranzamanML/1B_finetuned_llama3.2", use_auth_token=os.getenv("HF_TOKEN"))
tokenizer.push_to_hub("ImranzamanML/1B_finetuned_llama3.2", use_auth_token=os.getenv("HF_TOKEN"))

### Save fine-tuned model and its tokenizer locally on the machine.

In [None]:
model.save_pretrained("model/1B_finetuned_llama3.2")
tokenizer.save_pretrained("model/1B_finetuned_llama3.2")

### Conclusions

Loading of saved model for usage:

model, tokenizer = FastLanguageModel.from_pretrained(

model_name = "model/1B_finetuned_llama3.2",

max_seq_length = 5020,

dtype = None,

load_in_4bit = True

)

# Outlook:

Example:

Tutorial: How to Finetune Llama-3 and Use In Ollama

https://docs.unsloth.ai/basics/tutorial-how-to-finetune-llama-3-and-use-in-ollama