Based on the following:

https://www.datacamp.com/tutorial/fine-tuning-llama-3-1

5. Install the necessary Python packages in the Kaggle notebook using the following command:

In [8]:
%pip install -U transformers accelerate

Note: you may need to restart the kernel to use updated packages.


6. Load the model and tokenizer using the Transformers library from the local directory.

7. Create the text generation pipeline with the model, tokenizer, torch type, and device map.

In [16]:
from transformers import AutoTokenizer,AutoModelForCausalLM,pipeline
import torch
from huggingface_hub import login

# Login to Hugging Face
# login(token="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")
# If you wish to store it for future use (so you don’t have to log in each time), you can use the add_to_git_credential=True option:
login(token="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", add_to_git_credential=True)

# base_model = "meta-llama/Meta-Llama-3.1-8B"
base_model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(base_model)

model = AutoModelForCausalLM.from_pretrained(
        base_model,
        return_dict=True,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16, # original setting
        # torch_dtype=torch.bfloat16,  # suggested by chatgpt, less memory-intensive
        device_map="auto",
        trust_remote_code=True,
)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (manager).
Your token has been saved to C:\Users\USER\.cache\huggingface\token
Login successful


Downloading shards: 100%|██████████| 4/4 [11:27<00:00, 171.91s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.17s/it]
Some parameters are on the meta device because they were offloaded to the cpu.


This part below is suggested by chathpt to test the model and the pipeline, it took 45 seconds to run!

Testing the Pipeline: If the model is set up correctly, try generating some text to ensure everything is working:

In [11]:
text = pipe("Once upon a time", max_length=50, num_return_sequences=1)
print(text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': 'Once upon a time, there was a small country that was surrounded by a big country. The small country was not allowed to have an army, but the big country had a very large army. The big country’s army had many weapons, including'}]


8. Write the message and convert it into the proper prompt using the chat template.

9. Run the pipeline using the prompt and print out the generated output.

In [17]:
messages = [{"role": "user", "content": "What is the tallest building in the world?"}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = pipe(prompt, max_new_tokens=120, do_sample=True)
print(outputs[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the tallest building in the world?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The tallest building in the world is the Burj Khalifa, located in Dubai, United Arab Emirates. It stands at a height of 828 meters (2,722 feet) and has 163 floors. The building was completed in 2010 and was designed by the American architectural firm Skidmore, Owings & Merrill.


Fine-tuning Llama 3.1 on Mental Health Disorder Classification 
Now, we must load the dataset, process it, and fine-tune the Llama 3.1 model. We will also compare the model's performance before and after fine-tuning.

If you are new to LLMs, I recommend you take the Master Large Language Models (LLMs) Concepts course before diving into the fine-tuning part of the tutorial.

1. Setting up
First, we’ll start the new Kaggle notebook and Llama 3.1 model just like we did in the previous section.

We will then install the necessary Python packages as outlined below:

%%capture
%pip install -U bitsandbytes
%pip install -U transformers
%pip install -U accelerate
%pip install -U peft
%pip install -U trl

We can then initiate the Weights and Biases project by using the API key.

In [None]:
import wandb

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

wb_token = user_secrets.get_secret("wandb")

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune llama-3.1-8b-it on Sentiment Analysis Dataset', 
    job_type="training", 
    anonymous="allow"
)

Next, we need to import all the necessary Python packages and functions. 

In [None]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from trl import setup_chat_format
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split

2. Loading and processing the dataset
Now it’s time for us to load the dataset, perform data cleaning, and drop three ambiguous categories. 

To simplify things, we will be dropping the "Suicidal" category because Llama 3.1 has safety mechanisms to prevent certain triggering words. "Stress" is not considered a mental disorder, and "Personality Disorder" has a lot of overlap with "Bipolar Disorder." 

As a result, we will be left with only four categories: "Normal," "Depression," "Anxiety," and "Bipolar."

In [None]:
df = pd.read_csv("/kaggle/input/sentiment-analysis-for-mental-health/Combined Data.csv",index_col = "Unnamed: 0")
df.loc[:,'status'] = df.loc[:,'status'].str.replace('Bi-Polar','Bipolar')
df = df[(df.status != "Personality disorder") & (df.status != "Stress") & (df.status != "Suicidal")]
df.head()

To save training time, we will fine-tune the model on only 3000 samples. For that, we will shuffle the dataset and select 3000 rows. 

We will then split the dataset into the train, eval, and test sets for model training and testing. 

We also want to create the “text” column in train and eval sets using the generate_prompt function, which combines the data from the “statement” and “status” columns. 

Finally, we’ll create the “text” column in the test set using the generate_test_prompt function and the y_true using the “status” column. We will use it to generate the model evaluation report, as shown below.

In [None]:
# Shuffle the DataFrame and select only 3000 rows
df = df.sample(frac=1, random_state=85).reset_index(drop=True).head(3000)

# Split the DataFrame
train_size = 0.8
eval_size = 0.1

# Calculate sizes
train_end = int(train_size * len(df))
eval_end = train_end + int(eval_size * len(df))

# Split the data
X_train = df[:train_end]
X_eval = df[train_end:eval_end]
X_test = df[eval_end:]

# Define the prompt generation functions
def generate_prompt(data_point):
    return f"""
            Classify the text into Normal, Depression, Anxiety, Bipolar, and return the answer as the corresponding mental health disorder label.
text: {data_point["statement"]}
label: {data_point["status"]}""".strip()

def generate_test_prompt(data_point):
    return f"""
            Classify the text into Normal, Depression, Anxiety, Bipolar, and return the answer as the corresponding mental health disorder label.
text: {data_point["statement"]}
label: """.strip()

# Generate prompts for training and evaluation data
X_train.loc[:,'text'] = X_train.apply(generate_prompt, axis=1)
X_eval.loc[:,'text'] = X_eval.apply(generate_prompt, axis=1)

# Generate test prompts and extract true labels
y_true = X_test.loc[:,'status']
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

At this point, we want to check the distribution of categories in the train set. 

In [None]:
X_train.status.value_counts()

So, next, we want to convert the train and eval set into the Hugging Face datasets. 

In [None]:
# Convert to datasets
train_data = Dataset.from_pandas(X_train[["text"]])
eval_data = Dataset.from_pandas(X_eval[["text"]])

Then, we display the 4th sample from the “text” column.

train_data['text'][3]

3. Loading the model and tokenizer
Next, we want to load the Llama-3.1-8b-instruct model in 4-bit quantization to save the GPU memory. 

We will then load the tokenizer and set the pad token id. 

In [None]:
base_model_name = "/kaggle/input/llama-3.1/transformers/8b-instruct/1"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="float16",
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

tokenizer.pad_token_id = tokenizer.eos_token_id

4. Model evaluation before fine-tuning
Here, we create the predict function, which will use the text generation pipeline to predict labels from the “text” column. Running the function will return a list of mental disorder categories based on various samples in the testing set. 

In [None]:
def predict(test, model, tokenizer):
    y_pred = []
    categories = ["Normal", "Depression", "Anxiety", "Bipolar"]
    
    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]["text"]
        pipe = pipeline(task="text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        max_new_tokens=2, 
                        temperature=0.1)
        
        result = pipe(prompt)
        answer = result[0]['generated_text'].split("label:")[-1].strip()
        
        # Determine the predicted category
        for category in categories:
            if category.lower() in answer.lower():
                y_pred.append(category)
                break
        else:
            y_pred.append("none")
    
    return y_pred

y_pred = predict(X_test, model, tokenizer)

After, we create the evaluate function that will use the predicted labels and true labels to calculate the overall accuracy of the model and the accuracy per category, generate a classification report, and print out a confusion matrix. Running the function will give us a detailed model evaluation summary. 

In [None]:
def evaluate(y_true, y_pred):
    labels = ["Normal", "Depression", "Anxiety", "Bipolar"]
    mapping = {label: idx for idx, label in enumerate(labels)}
    
    def map_func(x):
        return mapping.get(x, -1)  # Map to -1 if not found, but should not occur with correct data
    
    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true_mapped)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]
        label_y_true = [y_true_mapped[i] for i in label_indices]
        label_y_pred = [y_pred_mapped[i] for i in label_indices]
        label_accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {labels[label]}: {label_accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true_mapped, y_pred=y_pred_mapped, target_names=labels, labels=list(range(len(labels))))
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped, labels=list(range(len(labels))))
    print('\nConfusion Matrix:')
    print(conf_matrix)

evaluate(y_true, y_pred)

5. Building the model
When building the model, we start by extracting the linear module names from the model using the bits and bytes library. 

We then configure LoRA using the target modules, task type, and other arguments before setting up training arguments. These training arguments are optimized for the Kaggle notebook. You might need to change them if you are using them locally. 

We will then create the model trainer using training arguments, a model, a tokenizer, a LoRA configuration, and a dataset. 

In [None]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)
modules = find_all_linear_names(model)
modules

In [None]:
['down_proj', 'gate_proj', 'o_proj', 'v_proj', 'up_proj', 'q_proj', 'k_proj']

In [None]:
output_dir="llama-3.1-fine-tuned-model"

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules,
)

training_arguments = TrainingArguments(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=1,                       # number of training epochs
    per_device_train_batch_size=1,            # batch size per device during training
    gradient_accumulation_steps=8,            # number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    logging_steps=1,                         
    learning_rate=2e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="wandb",                  # report metrics to w&b
    eval_strategy="steps",              # save checkpoint every epoch
    eval_steps = 0.2
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=512,
    packing=False,
    dataset_kwargs={
    "add_special_tokens": False,
    "append_concat_token": False,
    }
)

6. Model training
It’s now time to initiate the model training:

In [None]:
trainer.train()

Next, we finish the weights and biases run.

In [None]:
wandb.finish()
model.config.use_cache = True

To view a detailed summary, go to your Weights and Biases account and view the run in your browser. It comes with interactive visualizations.
We can then save both the model adapter and tokenizer locally. In the next section, we will use this to merge the adopter with the base model

In [None]:
# Save trained model and tokenizer
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)