## Fine-tuning your own LLM

In this tutorial, 📚 we will focus on fine-tuning a model from the Hugging Face Hub for the task of answering questions related to the Malawi Technical Guidelines for Integrated Disease Surveillance and Response (TGs for IDSR). This approach is different from Retrieval-Augmented Generation (RAG) and instead emphasizes the direct fine-tuning of a pre-trained large language model for this specific task. You can read the first notebook if you want to learn about how to implement RAG. 📝


## Requirements 🛠️

Just a basic setup: please use a GPU-enabled setup for your inference, but don't spend the whole day on it. I'm using Kaggle since they offer free GPUs; you can also use any other free platform you like... I guess.. 😬. Of course, if you don't have access to GPUs for some reason, you can also run it on CPUs....🥶

### Setting Up the Environment

We will start by setting up the necessary environment, including installing libraries, importing modules, and downloading the pre-trained LLM from the Hugging Face Hub. 🌱

Note!!!! Replace `YOUR_HF_TOKEN` 🔑 with your Huggingface token. ⚠️

In [3]:
!python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token ('YOUR_HF_TOKEN')"

In [7]:
!pip install lamini -qq

In [8]:
pip install accelerate -Uqq

Note: you may need to restart the kernel to use updated packages.


In [None]:
!pip install datasets -Uqq

In [2]:
import datasets
import tempfile
import logging
import random
import config
import os
import yaml
import time
import torch
import transformers
import pandas as pd
import jsonlines

#from utilities import *
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments
from transformers import AutoModelForCausalLM
from llama import BasicModelRunner


logger = logging.getLogger(__name__)
global_config = None

###### Looking at the data - Be sure to correct the Paths

Be sure to correct the paths of the train and test data. 🧐


In [3]:
train = pd.read_csv("/kaggle/input/malawi-public-health/Train.csv")

In [4]:
test = pd.read_csv("/kaggle/input/malawi-public-health/Test.csv")
test.head()

Unnamed: 0,ID,Question Text
0,Q4,"What is the definition of ""unusual event"""
1,Q5,What is Community Based Surveillance (CBS)?
2,Q9,What kind of training should members of VHC re...
3,Q10,What is indicator based surveillance (IBS)?
4,Q13,What is Case based surveillance?


In [5]:
train.head()

Unnamed: 0,ID,Question Text,Question Answer,Reference Document,Paragraph(s) Number,Keywords
0,Q829,Compare the laboratory confirmation methods fo...,Chikungunya is confirmed using serological tes...,TG Booklet 6,"154, 166",Laboratory Confirmation For Chikungunya Vs. Di...
1,Q721,When should specimens be collected for Anthrax...,Specimens should be collected during the vesic...,TG Booklet 6,140,"Anthrax Specimen Collection: Timing, Preparati..."
2,Q464,Which key information should be recorded durin...,"During a register review, key information abou...",TG Booklet 3,439-440,"Register Review, Key Information, Suspected Ca..."
3,Q449,Why is the District log of suspected outbreaks...,The log includes information about response ac...,TG Booklet 3,412,"District Log, Response Activities, Steps Taken..."
4,Q6,What do Community based surveillance strategie...,Community-based surveillance strategies focus ...,TG Booklet 1,86,"Community-based Surveillance Strategies, Ident..."


Checking out the maximum length of the text

Let's determine the maximum length of the text. This would help inform what our max_length parameter should look like📏


In [6]:
train["Question Length"] = train["Question Text"].str.len()
train["Answer Length"] = train["Question Answer"].str.len()

In [7]:
print("The maximum length of the Questions are ",train["Question Length"].max())
print("The maximum length of the Answers are ",train["Answer Length"].max())

The maximum length of the Questions are  279
The maximum length of the Answers are  843


In [1]:
dataset_name = "Train.csv"
dataset_path = f"/kaggle/input/malawi-public-health/{dataset_name}" #pah to Train.csv
use_hf = False

###### Choosing a model from the hub to fine-tune

The model used here is the EleutherAI/pythia-410m. Feel free to explore other models on the hub. 🚀


![image.png](attachment:image.png)

In [9]:
model_name = "EleutherAI/pythia-410m"

In [10]:
training_config = {
    "model": {
        "pretrained_name": model_name,
        "max_length" : 2048
    },
    "datasets": {
        "use_hf": use_hf,
        "path": dataset_path
    },
    "verbose": True
}

###### Tokenizing your dataset

This function tokenizes input examples, concatenating "Question Text" and "Question Answer" or "input" and "output" fields if you decide to rename them. It then pads the tokenized inputs with the end-of-sequence token, truncates them to a maximum length of 2048, and returns the tokenized inputs as numpy arrays. 🤖


In [11]:
def tokenize_function(examples):
    if "Question Text" in examples and "Question Answer" in examples:
      text = examples["Question Text"][0] + examples["Question Answer"][0]
    elif "input" in examples and "output" in examples:
      text = examples["input"][0] + examples["output"][0]
    else:
      text = examples["Question Text"][0]

    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )

    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


###### Creating and Splitting the Dataset

Let's create and split the dataset. 📊


In [13]:
!cp "/kaggle/input/malawi-public-health/Train.csv" -r "/kaggle/working/"

In [14]:
!cp "/kaggle/input/malawi-public-health/Test.csv" -r "/kaggle/working/"

In [15]:
finetuning_dataset_loaded = datasets.load_dataset("csv", data_files='Train.csv', split="train")

tokenized_dataset = finetuning_dataset_loaded.map(
    tokenize_function,
    batched=True,
    batch_size=1,
    drop_last_batch=True
)

print(tokenized_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


Map:   0%|          | 0/748 [00:00<?, ? examples/s]

Dataset({
    features: ['ID', 'Question Text', 'Question Answer', 'Reference Document', 'Paragraph(s) Number', 'Keywords', 'input_ids', 'attention_mask'],
    num_rows: 748
})


In [16]:
tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])

In [17]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)


DatasetDict({
    train: Dataset({
        features: ['ID', 'Question Text', 'Question Answer', 'Reference Document', 'Paragraph(s) Number', 'Keywords', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 673
    })
    test: Dataset({
        features: ['ID', 'Question Text', 'Question Answer', 'Reference Document', 'Paragraph(s) Number', 'Keywords', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 75
    })
})


In [18]:
train_dataset, test_dataset = split_dataset['train'], split_dataset['test']

print(train_dataset)
print(test_dataset)

Dataset({
    features: ['ID', 'Question Text', 'Question Answer', 'Reference Document', 'Paragraph(s) Number', 'Keywords', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 673
})
Dataset({
    features: ['ID', 'Question Text', 'Question Answer', 'Reference Document', 'Paragraph(s) Number', 'Keywords', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 75
})


###### Loading in your pretrained LLM

Now, let's load in your pretrained LLM. 🤖


In [19]:
base_model = AutoModelForCausalLM.from_pretrained(model_name)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/911M [00:00<?, ?B/s]

In [20]:
device_count = torch.cuda.device_count()
if device_count > 0:
    logger.debug("Select GPU device")
    device = torch.device("cuda")

else:
    logger.debug("Select CPU device")
    device = torch.device("cpu")

###### A peek into the LLM Architecture

Let's take a peek into the LLM architecture. 🔍🏛️


In [21]:
base_model.to(device)

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 1024)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
  

###### Zero Shot Inference - Testing the Zero Shot Performance

Here, we test the model's ability on the task before training to see how it performs. Zero-Shot involves providing input that the model hasn't been explicitly trained on and measuring its ability to produce accurate outputs without specific training data. 🧠💡

In [22]:
def inference(text, model, tokenizer, max_input_tokens=500, max_output_tokens=1000):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

In [23]:
test_dataset

Dataset({
    features: ['ID', 'Question Text', 'Question Answer', 'Reference Document', 'Paragraph(s) Number', 'Keywords', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 75
})

In [24]:
test_text = test_dataset[0]['Question Text']
print("Question input (test):", test_text)
print(f"Correct answer from dataset: {test_dataset[0]['Question Answer']}")
print("Model's answer: ")
print(inference(test_text, base_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Question input (test): Are all cases recorded?
Correct answer from dataset: Yes, all cases ( suspected, probably or confirmed) should always be recorded in a recongnised facility register or logbook, and the IDSR reporting forms.
Model's answer: 


Yes, all cases are recorded.

What is the maximum number of cases that can be recorded?

The maximum number of cases that can be recorded is 10.

What is the maximum number of cases that can be recorded per day?

The maximum number of cases that can be recorded per day is 10.

What is the maximum number of cases that can be recorded per week?

The maximum number of cases that can be recorded per week is 10.

What is the maximum number of cases that can be recorded per month?

The maximum number of cases that can be recorded per month is 10.

What is the maximum number of cases that can be recorded per year?

The maximum number of cases that can be recorded per year is 10.

What is the maximum number of cases that can be recorded per year per

Looks like the model output is quite poor. Hopefully, fine-tuning can help improve its performance. 🤞🔧

### Training the Model

Now, let's proceed with training the model. 🚀🔧


In [25]:
max_steps =1000

###### Create your model on the hub first

Before proceeding, make sure to create your model on the hub. Use the same directory name you will be using in your notebook 🏗️


![image-2.png](attachment:image-2.png)

In [26]:
trained_model_name = f"Malawi-Public-Health-Systems"
output_dir = trained_model_name

In [27]:
training_args = TrainingArguments(

  # Learning rate
  learning_rate=1.0e-5,

  # Number of training epochs
  num_train_epochs=1,

  # Max steps to train for (each step is a batch of data)
  # Overrides num_train_epochs, if not -1
  max_steps=max_steps,

  # Batch size for training
  per_device_train_batch_size=1,

  # Directory to save model checkpoints
  output_dir=output_dir,

  # Other arguments
  overwrite_output_dir=False, # Overwrite the content of the output directory
  disable_tqdm=False, # Disable progress bars
  eval_steps=120, # Number of update steps between two evaluations
  save_steps=120, # After # steps model is saved
  warmup_steps=1, # Number of warmup steps for learning rate scheduler
  per_device_eval_batch_size=1, # Batch size for evaluation
  evaluation_strategy="steps",
  logging_strategy="steps",
  logging_steps=1,
  optim="adafactor",
  gradient_accumulation_steps = 400,
  gradient_checkpointing=False,

  # Parameters for early stopping
  load_best_model_at_end=True,
  save_total_limit=1,
  metric_for_best_model="eval_loss",
  greater_is_better=False,
  push_to_hub= 'True',
)



In [28]:
model_flops = (
  base_model.floating_point_ops(
    {
       "input_ids": torch.zeros(
           (1, training_config["model"]["max_length"])
      )
    }
  )
  * training_args.gradient_accumulation_steps
)

print(base_model)
print("Memory footprint", base_model.get_memory_footprint() / 1e9, "GB")
print("Flops", model_flops / 1e9, "GFLOPs")

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 1024)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
  

In [29]:
from transformers import Trainer

2024-02-10 00:01:42.748157: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-10 00:01:42.748291: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-10 00:01:42.880585: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [30]:
trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

In [31]:
!wandb disabled

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


W&B disabled.


In [None]:
training_output = trainer.train()

Step,Training Loss,Validation Loss


### When Training is completed, Proceed to the inference notebook.

P.S.: You can also run the inference notebook even if training hasn't completed as long as checkpoints are already available. 📒