<a href="https://colab.research.google.com/github/menouarazib/llm/blob/main/Fine_Tune_Falcon7B_Linear_Equations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Enhancing Falcon7B’s Performance on Linear Equations**



The objective of this Notebook is to fine-tune the [**Falcon7B**](https://huggingface.co/tiiuae/falcon-7b) model on a simple dataset of linear equations. The aim is to test the model’s ability to solve mathematical linear equations after fine-tuning. The dataset, created specifically for this purpose, is hosted on Hugging Face and can be accessed [here](https://huggingface.co/datasets/Menouar/LinearEquations).

In this Notebook, I will:

1.   Set up the development environment.
2.   Load and prepare the dataset.
3.   Fine-tune Falcon7B using SFTTrainer and QLoRA.
4.   Test and evaluate the model.

## **1. Setup development environment**

The ***first step*** is to install Hugging Face Libraries, including trl, transformers, accelerate, peft, and datasets.

In [None]:
# Install Hugging Face libraries
!pip install -q datasets accelerate evaluate bitsandbytes

# Install peft, transformers, and trl from Github
!pip install -q git+https://github.com/huggingface/transformers
!pip install -q git+https://github.com/huggingface/peft
!pip install -q git+https://github.com/huggingface/trl

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/507.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/507.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     

The ***second step*** is to install Flash Attention to reduce the memory and runtime cost of the attention layer, and improve the performance of the model training. Learn more at [FlashAttention 2](https://github.com/Dao-AILab/flash-attention/tree/main).


In [None]:
import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'

# Install flash-attn
!pip -q install ninja packaging
!MAX_JOBS=4 pip install -q flash-attn --no-build-isolation


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone


The ***third step*** is to install **huggingface_hub** to use as a remote model versioning service. This means that our model, logs, and information will be automatically pushed to the Hub during training. To access my HF repository, use this [link](https://huggingface.co/Menouar/falcon7b-linear-equations).

In [None]:
# Install huggingface_hub
!pip install -q huggingface_hub

In [None]:
# Login into our HF account using our token
from huggingface_hub import login
from google.colab import userdata

login(
  token=userdata.get('HF_TOKEN'), # Retrieve my HF_TOKEN stored in Google Colab Secrets
  add_to_git_credential=True
)

# The id of my HF Repo
hf_repo_id = "falcon7b-linear-equations"

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## **2. Load and prepare the dataset**

I have created a dataset hosted on Hugging Face that contains two columns. The first column is the Problem description, and the second column is the step-by-step solution to the problem. The problems are all about solving linear equations with integer constants. The dataset can be accessed [here](https://huggingface.co/datasets/Menouar/LinearEquations).

In [None]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("Menouar/LinearEquations", split="train")

# Define the response template for the solutions
response_template = "### Solution:"

def create_text_field(sample):
  # Check if the problem and solution fields exist in the sample
  if "Problem" in sample and "Solution" in sample:
    return {
      "text": f"{sample['Problem']}\n {response_template} {sample['Solution']}"
    }
  else:
    # Raise an error if the problem or solution field is missing
    raise ValueError(f"Missing 'Problem' or 'Solution' field in sample: {sample}")

dataset = dataset.map(create_text_field, remove_columns=dataset.features, batched=False)

# Select only 20000 samples from the dataset
dataset = dataset.select(range(20000))

print(dataset[0])
print(len(dataset))

Downloading readme:   0%|          | 0.00/126 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/514M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2000000 [00:00<?, ? examples/s]

{'text': 'Solve for y: 10 - 4 = -2 - 1y + 9y .\n ### Solution: The equation is in the form of ay + b = dy + c where:\na = 0\nb = 10 - 4 = 6\nd = -1 + 9 = 8\nc = -2\nThe solution is y = (c - b)/(a - d) if a ≠ d\n-2 - 6 = -8\n0 - 8 = -8\ny = -8 / -8\nThe fraction -8 / -8 = 1.\nThe solution is y = 1.\n'}
20000


With the latest release of **TRL** they now support popular instruction and conversation dataset formats. This means we only need to convert our dataset to one of the supported formats and **TRL** will take care of the rest. In our case, I use the [**DataCollatorForCompletionOnlyLM**](https://huggingface.co/docs/trl/main/en/sft_trainer#train-on-completions-only) to train the model on the generated prompts only (solutions).  

## **3. Fine-tune Falcon7B using SFTTrainer and QLoRA**

We are now ready to fine-tune the Falcon-7B model. We will use the [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from the TRL library to fine-tune our model. The SFTTrainer makes it straightforward to supervise the fine-tuning of open large language models (LLMs). The SFTTrainer is a subclass of the Trainer from the Transformers library and supports all the same features, including logging, evaluation, and checkpointing, but also includes additional features.

In [None]:
import torch
from transformers import AutoTokenizer, BitsAndBytesConfig, FalconForCausalLM
from trl import setup_chat_format

# Hugging Face Falcon-7B model ID
model_id = "tiiuae/falcon-7b"

# BitsAndBytesConfig for 4-bit integers
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the model and tokenizer
model = FalconForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
"""
By default, Falcon7B does not have a padding token (pad_token). To avoid errors,
we should add it to the tokenizer. We have two options for this:
- Use tokenizer.add_special_tokens({'pad_token': '[PAD]'}). In this case,
we should resize the embeddings using model.resize_token_embeddings(len(tokenizer)).
- Use tokenizer.add_special_tokens({"pad_token": ">>SUFFIX<<"}). In this case,
we are using an already existing special token such as ">>SUFFIX<<". This solution is
not elegant, but it's acceptable for our simple case. Also, we don't need to resize the embeddings.
"""

tokenizer.add_special_tokens({"pad_token": ">>SUFFIX<<"})
tokenizer.padding_side = 'right' # to prevent warnings

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

As peft method we will use [**QLoRA**](https://arxiv.org/abs/2305.14314) a technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance by using quantization.

In [None]:
from peft import LoraConfig

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)

Before we can start our training we need to define the hyperparameters (**TrainingArguments**) we want to use.

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir=hf_repo_id,                  # repository id
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=42,         # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",           # use constant learning rate scheduler
    push_to_hub=True,                       # push model to hub
    report_to="tensorboard",                # report metrics to tensorboard
)

We now have every building block to create our **SFTTrainer** and start training our model.

In [None]:
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    data_collator=collator,
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with special tokens
        "append_concat_token": False, # No need to add additional separator token
    }
)

model.config.use_cache = False

# Start training
trainer.train()

# Save model
"""
Since we are using the PEFT method, it will only save the adapted model weights
and not the full model. To save a new fine-tuned model, we could use
`merge_and_unload` to merge the LoRA adapter into the original model.
However, this step is optional.

When we use `trainer.save_model()`, it will automatically save the trained LoRA Adapter.
Furthermore, it will push it to the Hugging Face repository, indicating the original model.
Therefore, for testing the new model, we just need to use `AutoPeftModelForCausalLM`
which will merge automatically.
"""
trainer.save_model()

{'text': 'Solve for y: 10 - 4 = -2 - 1y + 9y .\n ### Solution: The equation is in the form of ay + b = dy + c where:\na = 0\nb = 10 - 4 = 6\nd = -1 + 9 = 8\nc = -2\nThe solution is y = (c - b)/(a - d) if a ≠ d\n-2 - 6 = -8\n0 - 8 = -8\ny = -8 / -8\nThe fraction -8 / -8 = 1.\nThe solution is y = 1.\n'}




Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss
10,0.3631
20,0.0511
30,0.0214
40,0.0103
50,0.0095
60,0.0056
70,0.0049
80,0.0041
90,0.0036
100,0.0036




events.out.tfevents.1706624382.db0b26526538.1078.0:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

In [None]:
# Free the memory
del model
del trainer
torch.cuda.empty_cache()

from google.colab import runtime
runtime.unassign()

## **4. Test and evaluate the model**
This task is performed in another notebook, which can be found [here](https://colab.research.google.com/drive/13OqOLiIpWJylJpPJ0ln2pgyr2WkWdYoB).