<a href="https://colab.research.google.com/github/menouarazib/llm/blob/main/Phi_multi_step_reasoning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Φ answering on basic mathematical problems that require multi-step reasoning**






The objective of this Notebook is to fine-tune the [**Phi-2**](https://huggingface.co/microsoft/phi-2) model on answering on basic mathematical problems that require multi-step reasoning. The dataset we used is hosted on Hugging Face and can be accessed [here](https://huggingface.co/datasets/gsm8k).

In this Notebook, I will:

1.   Set up the development environment.
2.   Load and prepare the dataset.
3.   Fine-tune **Φ-2** using SFTTrainer and QLoRA.
4.   Test and evaluate the model.

## **1. Setup development environment**

The ***first step*** is to install Hugging Face Libraries, including trl, transformers, accelerate, peft, and datasets.

In [None]:
# Install Hugging Face libraries
!pip install -q datasets accelerate evaluate bitsandbytes

# Install peft, transformers, and trl from Github
!pip install -q git+https://github.com/huggingface/transformers
!pip install -q git+https://github.com/huggingface/peft
!pip install -q git+https://github.com/huggingface/trl

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/507.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.8/507.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/270.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K     [

The ***second step*** is to install Flash Attention to reduce the memory and runtime cost of the attention layer, and improve the performance of the model training. Learn more at [FlashAttention 2](https://github.com/Dao-AILab/flash-attention/tree/main).


In [None]:
import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'

# Install flash-attn
!pip -q install ninja packaging
!MAX_JOBS=4 pip install -q flash-attn --no-build-isolation


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone


The ***third step*** is to install **huggingface_hub** to use as a remote model versioning service. This means that our model, logs, and information will be automatically pushed to the Hub during training. To access my HF repository, use this [link](https://huggingface.co/Menouar/falcon7b-linear-equations).

In [None]:
# Install huggingface_hub
!pip install -q huggingface_hub

In [None]:
# Login into our HF account using our token
from huggingface_hub import login
from google.colab import userdata

login(
  token=userdata.get('HF_TOKEN'), # Retrieve my HF_TOKEN stored in Google Colab Secrets
  add_to_git_credential=True
)

# The id of my HF Repo
hf_repo_id = "phi-2-basic-maths"

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## **2. Load and prepare the dataset**

In [None]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("gsm8k", 'main')
dataset = dataset['train']


print(dataset)

Downloading readme:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer'],
    num_rows: 7473
})


In [None]:
# Define the response template for the solutions
response_template = "### Solution:"

def create_text_field(sample):
  return {
      "text": f"{sample['question']}\n{response_template} {sample['answer']}"
    }

dataset = dataset.map(create_text_field, remove_columns=dataset.features, batched=False)

print(dataset[0])
print(len(dataset))

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

{'text': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?\n### Solution: Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}
7473


With the latest release of **TRL** they now support popular instruction and conversation dataset formats. This means we only need to convert our dataset to one of the supported formats and **TRL** will take care of the rest. In our case, I use the [**DataCollatorForCompletionOnlyLM**](https://huggingface.co/docs/trl/main/en/sft_trainer#train-on-completions-only) to train the model on the generated prompts only (solutions).  

## **3. Fine-tune using SFTTrainer and QLoRA**

We are now ready to fine-tune the **Φ-2** model. We will use the [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from the TRL library to fine-tune our model. The SFTTrainer makes it straightforward to supervise the fine-tuning of open large language models (LLMs). The SFTTrainer is a subclass of the Trainer from the Transformers library and supports all the same features, including logging, evaluation, and checkpointing, but also includes additional features.

In [None]:
import torch
from transformers import AutoTokenizer, BitsAndBytesConfig, PhiForCausalLM

# Hugging Face phi-2 model ID
model_id = "microsoft/phi-2"

# BitsAndBytesConfig for 4-bit integers
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the model and tokenizer
model = PhiForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right' # to prevent warnings

config.json:   0%|          | 0.00/863 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


As peft method we will use [**QLoRA**](https://arxiv.org/abs/2305.14314) a technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance by using quantization.

In [None]:
from peft import LoraConfig

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)

Before we can start our training we need to define the hyperparameters (**TrainingArguments**) we want to use.

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir=hf_repo_id,                  # repository id
    num_train_epochs=30,                    # number of training epochs
    per_device_train_batch_size=42,         # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",           # use constant learning rate scheduler
    push_to_hub=True,                       # push model to hub
    report_to="tensorboard",                # report metrics to tensorboard
)

We now have every building block to create our **SFTTrainer** and start training our model.

In [None]:
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM


collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    data_collator=collator,
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with special tokens
        "append_concat_token": False, # No need to add additional separator token
    }
)

model.config.use_cache = False

# Start training
trainer.train()

# Save model
"""
Since we are using the PEFT method, it will only save the adapted model weights
and not the full model. To save a new fine-tuned model, we could use
`merge_and_unload` to merge the LoRA adapter into the original model.
However, this step is optional.

When we use `trainer.save_model()`, it will automatically save the trained LoRA Adapter.
Furthermore, it will push it to the Hugging Face repository, indicating the original model.
Therefore, for testing the new model, we just need to use `AutoPeftModelForCausalLM`
which will merge automatically.
"""
trainer.save_model()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss
10,0.8218
20,0.686
30,0.662
40,0.6515
50,0.6696
60,0.6535
70,0.6586
80,0.6645
90,0.6551
100,0.6486




In [None]:
# Free the memory
del model
del trainer
torch.cuda.empty_cache()

from google.colab import runtime
runtime.unassign()

## **4. Test and evaluate the model**
This task is performed in another notebook, which can be found [here](https://colab.research.google.com/drive/1xsdxOm-CgZmLAPFgp8iU9lLFEIIHGiUK)