BODMAS applied on a math expression
## Fine tuning an LLM

If you are  doing language modelling, perhaps to adapt to a specific domain or language, you need just a text key in the dataset. The current problem is using 'prompt' and 'completion' keys


If you are using LITE_MODE=True, then please run this on a free T4 box.

If you are using LITE_MODE-False, then please use a paid A100 with high memory.

In [1]:
!pip install -q --upgrade bitsandbytes==0.48.2 trl==0.25.1

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.5/465.5 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import bitsandbytes as bnb
print(bnb.__version__)

0.48.2


In [3]:
import torch, bitsandbytes as bnb
print("Torch:", torch.__version__)
print("CUDA:", torch.version.cuda)
print("BitsAndBytes:", bnb.__version__)
print("GPU capability:", torch.cuda.get_device_capability())


Torch: 2.9.0+cu126
CUDA: 12.6
BitsAndBytes: 0.48.2
GPU capability: (7, 5)


In [4]:
import os
import re
import math
from tqdm import tqdm
from google.colab import userdata
from huggingface_hub import login
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, set_seed, BitsAndBytesConfig
from datasets import load_dataset, Dataset, DatasetDict
import wandb
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from datetime import datetime
import matplotlib.pyplot as plt

In [5]:
# Constants

BASE_MODEL = "meta-llama/Llama-3.2-3B"
PROJECT_NAME = "bodmas-math"
HF_USER = "prizmaweb" # your HF name here!

LITE_MODE = True

DATA_USER = "prizmaweb"
DATASET_NAME = f"{DATA_USER}/bodmas"
RUN_NAME =  f"{datetime.now():%Y-%m-%d_%H.%M.%S}"
if LITE_MODE:
  RUN_NAME += "-lite"
PROJECT_RUN_NAME = f"{PROJECT_NAME}-{RUN_NAME}"
HUB_MODEL_NAME = f"{HF_USER}/{PROJECT_RUN_NAME}"

# Hyper-parameters - overall

EPOCHS =1 if LITE_MODE else 3
BATCH_SIZE = 8 if LITE_MODE else 256
MAX_SEQUENCE_LENGTH = 256 #from 128
GRADIENT_ACCUMULATION_STEPS = 1

# Hyper-parameters - QLoRA

QUANT_4_BIT = True
LORA_R = 16 if LITE_MODE else 256
LORA_ALPHA = LORA_R * 2
ATTENTION_LAYERS = ["q_proj", "v_proj", "k_proj", "o_proj"]
MLP_LAYERS = ["gate_proj", "up_proj", "down_proj"]
TARGET_MODULES = ATTENTION_LAYERS if LITE_MODE else ATTENTION_LAYERS + MLP_LAYERS
LORA_DROPOUT = 0.1

# Hyper-parameters - training

LEARNING_RATE = 1e-4
WARMUP_RATIO = 0.01
LR_SCHEDULER_TYPE = 'cosine'
WEIGHT_DECAY = 0.001
OPTIMIZER = "paged_adamw_32bit"

capability = torch.cuda.get_device_capability()
use_bf16 = capability[0] >= 8

# Tracking

VAL_SIZE = 500 if LITE_MODE else 1000
LOG_STEPS = 5 if LITE_MODE else 10
SAVE_STEPS = 100 if LITE_MODE else 200
LOG_TO_WANDB = True

In [6]:
capability = torch.cuda.get_device_capability()
print(capability)

(7, 5)


In [7]:
# A100 GPU supports this; T4 does not natively

use_bf16

False

### Log in to HuggingFace and Weights & Biases

If you don't already have a HuggingFace account, visit https://huggingface.co to sign up and create a token.

Then select the Secrets for this Notebook by clicking on the key icon in the left, and add a new secret called `HF_TOKEN` with the value as your token.

Repeat this for weightsandbiases at https://wandb.ai and add a secret called `WANDB_API_KEY`

In [8]:
# Log in to HuggingFace

hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [9]:
# Log in to Weights & Biases
wandb_api_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = wandb_api_key
wandb.login()

# Configure Weights & Biases to record against our project
os.environ["WANDB_PROJECT"] = PROJECT_NAME
os.environ["WANDB_LOG_MODEL"] = "false"
os.environ["WANDB_WATCH"] = "false"

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Currently logged in as: [33mprizmaweb[0m ([33mprizmaweb-akamai-technologies[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [10]:
#dataset = load_dataset(DATASET_NAME, "en")
dataset = load_dataset(DATASET_NAME)
sft_dataset = dataset['train']
sft_val_dataset = dataset['validation']

README.md:   0%|          | 0.00/417 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/4.36k [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/2.73k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/15 [00:00<?, ? examples/s]

In [11]:
# Uncomment for large datasets

#train = train.select(range(4000))

In [12]:
print(sft_dataset[0])


{'prompt': 'Solve using BODMAS: (8 - 3) + 4 * 2', 'completion': 'Step 1: Brackets → (8 - 3) = 5\nStep 2: Multiply → 4 * 2 = 8\nStep 3: Addition → 5 + 8 = 13\nFinal Answer: 13'}


In [13]:
#sanity check a few examples before training
for i in range(3):
    sample = sft_dataset[i]
    print(sample['prompt'])
    print(sample['completion'])
    print("---")


Solve using BODMAS: (8 - 3) + 4 * 2
Step 1: Brackets → (8 - 3) = 5
Step 2: Multiply → 4 * 2 = 8
Step 3: Addition → 5 + 8 = 13
Final Answer: 13
---
Solve using BODMAS: 20 / (2 + 3)
Step 1: Brackets → (2 + 3) = 5
Step 2: Division → 20 / 5 = 4
Final Answer: 4
---
Solve using BODMAS: (6 + 2) * (5 - 3)
Step 1: Brackets → (6 + 2) = 8, (5 - 3) = 2
Step 2: Multiply → 8 * 2 = 16
Final Answer: 16
---


In [14]:
print(sft_dataset[0].keys())


dict_keys(['prompt', 'completion'])


In [15]:
if LOG_TO_WANDB:
  wandb.init(project=PROJECT_NAME, name=RUN_NAME)

## Now load the Tokenizer and Model

The model is "quantized" - we are reducing the precision to 4 bits.

In [16]:
# pick the right quantization

if QUANT_4_BIT:
  quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16 if use_bf16 else torch.float16,
    bnb_4bit_quant_type="nf4"
  )
else:
  quant_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.bfloat16 if use_bf16 else torch.float16,
  )

In [17]:
# Load the Tokenizer and the Model

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
    dtype=torch.float16,
)
base_model.generation_config.pad_token_id = tokenizer.pad_token_id

print(f"Memory footprint: {base_model.get_memory_footprint() / 1e6:.1f} MB")

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Memory footprint: 2197.6 MB


# AND NOW

## We set up the configuration for Training

We need to create 2 objects:

A LoraConfig object with our hyperparameters for LoRA

An SFTConfig with our overall Training parameters

In [18]:
# LoRA Parameters

lora_parameters = LoraConfig(
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    r=LORA_R,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=TARGET_MODULES,
)

In [19]:
# Training parameters

train_parameters = SFTConfig(
    output_dir=PROJECT_RUN_NAME,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    optim=OPTIMIZER,
    save_steps=SAVE_STEPS,
    save_total_limit=10,
    logging_steps=LOG_STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=0.001,
    fp16=not use_bf16,
    bf16=use_bf16,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=WARMUP_RATIO,
    group_by_length=True,
    lr_scheduler_type=LR_SCHEDULER_TYPE,
    report_to="wandb" if LOG_TO_WANDB else None,
    run_name=RUN_NAME,
    max_length=MAX_SEQUENCE_LENGTH,
    save_strategy="steps",
    hub_strategy="every_save",
    push_to_hub=True,
    hub_model_id=HUB_MODEL_NAME,
    hub_private_repo=True,
    eval_strategy="steps",
    eval_steps=5,
    completion_only_loss=True,
)

# AND NOW - create the trainer

In [20]:
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=sft_dataset,
    eval_dataset=sft_val_dataset,
    peft_config=lora_parameters,
    args=train_parameters,
)

Adding EOS to train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/15 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/15 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/15 [00:00<?, ? examples/s]

## In the next cell, we kick off fine-tuning!

This will run for some time, uploading to the hub every SAVE_STEPS steps.

After some time, Google might stop your colab. For people on free plans, it can happen whenever Google is low on resources. For anyone on paid plans, they can give you up to 24 hours, but there's no guarantee.

If your server is stopped, you can follow my colab here to resume from your last save:

https://colab.research.google.com/drive/1qGTDVIas_Vwoby4UVi2vwsU0tHXy8OMO#scrollTo=R_O04fKxMMT-

I've saved this colab with my final run in the output so you can see the example. The trick is that I needed to set `is_trainable=True` when loading the fine_tuned model.

### Anyway, with that in mind, let's kick this off!

In [21]:
# Fine-tune!
fine_tuning.train()

# Push our fine-tuned model to Hugging Face
fine_tuning.model.push_to_hub(PROJECT_RUN_NAME, private=True)
print(f"Saved to the hub: {PROJECT_RUN_NAME}")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': None}.
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
5,1.1457,0.963745,1.549812,3181.0,0.82767


README.md:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:  91%|#########1| 33.5MB / 36.7MB            

No files have been modified since last commit. Skipping to prevent empty commit.


Saved to the hub: bodmas-math-2025-12-22_10.53.32-lite


In [22]:
if LOG_TO_WANDB:
  wandb.finish()

0,1
eval/entropy,▁
eval/loss,▁
eval/mean_token_accuracy,▁
eval/num_tokens,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/entropy,█▁
train/epoch,▁▁█
train/global_step,▁▁█

0,1
eval/entropy,1.54981
eval/loss,0.96374
eval/mean_token_accuracy,0.82767
eval/num_tokens,3181
eval/runtime,2.1463
eval/samples_per_second,6.989
eval/steps_per_second,6.989
total_flos,80969075073024.0
train/entropy,1.51582
train/epoch,1
