Training & Fine-Tuning Chosen Model
===================================

Note: intended to be run in [Google Colab](https://colab.research.google.com/) using a T4 runtime but will require compute units purchased and could take about 1 week to run.

**Recommended:** A100 GPU that uses 11.77 compute units per hour so will need Colab Pro to access that GPU and will take 26+ hours to run.

The four steps in QLoRA training:

1. **Forward pass**: predict the next token in the training data, aka running inference
2. **Loss calculation**: how different was it to the true next token, aka cross-entropy loss
3. **Backward pass**: how much should we tweak parameters to do better next time (the "gradients"), aka back propagation, backprop
4. **Optimization**: update parameters a tiny step to do better next time

## Hyperparameters

Important hyper-parameters for training:

1. **Epochs**
2. **Batch Size**
3. **Learning Rates**
4. **Gradient Accumulation**
5. **Optimizer**

Hyperparameters for QLoRA

- r: 32 dimensions
- alpha: 64 (double r)
- Target Modules for Llama architecture: q_proj, v_proj, k_proj, o_proj
- Dropout: 0.1 (10% of neurons removed from trainign set))
- Quantization: 4-bit

Hyperparameters for Training

- 1 epoch
- batch size = 4 on T4, an A100 box can go up to 16
- Gradient Accumulation steps = 1 (try 2 or 4 if having memory issues)
- Learning Rate (LR) = 1e-4 = 1 × 10^-4 = 0.0001
- LR scheduler type = 'cosine' (start with LR that slowly decreases then decreases a lot then tails off at the end)
- Warmup ratio = 0.03 (at the very beginning of the training process things are unstable because model has a lot to learn so dangerous to have a big LR initially so this says start with a lower LR and warm it up to the peak LR then start cosine trail)
- Optimizer = paged_adamw_32bit (Adam with Weight Decay has good convergence so does a good job of finding the optimal spot but comes at a cost of consuming a lot of memory becuase it keeps a rolling average of previous gradients)

Other administrative configurations:

- Number of batch steps to take before it saves prgress to wandb = 50
  How many steps before it uploads model to the hub = 2000
  LOG_TO_WANDB to toggle whether to log to weights and biases

# Dependencies

Ok to ignore:
ERROR: pip's dependency resolver...

In [None]:
# pip installs

!pip install -q datasets requests torch peft bitsandbytes transformers trl accelerate sentencepiece wandb matplotlib

In [None]:
# imports

import os
import re
import math
from tqdm import tqdm
from google.colab import userdata
from huggingface_hub import login
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, set_seed, BitsAndBytesConfig
from datasets import load_dataset, Dataset, DatasetDict
import wandb
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from datetime import datetime
import matplotlib.pyplot as plt

# Setup

You can just have one model and as you run this you just upload different versions that you store against that one model repository.

Alternatively, like what is done here, you can seperate out your different runs and have them as separate models because within them there are different versions potentially, different epochs, and keeping them seperate means when you train them with different hyperparameters you can keep note of that.

RUN_NAME: **2025-04-30_01.18.39**

PROJECT_RUN_NAME: **pricer-2025-04-30_01.18.39**

HUB_MODEL_NAME: **clanredhead/pricer-2025-04-30_01.18.39**

In [None]:
# Constants

BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
PROJECT_NAME = "pricer"
HF_USER = "clanredhead" # your HF name here!

# Data

DATASET_NAME = f"{HF_USER}/pricer-data"
MAX_SEQUENCE_LENGTH = 182

# Run name for saving the model in the hub

RUN_NAME =  f"{datetime.now():%Y-%m-%d_%H.%M.%S}"
PROJECT_RUN_NAME = f"{PROJECT_NAME}-{RUN_NAME}"
HUB_MODEL_NAME = f"{HF_USER}/{PROJECT_RUN_NAME}"

# Hyperparameters for QLoRA

LORA_R = 32
LORA_ALPHA = 64
TARGET_MODULES = ["q_proj", "v_proj", "k_proj", "o_proj"]
LORA_DROPOUT = 0.1
QUANT_4_BIT = True

# Hyperparameters for Training

EPOCHS = 1 # you can do more epochs if you wish, but only 1 is needed - more is probably overkill
BATCH_SIZE = 4 # on an A100 box this can go up to 16
GRADIENT_ACCUMULATION_STEPS = 1
LEARNING_RATE = 1e-4
LR_SCHEDULER_TYPE = 'cosine'
WARMUP_RATIO = 0.03
OPTIMIZER = "paged_adamw_32bit"

# Admin config - note that SAVE_STEPS is how often it will upload to the hub
# I've changed this from 5000 to 2000 so that you get more frequent saves

STEPS = 50
SAVE_STEPS = 2000
LOG_TO_WANDB = True

%matplotlib inline

In [None]:
HUB_MODEL_NAME

## HuggingFace and Weights & Biases Token

**IMPORTANT** HuggingFace token requires read and write permissions.

Add `HF_TOKEN` and `WANDB_API_KEY` to secrets, paste value and toggle on for this notebook.

Logs:
> Tracking run with wandb version 0.19.10<br>
> Run data is saved locally in /content/wandb/run-20250430_015044-t63ngy9d<br>
> Syncing run 2025-04-30_01.18.39 to Weights & Biases (docs)<br>
> View project at https://wandb.ai/john-stoops-personal-use/pricer<br>
> View run at https://wandb.ai/john-stoops-personal-use/pricer/runs/t63ngy9d

In [None]:
# Log in to HuggingFace

hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [None]:
# Log in to Weights & Biases
wandb_api_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = wandb_api_key
wandb.login()

# Configure Weights & Biases to record against our project
os.environ["WANDB_PROJECT"] = PROJECT_NAME
os.environ["WANDB_LOG_MODEL"] = "checkpoint" if LOG_TO_WANDB else "end"
os.environ["WANDB_WATCH"] = "gradients"

In [None]:
dataset = load_dataset(DATASET_NAME)
train = dataset['train']
test = dataset['test']

In [None]:
# if you wish to reduce the training dataset to 20,000 points instead, then uncomment this line:
# train = train.select(range(20000))

In [None]:
if LOG_TO_WANDB:
  wandb.init(project=PROJECT_NAME, name=RUN_NAME)

# Load Tokenizer and the Model

Note: The model is "quantized" by reducing the precision to 4 bits.

Boilerplate settings:

- Tell the trainer that we want to pad every data point so that it fills up the maximun sequence length by padding it up with end of sentance tokens (eos_token)

        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "right"

- Set pad token ID to stop it printing an unnecessary warning that this is not set:

        base_model.generation_config.pad_token_id = tokenizer.pad_token_id

Memory footprint: **5591.5 MB**

In [None]:
# pick the right quantization

if QUANT_4_BIT:
  quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
  )
else:
  quant_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.bfloat16
  )

In [None]:
# Load the Tokenizer and the Model

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
)
base_model.generation_config.pad_token_id = tokenizer.pad_token_id

print(f"Memory footprint: {base_model.get_memory_footprint() / 1e6:.1f} MB")

# Data Collator

**Issue**: we don't care for the model to learn about how to predict all of the tokens of the prompt up until the dollar sign. We want it to learn how to predict that token right there. So we don't want it to spend lots of time seeing how good it is at writing descriptions of products and then also learn the price. We want it to focus on that price.

**Solution**: set up a mask so that when you tell the trainer that you don't need it to learn about the prompt you just want it to take it into account to provide it context but to learn how to predict this token right here after the dollar sign.

There is a complicated way to do this by setting Masks, but HuggingFace provides the helper class to do this with some simple code.

Use HuggingFace DataCollatorForCompletionOnlyLM utility to do this:

- Come up with response template, i.e the chunk of text that is going to indicate that I want you to predict whatever comes next, e.g. "Price is $"
- The create an instance using repnse template and tokenizer

In [None]:
from trl import DataCollatorForCompletionOnlyLM
response_template = "Price is $"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

# Perform Training

Pass in LoRA and Supervised Fine-Tuning (SFT) configurations, pulled from constants in setup.

Hardcoded configurations are hyperparameters that don't matter and should always stay the same.

Tell it to push to the hub every time that it is doing a save:
        hub_strategy="every_save",
        push_to_hub=True,

Make it a private repo (make it public once satisfied with the results):
        hub_private_repo=True

Other settings:

- eval_strategy: a best practice is to use an evaluation strategy by passing in validation data and well as training data. Do this!

Setup SFTTrainer by passing in:

- The base model to train, e.g. Llama 3.1 8B
- Training data to use
- LoRA parameters
- Training parameters
- The collator to tell it to focus on predicting whatever comes after "Price is $"

Save to the hub after training the model:
fine_tuning.model.push_to_hub(PROJECT_RUN_NAME, private=True)

Resource usage: **14.7GB of 15GB used**

<img src="./../images/QLoRA-Training-4bit-Model-Resource-Usage.jpg" alt="Resource usage for QLoRA training of 4bit model" />

In [None]:
# First, specify the configuration parameters for LoRA

lora_parameters = LoraConfig(
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    r=LORA_R,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=TARGET_MODULES,
)

# Next, specify the general configuration parameters for training

train_parameters = SFTConfig(
    output_dir=PROJECT_RUN_NAME,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=1,
    eval_strategy="no",
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    optim=OPTIMIZER,
    save_steps=SAVE_STEPS,
    save_total_limit=10,
    logging_steps=STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=0.001,
    fp16=False,
    bf16=True,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=WARMUP_RATIO,
    group_by_length=True,
    lr_scheduler_type=LR_SCHEDULER_TYPE,
    report_to="wandb" if LOG_TO_WANDB else None,
    run_name=RUN_NAME,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    dataset_text_field="text",
    save_strategy="steps",
    hub_strategy="every_save",
    push_to_hub=True,
    hub_model_id=HUB_MODEL_NAME,
    hub_private_repo=True
)

# And now, the Supervised Fine Tuning Trainer will carry out the fine-tuning
# Given these 2 sets of configuration parameters
# The latest version of trl is showing a warning about labels - please ignore this warning
# But let me know if you don't see good training results (loss coming down).

fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=train,
    peft_config=lora_parameters,
    args=train_parameters,
    data_collator=collator
  )

# Execute Fine-Tuning

Model used: **clanredhead/pricer-2025-04-30_01.18.39**

Memory footprint: **5700.6 MB**

| Step | Training Loss |
|------| -------------|
| 50 | 2.874500 |
| 100 | 2.518400 |

Fine-tuned model architecture:

    PeftModelForCausalLM(
      (base_model): LoraModel(
        (model): LlamaForCausalLM(
          (model): LlamaModel(
            (embed_tokens): Embedding(128256, 4096)
            (layers): ModuleList(
              (0-31): 32 x LlamaDecoderLayer(
                (self_attn): LlamaAttention(
                  (q_proj): lora.Linear4bit(
                    (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=4096, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=4096, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (k_proj): lora.Linear4bit(
                    (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=4096, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=1024, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (v_proj): lora.Linear4bit(
                    (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=4096, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=1024, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (o_proj): lora.Linear4bit(
                    (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=4096, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=4096, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                )
                (mlp): LlamaMLP(
                  (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
                  (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
                  (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
                  (act_fn): SiLU()
                )
                (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
                (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
              )
            )
            (norm): LlamaRMSNorm((4096,), eps=1e-05)
            (rotary_emb): LlamaRotaryEmbedding()
          )
          (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
        )
      )
    )

In [None]:
# Fine-tune!
fine_tuning.train()

# Push our fine-tuned model to Hugging Face
fine_tuning.model.push_to_hub(PROJECT_RUN_NAME, private=True)
print(f"Saved to the hub: {PROJECT_RUN_NAME}")

In [None]:
if LOG_TO_WANDB:
  wandb.finish()