<a href="https://colab.research.google.com/github/khanhvy31/Unsloth-fine-tuning-SQUAD/blob/main/SQUAD_finetuning_LLAMA_using_Unsloth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

## Unsloth

### Load pre-trained LLM

In this tutorial, we will be working with the Stanford Question Answering Dataset (SQuAD), a benchmark dataset designed for machine reading comprehension and question answering tasks.

Let's load a pre-trained language model using the UnsLoth library's FastLanguageModel class

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", # "unsloth/llama-3-8b-bnb-4bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

### LoRA adapters

LoRA adapters (Low-Rank Adaptation) are a method used to fine-tune large pre-trained models in a more efficient and resource-friendly manner. Instead of updating all the parameters in a model, LoRA introduces small, trainable adapter modules into the model's architecture.

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

We also add `embed_tokens` and `lm_head` to allow the model to learn out of distribution data.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,  # The base pre-trained model that you want to fine-tune using PEFT (Parameter-Efficient Fine-Tuning) techniques.

    # r: The LoRA rank, which determines the number of trainable parameters in the low-rank adapters.
    # A higher value (e.g., 128) gives the adapter more capacity to learn task-specific nuances,
    # while lower values (e.g., 8, 16, 32, 64) might be sufficient for simpler tasks.
    r = 128,  # Choose any number > 0. Suggested values are 8, 16, 32, 64, 128.

    # target_modules: A list of module names in the model where the LoRA adapters should be applied.
    # For transformer-based models, these typically include:
    # - "q_proj", "k_proj", "v_proj", "o_proj": The projection layers in the multi-head attention mechanism.
    # - "gate_proj", "up_proj", "down_proj": Additional projection layers used in various architectures or gating mechanisms.
    # - "embed_tokens", "lm_head": Typically included for continual pretraining or when modifying the embedding and output layers.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head"],  # Add these for continual pretraining if needed.

    # lora_alpha: A scaling factor applied to the LoRA weights.
    # This factor adjusts the influence of the LoRA adapters relative to the original model weights.
    lora_alpha = 32,

    # lora_dropout: Dropout rate for the LoRA adapters.
    # A dropout of 0 means no dropout is applied, which is the optimized setting in this configuration.
    lora_dropout = 0,  # Supports any value, but 0 is optimized.

    # bias: Configures the use of bias parameters in the adapters.
    # Setting this to "none" means no additional bias terms are introduced, which simplifies the model.
    bias = "none",  # Supports any value, but "none" is optimized.

    # use_gradient_checkpointing: A technique to reduce memory usage by trading compute for memory.
    # The special "unsloth" setting here is an optimized mode that reportedly reduces VRAM usage by 30%
    # and allows for twice as large batch sizes, which is particularly useful for very long context lengths.
    use_gradient_checkpointing = "unsloth",  # Use True or "unsloth" for very long context scenarios.

    # random_state: Sets a seed for the random number generator to ensure reproducibility in training.
    random_state = 3407,

    # use_rslora: Activates Rank Stabilized LoRA (RS-LoRA), an enhanced version designed to improve stability during training.
    use_rslora = True,  # Enables the use of rank stabilized LoRA.

    # loftq_config: Configuration for LoftQ, another parameter-efficient fine-tuning technique.
    # Here it is set to None, meaning that LoftQ is not applied in this configuration.
    loftq_config = None,  # LoftQ is not used in this case.
)


Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM


<a name="Data"></a>
### Data Prep
We now use the SQUAD dataset from HuggingFace. We only sample the first 2000 rows to speed training up. We must add <oet> or else the model's generation will go on forever.


In [None]:
from datasets import load_dataset
squad = load_dataset("squad", split="train[:2000]")
squad = squad.train_test_split(test_size=0.2)

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Before preprocessing

In [None]:
text = squad['train'][0]
text

{'id': '573016fb947a6a140053d0b5',
 'title': 'Antibiotics',
 'context': 'Antibiotics revolutionized medicine in the 20th century, and have together with vaccination led to the near eradication of diseases such as tuberculosis in the developed world. Their effectiveness and easy access led to overuse, especially in livestock raising, prompting bacteria to develop resistance. This has led to widespread problems with antimicrobial and antibiotic resistance, so much as to prompt the World Health Organization to classify antimicrobial resistance as a "serious threat [that] is no longer a prediction for the future, it is happening right now in every region of the world and has the potential to affect anyone, of any age, in any country".',
 'question': 'What is one issue that can arise from overuse of antibiotics?',
 'answers': {'text': ['overuse, especially in livestock raising, prompting bacteria to develop resistance'],
  'answer_start': [220]}}

In [None]:
from datasets import load_dataset
from unsloth import UnslothTrainer, UnslothTrainingArguments, FastLanguageModel
from transformers import DataCollatorForLanguageModeling

# Preprocess
def preprocess_function(examples):
    contexts = examples["context"]
    questions = examples["question"]
    answers = [ans["text"][0] for ans in examples["answers"]]
    inputs = [f"Context: {context} Question: {question}" for context, question in zip(contexts, questions)]
    return {"input_text": inputs, "target_text": answers}

squad_processed = squad.map(preprocess_function, batched=True, num_proc=8, remove_columns=["id", "title", "context", "question", "answers"])

# Formatting with stop token
def formatting_func(example):
    return f"{example['input_text']} Answer: {example['target_text']}<|eot|>"


Map (num_proc=8):   0%|          | 0/1600 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/400 [00:00<?, ? examples/s]

In [None]:
squad_processed["train"][0]

{'input_text': 'Context: Antibiotics revolutionized medicine in the 20th century, and have together with vaccination led to the near eradication of diseases such as tuberculosis in the developed world. Their effectiveness and easy access led to overuse, especially in livestock raising, prompting bacteria to develop resistance. This has led to widespread problems with antimicrobial and antibiotic resistance, so much as to prompt the World Health Organization to classify antimicrobial resistance as a "serious threat [that] is no longer a prediction for the future, it is happening right now in every region of the world and has the potential to affect anyone, of any age, in any country". Question: What is one issue that can arise from overuse of antibiotics?',
 'target_text': 'overuse, especially in livestock raising, prompting bacteria to develop resistance'}

In [None]:
formatting_func(squad_processed["train"][1])

"Context: The Review of Politics was founded in 1939 by Gurian, modeled after German Catholic journals. It quickly emerged as part of an international Catholic intellectual revival, offering an alternative vision to positivist philosophy. For 44 years, the Review was edited by Gurian, Matthew Fitzsimons, Frederick Crosson, and Thomas Stritch. Intellectual leaders included Gurian, Jacques Maritain, Frank O'Malley, Leo Richard Ward, F. A. Hermens, and John U. Nef. It became a major forum for political ideas and modern political concerns, especially from a Catholic and scholastic tradition. Question: Over how many years did Gurian edit the Review of Politics at Notre Dame? Answer: 44<|eot|>"

<a name="Train"></a>
### Continued Pretraining
Now let's use Unsloth's `UnslothTrainer`

Also set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
# Data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) #not using mask language

# Trainer
trainer = UnslothTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=squad_processed["train"],
    dataset_text_field="input_text",
    formatting_func=formatting_func,
    max_seq_length=512,
    data_collator=data_collator,
    args=UnslothTrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        warmup_ratio=0.1,
        num_train_epochs=3,
        learning_rate=5e-5,
        embedding_learning_rate=5e-6,
        fp16 = not is_bfloat16_supported(), # Use 16-bit floating point if bfloat16 isn't supported.
        bf16 = is_bfloat16_supported(),     #use bf16 if hardware support
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.00,
        lr_scheduler_type="cosine", #Cosine scheduler to adjust the learning rate over time.
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

# Train
trainer.train()

Unsloth: Tokenizing ["input_text"] (num_proc=12):   0%|          | 0/1600 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,600 | Num Epochs = 3 | Total steps = 300
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 1,386,217,472/8,000,000,000 (17.33% trained)


Step,Training Loss
1,0.0783
2,0.0765
3,0.0854
4,0.0843
5,0.0809
6,0.0866
7,0.0709
8,0.1058
9,0.0945
10,0.095


TrainOutput(global_step=300, training_loss=0.07189546902974446, metrics={'train_runtime': 1031.6962, 'train_samples_per_second': 4.653, 'train_steps_per_second': 0.291, 'total_flos': 5.97475125435433e+16, 'train_loss': 0.07189546902974446})

<a name="Inference"></a>
### Inference
Let's run the model!

We first will try to see if the model follows the style and understands the "context - question - answer" format we set


In [None]:
from transformers import TextIteratorStreamer
from threading import Thread
text_streamer = TextIteratorStreamer(tokenizer)
import textwrap
max_print_width = 100

# Before running inference, call `FastLanguageModel.for_inference` first

FastLanguageModel.for_inference(model)

inputs = tokenizer(
[
    "<|begin_of_text|>Context: Albert Einstein was a theoretical physicist known for his work on relativity. "
    "His contributions revolutionized the understanding of space, time, and gravity. "
    "Question: What is the theory that made Albert Einstein famous? "
    "Answer:"

], return_tensors = "pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer = text_streamer,
    max_new_tokens = 25,
    use_cache = True,
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

# Accumulate the streamed text.
generated_text = ""
for new_text in text_streamer:
    generated_text += new_text

# Post-process the generated text to extract only the answer.
# The answer should starts after "Answer:"
if "Answer:" in generated_text:
    answer_section = generated_text.split("Answer:", 1)[1]
    # If a new question appears, stop there.
    if "Question:" in answer_section:
        answer = answer_section.split("Question:")[0].strip()
    else:
        answer = answer_section.strip()
else:
    answer = generated_text.strip()

# Optionally, wrap the text for display
wrapped_answer = "\n".join(textwrap.wrap(answer, width=max_print_width))
print(wrapped_answer)

Relativity.
