# Question-Answering with Gemma

In this notebook, we will fine-tune the Gemma-2B instruction-tuned model to perform question answering tasks using the Stanford Question Answering Dataset (SQuAD). Our goal is to develop a specialized model that can extract precise answers from provided context passages.

We'll use Unsloth's optimization techniques for efficient training, implementing Low-Rank Adaptation (LoRA) to minimize required computational resources. This approach allows us to fine-tune a large language model on modest hardware while maintaining high performance.

By the end of this process, we'll have a context-aware question-answering system that can extract information from passages and generate concise, accurate answers based on the provided context.

## Dependencies Installation

We will add necessary dependencies for our LLM fine-tuning project by installing key libraries while suppressing verbose output. This includes optimization tools, dataset handling, and efficient training modules all required for our workflow.

In [1]:
# Set up required libraries
!pip install --quiet unsloth datasets transformers accelerate bitsandbytes peft

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.4/46.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.2/193.2 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.1/162.1 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━

## Library Setup

We will import the essential components for our fine-tuning pipeline. This includes the dataset loader functionality, our specialized Supervised Fine-Tuning trainer from Unsloth, configuration tools for training parameters, and PyTorch as our deep learning framework backbone.

In [2]:
# Load necessary modules
from datasets import load_dataset
from unsloth.trainer import SFTTrainer
from transformers import TrainingArguments
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch SmolVLMForConditionalGeneration forward function.
🦥 Unsloth Zoo will now patch everything to make training faster!


## Model Initialization

We will load a pre-optimized version of the Gemma-2B instruction-tuned model using Unsloth's accelerated implementation. The model is configured with a context window of 2048 tokens, using half-precision floating point and 4-bit quantization to reduce memory requirements while maintaining performance.

In [3]:
# Import the optimized model loading functionality
from unsloth import FastLanguageModel

# Initialize the foundation model with optimization settings
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-2b-it",
    max_seq_length=2048,
    dtype=torch.float16,
    load_in_4bit=True,
)

==((====))==  Unsloth 2025.4.1: Fast Gemma patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.07G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

## Dataset Preparation

We will prepare the Stanford Question Answering Dataset (SQuAD) for our model training. First, we import the necessary language processing tools and acquire the dataset. Then we configure a BERT tokenizer to convert our text into numerical representations. Our processing function combines context passages with their corresponding questions, handling text length constraints appropriately. Finally, we apply this transformation across the entire dataset in efficient batches.

In [9]:
# Import required NLP components
from transformers import AutoTokenizer
from datasets import load_dataset

# Retrieve question-answering corpus
dataset = load_dataset("squad")

# Initialize text processor
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")

# Create input encoding processor
def process_text_pairs(examples):
    return tokenizer(examples["context"], examples["question"], truncation=True, padding="max_length", max_length=2048)

# Transform raw text into model-ready format
dataset = dataset.map(process_text_pairs, batched=True)

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

## Data Formatting

We will create a preprocessing function that standardizes our QA dataset entries into a consistent template format. The function handles various possible data structures, extracting context passages, questions, and answers while implementing error handling for missing or malformed entries. Our formatted output follows the context-question-answer pattern required for instruction fine-tuning.

In [10]:
# Create standardized data processor
def prepare_qa_format(data_item):
    # Extract base information with defaults
    background = data_item.get("context", "")
    query = data_item.get("question", "")

    # Flexible response extraction with safety mechanisms
    try:
        # Handle different data structures
        if type(data_item["answers"]) is dict:
            possible_responses = data_item["answers"].get("text", [])
        elif type(data_item["answers"]) is list and data_item["answers"]:
            possible_responses = data_item["answers"][0].get("text", [])
        else:
            possible_responses = []

        # Select first answer or use fallback
        response = possible_responses[0] if possible_responses else "No answer"
    except:
        # Default for any extraction errors
        response = "No answer"

    # Generate formatted instruction template
    return [f"""Context: {background}
Question: {query}
Answer: {response}"""]

## Efficiency Enhancement

We will apply Low-Rank Adaptation (LoRA) to our base model, enabling efficient fine-tuning by focusing on key projection matrices. This configuration uses rank-16 matrices with appropriate scaling and regularization, targeting all critical transformer components while implementing memory-saving gradient checkpointing. This approach dramatically reduces training parameters while maintaining adaptation capability.

In [11]:
# Configure fine-tuning architecture with parameter-efficient approach
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

Unsloth: Already have LoRA adapters! We shall skip this step.


## Training Configuration

We will set up our supervised fine-tuning environment with detailed hyperparameters. This configuration applies our data preparation function, establishes computational efficiency settings (like processor allocation and sequence handling), and defines the learning process parameters including small batch sizes with gradient accumulation, a moderate learning rate with decay, mixed precision computation, and appropriate checkpointing frequency.

In [14]:
# Initialize the supervised training framework
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    formatting_func=prepare_qa_format,
    dataset_text_field=None,
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        save_steps=200,
        output_dir="qa_outputs",
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        report_to="none",
    ),
)

In [15]:
# Execute the training process
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 87,599 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 19,611,648/2,000,000,000 (0.98% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,12.8204
20,12.1843
30,10.3868
40,9.4588
50,8.9612
60,9.2305
70,9.3462
80,9.1312
90,8.7931
100,8.679


TrainOutput(global_step=100, training_loss=9.899154205322265, metrics={'train_runtime': 1580.5536, 'train_samples_per_second': 0.506, 'train_steps_per_second': 0.063, 'total_flos': 1.96755069075456e+16, 'train_loss': 9.899154205322265})

## Model Preservation

We will store both our specialized question-answering model and its corresponding tokenizer to a designated directory. This ensures our fine-tuned solution remains accessible for future inference tasks without requiring retraining, creating a complete deployment-ready package.

In [None]:
# Saving the final model
model.save_pretrained("fine-tuned-gemma-qa")
tokenizer.save_pretrained("fine-tuned-gemma-qa")