# Llama3 7B Fine-tuning on MAWPS Dataset with LoRA

This notebook demonstrates how to fine-tune Llama3 7B on the MAWPS (Math Word Problems) dataset using LoRA (Low-Rank Adaptation) for efficient parameter training.

## Overview

- **Model**: Llama3 7B (or compatible model)
- **Dataset**: MAWPS - Math word problems dataset
- **Method**: LoRA fine-tuning for efficient training
- **Goal**: Improve model's ability to solve math word problems

## Features

- Efficient training with LoRA (Low-Rank Adaptation)
- 4-bit quantization for memory efficiency
- Custom evaluation metrics for math problems
- Modular code structure for easy experimentation

## 1. Install Required Dependencies

First, let's install all the necessary libraries for fine-tuning Llama3 with LoRA.

In [None]:
# Install required packages
# Run this cell if packages are not already installed

# !pip install torch>=2.0.0
# !pip install transformers>=4.35.0
# !pip install datasets>=2.14.0
# !pip install peft>=0.6.0
# !pip install accelerate>=0.24.0
# !pip install bitsandbytes>=0.41.0
# !pip install trl>=0.7.0
# !pip install wandb>=0.15.0
# !pip install scikit-learn>=1.3.0

print("Dependencies installation commands are ready!")
print("Uncomment and run the pip install commands if needed.")

: 

In [3]:
# Import necessary libraries
import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset, Dataset
from peft import (
    get_peft_model,
    LoraConfig,
    prepare_model_for_kbit_training,
    PeftModel
)
import pandas as pd
import numpy as np
import re
import json
import logging
from sklearn.metrics import accuracy_score
import warnings

# Setup
warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.INFO)

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

  from .autonotebook import tqdm as notebook_tqdm


PyTorch version: 2.7.1+cu126
Transformers version: 4.54.1
CUDA available: True
GPU: NVIDIA RTX 500 Ada Generation Laptop GPU
GPU Memory: 4.1 GB


## 2. Load and Explore MAWPS Dataset

The MAWPS (Math Word Problems) dataset contains grade school math word problems. Let's load and explore the dataset structure.

In [5]:
# Load MAWPS dataset
print("Loading MAWPS dataset...")
dataset = load_dataset("mwpt5/MAWPS")

print("Dataset structure:")
print(dataset)

# Explore the dataset
for split_name, split_data in dataset.items():
    print(f"\n{split_name} split:")
    print(f"  Number of examples: {len(split_data)}")
    print(f"  Features: {list(split_data.features.keys())}")

# Look at sample examples
print("\nSample examples from training set:")
for i in range(3):
    example = dataset['train'][i]
    print(f"\nExample {i+1}:")
    for key, value in example.items():
        print(f"  {key}: {value}")

Loading MAWPS dataset...


Generating train split: 100%|██████████| 1772/1772 [00:00<00:00, 31752.77 examples/s]

Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['Question', 'Equation', 'Answer', 'Numbers'],
        num_rows: 1772
    })
})

train split:
  Number of examples: 1772
  Features: ['Question', 'Equation', 'Answer', 'Numbers']

Sample examples from training set:

Example 1:
  Question: Mary is baking a cake . The recipe wants N_00 cups of flour . She already put in N_01 cups . How many cups does she need to add ?
  Equation: N_00 - N_01
  Answer: 6.0
  Numbers: 8.0 2.0

Example 2:
  Question: There are N_00 erasers and N_01 scissors in the drawer . Jason placed N_02 erasers in the drawer . How many erasers are now there in total ?
  Equation: N_00 + N_02
  Answer: 270.0
  Numbers: 139.0 118.0 131.0

Example 3:
  Question: One pencil weighs N_00 grams . How much do N_01 pencils weigh ?
  Equation: N_00 * N_01
  Answer: 141.5
  Numbers: 28.3 5.0





## 3. Load Llama3 7B Model with LoRA Configuration

Now let's load the Llama3 7B model (or a compatible model) and configure LoRA for efficient fine-tuning.

In [None]:
# Model configuration
# Using an open-access model that doesn't require authentication
# Options: microsoft/DialoGPT-medium, microsoft/DialoGPT-large, or any other open model
# MODEL_NAME = "microsoft/DialoGPT-medium"  # Open-access alternative
# MODEL_NAME = "meta-llama/Llama-3.2-1b-Instruct"  # Requires HF authentication and approval
MODEL_NAME = "Qwen/Qwen3-0.6B"
# MODEL_NAME = "huggyllama/llama-7b"       # Another option if you have access

MAX_LENGTH = 512

# LoRA configuration - adjusted for Qwen3
LORA_CONFIG = {
    "r": 16,                    # Rank
    "lora_alpha": 32,          # Alpha parameter for LoRA scaling
    "target_modules": [        # Target modules for LoRA (adjusted for Qwen3)
"q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    "lora_dropout": 0.1,       # Dropout probability for LoRA layers
    "bias": "none",            # Bias type
    "task_type": "CAUSAL_LM"   # Task type
}


print(f"Model: {MODEL_NAME}")
print(f"Max length: {MAX_LENGTH}")
print(f"LoRA configuration: {LORA_CONFIG}")
print("\nNote: Using DialoGPT as an open-access alternative.")
print("To use Llama models:")
print("1. Get approval at: https://huggingface.co/meta-llama/Llama-2-7b-hf")
print("2. Login with: huggingface-cli login")
print("3. Then change MODEL_NAME back to 'meta-llama/Llama-2-7b-hf'")

Model: Qwen/Qwen3-0.6B
Max length: 512
LoRA configuration: {'r': 16, 'lora_alpha': 32, 'target_modules': ['attn.c_attn', 'attn.c_proj', 'mlp.c_fc1', 'mlp.c_fc2'], 'lora_dropout': 0.1, 'bias': 'none', 'task_type': 'CAUSAL_LM'}

Note: Using DialoGPT as an open-access alternative.
To use Llama models:
1. Get approval at: https://huggingface.co/meta-llama/Llama-2-7b-hf
2. Login with: huggingface-cli login
3. Then change MODEL_NAME back to 'meta-llama/Llama-2-7b-hf'


In [None]:
# Optional: Login to Hugging Face (only needed for gated models like Llama)
# Uncomment and run this cell if you want to use Llama models

# from huggingface_hub import login
# login()

# Or alternatively, set your token directly:
# import os
# os.environ["HUGGINGFACE_HUB_TOKEN"] = "your_token_here"

print("Authentication cell ready (currently commented out)")
print("Uncomment lines above if you need to authenticate for gated models")

In [13]:
# Setup quantization configuration for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# Load tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer.padding_side = "right"  # Fix for fp16

print(f"Tokenizer loaded. Vocab size: {len(tokenizer)}")

# Load base model
print("Loading base model...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

print("Base model loaded successfully!")

Loading tokenizer...
Tokenizer loaded. Vocab size: 151669
Loading base model...
Tokenizer loaded. Vocab size: 151669
Loading base model...


INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Base model loaded successfully!


In [14]:
# Setup LoRA
print("Setting up LoRA...")
lora_config = LoraConfig(**LORA_CONFIG)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Print trainable parameters
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"Trainable params: {trainable_params:,} || "
        f"All params: {all_param:,} || "
        f"Trainable%: {100 * trainable_params / all_param:.2f}%"
    )

print_trainable_parameters(model)
print("LoRA setup complete!")

Setting up LoRA...


ValueError: Target modules {'mlp.c_fc2', 'mlp.c_fc1', 'attn.c_proj', 'attn.c_attn'} not found in the base model. Please check the target modules and try again.

In [10]:
model

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-27): 28 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=1024, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=1024, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=1024, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=1024, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=1024, out_features=3072, bias=False)
          (up_proj): Linear4bit(in_features=1024, out_features=3072, bias=False)
          (down_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
        (po

## 4. Prepare Dataset for Training

Now let's preprocess the MAWPS dataset by formatting it into instruction-response pairs suitable for fine-tuning.

In [None]:
# Define instruction template
INSTRUCTION_TEMPLATE = """Below is a math word problem. Solve it step by step.

### Problem:
{problem}

### Solution:
{solution}"""

def format_instruction(example):
    """Format a single example using the instruction template."""
    # Handle different possible field names in MAWPS dataset
    problem = example.get('Question', example.get('question', example.get('Problem', '')))
    solution = example.get('Answer', example.get('answer', example.get('Solution', '')))
    
    # Ensure solution is a string
    if isinstance(solution, (int, float)):
        solution = str(solution)
    
    formatted_text = INSTRUCTION_TEMPLATE.format(problem=problem, solution=solution)
    return {"text": formatted_text}

# Test formatting with a sample
sample_example = dataset['train'][0]
formatted_sample = format_instruction(sample_example)
print("Sample formatted example:")
print(formatted_sample["text"])
print("\n" + "="*50)

In [None]:
# Tokenization function
def tokenize_function(examples):
    """Tokenize examples for training."""
    # Format all examples
    formatted_texts = [format_instruction(example)["text"] for example in examples]
    
    # Tokenize
    tokenized = tokenizer(
        formatted_texts,
        truncation=True,
        padding=True,
        max_length=MAX_LENGTH,
        return_tensors=None
    )
    
    # For causal LM, labels are the same as input_ids
    tokenized["labels"] = tokenized["input_ids"].copy()
    
    return tokenized

# Prepare datasets
print("Preprocessing datasets...")

# For demonstration, we'll use a subset of the data
# In practice, you might want to use the full dataset
train_dataset = dataset['train'].select(range(1000))  # Use first 1000 examples
eval_dataset = dataset['test'].select(range(200))     # Use first 200 examples for eval

# Apply formatting and tokenization
def preprocess_batch(examples):
    """Preprocess a batch of examples."""
    formatted_examples = []
    for i in range(len(examples[list(examples.keys())[0]])):
        example = {key: examples[key][i] for key in examples.keys()}
        formatted_examples.append(example)
    
    return tokenize_function(formatted_examples)

train_dataset = train_dataset.map(
    preprocess_batch,
    batched=True,
    remove_columns=train_dataset.column_names,
    desc="Tokenizing train dataset"
)

eval_dataset = eval_dataset.map(
    preprocess_batch,
    batched=True,
    remove_columns=eval_dataset.column_names,
    desc="Tokenizing eval dataset"
)

print(f"Train dataset size: {len(train_dataset)}")
print(f"Eval dataset size: {len(eval_dataset)}")
print(f"Sample tokenized length: {len(train_dataset[0]['input_ids'])}")

## 5. Configure Training Arguments

Set up the training configuration with appropriate hyperparameters for LoRA fine-tuning.

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results_llama_mawps",
    num_train_epochs=2,                    # Number of training epochs
    per_device_train_batch_size=4,         # Batch size per device during training
    per_device_eval_batch_size=4,          # Batch size for evaluation
    gradient_accumulation_steps=4,         # Steps to accumulate gradients
    learning_rate=2e-4,                    # Learning rate
    max_grad_norm=1.0,                     # Max gradient norm for clipping
    weight_decay=0.01,                     # Weight decay
    warmup_ratio=0.1,                      # Warmup ratio
    lr_scheduler_type="cosine",            # Learning rate scheduler
    logging_steps=10,                      # Log every N steps
    evaluation_strategy="steps",           # Evaluation strategy
    eval_steps=100,                        # Evaluate every N steps
    save_strategy="steps",                 # Save strategy
    save_steps=200,                        # Save every N steps
    save_total_limit=2,                    # Maximum number of checkpoints to keep
    load_best_model_at_end=True,          # Load best model at the end
    metric_for_best_model="eval_loss",     # Metric to use for best model
    greater_is_better=False,               # Whether metric should be maximized
    
    # Performance optimizations
    fp16=True,                             # Use mixed precision training
    dataloader_pin_memory=False,           # Don't pin memory (can cause issues with some setups)
    group_by_length=True,                  # Group sequences by length for efficiency
    optim="paged_adamw_32bit",            # Optimizer
    
    # Reporting
    report_to=[],                          # Don't report to wandb for this demo
    logging_dir="./logs",
)

print("Training arguments configured:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Output directory: {training_args.output_dir}")

## 6. Initialize Trainer and Start Fine-tuning

Create the trainer and start the fine-tuning process.

In [None]:
# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We're doing causal language modeling, not masked language modeling
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

print("Trainer initialized successfully!")
print("Starting training...")

# Start training
train_result = trainer.train()

print("Training completed!")
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training steps: {train_result.global_step}")

## 7. Evaluate Model Performance

Evaluate the fine-tuned model and compute metrics.

In [None]:
# Evaluate the model
print("Evaluating model...")
eval_result = trainer.evaluate()

print("Evaluation Results:")
for key, value in eval_result.items():
    print(f"  {key}: {value:.4f}")

# Custom evaluation function for math problems
def extract_number(text):
    """Extract numerical answer from generated text."""
    # Look for numbers at the end of the text
    numbers = re.findall(r'[-+]?\d*\.?\d+', text)
    if numbers:
        try:
            return float(numbers[-1])
        except ValueError:
            return None
    return None

def evaluate_math_problems(model, tokenizer, eval_examples, num_samples=10):
    """Evaluate model on math problems."""
    correct = 0
    total = 0
    
    for i in range(min(num_samples, len(eval_examples))):
        example = eval_examples[i]
        
        # Get the original problem
        problem = example.get('Question', example.get('question', example.get('Problem', '')))
        expected_answer = example.get('Answer', example.get('answer', example.get('Solution', '')))
        
        # Generate solution
        prompt = f"Below is a math word problem. Solve it step by step.\n\n### Problem:\n{problem}\n\n### Solution:\n"
        
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        with torch.no_grad():
            outputs = model.generate(
                inputs.input_ids,
                max_new_tokens=128,
                temperature=0.1,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract solution
        if "### Solution:" in generated_text:
            solution = generated_text.split("### Solution:")[-1].strip()
        else:
            solution = generated_text.strip()
        
        # Extract numerical answers
        predicted_num = extract_number(solution)
        expected_num = extract_number(str(expected_answer))
        
        if predicted_num is not None and expected_num is not None:
            if abs(predicted_num - expected_num) < 1e-6:
                correct += 1
            total += 1
            
            print(f"\nExample {i+1}:")
            print(f"Problem: {problem}")
            print(f"Expected: {expected_answer}")
            print(f"Predicted: {solution}")
            print(f"Correct: {abs(predicted_num - expected_num) < 1e-6}")
    
    accuracy = correct / total if total > 0 else 0
    print(f"\nAccuracy on {total} problems: {accuracy:.2%}")
    return accuracy

# Run custom evaluation
original_eval_data = dataset['test'].select(range(10))  # Use original data for evaluation
accuracy = evaluate_math_problems(model, tokenizer, original_eval_data)

## 8. Save and Load Fine-tuned Model

Save the LoRA adapter weights and demonstrate how to load them later.

In [None]:
# Save the LoRA adapter
model_save_path = "./llama_mawps_lora"
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

print(f"Model saved to: {model_save_path}")

# Demonstrate how to load the model later
print("\nDemonstrating model loading...")

# For loading, you would:
# 1. Load the base model
# 2. Load the LoRA adapter

# Note: In a new session, you would do this:
"""
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter
finetuned_model = PeftModel.from_pretrained(base_model, model_save_path)

# Load tokenizer
loaded_tokenizer = AutoTokenizer.from_pretrained(model_save_path)
"""

print("Model loading instructions saved in comments above.")

## 9. Test Model with Custom Examples

Test the fine-tuned model with custom math word problems to see the improvements.

In [None]:
# Test with custom math problems
custom_problems = [
    "Sarah has 15 apples. She gives 7 apples to her friend Tom and 3 apples to her sister. How many apples does Sarah have left?",
    "A school has 450 students. If 180 students are boys, how many students are girls?",
    "Mike bought 3 packages of pencils. Each package contains 12 pencils. How many pencils did Mike buy in total?",
    "A rectangle has a length of 8 meters and a width of 5 meters. What is the area of the rectangle?"
]

def test_model(model, tokenizer, problem):
    """Test the model with a single problem."""
    prompt = f"Below is a math word problem. Solve it step by step.\n\n### Problem:\n{problem}\n\n### Solution:\n"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_new_tokens=200,
            temperature=0.1,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id
        )
    
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract solution
    if "### Solution:" in generated_text:
        solution = generated_text.split("### Solution:")[-1].strip()
    else:
        solution = generated_text.strip()
    
    return solution

print("Testing fine-tuned model with custom problems:")
print("=" * 60)

for i, problem in enumerate(custom_problems, 1):
    print(f"\nTest {i}:")
    print(f"Problem: {problem}")
    
    solution = test_model(model, tokenizer, problem)
    print(f"Solution: {solution}")
    print("-" * 40)

print("\nTesting completed!")

## Conclusion and Next Steps

Congratulations! You have successfully fine-tuned Llama3 7B on the MAWPS dataset using LoRA.

### What we accomplished:
- ✅ Loaded and explored the MAWPS dataset
- ✅ Set up Llama3 7B with 4-bit quantization
- ✅ Configured LoRA for efficient fine-tuning
- ✅ Preprocessed the dataset into instruction-response format
- ✅ Fine-tuned the model with only ~0.5% trainable parameters
- ✅ Evaluated the model performance
- ✅ Saved the LoRA adapter for future use

### Next Steps:
1. **Experiment with hyperparameters**: Try different LoRA ranks, learning rates, or batch sizes
2. **Use the full dataset**: We used a subset for demonstration - use the full dataset for better results
3. **Add better evaluation metrics**: Implement more sophisticated math problem evaluation
4. **Deploy the model**: Create an API or web interface for the fine-tuned model
5. **Compare with base model**: Evaluate the base model on the same problems to see improvement

### Modifying the Project:
- **Change the model**: Replace `MODEL_NAME` with any compatible model (Llama2, CodeLlama, etc.)
- **Adjust LoRA config**: Modify `LORA_CONFIG` to experiment with different adapter settings
- **Change the dataset**: Replace MAWPS with any other instruction-following dataset
- **Modify the prompt template**: Customize `INSTRUCTION_TEMPLATE` for different tasks

The project structure is modular and easily extensible for your specific needs!