# Llama 3.2 — MetaMathQA fine-tuning (Colab-ready)
This notebook prepares MetaMathQA data and demonstrates a PEFT/LoRA fine-tuning workflow suitable for Llama-family causal models. Run in Google Colab with a GPU runtime for best results.

## 1. Overview
This notebook contains: (1) environment and Colab quickstart, (2) data preparation for MetaMathQA, (3) example training using Hugging Face Transformers + PEFT (LoRA), and (4) evaluation examples.

Intended usage: open in Colab (Runtime → Change runtime type → GPU), run the setup cell, prepare data, then run the training cells.

## 2. Environment & Colab quickstart
If you run this notebook locally without a CUDA GPU, training will fail or be extremely slow — prefer Colab or other GPU hosts.

Open in Colab: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/karukan/llamaFinetuning/blob/main/llama3.2-finetune-metamathqa.ipynb)

Quick steps: set Runtime→Change runtime type→GPU, run the setup cell (mount Drive if you want checkpoints persisted).

In [None]:
import torch
print('PyTorch version:', torch.__version__)
print('CUDA available?', torch.cuda.is_available())
print('CUDA devices:', torch.cuda.device_count())
if torch.cuda.is_available():
    print('Current device name:', torch.cuda.get_device_name(0))

## 4. Data Preparation
We load `meta-math/MetaMathQA` via the `datasets` library, clean the text, and format prompt-completion pairs. The target format used below is a JSONL where each line is {"prompt":..., "completion":...} suitable for many LLM fine-tuning tools.
Option 2 justification: `MetaMathQA` is chosen because it contains math reasoning Q&A examples that can help the model specialize in formal mathematical problem phrasing and solution generation—useful for benchmarking math reasoning improvements after fine-tuning.

In [None]:
from datasets import load_dataset
import json, os

# Load dataset from the hub. If you have it locally, adapt the path.
dataset_name = 'meta-math/MetaMathQA'
print('Loading dataset:', dataset_name)
try:
    ds = load_dataset(dataset_name)
except Exception as e:
    print('Failed to load directly. Check network/access or replace with local path. Error:', e)
    ds = None

# Inspect if loaded
if ds is not None:
    print(ds)
    # show a few examples (train split may be named 'train')
    for k in ds.keys():
        print('Split', k, '->', ds[k].num_rows)
    print('Example row (first train if exists):')
    split = list(ds.keys())[0]
    print(ds[split][0])

### 4b. Data cleaning and formatting helpers

In [None]:
import re

def clean_text(s):
    if s is None:
        return ''
    # Basic cleanup: normalize whitespace, remove odd control chars
    s = s.replace(chr(9), ' ').replace(chr(13), ' ').replace(chr(10), ' ')
    s = ' '.join(s.split())
    return s

def format_prompt_completion(example):
    # Adapt field names to the dataset schema. Common fields: 'question' and 'answer' or similar.
    # We'll try to handle a few variants robustly.
    q = example.get('question') or example.get('problem') or example.get('prompt') or ''
    a = example.get('answer') or example.get('solution') or example.get('target') or ''
    q = clean_text(q)
    a = clean_text(a)
    # Compose the prompt and completion; ensure completion contains an end token or newline.
    prompt = f'Question: {q}\nAnswer:'
    completion = ' ' + a + ' '  # leading space helps some tokenizers' alignment
    return {'prompt': prompt, 'completion': completion}

In [None]:
# 4c. Create train/validation split and save JSONL files
import random
from pathlib import Path

out_dir = Path('./data')
out_dir.mkdir(parents=True, exist_ok=True)

def prepare_and_save(dset, split_name='train', val_frac=0.05, seed=42, max_items=None):
    # flatten list of formatted items
    items = []
    for i, ex in enumerate(dset):
        if max_items and i >= max_items:
            break
        formatted = format_prompt_completion(ex)
        if formatted['prompt'].strip() and formatted['completion'].strip():
            items.append(formatted)
    print(f'Prepared {len(items)} cleaned examples from {split_name}')
    random.Random(seed).shuffle(items)
    cut = int(len(items) * (1 - val_frac))
    train_items = items[:cut]
    val_items = items[cut:]
    # Save as JSONL
    train_path = out_dir / f'{split_name}_train.jsonl'
    val_path = out_dir / f'{split_name}_val.jsonl'
    with open(train_path, 'w', encoding='utf-8') as f1, open(val_path, 'w', encoding='utf-8') as f2:
        for it in train_items:
            f1.write(json.dumps(it, ensure_ascii=False) + '\n')
        for it in val_items:
            f2.write(json.dumps(it, ensure_ascii=False) + '\n')
    print('Saved', train_path, 'and', val_path)
    return train_path, val_path

# Run preparation if dataset loaded
if ds is not None:
    # Use first available split (often 'train') and limit items for quick tests
    first_split = list(ds.keys())[0]
    train_file, val_file = prepare_and_save(ds[first_split], split_name=first_split, val_frac=0.05, max_items=5000)
else:
    print('Dataset not loaded; please load dataset manually or provide local files.')

### 5B. Hugging Face Transformers + PEFT (LoRA) — runnable training pipeline
This is a concrete training implementation that uses PEFT LoRA; it's widely supported and works well for parameter-efficient fine-tuning. It also demonstrates hyperparameter setup, checkpointing, and early stopping.

In [None]:
# Install any missing dependencies in notebook if desired (uncomment to run).
# !pip install -q peft accelerate bitsandbytes evaluate transformers datasets

from transformers import (AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
from datasets import load_dataset, Dataset, DatasetDict
import evaluate

# Hyperparameters (5a)
model_name_or_path = '<LLAMA_3_2_1B_HF_ID_OR_LOCAL_PATH>'  # replace with the real repo id or path
output_dir = './hf_peft_output'
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
learning_rate = 2e-5
num_train_epochs = 3
save_strategy = 'epoch'
evaluation_strategy = 'epoch'
logging_strategy = 'steps'
logging_steps = 100
fp16 = True
gradient_accumulation_steps = 1
weight_decay = 0.01
warmup_steps = 50
max_length = 512

# Prepare tokenizer and model (may use 8-bit/bitsandbytes to reduce memory)
print('Loading tokenizer and model placeholder (do not run until you set model_name_or_path)')
# tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
# model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map='auto', load_in_8bit=True, torch_dtype=torch.float16)

# PEFT/LoRA config (small ranks for 1B model)
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['q_proj','v_proj'],  # adapt depending on model architecture
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM'
)

print('PEFT/LoRA config defined (placeholder)')

In [None]:
# Tokenization and dataset creation (uncomment & adapt when model set)
# from pathlib import Path
# train_jsonl = Path('./data/train.jsonl')
# val_jsonl = Path('./data/val.jsonl')
#
# def load_jsonl_to_dataset(path):
#     import json
#     items = []
#     with open(path, 'r', encoding='utf-8') as f:
#         for line in f:
#             items.append(json.loads(line))
#     return Dataset.from_list(items)
#
# train_ds = load_jsonl_to_dataset(train_jsonl)
# val_ds = load_jsonl_to_dataset(val_jsonl)
#
# def tokenize_fn(batch):
#     # concat prompt and completion so model predicts completion tokens; optionally shift labels to only include completion tokens
#     texts = [x['prompt'] + x['completion'] for x in batch]
#     out = tokenizer(texts, truncation=True, max_length=max_length, padding='max_length')
#     input_ids = out['input_ids']
#     out['labels'] = [[-100]*len(i) for i in input_ids]  # simple placeholder: refine for causal LM
#     # For simplicity here we set labels=input_ids so model learns to reconstruct; for instruction tuning mask prompt tokens if desired
#     out['labels'] = input_ids
#     return out
#
# tokenized_train = train_ds.map(tokenize_fn, batched=True, remove_columns=train_ds.column_names)
# tokenized_val = val_ds.map(tokenize_fn, batched=True, remove_columns=val_ds.column_names)
#
# Data collator for causal LM (no MLM masking)
# data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

print('Tokenization cell prepared. Uncomment and run when your model and tokenizer are set.')

#### Training arguments, early stopping and checkpointing
We'll configure Trainer/TrainingArguments and add an EarlyStoppingCallback to stop when validation loss plateaus.

In [None]:
# Example TrainingArguments and EarlyStoppingCallback usage (uncomment to run)
# from transformers import EarlyStoppingCallback
# training_args = TrainingArguments(
#     output_dir=output_dir,
#     per_device_train_batch_size=per_device_train_batch_size,
#     per_device_eval_batch_size=per_device_eval_batch_size,
#     evaluation_strategy=evaluation_strategy,
#     save_strategy=save_strategy,
#     num_train_epochs=num_train_epochs,
#     learning_rate=learning_rate,
#     logging_steps=logging_steps,
#     fp16=fp16,
#     gradient_accumulation_steps=gradient_accumulation_steps,
#     save_total_limit=3,
#     load_best_model_at_end=True,
# )
#
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=tokenized_train,
#     eval_dataset=tokenized_val,
#     data_collator=data_collator,
# )
#
# trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=2))
# trainer.train()
print('Training arguments and trainer skeleton provided. Run when datasets and model are loaded.')

## 6. Evaluation and Analysis
We implement: (a) a simple exact-match style metric (normalized whitespace and case-insensitive), (b) generation examples before/after fine-tuning, and (c) a short analysis template.

In [None]:
# Colab setup: mount Drive (optional) and install dependencies
# Run this cell in Google Colab (it will skip installs when not in Colab)
try:
    import google.colab
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    from google.colab import drive
    print('Mounting Google Drive...')
    drive.mount('/content/drive')
    print('Upgrading pip and installing dependencies (this may take a few minutes)')
    # Core dependencies used by this notebook; adjust as needed
    !pip install -q --upgrade pip
    !pip install -q git+https://github.com/unslothai/unsloth.git
    !pip install -q transformers datasets accelerate peft bitsandbytes evaluate sentencepiece safetensors
    # Optional: install huggingface hub to access gated weights if needed
    !pip install -q huggingface_hub
    import torch
    print('Install finished. PyTorch:', torch.__version__, 'CUDA available:', torch.cuda.is_available())
else:
    print('Not running in Colab. To use GPU, open this notebook in Google Colab (Runtime -> Change runtime type -> GPU).')