# Fine-tuning Gemma-2B on Apple Silicon (MPS)

This notebook provides a complete workflow for fine-tuning the `google/gemma-2b` model on a local MacBook Pro with Apple Silicon (M1/M2/M3). It leverages PyTorch's Metal Performance Shaders (MPS) for GPU acceleration and uses memory-efficient techniques like 4-bit quantization and QLoRA to make the process feasible on consumer hardware.

## Step 1: Setup and Environment

First, we install the necessary libraries. We use `bitsandbytes` for quantization, `peft` for LoRA, `trl` for the SFTTrainer, and `transformers` for the model and tokenizer.

In [None]:
!pip install -q -U torch torchvision torchaudio
!pip install -q -U accelerate bitsandbytes peft transformers trl datasets huggingface_hub

Log in to the Hugging Face Hub to download the model and push your fine-tuned adapter. You will need to generate a token with 'write' permissions from your Hugging Face account settings.

In [None]:
from huggingface_hub import login

login()

## Step 2: Load and Prepare Data

We will load two datasets from the Hugging Face Hub:
1. `michsethowusu/english-tooro_sentence-pairs_mt560`: A translation dataset.
2. `cle-13/rutooro_multitask`: A multi-task instruction dataset.

We'll then merge them and format them into a consistent instruction-response format.

In [None]:
from datasets import load_dataset, concatenate_datasets

# Load the datasets
translation_dataset = load_dataset("michsethowusu/english-tooro_sentence-pairs_mt560", split='train')
multitask_dataset = load_dataset("cle-13/rutooro_multitask", split='train')

### Data Formatting

To fine-tune the model, we need to format the examples into a single text field. The format should clearly separate the instruction from the response. We will use the following structure:

```
### Instruction:\n[Instruction Text]\n\n### Response:\n[Response Text]
```
We create a function that handles the two different structures of our source datasets.

In [None]:
def format_instruction(sample):
    # Handle the multitask dataset which has 'instruction', 'input', 'output' columns
    if 'instruction' in sample:
        instruction = sample['instruction']
        input_text = sample.get('input', '')
        response = sample['output']
        if input_text:
            return f"### Instruction:\n{instruction}\n{input_text}\n\n### Response:\n{response}"
        else:
            return f"### Instruction:\n{instruction}\n\n### Response:\n{response}"
    # Handle the translation dataset which has 'en' and 'tt' columns
    elif 'en' in sample and 'tt' in sample:
        instruction = f"Translate this to Rutooro: {sample['en']}"
        response = sample['tt']
        return f"### Instruction:\n{instruction}\n\n### Response:\n{response}"
    else:
        # Handle cases where columns might be missing or have different names
        raise ValueError(f"Unexpected sample structure: {sample.keys()}")

Now we apply this formatting function to both datasets and merge them. We also split the final dataset into a 90% training set and a 10% testing set.

In [None]:
# Process and format each dataset
formatted_translation_dataset = translation_dataset.map(lambda x: {'text': format_instruction(x)})
formatted_multitask_dataset = multitask_dataset.map(lambda x: {'text': format_instruction(x)})

# Select only the 'text' column
formatted_translation_dataset = formatted_translation_dataset.select_columns(['text'])
formatted_multitask_dataset = formatted_multitask_dataset.select_columns(['text'])

# Merge the datasets
merged_dataset = concatenate_datasets([formatted_translation_dataset, formatted_multitask_dataset])

# Shuffle the dataset for better training distribution
merged_dataset = merged_dataset.shuffle(seed=42)

# Split into training and testing sets
dataset_splits = merged_dataset.train_test_split(test_size=0.1)
train_dataset = dataset_splits['train']
test_dataset = dataset_splits['test']

print(f"Training set size: {len(train_dataset)}")
print(f"Testing set size: {len(test_dataset)}")
print("\nExample entry:\n", train_dataset[0]['text'])

## Step 3: Load the Model and Tokenizer

Now we load the `google/gemma-2b` model. We'll use 4-bit quantization via `bitsandbytes` to reduce the memory footprint, making it possible to run on a local machine.

**Important**: For 4-bit quantization to work on MPS, you might need a specific build of `bitsandbytes`. This notebook assumes a compatible version is installed. We also explicitly check if the MPS device is available.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Check for MPS availability
if not torch.backends.mps.is_available():
    if not torch.cuda.is_available():
        print("MPS not available. Running on CPU may be slow.")
        device = "cpu"
    else:
        print("MPS not available, but CUDA is. Using CUDA.")
        device = "cuda"
else:
    print("MPS is available. Using MPS.")
    device = "mps"

model_id = "google/gemma-2b"

# Configure quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map=device # Pin the model to the MPS device
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right' # Set padding side for proper batching
# Set pad token if it does not exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

## Step 4: Configure QLoRA

We use QLoRA (Quantized Low-Rank Adaptation) to fine-tune the model efficiently. QLoRA freezes the full model weights and injects small, trainable "adapter" layers. This drastically reduces the number of trainable parameters, saving memory.

We need to specify which layers of the model to apply LoRA to. For Gemma, these are the linear projection layers in the attention blocks.

In [None]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

## Step 5: Define Training Arguments

The `TrainingArguments` class from the `transformers` library holds all the hyperparameters for the training run. This includes settings like the learning rate, number of epochs, batch size, and where to save the model checkpoints.

Modern versions of `transformers` and `accelerate` handle device placement automatically. Since we loaded the model with `device_map='mps'`, the trainer will use the Apple Silicon GPU. We set `bf16=True` for mixed-precision training, which can speed up training and reduce memory usage if the hardware supports it.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="gemma-2b-rutooro-finetuned", # Directory to save the model
    num_train_epochs=3,
    per_device_train_batch_size=2, # Start with a small batch size
    gradient_accumulation_steps=4, # Effective batch size will be 8
    learning_rate=2e-4,
    logging_steps=25,
    bf16=True, # Use bfloat16 for mixed-precision training
    # fp16=True, # Use fp16 if bf16 is not supported
    push_to_hub=True, # Push the final adapter to the Hub
    report_to="tensorboard", # Optional: for logging metrics
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False
)

## Step 6: Fine-tuning with SFTTrainer

Now we bring everything together. The `SFTTrainer` from the `trl` library is a high-level wrapper that simplifies the training process. We pass it the model, tokenizer, datasets, LoRA configuration, and training arguments.

Calling `trainer.train()` will start the fine-tuning process. The trainer will handle the training loop, gradient updates, evaluation, and saving checkpoints.

In [None]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=lora_config,
    dataset_text_field="text",
    args=training_args,
    max_seq_length=1024, # Adjust based on your VRAM
    packing=True, # Pack multiple short examples into one sequence for efficiency
)

# Start training
trainer.train()

# Save the final adapter
trainer.save_model(f"{training_args.output_dir}/final_adapter")

## Step 7: Inference and Evaluation

After fine-tuning, the model is ready to be used for inference. The `trainer` object automatically loads the best version of the adapter. We can use this model to generate responses to new instructions.

We will format a sample instruction, tokenize it, and pass it to the model's `generate` function to get a response.

In [None]:
# The trainer loads the best model state, so we can directly use it for inference.
model.eval() # Set the model to evaluation mode

# Create a sample instruction
prompt_text = "Translate this to Rutooro: I am going to the market."

formatted_prompt = f"""### Instruction:
{prompt_text}
### Response:"""

# Tokenize the input
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)

# Generate a response
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, top_k=50, top_p=0.95)

# Decode and print the response
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response_text)

### Loading the Adapter for a New Session

If you were starting a new notebook session, you would load the quantized base model first and then apply the fine-tuned LoRA adapter on top of it. The code below shows how you would do this.

In [None]:
from peft import PeftModel

# Path to your saved adapter
adapter_path = f"{training_args.output_dir}/final_adapter"

# --- Code to load model from scratch (for a new session) ---
# 1. Load the quantized base model
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map=device
)

# 2. Load the PeftModel by merging the adapter into the base model
inference_model = PeftModel.from_pretrained(base_model, adapter_path)

# You can now use `inference_model` for generation just like we did above
print("Inference model loaded successfully.")

## Step 8: Save and Push the Final Model

The training process automatically saved the adapter in the output directory (`gemma-2b-rutooro-finetuned`). Because we set `push_to_hub=True` in the `TrainingArguments`, the trainer also attempted to push the model to the Hugging Face Hub after training.

You can also run the command manually to push the final adapter to the Hub. This will create a new repository under your user account.

In [None]:
# Push the final adapter to the Hugging Face Hub
trainer.push_to_hub()

print(f"Adapter pushed to Hugging Face Hub at: https://huggingface.co/{training_args.hub_model_id}")

## Conclusion

You have successfully fine-tuned the Gemma-2B model on a custom dataset using QLoRA on your Apple Silicon Mac. You can now share your model adapter, use it in other applications, or continue to refine it.