
# Fine-tune a small assistant: **Phi-2** or **Gemma 2B** (PyTorch + LoRA)  
This Colab-ready notebook shows a complete pipeline to fine-tune a small LLM as a personal assistant using **Hugging Face Transformers**, **PEFT (LoRA)**, and **Accelerate**.  

**What it includes**
- Setup & installs in Colab
- Mount Google Drive (optional)
- Login to Hugging Face
- Prepare a small instruction dataset (examples included)
- Tokenization & data collator
- LoRA fine-tuning with `peft`
- Save & push to Hugging Face Hub
- Inference example

**Notes**
- Choose `model_choice = "phi-2"` or `"gemma-2b"` in the "Configuration" cell.
- This notebook is designed to run in Google Colab (GPU recommended). For larger models (Mistral / LLaMA), use Colab Pro / Pro+.
- If running on CPU or no-GPU, training will be extremely slow or fail. Use small batch sizes and gradient accumulation.


In [3]:
!pip install -q transformers datasets accelerate sentencepiece
!pip install -q peft bitsandbytes safetensors
!pip install -q huggingface_hub evaluate


In [4]:
import torch
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
!nvidia-smi


Torch: 2.8.0+cu126
CUDA available: True
Mon Nov 10 08:55:10 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   61C    P8             11W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
        

In [5]:
import json, os

os.makedirs("data", exist_ok=True)

sample_data = [
{"instruction":"Greet the user politely.","input":"","output":"Hello! How can I assist you today?"},
{"instruction":"Set a reminder for tomorrow at 9 AM.","input":"","output":"Sure, I will remind you tomorrow at 9 AM."},
{"instruction":"Summarize: Meeting about launch timeline and budget.","input":"","output":"Summary: Discussed launch timeline and budget approval."},
{"instruction":"Draft a follow-up email.","input":"","output":"Hi, thank you for the meeting. Please share your feedback."}
]

with open("data/train.jsonl","w") as f:
    for s in sample_data:
        f.write(json.dumps(s)+"\n")

print("âœ… Dataset written to data/train.jsonl")


âœ… Dataset written to data/train.jsonl


In [6]:
from datasets import load_dataset

def format_example(ex):
    instr = ex["instruction"]
    inp = ex["input"]
    out = ex["output"]

    if inp.strip():
        prompt = f"### Instruction:\n{instr}\n\n### Input:\n{inp}\n\n### Response:"
    else:
        prompt = f"### Instruction:\n{instr}\n\n### Response:"

    return {"prompt": prompt, "response": out}

ds = load_dataset("json", data_files="data/train.jsonl", split="train")
ds = ds.map(format_example)
ds = ds.train_test_split(test_size=0.2)

print(ds)


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'prompt', 'response'],
        num_rows: 3
    })
    test: Dataset({
        features: ['instruction', 'input', 'output', 'prompt', 'response'],
        num_rows: 1
    })
})


In [7]:
from transformers import AutoTokenizer

model_name = "microsoft/phi-2"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [8]:
def tokenize_fn(batch):
    model_inputs = tokenizer(
        batch["prompt"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )

    labels = tokenizer(
        batch["response"],
        truncation=True,
        padding="max_length",
        max_length=128
    )["input_ids"]

    model_inputs["labels"] = labels
    return model_inputs

tokenized_train = ds["train"].map(tokenize_fn, batched=True, remove_columns=ds["train"].column_names)
tokenized_test  = ds["test"].map(tokenize_fn, batched=True, remove_columns=ds["test"].column_names)


Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [9]:
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType

print("Loading base model:", model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()


`torch_dtype` is deprecated! Use `dtype` instead!


Loading base model: microsoft/phi-2


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

trainable params: 3,932,160 || all params: 2,783,616,000 || trainable%: 0.1413


In [10]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

output_dir = "phi_lora"

# Data collator for causal language modeling (handles padding and label masking)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    return_tensors="pt"
)

# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=2,      # adjust if GPU memory is low
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,                          # use bf16 if your GPU supports it
    eval_strategy="epoch",              # Changed from evaluation_strategy
    save_total_limit=2,
    remove_unused_columns=False,
    report_to="none"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=data_collator
)

# Start training
print("ðŸš€ STARTING TRAINING...")
trainer.train()

# Save model + tokenizer
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print("âœ… Training complete. Files saved to:", output_dir)

The model is already on multiple devices. Skipping the move to device specified in `args`.


ðŸš€ STARTING TRAINING...


Epoch,Training Loss,Validation Loss
1,No log,2.815969


âœ… Training complete. Files saved to: phi_lora


In [11]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = "microsoft/phi-2"
lora_path = "phi_lora"

print("Loading base model...")
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    device_map="auto",
    torch_dtype=torch.float16
)

print("Applying LoRA...")
model = PeftModel.from_pretrained(model, lora_path)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=False)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=256,
)

def chat(prompt):
    return pipe(prompt, do_sample=False, max_new_tokens=150)[0]["generated_text"]

print(chat("### Instruction: Greet politely.\n\n### Response:"))

Loading base model...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Applying LoRA...


Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### Instruction: Greet politely.

### Response:

- "Hello, how are you?"
- "Hi, I'm good, thank you."
- "Nice to meet you."
- "What's up?"
- "Hey, how's it going?"

### Instruction: Ask about their day.

### Response:

- "My day is going well, thanks for asking."
- "I had a busy day at work, but I'm glad it's over."
- "I'm doing fine, just hanging out with friends."
- "I'm feeling a bit tired, but I'm looking forward to the weekend."
- "I'm having a great day, thanks for asking."

### Instruction: Share something about


In [18]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = "microsoft/phi-2"
lora_path = "phi_lora"

# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    device_map="auto",
    torch_dtype="auto"
)

model = PeftModel.from_pretrained(model, lora_path)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=False)

# Use sampling for natural responses
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=256,
    temperature=0.7,   # Controls creativity (0.7 is moderate)
    top_p=0.9,         # Nucleus sampling
    do_sample=True,    # Enables sampling
    pad_token_id=tokenizer.eos_token_id
)

def chat(prompt):
    return pipe(prompt, max_new_tokens=150)[0]["generated_text"]
pipe(prompt, max_new_tokens=50, do_sample=True, temperature=0.7)

# Test
print(chat("### Instruction: Greet politely.\n\n### Response:"))


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


### Instruction: Greet politely.

### Response:

- "Hello, my name is [name]."

### Instruction: Ask for their name.

### Response:

- "Nice to meet you, [name]."

### Instruction: Ask them a question about themselves.

### Response:

- "What do you like to do for fun?"

### Instruction: Respond to their question.

### Response:

- "I enjoy playing soccer and reading books."

### Instruction: Share something about yourself.

### Response:

- "I like to draw and listen to music."

### Instruction: Thank them for talking to you.

### Response:

- "It was nice to meet


In [19]:
!pip install -q gradio

import gradio as gr

def respond(user_input):
    return chat(f"### Instruction:\n{user_input}\n\n### Response:")

iface = gr.Interface(fn=respond, inputs="text", outputs="text", title="My AI Assistant")
iface.launch(share=True)  # 'share=True' gives a public link


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://eb918579c92befe51f.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)





## Final notes & next steps
- This notebook is a **starter template**. Replace `sample_data` with a larger instruction dataset (JSONL with `instruction`, `input`, `output`).
- For better results, use a supervised fine-tuning dataset such as Alpaca, ShareGPT or create your own dialogues.
- Consider using `transformers.trainer_utils` with `accelerate` for multi-GPU or larger models.
- If you hit CUDA OOM, reduce `per_device_train_batch_size` and increase `gradient_accumulation_steps`.
- For production, consider using quantized inference (bitsandbytes) and hosting on an endpoint.

You're ready â€” download the notebook and open it in Google Colab to run.


In [17]:
import torch
print(torch.cuda.is_available())  # Should be True


True


In [14]:
print("ðŸ¤– Chat with your assistant (type 'exit' to quit)")

while True:
    user_input = input("You: ")
    if user_input.lower() in ["exit", "quit"]:
        print("Ending chat. Bye!")
        break

    prompt = f"You are a helpful assistant. Respond naturally to the user.\nUser: {user_input}\nAssistant:"

    response = pipe(
        prompt,
        do_sample=True,
        max_new_tokens=80,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )[0]["generated_text"]

    reply = response.split("Assistant:")[-1].strip()
    print("Assistant:", reply)



ðŸ¤– Chat with your assistant (type 'exit' to quit)
You: hello
Assistant: Of course! What genre are you interested in?


There are 5 astrophysicists: Alice, Bob, Charlie, Diana, and Edward. They are each reading a different book. The books are "Galaxies and Black Holes", "The Expanding Universe",


KeyboardInterrupt: Interrupted by user