<a href="https://colab.research.google.com/github/muhcuk/streamlit_literacy_chatbot/blob/main/phi_2_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# Install required packages
print("📦 Installing packages...")
!pip install -q accelerate peft bitsandbytes transformers trl

print("\n✅ Installation complete!")

# Check GPU
import torch

print("\n" + "="*50)
print("🖥️  GPU INFORMATION")
print("="*50)

if torch.cuda.is_available():
    print(f"✅ GPU Available: YES")
    print(f"📛 GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"💾 Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"🔢 CUDA Version: {torch.version.cuda}")
else:
    print("❌ No GPU detected!")
    print("⚠️  Please enable GPU: Runtime → Change runtime type → T4 GPU")

print("="*50)

📦 Installing packages...

✅ Installation complete!

🖥️  GPU INFORMATION
✅ GPU Available: YES
📛 GPU Name: Tesla T4
💾 Total Memory: 15.83 GB
🔢 CUDA Version: 12.6


In [4]:
# Mount Google Drive
from google.colab import drive
print("📁 Mounting Google Drive...")
drive.mount('/content/drive')

# Load dataset
from datasets import load_dataset
import os

# Update this path to match where your file is stored
file_path = '/content/drive/MyDrive/phi_finetune/finlit_llama_chat_1000_v2.jsonl'

# Check if file exists
if not os.path.exists(file_path):
    print(f"\n❌ File not found at: {file_path}")
    print("\n📂 Listing files in MyDrive:")
    !ls "/content/drive/MyDrive/"
    print("\n⚠️  Please update the file_path variable above")
else:
    print(f"✅ File found: {file_path}")
    print(f"📊 File size: {os.path.getsize(file_path) / 1024:.2f} KB")

    # Load dataset
    print("\n📥 Loading dataset...")
    dataset = load_dataset('json', data_files=file_path)

    print(f"\n✅ Dataset loaded!")
    print(f"📝 Total samples: {len(dataset['train'])}")
    print(f"\n🔍 Sample entry:")
    print(dataset['train'][0])

📁 Mounting Google Drive...
Mounted at /content/drive
✅ File found: /content/drive/MyDrive/phi_finetune/finlit_llama_chat_1000_v2.jsonl
📊 File size: 516.14 KB

📥 Loading dataset...


Generating train split: 0 examples [00:00, ? examples/s]


✅ Dataset loaded!
📝 Total samples: 1000

🔍 Sample entry:
{'messages': [{'role': 'user', 'content': 'Why is saving important for Malaysian students?'}, {'role': 'assistant', 'content': 'Saving matters because it directly influences whether you feel in control or stressed about money. Saving means keeping part of your income aside instead of spending it now, so you can use it later for goals or emergencies. When you pay attention to it early, you avoid many common mistakes that young adults often regret later.'}]}


In [5]:
# Format dataset for Phi-2
def format_phi2_prompt(example):
    """Format for Phi-2 chat format"""
    messages = example['messages']

    formatted_text = ""
    for msg in messages:
        role = msg['role']
        content = msg['content']

        if role == 'user':
            formatted_text += f"User: {content}\n\n"
        elif role == 'assistant':
            formatted_text += f"Assistant: {content}\n\n"

    return {'text': formatted_text}

print("Formatting dataset for Phi-2...")
dataset = dataset.map(format_phi2_prompt)

# Split into train/validation (90/10)
dataset = dataset['train'].train_test_split(test_size=0.1, seed=42)

print(f"\nTrain samples: {len(dataset['train'])}")
print(f"Validation samples: {len(dataset['test'])}")

# Show formatted example
print(f"\nFormatted example:")
print("="*80)
print(dataset['train'][0]['text'][:500])
print("...")
print("="*80)

Formatting dataset for Phi-2...


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]


Train samples: 900
Validation samples: 100

Formatted example:
User: What are the main points I should remember about compound interest?

Assistant: In summary, compound interest is about three things: understanding your situation, making intentional choices and learning from experience. Compound interest is interest calculated on both the original amount and the interest that has already been added before. If you keep coming back to these ideas, you will slowly build stronger financial habits.


...


In [6]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_name = "microsoft/phi-2"
print(f"Model: {model_name}\n")

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
print("Loading Phi-2 model (this takes 2-3 minutes)...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Model: microsoft/phi-2

Loading Phi-2 model (this takes 2-3 minutes)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Loading tokenizer...


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

In [7]:
# Prepare for training
print("Preparing model for training...")
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["Wqkv", "fc1", "fc2"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

print("\nModel ready!")
model.print_trainable_parameters()

Preparing model for training...

Model ready!
trainable params: 13,107,200 || all params: 2,792,791,040 || trainable%: 0.4693


In [8]:
import shutil
import os

print("🧹 Cleaning up storage...")

# Clear HuggingFace cache
cache_dir = "/root/.cache/huggingface"
if os.path.exists(cache_dir):
    shutil.rmtree(cache_dir)
    print(f"✅ Cleared: {cache_dir}")

# Clear tmp files
os.system("rm -rf /tmp/*")
print("✅ Cleared /tmp/")

# Check available space
disk_usage = shutil.disk_usage("/")
print(f"\n💾 Available space: {disk_usage.free / 1e9:.2f} GB")

🧹 Cleaning up storage...
✅ Cleared: /root/.cache/huggingface
✅ Cleared /tmp/

💾 Available space: 79.65 GB


In [9]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

print("⚙️  Setting up training configuration...")

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length",
    )

print("Tokenizing dataset...")
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

# Training arguments
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/llama_finetune/phi2_finlit_checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    learning_rate=2e-4,
    weight_decay=0.01,
    fp16=False,
    bf16=True,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="none",
    save_total_limit=3,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    data_collator=data_collator,
)

print("✅ Trainer configured!")
print("\n📊 Training Configuration:")
print(f"   - Epochs: {training_args.num_train_epochs}")
print(f"   - Batch size: {training_args.per_device_train_batch_size}")
print(f"   - Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"   - Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   - Learning rate: {training_args.learning_rate}")
print(f"   - Max sequence length: 512")

⚙️  Setting up training configuration...
Tokenizing dataset...


Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

✅ Trainer configured!

📊 Training Configuration:
   - Epochs: 3
   - Batch size: 4
   - Gradient accumulation: 4
   - Effective batch size: 16
   - Learning rate: 0.0002
   - Max sequence length: 512


In [10]:
import time

print("="*80)
print("🚀 STARTING TRAINING")
print("="*80)
print("\n⏰ This will take approximately 30-60 minutes depending on your GPU")
print("📊 Training progress will be shown below\n")

start_time = time.time()

# Train the model
trainer.train()

end_time = time.time()
training_duration = (end_time - start_time) / 60

print("\n" + "="*80)
print("✅ TRAINING COMPLETE!")
print("="*80)
print(f"⏱️  Total training time: {training_duration:.2f} minutes")

# Save the final model
final_model_path = "/content/drive/MyDrive/llama_finetune/phi2_finlit_final"
print(f"\n💾 Saving final model to: {final_model_path}")

trainer.model.save_pretrained(final_model_path)
tokenizer.save_pretrained(final_model_path)

print("✅ Model saved successfully!")

🚀 STARTING TRAINING

⏰ This will take approximately 30-60 minutes depending on your GPU
📊 Training progress will be shown below



`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss
100,0.0996,0.102445


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]


✅ TRAINING COMPLETE!
⏱️  Total training time: 168.53 minutes

💾 Saving final model to: /content/drive/MyDrive/llama_finetune/phi2_finlit_final
✅ Model saved successfully!


In [11]:
print("="*80)
print("📊 EVALUATING MODEL")
print("="*80)

# Run evaluation
eval_results = trainer.evaluate()

print("\n📈 Evaluation Results:")
print("-" * 40)
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}")

print("\n" + "="*80)

📊 EVALUATING MODEL



📈 Evaluation Results:
----------------------------------------
eval_loss: 0.0837
eval_runtime: 119.0933
eval_samples_per_second: 0.8400
eval_steps_per_second: 0.2100
epoch: 3.0000



In [12]:
print("="*80)
print("🧪 TESTING FINE-TUNED MODEL")
print("="*80)

def generate_response(prompt, max_length=256):
    """Generate response using the fine-tuned model"""
    formatted_prompt = f"User: {prompt}\n\nAssistant:"

    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_length,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the assistant's response
    if "Assistant:" in response:
        response = response.split("Assistant:")[-1].strip()

    return response

# Test with sample questions
test_prompts = [
    "What is compound interest?",
    "How should I start investing as a beginner?",
    "Explain the difference between stocks and bonds.",
    "What is an emergency fund and why do I need one?",
]

print("\n🔍 Testing with sample questions:\n")

for i, prompt in enumerate(test_prompts, 1):
    print(f"\n{'='*80}")
    print(f"Question {i}: {prompt}")
    print("-" * 80)

    response = generate_response(prompt)
    print(f"Response: {response}")

print("\n" + "="*80)
print("✅ Testing complete!")
print("\n💡 You can now use the generate_response() function to test custom prompts")
print("Example: generate_response('Your question here')")


🧪 TESTING FINE-TUNED MODEL

🔍 Testing with sample questions:


Question 1: What is compound interest?
--------------------------------------------------------------------------------
Response: Compound interest is interest calculated on both the original amount and the interest that has already been added before. In simple terms, it gives you a clearer picture of your money so you can make calmer, smarter decisions.


1. First, be honest about your current situation by writing down your income and main expenses related to compound interest.
2. Next, set one small, realistic target linked to compound interest, such as an amount, a habit or a deadline.
3. Then, break the target into weekly or monthly actions that you can actually follow.
4. Monitor your progress regularly, for example once a week, and adjust if the plan feels too hard or too easy.
5. Finally, reflect on what works for you and slowly increase the challenge as you become more confident.
You do not need to be perfect; the g