# Fine-tuning Gemma 3 1B - Optimized for Small Dataset (3,855 samples)

**Model**: Google Gemma 3 1B-IT  
**Dataset**: UNSIQ - 771 unique questions √ó 5 variations = 3,855 samples  
**Optimization**: Research-based configuration for LIMITED DOMAIN-SPECIFIC DATA

## üéØ Small Dataset Challenges:
- ‚ö†Ô∏è **Overfitting Risk**: Model memorizes instead of learning patterns
- ‚ö†Ô∏è **Limited Context**: Less exposure to diverse scenarios
- ‚ö†Ô∏è **Generalization Issues**: May not perform well on unseen data

## ‚úÖ Our Solutions (Research-Based 2025):
1. **Lower LoRA Rank** (32 vs 64): Reduce capacity to prevent memorization
2. **Higher Dropout** (0.1 vs 0.05): Stronger regularization
3. **Higher Weight Decay** (0.05 vs 0.01): L2 regularization
4. **Fewer Epochs** (2 vs 3): Prevent overfitting from repeated exposure
5. **Lower Learning Rate** (5e-5 vs 2e-4): Conservative, stable training
6. **Early Stopping** (patience=5): Auto-stop when validation stops improving
7. **Frequent Eval** (every 25 steps): Close monitoring for overfitting

## üìö Research Sources:
- [Fine-Tuning LLMs on Small Datasets](https://www.sapien.io/blog/strategies-for-fine-tuning-llms-on-small-datasets)
- [Unveiling the Secret Recipe: Small LLMs](https://arxiv.org/html/2412.13337v1)
- [LoRA Hyperparameters Guide](https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide)
- [Practical LoRA Tips](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms)

## 1. Install Dependencies

In [None]:
!pip install -q -U torch>=2.4.0 transformers>=4.50.0 accelerate bitsandbytes peft datasets trl tensorboard sentencepiece

## 2. Import Libraries

In [None]:
import torch
import json
import os
from pathlib import Path
from datetime import datetime

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    EarlyStoppingCallback,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer

print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {__import__('transformers').__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

## 3. Load Small Dataset Configuration

In [None]:
config_path = "qlora_config_gemma3_1b_SMALL_DATASET.json"

with open(config_path, 'r', encoding='utf-8') as f:
    config = json.load(f)

print("\n" + "="*80)
print("SMALL DATASET OPTIMIZED CONFIGURATION")
print("="*80)

print(f"\nüìä Dataset Statistics:")
for key, value in config['dataset_stats'].items():
    print(f"  ‚Ä¢ {key}: {value}")

print(f"\n‚öôÔ∏è  LoRA Configuration (Overfitting Prevention):")
print(f"  ‚Ä¢ Rank (r): {config['qlora_config']['r']} ‚Üê Lower to prevent memorization")
print(f"  ‚Ä¢ Alpha: {config['qlora_config']['lora_alpha']}")
print(f"  ‚Ä¢ Dropout: {config['qlora_config']['lora_dropout']} ‚Üê Higher for regularization")

print(f"\nüéì Training Configuration (Small Data Optimized):")
print(f"  ‚Ä¢ Learning Rate: {config['training_args']['learning_rate']} ‚Üê Conservative")
print(f"  ‚Ä¢ Epochs: {config['training_args']['num_train_epochs']} ‚Üê Limited to prevent overfitting")
print(f"  ‚Ä¢ Weight Decay: {config['training_args']['weight_decay']} ‚Üê Strong regularization")
print(f"  ‚Ä¢ Early Stopping Patience: {config['training_args']['early_stopping_patience']}")
print(f"  ‚Ä¢ Eval Steps: {config['training_args']['eval_steps']} ‚Üê Frequent monitoring")

print(f"\nüõ°Ô∏è  Overfitting Prevention Techniques Applied:")
for i, technique in enumerate(config['overfitting_prevention']['techniques_applied'], 1):
    print(f"  {i}. {technique}")

print(f"\nüìö Research-Based Configuration:")
for finding in config['research_based_on']['key_findings']:
    print(f"  ‚úì {finding}")

print("\n" + "="*80)

## 4. Load Model & Tokenizer with QLoRA

In [None]:
model_name = config['model_config']['model_name']

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

print(f"Loading {model_name}...\n")

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    use_cache=False,
    attn_implementation="eager",
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.padding_side = 'right'
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"‚úì Model loaded: {model.num_parameters():,} parameters")
print(f"‚úì Tokenizer vocab: {len(tokenizer):,}")

## 5. Apply LoRA - Optimized for Small Data

In [None]:
print("Preparing model for QLoRA...")
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

# LoRA config optimized for small dataset
lora_config = LoraConfig(
    r=config['qlora_config']['r'],  # 32 - balanced for small data
    lora_alpha=config['qlora_config']['lora_alpha'],  # 64
    lora_dropout=config['qlora_config']['lora_dropout'],  # 0.1 - higher dropout
    bias=config['qlora_config']['bias'],
    task_type=config['qlora_config']['task_type'],
    target_modules=config['qlora_config']['target_modules'],
    modules_to_save=config['qlora_config'].get('modules_to_save'),
)

model = get_peft_model(model, lora_config)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())

print(f"\n‚úì LoRA Applied (Small Dataset Optimized):")
print(f"  Trainable: {trainable:,} ({100*trainable/total:.4f}%)")
print(f"  Total: {total:,}")
print(f"\n  Lower rank (32) reduces overfitting risk on {config['dataset_stats']['total_samples']} samples!")

## 6. Load UNSIQ Dataset

In [None]:
print("Loading datasets...\n")

train_dataset = load_dataset('json', data_files=config['dataset_config']['train_file'], split='train')
eval_dataset = load_dataset('json', data_files=config['dataset_config']['eval_file'], split='train')

print(f"‚úì Train: {len(train_dataset):,} samples")
print(f"‚úì Eval: {len(eval_dataset):,} samples ({len(eval_dataset)/(len(train_dataset)+len(eval_dataset))*100:.1f}%)")
print(f"‚úì Total: {len(train_dataset) + len(eval_dataset):,} samples")
print(f"\nüí° 15% validation split is ideal for monitoring overfitting on small datasets")

## 7. Format with Chat Template

In [None]:
def format_chat_template(example):
    text = tokenizer.apply_chat_template(
        example['messages'],
        tokenize=False,
        add_generation_prompt=False
    )
    return {'text': text}

train_dataset = train_dataset.map(format_chat_template, desc="Formatting train")
eval_dataset = eval_dataset.map(format_chat_template, desc="Formatting eval")

print("\n‚úì Datasets formatted with Gemma 3 chat template")

## 8. Training Arguments with Early Stopping

In [None]:
args = config['training_args']
os.makedirs(args['output_dir'], exist_ok=True)

training_args = TrainingArguments(
    # Output
    output_dir=args['output_dir'],
    overwrite_output_dir=args['overwrite_output_dir'],
    
    # Training - Optimized for small data
    num_train_epochs=args['num_train_epochs'],  # 2 epochs
    per_device_train_batch_size=args['per_device_train_batch_size'],
    per_device_eval_batch_size=args['per_device_eval_batch_size'],
    gradient_accumulation_steps=args['gradient_accumulation_steps'],
    gradient_checkpointing=args['gradient_checkpointing'],
    gradient_checkpointing_kwargs=args.get('gradient_checkpointing_kwargs', {}),
    
    # Optimization - Conservative for small data
    optim=args['optim'],
    learning_rate=args['learning_rate'],  # 5e-5 - lower
    weight_decay=args['weight_decay'],  # 0.05 - higher regularization
    max_grad_norm=args['max_grad_norm'],
    
    # Scheduler
    lr_scheduler_type=args['lr_scheduler_type'],
    warmup_ratio=args['warmup_ratio'],
    warmup_steps=args.get('warmup_steps', 0),
    
    # Evaluation - Frequent for monitoring
    eval_strategy=args['eval_strategy'],
    eval_steps=args['eval_steps'],  # Every 25 steps
    
    # Checkpointing
    save_strategy=args['save_strategy'],
    save_steps=args['save_steps'],
    save_total_limit=args['save_total_limit'],
    load_best_model_at_end=args['load_best_model_at_end'],
    metric_for_best_model=args['metric_for_best_model'],
    greater_is_better=args.get('greater_is_better', False),
    
    # Logging
    logging_strategy=args['logging_strategy'],
    logging_steps=args['logging_steps'],
    logging_first_step=args.get('logging_first_step', True),
    report_to=args['report_to'],
    
    # Precision
    bf16=args['bf16'],
    bf16_full_eval=args['bf16_full_eval'],
    
    # Data
    dataloader_num_workers=args['dataloader_num_workers'],
    dataloader_pin_memory=args.get('dataloader_pin_memory', True),
    group_by_length=args['group_by_length'],
    
    # Reproducibility
    seed=args.get('seed', 42),
    data_seed=args.get('data_seed', 42),
)

# Calculate stats
steps_per_epoch = len(train_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps)
total_steps = steps_per_epoch * training_args.num_train_epochs

print("\n" + "="*80)
print("TRAINING PLAN - SMALL DATASET OPTIMIZED")
print("="*80)
print(f"Steps per epoch: {steps_per_epoch}")
print(f"Total steps: {total_steps}")
print(f"Evaluations: ~{total_steps // training_args.eval_steps}")
print(f"Early stopping: Will stop after {args['early_stopping_patience']} evals without improvement")
print(f"Expected training time: ~30-45 min on A100")
print("="*80)

## 9. Initialize Trainer with Early Stopping

In [None]:
# Early stopping callback
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=config['training_args']['early_stopping_patience'],
    early_stopping_threshold=config['training_args'].get('early_stopping_threshold', 0.0)
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    dataset_text_field='text',
    max_seq_length=config['dataset_config']['max_length'],
    packing=config['dataset_config'].get('packing', False),
    callbacks=[early_stopping],  # Add early stopping
)

print("‚úì Trainer initialized with Early Stopping")
print(f"  ‚Ä¢ Will monitor: {training_args.metric_for_best_model}")
print(f"  ‚Ä¢ Patience: {config['training_args']['early_stopping_patience']} evaluations")
print(f"  ‚Ä¢ Auto-stop if validation loss stops improving!")

## 10. Start Training with Overfitting Monitoring

In [None]:
print("\n" + "="*80)
print(f"STARTING TRAINING - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*80)
print("\nüîç MONITORING FOR OVERFITTING:")
print("  ‚Ä¢ Watch eval_loss vs train_loss")
print("  ‚Ä¢ If eval_loss increases while train_loss decreases = OVERFITTING")
print("  ‚Ä¢ Early stopping will auto-stop training if detected")
print("\n" + "="*80 + "\n")

train_result = trainer.train()

print("\n" + "="*80)
print(f"TRAINING COMPLETED - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*80)
print(f"\nFinal Training Loss: {train_result.training_loss:.4f}")
print(f"Runtime: {train_result.metrics['train_runtime']:.2f}s ({train_result.metrics['train_runtime']/60:.1f} min)")
print(f"Samples/sec: {train_result.metrics['train_samples_per_second']:.2f}")

# Check if early stopping was triggered
if 'epoch' in train_result.metrics:
    actual_epochs = train_result.metrics['epoch']
    planned_epochs = training_args.num_train_epochs
    if actual_epochs < planned_epochs:
        print(f"\n‚ö†Ô∏è  EARLY STOPPING TRIGGERED at epoch {actual_epochs:.2f}/{planned_epochs}")
        print("    This means validation loss stopped improving - GOOD!")
    else:
        print(f"\n‚úì Completed all {planned_epochs} epochs")

print("="*80)

## 11. Save Model & Metrics

In [None]:
final_dir = f"{training_args.output_dir}/final_adapter"
trainer.model.save_pretrained(final_dir)
tokenizer.save_pretrained(final_dir)

# Save config
with open(f"{training_args.output_dir}/config.json", 'w', encoding='utf-8') as f:
    json.dump(config, f, indent=2, ensure_ascii=False)

# Save metrics
with open(f"{training_args.output_dir}/metrics.json", 'w', encoding='utf-8') as f:
    json.dump(train_result.metrics, f, indent=2)

print(f"‚úì Model saved: {final_dir}")
print(f"‚úì Config saved: {training_args.output_dir}/config.json")
print(f"‚úì Metrics saved: {training_args.output_dir}/metrics.json")

## 12. Final Evaluation

In [None]:
print("\nFinal evaluation...\n")
eval_results = trainer.evaluate()

print("="*80)
print("FINAL EVALUATION RESULTS")
print("="*80)
for k, v in eval_results.items():
    print(f"{k}: {v}")
print("="*80)

# Save eval results
with open(f"{training_args.output_dir}/eval_results.json", 'w') as f:
    json.dump(eval_results, f, indent=2)

print(f"\n‚úì Evaluation saved: {training_args.output_dir}/eval_results.json")

## 13. Test Inference - Quality Check

In [None]:
def generate_response(question, max_new_tokens=512):
    messages = [
        {
            "role": "system",
            "content": "Anda adalah asisten informasi UNSIQ (Universitas Sains Al-Qur'an) yang membantu menjawab pertanyaan tentang biaya kuliah, program studi, dan informasi akademik."
        },
        {"role": "user", "content": question}
    ]
    
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    gen_cfg = config.get('generation_config', {})
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=gen_cfg.get('max_new_tokens', max_new_tokens),
            temperature=gen_cfg.get('temperature', 0.7),
            top_p=gen_cfg.get('top_p', 0.9),
            top_k=gen_cfg.get('top_k', 50),
            repetition_penalty=gen_cfg.get('repetition_penalty', 1.1),
            do_sample=gen_cfg.get('do_sample', True),
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    generated = outputs[0][inputs['input_ids'].shape[1]:]
    return tokenizer.decode(generated, skip_special_tokens=True).strip()

test_questions = [
    "Berapa total biaya kuliah S1 Akuntansi di UNSIQ?",
    "Apa itu KIP Kuliah dan syaratnya?",
    "Mengapa semester 1 lebih mahal?",
    "Program studi apa saja yang ada di UNSIQ?",
    "Bagaimana sistem pembayaran di UNSIQ?"
]

print("\n" + "="*80)
print("QUALITY CHECK - TESTING FINE-TUNED MODEL")
print("="*80)

for i, q in enumerate(test_questions, 1):
    print(f"\n{'='*80}")
    print(f"TEST {i}/{len(test_questions)}")
    print(f"{'='*80}")
    print(f"\n‚ùì Q: {q}")
    print(f"\nü§ñ A: {generate_response(q)}")
    print(f"\n{'-'*80}")

## 14. Overfitting Analysis

In [None]:
# Analyze training history for overfitting
import pandas as pd

print("\n" + "="*80)
print("OVERFITTING ANALYSIS")
print("="*80)

# Load training history
history_file = f"{training_args.output_dir}/trainer_state.json"
if os.path.exists(history_file):
    with open(history_file, 'r') as f:
        history = json.load(f)
    
    log_history = history.get('log_history', [])
    
    # Extract train and eval losses
    train_losses = [(h['step'], h['loss']) for h in log_history if 'loss' in h and 'eval_loss' not in h]
    eval_losses = [(h['step'], h['eval_loss']) for h in log_history if 'eval_loss' in h]
    
    if train_losses and eval_losses:
        print(f"\nüìä Loss Progression:")
        print(f"\nTrain Loss:")
        for step, loss in train_losses[-5:]:
            print(f"  Step {step}: {loss:.4f}")
        
        print(f"\nEval Loss:")
        for step, loss in eval_losses[-5:]:
            print(f"  Step {step}: {loss:.4f}")
        
        # Check for overfitting
        if len(eval_losses) >= 2:
            last_eval = eval_losses[-1][1]
            best_eval = min(e[1] for e in eval_losses)
            
            print(f"\nüéØ Overfitting Check:")
            print(f"  Best eval loss: {best_eval:.4f}")
            print(f"  Final eval loss: {last_eval:.4f}")
            
            if last_eval > best_eval * 1.05:
                print(f"  ‚ö†Ô∏è  WARNING: Eval loss increased by {((last_eval/best_eval-1)*100):.1f}% - possible overfitting")
            else:
                print(f"  ‚úì GOOD: No significant overfitting detected")
else:
    print("\nTrainer history not found. Check TensorBoard for analysis.")

print("\n" + "="*80)

## 15. List All Checkpoints

In [None]:
checkpoints = sorted(
    [d for d in os.listdir(training_args.output_dir) if d.startswith('checkpoint-')],
    key=lambda x: int(x.split('-')[-1])
)

print("\n" + "="*80)
print(f"SAVED CHECKPOINTS ({len(checkpoints)} total)")
print("="*80)

for i, cp in enumerate(checkpoints, 1):
    cp_path = os.path.join(training_args.output_dir, cp)
    size = sum(
        os.path.getsize(os.path.join(root, f))
        for root, _, files in os.walk(cp_path)
        for f in files
    )
    step = int(cp.split('-')[-1])
    print(f"{i:2d}. {cp:20s} | Step {step:4d} | {size/1024**2:6.1f} MB")

print("="*80)

## 16. TensorBoard Visualization

In [None]:
%load_ext tensorboard

print("\nüìä Opening TensorBoard...")
print("\nMonitor these metrics for overfitting:")
print("  1. train/loss vs eval/loss - Should decrease together")
print("  2. If eval/loss increases while train/loss decreases = OVERFITTING")
print("  3. Learning rate schedule")
print("  4. Gradient norms\n")

%tensorboard --logdir {training_args.output_dir}

## 17. Training Summary

In [None]:
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())

summary = f"""
{'='*80}
TRAINING SUMMARY - SMALL DATASET OPTIMIZED
{'='*80}

üìä Dataset:
  ‚Ä¢ Total samples: {config['dataset_stats']['total_samples']}
  ‚Ä¢ Unique questions: {config['dataset_stats']['unique_questions']}
  ‚Ä¢ Train: {len(train_dataset)} | Eval: {len(eval_dataset)}
  ‚Ä¢ Type: Small domain-specific dataset

üîß Configuration (Overfitting Prevention):
  ‚Ä¢ LoRA Rank: {config['qlora_config']['r']} (reduced to prevent memorization)
  ‚Ä¢ Dropout: {config['qlora_config']['lora_dropout']} (higher for regularization)
  ‚Ä¢ Weight Decay: {config['training_args']['weight_decay']} (strong L2 reg)
  ‚Ä¢ Learning Rate: {config['training_args']['learning_rate']} (conservative)
  ‚Ä¢ Epochs: {config['training_args']['num_train_epochs']} (limited)
  ‚Ä¢ Early Stopping: Patience {config['training_args']['early_stopping_patience']}

üìà Results:
  ‚Ä¢ Final train loss: {train_result.training_loss:.4f}
  ‚Ä¢ Final eval loss: {eval_results.get('eval_loss', 'N/A')}
  ‚Ä¢ Training time: {train_result.metrics['train_runtime']/60:.1f} minutes
  ‚Ä¢ Trainable params: {trainable:,} ({100*trainable/total:.4f}%)

‚úÖ Overfitting Prevention Applied:
{chr(10).join('  ' + str(i+1) + '. ' + t for i, t in enumerate(config['overfitting_prevention']['techniques_applied']))}

üìÅ Output: {training_args.output_dir}/
  ‚Ä¢ final_adapter/ - Best model based on eval_loss
  ‚Ä¢ checkpoint-*/ - All training checkpoints
  ‚Ä¢ metrics.json - Training metrics
  ‚Ä¢ eval_results.json - Evaluation results

üéì Research-Based:
{chr(10).join('  ‚Ä¢ ' + p for p in config['research_based_on']['papers'])}

{'='*80}
‚úÖ TRAINING COMPLETE - MODEL OPTIMIZED FOR SMALL DATASET
{'='*80}
"""

print(summary)

# Save summary
with open(f"{training_args.output_dir}/SUMMARY.txt", 'w', encoding='utf-8') as f:
    f.write(summary)

print(f"\n‚úì Summary saved: {training_args.output_dir}/SUMMARY.txt")