# Refactotron: Optimized LoRA Training (FINAL VERSION)

**üî• KEY IMPROVEMENTS FROM ORIGINAL:**
- ‚úÖ **max_length=1024** (was 512) - fixes 69% truncation!
- ‚úÖ **Enhanced dataset**: 39,812 samples (was 7,943) - 5x larger!
- ‚úÖ **Learning rate**: 2e-5 (was 2e-4) - 10x lower for fine-tuning
- ‚úÖ **Cosine LR scheduler** - smooth decay
- ‚úÖ **Warmup**: 500 steps (was 100) - more stable
- ‚úÖ **Moderate regularization**: weight_decay=0.02, dropout=0.08, label_smoothing=0.05
- ‚úÖ **Expanded LoRA targets**: c_fc added for MLP layers

**Expected Results:**
- Validation Loss: **0.48-0.53** (vs 0.68 before)
- BLEU Score: **72-75** (target: 73.5)
- CodeBERT: **0.86-0.88** (target: 0.87)
- **Should hit or exceed all targets!** üéØ

## 1. Setup & GPU Check

In [None]:
# Check GPU availability
import torch

print("üñ•Ô∏è  GPU Status:")
print(f"   Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   Device: {torch.cuda.get_device_name(0)}")
    print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("   ‚ö†Ô∏è  NO GPU! Go to: Runtime > Change runtime type > T4 GPU")

## 2. Install Dependencies

In [None]:
!pip install -q transformers datasets peft accelerate bitsandbytes

## 3. Upload Training Data

**Upload the files you generated:**
- `train_enhanced.jsonl` (60 MB)
- `validation_enhanced.jsonl` (7.5 MB)

In [None]:
from google.colab import files

print("üì§ Upload train_enhanced.jsonl and validation_enhanced.jsonl")
uploaded = files.upload()

## 4. HuggingFace Authentication

**Make sure you have access to StarCoder:**
1. Go to https://huggingface.co/bigcode/starcoderbase-1b
2. Click "Request Access" if you haven't
3. Then paste your token below

In [None]:
from huggingface_hub import login

# Paste your HuggingFace token when prompted
login()

## 5. Load Model & Tokenizer

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

print("üì• Loading model and tokenizer...")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoderbase-1b")
tokenizer.pad_token = tokenizer.eos_token

# Load model in fp16 to save memory
model = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoderbase-1b",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

print(f"‚úÖ Base model loaded: {model.num_parameters():,} parameters")

## 6. Configure LoRA (Optimized)

**üî• Key improvement: dropout increased to 0.08 for moderate regularization**

In [None]:
print("‚öôÔ∏è  Configuring LoRA...")

lora_config = LoraConfig(
    r=16,                                      # Rank
    lora_alpha=32,                             # Scaling factor (2x rank)
    target_modules=["c_proj", "c_attn", "c_fc"],  # üî• Added c_fc for MLP layers
    lora_dropout=0.08,                         # üî• Moderate regularization (was 0.05)
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

print("‚úÖ LoRA configured with moderate regularization")

## 7. Load Training Data

In [None]:
from datasets import Dataset
import json

def load_jsonl(filepath):
    """Load JSONL file"""
    data = []
    with open(filepath, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return data

print("üìÇ Loading enhanced training data...")

# Load the enhanced data
train_data = load_jsonl('train_enhanced.jsonl')
val_data = load_jsonl('validation_enhanced.jsonl')

print(f"‚úÖ Train: {len(train_data):,} samples (was 7,943!)")
print(f"‚úÖ Validation: {len(val_data):,} samples")
print(f"\nüéØ Improvement: {len(train_data) / 7943:.1f}x more data!")

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)

# Preview one sample
print("\nüìù Sample training example:")
print(f"Input (first 200 chars):\n{train_data[0]['input'][:200]}...")
print(f"\nOutput (first 200 chars):\n{train_data[0]['output'][:200]}...")

## 8. Tokenization

**üî• CRITICAL FIX: max_length=1024 (was 512)**

This fixes the 69% truncation issue!

In [None]:
from transformers import DataCollatorForLanguageModeling

def tokenize_function(examples):
    """
    Tokenize input + output together.
    üî• Using max_length=1024 to avoid truncation!
    """
    # Combine input and output
    full_texts = [inp + "\n" + out for inp, out in zip(examples['input'], examples['output'])]

    # Tokenize with 1024 max length
    result = tokenizer(
        full_texts,
        truncation=True,
        max_length=1024,  # üî• INCREASED from 512! Fixes 69% truncation
        padding=False,
    )

    # Set labels
    result["labels"] = result["input_ids"].copy()

    return result

print("üîÑ Tokenizing datasets with max_length=1024...")
print("   (This fixes the 69% truncation issue!)\n")

tokenized_train = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=train_dataset.column_names,
    desc="Tokenizing train"
)

tokenized_val = val_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=val_dataset.column_names,
    desc="Tokenizing validation"
)

print(f"‚úÖ Train: {len(tokenized_train):,} samples")
print(f"‚úÖ Validation: {len(tokenized_val):,} samples")

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

## 9. Training Configuration (FINAL - OPTIMIZED)

**üî• ALL OPTIMIZATIONS APPLIED:**
1. ‚úÖ Learning rate: 2e-5 (was 2e-4)
2. ‚úÖ Cosine LR scheduler (was none)
3. ‚úÖ Warmup: 500 steps (was 100)
4. ‚úÖ Weight decay: 0.02 (was 0.01) - moderate regularization
5. ‚úÖ Label smoothing: 0.05 - reduces overconfidence
6. ‚úÖ Early stopping: patience=3

In [None]:
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback

print("‚öôÔ∏è  Configuring training (FINAL - OPTIMIZED)...")

training_args = TrainingArguments(
    # Output
    output_dir="./refactotron_lora_final",
    logging_dir="./logs",

    # Training schedule
    num_train_epochs=5,

    # Batch size
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,              # Effective batch size = 8

    # Learning rate (üî• OPTIMIZED)
    learning_rate=2e-5,                         # üî• DOWN from 2e-4 (10x lower!)
    lr_scheduler_type="cosine",                 # üî• Smooth decay
    warmup_steps=500,                           # üî• UP from 100 (more stable)

    # Regularization (üî• MODERATE)
    weight_decay=0.02,                          # üî• Moderate (2x from 0.01)
    label_smoothing_factor=0.05,                # üî• Light smoothing
    max_grad_norm=1.0,                          # Gradient clipping

    # Precision
    fp16=True,

    # Logging & evaluation
    logging_steps=50,
    eval_steps=500,
    save_steps=500,
    save_total_limit=3,
    eval_strategy="steps",

    # Best model selection
    load_best_model_at_end=True,
    metric_for_best_model="loss",

    # Memory optimization
    gradient_checkpointing=True,

    # Reporting
    report_to="none",
)

# Early stopping
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=3
)

print("‚úÖ Training configuration complete (Option B: Moderate)")

## 10. Initialize Trainer & Show Summary

In [None]:
print("üéØ Initializing trainer...")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
    callbacks=[early_stopping]
)

# Training summary
total_steps = (len(tokenized_train) //
               (training_args.per_device_train_batch_size *
                training_args.gradient_accumulation_steps) *
               training_args.num_train_epochs)

print("\n" + "=" * 70)
print("üìä FINAL TRAINING CONFIGURATION SUMMARY")
print("=" * 70)
print(f"\nüìà Dataset:")
print(f"   ‚Ä¢ Training samples: {len(tokenized_train):,} (was 7,943)")
print(f"   ‚Ä¢ Validation samples: {len(tokenized_val):,}")
print(f"   ‚Ä¢ Improvement: {len(tokenized_train) / 7943:.1f}x more data!")

print(f"\n‚öôÔ∏è  Training Setup:")
print(f"   ‚Ä¢ Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   ‚Ä¢ Total training steps: {total_steps}")
print(f"   ‚Ä¢ Evaluation every: {training_args.eval_steps} steps")

print(f"\nüî• CRITICAL FIXES FROM ORIGINAL:")
print(f"   ‚Ä¢ max_length: 1024 (was 512) ‚Üê Fixes 69% truncation!")
print(f"   ‚Ä¢ Learning rate: 2e-5 (was 2e-4) ‚Üê 10x lower")
print(f"   ‚Ä¢ LR scheduler: cosine (was none) ‚Üê Smooth decay")
print(f"   ‚Ä¢ Warmup steps: 500 (was 100) ‚Üê More stable")
print(f"   ‚Ä¢ Weight decay: 0.02 (was 0.01) ‚Üê Moderate regularization")
print(f"   ‚Ä¢ Label smoothing: 0.05 (was 0) ‚Üê Reduces overconfidence")
print(f"   ‚Ä¢ LoRA dropout: 0.08 (was 0.05) ‚Üê Moderate regularization")
print(f"   ‚Ä¢ LoRA targets: c_proj, c_attn, c_fc (added c_fc)")
print(f"   ‚Ä¢ Dataset: 39,812 samples (was 7,943) ‚Üê 5x larger!")

print(f"\nüìà EXPECTED RESULTS:")
print(f"   ‚Ä¢ OLD validation loss: 0.68 (with 69% truncation)")
print(f"   ‚Ä¢ NEW validation loss: 0.48-0.53 (no truncation!)")
print(f"   ‚Ä¢ BLEU score: 72-75 (target: 73.5)")
print(f"   ‚Ä¢ CodeBERT similarity: 0.86-0.88 (target: 0.87)")
print(f"   ‚Ä¢ Should hit or exceed targets! üéØ")

print("\n" + "=" * 70)
print("‚úÖ Ready to train!")
print("=" * 70)

## 11. Start Training üöÄ

**This will take 3-4 hours on T4 GPU.**

**What to watch:**
- Validation loss should decrease from ~0.7 to ~0.48-0.53
- Should NOT plateau at 0.68 like before
- Early stopping will kick in if no improvement for 3 evaluations

In [None]:
print("üöÄ Starting training...\n")
print("‚è±Ô∏è  Estimated time: 3-4 hours on T4 GPU\n")
print("=" * 70)

# START TRAINING
trainer.train()

print("\n" + "=" * 70)
print("‚úÖ TRAINING COMPLETE!")
print("=" * 70)

## 12. Save Model

In [None]:
print("üíæ Saving final model...")

# Save the LoRA adapter
model.save_pretrained("./refactotron_lora_final")
tokenizer.save_pretrained("./refactotron_lora_final")

print("‚úÖ Model saved to ./refactotron_lora_final")

## 13. Download Model

In [None]:
import zipfile
import os

print("üì¶ Creating ZIP archive...")

# Zip the model folder
!zip -r refactotron_lora_final.zip refactotron_lora_final/

# Download
from google.colab import files
print("‚¨áÔ∏è  Downloading...")
files.download('refactotron_lora_final.zip')

print("‚úÖ Download complete!")

## 14. Quick Inference Test

In [None]:
# Test on a sample
test_input = """### Refactor the following Python code to improve quality:

def f(x, y):
    z = x + y
    return z

### Refactored code:"""

print("üß™ Testing model inference...")
print("\nInput:")
print(test_input)

inputs = tokenizer(test_input, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.2,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
refactored = generated.split('### Refactored code:')[1].strip() if '### Refactored code:' in generated else generated

print("\nü§ñ Model Output:")
print(refactored[:500])

print("\n‚úÖ Inference working!")

## üéâ Training Complete!

### Next Steps:

1. **Check validation loss** - Should be ~0.48-0.53 (vs 0.68 before)
2. **Evaluate on test set** with BLEU & CodeBERT
3. **Compare to baseline** (vanilla StarCoder)
4. **Analyze results** for your project writeup

### Expected Improvements:

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| **Validation Loss** | 0.68 | 0.48-0.53 | ~25% better |
| **Data seen** | 31% (truncated) | 90%+ (full) | 3x more |
| **Training samples** | 7,943 | 39,812 | 5x more |
| **BLEU (expected)** | ~70 | 72-75 | Hit target! |

**You should hit your 73.5 BLEU target!** üéØ