# Refactotron: Optimized LoRA Training

**Improvements over previous version:**
- ‚úÖ 100% syntactically valid training data (vs 93% broken)
- ‚úÖ Learning rate: 2e-5 (down from 2e-4)
- ‚úÖ Cosine LR scheduler with proper warmup
- ‚úÖ Weight decay regularization
- ‚úÖ Expanded LoRA target modules

**Expected Results:**
- Validation Loss: 0.55-0.60 (vs 0.68 before)
- BLEU Score: 70-73 (target: 73.5)
- CodeBERT: 0.85-0.87 (target: 0.87)

## 1. Setup & GPU Check

In [1]:
# Check GPU availability
import torch

print("üñ•Ô∏è  GPU Status:")
print(f"   Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   Device: {torch.cuda.get_device_name(0)}")
    print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("   ‚ö†Ô∏è  NO GPU! Go to: Runtime > Change runtime type > T4 GPU")

üñ•Ô∏è  GPU Status:
   Available: False
   ‚ö†Ô∏è  NO GPU! Go to: Runtime > Change runtime type > T4 GPU


## 2. Install Dependencies

In [None]:
!pip install -q transformers datasets peft accelerate bitsandbytes

## 3. Upload Training Data

**Click the folder icon on the left sidebar and upload:**
- `train.jsonl` (7,943 samples)
- `validation.jsonl` (992 samples)

Or run the cell below to upload via file picker:

In [None]:
from google.colab import files

print("üì§ Upload train.jsonl and validation.jsonl")
uploaded = files.upload()

## 4. HuggingFace Authentication

In [None]:
from huggingface_hub import login

# Paste your HuggingFace token when prompted
login()

## 5. Load Model & Tokenizer

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

print("üì• Loading model and tokenizer...")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoderbase-1b")
tokenizer.pad_token = tokenizer.eos_token

# Load model in fp16 to save memory
model = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoderbase-1b",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

print(f"‚úÖ Base model loaded: {model.num_parameters():,} parameters")

## 6. Configure LoRA (Optimized)

**Key improvements:**
- Added `c_fc` to target modules (MLP layers)
- Light dropout (0.05) for regularization

In [None]:
print("‚öôÔ∏è  Configuring LoRA...")

lora_config = LoraConfig(
    r=16,                                      # Rank
    lora_alpha=32,                             # Scaling factor (2x rank)
    target_modules=["c_proj", "c_attn", "c_fc"],  # üî• Added c_fc for MLP layers
    lora_dropout=0.05,                         # Light dropout
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

print("‚úÖ LoRA configured")

## 7. Load Training Data

In [None]:
from datasets import Dataset
import json

def load_jsonl(filepath):
    """Load JSONL file"""
    data = []
    with open(filepath, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return data

print("üìÇ Loading training data...")

# Load the data
train_data = load_jsonl('train.jsonl')
val_data = load_jsonl('validation.jsonl')

print(f"‚úÖ Train: {len(train_data)} samples")
print(f"‚úÖ Validation: {len(val_data)} samples")

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)

# Preview one sample
print("\nüìù Sample training example:")
print(f"Input (first 200 chars):\n{train_data[0]['input'][:200]}...")
print(f"\nOutput (first 200 chars):\n{train_data[0]['output'][:200]}...")

## 8. Tokenization

In [None]:
from transformers import DataCollatorForLanguageModeling

def tokenize_function(examples):
    """
    Tokenize input + output together.
    The model will learn to predict the output given the input.
    """
    # Combine input and output
    full_texts = [inp + "\n" + out for inp, out in zip(examples['input'], examples['output'])]

    # Tokenize
    result = tokenizer(
        full_texts,
        truncation=True,
        max_length=512,
        padding=False,
    )

    # Set labels
    result["labels"] = result["input_ids"].copy()

    return result

print("üîÑ Tokenizing datasets...")

tokenized_train = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=train_dataset.column_names,
    desc="Tokenizing train"
)

tokenized_val = val_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=val_dataset.column_names,
    desc="Tokenizing validation"
)

print(f"‚úÖ Train: {len(tokenized_train)} samples")
print(f"‚úÖ Validation: {len(tokenized_val)} samples")

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

## 9. Training Configuration (OPTIMIZED)

**üî• Key Optimizations:**
1. **Learning rate: 2e-5** (down from 2e-4) - 10x lower for fine-tuning
2. **Cosine LR scheduler** - smooth decay instead of constant
3. **Warmup: 500 steps** (up from 100) - more stable training
4. **Weight decay: 0.01** - L2 regularization
5. **Early stopping: patience=3** - prevent overfitting

In [None]:
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback

print("‚öôÔ∏è  Configuring training (OPTIMIZED)...")

training_args = TrainingArguments(
    # Output
    output_dir="./refactotron_lora_optimized",
    logging_dir="./logs",

    # Training schedule
    num_train_epochs=5,

    # Batch size
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,              # Effective batch size = 8

    # Learning rate (üî• OPTIMIZED)
    learning_rate=2e-5,                         # üî• DOWN from 2e-4 (10x lower!)
    lr_scheduler_type="cosine",                 # üî• ADDED cosine decay
    warmup_steps=500,                           # üî• UP from 100

    # Regularization (üî• OPTIMIZED)
    weight_decay=0.01,                          # üî• ADDED weight decay
    max_grad_norm=1.0,

    # Precision
    fp16=True,

    # Logging & evaluation
    logging_steps=50,
    eval_steps=500,
    save_steps=500,
    save_total_limit=3,
    eval_strategy="steps",

    # Best model selection
    load_best_model_at_end=True,
    metric_for_best_model="loss",

    # Memory optimization
    gradient_checkpointing=True,

    # Reporting
    report_to="none",
)

# Early stopping
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=3
)

print("‚úÖ Training configuration complete")

## 10. Initialize Trainer

In [None]:
print("üéØ Initializing trainer...")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
    callbacks=[early_stopping]
)

# Training summary
total_steps = (len(tokenized_train) //
               (training_args.per_device_train_batch_size *
                training_args.gradient_accumulation_steps) *
               training_args.num_train_epochs)

print("\n" + "=" * 70)
print("üìä TRAINING CONFIGURATION SUMMARY")
print("=" * 70)
print(f"Total training samples: {len(tokenized_train)}")
print(f"Validation samples: {len(tokenized_val)}")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"Total training steps: {total_steps}")
print(f"Evaluation every: {training_args.eval_steps} steps")
print(f"\nüî• OPTIMIZATIONS APPLIED:")
print(f"  ‚Ä¢ Learning rate: 2e-5 (was 2e-4)")
print(f"  ‚Ä¢ LR scheduler: cosine (was none)")
print(f"  ‚Ä¢ Warmup steps: 500 (was 100)")
print(f"  ‚Ä¢ Weight decay: 0.01 (was 0)")
print(f"  ‚Ä¢ LoRA targets: c_proj, c_attn, c_fc (added c_fc)")
print(f"\nüìà EXPECTED RESULTS:")
print(f"  ‚Ä¢ Validation loss: 0.55-0.60 (vs 0.68 before)")
print(f"  ‚Ä¢ BLEU score: 70-73 (vs target 73.5)")
print(f"  ‚Ä¢ CodeBERT similarity: 0.85-0.87 (vs target 0.87)")
print("=" * 70)

## 11. Start Training üöÄ

**This will take 2-3 hours depending on GPU.**

Monitor the validation loss - it should:
- Decrease steadily from ~0.7 to ~0.55-0.60
- Stop early if no improvement for 3 evaluations
- NOT plateau at 0.68 like before!

In [None]:
print("üöÄ Starting training...\n")

# START TRAINING
trainer.train()

print("\n" + "=" * 70)
print("‚úÖ TRAINING COMPLETE!")
print("=" * 70)

## 12. Save Model

In [None]:
print("üíæ Saving final model...")

# Save the LoRA adapter
model.save_pretrained("./refactotron_lora_final")
tokenizer.save_pretrained("./refactotron_lora_final")

print("‚úÖ Model saved to ./refactotron_lora_final")

## 13. Download Model (Optional)

Download the trained model to your local machine:

In [None]:
# Zip the model folder
!zip -r refactotron_lora_final.zip refactotron_lora_final/

# Download
from google.colab import files
files.download('refactotron_lora_final.zip')

## 14. Test Inference (Optional)

Test the model on a sample:

In [None]:
# Load test data
test_data = load_jsonl('test.jsonl') if 'test.jsonl' in !ls else val_data

# Pick a random sample
import random
sample = random.choice(test_data)

print("üìù INPUT (Degraded Code):")
print("=" * 70)
input_text = sample['input']
print(input_text)

# Generate
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.2,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract just the refactored part
refactored = generated.split('### Refactored code:')[1].strip() if '### Refactored code:' in generated else generated

print("\nü§ñ MODEL OUTPUT (Refactored):")
print("=" * 70)
print(refactored[:500])

print("\n‚úÖ EXPECTED OUTPUT:")
print("=" * 70)
print(sample['output'][:500])

## üéâ Next Steps

**For full evaluation:**
1. Load test.jsonl
2. Generate refactored code for all test samples
3. Calculate BLEU score
4. Calculate CodeBERT similarity

**Expected Results:**
- BLEU: 70-73 (target: 73.5)
- CodeBERT: 0.85-0.87 (target: 0.87)

If you hit these targets, you've successfully completed the project! üéØ