# Python Code Generator - Training on Google Colab

This notebook trains a **CodeLlama-7B** model on your Python code dataset using a **free T4 GPU**.

## Setup Instructions:
1. **Enable GPU**: Runtime ‚Üí Change runtime type ‚Üí T4 GPU ‚Üí Save
2. **Run all cells** in order
3. **Upload** `python-codes-25k.json` when prompted
4. **Wait** 2-4 hours for training
5. **Download** the trained model

---

## Step 1: Check GPU Availability

In [None]:
!nvidia-smi

## Step 2: Install Dependencies

This will take about 5 minutes.

In [None]:
%%capture
!pip install -q transformers datasets accelerate peft trl
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

print("‚úÖ Dependencies installed!")

## Step 3: Upload Dataset

Click the "Choose Files" button and select `python-codes-25k.json`

In [None]:
from google.colab import files

print("üìÇ Upload your python-codes-25k.json file:")
uploaded = files.upload()

print("\n‚úÖ Dataset uploaded!")

## Step 4: Prepare Data

Process the dataset into training format.

In [None]:
import json
import pandas as pd
from datasets import Dataset, DatasetDict
import re

def extract_code_from_output(output: str) -> str:
    """Extract Python code from markdown code blocks"""
    code_match = re.search(r'```python\n(.*?)\n```', output, re.DOTALL)
    if code_match:
        return code_match.group(1)
    return output

def format_instruction(instruction: str, output: str) -> dict:
    """Format data in instruction-following format"""
    code = extract_code_from_output(output)
    
    prompt = f"""### Instruction:
{instruction.strip()}

### Response:
```python
{code.strip()}
```"""
    
    return {
        "text": prompt,
        "instruction": instruction.strip(),
        "output": code.strip()
    }

# Load and process data
print("Loading dataset...")
with open('python-codes-25k.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

print(f"Loaded {len(data)} examples")

# Format examples
formatted_data = []
for item in data:
    if 'instruction' in item and 'output' in item:
        formatted_data.append(format_instruction(item['instruction'], item['output']))

print(f"Formatted {len(formatted_data)} examples")

# Create dataset
df = pd.DataFrame(formatted_data)
dataset = Dataset.from_pandas(df)
split_dataset = dataset.train_test_split(test_size=0.1, seed=42)

dataset_dict = DatasetDict({
    'train': split_dataset['train'],
    'validation': split_dataset['test']
})

print(f"\n‚úÖ Data prepared!")
print(f"   Train: {len(dataset_dict['train'])} examples")
print(f"   Val: {len(dataset_dict['validation'])} examples")

# Show sample
print("\n--- Sample Example ---")
print(dataset_dict['train'][0]['text'][:500] + "...")

## Step 5: Load Model and Configure LoRA

Load CodeLlama-7B with efficient 4-bit quantization.

In [None]:
from unsloth import FastLanguageModel
import torch

# Model configuration
MODEL_NAME = "codellama/CodeLlama-7b-hf"
MAX_SEQ_LENGTH = 2048

print("Loading CodeLlama-7B model...")
print("This will download ~13GB, may take 5-10 minutes.\n")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=42,
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("\n‚úÖ Model loaded!")
print(f"   Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

## Step 6: Train the Model

**This will take 2-4 hours on a T4 GPU.**

You can monitor progress below. The loss should decrease over time.

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

# Training configuration
OUTPUT_DIR = "pythoncode-lora"

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=100,
    logging_steps=10,
    save_steps=500,
    eval_steps=500,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=True,
    fp16=True,
    optim="adamw_8bit",
    weight_decay=0.01,
    max_grad_norm=1.0,
    report_to="none",
    save_total_limit=3,
)

# Create trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset_dict['train'],
    eval_dataset=dataset_dict['validation'],
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    args=training_args,
    packing=False,
)

print("\n" + "="*50)
print("üöÄ Starting Training...")
print("="*50)
print(f"Total steps: {len(dataset_dict['train']) // 16 * 3}")
print(f"This will take approximately 2-4 hours.\n")

# Start training
trainer.train()

# Save final model
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print("\n" + "="*50)
print("‚úÖ Training Complete!")
print("="*50)

## Step 7: Test the Model

Let's test with a few examples before downloading.

In [None]:
# Set model to inference mode
model = FastLanguageModel.for_inference(model)

def generate_code(instruction: str):
    prompt = f"""### Instruction:
{instruction}

### Response:
"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )
    
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract only the response
    if "### Response:" in result:
        return result.split("### Response:")[1].strip()
    return result

# Test examples
test_prompts = [
    "Create a function to calculate factorial of a number",
    "Write a function to check if a string is a palindrome",
    "Implement binary search algorithm",
]

print("\n" + "="*50)
print("Testing Model")
print("="*50 + "\n")

for prompt in test_prompts:
    print(f"üìù Instruction: {prompt}")
    print("\nüíª Generated Code:")
    print(generate_code(prompt))
    print("\n" + "-"*50 + "\n")

## Step 8: Download the Trained Model

Download the model to use on your Mac with Ollama.

In [None]:
import shutil
from google.colab import files

print("Creating ZIP file...")
shutil.make_archive('pythoncode-lora', 'zip', 'pythoncode-lora')

print("Downloading model... (this may take a few minutes)")
files.download('pythoncode-lora.zip')

print("\n‚úÖ Download complete!")
print("\nNext steps:")
print("1. Extract the ZIP file on your Mac")
print("2. Run: python export_to_ollama.py")
print("3. Run: ollama create pythoncode -f Modelfile")
print("4. Run: ollama run pythoncode")

---

## üéâ All Done!

Your model is trained and ready to use!

### What You Have:
- ‚úÖ Fine-tuned CodeLlama-7B model
- ‚úÖ Trained on 25K Python code examples  
- ‚úÖ Ready to deploy with Ollama

### Usage on Your Mac:
```bash
# Extract the downloaded ZIP
unzip pythoncode-lora.zip

# Export to Ollama format
python export_to_ollama.py

# Create Ollama model
ollama create pythoncode -f Modelfile

# Use it!
ollama run pythoncode "Create a function to reverse a string"
```