# Module 12.1: Build a Code Assistant

**Goal**: Build a complete code completion assistant with fine-tuning

**Time**: 120 minutes

**Concepts Covered**:
- Prepare coding dataset (The Stack)
- Fine-tune SmolLM with LoRA
- Code completion API
- VS Code extension integration
- Evaluation on HumanEval

## Setup

In [None]:
!pip install torch transformers accelerate matplotlib seaborn numpy -q

In [None]:
# Code Assistant Project Setup
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load The Stack dataset (code dataset)
# dataset = load_dataset("bigcode/the-stack", data_dir="data/python", split="train[:1%]")

print("Code Assistant Components:")
print("1. Dataset: The Stack (code dataset)")
print("2. Model: SmolLM-1.7B (base)")
print("3. Fine-tuning: LoRA on code")
print("4. API: FastAPI server")
print("5. Evaluation: HumanEval benchmark")

# Model setup
model_name = "HuggingFaceTB/SmolLM2-1.7B"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(model_name)

print(f"\nBase model: {model_name}")

In [None]:
# Code Completion Function
def complete_code(model, tokenizer, prompt, max_tokens=100, temperature=0.2):
    """Complete code from prompt"""
    inputs = tokenizer(prompt, return_tensors="pt")
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=temperature,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )
    
    completed = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return completed

# Example
code_prompt = """def fibonacci(n):
    """Compute the nth Fibonacci number."""
    if n <= 1:
        return n
    return"""

print("Code Completion Example:")
print(f"Prompt: {code_prompt}")
print("\n(Would generate completion here)")

In [None]:
# HumanEval Evaluation
def evaluate_humaneval(model, tokenizer, problems):
    """Evaluate on HumanEval benchmark"""
    results = []
    
    for problem in problems:
        prompt = problem["prompt"]
        test_cases = problem["test"]
        
        # Generate solution
        solution = complete_code(model, tokenizer, prompt)
        
        # Extract function code
        # (In practice, parse and execute test cases)
        
        results.append({
            "problem_id": problem["task_id"],
            "solution": solution,
            "passed": False,  # Would check test cases
        })
    
    # Calculate pass rate
    pass_rate = sum(r["passed"] for r in results) / len(results)
    
    return {
        "pass_rate": pass_rate,
        "results": results
    }

print("HumanEval Evaluation:")
print("- 164 programming problems")
print("- Tests code generation quality")
print("- Pass rate: percentage of problems solved")

## Key Takeaways

✅ **Module Complete**

## Next Steps

Continue to the next module in the course.