Excellent question 🚀 — evaluation in **code generation** (like fine-tuning **StarCoder**) is **very different** from image classification with ViTs.

Let’s go step by step 👇

---

## 🔹 1. Nature of Task

* Dataset = **prompt → code** (seq2seq, causal LM generation).
* Output = **source code string**.
* Unlike classification (accuracy), we need metrics that capture **semantic correctness** and **syntactic quality**.

---

## 🔹 2. Common Evaluation Metrics for Code Generation

### ✅ **Perplexity (PPL)**

* Measures how well the model predicts the next token.
* Lower perplexity = better language modeling.
* Good for **training evaluation**, but doesn’t guarantee correctness of generated code.

---

### ✅ **BLEU / ROUGE**

* Compare n-gram overlap between generated code and reference code.
* Example: BLEU-4 is common.
* Issue → Code can be semantically identical with **different variable names/formatting**, so BLEU is not always reliable.

---

### ✅ **CodeBLEU**

* Extension of BLEU designed specifically for code:

  * N-gram match (like BLEU)
  * Weighted by **syntax & structure** (AST tree similarity, dataflow similarity)
  * More meaningful than plain BLEU.

---

### ✅ **Exact Match (EM)**

* % of predictions where generated code matches the reference **exactly**.
* Too strict for real-world code (different formatting still works).

---

### ✅ **Execution-based Metrics**

* Run the generated code and check correctness:

  * **Pass\@k** (from Codex/OpenAI paper):

    * Generate *k* samples per prompt.
    * Check if at least 1 sample passes all test cases.
    * Pass\@1, Pass\@5, Pass\@10 are standard.
  * **Unit test success rate**:

    * If dataset provides test cases, run the generated code and compute % passed.

This is the **gold standard** for evaluating code generation, since ultimately we care about **working code**.

---

## 🔹 3. Recommended Setup for StarCoder Fine-tuning

1. **Training-time metric**:

   * **Perplexity** (to monitor loss).

2. **Static text-based metrics**:

   * **CodeBLEU** (primary),
   * **Exact Match** (secondary).

3. **Execution-based metrics**:

   * **Pass\@k** (if you can run code safely with sandboxing).
   * **Test case accuracy** (if your dataset includes unit tests).

---

## 🔹 4. Example in Practice

* Suppose you fine-tuned StarCoder on a dataset of `prompt → code`.
* At evaluation time:

  * Compute **perplexity** on held-out set.
  * Compute **CodeBLEU** between generated vs reference code.
  * Optionally, run generated code with test cases → compute **Pass\@1, Pass\@5**.

---

✅ So summary:

* **During training** → Loss + Perplexity
* **After training** → CodeBLEU, Exact Match, Execution metrics (Pass\@k, test success rate)

---

👉 Do you want me to draft a **PyTorch/Hugging Face evaluation script** that shows how to compute **perplexity + CodeBLEU + Pass\@k** for a StarCoder fine-tuning run?


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import evaluate
import math
import numpy as np

# -----------------------
# 1. Load Model & Tokenizer
# -----------------------
model_name = "bigcode/starcoder"   # replace with your fine-tuned checkpoint
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# -----------------------
# 2. Load Dataset (prompt -> code)
# -----------------------
dataset = load_dataset("your_dataset_name")  # replace with your dataset
eval_data = dataset["test"].select(range(100))  # small subset for demo

# -----------------------
# 3. Perplexity
# -----------------------
def compute_perplexity(eval_data):
    encodings = tokenizer("\n\n".join(eval_data["code"]), return_tensors="pt")
    max_length = model.config.n_positions
    stride = 512

    lls = []
    for i in range(0, encodings.input_ids.size(1), stride):
        begin_loc = max(i + stride - max_length, 0)
        end_loc = min(i + stride, encodings.input_ids.size(1))
        trg_len = end_loc - i
        input_ids = encodings.input_ids[:, begin_loc:end_loc].to(model.device)
        target_ids = input_ids.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            log_likelihood = outputs.loss * trg_len
        lls.append(log_likelihood)

    ppl = torch.exp(torch.stack(lls).sum() / end_loc)
    return ppl.item()

ppl = compute_perplexity(eval_data)
print(f"Perplexity: {ppl:.2f}")

# -----------------------
# 4. CodeBLEU
# -----------------------
codebleu = evaluate.load("codebleu")

preds = []
refs = []

for example in eval_data:
    inputs = tokenizer(example["prompt"], return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.2,
            do_sample=False
        )
    gen_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
    preds.append(gen_code)
    refs.append(example["code"])

results = codebleu.compute(predictions=preds, references=refs, lang="python")  # adjust language
print("CodeBLEU:", results)

# -----------------------
# 5. Pass@k (Execution-based, Optional)
# -----------------------
def compute_pass_at_k(preds, refs, k=1):
    # Dummy version: exact string match (replace with test execution)
    correct = 0
    for p, r in zip(preds, refs):
        if p.strip() == r.strip():
            correct += 1
    return {"pass@{}".format(k): correct / len(refs)}

passk = compute_pass_at_k(preds, refs, k=1)
print(passk)


In [None]:
pip install transformers datasets accelerate bitsandbytes


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset

# -----------------------
# 1. Load Dataset
# -----------------------
# Your dataset must have "prompt" and "code" fields
dataset = load_dataset("json", data_files={
    "train": "train.jsonl",
    "test": "test.jsonl"
})

# Example record in train.jsonl:
# {"prompt": "Write a Python function to add two numbers", "code": "def add(a, b):\n    return a+b"}

# -----------------------
# 2. Load Model & Tokenizer
# -----------------------
model_name = "bigcode/starcoder"   # or "bigcode/starcoderbase"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# -----------------------
# 3. Preprocess Function
# -----------------------
def preprocess(example):
    # Concatenate prompt and code into a single sequence
    # You can use a separator if needed
    text = f"Prompt: {example['prompt']}\nCode:\n{example['code']}"
    tokenized = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=512  # adjust depending on GPU memory
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_ds = dataset.map(preprocess, batched=False, remove_columns=dataset["train"].column_names)

# -----------------------
# 4. Training Arguments
# -----------------------
training_args = TrainingArguments(
    output_dir="./starcoder-finetuned",
    evaluation_strategy="epoch",     # evaluate after each epoch
    save_strategy="epoch",           # save checkpoints each epoch
    learning_rate=5e-5,
    per_device_train_batch_size=2,   # adjust for GPU memory
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    warmup_steps=100,
    logging_dir="./logs",
    logging_steps=20,
    save_total_limit=2,
    report_to="none",  # disable wandb unless you want logging
    fp16=True,         # use mixed precision if on GPU
    push_to_hub=False,
    load_best_model_at_end=True,
)

# -----------------------
# 5. Trainer
# -----------------------
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
)

# -----------------------
# 6. Train
# -----------------------
trainer.train()

# -----------------------
# 7. Save Final Model
# -----------------------
trainer.save_model("./starcoder-finetuned")
tokenizer.save_pretrained("./starcoder-finetuned")


Perfect ⚡ let’s put everything together and write a **training script to fine-tune StarCoder** on your dataset of

```
prompt  →  code
```

pairs.
We’ll use Hugging Face **`transformers` + `datasets`** since they make this much easier.

---

# 🔹 1. Install Dependencies

```bash
pip install transformers datasets accelerate bitsandbytes
```

*(use `bitsandbytes` if you want to train with 8-bit optimizers on a single GPU)*

---

# 🔹 2. Training Script (PyTorch + Hugging Face)

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset

# -----------------------
# 1. Load Dataset
# -----------------------
# Your dataset must have "prompt" and "code" fields
dataset = load_dataset("json", data_files={
    "train": "train.jsonl",
    "test": "test.jsonl"
})

# Example record in train.jsonl:
# {"prompt": "Write a Python function to add two numbers", "code": "def add(a, b):\n    return a+b"}

# -----------------------
# 2. Load Model & Tokenizer
# -----------------------
model_name = "bigcode/starcoder"   # or "bigcode/starcoderbase"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# -----------------------
# 3. Preprocess Function
# -----------------------
def preprocess(example):
    # Concatenate prompt and code into a single sequence
    # You can use a separator if needed
    text = f"Prompt: {example['prompt']}\nCode:\n{example['code']}"
    tokenized = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=512  # adjust depending on GPU memory
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_ds = dataset.map(preprocess, batched=False, remove_columns=dataset["train"].column_names)

# -----------------------
# 4. Training Arguments
# -----------------------
training_args = TrainingArguments(
    output_dir="./starcoder-finetuned",
    evaluation_strategy="epoch",     # evaluate after each epoch
    save_strategy="epoch",           # save checkpoints each epoch
    learning_rate=5e-5,
    per_device_train_batch_size=2,   # adjust for GPU memory
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    warmup_steps=100,
    logging_dir="./logs",
    logging_steps=20,
    save_total_limit=2,
    report_to="none",  # disable wandb unless you want logging
    fp16=True,         # use mixed precision if on GPU
    push_to_hub=False,
    load_best_model_at_end=True,
)

# -----------------------
# 5. Trainer
# -----------------------
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
)

# -----------------------
# 6. Train
# -----------------------
trainer.train()

# -----------------------
# 7. Save Final Model
# -----------------------
trainer.save_model("./starcoder-finetuned")
tokenizer.save_pretrained("./starcoder-finetuned")
```

---

# 🔹 3. Dataset Format

Your dataset should look like this (`train.jsonl`):

```json
{"prompt": "Write a Python function to compute factorial", "code": "def factorial(n):\n    return 1 if n==0 else n*factorial(n-1)"}
{"prompt": "Generate SQL query to select all users older than 30", "code": "SELECT * FROM users WHERE age > 30;"}
```

---

# 🔹 4. What This Does

* Concatenates **prompt + code** into a single training example.
* Trains StarCoder in **causal LM mode** (predict next token).
* Loss = **CrossEntropyLoss** on next-token prediction.
* Optimizer = **AdamW** (default in Hugging Face `Trainer`).
* You can evaluate using **perplexity** + custom metrics (CodeBLEU, Pass\@k).

---

👉 Do you want me to also add a **LoRA/PEFT version** of this (parameter-efficient fine-tuning), which is much cheaper and faster if you don’t want to fine-tune the full StarCoder (\~15B parameters)?


Great question ⚡ — let’s walk through what happens **after fine-tuning StarCoder** on your `prompt → code` dataset.

---

## 🔹 1. What the Model Learned

During fine-tuning:

* The model saw examples like:

```
Prompt: Write a Python function to compute factorial
Code:
def factorial(n):
    return 1 if n==0 else n*factorial(n-1)
```

* So it learns:

  * Given a **prompt**, it should continue generating with the **code**.
  * Essentially, the model is trained as a **prompt → completion** generator.

---

## 🔹 2. Inference (After Training)

At inference time, when a user asks something like:

```python
prompt = "Write a Python function to reverse a string"
```

We tokenize and pass it to the model:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = "./starcoder-finetuned"  # your saved checkpoint
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

inputs = tokenizer(f"Prompt: {prompt}\nCode:\n", return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.2,   # lower for deterministic, higher for diversity
        top_p=0.95,
        do_sample=True
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## 🔹 3. Example Output

For the above prompt, the model might generate something like:

```python
Prompt: Write a Python function to reverse a string
Code:
def reverse_string(s):
    return s[::-1]
```

---

## 🔹 4. Key Points

* The **output is just text** (string containing code).
* The quality depends heavily on:

  * Size/quality of your fine-tuning dataset.
  * Whether prompts were consistent (`Prompt: ... Code:` style).
* If you trained with **only the code (no structured “Prompt: … Code:” format)**, the model may just generate raw code completions.

---

✅ So after fine-tuning:

* **Input** = natural language prompt (e.g., “Write SQL query for …”)
* **Output** = model-generated code (Python, SQL, etc., depending on your dataset).

---

👉 Do you want me to show you how to wrap this into a **Flask/FastAPI inference server**, so users can hit an endpoint with a prompt and get back generated code?
