# 07 - Fine-Tuning a CV Roaster (Medium Style)

This notebook fine-tunes a small language model to generate CV critiques.

## Approach
1. Generate synthetic training data using Gemini API
2. Fine-tune DistilGPT-2 with LoRA (Parameter-Efficient Fine-Tuning)
3. Compare: Base Model vs Fine-Tuned vs Gemini

## Hardware Requirements
- CPU-only compatible (no GPU required)
- 8GB RAM sufficient

---

In [2]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
from datetime import datetime
from tqdm import tqdm
import time
import sys
sys.path.append('..')
import google.generativeai as genai

# Hugging Face
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, PeftModel, TaskType
from datasets import Dataset
import torch

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## Setup

In [None]:
# Load API key from config.py
from config import GEMINI_API_KEY
genai.configure(api_key=GEMINI_API_KEY)
print("API key loaded from config.py")

API key loaded from config.py


## Load Data and Helper Functions

In [4]:
# Load dataset
df = pd.read_csv('../data/resume_data.csv')

# Load test CV indices
with open('../data/test_cv_indices.json', 'r') as f:
    test_data = json.load(f)
    test_cv_indices = test_data['indices']

print(f"Loaded {len(df)} resumes")
print(f"Test CVs: {test_cv_indices}")

Loaded 9544 resumes
Test CVs: [0, 1]


In [5]:
# CV formatting function (from EDA notebook)
def format_cv_for_llm(resume_row):
    """
    Format a resume row into a readable text for LLM processing.
    """
    cv_text = []
    
    if pd.notna(resume_row.get('career_objective')):
        cv_text.append(f"CAREER OBJECTIVE:\n{resume_row['career_objective']}")
    
    if pd.notna(resume_row.get('skills')):
        cv_text.append(f"\nSKILLS:\n{resume_row['skills']}")
    
    education_parts = []
    if pd.notna(resume_row.get('educational_institution_name')):
        education_parts.append(f"Institution: {resume_row['educational_institution_name']}")
    if pd.notna(resume_row.get('degree_names')):
        education_parts.append(f"Degree: {resume_row['degree_names']}")
    if pd.notna(resume_row.get('major_field_of_studies')):
        education_parts.append(f"Major: {resume_row['major_field_of_studies']}")
    if pd.notna(resume_row.get('passing_years')):
        education_parts.append(f"Year: {resume_row['passing_years']}")
    
    if education_parts:
        cv_text.append(f"\nEDUCATION:\n" + "\n".join(education_parts))
    
    work_parts = []
    if pd.notna(resume_row.get('professional_company_names')):
        work_parts.append(f"Company: {resume_row['professional_company_names']}")
    if pd.notna(resume_row.get('positions')):
        work_parts.append(f"Position: {resume_row['positions']}")
    if pd.notna(resume_row.get('start_dates')):
        work_parts.append(f"Period: {resume_row['start_dates']}")
        if pd.notna(resume_row.get('end_dates')):
            work_parts.append(f" to {resume_row['end_dates']}")
    if pd.notna(resume_row.get('responsibilities')):
        work_parts.append(f"Responsibilities:\n{resume_row['responsibilities']}")
    
    if work_parts:
        cv_text.append(f"\nWORK EXPERIENCE:\n" + "\n".join(work_parts))
    
    if pd.notna(resume_row.get('languages')):
        cv_text.append(f"\nLANGUAGES:\n{resume_row['languages']}")
    
    if pd.notna(resume_row.get('certification_skills')):
        cv_text.append(f"\nCERTIFICATIONS:\n{resume_row['certification_skills']}")
    
    return "\n".join(cv_text)

## Medium Roaster Prompt (Same as 03_medium_roaster)

In [6]:
MEDIUM_SYSTEM_PROMPT = """You are an experienced hiring manager who provides direct, honest CV feedback.

Your approach:
1. Be direct and honest - no sugarcoating
2. Point out obvious flaws and red flags
3. Call out generic buzzwords and filler content
4. Be professional but don't hold back the truth
5. Focus on what actually matters to employers

Keep your feedback:
- Brutally honest but professional
- Direct about weaknesses
- Critical of vague or generic content
- Focused on real-world hiring standards

Structure your response:
FIRST IMPRESSION: What stands out (good or bad)
MAJOR ISSUES: Glaring problems that need fixing
CONCERNS: Things that raise questions
WHAT WORKS: Brief acknowledgment of strengths
BOTTOM LINE: Final verdict and priority fixes
"""

def roast_cv_gemini(cv_text, temperature=0.7, model_name="gemini-2.0-flash"):
    """
    Generate CV critique using Gemini.
    """
    model = genai.GenerativeModel(
        model_name=model_name,
        generation_config=genai.GenerationConfig(
            temperature=temperature,
            top_p=0.95,
            top_k=40,
            max_output_tokens=1024,
        )
    )
    
    prompt = f"{MEDIUM_SYSTEM_PROMPT}\n\nReview this CV with honest, direct feedback:\n\n{cv_text}"
    
    response = model.generate_content(prompt)
    return response.text

---

## Part 1: Generate Synthetic Training Data

Create (CV, critique) pairs using Gemini API.

In [7]:
# Configuration
SYNTHETIC_DATA_PATH = Path('../data/fine_tuning_dataset.json')
MODEL_OUTPUT_DIR = Path('../models/medium_roaster_lora')
NUM_TRAINING_SAMPLES = 100
DELAY_BETWEEN_CALLS = 1.0

In [24]:
def generate_synthetic_dataset(df, num_samples=NUM_TRAINING_SAMPLES):
    """
    Generate (CV, critique) pairs for fine-tuning.
    """
    dataset = []
    
    # Randomly sample CVs (excluding test CVs)
    available_indices = [i for i in range(len(df)) if i not in test_cv_indices]
    sample_indices = np.random.choice(available_indices, min(num_samples, len(available_indices)), replace=False)
    
    print(f"Generating {len(sample_indices)} critique pairs...")
    print(f"Estimated time: {len(sample_indices) * DELAY_BETWEEN_CALLS / 60:.1f} minutes")
    
    for i, idx in enumerate(tqdm(sample_indices)):
        cv_text = format_cv_for_llm(df.iloc[idx])
        
        if not cv_text or len(cv_text) < 100:
            continue
        
        # Truncate very long CVs
        cv_text = cv_text[:3000]
        
        # Generate critique with Gemini
        try:
            critique = roast_cv_gemini(cv_text)
            dataset.append({
                "cv": cv_text,
                "critique": critique,
                "source_idx": int(idx)
            })
        except Exception as e:
            print(f"Error on CV {idx}: {e}")
        
        time.sleep(DELAY_BETWEEN_CALLS)
        
        # Save progress every 20 samples
        if (i + 1) % 20 == 0:
            print(f"\nCheckpoint: {len(dataset)} pairs saved")
            with open(SYNTHETIC_DATA_PATH, 'w', encoding='utf-8') as f:
                json.dump(dataset, f, indent=2, ensure_ascii=False)
    
    return dataset

In [25]:
# Check if dataset already exists, otherwise generate
if SYNTHETIC_DATA_PATH.exists():
    print(f"Loading existing dataset from {SYNTHETIC_DATA_PATH}")
    with open(SYNTHETIC_DATA_PATH, 'r', encoding='utf-8') as f:
        synthetic_data = json.load(f)
    print(f"Loaded {len(synthetic_data)} pairs")
else:
    print("Generating new synthetic dataset...")
    synthetic_data = generate_synthetic_dataset(df, NUM_TRAINING_SAMPLES)
    
    # Save final dataset
    with open(SYNTHETIC_DATA_PATH, 'w', encoding='utf-8') as f:
        json.dump(synthetic_data, f, indent=2, ensure_ascii=False)
    print(f"\nSaved {len(synthetic_data)} pairs to {SYNTHETIC_DATA_PATH}")

Loading existing dataset from ..\data\fine_tuning_dataset.json
Loaded 100 pairs


In [26]:
# Preview dataset
print(f"Dataset size: {len(synthetic_data)} pairs")
print(f"\nSample CV (truncated):")
print("="*80)
print(synthetic_data[0]['cv'][:500] + "...")
print(f"\nSample Critique:")
print("="*80)
print(synthetic_data[0]['critique'][:500] + "...")

Dataset size: 100 pairs

Sample CV (truncated):

SKILLS:
['Periodic financial reporting expert', 'General ledger accounting skills', 'Invoice coding familiarity', 'Strong communication skills', 'Complex problem solving', 'Account reconciliation expert', 'Organization', 'Time Management', 'Adaptability', 'Communication']

EDUCATION:
Institution: ['University of Greenwich', 'Oshwal College']
Degree: ['Bachelor of Arts', 'Association of Business Executive']
Major: ['Business Studies', 'Business']
Year: ['2014', '2013']

WORK EXPERIENCE:
Company:...

Sample Critique:
Okay, let's rip this CV apart. Here's the brutally honest truth:

**FIRST IMPRESSION:** This CV screams "entry-level and desperate." The formatting is basic, the skills section is a buzzword bingo, and the work experience section is a laundry list of generic tasks. It looks like a template someone filled out without putting any real thought into it.

**MAJOR ISSUES:**

*   **Skills Section - Useless:** This is a collection of 

---

## Part 2: Fine-Tune DistilGPT-2 with LoRA

Using Parameter-Efficient Fine-Tuning (PEFT) to train on CPU.

In [27]:
# Load model
MODEL_NAME = "distilgpt2"

print(f"Loading {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Set padding token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

print(f"Model parameters: {model.num_parameters():,}")

Loading distilgpt2...
Model parameters: 81,912,576


In [28]:
# Configure LoRA - only trains ~0.5% of parameters
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["c_attn", "c_proj"],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 405,504 || all params: 82,318,080 || trainable%: 0.4926


In [29]:
# Format training data
def format_training_example(cv, critique):
    """Format a (CV, critique) pair for training."""
    return f"### CV:\n{cv[:1500]}\n\n### Critique:\n{critique}\n\n### END"

formatted_texts = [
    format_training_example(item['cv'], item['critique'])
    for item in synthetic_data
]

print(f"Formatted {len(formatted_texts)} training examples")

Formatted 100 training examples


In [30]:
# Tokenize dataset
MAX_LENGTH = 512

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=MAX_LENGTH,
        padding="max_length",
    )

train_dataset = Dataset.from_dict({"text": formatted_texts})
train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=["text"])
train_dataset = train_dataset.map(lambda x: {"labels": x["input_ids"]})

# Split train/eval
split = train_dataset.train_test_split(test_size=0.1, seed=42)
train_data = split["train"]
eval_data = split["test"]

print(f"Train: {len(train_data)}, Eval: {len(eval_data)}")

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Train: 90, Eval: 10


In [25]:
# Training arguments (CPU optimized)
training_args = TrainingArguments(
    output_dir=str(MODEL_OUTPUT_DIR),
    overwrite_output_dir=True,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    warmup_steps=50,
    weight_decay=0.01,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_steps=100,
    save_total_limit=2,
    use_cpu=True,
    fp16=False,
    dataloader_num_workers=0,
    report_to="none",
    load_best_model_at_end=True,
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    data_collator=data_collator,
)

print("Trainer ready")

Trainer ready


In [27]:
# Train!
print("Starting training...")
print("Expected time: 30-60 minutes on CPU")
print("="*80)

trainer.train()

Starting training...
Expected time: 30-60 minutes on CPU


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss,Validation Loss
50,3.5038,3.4081


TrainOutput(global_step=69, training_loss=3.666443728018498, metrics={'train_runtime': 15437.7317, 'train_samples_per_second': 0.017, 'train_steps_per_second': 0.004, 'total_flos': 35611402567680.0, 'train_loss': 3.666443728018498, 'epoch': 3.0})

In [31]:
# Save the fine-tuned model
MODEL_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
model.save_pretrained(MODEL_OUTPUT_DIR)
tokenizer.save_pretrained(MODEL_OUTPUT_DIR)

print(f"Model saved to {MODEL_OUTPUT_DIR}")

Model saved to ..\models\medium_roaster_lora


---

## Part 3: Test & Compare Models

Compare three models:
1. **Base DistilGPT-2** (before fine-tuning)
2. **Fine-tuned DistilGPT-2** (after LoRA training)
3. **Gemini** (reference baseline)

In [8]:
# Load both models for comparison
print("Loading models for comparison...")

# Base model (no fine-tuning)
base_model = AutoModelForCausalLM.from_pretrained("distilgpt2")
base_model.eval()

# Fine-tuned model (with LoRA)
ft_base = AutoModelForCausalLM.from_pretrained("distilgpt2")
fine_tuned_model = PeftModel.from_pretrained(ft_base, MODEL_OUTPUT_DIR)
fine_tuned_model.eval()

print("Both models loaded")

Loading models for comparison...
Both models loaded


In [9]:
def roast_cv_local(model, cv_text, max_new_tokens=75):
    """
    Generate critique using a local model (base or fine-tuned).
    IMPROVED: Better prompt format for inference.
    """
    # Shorter CV + clear instruction
    cv_short = cv_text[:800]  # Only 800 chars
    
    prompt = f"You are a hiring manager who criticizes CVs. Criticize this CV:\n\n{cv_short} END OF THE CV.\n\n Now your Critique: I as a hiring manager think this CV is bad because"
    
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=400)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.5,  # Lower temp = less random
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract critique part
    if "Critique:" in generated:
        critique = generated.split("Critique:")[1]
        return critique.strip()
    
    return generated

In [10]:
# Test on first test CV
test_cv = format_cv_for_llm(df.iloc[test_cv_indices[0]])

print("TEST CV:")
print("="*80)
print(test_cv[:500] + "...")
print("="*80)

TEST CV:
CAREER OBJECTIVE:
Big data analytics working and database warehouse manager with robust experience in handling all kinds of data. I have also used multiple cloud infrastructure services and am well acquainted with them. Currently in search of role that offers more of development.

SKILLS:
['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', 'Hdfs', 'YARN', 'Core Java', 'Data Science', 'C++', 'Data Structures', 'DBMS', 'RDBMS', 'Informatica', 'Talend...


In [13]:
print("\n" + "="*80)
print("1. BASE MODEL (no fine-tuning):")
print("="*80)
base_critique = roast_cv_local(base_model, test_cv)
print(base_critique)


1. BASE MODEL (no fine-tuning):
I as a hiring manager think this CV is bad because it is a good one.
CAREER OBJECTIVE:
Big data analytics working and database warehouse manager with robust experience in handling all kinds of data. I have also used multiple cloud infrastructure services and am well acquainted with them. Currently in search of role that offers more of development.
SKILLS:
['Big Data', 'Hive', 'Python


In [17]:
print("\n" + "="*80)
print("2. FINE-TUNED MODEL (after LoRA training):")
print("="*80)
ft_critique = roast_cv_local(fine_tuned_model, test_cv)
print(ft_critique)


2. FINE-TUNED MODEL (after LoRA training):
I as a hiring manager think this CV is bad because it lacks a clear understanding of what it actually means.


In [18]:
print("\n" + "="*80)
print("3. GEMINI (reference):")
print("="*80)
gemini_critique = roast_cv_gemini(test_cv)
print(gemini_critique)


3. GEMINI (reference):
Okay, here's my brutally honest assessment of this CV:

**FIRST IMPRESSION:** This CV screams "generic and underwhelming." It's a laundry list of buzzwords with little substance to back them up. The formatting is basic, and the content lacks impact.

**MAJOR ISSUES:**

*   **Career Objective:** This is a terrible opening. It's vague, contains grammatical errors ("working" should be "worker" or removed entirely), and doesn't sell you at all. Saying you're "in search of a role that offers more of development" is weak. What kind of development? Why should a company invest in you for that? This needs a complete rewrite to showcase your value proposition and target a specific role.
*   **Skills Section:** This is just a keyword dump. Listing a bunch of technologies without demonstrating proficiency is pointless. Anyone can copy and paste these terms. There's no indication of your skill level or how you've applied these skills in real-world projects.
*   **Work Experi

## Quantitative Evaluation

Using CVProcessor metrics.

## Quantitative Evaluation

Using LLM Judge (same approach as notebook 05).

In [19]:
JUDGE_PROMPT = """You are an expert evaluator of CV critique quality.

Evaluate this CV critique on the following criteria (score 1-10 for each):

1. **Specificity**: How specific and actionable is the feedback?
2. **Relevance**: How relevant are the points to actual CV improvement?
3. **Coherence**: Is the critique coherent and well-structured?
4. **Completeness**: Does it cover important aspects of the CV?
5. **Overall Usefulness**: How useful would this be to the job seeker?

Respond in JSON format:
{
  "specificity": <score>,
  "relevance": <score>,
  "coherence": <score>,
  "completeness": <score>,
  "overall_usefulness": <score>,
  "reasoning": "<brief explanation>"
}
"""

def evaluate_critique_with_llm(critique_text, model_type, cv_text):
    """
    Use LLM to evaluate critique quality.
    """
    model = genai.GenerativeModel(
        model_name="gemini-2.0-flash",
        generation_config=genai.GenerationConfig(
            temperature=0.2,  # Low temperature for consistent evaluation
        )
    )
    
    prompt = f"""{JUDGE_PROMPT}

Model Type: {model_type}

Original CV (excerpt):
{cv_text[:500]}...

Critique to Evaluate:
{critique_text}
"""
    
    try:
        response = model.generate_content(prompt)
        text = response.text
        # Extract JSON from response
        start = text.find('{')
        end = text.rfind('}') + 1
        if start != -1 and end != 0:
            json_str = text[start:end]
            return json.loads(json_str)
    except Exception as e:
        print(f"Error evaluating: {e}")
        return None

print("Evaluating all three critiques with LLM judge...")
print("This may take ~30 seconds...\n")

# Evaluate all three models
evaluations = []

for model_name, critique in [('Base', base_critique), ('Fine-Tuned', ft_critique), ('Gemini', gemini_critique)]:
    print(f"Evaluating {model_name}...")
    eval_result = evaluate_critique_with_llm(critique, model_name, test_cv)
    
    if eval_result:
        eval_result['model'] = model_name
        evaluations.append(eval_result)
        print(f"  ✓ {model_name} evaluated")

print(f"\n✓ Completed {len(evaluations)} evaluations")

Evaluating all three critiques with LLM judge...
This may take ~30 seconds...

Evaluating Base...
  ✓ Base evaluated
Evaluating Fine-Tuned...
  ✓ Fine-Tuned evaluated
Evaluating Gemini...
  ✓ Gemini evaluated

✓ Completed 3 evaluations


In [20]:
# Display evaluation results
if evaluations:
    df_eval = pd.DataFrame(evaluations)
    
    print("\n" + "="*80)
    print("EVALUATION SCORES (LLM JUDGE)")
    print("="*80)
    
    score_cols = ['specificity', 'relevance', 'coherence', 'completeness', 'overall_usefulness']
    
    # Display table
    comparison_df = pd.DataFrame({
        'Metric': ['Specificity', 'Relevance', 'Coherence', 'Completeness', 'Overall Usefulness'],
        'Base (before)': [
            f"{df_eval[df_eval['model']=='Base']['specificity'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Base']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Base']['relevance'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Base']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Base']['coherence'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Base']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Base']['completeness'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Base']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Base']['overall_usefulness'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Base']) > 0 else "N/A",
        ],
        'Fine-Tuned (after)': [
            f"{df_eval[df_eval['model']=='Fine-Tuned']['specificity'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Fine-Tuned']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Fine-Tuned']['relevance'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Fine-Tuned']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Fine-Tuned']['coherence'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Fine-Tuned']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Fine-Tuned']['completeness'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Fine-Tuned']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Fine-Tuned']['overall_usefulness'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Fine-Tuned']) > 0 else "N/A",
        ],
        'Gemini (reference)': [
            f"{df_eval[df_eval['model']=='Gemini']['specificity'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Gemini']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Gemini']['relevance'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Gemini']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Gemini']['coherence'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Gemini']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Gemini']['completeness'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Gemini']) > 0 else "N/A",
            f"{df_eval[df_eval['model']=='Gemini']['overall_usefulness'].values[0]:.1f}/10" if len(df_eval[df_eval['model']=='Gemini']) > 0 else "N/A",
        ],
    })
    
    print(comparison_df.to_string(index=False))
    
    # Calculate averages
    df_eval['average_score'] = df_eval[score_cols].mean(axis=1)
    
    print("\n" + "="*80)
    print("AVERAGE SCORES")
    print("="*80)
    for _, row in df_eval.iterrows():
        print(f"{row['model']:15s}: {row['average_score']:.2f}/10")
    
    # Show reasoning
    print("\n" + "="*80)
    print("LLM JUDGE REASONING")
    print("="*80)
    for _, row in df_eval.iterrows():
        print(f"\n{row['model']}:")
        print(f"  {row['reasoning']}")
else:
    print("No evaluation results available")


EVALUATION SCORES (LLM JUDGE)
            Metric Base (before) Fine-Tuned (after) Gemini (reference)
       Specificity        1.0/10             1.0/10             9.0/10
         Relevance        1.0/10             2.0/10            10.0/10
         Coherence        1.0/10             1.0/10             9.0/10
      Completeness        1.0/10             1.0/10             9.0/10
Overall Usefulness        1.0/10             1.0/10            10.0/10

AVERAGE SCORES
Base           : 1.00/10
Fine-Tuned     : 1.20/10
Gemini         : 9.40/10

LLM JUDGE REASONING

Base:
  The critique is nonsensical and provides no actionable feedback. It contradicts itself by saying the CV is both good and bad. It also abruptly cuts off the skills section. It offers no specific suggestions for improvement.

Fine-Tuned:
  The critique is extremely vague and unhelpful. It states the CV is "bad" and lacks "a clear understanding of what it actually means" without providing any specific examples or actionab

## Summary & Conclusion

### What We Learned

This notebook demonstrated several important ML engineering techniques:

**Technical Skills Practiced:**
- **Synthetic Data Generation**: Used Gemini API to generate 100 (CV, critique) training pairs
- **Parameter-Efficient Fine-Tuning**: Implemented LoRA to train only 0.49% of model parameters (405K out of 82M)
- **CPU-Optimized Training**: Successfully fine-tuned on CPU with 8GB RAM (no GPU required)
- **Model Comparison**: Systematically compared Base vs Fine-Tuned vs Gemini models

**Process Insights:**
- Learned how to create domain-specific training data from a large language model
- Understood trade-offs between model size, quality, and computational requirements
- Practiced prompt engineering for both training and inference
- Implemented consistent evaluation methodology across notebooks

---

### Key Finding: Model Size Matters

**DistilGPT-2 (82M parameters) is too small for this complex task.**

| Model | Parameters | Performance | Quality Assessment |
|-------|-----------|-------------|-------------------|
| **Base DistilGPT-2** | 82M | Poor | Incoherent, repetitive output |
| **Fine-Tuned DistilGPT-2** | 82M | Minimal improvement | Slightly better but still unusable |
| **Gemini 2.0 Flash** | ~Billions | Excellent | Professional, actionable feedback |

**Why DistilGPT-2 Failed:**
1. **Model capacity too limited**: CV critique requires understanding context, professional norms, and constructive feedback patterns
2. **Task complexity**: Analyzing CVs demands domain knowledge that small models can't capture
3. **Training data insufficient**: 100 examples can't compensate for lack of base capabilities
4. **High training loss (3.666)**: Model struggled to learn even basic patterns

**Evidence from LLM Judge Scores:**
- Base Model: 1.0/10 average score
- Fine-Tuned Model: 1.2/10 average score (minimal improvement)
- Gemini: 9.4/10 average score

**Conclusion:**
For complex, nuanced tasks like CV critique, **larger models (1B+ parameters) or cloud-based LLMs are necessary**. Small models like DistilGPT-2 work well for simple classification or text completion, but fail at tasks requiring reasoning, domain expertise, and structured feedback generation.

---

### Future Improvements

If pursuing local fine-tuning even further:
- Use larger models (e.g., GPT-2 Medium/Large, Llama 3 8B)
- Generate more training data (500-1000 examples)
- Consider cloud GPU for faster training

**Recommendation:** For production use, stick with cloud-based LLMs that have the capacity for this task complexity. 