# üé® Introduction to Model Customization

## üõí The Zava Scenario

Cora is working well with a base model, but Zava wants it to provide responses that are more aligned with their brand voice, home improvement terminology, and customer service guidelines. Generic models may not capture the specific nuances of home improvement retail.

**The Opportunity**: Instead of relying solely on prompt engineering, we can customize the model itself through techniques like fine-tuning and distillation to make Cora more specialized for Zava's home improvement retail needs.

## What You'll Learn

In this section, you'll understand:

1. **Few-shot prompting** - Teaching models through examples in the prompt
2. **Supervised Fine-Tuning (SFT)** - Training models on domain-specific data
3. **Distillation** - Transferring knowledge from larger to smaller models
4. **Data preparation for fine-tuning** - JSONL format and best practices
5. **When to use each customization technique** - Trade-offs and use cases

## Why This Matters

Model customization enables you to:
- **Improve response quality** for domain-specific tasks
- **Reduce costs** by using smaller, specialized models
- **Align outputs** with brand voice and guidelines
- **Handle specialized terminology** unique to your business

Let's explore how to customize models to make Cora more effective for Zava's home improvement retail business.

---

## Why Customize Models?

Base models (GPT-4o, GPT-4o-mini) are trained on general internet data. They're powerful but generic.

**Problems with base models:**

### 1. Inconsistent Tone/Style

**Customer:** "What paint do you have?"

**Base Model:** "We have various paint options including interior, exterior, latex, oil-based..."

**Custom Model (Zava style):** "Great question! We have several excellent paint options for your project. Let me help you find the perfect match..."

The custom model maintains Zava's friendly, helpful brand voice consistently.

### 2. Domain Knowledge Gaps

**Customer:** "I need paint for T1-11 siding"

**Base Model:** "I can help you find paint. What color are you looking for?"

**Custom Model:** "For T1-11 siding, you'll want a high-quality exterior acrylic latex paint with good penetration. Our Premium Exterior Paint (PFIP000002) is perfect for this application..."

The custom model understands hardware/construction terminology.

### 3. Long Prompts

**Without customization:** Need to include examples in every prompt
```
Prompt (500 tokens):
"You are Cora, a helpful Zava assistant. Examples:
Q: What paint? A: [example]
Q: What drill? A: [example]
...
Now answer: What paint do you have?"
```

**With customization:** Model "knows" the style already
```
Prompt (50 tokens):
"You are Cora. Customer asks: What paint do you have?"
```

**Result:** Significant token reduction leads to lower costs at scale

### 4. Response Consistency

**Base model responses vary:**
- Query 1: Formal and technical
- Query 2: Casual and brief  
- Query 3: Overly detailed

**Custom model:** Consistent tone, structure, and detail level across all responses

## Customization Approaches

There are three main ways to customize model behavior:

| Approach | What It Does | Best For | Cost | Effort |
|----------|--------------|----------|------|--------|
| **Few-Shot Prompting** | Include examples in prompt | Quick testing, dynamic examples | Low (pay per token) | Low |
| **Fine-Tuning** | Train model on your data | Consistent style, domain knowledge | Medium (one-time training) | Medium |
| **Distillation** | Transfer knowledge from larger model | Cost optimization, efficiency | Low-Medium | High |

We'll explore each in detail.

---

## Few-Shot Prompting

**Few-shot prompting** means including example query-response pairs in your prompt to guide the model's behavior.

### How It Works

```python
prompt = """
You are Cora, a helpful Zava Hardware assistant.

Example 1:
Customer: What paint do you have?
Cora: Great question! We have several excellent paint options. For interior 
projects, I recommend our Premium Interior Paint. For exterior, our Premium 
Exterior Paint is weather-resistant and durable. What's your project?

Example 2:
Customer: Is PFIP000002 in stock?
Cora: Yes! Premium Exterior Paint (SKU: PFIP000002) is currently in stock 
with 75 units available. Would you like me to help you with anything else?

Now answer:
Customer: {user_query}
Cora:
"""
```

**The model learns from examples** and mimics the style.

### Pros

‚úÖ **Quick to implement** - No training required  
‚úÖ **Flexible** - Change examples anytime  
‚úÖ **No infrastructure** - Just modify prompts  
‚úÖ **Dynamic** - Different examples for different scenarios

### Cons

‚ùå **Increases token costs** - Examples add 200-500 tokens per request  
‚ùå **Uses context window** - Less room for actual conversation  
‚ùå **Not as consistent** - Model still improvises  
‚ùå **Limited examples** - Can only fit 5-10 examples

### When to Use Few-Shot

- ‚úÖ Prototyping and testing
- ‚úÖ Need flexibility (examples change often)
- ‚úÖ Low query volume (< 1000/day)
- ‚úÖ Simple behavior changes
- ‚ùå High query volume (expensive)
- ‚ùå Need perfect consistency
- ‚ùå Complex domain knowledge required

### Example Calculation

**Scenario:** 10,000 queries/day with 5 examples (300 tokens)

```
Few-Shot Approach:
- Higher token usage per request (examples included)
- Pay-per-use model
- Costs scale linearly with volume

Fine-Tuning Approach:
- Lower token usage per request (no examples needed)
- One-time training cost
- Lower per-request inference cost

At high volumes, fine-tuning becomes more cost-effective.

**Conclusion:** Few-shot works for low volume; fine-tuning better at scale.

For current pricing: https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/
```

---

## Fine-Tuning

**Fine-tuning** means continuing to train a base model on your specific data, adjusting its internal parameters to learn your domain and style.

### How It Works

```
Base Model (GPT-4o-mini)
         ‚Üì
    + Your Training Data
      (100-1000 examples)
         ‚Üì
   Fine-Tuning Process
   (2-6 hours on Azure)
         ‚Üì
   Custom Fine-Tuned Model
   (your-model-deployment)
```

**The model's weights are updated** to encode your patterns directly.

### What Gets Trained

**Training data format (JSONL):**

```jsonl
{"messages": [{"role": "system", "content": "You are Cora, a Zava assistant"}, {"role": "user", "content": "What paint do you have?"}, {"role": "assistant", "content": "Great question! We have several excellent paint options..."}]}
{"messages": [{"role": "system", "content": "You are Cora, a Zava assistant"}, {"role": "user", "content": "Is PFIP000002 in stock?"}, {"role": "assistant", "content": "Yes! Premium Exterior Paint is currently in stock..."}]}
```

Each line is a complete conversation example showing:
1. System instructions
2. User query
3. Expected assistant response

**The model learns:**
- Response style and tone
- Domain terminology
- Response structure and format
- Brand voice and personality

### Supervised Fine-Tuning (SFT)

**SFT** is the most common fine-tuning approach:

1. **Supervised** - You provide labeled examples (query ‚Üí expected response)
2. **Learning objective** - Model learns to predict your responses given inputs
3. **Gradient descent** - Model weights adjusted to minimize error on your data

**Analogy:** Like tutoring a student with practice problems and answer keys.

### Preparing Training Data

**Data requirements:**
- **Minimum:** 50 examples (more is better)
- **Recommended:** 100-1000 examples
- **Format:** JSONL (JSON Lines)
- **Quality:** High-quality examples only

**Data sources:**
1. **Real conversations** (if available)
2. **Synthetic generation** (using Azure AI Simulator)
3. **Manual curation** (expert-written examples)
4. **Hybrid** (real + augmented)

**Example generation:**

```python
training_data = [
    {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "What paint is best for kitchens?"},
            {"role": "assistant", "content": "For kitchens, I recommend our Premium Interior Paint..."}
        ]
    },
    # ... 99 more examples
]

# Save as JSONL
with open("training.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")
```

### Token Optimization

**Problem:** Training cost based on token count

**Formula:**
```
Training Cost = (Total Tokens in Dataset) √ó (Training Epochs) √ó (Price per Token)
```

**Optimization strategies:**

1. **Remove redundancy**
```python
# Before (verbose)
"Hello! Thank you for asking! I'm happy to help you with that question..."

# After (concise)
"I recommend our Premium Exterior Paint for outdoor wood projects..."
```

2. **Truncate long responses**
```python
# Keep responses under 150 tokens
if token_count(response) > 150:
    response = truncate_intelligently(response, 150)
```

3. **Validate before uploading**
```python
from azure.ai.ml import MLClient

# Check token counts
total_tokens = sum(count_tokens(ex) for ex in training_data)
cost_estimate = (total_tokens * epochs * price_per_1k) / 1000

print(f"Estimated training cost: ${cost_estimate:.2f}")
```

### Fine-Tuning Process

**Steps:**

1. **Prepare data** (JSONL format)
2. **Upload to Azure OpenAI**
3. **Submit fine-tuning job**
4. **Monitor progress** (2-6 hours)
5. **Deploy fine-tuned model**
6. **Test and validate**

**Azure OpenAI fine-tuning job:**

```python
from openai import AzureOpenAI

client = AzureOpenAI(...)

# Upload training file
with open("training.jsonl", "rb") as f:
    training_file = client.files.create(file=f, purpose="fine-tune")

# Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini-2024-07-18",  # Base model
    hyperparameters={
        "n_epochs": 3  # Number of training passes
    }
)

# Monitor
status = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {status.status}")
```

### Pros

‚úÖ **Shorter prompts** - No need for examples  
‚úÖ **Consistent behavior** - Model "knows" your style  
‚úÖ **Better domain knowledge** - Learns terminology  
‚úÖ **Cost-effective at scale** - Lower per-query cost  
‚úÖ **Improved quality** - Specialized for your use case

### Cons

‚ùå **Initial effort** - Requires creating training data  
‚ùå **Training time** - 2-6 hours per job  
‚ùå **Static knowledge** - Must retrain to update  
‚ùå **Versioning complexity** - Managing model versions

### When to Use Fine-Tuning

- ‚úÖ High query volume (> 1000/day)
- ‚úÖ Need consistent tone/style
- ‚úÖ Domain-specific terminology
- ‚úÖ Have quality training data
- ‚úÖ Long-term deployment
- ‚ùå Data changes daily
- ‚ùå Need real-time updates
- ‚ùå Very low query volume

---

## Distillation

**Distillation** means training a smaller, faster model to mimic a larger, more capable model.

### How It Works

```
Large "Teacher" Model          Small "Student" Model
(Higher cost/better)    ‚Üí      (Lower cost/faster)
                                     
Query: "What paint?"    ‚Üí      Query: "What paint?"
Response: [detailed]    ‚Üí      Response: [similar quality]
                                     
Cost: Higher per token         Cost: Lower per token
Latency: Slower                Latency: Faster

See pricing: https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/
```

**Goal:** Get GPT-4o quality at GPT-4o-mini cost/speed.

### Distillation Process

**Step 1: Generate Teacher Responses**

```python
# Use large model to generate high-quality responses
teacher_model = "gpt-4o"
student_training_data = []

for query in training_queries:
    teacher_response = call_model(teacher_model, query)
    student_training_data.append({
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query},
            {"role": "assistant", "content": teacher_response}
        ]
    })
```

**Step 2: Fine-Tune Student Model**

```python
# Fine-tune small model on teacher's responses
student_model = fine_tune(
    base_model="gpt-4o-mini",
    training_data=student_training_data
)
```

**Step 3: Evaluate**

```python
# Compare student vs teacher
for query in test_queries:
    teacher_response = call_model("gpt-4o", query)
    student_response = call_model(student_model, query)
    
    similarity = compute_similarity(teacher_response, student_response)
    print(f"Similarity: {similarity}")  # Goal: > 0.85
```

### Knowledge Transfer

**What gets distilled:**
- Reasoning patterns
- Response structure
- Domain knowledge
- Task-specific behaviors

**What doesn't get distilled:**
- Raw intelligence (student has limits)
- Emergent capabilities (student is smaller)
- Perfect accuracy (some quality loss acceptable)

### Types of Distillation

**1. Basic Distillation**
- Student learns from teacher's outputs directly
- Simple, effective for most use cases

**2. Distillation with Custom Graders**
- Use custom evaluators to score teacher responses
- Only keep high-quality examples for student training
- Better quality control

**Example: Custom grader**

```python
def grade_response(query, response):
    """Custom evaluator for response quality"""
    score = 0
    
    # Check for key elements
    if contains_product_sku(response):
        score += 1
    if polite_tone(response):
        score += 1
    if factually_grounded(response):
        score += 1
    if under_token_limit(response, 150):
        score += 1
        
    return score >= 3  # Keep if passes quality threshold

# Filter training data
high_quality_data = [
    ex for ex in distillation_data 
    if grade_response(ex["query"], ex["response"])
]
```

### Pros

‚úÖ **Cost reduction** - Cheaper model with similar quality  
‚úÖ **Speed improvement** - Faster inference  
‚úÖ **Smaller deployment** - Lower resource requirements  
‚úÖ **Quality preservation** - Maintains ~85-95% of teacher quality

### Cons

‚ùå **Two-step process** - Generate + fine-tune  
‚ùå **Quality ceiling** - Can't exceed student model's capacity  
‚ùå **Upfront cost** - Expensive to generate teacher responses  
‚ùå **Complex evaluation** - Need to validate quality preservation

### When to Use Distillation

- ‚úÖ Using expensive large model in production
- ‚úÖ Need to reduce cost/latency
- ‚úÖ Have budget for teacher model generation
- ‚úÖ Can accept 5-15% quality reduction
- ‚ùå Already using smallest model
- ‚ùå Need maximum quality (can't compromise)
- ‚ùå Low query volume (ROI too low)

### Example ROI Calculation

**Scenario:** 100K queries/month
Current (Large Model):
- Higher per-token cost
- Higher monthly operational cost

Distillation (Small Model fine-tuned):
- One-time generation cost (teacher responses)
- One-time training cost
- Lower per-token inference cost
- Significantly lower monthly operational cost

Result: Substantial cost savings at high volume
Typical payback period: Days to weeks
Annual savings: $15,000
For pricing details: https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/

```
**Conclusion:** Distillation is extremely valuable at high volume.

**Conclusion:** Distillation is extremely valuable at high volume.

---

## Choosing the Right Approach

### Decision Tree

```
Start: Need to customize model behavior?
  ‚Üì
  Yes ‚Üí High query volume (> 1000/day)?
         ‚Üì
         Yes ‚Üí Using expensive model?
                ‚Üì
                Yes ‚Üí Use DISTILLATION
                       (GPT-4o ‚Üí GPT-4o-mini fine-tuned)
                ‚Üì
                No ‚Üí Use FINE-TUNING
                      (GPT-4o-mini base ‚Üí fine-tuned)
         ‚Üì
         No ‚Üí Need flexibility?
               ‚Üì
               Yes ‚Üí Use FEW-SHOT PROMPTING
               ‚Üì
               No ‚Üí Use FINE-TUNING
                     (better long-term)
```

### Comparison Matrix

| Factor | Few-Shot | Fine-Tuning | Distillation |
|--------|----------|-------------|--------------|
| **Setup Time** | Minutes | Hours | Days |
| **Query Volume Sweet Spot** | < 1K/day | > 1K/day | > 10K/day |
| **Consistency** | Medium | High | High |
| **Cost at 10K queries/day** | Higher | Medium | Lower |
| **Flexibility** | High | Low | Low |
| **Domain Knowledge** | Limited | Good | Good |
| **Quality** | Good | Better | Best (if done right) |

### Hybrid Approaches

**Combine multiple techniques:**

**1. Fine-Tuning + RAG**
- Fine-tune for tone/style (static)
- RAG for product knowledge (dynamic)

```python
# Fine-tuned model for Zava brand voice
model = "zava-custom-model"

# RAG for up-to-date product info
context = retrieve_from_search(query)

# Combine
response = model.query(
    prompt=f"Context: {context}\n\nQuestion: {query}",
    model=model
)
```

**2. Few-Shot + Fine-Tuning**
- Fine-tune for general behavior
- Few-shot for specific edge cases

```python
# Use fine-tuned model as base
model = "zava-fine-tuned"

# Add few-shot for special cases
if is_complex_query(query):
    prompt_with_examples = add_examples(query)
    response = model.query(prompt_with_examples)
else:
    response = model.query(query)
```

**3. Distillation + RAG**
- Distill for cost/speed
- RAG for factual grounding

```python
# Distilled GPT-4o-mini model
model = "zava-distilled-mini"

# Retrieve current data
context = retrieve_from_search(query)

# Query with context
response = model.query(f"Context: {context}\n\nQ: {query}")
```

---

## Training Data Best Practices

### 1. Quality Over Quantity

**Better:**
- 100 high-quality, diverse examples
- Carefully curated and validated
- Representative of real use cases

**Worse:**
- 1000 low-quality, repetitive examples
- Automatically generated without review
- Not representative of actual queries

### 2. Diversity in Training Data

Cover different:
- **Query types** (questions, requests, commands)
- **Complexity levels** (simple to multi-step)
- **Product categories** (paint, tools, hardware)
- **Customer intents** (search, compare, fact-check)

```jsonl
{"messages": [{"role": "user", "content": "What paint?"}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "user", "content": "Compare latex vs oil-based paint for outdoor furniture"}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "user", "content": "Is PFIP000002 available in blue?"}, {"role": "assistant", "content": "..."}]}
```

### 3. Validation Split

**Don't use all data for training:**

```
Total: 1000 examples
  ‚Üì
Training: 800 (80%)
Validation: 200 (20%)
```

**Use validation set to:**
- Detect overfitting
- Tune hyperparameters
- Measure generalization

### 4. Consistent Formatting

```jsonl
{"messages": [{"role": "system", "content": "You are Cora..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are Cora..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```

**Keep consistent:**
- System message (same across all)
- Response structure
- Terminology and naming
- Tone and style

### 5. Token Budget Awareness

```python
# Check before training
def validate_training_data(data_file):
    total_tokens = 0
    
    with open(data_file) as f:
        for line in f:
            example = json.loads(line)
            tokens = count_tokens(example)
            total_tokens += tokens
            
            if tokens > 4096:  # Example too long
                print(f"Warning: Example exceeds limit: {tokens} tokens")
    
    epochs = 3
    cost = (total_tokens * epochs * 0.008) / 1000  # Example rate
    
    print(f"Total tokens: {total_tokens}")
    print(f"Estimated cost: ${cost:.2f}")
    
    return total_tokens < 1_000_000  # Example limit
```

### 6. Iterative Improvement

**Process:**
```
1. Create initial training set (100 examples)
   ‚Üì
2. Fine-tune model
   ‚Üì
3. Test on validation set
   ‚Üì
4. Identify failure patterns
   ‚Üì
5. Add examples addressing failures
   ‚Üì
6. Repeat steps 2-5
```

**Each iteration improves specific weaknesses.**

---

## Evaluating Custom Models

### Before vs After Comparison

**Test on validation set:**

```python
validation_queries = [...]  # Held-out test set

# Test base model
base_results = evaluate_model("gpt-4o-mini", validation_queries)

# Test fine-tuned model
custom_results = evaluate_model("zava-fine-tuned", validation_queries)

# Compare
comparison = {
    "Base Model": {
        "Accuracy": base_results.accuracy,
        "Tone Match": base_results.tone_score,
        "Avg Tokens": base_results.avg_tokens
    },
    "Fine-Tuned": {
        "Accuracy": custom_results.accuracy,
        "Tone Match": custom_results.tone_score,
        "Avg Tokens": custom_results.avg_tokens
    }
}
```

### Key Metrics

**1. Task Performance**
- Did accuracy improve?
- Are responses more relevant?
- Better product recommendations?

**2. Style Consistency**
- Matches brand voice?
- Consistent tone across queries?
- Appropriate formality level?

**3. Efficiency**
- Shorter prompts needed?
- Faster responses?
- Lower token usage?

**4. Error Reduction**
- Fewer hallucinations?
- Better handling of edge cases?
- More graceful failures?

### A/B Testing in Production

```python
# Split traffic
def route_query(query):
    if random.random() < 0.5:
        return base_model.query(query)
    else:
        return fine_tuned_model.query(query)

# Track metrics
metrics = {
    "base_model": {"satisfaction": [], "latency": []},
    "fine_tuned": {"satisfaction": [], "latency": []}
}

# After 1000 queries, compare
analyze_ab_test(metrics)
```

### Regression Testing

**Ensure fine-tuning didn't break existing capabilities:**

```python
# Test suite
regression_tests = [
    {"input": "What is 2+2?", "expected": "4"},
    {"input": "What is the capital of France?", "expected": "Paris"},
    # ... general knowledge tests
]

# Both models should pass
assert all(test_model(base_model, regression_tests))
assert all(test_model(fine_tuned, regression_tests))
```

---

## Terminology Quick Reference

| Term | Simple Definition |
|------|-------------------|
| **Fine-Tuning** | Training a model further on your specific data |
| **Distillation** | Training a small model to mimic a large model |
| **Few-Shot Prompting** | Including example responses in the prompt |
| **Supervised Fine-Tuning (SFT)** | Fine-tuning with labeled input-output pairs |
| **JSONL** | JSON Lines format - one JSON object per line |
| **Training Data** | Examples used to teach the model your patterns |
| **Validation Data** | Examples held out to test model performance |
| **Epoch** | One complete pass through training data |
| **Token** | Unit of text (~4 characters) used for billing |
| **Hyperparameters** | Settings that control training (e.g., learning rate) |
| **Overfitting** | Model memorizes training data, doesn't generalize |
| **Teacher Model** | Large model used as source in distillation |
| **Student Model** | Small model being trained in distillation |
| **Custom Grader** | Function that evaluates response quality |

---

## What's Next?

Now that you understand model customization concepts, you're ready to fine-tune and distill your own models!

### Hands-On Notebooks in This Section

- **`31-basic-finetuning.ipynb`** - Fine-tune a model on Zava product data
  - Prepare training data in JSONL format
  - Validate token counts and optimize data
  - Submit fine-tuning job to Azure OpenAI
  - Deploy and test fine-tuned model
  - Compare base vs. fine-tuned performance

- **`32-custom-grader.ipynb`** - Build custom evaluators for quality control
  - Create custom grading functions
  - Filter training data by quality
  - Improve training data quality
  - Validate responses meet standards

- **`33-distill-finetuning.ipynb`** - Distill GPT-4o knowledge to GPT-4o-mini
  - Generate teacher responses from GPT-4o
  - Create student training dataset
  - Fine-tune GPT-4o-mini on teacher outputs
  - Compare student vs. teacher quality
  - Calculate cost savings from distillation

### Recommended Learning Path

1. **Start here** ‚Üí Understand concepts (this notebook)
2. **Next** ‚Üí Basic fine-tuning (`31-basic-finetuning.ipynb`)
3. **Then** ‚Üí Quality control (`32-custom-grader.ipynb`)
4. **Advanced** ‚Üí Distillation (`33-distill-finetuning.ipynb`)
5. **After** ‚Üí Move to evaluation labs (measure improvements)
6. **Finally** ‚Üí Deploy custom models to production

---

## Further Reading

For deeper understanding:

- **[Fine-Tuning Guide](https://learn.microsoft.com/azure/ai-services/openai/how-to/fine-tuning)** - Azure OpenAI fine-tuning documentation
- **[Preparing Training Data](https://learn.microsoft.com/azure/ai-services/openai/how-to/fine-tuning?tabs=python#prepare-training-data)** - Data format and best practices
- **[Model Distillation](https://learn.microsoft.com/azure/ai-studio/concepts/model-distillation)** - Knowledge transfer concepts
- **[Token Optimization](https://learn.microsoft.com/azure/ai-services/openai/how-to/token-optimization)** - Reducing costs
- **[Evaluation Metrics](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in)** - Measuring model quality

---

Ready to fine-tune your first model? Open `31-basic-finetuning.ipynb` to get started! üöÄ