# Lab: Vibe Coding Workshop — Build a Learning Assistant CLI

## Welcome to the Vibe Coding Workshop!

In this lab, you'll learn AI-assisted development by **building**, not by studying theory.

### What You'll Build

A **Learning Assistant CLI** with:
- 3-5 commands (track, status, recommend, quiz, export)
- JSON persistence
- Comprehensive test coverage
- Clean project structure

### The 5-Step Vibe Coding Loop

1. **Spec** → Write requirements, constraints, acceptance tests
2. **Scaffold** → Request minimal project structure
3. **Test** → Generate pytest tests from spec
4. **Patch** → Iterate: test failure → minimal fix → verify
5. **Review** → AI-assisted code review + refactors

### Prerequisites

- Python 3.8+
- Your favorite AI coding assistant (Cursor, Windsurf, GitHub Copilot)
- A text editor or IDE

### Time Estimate

3-4 hours (with breaks)

---

**Let's begin!**

In [None]:
## Setup: Create Your Project Directory

Before starting, create a directory for your project:

```bash
mkdir -p ~/learning_assistant
cd ~/learning_assistant
```

**Checkpoint:**
- [ ] Created project directory
- [ ] Changed into the directory
- [ ] Ready to start Step 1

## Exercise 1: Write Your Spec

**Time: 10-15 minutes**

In this exercise, you'll write a specification for your Learning Assistant CLI. Your spec should define:
- 3-5 commands with arguments
- Requirements (data storage, validation, etc.)
- Non-goals (what you're NOT building)
- 8-10 acceptance tests

**Use the template below or create your own spec in a separate file.**

### Example Spec Template

```markdown
## Feature: Learning Assistant CLI

### Commands
1. `track <topic> <hours> [date]` - Log study session
2. `status` - Show progress as JSON
3. `recommend` - Suggest next topic
4. `quiz <topic>` - Generate practice questions
5. `export <filename>` - Export data to JSON

### Requirements
- Data stored in data.json
- status and export output valid JSON
- Input validation with helpful errors
- Each command has --help text

### Non-goals
- No web UI (CLI only)
- No external APIs (local data only)
- No fancy TUI (plain text output)

### Acceptance Tests
1. `status` on empty data → {"total_hours": 0, "topics": []}
2. `track python 2.5` → data.json updated
3. `track python -1` → Error: "Hours must be positive"
4. `recommend` after tracking → relevant suggestion
5. `export out.json` → valid JSON file created
...
```

**Your task:** Create a spec file or write it in your AI assistant, then move to Exercise 2.

In [None]:
**Checkpoint 1: Spec Complete**

Before moving forward, verify your spec:
- [ ] Defines 3-5 commands with clear arguments
- [ ] Lists specific requirements (storage, validation, output format)
- [ ] States non-goals (what you're NOT building)
- [ ] Includes 8-10 acceptance tests
- [ ] Is specific enough that someone could implement it without asking questions

**If yes to all → proceed to Exercise 2 (Scaffold)**

## Exercise 2: Request Project Scaffold (10 min)

**Goal:** Get AI to create project structure with minimal stubs.

Use your AI assistant with this prompt template:

```
Using my spec, create a Python CLI project with this structure:

learning_assistant/
├── README.md (setup + usage)
├── requirements.txt (pytest, jsonschema)
├── data.json (empty template)
├── src/
│   ├── __init__.py
│   ├── cli.py (argparse setup)
│   ├── commands.py (command stubs)
│   ├── storage.py (load/save stubs)
└── tests/
    ├── __init__.py
    └── test_commands.py (empty test stubs)

Requirements:
- Generate ONLY file structure with minimal stubs
- Do NOT implement full logic yet
- Add TODO comments where logic goes
- Keep functions under 10 lines (stubs only)
```

**Action:** Use AI to generate scaffold, verify files exist, check structure makes sense.

In [None]:
class SystemPromptDesigner:
    """Design effective system prompts for different use cases."""
    
    def __init__(self):
        self.prompt_templates = {
            'technical_writer': self.create_technical_writer_prompt(),
            'customer_support': self.create_customer_support_prompt(),
            'code_reviewer': self.create_code_reviewer_prompt(),
            'data_analyst': self.create_data_analyst_prompt()
        }
    
    def create_technical_writer_prompt(self) -> str:
        """Create prompt for technical documentation."""
        return """
You are a senior technical writer with 10 years of experience in software documentation.
You specialize in creating clear, concise, and accurate technical documentation for developers.

Your characteristics:
- Write in a professional but approachable tone
- Use active voice and present tense
- Include practical examples and code snippets
- Structure information logically with clear headings

Your constraints:
- Avoid jargon unless necessary (define when used)
- Keep sentences under 25 words when possible
- Provide step-by-step instructions for complex tasks
- Include troubleshooting sections for common issues

Your output format:
- Use Markdown formatting
- Include a brief overview section
- Structure content with hierarchical headings
- End with a summary or next steps section
"""
    
    def create_customer_support_prompt(self) -> str:
        """Create prompt for customer support."""
        return """
You are a helpful customer support assistant for TechCorp.
You have access to order information, product details, and troubleshooting guides.

Guidelines:
- Be polite and professional
- Ask clarifying questions when needed
- Provide step-by-step solutions
- Escalate to human agent for complex issues
- Never make promises about refunds or compensation
"""
    
    def create_code_reviewer_prompt(self) -> str:
        """Create prompt for code review."""
        return """
You are an experienced software engineer specializing in code review and best practices.
You have expertise in Python, JavaScript, and software architecture.

Review criteria:
- Check for code readability and maintainability
- Identify potential bugs or security issues
- Suggest performance improvements
- Ensure proper error handling
- Verify adherence to coding standards

Provide constructive feedback with specific examples and suggestions.
"""
    
    def create_data_analyst_prompt(self) -> str:
        """Create prompt for data analysis."""
        return """
You are a senior data analyst with expertise in statistical analysis and data visualization.
You specialize in extracting insights from complex datasets.

Analysis approach:
- Start with data quality assessment
- Identify key patterns and trends
- Provide statistical significance where applicable
- Suggest actionable recommendations
- Include appropriate visualizations

Always explain your methodology and assumptions clearly.
"""

# Test system prompt effectiveness
print("Testing System Prompt Designer...")
designer = SystemPromptDesigner()

for role, prompt in designer.prompt_templates.items():
    print(f"\n--- {role.replace('_', ' ').title()} Prompt ---")
    print(f"Length: {len(prompt)} characters")
    print(f"First 200 chars: {prompt[:200]}...")

## Exercise 2: The CLEAR Framework Implementation

### What is CLEAR? A Systematic Approach to Prompt Engineering

**The CLEAR Framework:**

CLEAR is a **structured methodology** for creating effective prompts:

- **C**ontext: Set the scene and background
- **L**ength: Specify desired output length
- **E**xamples: Provide few-shot examples
- **A**udience: Define who will read this
- **R**equirements: List specific constraints and needs

**Why Use a Framework?**

Without a framework:
- Prompts are inconsistent
- Important details get forgotten
- Results vary wildly
- Hard to improve systematically

With CLEAR:
- ✅ **Consistent**: Same structure every time
- ✅ **Complete**: All important aspects covered
- ✅ **Comparable**: Easy to A/B test variations
- ✅ **Improvable**: Clear what to change

---

### Breaking Down Each Component

**1. Context (C)**
```python
context = "You are writing product descriptions for an e-commerce website."
```
**Purpose**: 
- Establishes the situation
- Provides background information
- Sets expectations

**Why It Matters**: 
- AI needs context to make good decisions
- Without context, responses are generic
- Context helps AI understand constraints

**2. Length (L)**
```python
length = "100-150 words"
```
**Purpose**:
- Specifies output size
- Prevents too-short or too-long responses
- Helps with consistency

**Why It Matters**:
- Users expect consistent length
- Too short = missing information
- Too long = wasted tokens, user fatigue

**3. Examples (E)**
```python
examples = """
Input: Wireless Bluetooth Headphones
Output: Experience premium sound quality...
"""
```
**Purpose**:
- Shows AI exactly what you want
- Demonstrates format and style
- Provides reference patterns

**Why It Matters**:
- **Few-shot learning**: AI learns from examples
- **Format consistency**: Examples show structure
- **Quality bar**: Examples set expectations

**4. Audience (A)**
```python
audience = "environmentally conscious consumers aged 25-45"
```
**Purpose**:
- Defines who will read the output
- Influences tone and complexity
- Guides content selection

**Why It Matters**:
- Technical audience → use jargon
- General audience → simple language
- Age group → appropriate references

**5. Requirements (R)**
```python
requirements = "Highlight these features: organic cotton, breathable fabric..."
```
**Purpose**:
- Lists must-have elements
- Specifies constraints
- Defines success criteria

**Why It Matters**:
- Ensures nothing is forgotten
- Provides clear checklist
- Makes evaluation easier

---

### The CLEAR Template Structure

**Standard Template:**
```
Context: {context}

Task: {task}

Requirements:
- Length: {length}
- Audience: {audience}
- {requirements}

Examples:
{examples}

Remember to follow all requirements and match the example format.
```

**Why This Order?**
1. **Context first**: Sets the stage
2. **Task second**: What to do
3. **Requirements third**: How to do it
4. **Examples last**: Reference for style

---

### Real-World Applications

**E-commerce Product Descriptions:**
- **Context**: E-commerce website
- **Length**: 100-150 words
- **Examples**: Show successful descriptions
- **Audience**: Target customer segment
- **Requirements**: Highlight key features

**Technical Documentation:**
- **Context**: Developer documentation
- **Length**: 300-500 words
- **Examples**: Code examples and explanations
- **Audience**: Software developers
- **Requirements**: Include code snippets, troubleshooting

**Marketing Copy:**
- **Context**: Marketing campaign
- **Length**: 50-100 words
- **Examples**: Compelling copy samples
- **Audience**: Target market
- **Requirements**: Include call-to-action, benefits

---

### Advanced CLEAR Techniques

**1. Multiple Examples**
- Show variety in examples
- Cover edge cases
- Demonstrate range of acceptable outputs

**2. Negative Examples**
- Show what NOT to do
- Highlight common mistakes
- Set boundaries

**3. Progressive Complexity**
- Start with simple examples
- Build to complex scenarios
- Guide AI learning curve

**4. Dynamic Requirements**
- Adjust based on input
- Context-aware constraints
- Adaptive specifications

---

### Measuring CLEAR Effectiveness

**Metrics to Track:**
- **Completeness**: All requirements met?
- **Consistency**: Similar inputs → similar outputs?
- **Quality**: Does output match examples?
- **Efficiency**: Right length, no waste?

**A/B Testing:**
- Test different CLEAR variations
- Measure user satisfaction
- Optimize based on results

Now let's implement it:

In [None]:
class ClearFramework:
    """Implement the CLEAR framework for prompt engineering."""
    
    @staticmethod
    def create_prompt(task: str, context: str, length: str, examples: str, 
                     audience: str, requirements: str) -> str:
        """Create a prompt using the CLEAR framework."""
        return f"""
Context: {context}

Task: {task}

Requirements:
- Length: {length}
- Audience: {audience}
- {requirements}

Examples:
{examples}

Remember to follow all requirements and match the example format.
"""
    
    @staticmethod
    def create_product_description_prompt(product_name: str, features: List[str], 
                                        target_audience: str) -> str:
        """Create a product description prompt using CLEAR framework."""
        context = f"You are writing product descriptions for an e-commerce website."
        task = f"Write a compelling product description for {product_name}"
        length = "100-150 words"
        audience = target_audience
        requirements = f"Highlight these features: {', '.join(features)}. Use persuasive language."
        
        examples = """
Input: Wireless Bluetooth Headphones
Output: Experience premium sound quality with our wireless Bluetooth headphones. 
Featuring active noise cancellation, 30-hour battery life, and comfortable over-ear design. 
Perfect for music lovers and professionals who demand exceptional audio performance.

Input: Stainless Steel Water Bottle
Output: Stay hydrated with our durable stainless steel water bottle. Double-wall vacuum 
insulation keeps drinks cold for 24 hours or hot for 12 hours. Leak-proof lid and sweat-free 
design make it perfect for gym, office, or outdoor adventures.
"""
        
        return ClearFramework.create_prompt(task, context, length, examples, audience, requirements)

# Test CLEAR framework
print("Testing CLEAR Framework...")

# Create product description prompt
product_prompt = ClearFramework.create_product_description_prompt(
    product_name="Organic Cotton T-Shirt",
    features=["100% organic cotton", "breathable fabric", "sustainable production", "comfortable fit"],
    target_audience="environmentally conscious consumers aged 25-45"
)

print("Generated Product Description Prompt:")
print(product_prompt)

## Exercise 3: Parameter Tuning and Evaluation

### Why Parameters Matter: Controlling AI Behavior

**The Hidden Levers:**

AI models have **parameters** that control their behavior:
- **Temperature**: Controls randomness/creativity
- **Top-p**: Controls diversity of token selection
- **Max tokens**: Limits response length
- **Frequency penalty**: Reduces repetition
- **Presence penalty**: Encourages new topics

**Why This Matters:**

Same prompt + different parameters = **completely different outputs**:
- Low temperature: Consistent, predictable
- High temperature: Creative, varied
- Wrong parameters: Useless or harmful outputs

**Real-World Impact:**
- **10-50% improvement** in output quality with right parameters
- **Cost savings**: Shorter responses = lower costs
- **User satisfaction**: Better outputs = happier users

---

### Understanding Temperature: The Creativity Dial

**What Temperature Does:**

Temperature controls **how random** the model's token selection is:

- **Temperature = 0.0**: Deterministic (always same output)
- **Temperature = 0.1-0.3**: Low creativity, high consistency
- **Temperature = 0.5-0.7**: Balanced (default)
- **Temperature = 0.8-1.0**: High creativity, high variation
- **Temperature > 1.0**: Very random, often incoherent

**How It Works:**

```
Model calculates probability for each possible next token:
- "the" (40%)
- "a" (30%)
- "an" (20%)
- "some" (10%)

Low temperature: Always picks "the" (highest probability)
High temperature: Randomly picks based on probabilities
```

**When to Use Each:**

| Temperature | Use Case | Example |
|------------|----------|---------|
| 0.0-0.2 | Code generation, factual answers | "What is 2+2?" |
| 0.3-0.5 | Technical writing, analysis | "Explain how X works" |
| 0.6-0.8 | Creative writing, brainstorming | "Write a story about..." |
| 0.9-1.2 | Experimental, artistic | "Generate unique ideas" |

---

### Understanding Top-p (Nucleus Sampling)

**What Top-p Does:**

Top-p selects from the **smallest set of tokens** whose cumulative probability exceeds p:

- **Top-p = 0.1**: Only considers top 10% most likely tokens
- **Top-p = 0.5**: Considers tokens up to 50% cumulative probability
- **Top-p = 0.9**: Considers most tokens (default)
- **Top-p = 1.0**: Considers all tokens

**Example:**

```
Token probabilities:
- "the" (40%)
- "a" (30%)  ← Cumulative: 70%
- "an" (20%) ← Cumulative: 90%
- "some" (10%)

Top-p = 0.7: Only considers "the" and "a"
Top-p = 0.9: Considers "the", "a", and "an"
```

**Why Use Top-p?**

- **More control** than temperature alone
- **Adaptive**: Adjusts based on probability distribution
- **Quality**: Filters out low-probability nonsense tokens

---

### Parameter Profiles for Different Use Cases

**1. Accuracy-Focused (Temperature: 0.1, Top-p: 0.3)**
```python
{
    'temperature': 0.1,
    'top_p': 0.3,
    'description': 'High accuracy, low creativity'
}
```
**Use for:**
- Factual questions
- Code generation
- Data extraction
- Classification tasks

**2. Balanced (Temperature: 0.5, Top-p: 0.7)**
```python
{
    'temperature': 0.5,
    'top_p': 0.7,
    'description': 'Balanced creativity and accuracy'
}
```
**Use for:**
- General conversation
- Content writing
- Analysis tasks
- Most applications

**3. Creative (Temperature: 0.8, Top-p: 0.9)**
```python
{
    'temperature': 0.8,
    'top_p': 0.9,
    'description': 'High creativity, varied outputs'
}
```
**Use for:**
- Creative writing
- Brainstorming
- Idea generation
- Artistic content

**4. Deterministic (Temperature: 0.0, Top-p: 0.1)**
```python
{
    'temperature': 0.0,
    'top_p': 0.1,
    'description': 'Maximum consistency'
}
```
**Use for:**
- Testing and debugging
- Reproducible outputs
- When consistency is critical

---

### Measuring Parameter Impact

**Key Metrics:**

**1. Consistency Score**
- **What**: How similar are outputs for same input?
- **How**: Compare multiple runs with same parameters
- **Target**: High for accuracy, low for creativity

**2. Lexical Diversity**
- **What**: Ratio of unique words to total words
- **How**: Count unique words / total words
- **Target**: Higher = more varied vocabulary

**3. Length Variance**
- **What**: How much does response length vary?
- **How**: Standard deviation of response lengths
- **Target**: Low for consistent tasks, high for creative tasks

**4. Quality Metrics**
- **What**: Does output meet requirements?
- **How**: Human evaluation or automated checks
- **Target**: Depends on use case

---

### The Parameter Optimization Process

**Step 1: Define Success Metrics**
- What makes a "good" output?
- How will you measure it?
- What's acceptable vs unacceptable?

**Step 2: Test Parameter Grid**
```python
param_grid = [
    (0.1, 0.3),  # Accuracy-focused
    (0.5, 0.7),  # Balanced
    (0.8, 0.9),  # Creative
]
```

**Step 3: Run Tests**
- Same prompts across all parameter combinations
- Multiple runs per combination (for consistency)
- Record all metrics

**Step 4: Analyze Results**
- Which parameters give best results?
- Are there trade-offs?
- What's the optimal balance?

**Step 5: Validate**
- Test on new prompts
- Verify in production
- Monitor ongoing performance

---

### Real-World Optimization Example

**Scenario: Customer Support Chatbot**

**Initial Parameters:**
- Temperature: 0.7 (default)
- Top-p: 0.9 (default)
- **Result**: Inconsistent, sometimes too creative

**Optimized Parameters:**
- Temperature: 0.2
- Top-p: 0.5
- **Result**: Consistent, accurate, professional

**Impact:**
- **30% improvement** in user satisfaction
- **50% reduction** in escalations to human agents
- **20% faster** response times (more concise)

---

### Best Practices

**1. Start with Defaults**
- Don't optimize prematurely
- Understand baseline first
- Then optimize based on data

**2. Test Systematically**
- Use parameter grids
- Test one variable at a time
- Document all results

**3. Consider Trade-offs**
- Accuracy vs creativity
- Consistency vs variety
- Speed vs quality

**4. Monitor Continuously**
- Parameters may need adjustment over time
- Use cases evolve
- Models get updated

Now let's implement parameter optimization:

In [None]:
class ParameterOptimizer:
    """Optimize model parameters for different use cases."""
    
    def __init__(self):
        self.parameter_profiles = {
            'accuracy_focused': {'temperature': 0.1, 'top_p': 0.3, 'description': 'High accuracy, low creativity'},
            'balanced': {'temperature': 0.5, 'top_p': 0.7, 'description': 'Balanced creativity and accuracy'},
            'creative': {'temperature': 0.8, 'top_p': 0.9, 'description': 'High creativity, varied outputs'},
            'deterministic': {'temperature': 0.0, 'top_p': 0.1, 'description': 'Maximum consistency'}
        }
    
    def get_optimal_parameters(self, use_case: str) -> Dict[str, float]:
        """Get optimal parameters for specific use case."""
        return self.parameter_profiles.get(use_case, self.parameter_profiles['balanced'])
    
    def test_parameter_sensitivity(self, client, test_prompts: List[str], 
                                 param_grid: List[Tuple[float, float]]) -> List[Dict[str, Any]]:
        """Test multiple parameter combinations."""
        results = []
        
        for temp, top_p in param_grid:
            combination_results = {
                'temperature': temp,
                'top_p': top_p,
                'outputs': [],
                'metrics': {}
            }
            
            for prompt in test_prompts:
                try:
                    response = client.chat.completions.create(
                        model="gpt-3.5-turbo",
                        messages=[{"role": "user", "content": prompt}],
                        temperature=temp,
                        top_p=top_p,
                        max_tokens=150
                    )
                    
                    combination_results['outputs'].append({
                        'prompt': prompt,
                        'response': response.choices[0].message.content
                    })
                except Exception as e:
                    print(f"Error with temp={temp}, top_p={top_p}: {e}")
            
            # Calculate metrics for this combination
            combination_results['metrics'] = self.calculate_metrics(combination_results['outputs'])
            results.append(combination_results)
        
        return results
    
    def calculate_metrics(self, outputs: List[Dict[str, Any]]) -> Dict[str, float]:
        """Calculate stability and quality metrics."""
        responses = [output['response'] for output in outputs]
        
        return {
            'avg_length': np.mean([len(r) for r in responses]),
            'length_variance': np.var([len(r) for r in responses]),
            'lexical_diversity': self.calculate_lexical_diversity(responses),
            'consistency_score': self.calculate_consistency(responses)
        }
    
    def calculate_lexical_diversity(self, responses: List[str]) -> float:
        """Calculate lexical diversity (unique words / total words)."""
        all_words = []
        for response in responses:
            words = re.findall(r'\b\w+\b', response.lower())
            all_words.extend(words)
        
        if not all_words:
            return 0.0
        
        unique_words = len(set(all_words))
        total_words = len(all_words)
        return unique_words / total_words
    
    def calculate_consistency(self, responses: List[str]) -> float:
        """Calculate consistency across responses."""
        if len(responses) < 2:
            return 1.0
        
        # Use TF-IDF to measure similarity
        vectorizer = TfidfVectorizer(stop_words='english')
        try:
            tfidf_matrix = vectorizer.fit_transform(responses)
            similarities = cosine_similarity(tfidf_matrix)
            
            # Calculate average similarity (excluding diagonal)
            n = len(responses)
            total_similarity = 0
            count = 0
            for i in range(n):
                for j in range(i+1, n):
                    total_similarity += similarities[i][j]
                    count += 1
            
            return total_similarity / count if count > 0 else 0.0
        except:
            return 0.0

# Test parameter optimization
print("Testing Parameter Optimization...")

optimizer = ParameterOptimizer()

# Test different parameter profiles
for profile_name, params in optimizer.parameter_profiles.items():
    print(f"\n--- {profile_name.replace('_', ' ').title()} ---")
    print(f"Temperature: {params['temperature']}")
    print(f"Top_p: {params['top_p']}")
    print(f"Description: {params['description']}")

# Test parameter sensitivity with sample prompts
test_prompts = [
    "Write a creative story about artificial intelligence.",
    "Explain quantum computing in simple terms.",
    "What are the benefits of renewable energy?"
]

## Exercise 4: LLM-as-Judge Evaluation Framework

### Why Automated Evaluation? The Scale Problem

**The Challenge:**

Evaluating AI outputs manually is:
- **Slow**: Humans take minutes per evaluation
- **Expensive**: Requires human evaluators
- **Inconsistent**: Different evaluators, different standards
- **Not Scalable**: Can't evaluate thousands of outputs

**The Solution: LLM-as-Judge**

Use **another AI model** to evaluate AI outputs:
- **Fast**: Seconds per evaluation
- **Cheap**: Fraction of human cost
- **Consistent**: Same criteria every time
- **Scalable**: Evaluate millions of outputs

**When It Works:**
- ✅ Objective criteria (accuracy, completeness)
- ✅ Structured outputs (JSON, code)
- ✅ Well-defined quality standards
- ✅ High-volume evaluation needs

**When It Doesn't:**
- ❌ Subjective quality (artistic merit)
- ❌ Domain expertise required
- ❌ Nuanced human judgment
- ❌ Critical decisions (use human review)

---

### Understanding LLM-as-Judge Architecture

**The Two-Model System:**

```
Model 1 (Generator): Creates outputs
    ↓
Output: "The capital of France is Paris..."
    ↓
Model 2 (Judge): Evaluates outputs
    ↓
Evaluation: "Relevance: 9/10, Accuracy: 10/10..."
```

**Why This Works:**

- **Separation**: Judge doesn't know it's evaluating AI output
- **Objectivity**: Judge applies same criteria to all outputs
- **Scalability**: Can evaluate at same speed as generation
- **Cost**: Judge model can be smaller/cheaper than generator

---

### Designing Effective Evaluation Prompts

**The Evaluation Prompt Structure:**

```
You are an expert evaluator. Assess the following response:

Criteria: {criteria_list}

{f'Reference answer: {reference}' if reference else ''}

Response to evaluate: {response}

Provide scores (1-10) for each criterion:
- Relevance: [score] - [justification]
- Accuracy: [score] - [justification]
- Clarity: [score] - [justification]
- Completeness: [score] - [justification]
- Style: [score] - [justification]

Overall score: [average]/10
```

**Key Components:**

**1. Role Definition**
```python
"You are an expert evaluator..."
```
**Why**: Gives judge authority and context

**2. Criteria Specification**
```python
"Criteria: Relevance, Accuracy, Clarity, Completeness, Style"
```
**Why**: Defines what to evaluate

**3. Reference Answer (Optional)**
```python
"Reference answer: The capital of France is Paris."
```
**Why**: Provides ground truth for comparison

**4. Structured Output Format**
```python
"Provide scores (1-10) for each criterion..."
```
**Why**: Makes parsing results easier

---

### Evaluation Criteria Design

**Common Criteria:**

**1. Relevance**
- **What**: Does response address the question?
- **Scale**: 1 (completely off-topic) to 10 (perfectly relevant)
- **Example**: Question about Python, response about Java → Low relevance

**2. Accuracy**
- **What**: Is the information correct?
- **Scale**: 1 (completely wrong) to 10 (completely accurate)
- **Example**: "2+2=5" → Low accuracy

**3. Clarity**
- **What**: Is the response easy to understand?
- **Scale**: 1 (confusing) to 10 (crystal clear)
- **Example**: Jargon-filled response → Low clarity

**4. Completeness**
- **What**: Does response cover all aspects?
- **Scale**: 1 (missing key points) to 10 (comprehensive)
- **Example**: Partial answer → Lower completeness

**5. Style**
- **What**: Does response match required style?
- **Scale**: 1 (wrong style) to 10 (perfect style)
- **Example**: Formal question, casual response → Lower style score

---

### Parsing Evaluation Results

**The Challenge:**

Judge outputs natural language:
```
"Relevance: 9 - The response directly addresses the question about France's capital.
Accuracy: 10 - The information is correct.
Overall score: 9.5/10"
```

**The Solution: Pattern Matching**

```python
score_patterns = {
    'relevance': r'Relevance:\s*(\d+)',
    'accuracy': r'Accuracy:\s*(\d+)',
    'clarity': r'Clarity:\s*(\d+)',
    'completeness': r'Completeness:\s*(\d+)',
    'style': r'Style:\s*(\d+)'
}

for criterion, pattern in score_patterns.items():
    match = re.search(pattern, evaluation_text)
    if match:
        scores[criterion] = int(match.group(1))
```

**Why This Works:**
- **Reliable**: Consistent format = easy parsing
- **Robust**: Handles minor variations
- **Fast**: Regex is efficient

---

### Production Template Management

**Why Templates Matter:**

In production, you need:
- **Version Control**: Track template changes
- **A/B Testing**: Compare template variations
- **Reusability**: Use templates across projects
- **Maintainability**: Update templates easily

**Template Structure:**

```python
{
    'name': 'customer_support',
    'version': '1.0.0',
    'template': 'You are a customer support agent...',
    'parameters': [
        {'name': 'company_name', 'required': True},
        {'name': 'product_type', 'required': True},
        {'name': 'customer_issue', 'required': True}
    ],
    'metadata': {
        'description': 'Customer support response template',
        'category': 'support',
        'created_at': timestamp,
        'usage_count': 0
    }
}
```

**Benefits:**
- **Validation**: Check required parameters before rendering
- **Tracking**: Monitor template usage
- **Versioning**: Roll back if needed
- **Documentation**: Self-documenting structure

---

### Real-World Applications

**1. Content Quality Control**
- Evaluate generated articles before publishing
- Ensure brand voice consistency
- Check factual accuracy

**2. Customer Support**
- Evaluate chatbot responses
- Ensure helpfulness and accuracy
- Monitor quality over time

**3. Code Generation**
- Evaluate generated code quality
- Check correctness and style
- Ensure best practices

**4. Data Extraction**
- Evaluate extraction accuracy
- Check completeness
- Validate structure

---

### Best Practices

**1. Use Reference Answers When Available**
- Provides ground truth
- Improves evaluation accuracy
- Enables comparison

**2. Define Clear Criteria**
- Vague criteria = inconsistent evaluations
- Specific criteria = reliable results
- Examples help clarify

**3. Calibrate with Human Evaluators**
- Compare LLM judge to human judges
- Adjust criteria based on differences
- Validate approach

**4. Monitor Judge Performance**
- Track evaluation consistency
- Check for bias
- Update judge prompts as needed

**5. Combine with Other Metrics**
- Don't rely solely on LLM-as-judge
- Use automated metrics (length, format)
- Include human spot-checks

Now let's implement the evaluation framework:

In [None]:
class LLMJudge:
    """Implement LLM-as-a-judge evaluation system."""
    
    def __init__(self, judge_model_client):
        self.judge_model = judge_model_client
    
    def evaluate_response(self, response: str, criteria: List[str], reference: str = None) -> Dict[str, Any]:
        """Evaluate a response using LLM-as-judge."""
        evaluation_prompt = f"""
You are an expert evaluator. Assess the following response based on these criteria:

Criteria: {', '.join(criteria)}

{f'Reference answer: {reference}' if reference else ''}

Response to evaluate: {response}

Provide scores (1-10) for each criterion and brief justifications:
- Relevance: [score] - [justification]
- Accuracy: [score] - [justification] 
- Clarity: [score] - [justification]
- Completeness: [score] - [justification]
- Style: [score] - [justification]

Overall score: [average]/10
"""
        
        try:
            evaluation = self.judge_model.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": evaluation_prompt}]
            )
            
            return self.parse_evaluation(evaluation.choices[0].message.content)
        except Exception as e:
            return {'error': str(e), 'overall_score': 0}
    
    def parse_evaluation(self, evaluation_text: str) -> Dict[str, Any]:
        """Parse evaluation results from judge response."""
        scores = {}
        
        # Extract individual scores
        score_patterns = {
            'relevance': r'Relevance:\s*(\d+)',
            'accuracy': r'Accuracy:\s*(\d+)',
            'clarity': r'Clarity:\s*(\d+)',
            'completeness': r'Completeness:\s*(\d+)',
            'style': r'Style:\s*(\d+)'
        }
        
        for criterion, pattern in score_patterns.items():
            match = re.search(pattern, evaluation_text)
            if match:
                scores[criterion] = int(match.group(1))
        
        # Extract overall score
        overall_match = re.search(r'Overall score:\s*(\d+(?:\.\d+)?)', evaluation_text)
        overall_score = float(overall_match.group(1)) if overall_match else 0
        
        return {
            'individual_scores': scores,
            'overall_score': overall_score,
            'full_evaluation': evaluation_text
        }

class ProductionPromptTemplate:
    """Production-ready prompt template management."""
    
    def __init__(self, template_dir: str = "templates"):
        self.template_dir = template_dir
        self.templates = {}
        self.version_history = {}
    
    def create_template(self, name: str, template: str, parameters: List[Dict[str, Any]], 
                       metadata: Dict[str, Any] = None) -> str:
        """Create a new prompt template."""
        template_data = {
            'name': name,
            'version': '1.0.0',
            'template': template,
            'parameters': parameters,
            'metadata': metadata or {},
            'created_at': time.time(),
            'usage_count': 0
        }
        
        self.templates[name] = template_data
        self.version_history[f"{name}_v1.0.0"] = template_data.copy()
        
        return f"{name}_v1.0.0"
    
    def render_template(self, name: str, **kwargs) -> str:
        """Render template with provided parameters."""
        if name not in self.templates:
            raise ValueError(f"Template '{name}' not found")
        
        template_data = self.templates[name]
        template_str = template_data['template']
        
        # Validate required parameters
        required_params = [p['name'] for p in template_data['parameters'] if p.get('required', True)]
        missing_params = [p for p in required_params if p not in kwargs]
        
        if missing_params:
            raise ValueError(f"Missing required parameters: {missing_params}")
        
        # Render template
        try:
            rendered = template_str.format(**kwargs)
            self.templates[name]['usage_count'] += 1
            return rendered
        except KeyError as e:
            raise ValueError(f"Missing template parameter: {e}")

# Test the evaluation framework
print("Testing LLM-as-Judge Evaluation Framework...")

# Create sample responses for evaluation
sample_responses = [
    "The capital of France is Paris. It is known for the Eiffel Tower and Louvre Museum.",
    "Paris is the capital city of France, located in Western Europe. It has a population of over 2 million people.",
    "France's capital is Paris, which is famous for landmarks like the Eiffel Tower."
]

print("Sample responses created for evaluation")

# Test production template system
print("\nTesting Production Template System...")

template_manager = ProductionPromptTemplate()

# Create a customer support template
support_template = """
You are a customer support agent for {company_name}.
You specialize in helping customers with {product_type} products.

Guidelines:
- Be polite and professional
- Ask clarifying questions when needed
- Provide step-by-step solutions
- Escalate complex issues when necessary

Customer Issue: {customer_issue}
"""

template_params = [
    {'name': 'company_name', 'type': 'string', 'required': True, 'description': 'Company name'},
    {'name': 'product_type', 'type': 'string', 'required': True, 'description': 'Type of product'},
    {'name': 'customer_issue', 'type': 'string', 'required': True, 'description': 'Customer issue description'}
]

template_version = template_manager.create_template(
    name="customer_support",
    template=support_template,
    parameters=template_params,
    metadata={'description': 'Customer support response template', 'category': 'support'}
)

print(f"Created template: {template_version}")

# Test rendering the template
rendered_prompt = template_manager.render_template(
    name="customer_support",
    company_name="TechCorp",
    product_type="software",
    customer_issue="I can't log into my account"
)

print("\nRendered Template:")
print(rendered_prompt)

# Test with missing parameter (should raise error)
try:
    template_manager.render_template(
        name="customer_support",
        company_name="TechCorp",
        product_type="software"
        # Missing customer_issue parameter
    )
except ValueError as e:
    print(f"\nExpected error for missing parameter: {e}")

print("\n✅ Prompt Engineering Lab Completed!")
print("\nKey Skills Learned:")
print("- System prompt design for different roles")
print("- CLEAR framework implementation")
print("- Parameter optimization and tuning")
print("- LLM-as-judge evaluation framework")
print("- Production-ready template management")

print("\nNext Steps:")
print("1. Experiment with different parameter combinations")
print("2. Create custom evaluation criteria for your use case")
print("3. Build a comprehensive prompt template library")
print("4. Implement A/B testing for prompt variations")
print("5. Add monitoring and analytics for production use")

## Summary and Next Steps

Congratulations! You've completed the Prompt Engineering and Evaluation lab. Here's what you've learned:

### Key Skills Acquired:
- ✅ System prompt design for different professional roles
- ✅ CLEAR framework implementation for structured prompts
- ✅ Parameter optimization (temperature, top_p) for different use cases
- ✅ LLM-as-judge evaluation framework for automated assessment
- ✅ Production-ready prompt template management with versioning

### Best Practices:
- Always define clear roles and constraints in system prompts
- Use the CLEAR framework for complex prompt requirements
- Test parameter sensitivity for your specific use case
- Implement automated evaluation for quality assurance
- Version control your prompt templates for production use

### Next Steps:
1. Experiment with different parameter combinations for your specific use cases
2. Create custom evaluation criteria tailored to your domain
3. Build a comprehensive library of prompt templates
4. Implement A/B testing frameworks for prompt optimization
5. Add monitoring and analytics for production prompt performance

Remember: Effective prompt engineering is an iterative process. Continuously test, evaluate, and refine your prompts based on real-world performance!