# Lab: Prompt Engineering and Evaluation

## Learning Objectives

- Implement effective prompt engineering techniques
- Build guardrails against AI hallucinations
- Create evaluation frameworks for AI outputs
- Optimize model parameters for different use cases
- Develop production-ready prompt templates

## Prerequisites

Install required packages:

In [None]:
# Install required packages
!pip install openai huggingface_hub transformers numpy scikit-learn python-dotenv

## Setup and Configuration

Let's start by setting up our environment and API credentials:

In [None]:
import os
import json
import time
import random
import re
import numpy as np
from typing import Dict, Any, List, Optional, Tuple
from dataclasses import dataclass
from dotenv import load_dotenv
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load environment variables
load_dotenv()

# API Keys (use environment variables in production)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-key-here")
HF_TOKEN = os.getenv("HF_TOKEN", "your-huggingface-token-here")

print("Environment setup complete!")
print(f"OpenAI API Key configured: {'Yes' if OPENAI_API_KEY != 'your-openai-key-here' else 'No'}")
print(f"HuggingFace Token configured: {'Yes' if HF_TOKEN != 'your-huggingface-token-here' else 'No'}")

## Exercise 1: System Prompts and Role Design

Create effective system prompts that define model behavior and constraints:

In [None]:
class SystemPromptDesigner:
    """Design effective system prompts for different use cases."""
    
    def __init__(self):
        self.prompt_templates = {
            'technical_writer': self.create_technical_writer_prompt(),
            'customer_support': self.create_customer_support_prompt(),
            'code_reviewer': self.create_code_reviewer_prompt(),
            'data_analyst': self.create_data_analyst_prompt()
        }
    
    def create_technical_writer_prompt(self) -> str:
        """Create prompt for technical documentation."""
        return """
You are a senior technical writer with 10 years of experience in software documentation.
You specialize in creating clear, concise, and accurate technical documentation for developers.

Your characteristics:
- Write in a professional but approachable tone
- Use active voice and present tense
- Include practical examples and code snippets
- Structure information logically with clear headings

Your constraints:
- Avoid jargon unless necessary (define when used)
- Keep sentences under 25 words when possible
- Provide step-by-step instructions for complex tasks
- Include troubleshooting sections for common issues

Your output format:
- Use Markdown formatting
- Include a brief overview section
- Structure content with hierarchical headings
- End with a summary or next steps section
"""
    
    def create_customer_support_prompt(self) -> str:
        """Create prompt for customer support."""
        return """
You are a helpful customer support assistant for TechCorp.
You have access to order information, product details, and troubleshooting guides.

Guidelines:
- Be polite and professional
- Ask clarifying questions when needed
- Provide step-by-step solutions
- Escalate to human agent for complex issues
- Never make promises about refunds or compensation
"""
    
    def create_code_reviewer_prompt(self) -> str:
        """Create prompt for code review."""
        return """
You are an experienced software engineer specializing in code review and best practices.
You have expertise in Python, JavaScript, and software architecture.

Review criteria:
- Check for code readability and maintainability
- Identify potential bugs or security issues
- Suggest performance improvements
- Ensure proper error handling
- Verify adherence to coding standards

Provide constructive feedback with specific examples and suggestions.
"""
    
    def create_data_analyst_prompt(self) -> str:
        """Create prompt for data analysis."""
        return """
You are a senior data analyst with expertise in statistical analysis and data visualization.
You specialize in extracting insights from complex datasets.

Analysis approach:
- Start with data quality assessment
- Identify key patterns and trends
- Provide statistical significance where applicable
- Suggest actionable recommendations
- Include appropriate visualizations

Always explain your methodology and assumptions clearly.
"""

# Test system prompt effectiveness
print("Testing System Prompt Designer...")
designer = SystemPromptDesigner()

for role, prompt in designer.prompt_templates.items():
    print(f"\n--- {role.replace('_', ' ').title()} Prompt ---")
    print(f"Length: {len(prompt)} characters")
    print(f"First 200 chars: {prompt[:200]}...")

## Exercise 2: The CLEAR Framework Implementation

Implement the CLEAR framework for creating effective prompts:

In [None]:
class ClearFramework:
    """Implement the CLEAR framework for prompt engineering."""
    
    @staticmethod
    def create_prompt(task: str, context: str, length: str, examples: str, 
                     audience: str, requirements: str) -> str:
        """Create a prompt using the CLEAR framework."""
        return f"""
Context: {context}

Task: {task}

Requirements:
- Length: {length}
- Audience: {audience}
- {requirements}

Examples:
{examples}

Remember to follow all requirements and match the example format.
"""
    
    @staticmethod
    def create_product_description_prompt(product_name: str, features: List[str], 
                                        target_audience: str) -> str:
        """Create a product description prompt using CLEAR framework."""
        context = f"You are writing product descriptions for an e-commerce website."
        task = f"Write a compelling product description for {product_name}"
        length = "100-150 words"
        audience = target_audience
        requirements = f"Highlight these features: {', '.join(features)}. Use persuasive language."
        
        examples = """
Input: Wireless Bluetooth Headphones
Output: Experience premium sound quality with our wireless Bluetooth headphones. 
Featuring active noise cancellation, 30-hour battery life, and comfortable over-ear design. 
Perfect for music lovers and professionals who demand exceptional audio performance.

Input: Stainless Steel Water Bottle
Output: Stay hydrated with our durable stainless steel water bottle. Double-wall vacuum 
insulation keeps drinks cold for 24 hours or hot for 12 hours. Leak-proof lid and sweat-free 
design make it perfect for gym, office, or outdoor adventures.
"""
        
        return ClearFramework.create_prompt(task, context, length, examples, audience, requirements)

# Test CLEAR framework
print("Testing CLEAR Framework...")

# Create product description prompt
product_prompt = ClearFramework.create_product_description_prompt(
    product_name="Organic Cotton T-Shirt",
    features=["100% organic cotton", "breathable fabric", "sustainable production", "comfortable fit"],
    target_audience="environmentally conscious consumers aged 25-45"
)

print("Generated Product Description Prompt:")
print(product_prompt)

## Exercise 3: Parameter Tuning and Evaluation

Implement parameter optimization and evaluation frameworks:

In [None]:
class ParameterOptimizer:
    """Optimize model parameters for different use cases."""
    
    def __init__(self):
        self.parameter_profiles = {
            'accuracy_focused': {'temperature': 0.1, 'top_p': 0.3, 'description': 'High accuracy, low creativity'},
            'balanced': {'temperature': 0.5, 'top_p': 0.7, 'description': 'Balanced creativity and accuracy'},
            'creative': {'temperature': 0.8, 'top_p': 0.9, 'description': 'High creativity, varied outputs'},
            'deterministic': {'temperature': 0.0, 'top_p': 0.1, 'description': 'Maximum consistency'}
        }
    
    def get_optimal_parameters(self, use_case: str) -> Dict[str, float]:
        """Get optimal parameters for specific use case."""
        return self.parameter_profiles.get(use_case, self.parameter_profiles['balanced'])
    
    def test_parameter_sensitivity(self, client, test_prompts: List[str], 
                                 param_grid: List[Tuple[float, float]]) -> List[Dict[str, Any]]:
        """Test multiple parameter combinations."""
        results = []
        
        for temp, top_p in param_grid:
            combination_results = {
                'temperature': temp,
                'top_p': top_p,
                'outputs': [],
                'metrics': {}
            }
            
            for prompt in test_prompts:
                try:
                    response = client.chat.completions.create(
                        model="gpt-3.5-turbo",
                        messages=[{"role": "user", "content": prompt}],
                        temperature=temp,
                        top_p=top_p,
                        max_tokens=150
                    )
                    
                    combination_results['outputs'].append({
                        'prompt': prompt,
                        'response': response.choices[0].message.content
                    })
                except Exception as e:
                    print(f"Error with temp={temp}, top_p={top_p}: {e}")
            
            # Calculate metrics for this combination
            combination_results['metrics'] = self.calculate_metrics(combination_results['outputs'])
            results.append(combination_results)
        
        return results
    
    def calculate_metrics(self, outputs: List[Dict[str, Any]]) -> Dict[str, float]:
        """Calculate stability and quality metrics."""
        responses = [output['response'] for output in outputs]
        
        return {
            'avg_length': np.mean([len(r) for r in responses]),
            'length_variance': np.var([len(r) for r in responses]),
            'lexical_diversity': self.calculate_lexical_diversity(responses),
            'consistency_score': self.calculate_consistency(responses)
        }
    
    def calculate_lexical_diversity(self, responses: List[str]) -> float:
        """Calculate lexical diversity (unique words / total words)."""
        all_words = []
        for response in responses:
            words = re.findall(r'\b\w+\b', response.lower())
            all_words.extend(words)
        
        if not all_words:
            return 0.0
        
        unique_words = len(set(all_words))
        total_words = len(all_words)
        return unique_words / total_words
    
    def calculate_consistency(self, responses: List[str]) -> float:
        """Calculate consistency across responses."""
        if len(responses) < 2:
            return 1.0
        
        # Use TF-IDF to measure similarity
        vectorizer = TfidfVectorizer(stop_words='english')
        try:
            tfidf_matrix = vectorizer.fit_transform(responses)
            similarities = cosine_similarity(tfidf_matrix)
            
            # Calculate average similarity (excluding diagonal)
            n = len(responses)
            total_similarity = 0
            count = 0
            for i in range(n):
                for j in range(i+1, n):
                    total_similarity += similarities[i][j]
                    count += 1
            
            return total_similarity / count if count > 0 else 0.0
        except:
            return 0.0

# Test parameter optimization
print("Testing Parameter Optimization...")

optimizer = ParameterOptimizer()

# Test different parameter profiles
for profile_name, params in optimizer.parameter_profiles.items():
    print(f"\n--- {profile_name.replace('_', ' ').title()} ---")
    print(f"Temperature: {params['temperature']}")
    print(f"Top_p: {params['top_p']}")
    print(f"Description: {params['description']}")

# Test parameter sensitivity with sample prompts
test_prompts = [
    "Write a creative story about artificial intelligence.",
    "Explain quantum computing in simple terms.",
    "What are the benefits of renewable energy?"
]

## Exercise 4: LLM-as-Judge Evaluation Framework

Implement automated evaluation using LLMs as judges:

In [None]:
class LLMJudge:
    """Implement LLM-as-a-judge evaluation system."""
    
    def __init__(self, judge_model_client):
        self.judge_model = judge_model_client
    
    def evaluate_response(self, response: str, criteria: List[str], reference: str = None) -> Dict[str, Any]:
        """Evaluate a response using LLM-as-judge."""
        evaluation_prompt = f"""
You are an expert evaluator. Assess the following response based on these criteria:

Criteria: {', '.join(criteria)}

{f'Reference answer: {reference}' if reference else ''}

Response to evaluate: {response}

Provide scores (1-10) for each criterion and brief justifications:
- Relevance: [score] - [justification]
- Accuracy: [score] - [justification] 
- Clarity: [score] - [justification]
- Completeness: [score] - [justification]
- Style: [score] - [justification]

Overall score: [average]/10
"""
        
        try:
            evaluation = self.judge_model.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": evaluation_prompt}]
            )
            
            return self.parse_evaluation(evaluation.choices[0].message.content)
        except Exception as e:
            return {'error': str(e), 'overall_score': 0}
    
    def parse_evaluation(self, evaluation_text: str) -> Dict[str, Any]:
        """Parse evaluation results from judge response."""
        scores = {}
        
        # Extract individual scores
        score_patterns = {
            'relevance': r'Relevance:\s*(\d+)',
            'accuracy': r'Accuracy:\s*(\d+)',
            'clarity': r'Clarity:\s*(\d+)',
            'completeness': r'Completeness:\s*(\d+)',
            'style': r'Style:\s*(\d+)'
        }
        
        for criterion, pattern in score_patterns.items():
            match = re.search(pattern, evaluation_text)
            if match:
                scores[criterion] = int(match.group(1))
        
        # Extract overall score
        overall_match = re.search(r'Overall score:\s*(\d+(?:\.\d+)?)', evaluation_text)
        overall_score = float(overall_match.group(1)) if overall_match else 0
        
        return {
            'individual_scores': scores,
            'overall_score': overall_score,
            'full_evaluation': evaluation_text
        }

class ProductionPromptTemplate:
    """Production-ready prompt template management."""
    
    def __init__(self, template_dir: str = "templates"):
        self.template_dir = template_dir
        self.templates = {}
        self.version_history = {}
    
    def create_template(self, name: str, template: str, parameters: List[Dict[str, Any]], 
                       metadata: Dict[str, Any] = None) -> str:
        """Create a new prompt template."""
        template_data = {
            'name': name,
            'version': '1.0.0',
            'template': template,
            'parameters': parameters,
            'metadata': metadata or {},
            'created_at': time.time(),
            'usage_count': 0
        }
        
        self.templates[name] = template_data
        self.version_history[f"{name}_v1.0.0"] = template_data.copy()
        
        return f"{name}_v1.0.0"
    
    def render_template(self, name: str, **kwargs) -> str:
        """Render template with provided parameters."""
        if name not in self.templates:
            raise ValueError(f"Template '{name}' not found")
        
        template_data = self.templates[name]
        template_str = template_data['template']
        
        # Validate required parameters
        required_params = [p['name'] for p in template_data['parameters'] if p.get('required', True)]
        missing_params = [p for p in required_params if p not in kwargs]
        
        if missing_params:
            raise ValueError(f"Missing required parameters: {missing_params}")
        
        # Render template
        try:
            rendered = template_str.format(**kwargs)
            self.templates[name]['usage_count'] += 1
            return rendered
        except KeyError as e:
            raise ValueError(f"Missing template parameter: {e}")

# Test the evaluation framework
print("Testing LLM-as-Judge Evaluation Framework...")

# Create sample responses for evaluation
sample_responses = [
    "The capital of France is Paris. It is known for the Eiffel Tower and Louvre Museum.",
    "Paris is the capital city of France, located in Western Europe. It has a population of over 2 million people.",
    "France's capital is Paris, which is famous for landmarks like the Eiffel Tower."
]

print("Sample responses created for evaluation")

# Test production template system
print("\nTesting Production Template System...")

template_manager = ProductionPromptTemplate()

# Create a customer support template
support_template = """
You are a customer support agent for {company_name}.
You specialize in helping customers with {product_type} products.

Guidelines:
- Be polite and professional
- Ask clarifying questions when needed
- Provide step-by-step solutions
- Escalate complex issues when necessary

Customer Issue: {customer_issue}
"""

template_params = [
    {'name': 'company_name', 'type': 'string', 'required': True, 'description': 'Company name'},
    {'name': 'product_type', 'type': 'string', 'required': True, 'description': 'Type of product'},
    {'name': 'customer_issue', 'type': 'string', 'required': True, 'description': 'Customer issue description'}
]

template_version = template_manager.create_template(
    name="customer_support",
    template=support_template,
    parameters=template_params,
    metadata={'description': 'Customer support response template', 'category': 'support'}
)

print(f"Created template: {template_version}")

# Test rendering the template
rendered_prompt = template_manager.render_template(
    name="customer_support",
    company_name="TechCorp",
    product_type="software",
    customer_issue="I can't log into my account"
)

print("\nRendered Template:")
print(rendered_prompt)

# Test with missing parameter (should raise error)
try:
    template_manager.render_template(
        name="customer_support",
        company_name="TechCorp",
        product_type="software"
        # Missing customer_issue parameter
    )
except ValueError as e:
    print(f"\nExpected error for missing parameter: {e}")

print("\n✅ Prompt Engineering Lab Completed!")
print("\nKey Skills Learned:")
print("- System prompt design for different roles")
print("- CLEAR framework implementation")
print("- Parameter optimization and tuning")
print("- LLM-as-judge evaluation framework")
print("- Production-ready template management")

print("\nNext Steps:")
print("1. Experiment with different parameter combinations")
print("2. Create custom evaluation criteria for your use case")
print("3. Build a comprehensive prompt template library")
print("4. Implement A/B testing for prompt variations")
print("5. Add monitoring and analytics for production use")

## Summary and Next Steps

Congratulations! You've completed the Prompt Engineering and Evaluation lab. Here's what you've learned:

### Key Skills Acquired:
- ✅ System prompt design for different professional roles
- ✅ CLEAR framework implementation for structured prompts
- ✅ Parameter optimization (temperature, top_p) for different use cases
- ✅ LLM-as-judge evaluation framework for automated assessment
- ✅ Production-ready prompt template management with versioning

### Best Practices:
- Always define clear roles and constraints in system prompts
- Use the CLEAR framework for complex prompt requirements
- Test parameter sensitivity for your specific use case
- Implement automated evaluation for quality assurance
- Version control your prompt templates for production use

### Next Steps:
1. Experiment with different parameter combinations for your specific use cases
2. Create custom evaluation criteria tailored to your domain
3. Build a comprehensive library of prompt templates
4. Implement A/B testing frameworks for prompt optimization
5. Add monitoring and analytics for production prompt performance

Remember: Effective prompt engineering is an iterative process. Continuously test, evaluate, and refine your prompts based on real-world performance!