# LLM Lab: Prompt Engineering Evaluation
## Supply Chain Optimization Domain

This notebook evaluates how prompt phrasing, structure, and constraints influence LLM response quality.

**Team Roles:**
- Prompt Architect
- Evaluation Engineer
- Safety & Mitigation Analyst
- MLOps Integrator
- Technical Communicator

## Setup & Configuration

In [None]:
# Install dependencies if needed
# !pip install -r ../requirements.txt

In [None]:
import os
import sys
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add src to path
sys.path.insert(0, '..')

from src.llm_clients import LLMClient, load_config
from src.prompts import PromptBuilder
from src.evaluator import ResponseEvaluator, create_evaluation_summary
from src.visualizations import generate_all_visualizations

# Set your API key
# Option 1: Set in environment
# os.environ['GOOGLE_API_KEY'] = 'your-api-key-here'

# Option 2: Already set via terminal export

In [None]:
# Load configuration
config = load_config('../config/experiment_config.yaml')
print(f"Domain: {config['experiment']['domain']}")
print(f"Model: {config['models']['primary']['name']}")
print(f"Temperature: {config['models']['primary']['parameters']['temperature']}")

## Task 1: Design Prompt Variants

We test 5 distinct prompt designs:
1. **P1_direct** - Naive, no constraints
2. **P2_constrained** - Explicit format requirements
3. **P3_role_based** - Expert persona
4. **P4_reasoning_step** - Chain-of-Thought
5. **P5_context_first** - Context before instruction

In [None]:
# Build all prompts
prompt_builder = PromptBuilder('../config/experiment_config.yaml')
prompts_table = prompt_builder.get_prompts_table()

# Display prompt variants table
df_prompts = pd.DataFrame(prompts_table)
df_prompts

In [None]:
# Preview a specific prompt (e.g., Chain-of-Thought)
print("=" * 60)
print("P4 Reasoning Step (CoT) Prompt:")
print("=" * 60)
print(prompt_builder.build_prompt('P4_reasoning_step'))

## Run Experiments

Execute each prompt variant and collect responses.

In [None]:
# Initialize LLM client
client = LLMClient(config)
print(f"Using model: {client.get_model_info()}")

In [None]:
# Run all prompt variants
responses = {}
all_prompts = prompt_builder.build_all_prompts()

for variant_id, data in all_prompts.items():
    print(f"Running {variant_id}: {data['name']}...")
    try:
        result = client.generate(data['prompt'])
        responses[variant_id] = {
            'name': data['name'],
            'description': data['description'],
            'prompt': data['prompt'],
            'response': result['response'],
            'token_count': result['token_count'],
            'latency_ms': result['latency_ms']
        }
        print(f"  ✓ Completed ({result['token_count']} tokens, {result['latency_ms']}ms)")
    except Exception as e:
        print(f"  ✗ Error: {e}")
        responses[variant_id] = {'error': str(e)}

print(f"\nCompleted {len([r for r in responses.values() if 'response' in r])}/{len(all_prompts)} variants")

In [None]:
# View a sample response
sample_variant = 'P1_direct'
if sample_variant in responses and 'response' in responses[sample_variant]:
    print(f"Response from {sample_variant}:")
    print("=" * 60)
    print(responses[sample_variant]['response'])

## Task 2: Evaluate Limitations and Mitigations

Analyze responses for:
- Factual hallucinations
- Logical inconsistencies
- Overconfidence
- Missing key details
- Over-elaboration

In [None]:
# Initialize evaluator
evaluator = ResponseEvaluator('../config/experiment_config.yaml')

# Evaluate all responses
evaluations = {}

for variant_id, data in responses.items():
    if 'response' in data:
        eval_result = evaluator.full_evaluation(
            response=data['response'],
            token_count=data['token_count']
        )
        evaluations[variant_id] = eval_result
        
        # Print summary
        summary = eval_result['summary']
        print(f"{variant_id}: Accuracy={summary['accuracy_score']}/2, "
              f"Completeness={summary['completeness_pct']}%, "
              f"Issues={summary['issue_count']}")

In [None]:
# Detailed failure analysis for one variant
analyze_variant = 'P1_direct'
if analyze_variant in evaluations:
    failures = evaluations[analyze_variant]['failure_behaviors']
    print(f"\nFailure Analysis for {analyze_variant}:")
    print("-" * 40)
    for issue in failures['issues']:
        print(f"• [{issue['severity']}] {issue['type']}")
        print(f"  {issue['description']}")

### Mitigation Strategies

Based on observed failures, we can apply:
1. **Chain-of-Thought (CoT)** - For reasoning improvement
2. **Source Checking Requests** - "Cite sources" instruction
3. **Confidence Calibration** - "Express uncertainty where appropriate"
4. **Output Validation** - "Verify your answer is complete"

In [None]:
# Document mitigation observations
mitigation_notes = """
## Observed Issues & Mitigations

| Issue | Mitigation | Variant that helps |
|-------|------------|-------------------|
| Overconfidence | Add uncertainty language request | P4_reasoning_step |
| Missing details | Explicit checklist in prompt | P2_constrained |
| Over-elaboration | Word count limits | P2_constrained |
| Hallucination | Request reasoning steps | P4_reasoning_step |

### Key Insight:
Add your team's observations here after running experiments.
"""
print(mitigation_notes)

## Task 3: Quantitative and Qualitative Evaluation

In [None]:
# Create summary table
summary_rows = create_evaluation_summary(evaluations)
df_summary = pd.DataFrame(summary_rows)
df_summary

In [None]:
# Add peer clarity ratings (manual input)
# Team members should rate each response 1-5 on clarity

# Example: Fill these in after peer review
clarity_ratings = {
    'P1_direct': 3,
    'P2_constrained': 4,
    'P3_role_based': 4,
    'P4_reasoning_step': 5,
    'P5_context_first': 4
}

# Update evaluations with clarity scores
for variant_id, rating in clarity_ratings.items():
    if variant_id in evaluations:
        evaluations[variant_id]['clarity_score'] = rating

In [None]:
# Generate all visualizations
import os
os.makedirs('../results', exist_ok=True)

figs = generate_all_visualizations(evaluations, '../results')
plt.show()

In [None]:
# Display accuracy comparison
from src.visualizations import plot_accuracy_comparison
plot_accuracy_comparison(evaluations)
plt.show()

In [None]:
# Display radar chart
from src.visualizations import plot_radar_chart
plot_radar_chart(evaluations)
plt.show()

## Connection to Few-Shot and RAG

**How these prompting techniques scale:**

1. **Few-Shot Prompting**: Add examples to P2_constrained or P4_reasoning_step
2. **Retrieval-Augmented Generation (RAG)**: P5_context_first naturally extends to RAG by
   replacing static context with retrieved documents
3. **Production Systems**: Use structured output (P2) + reasoning (P4) + dynamic context (P5)

In [None]:
# Example: Few-shot extension of P2_constrained
few_shot_prompt = """
Here is an example of a well-structured inventory optimization answer:

Example Query: How to manage inventory for a retail store?
Example Answer:
1. Implement ABC analysis to prioritize high-value items
2. Use EOQ formula: √(2DS/H) where D=demand, S=order cost, H=holding cost
3. Set reorder point: Lead time demand + Safety stock
4. Monitor with KPIs: turnover ratio, stockout rate
5. Review quarterly and adjust for seasonality

Now answer the following using the same structure:

{query}
"""

print("Few-shot prompt template created.")
print("This extends P2_constrained with a concrete example.")

## Save Results

In [None]:
# Save responses to JSON for reproducibility
import json

with open('../results/responses.json', 'w') as f:
    json.dump(responses, f, indent=2)

with open('../results/evaluations.json', 'w') as f:
    # Convert to serializable format
    serializable_evals = {k: v for k, v in evaluations.items()}
    json.dump(serializable_evals, f, indent=2)

print("Results saved to ../results/")

## Summary & Next Steps

### Key Findings (Fill in after experiments):
1. Best overall prompt variant: _____
2. Most common failure mode: _____
3. Most effective mitigation: _____

### Recommendations:
- For accuracy: Use _____
- For efficiency: Use _____
- For production: Combine _____ + _____