# LAB 1.2: PROMPT DEBUGGING AND OPTIMIZATION

**Course:** Advanced Prompt Engineering Training  
**Session:** Session 1 - Prompt Engineering Fundamentals Review  
**Duration:** 50 minutes  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚òÜ‚òÜ  
**Type:** Hands-on Debugging & A/B Testing

## LAB OVERVIEW

This lab focuses on **systematic prompt debugging and optimization**. You'll work with broken or suboptimal prompts used in BFSI fraud detection scenarios and learn to:

- Identify root causes of prompt failures
- Apply systematic debugging techniques
- Optimize prompts for accuracy, consistency, and cost
- Conduct rigorous A/B testing
- Measure and compare prompt performance

**Scenario:** You're a prompt engineer at a financial institution. The fraud detection team has been using AI to analyze suspicious transactions, but they're getting inconsistent results, hallucinations, and high false positive rates. Your job is to debug and optimize their prompts.

## LEARNING OBJECTIVES

By the end of this lab, you will be able to:

‚úì Diagnose common prompt failure patterns  
‚úì Apply systematic debugging methodology  
‚úì Optimize prompts for accuracy and consistency  
‚úì Conduct quantitative A/B testing  
‚úì Measure performance improvements objectively  
‚úì Balance accuracy, cost, and latency trade-offs

### Step 1: Import Libraries

In [None]:
# Lab 1.2: Prompt Debugging and Optimization
# Advanced Prompt Engineering Training - Session 1

import os
import json
from openai import OpenAI
import pandas as pd
import numpy as np
from datetime import datetime
import time
from typing import Dict, List, Tuple

print("‚úì Libraries imported")

### Step 2: Configure OpenAI Client

In [None]:
# Initialize OpenAI client
api_key=os.environ.get("OPENAI_API_KEY")

# Configuration
MODEL = os.getenv("MODEL_NAME")
TEMPERATURE = 0  # Deterministic for consistent debugging

if not api_key:
    raise ValueError("OPENAI_API_KEY not found. Please set it in .env file")

if not MODEL:
    raise ValueError("MODEL_NAME not found. Please set it in .env file")

client = OpenAI(api_key=api_key)

print(f"‚úì Model: {MODEL}")
print(f"‚úì Temperature: {TEMPERATURE}")

### Step 3: Create Helper Functions

In [None]:
def call_gpt4(prompt, system_prompt="You are a helpful AI assistant.", temperature=0):
    """
    Wrapper for GPT-4 API calls with token tracking
    
    Args:
        prompt (str): User prompt
        system_prompt (str): System prompt
        temperature (float): Sampling temperature
    
    Returns:
        dict: Response with content, tokens, and latency
    """
    start_time = time.time()
    
    try:
        response = client.chat.completions.create(
            model=MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature
        )
        
        latency = time.time() - start_time
        
        return {
            "content": response.choices[0].message.content,
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens,
            "latency": latency
        }
    except Exception as e:
        return {
            "content": f"Error: {str(e)}",
            "prompt_tokens": 0,
            "completion_tokens": 0,
            "total_tokens": 0,
            "latency": 0
        }

def compare_prompts(prompt_a, prompt_b, test_cases, system_prompt="You are a helpful AI assistant."):
    """
    A/B test two prompts against multiple test cases
    
    Args:
        prompt_a (str): First prompt (baseline)
        prompt_b (str): Second prompt (optimized)
        test_cases (list): List of test inputs
        system_prompt (str): System prompt
    
    Returns:
        pd.DataFrame: Comparison results
    """
    results = []
    
    for i, test_case in enumerate(test_cases):
        # Test Prompt A
        response_a = call_gpt4(prompt_a.format(**test_case), system_prompt)
        
        # Test Prompt B
        response_b = call_gpt4(prompt_b.format(**test_case), system_prompt)
        
        results.append({
            "test_case": i + 1,
            "input": str(test_case),
            "response_a": response_a["content"],
            "response_b": response_b["content"],
            "tokens_a": response_a["total_tokens"],
            "tokens_b": response_b["total_tokens"],
            "latency_a": response_a["latency"],
            "latency_b": response_b["latency"]
        })
    
    return pd.DataFrame(results)

print("‚úì Helper functions created")

### Step 4: Test Connection

In [None]:
# Test connection
test = call_gpt4("Say 'Ready for debugging' if you receive this.")
print(f"Response: {test['content']}")
print(f"Tokens used: {test['total_tokens']}")
print(f"Latency: {test['latency']:.2f}s")
print("\n‚úì Connection verified")

## DEBUGGING METHODOLOGY

### The 5-Step Debugging Process

When a prompt fails or underperforms, follow this systematic approach:

```
1. IDENTIFY THE PROBLEM
   - What is the actual vs. expected output?
   - Is it a consistency issue, accuracy issue, or format issue?
   - Does it fail on all inputs or specific cases?

2. ISOLATE THE ROOT CAUSE
   - Vague instructions?
   - Missing context?
   - Conflicting constraints?
   - Model limitations?

3. HYPOTHESIZE A FIX
   - What specific change should address the root cause?
   - Will this change introduce new problems?

4. TEST THE FIX
   - Run the improved prompt on test cases
   - Compare against baseline quantitatively

5. VALIDATE & ITERATE
   - Does it work consistently across edge cases?
   - Are there side effects?
   - Can it be further optimized?
```

## CHALLENGE 1: VAGUE INSTRUCTIONS

**Time:** 10 minutes  
**Problem Type:** Unclear output format and inconsistent results

### Background

The fraud team wrote a prompt to analyze transactions, but results are wildly inconsistent - sometimes one sentence, sometimes paragraphs, sometimes missing key information.

### Broken Prompt

In [None]:
# BROKEN PROMPT - DO NOT USE AS-IS

broken_prompt_v1 = """
Analyze this transaction for fraud:

Transaction: {transaction_details}

Is it fraudulent?
"""

system_prompt_v1 = "You are a fraud detection AI."

### Test Data

In [None]:
# Suspicious transaction test cases
test_transactions = [
    {
        "transaction_details": "Card ending 4523, $2,450 purchase at 'Electronics Warehouse', location: Lagos Nigeria, cardholder location: New York, time: 3:47 AM"
    },
    {
        "transaction_details": "Card ending 7891, $12.50 purchase at 'Starbucks', location: Seattle WA, cardholder location: Seattle WA, time: 8:15 AM"
    },
    {
        "transaction_details": "Card ending 3344, $8,900 purchase at 'Luxury Watches International', location: Dubai UAE, cardholder location: London UK, time: 2:30 PM"
    }
]

### Problem Analysis

In [None]:
# Run the broken prompt
print("BROKEN PROMPT OUTPUT:")
print("=" * 80)

for i, test in enumerate(test_transactions):
    response = call_gpt4(broken_prompt_v1.format(**test), system_prompt_v1)
    print(f"\nTest Case {i+1}:")
    print(f"Input: {test['transaction_details']}")
    print(f"Output: {response['content']}")
    print(f"Tokens: {response['total_tokens']}")
    print("-" * 80)

**Problems with this prompt:**
1. ‚ùå No output format specified (yes/no? explanation? confidence score?)
2. ‚ùå No analysis structure (what factors to consider?)
3. ‚ùå Results vary wildly in length and detail
4. ‚ùå No guidance on edge cases (what if data is ambiguous?)

### Student Exercise

Debug and optimize this prompt to produce consistent, structured output.

In [None]:
# TODO: Write an improved version of the prompt
# Requirements:
# - Consistent output format
# - Structured analysis
# - Specific fraud indicators to check
# - Clear decision (Fraudulent / Suspicious / Legitimate)

improved_prompt_v1 = """
[WRITE YOUR IMPROVED PROMPT HERE]
"""

improved_system_prompt_v1 = """
[WRITE YOUR IMPROVED SYSTEM PROMPT HERE]
"""

# TODO: Test your improved prompt
# for test in test_transactions:
#     response = call_gpt4(improved_prompt_v1.format(**test), improved_system_prompt_v1)
#     print(response['content'])

### Solution

In [None]:
# SOLUTION: Structured, Explicit Prompt

optimized_prompt_v1 = """
Analyze this transaction for fraud indicators using the structured format below.

TRANSACTION DATA:
{transaction_details}

ANALYSIS REQUIRED:

1. LOCATION RISK:
   - Is purchase location far from cardholder location?
   - Assessment: [Low/Medium/High]

2. AMOUNT RISK:
   - Is amount unusual for this merchant type?
   - Assessment: [Low/Medium/High]

3. TIMING RISK:
   - Is transaction time suspicious (late night, unusual hours)?
   - Assessment: [Low/Medium/High]

4. MERCHANT RISK:
   - Is merchant type high-risk (electronics, jewelry, international)?
   - Assessment: [Low/Medium/High]

FINAL DECISION:
Based on the above factors, classify as:
- FRAUDULENT (3+ high-risk factors)
- SUSPICIOUS (2 high-risk factors, recommend review)
- LEGITIMATE (0-1 high-risk factors)

Provide your analysis now in exactly this format.
"""

optimized_system_prompt_v1 = """You are a fraud detection analyst. 
You analyze transactions using specific risk factors.
You always provide structured analysis in the exact format requested.
You never deviate from the analysis template."""

# Test optimized prompt
print("OPTIMIZED PROMPT OUTPUT:")
print("=" * 80)

for i, test in enumerate(test_transactions):
    response = call_gpt4(optimized_prompt_v1.format(**test), optimized_system_prompt_v1)
    print(f"\nTest Case {i+1}:")
    print(f"Input: {test['transaction_details'][:80]}...")
    print(f"Output:\n{response['content']}")
    print(f"Tokens: {response['total_tokens']}")
    print("-" * 80)

### Improvement Metrics

In [None]:
# Compare broken vs optimized
comparison = compare_prompts(
    broken_prompt_v1, 
    optimized_prompt_v1, 
    test_transactions,
    optimized_system_prompt_v1
)

print("\nPERFORMANCE COMPARISON:")
print("=" * 80)
print(f"Average tokens - Broken: {comparison['tokens_a'].mean():.0f}")
print(f"Average tokens - Optimized: {comparison['tokens_b'].mean():.0f}")
print(f"Token reduction: {((comparison['tokens_a'].mean() - comparison['tokens_b'].mean()) / comparison['tokens_a'].mean() * 100):.1f}%")
print(f"\nAverage latency - Broken: {comparison['latency_a'].mean():.2f}s")
print(f"Average latency - Optimized: {comparison['latency_b'].mean():.2f}s")

### Key Takeaways

‚úì **Explicit structure** - Template ensures consistency  
‚úì **Defined criteria** - Clear factors to evaluate  
‚úì **Decision framework** - Objective thresholds (3+ high = fraudulent)  
‚úì **System prompt alignment** - Reinforces structured output

## CHALLENGE 2: MISSING CONTEXT

**Time:** 10 minutes  
**Problem Type:** Insufficient information leading to poor decisions

### Background

The fraud team's prompt isn't considering the customer's transaction history, leading to false positives (legitimate unusual purchases flagged as fraud).

In [None]:
# BROKEN PROMPT - Lacks historical context

broken_prompt_v2 = """
Transaction: {current_transaction}

Is this fraudulent? Yes or No.
"""

system_prompt_v2 = "You are a fraud detector."

In [None]:
# Test cases with transaction history
test_cases_with_history = [
    {
        "current_transaction": "$3,500 purchase at 'Apple Store', Tokyo Japan",
        "cardholder_location": "San Francisco, CA",
        "recent_history": "Last 5 transactions: $45 Whole Foods SF, $12 Starbucks SF, $89 Gas Station SF, $156 Amazon, $2,100 Apple Store SF"
    },
    {
        "current_transaction": "$450 purchase at 'Designer Handbags Online', location unknown",
        "cardholder_location": "Miami, FL",
        "recent_history": "Last 5 transactions: $12 McDonald's Miami, $35 Grocery Miami, $8 Coffee Miami, $15 Parking Miami, $28 Pharmacy Miami"
    },
    {
        "current_transaction": "$8,000 wire transfer to 'International Consulting Services Ltd'",
        "cardholder_location": "Chicago, IL",
        "recent_history": "Last 5 transactions: $7,500 wire to 'Global Business Partners', $6,200 wire to 'Overseas Contractors Inc', $5,800 wire to 'International Trade Co', $9,100 wire to 'Worldwide Suppliers', $7,300 wire to 'Foreign Consulting Group'"
    }
]

In [None]:
# Test broken prompt (no history context)
print("BROKEN PROMPT (No History Context):")
print("=" * 80)

for i, test in enumerate(test_cases_with_history):
    # Only passing current transaction - ignoring history
    response = call_gpt4(
        broken_prompt_v2.format(current_transaction=test['current_transaction']),
        system_prompt_v2
    )
    print(f"\nTest Case {i+1}:")
    print(f"Current: {test['current_transaction']}")
    print(f"History (NOT PROVIDED TO MODEL): {test['recent_history']}")
    print(f"Decision: {response['content']}")
    print("-" * 80)

### Student Exercise

In [None]:
# TODO: Write a context-aware prompt
# Requirements:
# - Include transaction history in analysis
# - Distinguish pattern breaks from consistent behavior
# - Lower false positives while maintaining fraud detection

improved_prompt_v2 = """
[WRITE YOUR CONTEXT-AWARE PROMPT HERE]
"""

# TODO: Test your improved prompt

### Solution

In [None]:
# SOLUTION: Context-Aware Analysis

optimized_prompt_v2 = """
Analyze this transaction in the context of the customer's recent behavior.

CURRENT TRANSACTION:
{current_transaction}

CARDHOLDER LOCATION:
{cardholder_location}

RECENT TRANSACTION HISTORY:
{recent_history}

ANALYSIS FRAMEWORK:

1. PATTERN CONSISTENCY:
   - Is this transaction consistent with recent spending patterns?
   - Similar merchant types, amounts, or locations in history?
   - Assessment: [Consistent / Deviation / Major Deviation]

2. BEHAVIORAL CONTEXT:
   - Does history suggest business expenses, travel, or specific interests?
   - Is there an established pattern this fits into?
   - Context: [Describe pattern if exists]

3. ANOMALY EVALUATION:
   - If this IS unusual, is it explainable? (e.g., travel, gift, one-time purchase)
   - Is the deviation suspicious or just atypical?
   - Risk Level: [Low / Medium / High]

4. FRAUD INDICATORS:
   - Sudden pattern break with high-risk characteristics?
   - Transaction type known for fraud (wire transfers, gift cards, crypto)?
   - Multiple red flags?

DECISION:
- FRAUDULENT: Clear fraud indicators, major pattern break, high risk
- SUSPICIOUS: Unusual but explainable, recommend verification with customer
- LEGITIMATE: Consistent with patterns OR reasonable deviation

Provide your structured analysis.
"""

optimized_system_prompt_v2 = """You are a senior fraud analyst with expertise in behavioral analysis.
You always consider transaction history and spending patterns.
You distinguish between unusual-but-legitimate and genuinely fraudulent activity.
You minimize false positives while maintaining security."""

# Test optimized prompt
print("OPTIMIZED PROMPT (With Context):")
print("=" * 80)

for i, test in enumerate(test_cases_with_history):
    response = call_gpt4(optimized_prompt_v2.format(**test), optimized_system_prompt_v2)
    print(f"\nTest Case {i+1}:")
    print(f"Current: {test['current_transaction']}")
    print(f"\nAnalysis:\n{response['content']}")
    print("-" * 80)

### Key Takeaways

‚úì **Context is king** - Historical patterns dramatically improve accuracy  
‚úì **Pattern recognition** - Distinguish breaks from consistency  
‚úì **Behavioral analysis** - Understand customer profiles  
‚úì **Explainability** - Show WHY a decision makes sense

## CHALLENGE 3: HALLUCINATION ISSUES

**Time:** 10 minutes  
**Problem Type:** Model inventing facts not present in input

### Background

The fraud team noticed the AI sometimes "invents" transaction details that weren't provided - extremely dangerous in financial compliance.

In [None]:
# BROKEN PROMPT - Encourages hallucination

broken_prompt_v3 = """
Analyze this transaction for fraud. Consider all relevant factors including 
the customer's age, income level, credit score, and previous fraud history.

Transaction: {transaction}

Provide a comprehensive fraud analysis.
"""

system_prompt_v3 = "You are a fraud detection expert."

In [None]:
# Transactions with LIMITED information
limited_info_transactions = [
    {
        "transaction": "Card 9876, $450 purchase, merchant: Online Gaming Site"
    },
    {
        "transaction": "Card 5544, $2,100 purchase, merchant: Cash4Gold, location: Unknown"
    },
    {
        "transaction": "Card 3322, $85 purchase, merchant: Gas Station, location: Highway 101"
    }
]

In [None]:
# Demonstrate hallucination
print("HALLUCINATION DEMONSTRATION:")
print("=" * 80)

for i, test in enumerate(limited_info_transactions):
    response = call_gpt4(broken_prompt_v3.format(**test), system_prompt_v3)
    print(f"\nTest Case {i+1}:")
    print(f"Input: {test['transaction']}")
    print(f"Output:\n{response['content']}")
    print("\n‚ö† HALLUCINATION CHECK: Did the model mention age, income, credit score, or fraud history?")
    print("   (These were NOT in the input!)")
    print("-" * 80)

### Student Exercise

In [None]:
# TODO: Write a hallucination-resistant prompt
# Requirements:
# - Only analyze data actually provided
# - Explicitly acknowledge missing information
# - Do NOT invent or assume missing details

improved_prompt_v3 = """
[WRITE YOUR HALLUCINATION-RESISTANT PROMPT HERE]
"""

# TODO: Test your prompt

### Solution

In [None]:
# SOLUTION: Grounded, Hallucination-Resistant Prompt

optimized_prompt_v3 = """
Analyze this transaction for fraud using ONLY the information provided below.

CRITICAL INSTRUCTIONS:
1. Use ONLY the transaction data provided
2. Do NOT assume or invent information not given
3. If important information is missing, explicitly state "Information not available"
4. Base your analysis solely on what is present

TRANSACTION DATA:
{transaction}

ANALYSIS USING ONLY PROVIDED DATA:

1. MERCHANT RISK:
   - Merchant type: [from transaction]
   - Risk level based on merchant: [Low/Medium/High]

2. AMOUNT ANALYSIS:
   - Transaction amount: [from transaction]
   - General risk for this amount: [Low/Medium/High]

3. LOCATION RISK (if provided):
   - Location: [from transaction or "Not provided"]
   - Risk assessment: [Only if location given]

4. DATA LIMITATIONS:
   - What critical information is MISSING?
   - List: [customer history, location, time, etc.]

FRAUD ASSESSMENT:
Based ONLY on available data:
- Risk Level: [Low/Medium/High/INSUFFICIENT DATA]
- Recommendation: [If insufficient data, state "Require additional information"]

Provide analysis now using ONLY the data given.
"""

optimized_system_prompt_v3 = """You are a fraud analyst bound by strict data integrity rules.
You NEVER assume or invent information not explicitly provided.
You ALWAYS acknowledge data limitations.
You NEVER make up customer details, history, or context.
If data is insufficient, you say so clearly."""

# Test optimized prompt
print("HALLUCINATION-RESISTANT PROMPT:")
print("=" * 80)

for i, test in enumerate(limited_info_transactions):
    response = call_gpt4(optimized_prompt_v3.format(**test), optimized_system_prompt_v3)
    print(f"\nTest Case {i+1}:")
    print(f"Input: {test['transaction']}")
    print(f"\nAnalysis:\n{response['content']}")
    print("\n‚úì VERIFICATION: Did model only use provided data?")
    print("-" * 80)

### Validation Test

In [None]:
# Explicitly test for hallucination
def check_for_hallucination(input_data, output_text):
    """
    Check if output contains information not in input
    """
    hallucination_keywords = [
        "age", "income", "credit score", "fraud history", 
        "previous", "customer profile", "demographics"
    ]
    
    found_hallucinations = []
    for keyword in hallucination_keywords:
        if keyword.lower() in output_text.lower() and keyword.lower() not in input_data.lower():
            found_hallucinations.append(keyword)
    
    return found_hallucinations

# Test both prompts
print("\nHALLUCINATION TEST:")
print("=" * 80)

test = limited_info_transactions[0]

# Broken prompt
response_broken = call_gpt4(broken_prompt_v3.format(**test), system_prompt_v3)
hallucinations_broken = check_for_hallucination(test['transaction'], response_broken['content'])

# Optimized prompt  
response_optimized = call_gpt4(optimized_prompt_v3.format(**test), optimized_system_prompt_v3)
hallucinations_optimized = check_for_hallucination(test['transaction'], response_optimized['content'])

print(f"Broken Prompt Hallucinations: {hallucinations_broken if hallucinations_broken else 'None detected'}")
print(f"Optimized Prompt Hallucinations: {hallucinations_optimized if hallucinations_optimized else 'None detected'}")
print("=" * 80)

### Key Takeaways

‚úì **Explicit constraints** - "Use ONLY provided data"  
‚úì **Acknowledge gaps** - State what's missing  
‚úì **System prompt alignment** - Reinforce no hallucination  
‚úì **Validation critical** - Test for invented information

üí° **Why This Matters in BFSI:**
- Regulatory compliance requires auditable facts
- Hallucinated data = legal liability
- Financial decisions must be based on truth
- Trust and accuracy are non-negotiable

## LAB SUMMARY

### Key Debugging Principles

1. **Always test systematically** - Use A/B comparisons
2. **Measure objectively** - Tokens, latency, accuracy, cost
3. **Iterate incrementally** - Change one thing at a time
4. **Validate thoroughly** - Test edge cases after changes
5. **Document improvements** - Track what works and why

### Production Readiness

Your prompts are production-ready when:

‚úÖ Deterministic (temperature=0 for BFSI)  
‚úÖ Grounded (no hallucination)  
‚úÖ Consistent (same input ‚Üí same output)  
‚úÖ Efficient (optimized token usage)  
‚úÖ Accurate (validated against test suite)  
‚úÖ Auditable (clear reasoning trail)