# Error Analysis for JO's Bike Shop Multi-Turn Chatbot

This notebook implements the error analysis methodology described in `error_analysis_plan.md` using open coding and axial coding techniques.

## Part 1: Define Dimensions & Generate Initial Queries

### 1.1 Identified Dimensions

**Dimension 1: Primary Intent** (conversation_mode)
- `shop_info`, `product_inquiry`, `book_appointment`, `maintenance_tips`, `policy_question`, `recall_booking`, `what_can_you_do`

**Dimension 2: Information Completeness** (booking_completeness)
- `complete_upfront`, `partial_info`, `minimal_info`, `no_info`

**Dimension 3: Conversation Complexity** (interaction_pattern)
- `single_turn`, `multi_turn_linear`, `multi_turn_complex`, `mixed_intent`

**Dimension 4: User Behavior Pattern** (user_behavior)
- `cooperative`, `conversational`, `corrective`, `distracted`, `canceling`

### 1.2 Generate Unique Combinations (Tuples)

Using an LLM to generate diverse test scenario combinations.

In [1]:
import os
from openai import OpenAI
import nest_asyncio
nest_asyncio.apply()

from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [2]:
# Prompt for generating test scenario combinations
combination_prompt = """You are helping create test scenarios for a bike shop chatbot that handles multi-turn conversations. 

Generate 20 unique test scenario combinations using these dimensions:

1. Primary Intent: shop_info, product_inquiry, book_appointment, maintenance_tips, policy_question, recall_booking, what_can_you_do
2. Information Completeness: complete_upfront, partial_info, minimal_info, no_info
3. Interaction Pattern: single_turn, multi_turn_linear, multi_turn_complex, mixed_intent
4. User Behavior: cooperative, conversational, corrective, distracted, canceling

For each combination, output as a tuple: (primary_intent, completeness, interaction_pattern, user_behavior)

Ensure diverse coverage:
- Include at least 5 book_appointment scenarios (this is the main multi-turn feature)
- Include at least 2 recall_booking scenarios
- Include at least 3 scenarios with corrective or distracted behavior
- Vary the interaction patterns to test different conversation flows

Output format: One tuple per line, e.g.:
(book_appointment, partial_info, multi_turn_linear, cooperative)

Return ONLY the tuples, one per line, no additional text or explanation."""

# Call OpenAI API to generate combinations
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that generates test scenarios for chatbot evaluation."},
        {"role": "user", "content": combination_prompt}
    ],
    temperature=0.8,  # Higher temperature for more diverse combinations
)

# Extract and display the generated combinations
generated_combinations = response.choices[0].message.content
print(generated_combinations)

(shop_info, complete_upfront, single_turn, cooperative)  
(product_inquiry, partial_info, multi_turn_linear, conversational)  
(book_appointment, minimal_info, multi_turn_complex, cooperative)  
(maintenance_tips, no_info, mixed_intent, corrective)  
(policy_question, complete_upfront, single_turn, distracted)  
(book_appointment, partial_info, mixed_intent, distracted)  
(recall_booking, complete_upfront, single_turn, cooperative)  
(what_can_you_do, no_info, single_turn, conversational)  
(book_appointment, complete_upfront, multi_turn_linear, cooperative)  
(maintenance_tips, partial_info, multi_turn_linear, cooperative)  
(product_inquiry, minimal_info, multi_turn_complex, corrective)  
(recall_booking, minimal_info, multi_turn_linear, cooperative)  
(book_appointment, no_info, multi_turn_complex, distracted)  
(shop_info, partial_info, multi_turn_complex, cooperative)  
(maintenance_tips, complete_upfront, multi_turn_linear, corrective)  
(book_appointment, partial_info, mixed_int

In [3]:
# Parse the combinations into a structured list
import re

# Extract tuples from the text
tuple_pattern = r'\([^)]+\)'
tuples_raw = re.findall(tuple_pattern, generated_combinations)

# Parse each tuple
test_scenarios = []
for tuple_str in tuples_raw:
    # Remove parentheses and split by comma
    parts = tuple_str.strip('()').split(', ')
    if len(parts) == 4:
        test_scenarios.append({
            'primary_intent': parts[0],
            'completeness': parts[1],
            'interaction_pattern': parts[2],
            'user_behavior': parts[3],
            'tuple': tuple_str
        })

print(f"Generated {len(test_scenarios)} test scenarios:\n")
for i, scenario in enumerate(test_scenarios, 1):
    print(f"{i}. {scenario['tuple']}")

Generated 20 test scenarios:

1. (shop_info, complete_upfront, single_turn, cooperative)
2. (product_inquiry, partial_info, multi_turn_linear, conversational)
3. (book_appointment, minimal_info, multi_turn_complex, cooperative)
4. (maintenance_tips, no_info, mixed_intent, corrective)
5. (policy_question, complete_upfront, single_turn, distracted)
6. (book_appointment, partial_info, mixed_intent, distracted)
7. (recall_booking, complete_upfront, single_turn, cooperative)
8. (what_can_you_do, no_info, single_turn, conversational)
9. (book_appointment, complete_upfront, multi_turn_linear, cooperative)
10. (maintenance_tips, partial_info, multi_turn_linear, cooperative)
11. (product_inquiry, minimal_info, multi_turn_complex, corrective)
12. (recall_booking, minimal_info, multi_turn_linear, cooperative)
13. (book_appointment, no_info, multi_turn_complex, distracted)
14. (shop_info, partial_info, multi_turn_complex, cooperative)
15. (maintenance_tips, complete_upfront, multi_turn_linear, cor

In [4]:
# Save the generated combinations to a file for future reference
import json

output_file = 'test_scenario_combinations.json'

with open(output_file, 'w') as f:
    json.dump(test_scenarios, f, indent=2)

print(f"\n‚úì Saved {len(test_scenarios)} test scenarios to {output_file}")


‚úì Saved 20 test scenarios to test_scenario_combinations.json


In [5]:
# Analyze the distribution of scenarios
import pandas as pd

df = pd.DataFrame(test_scenarios)

print("Distribution by Primary Intent:")
print(df['primary_intent'].value_counts())
print("\nDistribution by Information Completeness:")
print(df['completeness'].value_counts())
print("\nDistribution by Interaction Pattern:")
print(df['interaction_pattern'].value_counts())
print("\nDistribution by User Behavior:")
print(df['user_behavior'].value_counts())

Distribution by Primary Intent:
primary_intent
book_appointment    7
maintenance_tips    3
shop_info           2
product_inquiry     2
policy_question     2
recall_booking      2
what_can_you_do     2
Name: count, dtype: int64

Distribution by Information Completeness:
completeness
complete_upfront    6
partial_info        6
minimal_info        4
no_info             4
Name: count, dtype: int64

Distribution by Interaction Pattern:
interaction_pattern
multi_turn_linear     6
single_turn           5
multi_turn_complex    5
mixed_intent          4
Name: count, dtype: int64

Distribution by User Behavior:
user_behavior
cooperative       10
conversational     3
corrective         3
distracted         3
canceling          1
Name: count, dtype: int64


In [6]:
df.drop(columns=['tuple'])

Unnamed: 0,primary_intent,completeness,interaction_pattern,user_behavior
0,shop_info,complete_upfront,single_turn,cooperative
1,product_inquiry,partial_info,multi_turn_linear,conversational
2,book_appointment,minimal_info,multi_turn_complex,cooperative
3,maintenance_tips,no_info,mixed_intent,corrective
4,policy_question,complete_upfront,single_turn,distracted
5,book_appointment,partial_info,mixed_intent,distracted
6,recall_booking,complete_upfront,single_turn,cooperative
7,what_can_you_do,no_info,single_turn,conversational
8,book_appointment,complete_upfront,multi_turn_linear,cooperative
9,maintenance_tips,partial_info,multi_turn_linear,cooperative


In [7]:
df[df['interaction_pattern'] == 'single_turn']

Unnamed: 0,primary_intent,completeness,interaction_pattern,user_behavior,tuple
0,shop_info,complete_upfront,single_turn,cooperative,"(shop_info, complete_upfront, single_turn, coo..."
4,policy_question,complete_upfront,single_turn,distracted,"(policy_question, complete_upfront, single_tur..."
6,recall_booking,complete_upfront,single_turn,cooperative,"(recall_booking, complete_upfront, single_turn..."
7,what_can_you_do,no_info,single_turn,conversational,"(what_can_you_do, no_info, single_turn, conver..."
16,policy_question,no_info,single_turn,canceling,"(policy_question, no_info, single_turn, cancel..."


In [8]:
df[df['interaction_pattern'] == 'multi_turn_linear']

Unnamed: 0,primary_intent,completeness,interaction_pattern,user_behavior,tuple
1,product_inquiry,partial_info,multi_turn_linear,conversational,"(product_inquiry, partial_info, multi_turn_lin..."
8,book_appointment,complete_upfront,multi_turn_linear,cooperative,"(book_appointment, complete_upfront, multi_tur..."
9,maintenance_tips,partial_info,multi_turn_linear,cooperative,"(maintenance_tips, partial_info, multi_turn_li..."
11,recall_booking,minimal_info,multi_turn_linear,cooperative,"(recall_booking, minimal_info, multi_turn_line..."
14,maintenance_tips,complete_upfront,multi_turn_linear,corrective,"(maintenance_tips, complete_upfront, multi_tur..."
17,book_appointment,minimal_info,multi_turn_linear,conversational,"(book_appointment, minimal_info, multi_turn_li..."


### 1.3 Configure Testing Phase

We'll generate synthetic examples in phases:
- **Phase A**: Single-turn conversations (simple query/response)
- **Phase B**: Multi-turn conversations (adaptive user with goal)

Set the phase configuration below:

In [9]:
# Configuration: Select which phase to generate examples for
# Phase A: single_turn only (simple, good for starting)
# Phase B: multi-turn patterns (more complex, adaptive user simulation)

TESTING_PHASE = "A"  # Change to "B" for Phase B later

if TESTING_PHASE == "A":
    print("üìç Phase A: Single-Turn Conversations")
    print("   - Focus: Simple query/response interactions")
    print("   - Scenarios: single_turn interaction patterns only")
    selected_scenarios = [s for s in test_scenarios if s['interaction_pattern'] == 'single_turn']
elif TESTING_PHASE == "B":
    print("üìç Phase B: Multi-Turn Conversations")
    print("   - Focus: Complex conversations with adaptive user simulation")
    print("   - Scenarios: multi_turn_linear, multi_turn_complex, mixed_intent patterns")
    selected_scenarios = [s for s in test_scenarios if s['interaction_pattern'] != 'single_turn']
else:
    raise ValueError(f"Invalid TESTING_PHASE: {TESTING_PHASE}. Must be 'A' or 'B'")

print(f"\n‚úì Selected {len(selected_scenarios)} scenarios for Phase {TESTING_PHASE}")
print(f"  Remaining for future: {len(test_scenarios) - len(selected_scenarios)} scenarios")

üìç Phase A: Single-Turn Conversations
   - Focus: Simple query/response interactions
   - Scenarios: single_turn interaction patterns only

‚úì Selected 5 scenarios for Phase A
  Remaining for future: 15 scenarios


### 1.4 Generate Synthetic Examples

Based on the selected phase, generate appropriate synthetic examples.

In [10]:
# Phase A: Generate simple single-turn queries
def generate_phase_a_examples(scenarios, examples_per_scenario=1):
    """
    Generate simple, realistic single-turn queries for Phase A scenarios.
    
    Args:
        scenarios: List of scenario dictionaries
        examples_per_scenario: Number of example queries to generate per scenario (default: 1)
    
    Returns:
        List of query dictionaries with metadata
    """
    examples = []
    
    for scenario in scenarios:
        for example_num in range(examples_per_scenario):
            # Create a prompt to generate a realistic single-turn query
            query_prompt = f"""Generate a single, realistic user query for a bike shop chatbot.

Scenario details:
- Primary Intent: {scenario['primary_intent']}
- Information Completeness: {scenario['completeness']}
- User Behavior: {scenario['user_behavior']}

Requirements:
- Generate ONE natural language query that a real customer might ask
- Match the primary intent (e.g., shop_info = asking about hours/location)
- Match the completeness level:
  * complete_upfront: Include all necessary details in the query
  * partial_info: Include some but not all details
  * minimal_info: Very basic query with little detail
  * no_info: General question with no specifics
- Use natural, friendly tone for conversational behavior

Return ONLY the user query, nothing else. Make it sound natural and realistic."""

            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": "You are an expert at creating realistic user queries for chatbot testing."},
                    {"role": "user", "content": query_prompt}
                ],
                temperature=0.8
            )
            
            user_query = response.choices[0].message.content.strip()
            
            examples.append({
                'phase': 'A',
                'scenario': scenario,
                'user_query': user_query,
                'expected_mode': scenario['primary_intent']
            })
    
    return examples


# Phase B: Generate multi-turn conversation goals (for future use)
def generate_phase_b_examples(scenarios):
    """
    Generate conversation goals and initial queries for Phase B scenarios.
    These will be used with adaptive user simulation.
    
    Args:
        scenarios: List of scenario dictionaries
    
    Returns:
        List of conversation goal dictionaries with metadata
    """
    examples = []
    
    for scenario in scenarios:
        # Create goal based on intent
        intent = scenario['primary_intent']
        if intent == 'book_appointment':
            goal = "Successfully book a service appointment by providing service type, date, and time"
        elif intent == 'shop_info':
            goal = "Get specific information about shop hours, location, or services"
        elif intent == 'product_inquiry':
            goal = "Learn about specific bikes or accessories and their availability"
        elif intent == 'maintenance_tips':
            goal = "Get practical advice on bike maintenance or care"
        elif intent == 'policy_question':
            goal = "Understand store policies regarding returns, warranties, or delivery"
        elif intent == 'recall_booking':
            goal = "Successfully recall or reference a previously made booking"
        else:
            goal = f"Complete a {intent} interaction successfully"
        
        # Generate initial query
        query_prompt = f"""Generate an initial user query to start a multi-turn conversation with a bike shop chatbot.

Scenario:
- Primary Intent: {scenario['primary_intent']}
- Information Completeness: {scenario['completeness']}
- User Behavior: {scenario['user_behavior']}
- User Goal: {goal}

Generate a realistic opening message that matches the completeness level and behavior.
Return ONLY the initial query, nothing else."""

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are an expert at creating realistic conversation starters."},
                {"role": "user", "content": query_prompt}
            ],
            temperature=0.8
        )
        
        initial_query = response.choices[0].message.content.strip()
        
        examples.append({
            'phase': 'B',
            'scenario': scenario,
            'user_goal': goal,
            'initial_query': initial_query,
            'expected_mode': scenario['primary_intent']
        })
    
    return examples


print("‚úì Phase A and Phase B generation functions defined")

‚úì Phase A and Phase B generation functions defined


In [11]:
# Generate examples based on selected phase
print(f"Generating Phase {TESTING_PHASE} examples...")

if TESTING_PHASE == "A":
    synthetic_examples = generate_phase_a_examples(selected_scenarios, examples_per_scenario=2)
    print(f"‚úì Generated {len(synthetic_examples)} Phase A single-turn queries")
elif TESTING_PHASE == "B":
    synthetic_examples = generate_phase_b_examples(selected_scenarios)
    print(f"‚úì Generated {len(synthetic_examples)} Phase B conversation goals")

# Save to file
output_file = f'phase_{TESTING_PHASE.lower()}_synthetic_examples.json'
with open(output_file, 'w') as f:
    json.dump({
        'phase': TESTING_PHASE,
        'count': len(synthetic_examples),
        'examples': synthetic_examples
    }, f, indent=2)

print(f"‚úì Saved to {output_file}")

Generating Phase A examples...
‚úì Generated 10 Phase A single-turn queries
‚úì Saved to phase_a_synthetic_examples.json
‚úì Generated 10 Phase A single-turn queries
‚úì Saved to phase_a_synthetic_examples.json


In [12]:
# Display sample examples
print(f"\n{'='*80}")
print(f"SAMPLE PHASE {TESTING_PHASE} EXAMPLES")
print(f"{'='*80}\n")

for i, example in enumerate(synthetic_examples[:3], 1):
    print(f"{i}. Scenario: {example['scenario']['tuple']}")
    if TESTING_PHASE == "A":
        print(f"   Query: \"{example['user_query']}\"")
    else:  # Phase B
        print(f"   Goal: {example['user_goal']}")
        print(f"   Initial Query: \"{example['initial_query']}\"")
    print()


SAMPLE PHASE A EXAMPLES

1. Scenario: (shop_info, complete_upfront, single_turn, cooperative)
   Query: ""Hi there! Could you please tell me your store hours and location? I'm planning to visit soon and want to make sure I have the details right. Thanks!""

2. Scenario: (shop_info, complete_upfront, single_turn, cooperative)
   Query: ""Hi there! Could you please tell me your store hours and location? I'm planning to visit sometime soon. Thanks!""

3. Scenario: (policy_question, complete_upfront, single_turn, distracted)
   Query: ""Hey, I was wondering if you could tell me about your return policy on bikes? I got a call, can you hold on a sec?""



## Part 2: Error Analysis - Running Tests

We'll use the `synthetic_examples` generated in Part 1 based on the `TESTING_PHASE` configuration.

In [13]:
# Verify we have synthetic examples to test
if 'synthetic_examples' not in locals() or not synthetic_examples:
    raise ValueError("No synthetic examples found. Please run Part 1 first to generate examples.")

print(f"‚úì Using {len(synthetic_examples)} synthetic examples from Part 1")
print(f"  Phase: {TESTING_PHASE}")
print(f"  Based on {len(selected_scenarios)} selected scenarios")

‚úì Using 10 synthetic examples from Part 1
  Phase: A
  Based on 5 selected scenarios


In [14]:
# Display sample synthetic examples that will be tested
print(f"Sample Synthetic Examples for Testing (Phase {TESTING_PHASE}):\n")
for i, example in enumerate(synthetic_examples[:5], 1):
    print(f"{i}. Scenario: {example['scenario']['tuple']}")
    if TESTING_PHASE == "A":
        print(f"   Query: {example['user_query']}")
    else:  # Phase B
        print(f"   Goal: {example['user_goal']}")
        print(f"   Initial Query: {example['initial_query']}")
    print()

Sample Synthetic Examples for Testing (Phase A):

1. Scenario: (shop_info, complete_upfront, single_turn, cooperative)
   Query: "Hi there! Could you please tell me your store hours and location? I'm planning to visit soon and want to make sure I have the details right. Thanks!"

2. Scenario: (shop_info, complete_upfront, single_turn, cooperative)
   Query: "Hi there! Could you please tell me your store hours and location? I'm planning to visit sometime soon. Thanks!"

3. Scenario: (policy_question, complete_upfront, single_turn, distracted)
   Query: "Hey, I was wondering if you could tell me about your return policy on bikes? I got a call, can you hold on a sec?"

4. Scenario: (policy_question, complete_upfront, single_turn, distracted)
   Query: "Hey, sorry I might have missed it but could you tell me if there's a return policy for bikes if they're not quite what I expected? Like within a certain timeframe or something?"

5. Scenario: (recall_booking, complete_upfront, single_turn, 

In [15]:
# Import Burr application
import sys
sys.path.append('..')

from application import application, TERMINAL_ACTIONS
import asyncio

print("‚úì Burr application imported successfully")

‚úì Burr application imported successfully


In [16]:
# Function to run a single-turn test (Phase A)
async def run_single_turn_test(example, test_id):
    """
    Run a single-turn test: submit query, get bot response.
    
    Args:
        example: Dictionary with scenario and user_query from synthetic_examples
        test_id: Unique identifier for this test
    
    Returns:
        Dictionary with test results
    """
    # Create fresh Burr application instance
    app = application(app_id=f"phase_{TESTING_PHASE.lower()}_{test_id}")
    
    user_query = example['user_query']
    scenario = example['scenario']
    
    result = {
        'test_id': test_id,
        'scenario': scenario,
        'user_query': user_query,
        'bot_response': None,
        'action_taken': None,
        'state_snapshot': None,
        'success': False,
        'error': None
    }
    
    try:
        # Submit query to bot
        action, streaming_container = await app.astream_result(
            halt_after=TERMINAL_ACTIONS,
            inputs={"query": user_query}
        )
        
        # Collect streaming response
        response_parts = []
        async for item in streaming_container:
            response_parts.append(item.get('delta', ''))
        
        bot_response = ''.join(response_parts)
        
        # Get final state
        final_result, state = await streaming_container.get()
        
        # Update result
        result['bot_response'] = bot_response
        result['action_taken'] = action.name
        result['state_snapshot'] = {
            'mode': state.get('mode'),
            'in_appointment_flow': state.get('in_appointment_flow'),
            'appointment_data': state.get('appointment_data'),
            'safe': state.get('safe')
        }
        result['success'] = True
        
    except Exception as e:
        result['error'] = str(e)
    
    return result

print("‚úì Function defined: run_single_turn_test()")

‚úì Function defined: run_single_turn_test()


In [17]:
# Run all tests for current phase
async def run_all_tests(examples_list):
    """Run all tests for the synthetic examples."""
    results = []
    
    for idx, example in enumerate(examples_list, 1):
        test_id = f"{TESTING_PHASE}{idx:03d}"
        
        result = await run_single_turn_test(example, test_id)
        results.append(result)
        
    return results;

# Run the tests
print(f"Starting Phase {TESTING_PHASE} tests...")
print(f"Running {len(synthetic_examples)} tests...")

test_results = await run_all_tests(synthetic_examples)
# Print all results
print("\n" + "="*80)
print("ALL TEST RESULTS")
print("="*80 + "\n")
for idx, result in enumerate(test_results, 1):
    if result['success']:
        print(f"[{idx}/{len(test_results)}] {result['test_id']} - {result['scenario']['primary_intent']}")
        print(f"    ‚úì Response: {len(result['bot_response'])} chars | Action: {result['action_taken']}")
    else:
        print(f"[{idx}/{len(test_results)}] {result['test_id']} - {result['scenario']['primary_intent']}")
        print(f"    ‚úó Error: {result['error']}")

print(f"\n‚úì Completed {len(test_results)} tests")


Starting Phase A tests...
Running 10 tests...

ALL TEST RESULTS

[1/10] A001 - shop_info
    ‚úì Response: 1045 chars | Action: shop_info
[2/10] A002 - shop_info
    ‚úì Response: 674 chars | Action: shop_info
[3/10] A003 - policy_question
    ‚úì Response: 443 chars | Action: policy_question
[4/10] A004 - policy_question
    ‚úì Response: 212 chars | Action: policy_question
[5/10] A005 - recall_booking
    ‚úì Response: 102 chars | Action: recall_booking
[6/10] A006 - recall_booking
    ‚úì Response: 102 chars | Action: recall_booking
[7/10] A007 - what_can_you_do
    ‚úì Response: 451 chars | Action: what_can_you_do
[8/10] A008 - what_can_you_do
    ‚úì Response: 451 chars | Action: what_can_you_do
[9/10] A009 - policy_question
    ‚úì Response: 207 chars | Action: policy_question
[10/10] A010 - policy_question
    ‚úì Response: 367 chars | Action: policy_question

‚úì Completed 10 tests

ALL TEST RESULTS

[1/10] A001 - shop_info
    ‚úì Response: 1045 chars | Action: shop_info
[2/10

In [18]:
# Display sample results
print("="*80)
print(f"SAMPLE TEST RESULTS (Phase {TESTING_PHASE})")
print("="*80)

for i, result in enumerate(test_results[:4], 1):
    print(f"\nTest {result['test_id']}")
    print(f"Scenario: {result['scenario']['tuple']}")
    print(f"\nUser Query: \"{result['user_query']}\"")
    print(f"\nBot Response:\n{result['bot_response'][:300]}...")
    print(f"\nMode: {result['state_snapshot']['mode']}")
    print("-"*80)

SAMPLE TEST RESULTS (Phase A)

Test A001
Scenario: (shop_info, complete_upfront, single_turn, cooperative)

User Query: ""Hi there! Could you please tell me your store hours and location? I'm planning to visit soon and want to make sure I have the details right. Thanks!""

Bot Response:
Hey there! That's awesome to hear you're planning to visit JO's Bike Shop ‚Äì we can't wait to have you roll in! üö≤üîß Our shop is located at 456 Pedal Lane in beautiful Portland, OR 97201.

As for our store hours, we're open 7 days a week to serve all your biking needs:
- Monday to Wednesday: 9:00 AM - ...

Mode: shop_info
--------------------------------------------------------------------------------

Test A002
Scenario: (shop_info, complete_upfront, single_turn, cooperative)

User Query: ""Hi there! Could you please tell me your store hours and location? I'm planning to visit sometime soon. Thanks!""

Bot Response:
Absolutely, greetings fellow rider! üö¥‚Äç‚ôÇÔ∏èüåü JO's Bike Shop is located at

In [19]:
# Save test results
output_data = {
    'metadata': {
        'phase': TESTING_PHASE,
        'description': f'Phase {TESTING_PHASE} conversation tests',
        'timestamp': pd.Timestamp.now().isoformat(),
        'total_tests': len(test_results),
        'successful_tests': sum(1 for r in test_results if r['success'])
    },
    'results': test_results
}

output_filename = f'phase_{TESTING_PHASE.lower()}_results.json'
with open(output_filename, 'w') as f:
    json.dump(output_data, f, indent=2)

print(f"‚úì Saved detailed results to {output_filename}")
print(f"  Total tests: {len(test_results)}")
print(f"  Successful: {sum(1 for r in test_results if r['success'])}")
print(f"  Failed: {sum(1 for r in test_results if not r['success'])}")

‚úì Saved detailed results to phase_a_results.json
  Total tests: 10
  Successful: 10
  Failed: 0


In [20]:
# Create analysis CSV for open coding
analysis_rows = []

for result in test_results:
    scenario = result['scenario']
    row = {
        'Trace_ID': result['test_id'],
        'Phase': TESTING_PHASE,
        'Scenario_Tuple': scenario['tuple'],
        'Primary_Intent': scenario['primary_intent'],
        'Completeness': scenario['completeness'],
        'Interaction_Pattern': scenario['interaction_pattern'],
        'User_Behavior': scenario['user_behavior'],
        'User_Query': result['user_query'],
        'Bot_Response_Preview': result['bot_response'][:200] if result['bot_response'] else '',
        'Action_Taken': result['action_taken'],
        'Mode_Detected': result['state_snapshot']['mode'] if result['state_snapshot'] else '',
        'Success': 1 if result['success'] else 0,
        'Error': result['error'] or '',
        'Open_Code_Notes': '',  # For manual analysis
        'Overall_Success': '',  # 0=failed, 1=partial, 2=success (to fill manually)
    }
    analysis_rows.append(row)

analysis_df = pd.DataFrame(analysis_rows)
csv_filename = f'phase_{TESTING_PHASE.lower()}_analysis.csv'
analysis_df.to_csv(csv_filename, index=False)

print(f"‚úì Created {csv_filename} for open coding")
print(f"\nColumns available for analysis:")
for col in analysis_df.columns:
    print(f"  - {col}")

‚úì Created phase_a_analysis.csv for open coding

Columns available for analysis:
  - Trace_ID
  - Phase
  - Scenario_Tuple
  - Primary_Intent
  - Completeness
  - Interaction_Pattern
  - User_Behavior
  - User_Query
  - Bot_Response_Preview
  - Action_Taken
  - Mode_Detected
  - Success
  - Error
  - Open_Code_Notes
  - Overall_Success


In [21]:
# Display the mode analysis table as a dataframe
# Ensure mode_analysis is defined before using it
if 'mode_analysis' not in locals():
    mode_analysis = pd.DataFrame([
        {
            'Test_ID': r['test_id'],
            'Expected_Intent': r['scenario']['primary_intent'],
            'Actual_Intent': r['state_snapshot']['mode'] if r['state_snapshot'] else None,
            'User_Query': r['user_query']
        }
        for r in test_results if r['success']
    ])

print("Mode Detection Analysis:")
print(f"Analyzing {len(mode_analysis)} successful tests...\n")

# Display the confusion matrix as a dataframe
confusion_matrix = pd.crosstab(mode_analysis['Expected_Intent'], mode_analysis['Actual_Intent'], margins=True)
print("Expected vs Actual Confusion Matrix:")
confusion_matrix

Mode Detection Analysis:
Analyzing 10 successful tests...

Expected vs Actual Confusion Matrix:


Actual_Intent,policy_question,recall_booking,shop_info,what_can_you_do,All
Expected_Intent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
policy_question,4,0,0,0,4
recall_booking,0,2,0,0,2
shop_info,0,0,2,0,2
what_can_you_do,0,0,0,2,2
All,4,2,2,2,10


In [22]:
print(f"\n‚úì All modes detected correctly!")
print(f"Perfect accuracy: {len(mode_analysis)}/{len(mode_analysis)} tests")

# Display the detailed analysis dataframe
print(f"\nDetailed Mode Analysis:")
mode_analysis


‚úì All modes detected correctly!
Perfect accuracy: 10/10 tests

Detailed Mode Analysis:


Unnamed: 0,Test_ID,Expected_Intent,Actual_Intent,User_Query
0,A001,shop_info,shop_info,"""Hi there! Could you please tell me your store..."
1,A002,shop_info,shop_info,"""Hi there! Could you please tell me your store..."
2,A003,policy_question,policy_question,"""Hey, I was wondering if you could tell me abo..."
3,A004,policy_question,policy_question,"""Hey, sorry I might have missed it but could y..."
4,A005,recall_booking,recall_booking,Hi there! I scheduled a tune-up for my mountai...
5,A006,recall_booking,recall_booking,Hi there! I have a bike tune-up appointment bo...
6,A007,what_can_you_do,what_can_you_do,Hey there! Could you tell me what you can help...
7,A008,what_can_you_do,what_can_you_do,Hey there! What kind of things can you help me...
8,A009,policy_question,policy_question,Can I cancel my bike order if I've changed my ...
9,A010,policy_question,policy_question,Can you tell me about your return policy?


### Testing Summary

**Completed:**
- ‚úÖ Used synthetic examples from Part 1 based on TESTING_PHASE configuration
- ‚úÖ Ran chatbot against all generated queries
- ‚úÖ Captured bot responses and state information
- ‚úÖ Saved results to `phase_{phase}_results.json` (detailed)
- ‚úÖ Created `phase_{phase}_analysis.csv` for open coding

**Files Created:**
- `phase_{phase}_results.json` - Complete test results with full bot responses
- `phase_{phase}_analysis.csv` - Spreadsheet ready for open coding analysis

**Next Steps:**
1. Review results in Burr UI: `burr` (look for traces starting with `phase_{phase}_`)
2. Perform open coding: Fill in `Open_Code_Notes` column in the CSV
3. After current phase analysis is complete, change TESTING_PHASE in Part 1 and repeat

