In [1]:
from textwrap import dedent

# Homework 2: Recipe Bot Error Analysis

This notebook walks through the complete error analysis process for a Recipe Bot. We'll identify failure modes, generate test queries, and analyze bot responses to build a taxonomy of errors.

**Note:** This uses the pre-existing queries and bot responses in `results_20250518_215844.csv` as our data source.

For a recording of the homework walkthrough please see: https://youtu.be/h9oAAAYnGx4

In [2]:
import pandas as pd

In [3]:
from claudette import models, Client
model = models[1]
c = Client(model)
model

'claude-sonnet-4-20250514'

## Part 1: Define Dimensions & Generate Initial Queries

### Identify Key Dimensions


meal_type("breakfast", "lunch", "dinner", "snack")
user("matt", "kate", "guests")
protein_category("poultry", "seafood", "red_meat")
side_dish("salad", "roasted_veg", "rice_base")
cooking_method("oven", "grill", "stovetop", "air_fryer")

### Generate Unique Combinations

In [4]:
from textwrap import dedent

# Updated prompt using the "leveled-up" categorical structure
prompt = dedent('''\
I am designing a Recipe Bot and want to test it with a diverse set of user scenarios. 
Please generate 50 unique combinations (tuples) using the following structured dimensions and their possible values:

meal_type("breakfast", "lunch", "dinner", "snack")
user("matt", "kate", "hosting")
protein_type("salmon", "chicken", "red meat", "turkey", "fish")
fruit("bananas", "apples", "blueberries", "raspberries", "avocado", "strawberries", "apples", "bananas")
vegetables("asparagus", "bell peppers", "mini cucumber", "cherry tomatos", "brussel sprouts", "cauliflower rice", "garlic", "cauliflower")
cooking_method("sous vide", "bbq grill", "oven", "air fryer", "toaster oven", "instapot")

Each combination should select one value from each of the 5 dimensions. 
Present the results as a list of tuples, where each tuple contains one value for each dimension in the following order: 
(meal_type, user, protein_category, side_dish, cooking_method). 

Ensure that the combinations are varied and realistic, keeping in mind logical pairings (e.g., breakfast combinations should make sense).''')


In [5]:
# c(prompt)

In [6]:
import random
random.seed(42)
dimension_samples_for_nlp = random.sample(dimension_examples, 40)
dimension_samples_for_nlp

NameError: name 'dimension_examples' is not defined

In [None]:
dimension_samples_for_nlp = [
    ('breakfast', 'guests', 'seafood', 'roasted_veg', 'oven'),
    ('lunch', 'kate', 'poultry', 'rice_base', 'stovetop'),
  ("lunch", "kate", "red_meat", "rice_base", "oven"),
    ("dinner", "guests", "seafood", "rice_base", "stovetop"),
    ("lunch", "kate", "seafood", "salad", "stovetop"),
    # --- Group 1: Core Success Cases (15 Tuples) ---
    # These are realistic, high-quality combinations the bot MUST handle correctly.
    ("dinner", "matt", "poultry", "roasted_veg", "grill"),      # Ideal Matt dinner.
    ("dinner", "kate", "seafood", "rice_base", "oven"),         # Ideal Kate dinner (iron, low-fat, batch-prep friendly).
    ("dinner", "guests", "poultry", "roasted_veg", "oven"),     # Classic safe guest meal.
    ("lunch", "matt", "red_meat", "salad", "stovetop"),         # Leftover steak salad - valid.
    ("lunch", "kate", "poultry", "rice_base", "stovetop"),      # Perfect use of prepped grains for Kate.
    ("dinner", "matt", "seafood", "salad", "air_fryer"),        # Healthy and safe for Matt.
    ("dinner", "kate", "red_meat", "rice_base", "stovetop"),    # Valid for Kate's iron needs.
    ("dinner", "guests", "red_meat", "salad", "grill"),         # Classic BBQ scenario.
    ("lunch", "guests", "poultry", "salad", "grill"),           # Standard light lunch for guests.
    ("breakfast", "matt", "poultry", "roasted_veg", "stovetop"),# e.g., Turkey sausage and peppers.
    ("breakfast", "kate", "seafood", "salad", "stovetop"),      # e.g., Smoked salmon on greens.
    ("snack", "matt", "poultry", "salad", "air_fryer"),         # Quick, high-protein snack.
    ("snack", "kate", "seafood", "salad", "stovetop"),          # Light, iron-rich snack.
    ("dinner", "matt", "red_meat", "roasted_veg", "oven"),      # Valid oven-based meal for Matt.
    ("dinner", "kate", "poultry", "salad", "oven"),             # Baked chicken salad, valid for Kate.

    # --- Group 2: Critical Failure Cases (10 Tuples) ---
    # These tuples violate Matt's core dietary rule. The bot should NOT suggest these recipes as-is.
    # This directly addresses your need for more failure mode examples.
    ("dinner", "matt", "poultry", "rice_base", "stovetop"),     # FAILURE: Matt cannot have rice_base.
    ("lunch", "matt", "seafood", "rice_base", "oven"),          # FAILURE: Matt cannot have rice_base.
    ("breakfast", "matt", "red_meat", "rice_base", "stovetop"), # FAILURE: Matt cannot have rice_base.
    ("snack", "matt", "poultry", "rice_base", "air_fryer"),     # FAILURE: Matt cannot have rice_base.
    ("dinner", "matt", "red_meat", "rice_base", "grill"),       # FAILURE: Matt cannot have rice_base.
    ("lunch", "matt", "poultry", "rice_base", "stovetop"),      # FAILURE: Matt cannot have rice_base.
    ("breakfast", "matt", "seafood", "rice_base", "oven"),      # FAILURE: Matt cannot have rice_base.
    ("snack", "matt", "seafood", "rice_base", "air_fryer"),     # FAILURE: Matt cannot have rice_base.
    ("dinner", "matt", "seafood", "rice_base", "grill"),        # FAILURE: Matt cannot have rice_base.
    ("lunch", "matt", "red_meat", "rice_base", "stovetop"),     # FAILURE: Matt cannot have rice_base.

    # --- Group 3: Subtle & Edge Cases (15 Tuples) ---
    # These are logically possible but unlikely, testing the bot's nuance and common sense.
    ("breakfast", "guests", "red_meat", "roasted_veg", "grill"),# Grilling steak for breakfast is unusual.
    ("snack", "kate", "red_meat", "roasted_veg", "oven"),       # A heavy, oven-cooked snack is unlikely.
    ("dinner", "guests", "poultry", "rice_base", "air_fryer"),  # Air-frying a grain base is odd.
    ("snack", "guests", "red_meat", "salad", "oven"),           # Baking a steak salad for a snack is weird.
    ("breakfast", "matt", "seafood", "roasted_veg", "grill"),   # Grilling fish and vegetables for breakfast.
    ("lunch", "kate", "red_meat", "rice_base", "air_fryer"),    # Another odd air-fryer combination.
    ("dinner", "matt", "poultry", "salad", "stovetop"),         # Less common than grilling or baking.
    ("snack", "kate", "poultry", "roasted_veg", "grill"),       # Grilling a small snack is high-effort.
    ("breakfast", "kate", "poultry", "salad", "grill"),         # Grilling chicken for a breakfast salad.
    ("lunch", "guests", "seafood", "rice_base", "air_fryer"),   # Testing this unusual cooking method.
    ("snack", "matt", "red_meat", "salad", "stovetop"),         # Pan-searing steak for a snack.
    ("dinner", "guests", "seafood", "salad", "air_fryer"),      # Air-fried fish on a salad.
    ("breakfast", "guests", "seafood", "roasted_veg", "oven"),  # A seafood frittata-style breakfast.
    ("lunch", "kate", "seafood", "roasted_veg", "grill"),       # Grilling is less batch-prep friendly.
    ("snack", "guests", "poultry", "rice_base", "stovetop"),    # A grain-based snack.
    
    # --- Group 4: Ambiguity & Crossover Tests (10 Tuples) ---
    # These test scenarios where needs might overlap or be ambiguous.
    ("dinner", "guests", "seafood", "rice_base", "grill"),      # What if Matt is one of the guests? Bot should clarify.
    ("dinner", "guests", "red_meat", "rice_base", "oven"),      # Another guest scenario that is bad for Matt.
    ("lunch", "guests", "poultry", "rice_base", "stovetop"),    # Tests if the bot defaults to the unsafe option.
    ("breakfast", "guests", "poultry", "rice_base", "stovetop"),# A breakfast version of the guest ambiguity.
    ("snack", "guests", "seafood", "rice_base", "stovetop"),    # A snack version of the guest ambiguity.
    ("dinner", "kate", "red_meat", "salad", "grill"),           # Kate using Matt's preferred cooking method.
    ("dinner", "matt", "seafood", "salad", "stovetop"),         # Matt using Kate's preferred cooking method.
    ("lunch", "kate", "red_meat", "salad", "grill"),            # Kate using Matt's preferred method for lunch.
    ("lunch", "matt", "poultry", "roasted_veg", "oven"),        # Matt using Kate's preferred method for lunch.
    ("dinner", "guests", "poultry", "salad", "air_fryer"),      # A simple, safe crossover meal.
    ('snack', 'kate', 'red_meat', 'roasted_veg', 'oven')
]

In [None]:

followup_prompt = dedent('''\
   Convert these dimension combinations into realistic user queries for a recipe bot. 
   Create natural, conversational queries that reflect how real users would interact in chat interfaces like Discord or ChatGPT. 
   Include variations in:
   - Writing style (formal vs casual)
   - Sentence structure (complete vs incomplete)
   - Common typos and informal grammar
   - Natural language patterns
   - Realistic context and urgency
                         
    Include only 1 example per `dimension_example`.
                         
    <dimension_examples>
    {dimension_examples}
    </dimension_examples>
                         
   Return the results as a list of strings. Each query should be based on the following dimensions, in this order:
   meal_type, user, protein_type, ingredient_base, cooking_method
   ''') 
prompt_text = followup_prompt.format(dimension_examples=dimension_samples_for_nlp)
print (prompt_text)

### Generate Nature Language Queries

In [None]:
# c(followup_prompt.format(dimension_examples=dimension_samples_for_nlp))

## Part 2: Initial Error Analysis

### Run bot on synthetic queries

I decided at this point to implement automated tracing.  Copying and pasting from the UI felt annoying and I didn't want to do that.  So I felt like I had 2 main options:

1. Implement functions that can call the backend programatically
2. Implement automated tracing

I opted for option #2 because I wanted to be a user of my product more, and did not want to fully automate away the experience of using the actual application.

So I implemented the simplest tracing mechanism I could think of to start with.  Saving JSON files to disk.


```python
    traces_dir = Path(__file__).parent.parent / "annotation" / "traces"
    traces_dir.mkdir(parents=True, exist_ok=True)
    ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S_%f")
    trace_path = traces_dir / f"trace_{ts}.json"
    with open(trace_path, "w") as f:
        json.dump({
            "request": payload.model_dump(),
            "response": response.model_dump()
        }, f)
```

I took each of the synthetic queries and ran them through the app to generate the traces.  I then copied them into a `golden_dataset` folder which is what ill use for my open coding dataset for this excersize.

### Open Coding

> NOTE:  Watch Hamel and Isaac do open coding live.  This is VERY important to watch.

https://www.youtube.com/watch?v=AKg27L4E0M8

To do open coding I opted to create an annotation app with fasthtml.  You can see it in `annotation/
annotation.py` and run it with `python annotation.py`.  This reads the json files from the `golden_dataset` folder directly, and then saves any of my open coding notes back in the json file.  I only solved for open coding first.

![](imgs/open_coding_dashboard.png)

![](imgs/open_coding_notes.png)

UX things I noticed along the way I will improve over time:
- Kinda annoying not to have a next button and have to go back to the dashboard
- Dashboard needs some indication as to what's been done so when I come back to it it's not lost

I adressed this by using an href for next and previous, and added a single emoji for it open coding was done.  I then extended it to give 2 emojis if both open coding and axial coding was done.

![](./imgs/NewDashboard.png)

### Axial Coding and Taxonomy Definition

I then went through and did axial coding.  I did this by adding MonsterUI's insertable select and saving things back to json.

The insertable select saves to the json as well and lets you search and add new codes as you go if one doesne exist

Findings:

- I failure modes had just 1 or 2 traces in them.  This tells me that I probably have not seen all the failure modes and have not reached saturation.  I need to do more
- Maybe the original instruction for no follow up quesetions was bad.  If someone asks for keto + beans it's impossible to comply with both, and seems like in that case it makes sense to have a follow up question.