# ðŸ§¡ Exploring the Results

In [None]:
import json
import pandas as pd
from evoproc_procedures.schemas import get_schema
from evoproc_procedures.ollama import query
from evoproc_procedures.runners import run_steps_stateful_minimal
from evoproc_procedures.helpers import pretty_print

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
FINAL_SCHEMA = get_schema("gsm")
QUERY_FN = query
MODEL = "gpt-oss:120b"
GPT_OSS_LOCAL_BUGGY_MODELS = {"gpt-oss:20b", "gpt-oss:120b"}

def query_fn_no_format_for_gptoss(prompt, model, fmt, seed):
    # kill structured-output / format only for gpt-oss models
    # Note, this will only be with the gpt-oss NON-CLOUD models (gpt-oss:120b & gpt-oss:20b)
    # Cloud models do not have this bug, so I want to bypass this if it is a cloud model
    if isinstance(model, str) and model in GPT_OSS_LOCAL_BUGGY_MODELS:
        fmt = None
    return QUERY_FN(prompt, model, fmt, seed)   # call your existing function

## ðŸŸ§ Import the two sets of data

In [4]:
PROC_RESULTS_FILEPATH = "runs/gsm8k_train_v6.jsonl"
BASELINE_RESULTS_FILEPATH = "runs/gsm8k_train_v6_baseline.jsonl"

In [5]:
read_results = None
with open(PROC_RESULTS_FILEPATH, "r") as f:
    items = [json.loads(line) for line in f if line.startswith("//") is False]
    read_results = items
    f.close()
proc_results_df = pd.DataFrame(read_results)

In [6]:
read_results = None
with open(BASELINE_RESULTS_FILEPATH, "r") as f:
    items = [json.loads(line) for line in f if line.startswith("//") is False]
    read_results = items
    f.close()
baseline_results_df = pd.DataFrame(read_results)

## ðŸŸ§ Let's see where the correct and incorrect answers intersect

In [None]:
both_correct = proc_results_df[(proc_results_df['correct']) & (baseline_results_df['correct'])]
both_incorrect = proc_results_df[(~proc_results_df['correct']) & (~baseline_results_df['correct'])]

proc_correct_baseline_incorrect = proc_results_df[(proc_results_df['correct']) & (~baseline_results_df['correct'])]
proc_incorrect_baseline_correct = proc_results_df[(~proc_results_df['correct']) & (baseline_results_df['correct'])]

print(f"Both Correct: {both_correct.shape[0]}")
print(f"Both Incorrect: {both_incorrect.shape[0]}")
print(f"Procedure Correct, Baseline Incorrect: {proc_correct_baseline_incorrect.shape[0]}")
print(f"Procedure Incorrect, Baseline Correct: {proc_incorrect_baseline_correct.shape[0]}")

Both Correct: 231
Both Incorrect: 12
Procedure Correct, Baseline Incorrect: 1
Procedure Incorrect, Baseline Correct: 30


So: 
- All questions answered correctly by procedures were also answered correctly by baseline. 
- All questions answered incorrectly by baseline were also answered incorrectly by procedures except for ONE

## ðŸŸ§ Let's dive deeper into what the procedures got wrong

42 procedures resulted in incorrect results. Let's dive into those to see *why* the procedures produced incorrect answers.

In [None]:
proc_incorrect = proc_results_df[~proc_results_df['correct']]
proc_incorrect.shape

(42, 16)

Looking at what procedures got incorrect that baseline got correct (30 instances)

In [9]:
proc_results_df.iloc[113]["gold_answer"]

'She loses 5 marbles because 25 x .2 = <<25*.2=5>>5\nShe has 20 marbles after this because 25 - 5 = <<25-5=20>>20\nHer friend gives her 40 marbles because 20 x 2 = <<20*2=40>>40\nAdding those marbles to the 20 she had before, she ends up with 40 marbles + 20 marbles = <<40+20=60>>60 marbles\n#### 60'

In [10]:
for incorrect_instance in proc_incorrect_baseline_correct.itertuples():
    print("Question:", incorrect_instance.question)
    print("Index: ", incorrect_instance.Index)
    print("Expected Answer:", incorrect_instance.gold_num)
    print("Predicted Answer:", incorrect_instance.pred_num)
    print("Did baseline get it correct?:", baseline_results_df.iloc[incorrect_instance.Index]['correct'])
    print("Procedure Output:")
    pretty_print(incorrect_instance.procedure)
    run_steps_stateful_minimal(incorrect_instance.procedure, incorrect_instance.question, FINAL_SCHEMA, MODEL, print_bool=True, query_fn=query_fn_no_format_for_gptoss)
    print("----- \n\n")

Question: Samanthaâ€™s last name has three fewer letters than Bobbieâ€™s last name. If Bobbie took two letters off her last name, she would have a last name twice the length of Jamieâ€™s. Jamieâ€™s full name is Jamie Grey. How many letters are in Samanthaâ€™s last name?
Index:  22
Expected Answer: 7.0
Predicted Answer: 9.0
Did baseline get it correct?: True
Procedure Output:

--- Procedure: Compute Samantha's last name length from textual relationships ---
Steps:

Step 1: Extract all numeric facts from the problem text and store each as a separate variable
  **Inputs**:
    - problem_text: Full problem statement
  **Outputs**:
    - diff_sam_bob: Difference: Samantha's last name has three fewer letters than Bobbie's (value 3)
    - bob_removed: Number of letters Bobbie removes from her last name (value 2)
    - factor: Multiplier indicating Bobbie's shortened name is twice Jamie's (value 2)
    - jamie_last_len: Length of Jamie's last name "Grey" (value 4)

Step 2: Form the symbolic ex

### Condition 1: **Correct setup, wrong execution**

    - often from hallucinations in responses that leads to nonsensical or unrelated results

    - condition1 = [22, 48, 81, 136, 173]

### Condition 2: **Correct setup, no or partial execution**

    - often from defining an expression but never filling in and executing that expression
    - I wonder how many of these (along with condition 1) can be mitigated by a multi-run and self-consistency choice approach?

    - condition2 = [31, 52, 65, 183, 193, 196, 200, 210, 240, 243, 245]

### Condition 3: **Incorrect or incomplete setup**

    - often from not extracting numbers correctly from the problem text or the procedure itself is incomplete/incorrect

    - condition3 = [95, 103, 113, 142, 181, 187, 190, 237, 242, 267, 274]

### Condition 4: **Correct, but for some reason shows as incorrect**? (e.g., procedure output is there but for some reason was logged as 0)

    - possibly from improper execution of previous step run?

    - condition4 = [144, 234]

### Condition 5: **Correct, but the gold answer is incorrect or the units differ**

    - condition5: [270]

    - For index=270, the answer is technically correct. The answer is 60,000 cents. The question does not specify the final unit. The "gold answer" is 600 dollars, so techincally these equate, they are just in different units. I would exclude this from the set or mark as correct.
    

In [11]:
condition1 = [22, 48, 81, 136, 173]
condition2 = [31, 52, 65, 183, 193, 196, 200, 210, 240, 243, 245]
condition3 = [95, 103, 113, 142, 181, 187, 190, 237, 242, 267, 274]
condition4 = [144, 234]
condition5 = [270]

Now this is looking at the items that were incorrect for BOTH (12 instances)

In [12]:
for incorrect_instance in proc_incorrect.itertuples():
    if incorrect_instance.Index in condition1 + condition2 + condition3 + condition4 + condition5:
        # Already reviewed, skip
        continue
    print("Question:", incorrect_instance.question)
    print("Index: ", incorrect_instance.Index)
    print("Expected Answer:", incorrect_instance.gold_num)
    print("Predicted Answer:", incorrect_instance.pred_num)
    print("Did baseline get it correct?:", baseline_results_df.iloc[incorrect_instance.Index]['correct'])
    print("Procedure Output:")
    pretty_print(incorrect_instance.procedure)
    run_steps_stateful_minimal(incorrect_instance.procedure, incorrect_instance.question, FINAL_SCHEMA, MODEL, print_bool=True, query_fn=query_fn_no_format_for_gptoss)
    print("----- \n\n")

Question: Jasper will serve charcuterie at his dinner party. He buys 2 pounds of cheddar cheese for $10, a pound of cream cheese that cost half the price of the cheddar cheese, and a pack of cold cuts that cost twice the price of the cheddar cheese. How much does he spend on the ingredients?
Index:  13
Expected Answer: 35.0
Predicted Answer: 22.5
Did baseline get it correct?: False
Procedure Output:

--- Procedure: Compute total spending on charcuterie ingredients using symbolic relationships ---
Steps:

Step 1: Extract numeric facts from the problem text and create variables for each
  **Inputs**:
    - problem_text: Original problem statement describing the purchase details
  **Outputs**:
    - cheddar_qty: Quantity of cheddar cheese in pounds (2 pounds)
    - cheddar_total_cost: Total cost of cheddar cheese in dollars ($10)
    - cream_cheese_qty: Quantity of cream cheese in pounds (1 pound)
    - cream_cheese_price_factor: Price factor for cream cheese relative to cheddar price per

[step 4] inputs: {'wednesday_needed': -1}
[step 4] outputs: {'final_answer': 'The required Wednesday situp count is -1.', 'final_answer_numerical': -1, 'confidence': 1.0}
----- 


Question: Henry took 9 pills a day for 14 days. Of these 9 pills, 4 pills cost $1.50 each, and the other pills each cost $5.50 more. How much did he spend in total on the pills?
Index:  108
Expected Answer: 41.0
Predicted Answer: 574.0
Did baseline get it correct?: False
Procedure Output:

--- Procedure: Compute the total amount Henry spent on pills over 14 days using the given costs and quantities. ---
Steps:

Step 1: Extract all numeric facts from the problem text and store each as a separate variable.
  **Inputs**:
    - problem_text: The full problem statement.
  **Outputs**:
    - pills_per_day: Number of pills Henry took each day (9).
    - num_days: Number of days Henry took the pills (14).
    - cheap_pills_per_day: Number of $1.50 pills taken each day (4).
    - cheap_pill_cost: Cost of each cheap pi

### Condition 1: **Correct setup, wrong execution**

    - often from hallucinations in responses that leads to nonsensical or unrelated results

    - condition1 = []

### Condition 2: **Correct setup, no or partial execution**

    - often from defining an expression but never filling in and executing that expression
    - I wonder how many of these (along with condition 1) can be mitigated by a multi-run and self-consistency choice approach?

    - condition2 = [59, 264]

### Condition 3: **Incorrect or incomplete setup**

    - often from extracting numbers incorrectly from the problem text or the procedure itself is incomplete/incorrect

    - condition3 = [13, 63, 220, 230]

### Condition 4: **Correct, but for some reason shows as incorrect**? (e.g., procedure output is there but for some reason was logged as 0)

    - possibly from improper execution of previous step run? 

    - condition4 = []

### Condition 5: **Correct, but the gold answer is incorrect or the units differ**

    - condition5: [108, 167, 218, 228, 229, 235]

    - 108: It seems that the answer provided is actually incorrect, and the predicted answer is actually correct. So, this seems to be marked as incorrect when it is in fact correct. (Needs correction from data providers!)

    - 167: It seems that the answer provided is actually incorrect, and the predicted answer is actually correct. So, this seems to be marked as incorrect when it is in fact correct. (Needs correction from data providers!)

    - 218: It seems that the answer provided is actually incorrect, and the predicted answer is actually correct. So, this seems to be marked as incorrect when it is in fact correct. (Needs correction from data providers!)

    - 228: It seems that the answer provided is actually incorrect, and the predicted answer is actually correct. So, this seems to be marked as incorrect when it is in fact correct. (Needs correction from data providers!)

    - 229: Correct, but the question does not specify the units of the final answer. The given answer says 15 units remaining, but the predicted answer is 15 units remaining * 25 feet per unit, final answer = 375 (as in 375 feet of material left instead of 15 units left)

    - 235: It seems that the answer provided is actually incorrect, and the predicted answer is actually correct. So, this seems to be marked as incorrect when it is in fact correct. (Needs correction from data providers!)

### Special Condition(s): 

    - 59: the final output has an execution defined in its output (the final thing to be calculated, which is subtracting the final number off the total). However, our design makes it so that final output is just strictly output and is not a direct execution result, so I am assuming the final step (defined in the final output), is never actually executed. To handle this, we need to specify to the model in the procedure generation process that the final step should be explicitly marked as OUTPUT and not an executable, and that the final output needs to be just the final number (even if a step needs to just be a "passthrough" step). However, I wonder why the final output is not executing? Maybe because of the enforced final output schema we add, so the final step is just "final answer numerical: the final answer" by default and is never "final answer numerical: execution that needs to be done for the answer"? Not sure, need to look into

In [13]:
condition1_final = [22, 48, 81, 136, 173]
condition2_final = [31, 52, 59, 65, 183, 193, 196, 200, 210, 240, 243, 245, 264]
condition3_final = [13, 63, 95, 103, 113, 142, 181, 187, 190, 220, 230, 237, 242, 267, 274]
condition4_final = [144, 234]
condition5_final = [108, 167, 218, 228, 229, 235, 270]

In [14]:
len(condition3_final)

15

Conditions 1, 2, and 4 indicate issues in the step runs. These may be able to be mitigated by self-consistency-style answers, where the procedures are run multiple times and the most consistent answer is chosen.

Condition 3 indicated issues in the procedure, which means a new procedure would need to be generated and the procedure as it exists now would not work.

Condition 5 means the given answer is actually incorrect or the answers align but are just in different units (because no unit was specified in the question). This means this set was actually correct (7 instances).

Therefore, really only 15 instances are solidly incorrect.

In [None]:
baseline_incorrect = baseline_results_df[~baseline_results_df['correct']]

In [16]:
baseline_incorrect.shape

(13, 12)

In [17]:
for incorrect_instance in baseline_incorrect.itertuples():
    if incorrect_instance.Index in condition5_final:
        # Already reviewed, skip
        continue
    print("Question:", incorrect_instance.question)
    print("Index: ", incorrect_instance.Index)
    print("Expected Answer:", incorrect_instance.gold_num)
    print("Predicted Answer:", incorrect_instance.pred_num)
    print("----- \n\n")

Question: Jasper will serve charcuterie at his dinner party. He buys 2 pounds of cheddar cheese for $10, a pound of cream cheese that cost half the price of the cheddar cheese, and a pack of cold cuts that cost twice the price of the cheddar cheese. How much does he spend on the ingredients?
Index:  13
Expected Answer: 35.0
Predicted Answer: 22.5
----- 


Question: Every hour Joanne has to collect the coins out of the fountain inside the mall. During the first hour, she collected 15 coins. For the next two hours, she collected 35 coins from the fountain. In the fourth hour, she collected 50 coins from the fountain but she gave 15 of them to her coworker so she could buy a soda. How many coins did she have after the fourth hour?
Index:  59
Expected Answer: 120.0
Predicted Answer: 85.0
----- 


Question: Shawna's workout goal is 30 situps. On Monday, Shawna was only able to do 12 situps, so she decided that she would make up for the rest on Tuesday. However, she was only able to do 19 si

TODO: 
To mitigate problem 1:
    Run the procedure multiple times
To mitigate problem 3:
    Increase population size/iterations

Check condition 3

In [18]:
condition3_final = [13, 63, 95, 103, 113, 142, 181, 187, 190, 220, 230, 237, 242, 267, 274]

In [19]:
proc_correct_baseline_incorrect

Unnamed: 0,id,row_index,question,gold_answer,gold_num,fitness,procedure,state,pred_answer,pred_num,correct,steps,status,error_type,error,timestamp_utc
118,4e723bec21d3369944d63856435ebccc,118,Jeanette is practicing her juggling. Each week...,First find the total number of additional obje...,13.0,1.25,{'NameDescription': 'Compute the number of obj...,{'problem_text': 'Jeanette is practicing her j...,The final answer is 13.,13.0,True,3.0,ok,,,
