# ðŸ§¡ Generating Generalization Attacks for **EvoProc** procedures for the **GSM8K Dataset**

Pseduo-code Plan

Note: Dataset is only the first 275 questions of the GSM8K dataset training batch (accessed with HuggingFace `dataset` library)
1. For each question in the dataset
    
    a. Change the numbers in the question

    b. Using the new numbers, calculate the new "golden number" (aka the final answer)

    c. Ensure that the final answer is an INTEGER (common practice in the GSM8K dataset)

2. For each new mutated question (question with new numbers)

    a. Proc Answer: Use the corresponding procedure from the file `/home/student/Desktop/malia/evoproc_tests/runs/gsm8k_train_v6.jsonl` (access the procedure using `instance["procedure"]`). Run each step with the NEW problem text inserted.

    b. Baseline answer: Pass NEW problem text

3. Compare % decrease in accuracy for both proc and baseline answers

## ðŸŸ§ **Step 1**: Import packages and necessary application functions & variables

In [1]:
import os
import random
import re
import math
import json
import traceback
import pandas as pd
from hashlib import blake2b
from datetime import datetime, timezone
from datasets import load_dataset
from typing import Callable, Optional, Dict, Any, Tuple
from evoproc.ga_scaffold_structured import ProcedureGA, GAConfig
from evoproc.validators import validate_procedure_structured
from evoproc_procedures.schemas import get_schema
from evoproc_procedures.prompts import create_procedure_prompt
from evoproc_procedures.ollama import query, repair_fn_ollama
from evoproc_procedures.runners import run_steps_stateful_minimal
from evoproc_procedures.helpers import pretty_print

  from .autonotebook import tqdm as notebook_tqdm


## ðŸŸ§ **Step 2**: Import the GSM8K Dataset

In [2]:
gsm8k_dataset = load_dataset("gsm8k", "main", split="train[:275]")
# Use the first 275 questions as requested
subsample = gsm8k_dataset


## ðŸŸ§ **Step 2b**: Load prior procedures and set up mutation/solver helpers


In [5]:
# Paths to existing results (procedures)
PROC_RESULTS_PATH = "runs/gsm8k_train_v6.jsonl"
BASELINE_RESULTS_PATH = "runs/gsm8k_train_v6_baseline.jsonl"

# New output files for the generalization attack
GENERALIZATION_PROC_OUT = "runs/generalization_attack_results.jsonl"
GENERALIZATION_BASELINE_OUT = "runs/generalization_attack_baseline_results.jsonl"

def _read_jsonl_allowing_comments(path):
    records = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith("//"):
                continue
            records.append(json.loads(line))
    return records

proc_records = _read_jsonl_allowing_comments(PROC_RESULTS_PATH)
procedures_by_row = {rec.get("row_index"): rec.get("procedure") for rec in proc_records if rec.get("procedure") is not None}

# Reasoning extraction: take everything before the final #### line
def extract_reasoning(answer_text: str) -> str:
    if not answer_text:
        return ""
    parts = answer_text.split("####")
    return parts[0].strip()

_NUM_RE = re.compile(r"-?\d+")

# Mutate numbers in a question while keeping mapping (by value)
def mutate_numbers_in_text(text: str, rng: random.Random, min_val: int = 1, max_val: int = 99):
    mapping = {}
    def repl(m):
        s = m.group(0)
        if s not in mapping:
            orig = int(s)
            new = orig
            for _ in range(50):
                new = rng.randint(min_val, max_val)
                if new != orig:
                    break
            mapping[s] = new
        return str(mapping[s])
    new_text = _NUM_RE.sub(repl, text)
    return new_text, mapping

# Solver to create new gold answers using original reasoning as a template
def solve_gold_with_reasoning(question: str, reasoning: str, query_fn, model: str, seed: int = 1234, max_retries: int = 3):
    base_prompt = (
        "Solve the GSM8K problem with the new numbers.\n"
        "You are given the original reasoning path as a template.\n"
        "Recompute all arithmetic with the NEW numbers; do not copy old numeric results.\n"
        "Return ONLY JSON with keys: \"final_answer\" (string), \"final_answer_numerical\" (integer).\n"
        "No extra text.\n\n"
        f"PROBLEM:\n{question}\n\n"
        f"ORIGINAL_REASONING_TEMPLATE:\n{reasoning}\n"
    )
    last_raw = None
    for attempt in range(max_retries):
        prompt = base_prompt
        if attempt > 0:
            prompt += "\nYour previous answer was not a valid integer. Re-solve and return an integer."
        raw = query_fn(prompt, model, FINAL_SCHEMA, seed + attempt)
        last_raw = raw
        state, pred_ans, pred_num, raw_text = _parse_baseline_output(raw)
        # Coerce to int if it is an integer-valued float
        if pred_num is not None:
            try:
                if float(pred_num).is_integer():
                    return {
                        "gold_num": int(float(pred_num)),
                        "gold_answer": pred_ans,
                        "state": state,
                        "raw": raw_text,
                        "coerced": False,
                    }
            except Exception:
                pass
    # Final fallback: try to parse last numeric and coerce
    gold_num = None
    try:
        if last_raw is not None:
            _, _, pred_num, raw_text = _parse_baseline_output(last_raw)
            if pred_num is not None:
                gold_num = int(round(float(pred_num)))
                return {
                    "gold_num": gold_num,
                    "gold_answer": str(pred_num),
                    "state": None,
                    "raw": raw_text,
                    "coerced": True,
                }
    except Exception:
        pass
    return {"gold_num": None, "gold_answer": None, "state": None, "raw": None, "coerced": True}

# Build a mutated dataset with new questions and LLM-solved gold numbers
def build_mutated_examples(dataset, rng: random.Random, query_fn, model: str, seed: int = 1234):
    mutated = []
    for idx, ex in enumerate(dataset):
        orig_q = ex.get("question", "")
        orig_a = ex.get("answer", "")
        reasoning = extract_reasoning(orig_a)
        new_q, mapping = mutate_numbers_in_text(orig_q, rng)
        solved = solve_gold_with_reasoning(new_q, reasoning, query_fn, model, seed=seed + idx)
        gold_num = solved.get("gold_num")
        # store a minimal GSM8K-style answer string for parsing
        gold_answer = None if gold_num is None else f"#### {int(gold_num)}"
        mutated.append({
            "question": new_q,
            "answer": gold_answer,
            "gold_num": gold_num,
            "orig_question": orig_q,
            "orig_answer": orig_a,
            "orig_reasoning": reasoning,
            "number_mapping": mapping,
            "procedure": procedures_by_row.get(idx),
            "solver_raw": solved.get("raw"),
            "solver_state": solved.get("state"),
            "solver_coerced": solved.get("coerced"),
        })
    return mutated


## ðŸŸ§ **Step 3**: Set variable constants and instantiate necessary functions

These include:
- Defining regex search functions to grab the final numerical answer from the GSM8K `answer` parameter
- Defining the evaluation function to compare predicted and actual answers
- Defining functions for file and ID handling (for saving results)
- Grabbing the GSM final answer schema
- Setting the query function
- Setting the model
- Instantiating the GA (genetic algorithm) object
- Defining the run function which puts everything together into one easy-to-run function

In [6]:
_FINAL_AFTER_HASH_RE = re.compile(r"####\s*(-?\d+(?:\.\d+)?)\s*$", re.MULTILINE)
_LAST_NUMBER_RE = re.compile(r"-?\d+(?:\.\d+)?")

def extract_gold_number(gold_answer):
    m = _FINAL_AFTER_HASH_RE.search(gold_answer)
    if m:
        return float(m.group(1))
    nums = _LAST_NUMBER_RE.findall(gold_answer)
    return float(nums[-1]) if nums else None

def _numbers_equal(a, b, tol=1e-9):
    if a is None or b is None:
        return False
    try:
        return abs(float(a) - float(b)) < tol
    except Exception:
        return a == b

In [7]:
def eval_fn(state) -> float:
    """Return a fitness score in [0,1]."""
    # prefer model-extracted numeric if present, else try to parse its text
    pred_num = state.get("final_answer_numerical")
    if pred_num is None:
        try:
            pred_num = float(re.findall(r"-?\d+(?:\.\d+)?", state.get("final_answer",""))[-1])
        except Exception:
            return 0.0
    gold_num = state.get("_gold_num")  # weâ€™ll inject this per item
    if gold_num is None:
        return 0.0
    # exact match or close within small tolerance
    return 1.0 if math.isclose(pred_num, gold_num, rel_tol=0, abs_tol=1e-6) else 0.0

In [8]:
def _safe_extract_json(text: str) -> Optional[dict]:
    """Try to pull a JSON object from a string. Returns dict or None."""
    if not text:
        return None
    s = text.strip()

    # Strip ```json ... ``` fences
    fence = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", s, flags=re.DOTALL | re.IGNORECASE)
    if fence:
        s = fence.group(1).strip()

    # Try direct parse
    try:
        obj = json.loads(s)
        return obj if isinstance(obj, dict) else None
    except Exception:
        pass

    # Fallback: take the first {...} block
    i, j = s.find("{"), s.rfind("}")
    if i != -1 and j != -1 and j > i:
        try:
            obj = json.loads(s[i : j + 1])
            return obj if isinstance(obj, dict) else None
        except Exception:
            return None
    return None


def _parse_baseline_output(raw: Any) -> Tuple[Dict[str, Any], Optional[str], Optional[float], str]:
    """
    Returns (state, pred_answer, pred_num, raw_text).
    - state: parsed dict if possible, else {"final_answer": <text>}
    - pred_answer: state["final_answer"] if available, else the raw text
    - pred_num: state["final_answer_numerical"] if available/coercible, else last number in pred_answer
    - raw_text: normalized string version of raw
    """
    # Normalize raw_text
    if isinstance(raw, dict):
        raw_text = json.dumps(raw, ensure_ascii=False)
        state = raw
    else:
        raw_text = "" if raw is None else str(raw)
        state = _safe_extract_json(raw_text)

    if isinstance(state, dict):
        pred_answer = state.get("final_answer")
        pred_num = state.get("final_answer_numerical")

        # Coerce numeric if possible
        try:
            pred_num = float(pred_num) if pred_num is not None else None
        except Exception:
            pred_num = None

        # If numeric missing, try parse from final_answer text
        if pred_num is None:
            txt = str(pred_answer or "")
            nums = _LAST_NUMBER_RE.findall(txt)
            pred_num = float(nums[-1]) if nums else None

        return state, (str(pred_answer) if pred_answer is not None else None), pred_num, raw_text

    # No JSON found: treat as plain text
    txt = raw_text.strip()
    nums = _LAST_NUMBER_RE.findall(txt)
    pred_num = float(nums[-1]) if nums else None
    return {"final_answer": txt}, (txt if txt else None), pred_num, raw_text


In [9]:
def _append_jsonl(path, items):
    os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
    with open(path, "a", encoding="utf-8") as f:
        for it in items:
            f.write(json.dumps(it, ensure_ascii=False) + "\n")
        f.flush()
        os.fsync(f.fileno())

def _load_existing_ids(path):
    ids = set()
    try:
        with open(path, "r", encoding="utf-8") as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                try:
                    rec = json.loads(line)
                    if "id" in rec and rec["id"] is not None:
                        ids.add(rec["id"])
                except json.JSONDecodeError:
                    # tolerate a partially written last line
                    continue
    except FileNotFoundError:
        pass
    return ids

def _stable_example_id(ex, *, salt=""):
    """
    Make a deterministic ID when dataset lacks 'id'.
    Uses question+answer (+ optional salt) to avoid changing if order changes.
    """
    # If a real id exists, reuse it
    if ex.get("id") is not None:
        return str(ex["id"])

    q = (ex.get("question") or "").strip()
    a = (ex.get("answer") or "").strip()
    h = blake2b(digest_size=16)  # 128-bit
    h.update(salt.encode("utf-8", "ignore"))
    h.update(q.encode("utf-8", "ignore"))
    h.update(b"\x1e")  # delimiter
    h.update(a.encode("utf-8", "ignore"))
    return h.hexdigest()

In [10]:
# MODEL = "llama4:latest"
# MODEL = "gemma3:latest"
# MODEL = "gpt-oss:120b-cloud"
GPT_OSS_LOCAL_BUGGY_MODELS = {"gpt-oss:20b", "gpt-oss:120b"}
MODEL = "gpt-oss:120b"
FINAL_SCHEMA = get_schema("gsm")
QUERY_FN = query

In [11]:
def query_fn_no_format_for_gptoss(prompt, model, fmt, seed):
    # kill structured-output / format only for gpt-oss models
    # Note, this will only be with the gpt-oss NON-CLOUD models (gpt-oss:120b & gpt-oss:20b)
    # Cloud models do not have this bug, so I want to bypass this if it is a cloud model
    if isinstance(model, str) and model in GPT_OSS_LOCAL_BUGGY_MODELS:
        fmt = None
    return QUERY_FN(prompt, model, fmt, seed)   # call your existing function

In [12]:
RunnerFn = Callable[[str, Optional[float]], Dict[str, Any]]

def make_baseline_runner(query_fn, model: str, seed: int = 1234, print_bool: bool = False) -> RunnerFn:
    def _runner(question: str, gold_num: Optional[float], ex: Optional[dict] = None, idx: Optional[int] = None) -> Dict[str, Any]:
        prompt = (
            "Solve the following GSM8K problem."
            'Return ONLY JSON with keys: "final_answer" (string), "final_answer_numerical" (number).'
            "No extra text."
            f"PROBLEM:{question}"
        )
        if print_bool:
            print("Prompt to model:")
            print(prompt)
            print("-----")
        raw = query_fn(prompt, model, FINAL_SCHEMA, seed)  # your query_fn can ignore fmt for local gpt-oss
        if print_bool:
            print("Raw model output:")
            print(raw)
            print("-----")
        state, pred_ans, pred_num, raw_text = _parse_baseline_output(raw)

        correct = (
            pred_num is not None
            and gold_num is not None
            and math.isclose(float(pred_num), float(gold_num), rel_tol=0, abs_tol=1e-6)
        )
        return {
            "mode": "baseline",
            "state": state,
            "pred_answer": pred_ans,
            "pred_num": pred_num,
            "correct": bool(correct),
            "raw": raw_text,
        }
    return _runner

# GA-based procedural runner (original)
def make_procedural_runner(ga, query_fn, model: str, seed: int = 1234, print_bool: bool = False) -> RunnerFn:
    def _runner(question: str, gold_num: Optional[float], ex: Optional[dict] = None, idx: Optional[int] = None) -> Dict[str, Any]:
        best, history = ga.run(
            task_description=question,
            final_answer_schema=FINAL_SCHEMA,
            eval_fn=None,
            print_progress=print_bool,
        )
        if print_bool:
            print("Best procedure found:")
            pretty_print(best.proc)
            print("Running final procedure to get final answer...")
        final_state = run_steps_stateful_minimal(
            best.proc, question, FINAL_SCHEMA, ga.model, print_bool=print_bool, query_fn=query_fn
        )
        pred_ans = final_state.get("final_answer")
        pred_num = final_state.get("final_answer_numerical")
        correct = (
            pred_num is not None
            and gold_num is not None
            and math.isclose(float(pred_num), float(gold_num), rel_tol=0, abs_tol=1e-6)
        )
        return {
            "mode": "procedural",
            "procedure": best.proc,
            "fitness": best.fitness,
            "steps": len(best.proc.get("steps", [])),
            "state": final_state,
            "pred_answer": pred_ans,
            "pred_num": pred_num,
            "correct": bool(correct),
        }
    return _runner

# Fixed-procedure runner using existing procedures from prior run
def make_fixed_procedure_runner(query_fn, model: str, seed: int = 1234, print_bool: bool = False) -> RunnerFn:
    def _runner(question: str, gold_num: Optional[float], ex: Optional[dict] = None, idx: Optional[int] = None) -> Dict[str, Any]:
        if not ex or ex.get("procedure") is None:
            raise ValueError("Missing procedure in example for fixed-procedure runner")
        proc = ex.get("procedure")
        final_state = run_steps_stateful_minimal(
            proc, question, FINAL_SCHEMA, model, print_bool=print_bool, query_fn=query_fn
        )
        pred_ans = final_state.get("final_answer")
        pred_num = final_state.get("final_answer_numerical")
        correct = (
            pred_num is not None
            and gold_num is not None
            and math.isclose(float(pred_num), float(gold_num), rel_tol=0, abs_tol=1e-6)
        )
        return {
            "mode": "procedural_fixed",
            "procedure": proc,
            "steps": len(proc.get("steps", [])) if isinstance(proc, dict) else None,
            "state": final_state,
            "pred_answer": pred_ans,
            "pred_num": pred_num,
            "correct": bool(correct),
        }
    return _runner


Runners for running the procedural queries vs. the baseline queries

In [13]:
def run_gsm8k_batch(
    examples,
    runner: RunnerFn,       # <-- inject baseline/procedural behavior here
    out_path=None,          # e.g., "runs/gsm8k_results.jsonl"
    save_every=10,          # write every N examples
    resume=False,           # skip examples whose IDs are already in out_path
    id_salt="",             # optional: add dataset name/split/version here for extra uniqueness
    *,
    skip_errors: bool = False,          # NEW: continue after per-item failures
    save_error_records: bool = True,    # NEW: write an "error" record to JSONL
    include_traceback: bool = False,    # NEW: optionally store traceback (can be large)
    print_bool: bool = False,
):
    """
    examples: iterable of dicts like {"question": "...", "answer": "..."} (GSM8K format)
    Returns: list of per-item result dicts with procedure, state, and score.
    Persists results to JSONL every `save_every`. If `resume=True`, skips already-saved IDs.
    """
    pending = []
    results = []
    existing_ids = _load_existing_ids(out_path) if (resume and out_path) else set()

    def _flush():
        nonlocal pending
        if out_path and pending:
            _append_jsonl(out_path, pending)
            pending.clear()

    for idx, ex in enumerate(examples):
        qid = _stable_example_id(ex, salt=id_salt)
        if resume and out_path and (qid in existing_ids):
            continue

        question = ex.get("question")
        gold_text = ex.get("answer")
        gold_num = ex.get("gold_num")
        if gold_num is None:
            gold_num = extract_gold_number(gold_text) if gold_text else None

        best = None # so the except block is safe
        try:
            # runner may accept ex/idx; if it doesn't, Python will ignore via default args
            extra = runner(question, gold_num, ex=ex, idx=idx)

            rec = {
                "id": qid,
                "row_index": idx,
                "question": question,
                "gold_answer": gold_text,
                "gold_num": gold_num,
                "status": "ok",
                **extra,
            }

            # include any extra fields from the example for traceability
            for k, v in ex.items():
                if k in {"question", "answer"}:
                    continue
                rec.setdefault(k, v)

        except Exception as e:
            this_err_proc = getattr(best, "proc", None) 

            if not skip_errors:
                # flush anything we have before raising
                _flush()
                raise

            # create an error record (and optionally mark it as "done" for resume)
            rec = {
                "id": qid,
                "row_index": idx,
                "question": question,
                "gold_answer": gold_text,
                "gold_num": gold_num,
                "procedure": this_err_proc,
                "status": "error",
                "error_type": type(e).__name__,
                "error": str(e),
                "timestamp_utc": datetime.now(timezone.utc).isoformat(),
            }
            if include_traceback:
                rec["traceback"] = traceback.format_exc()

            # If you *don't* want resume=True to skip errored items later,
            # set save_error_records=False OR change your _load_existing_ids
            # to only count status=="ok".
            if not save_error_records:
                # don't save it; still return it in-memory
                results.append(rec)
                continue

        results.append(rec)

        if out_path:
            pending.append(rec)
            if len(pending) >= save_every:
                _flush()

    _flush()
    return results


Runner for running an entire GSM8K batch with file saving IO capabilities

In [14]:
def run_gsm8k_batch(
    examples,
    runner: RunnerFn,       # <-- inject baseline/procedural behavior here
    out_path=None,          # e.g., "runs/gsm8k_results.jsonl"
    save_every=10,          # write every N examples
    resume=False,           # skip examples whose IDs are already in out_path
    id_salt="",             # optional: add dataset name/split/version here for extra uniqueness
    *,
    skip_errors: bool = False,          # NEW: continue after per-item failures
    save_error_records: bool = True,    # NEW: write an "error" record to JSONL
    include_traceback: bool = False,    # NEW: optionally store traceback (can be large)
    print_bool: bool = False,
):
    """
    examples: iterable of dicts like {"question": "...", "answer": "..."} (GSM8K format)
    Returns: list of per-item result dicts with procedure, state, and score.
    Persists results to JSONL every `save_every`. If `resume=True`, skips already-saved IDs.
    """
    pending = []
    results = []
    existing_ids = _load_existing_ids(out_path) if (resume and out_path) else set()

    def _flush():
        nonlocal pending
        if out_path and pending:
            _append_jsonl(out_path, pending)
            pending.clear()

    for idx, ex in enumerate(examples):
        qid = _stable_example_id(ex, salt=id_salt)
        if resume and out_path and (qid in existing_ids):
            continue

        question = ex.get("question")
        gold_text = ex.get("answer")
        gold_num = extract_gold_number(gold_text) if gold_text else None

        best = None # so the except block is safe
        try:
            extra = runner(question, gold_num)

            rec = {
                "id": qid,
                "row_index": idx,
                "question": question,
                "gold_answer": gold_text,
                "gold_num": gold_num,
                "status": "ok",
                **extra,
            }

        except Exception as e:
            this_err_proc = getattr(best, "proc", None) 

            if not skip_errors:
                # flush anything we have before raising
                _flush()
                raise

            # create an error record (and optionally mark it as "done" for resume)
            rec = {
                "id": qid,
                "row_index": idx,
                "question": question,
                "gold_answer": gold_text,
                "gold_num": gold_num,
                "procedure": this_err_proc,
                "status": "error",
                "error_type": type(e).__name__,
                "error": str(e),
                "timestamp_utc": datetime.now(timezone.utc).isoformat(),
            }
            if include_traceback:
                rec["traceback"] = traceback.format_exc()

            # If you *don't* want resume=True to skip errored items later,
            # set save_error_records=False OR change your _load_existing_ids
            # to only count status=="ok".
            if not save_error_records:
                # don't save it; still return it in-memory
                results.append(rec)
                continue

        results.append(rec)

        if out_path:
            pending.append(rec)
            if len(pending) >= save_every:
                _flush()

    _flush()
    return results


## ðŸŸ§ **Step 4**: Run the batch function to get results

Running the first example to make sure everything goes smoothly before running larger batches

Uncomment the cells below if you want to test before running the larger set

In [13]:
# first = train_dataset.select(range(1))

# res = ollama.generate(
#         model=MODEL,
#         prompt=create_procedure_prompt(first[0]["question"]),
#         format=None,
#         options={"temperature": 1, "seed": 1234},
#     )
# print(res["response"])

In [15]:
# Change these variables to control current file
CURRENT_FILE_PATH = GENERALIZATION_PROC_OUT
CURRENT_ID_SALT = "gsm8k-train-gen-attack-proc"
# CHANGE TO TRUE ONLY IF YOU NEED TO RESUME
RESUME = False
# CHANGE ONLY IF YOU WANT TO PRINT LOGS AS IT RUNS
PRINT_BOOL = False


In [16]:
# Instantiate the runners
proc_runner = make_fixed_procedure_runner(query_fn_no_format_for_gptoss, MODEL, seed=1234, print_bool=PRINT_BOOL)
baseline_runner = make_baseline_runner(query_fn_no_format_for_gptoss, MODEL, seed=1234, print_bool=PRINT_BOOL)
# Set which runner you want to use
RUNNER = proc_runner


In [None]:
# Build mutated examples with LLM-solved gold answers (may take time)
import random
rng = random.Random(42)
mutated_examples = build_mutated_examples(subsample, rng, query_fn_no_format_for_gptoss, MODEL, seed=1234)


In [None]:
# Save mutated examples to a JSON file for reference
with open('runs/mutated_examples.json', 'w') as f:
    json.dump(mutated_examples, f, indent=2)

In [None]:
mutated_examples

In [20]:
results = run_gsm8k_batch(
    mutated_examples,
    runner=RUNNER,
    out_path=CURRENT_FILE_PATH,
    save_every=2,
    resume=RESUME,
    id_salt=CURRENT_ID_SALT,   # optional but nice to set (dataset name/split/version)
    skip_errors=True,          # NEW: continue after per-item failures
    print_bool=PRINT_BOOL
)


In [None]:
# Run baseline on the same mutated examples
CURRENT_FILE_PATH = GENERALIZATION_BASELINE_OUT
CURRENT_ID_SALT = "gsm8k-train-gen-attack-baseline"
RUNNER = baseline_runner
results_baseline = run_gsm8k_batch(
    mutated_examples,
    runner=RUNNER,
    out_path=CURRENT_FILE_PATH,
    save_every=2,
    resume=RESUME,
    id_salt=CURRENT_ID_SALT,
    skip_errors=True,
    print_bool=PRINT_BOOL
)


In [21]:
# Choose which results file to analyze
CURRENT_FILE_PATH = GENERALIZATION_PROC_OUT  # or GENERALIZATION_BASELINE_OUT
read_results = None
with open(CURRENT_FILE_PATH, "r") as f:
    procs = [json.loads(line) for line in f]
    read_results = procs
    f.close()
results_df = pd.DataFrame(read_results)


In [22]:
results = run_gsm8k_batch(
    subsample, 
    runner=RUNNER,
    out_path=CURRENT_FILE_PATH, 
    save_every=2,
    resume=RESUME,
    id_salt=CURRENT_ID_SALT,   # optional but nice to set (dataset name/split/version)
    skip_errors=True,          # NEW: continue after per-item failures
    print_bool=PRINT_BOOL
)

## ðŸŸ§ **Step 5**: Analyze results

Read your results from your file

In [18]:
read_results = None
with open(CURRENT_FILE_PATH, "r") as f:
    procs = [json.loads(line) for line in f]
    read_results = procs
    f.close()
results_df = pd.DataFrame(read_results)

In [19]:
print(results_df["correct"].value_counts())
results_df["correct"].value_counts().plot(kind='bar')

KeyError: 'correct'

Printing to further investigate false answers

In [None]:
false_answers = results_df[~results_df["correct"]]
for index, row in false_answers.iterrows():
    print(f"Question: {row['question']}")
    print(f"Gold Answer: {row['gold_answer']}")
    print(f"Predicted Answer: {row['pred_answer']}")
    print("Procedure:")
    pretty_print(row['procedure'])
    print("\n---\n")

Question: Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?
Gold Answer: In the beginning, Betty has only 100 / 2 = $<<100/2=50>>50.
Betty's grandparents gave her 15 * 2 = $<<15*2=30>>30.
This means, Betty needs 100 - 50 - 30 - 15 = $<<100-50-30-15=5>>5 more.
#### 5
Predicted Answer: Betty needs $30 more to buy the wallet.
Procedure:

--- Procedure: Calculate how much more money Betty needs. ---
Steps:

Step 1: Extract the problem text from the input.
  **Inputs**:
    - problem_text: The problem description.
  **Outputs**:
    - problem_text: The problem description.

Step 2: Calculate the amount Betty has.
  **Inputs**:
    - problem_text: The problem description.
  **Outputs**:
    - money_she_has: The amount of money Betty has.

Step 3: Calculate the amount Betty needs.
  **

Looking at an isolated false answer

In [None]:
this = false_answers.iloc[1]
run_steps_stateful_minimal(this['procedure'], this['question'], FINAL_SCHEMA, ga.model, print_bool=True, query_fn=QUERY_FN)

[step 1] inputs: {'problem_text': 'Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday. If she wants to read half of the remaining pages tomorrow, how many pages should she read?'}
[step 1] outputs: {k: state[k] for k in ['total_pages'] if k in state}
[step 2] inputs: {'total_pages': 120}
[step 2] outputs: {k: state[k] for k in ['yesterdays_pages'] if k in state}
[step 3] inputs: {'total_pages': 120, 'yesterdays_pages': 0}
[step 3] outputs: {k: state[k] for k in ['todays_pages'] if k in state}
[step 4] inputs: {'todays_pages': 120, 'yesterdays_pages': 0}
[step 4] outputs: {k: state[k] for k in ['tomorrows_pages'] if k in state}
[step 5] inputs: {'tomorrows_pages': 120}
[step 5] outputs: {k: state[k] for k in ['final_answer', 'final_answer_numerical', 'confidence', 'units'] if k in state}


{'problem_text': 'Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday. If she wants to read half of the remaining pages tomorrow, how many pages should she read?',
 'total_pages': 120,
 'yesterdays_pages': 0,
 'todays_pages': 120,
 'tomorrows_pages': 120,
 'final_answer': 'Julie will read 120 pages tomorrow.',
 'final_answer_numerical': 120,
 'confidence': 1,
 'units': 'pages'}