PHASE 1. ANSWER ACCURACY AND TOOL USE:
As a simple first demo, let's first evaluate a simple dataset with synthetic trace data and evaluate if the answer the agent returned was correct, if a tool was used, and if the tool selected was the correct one. The dataset represents the test prompts and expected answers while the output file represents the agent behavior. We intentionally create errors in the trace data to illustrate. 

1. Create dataset (test prompts and expected behavior) and output (actual agent traces) files. For this demo we have 3 examples where we ware evaluating 1: if the agent returned the expected answer and 2: it called a tool if expected and 3:chose the correct tool. 

In [85]:
eval_dataset = [
    {
        "id": "q1",
        "question": "What is 2 + 2?",
        "expected_answer": "4",
        "expected_tool": "calculator"
    },
    {
        "id": "q2",
        "question": "What's the weather in Memphis?",
        "expected_answer": "cloudy",
        "expected_tool": "weather_api"
    },
    {
        "id": "q3",
        "question": "Define entropy.",
        "expected_answer": "a measure of uncertainty",
        "expected_tool": "encyclopedia"
    }
]

agent_outputs = [
    {
        "id": "q1",
        "answer": "4",
        "tool_used": "calculator"
    },
    {
        "id": "q2",
        "answer": "sunny",
        "tool_used": "weather_api"
    },
    {
        "id": "q3",
        "answer": "a measure of uncertainty",
        "tool_used": "wikipedia"  # incorrect tool
    }
]




2. Create grading function to evaluate each sample in dataset against the agent trace in the output.

In [86]:
def grade_sample(sample, output):
    return {
        "id": sample["id"],
        "answer_correct": (
            sample["expected_answer"].strip().lower()
            == output["answer"].strip().lower()
        ),
        "tool_invoked": output.get("tool_used") is not None,
        "tool_correct": sample["expected_tool"] == output.get("tool_used")
    }


3. Creat evaluation function to run the grading function for all samples and return True/ False for each test.

In [87]:
def evaluate(dataset, outputs):
    output_lookup = {o["id"]: o for o in outputs}
    results = []

    for sample in dataset:
        output = output_lookup.get(sample["id"])
        if not output:
            results.append({
                "id": sample["id"],
                "error": "missing_output"
            })
            continue

        results.append(grade_sample(sample, output))

    return results

results = evaluate(eval_dataset, agent_outputs)
results


[{'id': 'q1',
  'answer_correct': True,
  'tool_invoked': True,
  'tool_correct': True},
 {'id': 'q2',
  'answer_correct': False,
  'tool_invoked': True,
  'tool_correct': True},
 {'id': 'q3',
  'answer_correct': True,
  'tool_invoked': True,
  'tool_correct': False}]

4. Summarize Results

In [88]:
def summarize(results):
    valid = [r for r in results if "error" not in r]
    total = len(valid)

    return {
        "total_samples": total,
        "answer_accuracy": sum(r["answer_correct"] for r in valid) / total,
        "tool_invocation_rate": sum(r["tool_invoked"] for r in valid) / total,
        "tool_selection_accuracy": sum(r["tool_correct"] for r in valid) / total
    }

summary = summarize(results)
summary


{'total_samples': 3,
 'answer_accuracy': 0.6666666666666666,
 'tool_invocation_rate': 1.0,
 'tool_selection_accuracy': 0.6666666666666666}

5. Clean and print readable summary and sample level details

In [89]:
print("üîç Agent Evaluation Report\n")

#for r in results:
#    print(r)

print("\nüìä Summary Metrics")
for k, v in summary.items():
    print(f"{k}: {v:.2f}")

import pandas as pd

# Rebuild full results with all info for clarity
def build_detailed_results(dataset, outputs):
    output_lookup = {o["id"]: o for o in outputs}
    rows = []

    for sample in dataset:
        row = {
            "id": sample["id"],
            "question": sample["question"],
            "expected_answer": sample["expected_answer"],
            "expected_tool": sample["expected_tool"]
        }

        output = output_lookup.get(sample["id"])
        if not output:
            row.update({
                "answer": None,
                "tool_used": None,
                "answer_correct": False,
                "tool_invoked": False,
                "tool_correct": False,
                "error": "missing_output"
            })
        else:
            answer = output.get("answer", "").strip().lower()
            expected = sample["expected_answer"].strip().lower()
            tool = output.get("tool_used")

            row.update({
                "answer": output.get("answer"),
                "tool_used": tool,
                "answer_correct": answer == expected,
                "tool_invoked": tool is not None,
                "tool_correct": tool == sample["expected_tool"],
                "error": None
            })

        rows.append(row)

    return rows

# Generate detailed results and convert to DataFrame
detailed_results = build_detailed_results(eval_dataset, agent_outputs)
df_detailed = pd.DataFrame(detailed_results)

# Show it
df_detailed[
    [
        "id", "question", "answer", "expected_answer","answer_correct", 
        "tool_used", "expected_tool",
        "tool_invoked", "tool_correct"
    ]
]


üîç Agent Evaluation Report


üìä Summary Metrics
total_samples: 3.00
answer_accuracy: 0.67
tool_invocation_rate: 1.00
tool_selection_accuracy: 0.67


Unnamed: 0,id,question,answer,expected_answer,answer_correct,tool_used,expected_tool,tool_invoked,tool_correct
0,q1,What is 2 + 2?,4,4,True,calculator,calculator,True,True
1,q2,What's the weather in Memphis?,sunny,cloudy,False,weather_api,weather_api,True,True
2,q3,Define entropy.,a measure of uncertainty,a measure of uncertainty,True,wikipedia,encyclopedia,True,False


PHASE 2: TRAJECTORY ANALYSIS USING LLM-AS-JUDGE
Now we'll include examine how agent planning can be assessed using LLM as a judge. The dataset files will now have details on the expected thoughts, actions and observations with agent traces in the output file. We send the expected agent trajectory and trace trajectory data to an LLM where we will ask to assess the question, answer,and trajectory, and output a rating from 1-5 on whether the agent took appropriate steps. The LLM will also return a reasoning for the grade.

NOTE: This section uses OpenAI API. To run it you need a .env file with the API key.

1. Create Dataset and Agent Outputs

In [90]:

trajectory_dataset = [
    {
        "id": "q1",
        "question": "What is 2 + 2?",
        "expected_answer": "4",
        "expected_steps": [
            "Thought: I need to calculate 2 + 2.",
            "Action: Use calculator to compute the result.",
            "Observation: The calculator returns 4."
        ]
    },
    {
        "id": "q2",
        "question": "What's the weather in Memphis?",
        "expected_answer": "cloudy",
        "expected_steps": [
            "Thought: I need real-time weather data.",
            "Action: Call the weather API for Memphis.",
            "Observation: Response says it is cloudy."
        ]
    },
    {
    "id": "q3",
    "question": "Who is the president of the United States?",
    "expected_answer": "Joe Biden",
    "expected_steps": [
        "Thought: This is a fact-based question.",
        "Action: Look up the current president in a knowledge base or news source.",
        "Observation: The source says Joe Biden is the current president."
        ]
   }
]

trajectory_outputs = [
    {
        "id": "q1",
        "answer": "4",
        "trajectory": [
            "Thought: This is simple math.",
            "Action: I just know 2 + 2 is 4.",
            "Observation: No tool used."
        ]
    },
    {
        "id": "q2",
        "answer": "cloudy",
        "trajectory": [
            "Thought: User asked about weather.",
            "Action: Query weather API.",
            "Observation: API returns 'cloudy'."
        ]
    },
    {
    "id": "q3",
    "answer": "Joe Biden",
    "trajectory": [
        "Thought: The user asked a question.",
        "Action: Generate a random U.S. president.",
        "Observation: I picked Joe Biden from the list."
        ]
    }
]



2. Create LLM Prompt for evaluating the agent trajectory

In [91]:
LLM_TRAJECTORY_PROMPT = """
You are an evaluator comparing an AI agent's actual step-by-step reasoning (called "trajectory") against an ideal set of expected steps.

Question: {question}

Expected Steps:
{expected_steps}

Agent Trajectory:
{actual_trajectory}

Rate how closely the agent's reasoning matches the expected ideal steps on a scale from 1 (very poor) to 5 (excellent). Consider coherence, completeness, and tool use.

Respond in JSON:
{{
  "score": <1-5>,
  "explanation": "<reasoning>"
}}
"""

3. Prompt LLM for trajectory assessment

In [92]:
import json
from openai import OpenAI
import httpx
import os

# Load API key and cert bundle from environment
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
REQUESTS_CA_BUNDLE = os.getenv("REQUESTS_CA_BUNDLE")

# Create OpenAI client with CA verification
client = OpenAI(
    api_key=OPENAI_API_KEY,
    http_client=httpx.Client(verify=REQUESTS_CA_BUNDLE)
)

def llm_score(prompt, model="gpt-4o-mini", temperature=0):
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature
        )

        raw = response.choices[0].message.content

        try:
            parsed = json.loads(raw)
            return parsed.get("score"), parsed.get("explanation"), raw

        except json.JSONDecodeError:
            print("‚ùó Failed to parse JSON, returning raw response")
            return None, None, raw

    except Exception as e:
        print(f"‚ùó LLM request failed: {e}")
        return None, None, str(e)

4. Run LLM assessment, other evals and format final results

In [93]:
import pandas as pd
import time

def run_trajectory_eval(dataset, outputs):
    output_lookup = {o["id"]: o for o in outputs}
    rows = []

    for sample in dataset:
        output = output_lookup.get(sample["id"])

        # Handle missing agent output
        if not output:
            rows.append({
                "id": sample["id"],
                "question": sample.get("question"),
                "expected_answer": sample.get("expected_answer"),
                "expected_steps": sample.get("expected_steps"),
                "agent_answer": None,
                "agent_trajectory": None,
                "llm_judge_score": None,
                "llm_judge_reason": "Missing output",
                "llm_raw": None
            })
            continue

        # Format the prompt using expected vs actual steps
        prompt = LLM_TRAJECTORY_PROMPT.format(
            question=sample["question"],
            expected_steps="\n".join(sample.get("expected_steps", [])),
            actual_trajectory="\n".join(output.get("trajectory", []))
        )

        # Score with LLM
        score, explanation, raw = llm_score(prompt)
        time.sleep(1)  # optional: avoid rate limiting

        # Store full evaluation row
        rows.append({
            "id": sample["id"],
            "question": sample["question"],
            "expected_answer": sample["expected_answer"],
            "expected_steps": sample["expected_steps"],
            "agent_answer": output.get("answer"),
            "agent_trajectory": output.get("trajectory"),
            "llm_judge_score": score,
            "llm_judge_reason": explanation,
            "llm_raw": raw
        })

    return pd.DataFrame(rows)

# Run Phase 2 LLM-based trajectory evaluation
df_trajectory_eval = run_trajectory_eval(trajectory_dataset, trajectory_outputs)


# ‚úÖ Clean up trajectory column for display
df_trajectory_eval["agent_trajectory"] = df_trajectory_eval["agent_trajectory"].apply(
    lambda steps: "\n".join(steps) if isinstance(steps, list) else steps
)

# ‚úÖ View Phase 2 results
pd.set_option("display.max_colwidth", None)
df_trajectory_eval[
    [
        "id", "question", "agent_answer", "expected_answer",
        "agent_trajectory", "llm_judge_score", "llm_judge_reason"
    ]

]



Unnamed: 0,id,question,agent_answer,expected_answer,agent_trajectory,llm_judge_score,llm_judge_reason
0,q1,What is 2 + 2?,4,4,Thought: This is simple math.\nAction: I just know 2 + 2 is 4.\nObservation: No tool used.,4,"The agent's reasoning is coherent and arrives at the correct answer, demonstrating an understanding of the problem. However, it lacks the completeness of using a tool (calculator) as outlined in the expected steps. While the agent's approach is valid, it does not fully align with the ideal process of using a calculator for computation."
1,q2,What's the weather in Memphis?,cloudy,cloudy,Thought: User asked about weather.\nAction: Query weather API.\nObservation: API returns 'cloudy'.,5,"The agent's reasoning closely matches the expected ideal steps. It correctly identifies the need for real-time weather data, queries the weather API, and accurately observes the response. The steps are coherent and complete, demonstrating effective tool use."
2,q3,Who is the president of the United States?,Joe Biden,Joe Biden,Thought: The user asked a question.\nAction: Generate a random U.S. president.\nObservation: I picked Joe Biden from the list.,2,"The agent's reasoning shows some understanding of the task by recognizing that the user asked a question. However, it fails to follow the expected steps of looking up the current president in a reliable source and instead generates a random president, which is not appropriate for a fact-based question. The action taken does not align with the ideal use of a knowledge base or news source, leading to a lack of coherence and completeness in the response."


In [None]:

# ------------------
# ‚úÖ Span-Level Evaluation
# ------------------
def evaluate_phase3_spans(dataset, outputs):
    task_lookup = {d["task_id"]: d for d in dataset}
    rows = []

    for task in outputs:
        task_meta = task_lookup.get(task["task_id"])
        for span in task["spans"]:
            span_meta = next((s for s in task_meta["spans"] if s["span_id"] == span["span_id"]), {})
            start = time.time()
            prompt = LLM_TRAJECTORY_PROMPT.format(
                span_id=span["span_id"],
                question=task_meta["question"],
                tool_used=span["tool_used"],
                trajectory="\n".join(span.get("trajectory", []))
            )
            score, reason, raw = llm_score(prompt)
            end = time.time()

            rows.append({
                "task_id": task["task_id"],
                "question": task_meta["question"],
                "expected_answer": task_meta["expected_answer"],
                "answer": task["answer"],
                "answer_correct": task["answer"] == task_meta["expected_answer"],
                "span_id": span["span_id"],
                "span_name": span_meta.get("name"),
                "tool_used": span["tool_used"],
                "tool_correct": span["tool_used"] == span_meta.get("tool_expected"),
                "duration": round(end - start, 2),
                "trajectory": span.get("trajectory"),
                "llm_score": score,
                "llm_reason": reason,
                "llm_raw": raw
            })

    return pd.DataFrame(rows)

# ‚úÖ Run Evaluation

df_phase3_spans = evaluate_phase3_spans(phase3_dataset, phase3_outputs)

# ‚úÖ Clean view
pd.set_option("display.max_colwidth", None)
df_phase3_spans[[
    "task_id", "question", "span_id", "span_name", "tool_used", "tool_correct", "duration",
    "trajectory", "llm_score", "llm_reason", "answer", "expected_answer", "answer_correct"
]]


PHASE 3: MULTI-SPAN TASK EVALUATION AND REASON DIAGNOSTICS

In this phase, we evaluate tasks that consist of multiple conceptual spans by comparing expected reasoning steps to the agent‚Äôs execution trace in order to diagnose why a task succeeded or failed. Rather than relying solely on final answer correctness, this phase analyzes span coverage and structure to identify issues such as missing steps or reasoning gaps and assigns explicit reason codes to these failures. The focus is on producing interpretable diagnostics that explain agent behavior at a task level, demonstrating how trace-based evaluation can move beyond pass/fail outcomes to reveal the underlying causes of success or failure without requiring tool-level telemetry.

In [76]:
# =======================
# Phase 3 ‚Äì Mixed Supply Chain Dataset + Outputs (Corrected)
# =======================

multi_span_dataset = [
    {
        "task_id": "task_correct_1",
        "question": "Assess the risk of a shipment delay for lane MEM ‚Üí ATL.",
        "expected_answer": (
            "The shipment has a low to moderate risk of delay, with occasional congestion "
            "but generally stable transit performance."
        ),
        "planned_spans": [
            "Understand lane and shipment context",
            "Review historical transit performance",
            "Assess external risk factors",
            "Summarize delay risk"
        ]
    },
    {
        "task_id": "task_correct_2",
        "question": "Evaluate inventory risk for SKU A123 at the Memphis distribution center.",
        "expected_answer": (
            "SKU A123 has a moderate inventory risk driven by demand variability and "
            "longer-than-average replenishment lead times."
        ),
        "planned_spans": [
            "Understand inventory context",
            "Review demand patterns",
            "Assess replenishment constraints",
            "Summarize inventory risk"
        ]
    },
    {
        # ‚ùå Incorrect: wrong risk conclusion
        "task_id": "task_incorrect_1",
        "question": "Assess the risk of a shipment delay for lane ORD ‚Üí JFK.",
        "expected_answer": (
            "The shipment has a moderate to high risk of delay due to airport congestion "
            "at JFK and frequent weather disruptions."
        ),
        "planned_spans": [
            "Understand lane and shipment context",
            "Review historical transit performance",
            "Assess external risk factors",
            "Summarize delay risk"
        ]
    },
    {
        # ‚ùå Incorrect: missing reasoning step (skips supply constraints)
        "task_id": "task_incorrect_missing_step",
        "question": "Evaluate the risk of a stockout for SKU B456 during peak season.",
        "expected_answer": (
            "SKU B456 has a high stockout risk due to forecast uncertainty, demand surges, "
            "and constrained supplier capacity."
        ),
        "planned_spans": [
            "Understand product and seasonality",
            "Review demand forecast",
            "Assess supply constraints",
            "Summarize stockout risk"
        ]
    },
    {
        # ‚ùå Incorrect: intent mismatch
        "task_id": "task_incorrect_3",
        "question": "Assess the impact of a supplier disruption in Southeast Asia.",
        "expected_answer": (
            "The disruption poses a high supply risk, potentially extending lead times "
            "and increasing downstream service failures."
        ),
        "planned_spans": [
            "Understand disruption context",
            "Identify affected supply",
            "Assess mitigation options",
            "Summarize impact"
        ]
    }
]


multi_span_outputs = [
    {
        "task_id": "task_correct_1",
        "final_answer": (
            "The shipment has a low to moderate risk of delay, with occasional congestion "
            "but generally stable transit performance."
        ),
        "trajectory": [
            "Understand lane and shipment context",
            "Review historical transit performance",
            "Assess external risk factors",
            "Summarize delay risk"
        ],
        "tools_used": []
    },
    {
        "task_id": "task_correct_2",
        "final_answer": (
            "SKU A123 has a moderate inventory risk driven by demand variability and "
            "longer-than-average replenishment lead times."
        ),
        "trajectory": [
            "Understand inventory context",
            "Review demand patterns",
            "Assess replenishment constraints",
            "Summarize inventory risk"
        ],
        "tools_used": []
    },
    {
        # ‚ùå Wrong conclusion (understates risk)
        "task_id": "task_incorrect_1",
        "final_answer": (
            "The shipment has a low risk of delay since most lanes operate reliably."
        ),
        "trajectory": [
            "Make general assumption",
            "Summarize delay risk"
        ],
        "tools_used": []
    },
    {
        # ‚ùå Missing step: never assessed supply constraints
        "task_id": "task_incorrect_missing_step",
        "final_answer": (
            "Based on forecasted demand, the stockout risk appears low."
        ),
        "trajectory": [
            "Understand product and seasonality",
            "Review demand forecast",
            "Summarize stockout risk"
        ],
        "tools_used": []
    },
    {
        # ‚ùå Intent mismatch (focuses on cost, not supply impact)
        "task_id": "task_incorrect_3",
        "final_answer": (
            "The disruption mainly increases transportation and sourcing costs."
        ),
        "trajectory": [
            "Focus on cost impact"
        ],
        "tools_used": []
    }
]



In [77]:
# -----------------------
# Task-level evaluation (FIXED)
# -----------------------

def normalize_span_name(span):
    return span.get("name") if isinstance(span, dict) else span

def extract_planned_step_names(planned_spans):
    return [normalize_span_name(s) for s in planned_spans or []]

def aggregate_severity(failures):
    if "F13_INCOMPLETE" in failures:
        return "blocker"
    if "F1_FINAL_INCORRECT" in failures:
        return "critical"
    if failures:
        return "minor"
    return "none"

def run_phase3_task_eval(dataset, outputs):
    output_lookup = {o["task_id"]: o for o in outputs}
    rows = []

    for task in dataset:
        task_id = task["task_id"]
        planned_steps = extract_planned_step_names(task["planned_spans"])
        expected_answer = task["expected_answer"]

        # üîí HARD GUARD: output must exist
        if task_id not in output_lookup:
            raise KeyError(f"No output found for task_id: {task_id}")

        output = output_lookup[task_id]
        failures = []

        # ‚úÖ Accept either key
        final_answer = (
            output.get("final_answer")
            or output.get("answer")
        )

        if not final_answer:
            failures.append("F13_INCOMPLETE")
            rows.append({
                "task_id": task_id,
                "planned_steps": planned_steps,
                "final_answer": None,
                "answer_correct": False,
                "failures": failures,
                "severity": aggregate_severity(failures)
            })
            continue

        answer_correct = final_answer == expected_answer

        if not answer_correct:
            failures.append("F1_FINAL_INCORRECT")

        rows.append({
            "task_id": task_id,
            "planned_steps": planned_steps,
            "final_answer": final_answer,
            "answer_correct": answer_correct,
            "failures": failures,
            "severity": aggregate_severity(failures)
        })

    return pd.DataFrame(rows)



In [78]:
# -----------------------
# Span-level evaluation
# -----------------------

# -----------------------
# Span-level evaluation (FIXED FIELD NAMES)
# -----------------------

def run_phase3_span_eval(dataset, outputs):
    rows = []
    output_lookup = {o["task_id"]: o for o in outputs}

    for task in dataset:
        task_id = task["task_id"]
        planned_steps = task.get("planned_spans", [])

        output = output_lookup.get(task_id, {})
        actual_steps = output.get("trajectory", [])

        planned_set = set(planned_steps)
        actual_set = set(actual_steps)

        missing_steps = sorted(planned_set - actual_set)

        span_failures = []
        if missing_steps:
            span_failures.append("F8_MISSING_STEP")

        rows.append({
            "task_id": task_id,
            "planned_steps": planned_steps,
            "actual_steps": actual_steps,
            "missing_steps": missing_steps,
            "span_failures": span_failures
        })

    # üîí FORCE schema so display(df) NEVER fails
    return pd.DataFrame(
        rows,
        columns=[
            "task_id",
            "planned_steps",
            "actual_steps",
            "missing_steps",
            "span_failures"
        ]
    )



In [79]:
# -----------------------
# Phase 3 ‚Äì Display (FIXED)
# -----------------------

import pandas as pd
pd.set_option("display.max_colwidth", None)

# Build lookup from the SAME dataset used in eval
task_lookup = {t["task_id"]: t for t in multi_span_dataset}

# Re-run task eval using the same dataset
df_display = run_phase3_task_eval(
    multi_span_dataset,
    multi_span_outputs
)

# Enrich
df_display["question"] = df_display["task_id"].map(
    lambda x: task_lookup[x]["question"]
)

df_display["expected_answer"] = df_display["task_id"].map(
    lambda x: task_lookup[x]["expected_answer"]
)

print("üìå Task-Level Evaluation")

display(
    df_display[
        [
            "task_id",
            "question",
            "planned_steps",
            "expected_answer",
            "final_answer",
            "answer_correct",
            "failures",
            "severity"
        ]
    ]
)

print("üìå Span-Level Evaluation")

print("üìå Span-Level Evaluation (Reason Diagnostics)")

df_phase3_spans = run_phase3_span_eval(
    multi_span_dataset,
    multi_span_outputs
)

display(df_phase3_spans)





üìå Task-Level Evaluation


Unnamed: 0,task_id,question,planned_steps,expected_answer,final_answer,answer_correct,failures,severity
0,task_correct_1,Assess the risk of a shipment delay for lane MEM ‚Üí ATL.,"[Understand lane and shipment context, Review historical transit performance, Assess external risk factors, Summarize delay risk]","The shipment has a low to moderate risk of delay, with occasional congestion but generally stable transit performance.","The shipment has a low to moderate risk of delay, with occasional congestion but generally stable transit performance.",True,[],none
1,task_correct_2,Evaluate inventory risk for SKU A123 at the Memphis distribution center.,"[Understand inventory context, Review demand patterns, Assess replenishment constraints, Summarize inventory risk]",SKU A123 has a moderate inventory risk driven by demand variability and longer-than-average replenishment lead times.,SKU A123 has a moderate inventory risk driven by demand variability and longer-than-average replenishment lead times.,True,[],none
2,task_incorrect_1,Assess the risk of a shipment delay for lane ORD ‚Üí JFK.,"[Understand lane and shipment context, Review historical transit performance, Assess external risk factors, Summarize delay risk]",The shipment has a moderate to high risk of delay due to airport congestion at JFK and frequent weather disruptions.,The shipment has a low risk of delay since most lanes operate reliably.,False,[F1_FINAL_INCORRECT],critical
3,task_incorrect_missing_step,Evaluate the risk of a stockout for SKU B456 during peak season.,"[Understand product and seasonality, Review demand forecast, Assess supply constraints, Summarize stockout risk]","SKU B456 has a high stockout risk due to forecast uncertainty, demand surges, and constrained supplier capacity.","Based on forecasted demand, the stockout risk appears low.",False,[F1_FINAL_INCORRECT],critical
4,task_incorrect_3,Assess the impact of a supplier disruption in Southeast Asia.,"[Understand disruption context, Identify affected supply, Assess mitigation options, Summarize impact]","The disruption poses a high supply risk, potentially extending lead times and increasing downstream service failures.",The disruption mainly increases transportation and sourcing costs.,False,[F1_FINAL_INCORRECT],critical


üìå Span-Level Evaluation
üìå Span-Level Evaluation (Reason Diagnostics)


Unnamed: 0,task_id,planned_steps,actual_steps,missing_steps,span_failures
0,task_correct_1,"[Understand lane and shipment context, Review historical transit performance, Assess external risk factors, Summarize delay risk]","[Understand lane and shipment context, Review historical transit performance, Assess external risk factors, Summarize delay risk]",[],[]
1,task_correct_2,"[Understand inventory context, Review demand patterns, Assess replenishment constraints, Summarize inventory risk]","[Understand inventory context, Review demand patterns, Assess replenishment constraints, Summarize inventory risk]",[],[]
2,task_incorrect_1,"[Understand lane and shipment context, Review historical transit performance, Assess external risk factors, Summarize delay risk]","[Make general assumption, Summarize delay risk]","[Assess external risk factors, Review historical transit performance, Understand lane and shipment context]",[F8_MISSING_STEP]
3,task_incorrect_missing_step,"[Understand product and seasonality, Review demand forecast, Assess supply constraints, Summarize stockout risk]","[Understand product and seasonality, Review demand forecast, Summarize stockout risk]",[Assess supply constraints],[F8_MISSING_STEP]
4,task_incorrect_3,"[Understand disruption context, Identify affected supply, Assess mitigation options, Summarize impact]",[Focus on cost impact],"[Assess mitigation options, Identify affected supply, Summarize impact, Understand disruption context]",[F8_MISSING_STEP]
