Skip to content

johnson00111/math-agent-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Math Agent Evaluation

Evaluate 4 different LangGraph-based agent designs on the same 30 math word problems, with consistent scoring and Langfuse observability.

This repo is designed to answer one practical question:

How much do agent architecture and tool policy affect accuracy, latency, and reasoning process quality on arithmetic-heavy tasks?

What This Project Includes

  • 4 agent architectures (CoT, ReAct, Constrained ReAct, Multi-Agent pipeline)
  • Shared calculator tool with safe AST-based evaluation (no eval)
  • Automatic provider switching between Ollama and OpenAI-compatible endpoints
  • End-to-end evaluation runner with per-case logs and summary table
  • Rule-based process scoring (tool usage, reasoning steps, answer clarity, efficiency)
  • Langfuse trace + score logging for side-by-side comparison
  • Results exported to JSON for offline analysis

Agent Variants

Agent Core Behavior Tool Policy
CoT Pure step-by-step reasoning in one pass No tools
ReAct Reason + optional calculator actions Tool use allowed
Constrained ReAct Same as ReAct but with strict validation loop Must use calculator for arithmetic
Multi-Agent Planner -> Calculator Agent -> Verifier pipeline Planner plans, calculator computes, verifier checks

Evaluation Dataset

  • File: eval/test_cases.json
  • Size: 30 problems
  • Mix of easy / medium / hard word problems
  • Ground-truth numeric answers included

Tech Stack

  • Python 3.11+
  • LangChain + LangGraph
  • Langfuse v4
  • Ollama (default local path) or OpenAI-compatible API

Setup

1. Create and activate environment

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Prepare environment variables

cp .env.example .env

Then edit .env with your Langfuse keys and model provider config.

Default .env.example includes Langfuse + Ollama fields. If you want OpenAI mode, add:

OPENAI_API_KEY=your_key
OPENAI_MODEL=gpt-4o-mini
# optional for OpenAI-compatible proxies
OPENAI_BASE_URL=https://api.openai.com/v1

Provider selection logic:

  • If OPENAI_API_KEY exists -> use OpenAI path
  • Otherwise -> use Ollama path

3. (Optional) Pull local model for Ollama

ollama pull qwen3:8b

4. Langfuse

Use either:

Run Evaluations

Run all agents

python -m eval.run_eval

Run a single agent

python -m eval.run_eval --agent cot
python -m eval.run_eval --agent react
python -m eval.run_eval --agent constrained
python -m eval.run_eval --agent multi_agent

Dry run (quick smoke test)

python -m eval.run_eval --dry-run

Dry run skips Langfuse logging and runs 3 cases by default.

Limit number of test cases

python -m eval.run_eval --max-cases 10
python -m eval.run_eval --agent react --max-cases 5

What Gets Reported

For each agent:

  • Correct / total
  • Accuracy
  • Average latency
  • Error count
  • Per-test detail list (expected, predicted, time, status)

Terminal output also includes a final comparison table across agents.

Scoring Details

Outcome-level scores:

  • correctness (1 or 0)
  • latency_seconds

Process-level scores:

  • tool_call_count
  • reasoning_steps
  • answer_clarity
  • efficiency

These are sent to Langfuse as trace scores for analysis in the UI.

Outputs

  • Local JSON summary written to results/eval-.json
  • Langfuse traces tagged by agent type:
    • cot
    • react
    • constrained
    • multi_agent

Project Structure

math-agent-eval/
├── agents/
│   ├── cot_agent.py
│   ├── react_agent.py
│   ├── constrained_agent.py
│   ├── multi_agent.py
│   └── state.py
├── tools/
│   └── calculator.py
├── eval/
│   ├── test_cases.json
│   ├── run_eval.py
│   ├── scoring.py
│   └── process_scoring.py
├── llm_factory.py
├── langfuse_config.py
├── .env.example
└── README.md

Notes

  • The calculator tool supports: +, -, *, /, **, %
  • Numeric extraction prefers ANSWER: , then falls back to last number found
  • Accuracy check uses tolerance rules for small vs large expected values

Quick Start (Minimal)

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
python -m eval.run_eval --dry-run

About

A controlled evaluation framework comparing 4 LangGraph agent architectures (CoT, ReAct, Constrained, Multi-Agent) on math word problems, with Langfuse observability and process-level scoring.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages