Math Agent Evaluation

Evaluate 4 different LangGraph-based agent designs on the same 30 math word problems, with consistent scoring and Langfuse observability.

This repo is designed to answer one practical question:

How much do agent architecture and tool policy affect accuracy, latency, and reasoning process quality on arithmetic-heavy tasks?

What This Project Includes

4 agent architectures (CoT, ReAct, Constrained ReAct, Multi-Agent pipeline)
Shared calculator tool with safe AST-based evaluation (no eval)
Automatic provider switching between Ollama and OpenAI-compatible endpoints
End-to-end evaluation runner with per-case logs and summary table
Rule-based process scoring (tool usage, reasoning steps, answer clarity, efficiency)
Langfuse trace + score logging for side-by-side comparison
Results exported to JSON for offline analysis

Agent Variants

Agent	Core Behavior	Tool Policy
CoT	Pure step-by-step reasoning in one pass	No tools
ReAct	Reason + optional calculator actions	Tool use allowed
Constrained ReAct	Same as ReAct but with strict validation loop	Must use calculator for arithmetic
Multi-Agent	Planner -> Calculator Agent -> Verifier pipeline	Planner plans, calculator computes, verifier checks

Evaluation Dataset

File: eval/test_cases.json
Size: 30 problems
Mix of easy / medium / hard word problems
Ground-truth numeric answers included

Tech Stack

Python 3.11+
LangChain + LangGraph
Langfuse v4
Ollama (default local path) or OpenAI-compatible API

Setup

1. Create and activate environment

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Prepare environment variables

cp .env.example .env

Then edit .env with your Langfuse keys and model provider config.

Default .env.example includes Langfuse + Ollama fields. If you want OpenAI mode, add:

OPENAI_API_KEY=your_key
OPENAI_MODEL=gpt-4o-mini
# optional for OpenAI-compatible proxies
OPENAI_BASE_URL=https://api.openai.com/v1

Provider selection logic:

If OPENAI_API_KEY exists -> use OpenAI path
Otherwise -> use Ollama path

3. (Optional) Pull local model for Ollama

ollama pull qwen3:8b

4. Langfuse

Use either:

Langfuse Cloud: https://cloud.langfuse.com
Self-hosted deployment docs: https://langfuse.com/docs/deployment/self-host

Run Evaluations

Run all agents

python -m eval.run_eval

Run a single agent

python -m eval.run_eval --agent cot
python -m eval.run_eval --agent react
python -m eval.run_eval --agent constrained
python -m eval.run_eval --agent multi_agent

Dry run (quick smoke test)

python -m eval.run_eval --dry-run

Dry run skips Langfuse logging and runs 3 cases by default.

Limit number of test cases

python -m eval.run_eval --max-cases 10
python -m eval.run_eval --agent react --max-cases 5

What Gets Reported

For each agent:

Correct / total
Accuracy
Average latency
Error count
Per-test detail list (expected, predicted, time, status)

Terminal output also includes a final comparison table across agents.

Scoring Details

Outcome-level scores:

correctness (1 or 0)
latency_seconds

Process-level scores:

tool_call_count
reasoning_steps
answer_clarity
efficiency

These are sent to Langfuse as trace scores for analysis in the UI.

Outputs

Local JSON summary written to results/eval-.json
Langfuse traces tagged by agent type:
- cot
- react
- constrained
- multi_agent

Project Structure

math-agent-eval/
├── agents/
│   ├── cot_agent.py
│   ├── react_agent.py
│   ├── constrained_agent.py
│   ├── multi_agent.py
│   └── state.py
├── tools/
│   └── calculator.py
├── eval/
│   ├── test_cases.json
│   ├── run_eval.py
│   ├── scoring.py
│   └── process_scoring.py
├── llm_factory.py
├── langfuse_config.py
├── .env.example
└── README.md

Notes

The calculator tool supports: +, -, *, /, **, %
Numeric extraction prefers ANSWER: , then falls back to last number found
Accuracy check uses tolerance rules for small vs large expected values

Quick Start (Minimal)

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
python -m eval.run_eval --dry-run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Math Agent Evaluation

What This Project Includes

Agent Variants

Evaluation Dataset

Tech Stack

Setup

1. Create and activate environment

2. Prepare environment variables

3. (Optional) Pull local model for Ollama

4. Langfuse

Run Evaluations

Run all agents

Run a single agent

Dry run (quick smoke test)

Limit number of test cases

What Gets Reported

Scoring Details

Outputs

Project Structure

Notes

Quick Start (Minimal)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agents		agents
eval		eval
tools		tools
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
langfuse_config.py		langfuse_config.py
llm_factory.py		llm_factory.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Math Agent Evaluation

What This Project Includes

Agent Variants

Evaluation Dataset

Tech Stack

Setup

1. Create and activate environment

2. Prepare environment variables

3. (Optional) Pull local model for Ollama

4. Langfuse

Run Evaluations

Run all agents

Run a single agent

Dry run (quick smoke test)

Limit number of test cases

What Gets Reported

Scoring Details

Outputs

Project Structure

Notes

Quick Start (Minimal)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages