Evaluate 4 different LangGraph-based agent designs on the same 30 math word problems, with consistent scoring and Langfuse observability.
This repo is designed to answer one practical question:
How much do agent architecture and tool policy affect accuracy, latency, and reasoning process quality on arithmetic-heavy tasks?
- 4 agent architectures (CoT, ReAct, Constrained ReAct, Multi-Agent pipeline)
- Shared calculator tool with safe AST-based evaluation (no eval)
- Automatic provider switching between Ollama and OpenAI-compatible endpoints
- End-to-end evaluation runner with per-case logs and summary table
- Rule-based process scoring (tool usage, reasoning steps, answer clarity, efficiency)
- Langfuse trace + score logging for side-by-side comparison
- Results exported to JSON for offline analysis
| Agent | Core Behavior | Tool Policy |
|---|---|---|
| CoT | Pure step-by-step reasoning in one pass | No tools |
| ReAct | Reason + optional calculator actions | Tool use allowed |
| Constrained ReAct | Same as ReAct but with strict validation loop | Must use calculator for arithmetic |
| Multi-Agent | Planner -> Calculator Agent -> Verifier pipeline | Planner plans, calculator computes, verifier checks |
- File: eval/test_cases.json
- Size: 30 problems
- Mix of easy / medium / hard word problems
- Ground-truth numeric answers included
- Python 3.11+
- LangChain + LangGraph
- Langfuse v4
- Ollama (default local path) or OpenAI-compatible API
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtcp .env.example .envThen edit .env with your Langfuse keys and model provider config.
Default .env.example includes Langfuse + Ollama fields. If you want OpenAI mode, add:
OPENAI_API_KEY=your_key
OPENAI_MODEL=gpt-4o-mini
# optional for OpenAI-compatible proxies
OPENAI_BASE_URL=https://api.openai.com/v1Provider selection logic:
- If OPENAI_API_KEY exists -> use OpenAI path
- Otherwise -> use Ollama path
ollama pull qwen3:8bUse either:
- Langfuse Cloud: https://cloud.langfuse.com
- Self-hosted deployment docs: https://langfuse.com/docs/deployment/self-host
python -m eval.run_evalpython -m eval.run_eval --agent cot
python -m eval.run_eval --agent react
python -m eval.run_eval --agent constrained
python -m eval.run_eval --agent multi_agentpython -m eval.run_eval --dry-runDry run skips Langfuse logging and runs 3 cases by default.
python -m eval.run_eval --max-cases 10
python -m eval.run_eval --agent react --max-cases 5For each agent:
- Correct / total
- Accuracy
- Average latency
- Error count
- Per-test detail list (expected, predicted, time, status)
Terminal output also includes a final comparison table across agents.
Outcome-level scores:
- correctness (1 or 0)
- latency_seconds
Process-level scores:
- tool_call_count
- reasoning_steps
- answer_clarity
- efficiency
These are sent to Langfuse as trace scores for analysis in the UI.
- Local JSON summary written to results/eval-.json
- Langfuse traces tagged by agent type:
- cot
- react
- constrained
- multi_agent
math-agent-eval/
├── agents/
│ ├── cot_agent.py
│ ├── react_agent.py
│ ├── constrained_agent.py
│ ├── multi_agent.py
│ └── state.py
├── tools/
│ └── calculator.py
├── eval/
│ ├── test_cases.json
│ ├── run_eval.py
│ ├── scoring.py
│ └── process_scoring.py
├── llm_factory.py
├── langfuse_config.py
├── .env.example
└── README.md
- The calculator tool supports: +, -, *, /, **, %
- Numeric extraction prefers ANSWER: , then falls back to last number found
- Accuracy check uses tolerance rules for small vs large expected values
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
python -m eval.run_eval --dry-run