Agent Reasoning

Tools for evaluating and analyzing AI agent reasoning capabilities.

Installation

Clone the repository:

git clone https://github.com/narphorium/agent-reasoning.git
cd agent-reasoning

Install uv if you haven't already:

# On macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows PowerShell
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Create a virtual environment and install the package:

# Create and activate virtual environment
uv venv
source .venv/bin/activate  # On Unix
# OR
.venv\Scripts\activate     # On Windows

# Install in editable mode
uv pip install -e .

Set up environment variables:

Copy the example environment file and add your API keys:

cp .env.example .env

Then edit .env with your API keys. The following keys are supported:

# OpenAI API key for GPT models
OPENAI_API_KEY=sk-...

# Anthropic API key for Claude models
ANTHROPIC_API_KEY=sk-ant-...

# Google API key for Gemini models
GOOGLE_API_KEY=...

Set up Ollama (for local models):

First install Ollama following the official instructions.

Then pull and run the DeepSeek model:

# Pull the model
ollama pull hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0

Usage

Clone SWE-Bench Repositories

Before running evaluations, you'll need to clone the repositories from the SWE-Bench dataset. Use the clone-repos script:

clone-repos <repos_path> --split test

Or using Python directly:

python scripts/clone_repos.py <repos_path> --split test

Arguments:

repos_path: Directory where repositories should be cloned
--split: Dataset split to use (choices: "train", "test", "validation", "dev", default: "test")

Example:

clone-repos ~/Documents/SWE-Bench --split test

The script will:

Create the target directory if it doesn't exist
Get unique repositories from the specified dataset split
Clone each repository (skipping any that already exist)
Show progress with a progress bar

Evaluation Script

The evaluation script measures how well different models can identify which files need to be modified for a given code change task.

You can run the evaluation script in two ways:

Using the installed command:

evaluate <model> <prompt> <repos_path> --split test

Using Python directly:

python scripts/evaluate.py <model> <prompt> <repos_path> --split test

Arguments:

model: Name of the model to evaluate (e.g., "gpt-4", "claude-3", "o1-preview", "deepseek-8b")
prompt: Name of the prompt strategy to use (e.g., "file-selection")
repos_path: Path to the root directory containing all SWE-bench repositories
--split: Dataset split to evaluate on (choices: "train", "test", "validation", default: "test")

Example:

evaluate o1-preview-2024-09-12 file-selection ~/Documents/SWE-Bench --split test

Results

The script creates a timestamped results directory under results/ with the format:

results/
  YYYY-MM-DD-HH-MM-SS-prompt-model/
    instance1.json
    instance2.json
    ...
    aggregate_metrics.json

Each instance result file contains:

Instance evaluation metrics (precision and recall for code and test files)
Detailed results showing true/false positives and negatives
Complete conversation trajectory with the model

The aggregate_metrics.json file contains the averaged results across all evaluated instances.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
agent_reasoning		agent_reasoning
notebooks		notebooks
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Reasoning

Installation

Usage

Clone SWE-Bench Repositories

Evaluation Script

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Reasoning

Installation

Usage

Clone SWE-Bench Repositories

Evaluation Script

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages