# debug-gym: A Text-Based Environment for Interactive Debugging

`debug-gym` is a text-based interactive debugging framework, designed for debugging Python programs.

[[Technical Report](https://arxiv.org/abs/2503.21557)] [[Project Page](https://aka.ms/debug-gym/)]


### Installation
Follow the [installation instructions](https://github.com/microsoft/debug-gym#installation) to set up the environment. It should be as simple as running:

```bash
# Clone the repository and set up the virtual environment
git clone https://github.com/microsoft/debug-gym.git -b tutorial
cd debug-gym

# Install uv and create a virtual environment
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv
source ./venv/bin/activate

# Install the package in editable mode
uv pip install -e .
```

In [None]:
# In codespace you can simply install the package directly from the notebook.
!pip install -e ..

### Tutorial

`debug-gym` supports widely used coding benchmarks

| Benchmark name | Link |
| :-: | :----- |
| `aider` | [https://github.com/Aider-AI/aider](https://github.com/Aider-AI/aider) |
| `swebench`| [https://github.com/princeton-nlp/SWE-bench](https://github.com/princeton-nlp/SWE-bench) |
| `swesmith`| [https://github.com/SWE-bench/SWE-smith](https://github.com/SWE-bench/SWE-smith) |
| `mini_nightmare` | A set of 10 hand-crafted minimal buggy code snippet where rewrite only agents have harder time to tackle. Read details [here](https://github.com/microsoft/debug-gym/blob/main/data/mini_nightmare/mini_nightmare.md). |


For this tutorial, we will use `mini_nightmare` benchmark, in particular the `pandas_dataframe` task, which is a buggy code snippet that requires the agent to inspect the columns of a pandas dataframe obtained from the web.

In [1]:
from debug_gym.gym.envs import MiniNightmareEnv
from debug_gym.logger import DebugGymLogger

# For the sake of this tutorial, we disable the logger to avoid cluttering the output.
logger = DebugGymLogger("debug-gym", level="ERROR")
logger.disabled = True

# Initialize the MiniNightmare benchmark environment.
# This will download the dataset and set up the environment.
env = MiniNightmareEnv(logger=logger)

# Load the dataset and print the available tasks.
print(f"Available tasks: {sorted(env.dataset)}")

Available tasks: ['config', 'counter', 'grader', 'pandas_dataframe', 'patcher', 'purr', 'scientific_calculator', 'shopping_cart', 'sum_tree', 'tomorrow_date']


#### Starting the `pandas_dataframe` task

We will start the `pandas_dataframe` task using the `debug-gym` environment. This will initialize the environment and provide us with the first observation.

`MiniNightmareEnv` is an interactive environment that follows the [Gymnasium](https://github.com/Farama-Foundation/Gymnasium) paradigm. Once the environment `env` is instantiated, one can use `env.reset()` to start an episode and receives initial informations. Then, one can interact with the environment using `env.step(action)`, where `action` specifies one of the available tools (see below), doing so will return subsequent informations (e.g, error message, debugger stdout, etc.)

In [2]:
info = env.reset(options={"task_name": "pandas_dataframe"})
print(info)

                      DEBUG GYM ENVIRONMENT INFO                      
📊 Status: 🔄 (IN PROGRESS)	🎯 Score: 0/1	✏️ Rewrites: 0
👁️ Observation:
```
collected 1 item

test.py F

FAILED test.py::test_calculate_stats - KeyError: 'Price'
```

🛠️  Available Tools (0):
   

🔴 Breakpoints:
   None set

📁 Directory Structure:
   Listing files in the current working directory. (read-only) indicates read-only files. Max depth: 1.
   /tmp/RepoEnv-hk6i04hj/
   |-- pandas_dataframe_code.py
   |-- test.py (read-only)


#### Adding tools 🛠️

One of the core designs of `debug-gym` is the notion of tools. Users can dynamically import tools, or develop customized tools and utilize them in the environment. Tools are modules that augment an agent's action space, observation space, or provide additonal functionalities to the agent. Below are the set of tools we have implemented so far.

| Tool name | Description |
| :-: | :----- |
| `listdir` | It returns the directory tree at a given subdirectory. This is particularly useful when dealing with a repository with multiple files. |
| `view` | It is used to change an agent's focus to a particular source code file. This is particularly useful when dealing with a repository with multiple files. |
| `eval` | It runs the current code repository using the provided entrypoint (e.g., pytest), and returns the terminal's output (e.g., error message). |
| `pdb` | Interactive debugger wrapping the [Python pdb tool](https://docs.python.org/3/library/pdb.html). In additon, users can choose to maintain a set of persistent breakpoints (as in some programming IDEs), which are not reset after every eval. With such feature, a new pdb debugging session is activated automatically, with all the breakpoints restored. Note such breakpoint can be cleared by pdb commands such as `cl`. |
| `rewrite` | It can be used to rewrite a certain piece of code to fix the bug. The inputs of this tool call include the file path, the start and end line numbers, and the new code. |

In [3]:

from debug_gym.gym.tools.toolbox import Toolbox

for tool in  ["view", "rewrite", "eval"]:
    env.add_tool(Toolbox.get_tool(tool))

info = env.reset(options={"task_name": "pandas_dataframe"})
print(info)

                      DEBUG GYM ENVIRONMENT INFO                      
📊 Status: 🔄 (IN PROGRESS)	🎯 Score: 0/1	✏️ Rewrites: 0
👁️ Observation:
```
collected 1 item

test.py F

FAILED test.py::test_calculate_stats - KeyError: 'Price'
```

🛠️  Available Tools (3):
   view(path:string, start:number, end:number, include_line_numbers_and_breakpoints:boolean): Specify a file path to view its content.
   rewrite(path:string, start:number, end:number, new_code:string): Rewrite the content of the specified file path, between lines [start, end], with the new code.
   eval(): Evaluate the current code against pre-defined test cases.

🔴 Breakpoints:
   None set

📁 Directory Structure:
   Listing files in the current working directory. (read-only) indicates read-only files. Max depth: 1.
   /tmp/RepoEnv-32lac0qw/
   |-- pandas_dataframe_code.py
   |-- test.py (read-only)


#### Setting up the LLM

To use `debug-gym`, you need to set up a language model (LLM) that will be used as the agent to interact with the environment. You can use any LLM that is compatible with the `debug-gym` framework. Currently, we support OpenAI, Azure OpenAI, and Anthropic. We also support local LLMs via vLLM and using the OpenAI API.


In [4]:
import os
from debug_gym.llms.base import LLM, LLMConfig
from debug_gym.llms import OpenAILLM

MODEL = "gpt-4.1"  # gpt-5 tool calling seems broken at the moment 🤷.
llm_config = LLMConfig(
    model=MODEL,
    context_limit=128,
    endpoint="https://api.openai.com/v1",
    api_key=os.environ["OPENAI_API_KEY"],
)

llm = OpenAILLM(MODEL, logger=logger, llm_config=llm_config)
llm.client.models.retrieve(MODEL)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Model(id='gpt-4.1', created=1744316542, object='model', owned_by='system')

#### Running agent loop

We provide the below LLM-based agents, they all have minimal design and serve the purpose of demonstrating the `debug-gym` APIs.

| Agent name | Available Tools | Description |
| :-: | :-: | :----- |
| `debug_agent` | `pdb`, `rewrite`, `view`, `eval` | A minimal agent that dumps all available information into its prompt and queries the LLM to generate a command. |
| `rewrite_agent` | `rewrite`, `view`, `eval`  | A `debug_agent` but `pdb` tool is disabled (an agent keeps rewriting). |
| `debug_5_agent` | `pdb`, `rewrite`, `view`, `eval`  | A `debug_agent`, but `pdb` tool is only enabled after certain amount of rewrites. |
| `solution_agent` | `pdb`, `eval`  | An oracle agent that applies a gold patch (only works with `swebench` and `swesmith` benchmarks for now). The agent checks that tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected. |

For the sake of this tutorial, we will copy a minimal version of our agents running loop.

In [5]:
import json
from termcolor import colored

from debug_gym.agents.history_tracker import HistoryTracker, build_history_prompt

MAX_STEPS = 20  # Maximum number of steps to run the agent.
MEMORY_SIZE = 20  # Size of the history tracker.
TASK_NAME = "pandas_dataframe"
SYSTEM_PROMPT = (
    "Your goal is to debug a Python program to make sure it can pass a set of test functions."
    " You have access to a set of tools, you can use them to investigate the code and propose a rewriting patch to fix the bugs."
    " Avoid rewriting the entire code, focus on the bugs only. At every step, you have to use one of the tools via function calling."
    " You can only call one tool at a time. Do not repeat your previous action unless they can provide more information."
    " You can think step by step to help you make the decision at every step, but you must be concise and avoid overthinking."
    " Output both your thinking process (if any) and the tool call in the response."
)

def run():
    # We will use a history tracker to keep track of the agent's actions and observations.
    # This will be used to build the prompt for the LLM.
    history = HistoryTracker(MEMORY_SIZE)

    # Let's reset the environment and get the initial state.
    info = env.reset(options={"task_name": TASK_NAME})
    history.step(new_info=info, llm_responses=None)  # initial state does not have response

    # List the available tools that the LLM can use.
    print("Available tools:")
    for tool in llm.define_tools(info.tools):
        tool = tool.get("function", {})
        name = tool.get("name", "<unknown>")
        desc = tool.get("description", "").split(".", 1)[0].strip() + "."
        props = tool.get("parameters", {}).get("properties", {})
        args = ", ".join(props.keys()) if props else ""
        print(f"- {name}({args}): {desc}")

    print(f"\nSystem prompt:\n{colored(SYSTEM_PROMPT, 'yellow')}")

    highscore = info.score
    for step in range(1, MAX_STEPS+1):
        print(f"\n{'='*20} STEP {step} {'='*20}")
        highscore = max(highscore, info.score)
        print(f"Nb. of tests passed: {info.score:>4}/{info.max_score:<4} ({info.score/info.max_score:.1%}) [Best: {highscore}]\n")

        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
        messages.extend(build_history_prompt(history, llm))

        # Print the last observation from the environment.
        print(colored(messages[-1]["content"], "magenta"))

        # Send the prompt to the LLM and get the response.
        llm_response = llm(messages, info.tools)

        # Print the LLM response.
        print(colored(f"{llm_response.response}", "cyan"))
        print(colored(f"Tool call: {llm_response.tool}", "cyan", attrs=["bold"]))

        # Send the response to the environment and get the next state.
        info = env.step(llm_response.tool, llm_response.response)
        history.step(info, llm_response)

        if info.done:
            break

    reason = "bug fixed" if info.done else "max steps reached"
    print(f"Step: {step} | Score: {info.score}/{info.max_score} ({info.score/info.max_score:.1%}) | Reason: {reason}")

    return history

history = run()

Available tools:
- view(path, start, end, include_line_numbers_and_breakpoints): Specify a file path to view its content.
- rewrite(path, start, end, new_code): Rewrite the content of the specified file path, between lines [start, end], with the new code.
- eval(): Evaluate the current code against pre-defined test cases.

System prompt:
[33mYour goal is to debug a Python program to make sure it can pass a set of test functions. You have access to a set of tools, you can use them to investigate the code and propose a rewriting patch to fix the bugs. Avoid rewriting the entire code, focus on the bugs only. At every step, you have to use one of the tools via function calling. You can only call one tool at a time. Do not repeat your previous action unless they can provide more information. You can think step by step to help you make the decision at every step, but you must be concise and avoid overthinking. Output both your thinking process (if any) and the tool call in the response.[0m

---
Let's add the pdb tool to the environment and run the agent loop again.

In [6]:
if not env.has_tool("pdb"):
    env.add_tool(Toolbox.get_tool("pdb"))
    SYSTEM_PROMPT += (
        " Do not use rewrite for adding print statement, use the pdb tool instead."
    )

history = run()

Available tools:
- view(path, start, end, include_line_numbers_and_breakpoints): Specify a file path to view its content.
- rewrite(path, start, end, new_code): Rewrite the content of the specified file path, between lines [start, end], with the new code.
- eval(): Evaluate the current code against pre-defined test cases.
- pdb(command): An interface to the Python debugger PDB.

System prompt:
[33mYour goal is to debug a Python program to make sure it can pass a set of test functions. You have access to a set of tools, you can use them to investigate the code and propose a rewriting patch to fix the bugs. Avoid rewriting the entire code, focus on the bugs only. At every step, you have to use one of the tools via function calling. You can only call one tool at a time. Do not repeat your previous action unless they can provide more information. You can think step by step to help you make the decision at every step, but you must be concise and avoid overthinking. Output both your thinkin

#### Analysis and Visualization

We provide a script, `json_log_viewer.py`, to help analyze the log files (e.g., the `.jsonl` files) generated by the `debug-gym` agents.

First let's save the agent's history to a `.jsonl` file. This file will contain all the interactions between the agent and the environment, including the actions taken by the agent, the observations received, and any errors encountered.

In [7]:
jsonl_output = {
    "problem": TASK_NAME,
    "config": {},
    "tools": llm.define_tools(env.tools),
    "uuid": "N/A",
    "success": env.done,
    "log": [history.json(i) for i in range(len(history))],
    "agent_type": "custom",
    "logger": "N/A",
}

with open("tutorial.jsonl", "w") as f:
    json.dump(jsonl_output, f, indent=4)


In [8]:

# Run the viewer.
!python ../analysis/json_log_viewer/json_log_viewer.py

 * Serving Flask app 'json_log_viewer'
 * Debug mode: off
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://10.209.224.183:5000
[33mPress CTRL+C to quit[0m
^C
