# Testing an Ollama Model on a Single SWE-Bench Task

This notebook walks through **testing an Ollama model** on **one specific task** from SWE-Bench. It is written as a *procedure* with commands and checkpoints rather than a fully automated run, so you can adapt it to your environment.

**What you will do**
1. Pick a single SWE-Bench task ID.
2. Prepare the repository and environment.
3. Configure the Ollama model and runner.
4. Run the task once and capture outputs.
5. Evaluate the result with SWE-Bench‚Äôs evaluation logic.

---

## Prerequisites
- You have this repo cloned and can run the provided scripts.
- You have Ollama installed and a model pulled (e.g., `ollama pull llama3.1`).
- You can run Python in the project environment.

> If you haven‚Äôt set up the environment, run the project‚Äôs standard setup procedure first (see repo README).

In [1]:
%run _dev_setup.py

üîÅ Autoreload is ON (IPython detected).
‚úÖ Using llm_wc from: /home/iamsikun/research/llm-wc/src/llm_wc


In [2]:
import re
from pathlib import Path
from datasets import load_dataset
import yaml

In [3]:
from minisweagent.run.extra import swebench as swebench_run
from minisweagent.run.extra.utils.batch_progress import RunBatchProgressManager
from minisweagent.config import get_config_path
from swebench.harness import prepare_images as swebench_prepare

## Step 1 ‚Äî Choose a single SWE-Bench task
Pick **one** task ID from the SWE-Bench dataset.

**Example task ID** (replace with any real task you want to test):
- `swebench__requests-1929`

**Checkpoint:** You should now have a single `TASK_ID` you want to run.


In [4]:
dataset_name = "princeton-nlp/SWE-Bench_Verified"
split = "test"

ds = load_dataset(dataset_name, split=split)

README.md: 0.00B [00:00, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

In [5]:
task_ids: list = [row["instance_id"] for row in ds]

print(f"Dataset: {dataset_name}")
print(f"Number of task IDs: {len(task_ids)}")
print(f"First 5 task IDs: {task_ids[:5]}")

Dataset: princeton-nlp/SWE-Bench_Verified
Number of task IDs: 500
First 5 task IDs: ['astropy__astropy-12907', 'astropy__astropy-13033', 'astropy__astropy-13236', 'astropy__astropy-13398', 'astropy__astropy-13453']


## Step 2 ‚Äî Configure the run (pure Python)
This replaces the shell script in `scripts/run_swebench_verified.sh` with direct Python calls.
We will set up the task ID, model, config path, and output directory.

**Checkpoint:** You have `TASK_ID`, `MODEL_NAME`, `CONFIG_PATH`, and `OUTPUT_DIR` defined.


In [6]:
# Pick a task ID (override this if you want a specific instance)
TASK_ID = task_ids[0]

# Ollama model to test (must be available in Ollama)
MODEL_NAME = "ollama/gemma3:4b"

# Config used by mini-swe-agent (Ollama defaults in this repo)
CONFIG_PATH = Path("../config/swebench_ollama.yaml")

# Dataset to use
DATASET_NAME = dataset_name
SPLIT = split

# Match the output directory layout used by scripts/run_swebench_verified.sh
sanitized_model = MODEL_NAME.replace("/", "_").replace(":", "_")
OUTPUT_DIR = Path("runs/swebench_verified") / sanitized_model / "single_task"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"TASK_ID: {TASK_ID}")
print(f"MODEL_NAME: {MODEL_NAME}")
print(f"CONFIG_PATH: {CONFIG_PATH}")
print(f"OUTPUT_DIR: {OUTPUT_DIR}")
print(f"Dataset: {DATASET_NAME}")
print(f"Split: {SPLIT}")


TASK_ID: astropy__astropy-12907
MODEL_NAME: ollama/gemma3:4b
CONFIG_PATH: ../config/swebench_ollama.yaml
OUTPUT_DIR: runs/swebench_verified/ollama_gemma3_4b/single_task
Dataset: princeton-nlp/SWE-Bench_Verified
Split: test


## Step 3 ‚Äî Run a single SWE-Bench instance (pure Python)
This calls the **same mini-swe-agent logic** as `mini-extra swebench`, but runs a single instance directly.

**Mac/ARM note:** SWE-Bench images on DockerHub are built for x86_64. On Apple Silicon,
build the instance image locally and override `image_name` so the runner uses your local image.

**Checkpoint:** `preds.json` appears in `OUTPUT_DIR`.


In [7]:
# Load the instance by ID
instance = next(row for row in ds if row['instance_id'] == TASK_ID)

# For Apple Silicon/ARM: build the image locally and override image_name
USE_LOCAL_IMAGES = True
if USE_LOCAL_IMAGES:
    local_image = f"sweb.eval.x86_64.{TASK_ID.lower()}:latest"
    swebench_prepare.main(
        dataset_name=DATASET_NAME,
        split=SPLIT,
        instance_ids=[TASK_ID],
        max_workers=1,
        force_rebuild=False,
        open_file_limit=8192,
        namespace=None,
        tag="latest",
        env_image_tag="latest",
    )
    instance["image_name"] = local_image

# Load and override config (same as CLI logic)
config_path = get_config_path(CONFIG_PATH)
config = yaml.safe_load(config_path.read_text())
config.setdefault("model", {})["model_name"] = MODEL_NAME

progress = RunBatchProgressManager(1, OUTPUT_DIR / "exit_statuses.yaml")
swebench_run.process_instance(instance, OUTPUT_DIR, config, progress)


All images exist. Nothing left to build.


2026-01-23 21:21:45,750 - minisweagent.environment - DEBUG - Starting container with command: docker run -d --name minisweagent-2c5d28c5 -w /testbed --rm sweb.eval.x86_64.astropy__astropy-12907:latest sleep 2h


2026-01-23 21:21:46,137 - minisweagent.environment - INFO - Started container minisweagent-2c5d28c5 with ID e1a648ebc3ea334d76ec9deab54ff500d5fc245d4807e7990c3779dca0e8da70
[92m21:21:46 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:21:46,160 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:21:57 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:21:57,209 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:21:58 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:21:58,430 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:22:06 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:22:06,258 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:22:07 - LiteLLM:INF


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:25:10 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:25:10,967 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama



[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:25:26 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:25:26,379 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:25:39 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:25:39,417 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:25:39 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:25:39,912 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:25:49 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:25:49,730 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:25:50 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:25:50,317 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:26:44 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:26:44,115 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:26:57 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:26:57,280 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:26:57 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:26:57,768 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:27:07 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:27:07,379 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:27:07 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:27:07,839 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:30:01 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:30:01,828 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:30:15 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:30:15,059 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:30:15 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:30:15,543 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:30:25 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:30:25,352 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:30:25 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:30:25,797 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:30:48 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:30:48,492 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama



[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:31:04 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:31:04,080 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:31:16 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:31:16,705 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:31:17 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:31:17,173 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:31:26 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:31:26,957 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:31:27 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:31:27,404 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:35:16 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:35:16,633 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama



[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:35:31 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:35:31,946 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:35:44 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:35:44,631 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:35:45 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:35:45,073 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:35:54 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:35:54,816 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:35:55 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:35:55,231 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:37:05 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:37:05,574 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:37:18 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:37:18,227 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:37:18 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:37:18,702 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:37:27 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:37:27,363 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:37:27 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:37:27,774 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:40:29 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:40:29,798 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:40:42 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:40:42,608 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:40:43 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:40:43,057 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:40:51 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:40:51,645 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:40:52 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:40:52,211 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:43:22 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:43:22,642 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:43:34 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:43:34,422 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:43:34 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:43:34,902 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:43:43 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:43:43,520 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:43:43 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:43:43,942 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:46:21 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:46:21,710 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:46:33 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:46:33,564 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:46:34 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:46:34,051 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:46:44 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:46:44,052 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:46:44 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:46:44,506 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:54:31 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:54:31,738 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:54:43 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:54:43,927 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:54:44 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:54:44,437 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:54:54 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:54:54,329 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:54:54 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:54:54,759 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



[92m21:57:09 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:57:09,489 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:57:23 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:57:23,257 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:57:23 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:57:23,695 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider = ollama
[92m21:57:32 - LiteLLM:INFO[0m: utils.py:1620 - Wrapper: Completed Call, calling success_handler
2026-01-23 21:57:32,702 - LiteLLM - INFO - Wrapper: Completed Call, calling success_handler
[92m21:57:33 - LiteLLM:INFO[0m: utils.py:3871 - 
LiteLLM completion() model= gemma3:4b; provider = ollama
2026-01-23 21:57:33,161 - LiteLLM - INFO - 
LiteLLM completion() model= gemma3:4b; provider 

2026-01-23 22:07:06,299 - minisweagent - INFO - Saved trajectory to 'runs/swebench_verified/ollama_gemma3_4b/single_task/astropy__astropy-12907/astropy__astropy-12907.traj.json'


## Step 4 ‚Äî Inspect the output (pure Python)
The batch runner writes a `preds.json` file with the model patch for the instance.

**Checkpoint:** You can see the patch for `TASK_ID`.


In [None]:
import json

preds_path = OUTPUT_DIR / "preds.json"
print(f"preds.json exists: {preds_path.exists()}")

if preds_path.exists():
    preds = json.loads(preds_path.read_text())
    entry = preds.get(TASK_ID)
    print(f"Keys in entry: {list(entry.keys()) if entry else None}")
    if entry:
        print("\n--- Patch preview ---\n")
        print(entry.get("model_patch", "")[:1000])


## Step 5 ‚Äî Evaluate the result (pure Python)
This calls the SWE-bench evaluation harness directly (same as `python -m swebench.harness.run_evaluation`).

**Mac/ARM note:** pass `namespace=None` to build images locally instead of pulling x86_64 images.

**Checkpoint:** You get a PASS/FAIL result for the single instance.


In [None]:
from swebench.harness import run_evaluation as swebench_eval

preds_path = OUTPUT_DIR / "preds.json"
run_id = f"{MODEL_NAME.replace('/', '__')}_single"

swebench_eval.main(
    dataset_name="princeton-nlp/SWE-Bench_Verified",
    split="test",
    instance_ids=[TASK_ID],
    predictions_path=str(preds_path),
    max_workers=1,
    force_rebuild=False,
    cache_level="all",
    clean=False,
    open_file_limit=4096,
    run_id=run_id,
    timeout=900,
    namespace=None,
    rewrite_reports=False,
    modal=False,
    instance_image_tag="latest",
    env_image_tag="latest",
    report_dir=".",
)


## Step 6 ‚Äî Record the experiment
Capture a short summary so you can compare runs later.

Suggested fields:
- **Task ID**: `TASK_ID`
- **Model**: `MODEL_NAME`
- **Config**: `CONFIG_PATH`
- **Output**: `OUTPUT_DIR`
- **Result**: PASS/FAIL
- **Notes**: Any errors, retries, or unusual behavior


---
## Troubleshooting tips
- **Model not found**: confirm the model exists in Ollama and that the name matches `MODEL_NAME`.
- **Missing `datasets`**: install the `datasets` package in the notebook kernel environment.
- **Docker errors**: SWE-bench runs inside containers; ensure Docker is running.
- **Evaluation timeouts**: increase `timeout` in the evaluation cell if needed.
