SWE Context Bench: A Benchmark for Context Learning in Coding

SWE-ContextBench is a benchmark created to evaluate how well programming agents, such as AI coding systems, can reuse past experience when solving new tasks. It is built on top of existing datasets including SWE-Bench Lite, SWE-Bench Multilingual, and SWE-Bench Verified. The dataset contains 1,100 base tasks along with 376 related tasks that are derived from real dependency and reference relationships among GitHub issues and pull requests. These tasks are organized in a way that groups together problems with shared context, enabling the study of how effectively an agent can transfer knowledge across similar tasks. The dataset spans 51 real-world GitHub repositories and covers 9 different programming languages.

SWE-ContextBench is introduced as part of the research paper titled SWE Context Bench: A Benchmark for Context Learning in Coding.

SWE-ContextBench dataset can be downloaded from Hugging Face.

Run Evaluation

We provide pre-built Docker images for the related tasks.

Step1: Put your predictions in the predictions folder, with naming {instance_id}_preds.json

{instance_id}_preds.json follow the same data format convention as the SWE-Bench series datasets:

{
  "instance_id": {
    "model_name_or_path": {model_name_or_path},
    "instance_id": {instance_id},
    "model_patch": {model_patch}
  }
}

Step2: Run

# Run only on "lite" subset
./evaluation.sh {run_id} lite

# Run on full dataset 
./evaluation.sh {run_id} full

, {run_id} can be any text, e.g., my_run_id

🚨 News

02-05-26. Code Uploaded and Dataset Released 👩‍💻

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
cases		cases
predictions		predictions
swebench_memory		swebench_memory
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
evaluation.sh		evaluation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWE Context Bench: A Benchmark for Context Learning in Coding

Run Evaluation

🚨 News

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SWE Context Bench: A Benchmark for Context Learning in Coding

Run Evaluation

🚨 News

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages