SWE-ContextBench is a benchmark created to evaluate how well programming agents, such as AI coding systems, can reuse past experience when solving new tasks. It is built on top of existing datasets including SWE-Bench Lite, SWE-Bench Multilingual, and SWE-Bench Verified. The dataset contains 1,100 base tasks along with 376 related tasks that are derived from real dependency and reference relationships among GitHub issues and pull requests. These tasks are organized in a way that groups together problems with shared context, enabling the study of how effectively an agent can transfer knowledge across similar tasks. The dataset spans 51 real-world GitHub repositories and covers 9 different programming languages.
SWE-ContextBench is introduced as part of the research paper titled SWE Context Bench: A Benchmark for Context Learning in Coding.
SWE-ContextBench dataset can be downloaded from Hugging Face.
We provide pre-built Docker images for the related tasks.
Step1: Put your predictions in the predictions folder, with naming {instance_id}_preds.json
{instance_id}_preds.json follow the same data format convention as the SWE-Bench series datasets:
{
"instance_id": {
"model_name_or_path": {model_name_or_path},
"instance_id": {instance_id},
"model_patch": {model_patch}
}
}
Step2: Run
# Run only on "lite" subset
./evaluation.sh {run_id} lite
# Run on full dataset
./evaluation.sh {run_id} full
, {run_id} can be any text, e.g., my_run_id
- 02-05-26. Code Uploaded and Dataset Released 👩💻
