Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

Leaderboard: https://tebench-leadership.vercel.app/

TEBench is the first project-level benchmark for test evolution — the task of keeping a test suite synchronized with evolving production code. Given a project repository and a code-changing commit, TEBench requires systems to autonomously identify tests that need modification, determine where new tests are needed, and produce the corresponding test patch.

TEBench curates 314 task instances from 10 real-world Java projects, all drawn from the Defects4J ecosystem with developer-written ground truth. Each instance is annotated with one or more of three evolution types.

Evolution Types

Type	Description
Test-Breaking	An existing test fails to compile or execute after the code change. The developer modifies it to restore correctness.
Test-Stale	An existing test still passes after the code change, but the developer updates it so it better reflects the revised semantics.
Test-Missing	The developer adds a new test method to cover behavior introduced or exposed by the change.

Test-Breaking and Test-Stale together correspond to what prior work calls obsolete tests. Test-Missing extends the scope of the task beyond test repair to capture tests that need to be created from scratch in response to a commit.

Dataset Overview

Project	Tasks	Src LOC	Test Files	Breaking	Stale	Missing
commons-cli	18	9,716	51	8	12	9
commons-codec	19	25,102	84	12	11	12
commons-collections	23	80,241	300	10	14	15
commons-compress	86	92,057	260	34	58	53
commons-csv	31	6,295	43	22	18	16
commons-lang	69	101,573	275	28	46	40
commons-math	8	142,903	403	8	3	2
gson	1	22,329	139	1	0	1
jfreechart	3	211,097	361	1	3	1
jsoup	56	27,390	84	48	42	50
Total	314	718,703	2,000	172	207	199

Multi-label distribution. 219 tasks (69.7%) carry multiple labels, and 45 tasks (14.3%) exhibit all three types simultaneously. The most frequent combination is Stale + Missing (105 tasks, 33.4%); only 95 tasks (30.3%) involve a single evolution type, confirming that real-world test evolution is predominantly multi-faceted.

Task complexity. The median task touches 4 changed files, 34 lines of source changes, and 32 lines of test changes. The distribution exhibits a long tail: the most complex task spans 20 files with 732 lines of source changes. 114 tasks (36.4%) involve modifications to more than one test file; 63 (20.1%) span multiple test packages; 236 (75.2%) require changes to more than one test method.

Temporal distribution. Commits span 2016 – 2025, with 77.4% from 2020 or later (39.8% from 2024 – 2025), so the dataset reflects contemporary development practices and coding conventions.

Three-Version Structure

Each task instance is built around a three-version structure:

V-1   (parent commit — baseline before any changes)
  └──> V-0.5  (production-code changes only, test files unchanged — agent input)
         └──> V0    (full commit, including the developer's test updates — ground truth)

V-1 — the parent commit; serves as the baseline.
V-0.5 — only production-code changes applied; test files are left unchanged. This is the state the coding agent sees, simulating the real-world scenario in which a developer has committed code changes but has not yet updated the tests.
V0 — the full commit, including the developer's actual test modifications; serves as ground truth.

Benchmark Construction Pipeline

TEBench is constructed through a four-stage filtering pipeline over 17 Defects4J projects (67,670 commits):

Project Source — Start from Defects4J; exclude 3 projects that do not use Maven, leaving 14 Maven-based projects (67,670 commits).
Static Filtering — Date filter (post-2016, post-2019 for Java 8+ projects), co-modification of src/main/ and src/test/, method-body-level AST changes via javalang. Reduces to 6,169 commits from 14 projects.
Execution-Based Validation — Build two isolated worktrees per commit (V-0.5 and V0), run the test suite, and collect line/branch coverage via JaCoCo. Exclude build failures, non-functional test changes, and commits whose test changes lack a verifiable causal relationship with the production change. Reduces to 561 commits from 12 projects.
Quality Filtering — Exclude merge commits, constrain test-change size to 5 – 200 lines, and deduplicate by (project, ClassName.methodName). Final dataset: 314 task instances from 10 projects.

Evaluation Framework

Identification Metrics

The identification stage measures whether the agent correctly locates the tests that require attention, compared against the developer ground truth:

Method-level granularity for modified / deleted test methods: a true positive requires the agent to modify or delete the same test method as in the GT.
File-level granularity for newly added test methods: a true positive requires the agent to add at least one new test method in the same file where the GT adds methods.

Reports Precision, Recall, and F1.

Update Metrics

The update stage evaluates three dimensions, designed around the developer-written GT as a reference for evolution intent rather than as an absolute oracle.

Executability (s_exec):

0    — if compilation fails
0.5  — if compilation succeeds but tests fail
1    — if all tests pass

Coverage Overlap (s_line, s_branch): how well the agent's tests cover the same lines / branches as the GT tests, restricted to production methods modified by the commit:

s_line   = |C_line(agent)   ∩ C_line(gt)|   / |C_line(gt)|
s_branch = |C_branch(agent) ∩ C_branch(gt)| / |C_branch(gt)|

Modification Similarity (s_mod): token-level Jaccard similarity between agent and GT test modifications:

s_mod = |tokens(agent) ∩ tokens(gt)| / |tokens(agent) ∪ tokens(gt)|

Composite Score:

s_update = s_exec × (0.3·s_line + 0.3·s_branch + 0.4·s_mod)   if GT has coverage change
s_update = s_exec × s_mod                                       if GT has no coverage change

Evaluated Systems (paper Table 4)

We evaluate eight systems organised along two axes: a heuristic baseline and seven LLM-based configurations spanning three industrial agent frameworks and six base models.

Agent Framework	Base Model	Version
Heuristic Baseline	—	—
Claude Code	Claude Sonnet 4.6	v2.1.45
Codex CLI	ChatGPT 5.3 Codex	v0.114.0
OpenCode	Claude Sonnet 4.6	v1.2.16
OpenCode	Qwen3.5	v1.2.16
OpenCode	GLM-5	v1.2.16
OpenCode	Kimi-K2.5	v1.2.16
OpenCode	DeepSeek-V3.2	v1.2.16

OpenCode is the framework used to swap in the four open-source backbones (Qwen3.5, GLM-5, Kimi-K2.5, DeepSeek-V3.2) under an identical prompt and execution protocol. Use baseline/opencode/scripts/multi_model_runner.py to reproduce all OpenCode configurations in one invocation.

Results Highlights (from the paper)

Identification (Table 5). All seven LLM-based configurations cluster within a 3.7-point F1 band overall (45.7 – 49.4%), with backbone variation contributing ≈ 3.6 F1 points and framework variation only ≈ 1.2 points. The bottleneck lies in the inherent task difficulty rather than in any specific configuration.

Configuration	Overall F1	Breaking F1	Stale F1	Missing F1
Heuristic Baseline	4.0	3.3	2.0	2.0
Claude Code	47.1	59.6	35.0	54.1
Codex CLI	49.4	69.4	37.4	54.5
OpenCode (Sonnet)	48.3	66.8	35.6	51.2
OpenCode (Qwen)	48.2	70.4	36.0	54.3
OpenCode (GLM)	49.3	71.0	37.1	53.9
OpenCode (Kimi)	49.4	60.7	35.8	51.3
OpenCode (DeepSeek)	45.7	58.6	33.4	50.0

Test-Stale is the most challenging type (avg F1 ≈ 35.8%): stale tests still pass on the updated code, so no execution-failure signal is available, and configurations must rely on proactive semantic reasoning. Configurations exhibit a systematic Recall-over-Precision imbalance (mean gap 13.7 points), indicating a shared inductive bias toward over-prediction.

Update (Table 6). Composite update scores cluster within an 8.8-point band (63.6 – 72.3%). Executability stays consistently high (87.7 – 99.2%) yet exceeds modification similarity by 33.7 – 48.9 percentage points, indicating that producing executable tests is far easier than producing tests that align with developer intent.

Configuration	Overall OA	Breaking OA	Stale OA	Missing OA
Claude Code	70.5	73.2	68.5	63.8
Codex CLI	72.3	76.6	70.8	65.7
OpenCode (Sonnet)	68.9	73.8	65.4	62.6
OpenCode (Qwen)	67.0	73.3	62.8	59.1
OpenCode (GLM)	69.3	74.2	66.7	62.2
OpenCode (Kimi)	63.6	68.3	59.8	56.0
OpenCode (DeepSeek)	64.5	70.4	60.9	55.4

The type-wise difficulty ranking on the update task (Breaking > Stale > Missing) is the inverse of the identification ranking: once located, Breaking and Stale tests need only targeted assertion edits, whereas Missing requires generating entirely new code that naturally produces lower similarity to the GT.

Repository Structure

TEBench/
├── config.py                           # Global configuration
├── main.py                             # Phase 1: dataset building / commit filtering
├── analysis.py                         # Phase 2: analysis tool entry point
├── generate_filtered_versions.py       # Phase 3: generate V-0.5 (filtered) branches
├── batch_worktree_builder.py           # Phase 4: bulk-build per-task worktrees
├── evaluate.py                         # Top-level evaluation entry point
├── evaluate_user_identification.py     # Identification-stage evaluation
├── extract_gt_changes.py               # GT extraction helper
├── compare_identification.py           # Per-config identification comparison
├── run_analysis.sh                     # Shell wrapper for batch project analysis
├── requirements.txt                    # Python dependencies
│
├── modules/                            # Core pipeline modules
│   ├── git_analyzer.py                 # Git operations
│   ├── code_analyzer.py                # Java AST parsing
│   ├── change_detector.py              # Method-level change detection
│   ├── maven_executor.py               # Maven build / test execution
│   ├── coverage_analyzer.py            # JaCoCo coverage analysis
│   ├── commit_filter.py                # Commit filtering logic
│   ├── dataset_generator.py            # Dataset export
│   ├── diff_filter.py                  # Source / test diff separation
│   ├── filtered_version_generator.py   # V-0.5 / T-0.5 branch generation
│   ├── isolated_executor.py            # Isolated worktree executor
│   └── commit_classifier.py            # Commit type classifier
│
├── analysis/                           # Analysis sub-package
│   ├── project_analyzer.py             # Per-project analysis orchestration
│   ├── commit_analyzer.py              # Per-commit analysis
│   ├── cache_manager.py                # Result caching
│   ├── report_generator.py             # JSON / Markdown report generation
│   ├── filter_commits.py               # Step 1 commit filtering
│   ├── filter_commits_step2.py         # Step 2 fine-grained filtering
│   ├── classify_changes.py             # Step 3 evolution-type classification
│   └── diagnose_projects.py            # Per-project elimination diagnostics
│
├── update_evaluation/                  # Update-quality evaluation
│   ├── evaluation_orchestrator.py      # Evaluation orchestration
│   ├── executability_evaluator.py      # Compile + test pass evaluation
│   ├── coverage_increment_analyzer.py  # Coverage overlap analysis
│   ├── modification_effort_calculator.py # Modification similarity (Jaccard)
│   ├── changed_method_extractor.py     # Changed method extraction
│   └── worktree_manager.py             # Worktree lifecycle management
│
├── identify_evaluation/                # Identification evaluation
│   ├── gt_extractor.py                 # Ground-truth test-change extractor
│   ├── example_predicted_format.json   # Example prediction format
│   └── README.md                       # Module documentation
│
├── baseline/                           # Coding-agent baselines (paper Table 4)
│   ├── shared_test_update_prompt.py    # Unified prompt across all agents
│   ├── claude-code/scripts/            # Claude Code runner
│   ├── codex/scripts/                  # Codex CLI runner
│   └── opencode/                       # OpenCode runners (5 backbones)
│       ├── README.md
│       └── scripts/
│           ├── batch_opencode_runner.py    # Single-backbone batch run
│           ├── multi_model_runner.py       # Multi-backbone runner (all 5 OpenCode configs)
│           ├── batch_evaluate_worktrees_from_csv.py
│           └── evaluate_opencode_results.py

Usage

Prerequisites

Python 3.8+
Java 8+ (JDK)
Maven 3.x
Git
Optional: an OpenCode install for the multi-backbone runner

pip install -r requirements.txt

Step 1 — Configure

Edit config.py to set the repository path and filter parameters:

class Config:
    REPO_PATH = "/path/to/your/java-project"
    DATE_FILTER = "2016-01-01"
    COVERAGE_THRESHOLD = 0.5

Step 2 — Initial Filtering (Static + Execution)

python main.py /path/to/java-project

Generates output/dataset.json containing all qualified commits.

Step 3 — Generate Filtered Versions (V-0.5)

python generate_filtered_versions.py output/dataset.json output/filtered_dataset.json

Creates filtered/* and test-only/* git branches for each qualified commit.

Step 4 — Analysis and Classification

# Analyze a single project
python analysis.py --project /path/to/commons-csv

# Analyze all projects in a directory
python analysis.py --projects-dir /path/to/defects4j-projects --workers 4

# Quick scan (file-level only)
python analysis.py --project /path/to/project --phase quick

# Resume a previous run
python analysis.py --project /path/to/project --resume

# Filter by date
python analysis.py --project /path/to/project --since 2020-01-01

Output is written to output/analysis/<project_name>/.

Step 5 — Run Coding Agents

The baseline/ directory provides ready-to-use runners for the seven LLM-based configurations evaluated in the paper.

# Reproduce all five OpenCode rows of Table 4 in one invocation
python baseline/opencode/scripts/multi_model_runner.py \
  --input  /path/to/worktree_records.xlsx \
  --output /path/to/multi_model_results \
  --models claude-sonnet-4-6 qwen-3.5 glm-5 kimi-k2.5 deepseek-v3.2 \
  --workers 2 --status ready

The Claude Code and Codex CLI configurations live under baseline/claude-code/ and baseline/codex/ respectively. All three frameworks share the unified prompt in baseline/shared_test_update_prompt.py.

Step 6 — Identification Evaluation

python identify_evaluation/gt_extractor.py \
  --input  /path/to/worktree_records.csv \
  --output identify_evaluation/gt_changes_all.json

See identify_evaluation/README.md for full details.

Step 7 — Update-Quality Evaluation

python baseline/opencode/scripts/evaluate_opencode_results.py \
  -r /path/to/multi_model_results/<model_name> \
  -w /path/to/worktree_records.xlsx \
  -p /path/to/defects4j-projects \
  -o evaluation_<model_name>.json --verbose

update_evaluation/evaluation_orchestrator.py is the underlying engine and can also be called directly. It computes s_exec, s_line, s_branch, s_mod, and the composite s_update score per task.

Dataset Format

`filtered_dataset.json`

{
  "metadata": {
    "source_dataset": "output/dataset.json",
    "total_processed": 130,
    "source_only": {
      "successful": 125,
      "failed": {"apply_patch": 0, "compilation": 5, "other": 0},
      "success_rate": "96.15%"
    },
    "test_only": {
      "successful": 123,
      "failed": {"apply_patch": 0, "compilation": 7, "other": 0},
      "success_rate": "94.62%"
    }
  },
  "commits": [
    {
      "original_commit": "d93c4940...",
      "parent_commit": "c36d6cde...",
      "author": "...",
      "date": "2025-03-15 04:29:53",
      "message": "...",
      "changed_files": {
        "test_files": ["..."],
        "source_files": ["..."],
        "other_files": []
      },
      "changed_methods": {
        "test_methods": [...],
        "source_methods": [...]
      },
      "coverage_analysis": {...},
      "filtered_version": {
        "success": true,
        "filtered_commit_hash": "ab0f7745...",
        "branch_name": "filtered/d93c4940"
      },
      "test_only_version": {
        "success": true,
        "test_only_commit_hash": "5e5d1c2a...",
        "branch_name": "test-only/d93c4940"
      }
    }
  ]
}

Technical Implementation

Diff Filtering Algorithm

The diff filter separates source and test changes from a single commit diff using regex-based parsing of the unified diff format:

Split diff by file (diff --git markers).
Classify each file as test (src/test/) or source (src/main/).
Reconstruct source-only and test-only patch files.
Apply patches independently to create V-0.5 and T-0.5 versions.

Analysis Pipeline Phases

Phase	Description
`quick`	File-level scan only — identify commits that co-modify test and source files
`method`	AST-level method-change analysis — no test execution
`full`	Complete pipeline including isolated build, test execution, and JaCoCo coverage collection

Execution Environment

For each task instance we construct an isolated execution environment based on the V-0.5 version. We use Git's worktree mechanism to create a dedicated working directory that contains the updated source with the original tests; a separate worktree is attached to this branch. This provides full filesystem isolation between tasks while sharing only the read-only Git object store with the main repository — significantly more lightweight than provisioning a separate Docker container per project. A Docker image bundling all project environments and evaluation scripts is provided in the replication package for full reproducibility.

Dependencies

GitPython==3.1.40
javalang==0.13.0
lxml==5.1.0

Notes

Repository state: ensure the Git repository is clean (no uncommitted changes) before running.
JDK version: ensure a compatible JDK is installed for the target project.
Maven: must be available on PATH or configured via AnalysisConfig.MAVEN_EXECUTABLE.
Disk space: generated branches (filtered/*, test-only/*) and per-model worktree copies consume additional disk space.
Worktrees: analysis.py uses temporary git worktrees that are automatically cleaned up.

Cleaning Up Generated Branches

cd /path/to/your/java-project
git branch | grep "filtered/"  | xargs git branch -D
git branch | grep "test-only/" | xargs git branch -D

Research Applications

TEBench is designed to support research in:

Test evolution — how test suites co-evolve with production code changes.
Automated test repair — detecting and fixing breaking or stale tests.
Test generation — producing new tests for uncovered behavior introduced by commits.
Coverage analysis — measuring coverage impact of code changes.
Coding-agent evaluation — benchmarking LLM-based agents on project-level software engineering tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
analysis		analysis
baseline		baseline
identify_evaluation		identify_evaluation
modules		modules
update_evaluation		update_evaluation
utils		utils
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
batch_worktree_builder.py		batch_worktree_builder.py
compare_identification.py		compare_identification.py
config.py		config.py
evaluate.py		evaluate.py
evaluate_user_identification.py		evaluate_user_identification.py
extract_gt_changes.py		extract_gt_changes.py
generate_filtered_versions.py		generate_filtered_versions.py
main.py		main.py
requirements.txt		requirements.txt
run_analysis.sh		run_analysis.sh

Folders and files

Latest commit

History

Repository files navigation

Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

Evolution Types

Dataset Overview

Three-Version Structure

Benchmark Construction Pipeline

Evaluation Framework

Identification Metrics

Update Metrics

Evaluated Systems (paper Table 4)

Results Highlights (from the paper)

Repository Structure

Usage

Prerequisites

Step 1 — Configure

Step 2 — Initial Filtering (Static + Execution)

Step 3 — Generate Filtered Versions (V-0.5)

Step 4 — Analysis and Classification

Step 5 — Run Coding Agents

Step 6 — Identification Evaluation

Step 7 — Update-Quality Evaluation

Dataset Format

filtered_dataset.json

Technical Implementation

Diff Filtering Algorithm

Analysis Pipeline Phases

Execution Environment

Dependencies

Notes

Cleaning Up Generated Branches

Research Applications

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`filtered_dataset.json`

Packages