Add evaluation analysis scripts by djukicmilica · Pull Request #605 · microsoft/BC-Bench

djukicmilica · 2026-04-07T15:42:45Z

New script: Get-WorkflowSummary.ps1

New script: Get-WorkflowSummary.ps1
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
PowerShell script that fetches evaluation workflow run summaries from GitHub Actions, downloads JSONL artifacts (including from nested zips), and optionally copies them into a stable output folder for analysis.

New script: bcbench_analyze_artifacts.py

New script: bcbench_analyze_artifacts.py
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
Python script for offline analysis of downloaded BC-Bench artifacts. Supports ZIP and pre-extracted input modes. Produces summary CSVs, top failures, grouped errors, and extracts generated test code/patches per test ID.

New script: group_errors_from_summary.py

New script: group_errors_from_summary.py
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
Python script that groups errors from a summary CSV into high-level categories (tests passed pre-patch, failed post-patch, build failures, etc.).

AB#626728

… test agent but pass with general agent

…rs and table relation

ventselartur · 2026-04-08T22:03:39Z

You cannot check the change to config.yaml file in. I would suggest to separate changes to the scripts from AlTest.agent.md. The latter should be run at least 5 times to see our score on BC Bench. It should be more than the existing version of AL test agent

config.yaml change is ok for now, you'll need it to run things.

But do separate the changes for the scripts

ventselartur · 2026-04-08T22:04:55Z

We need to run this at least 5 times to see if this is going to perform better than the existing AL test agent. Let's not push that change to master yet.

haoranpb · 2026-04-09T07:08:16Z

Run 1 completed: https://github.com/microsoft/BC-Bench/actions/runs/24083046010

Run 2 in progress: https://github.com/microsoft/BC-Bench/actions/runs/24148965983

haoranpb

Can you please move those scripts into tools/altest folder (you will have to create those folders) instead? And leave a README under tools folder saying those scripts are designed for downloading and analyzing GitHub Action artifacts

the scripts folder is designed for powershell scripts used in environment setup, etc.

And also mention it in

BC-Bench/.github/copilot-instructions.md

Lines 5 to 10 in 143a74e

    
           - **Dataset**: Benchmark entries following SWE-Bench schema with BC-specific adjustments 
        
           - **Python Package** (`src/bcbench/`): CLI tools, agent implementations, and validation utilities 
        
           - **PowerShell Scripts** (`scripts/`): Environment setup and dataset verification using AL-GO/BCContainerHelper 
        
           - **Agent Evaluations**: Focuses on GitHub Copilot CLI and Claude Code 
        
           - **Experiments**: MCP Servers, custom instructions, custom agents, skills, etc. and their performance on the benchmark 
        
           - **Notebooks** (`notebooks/`): Analysis and visualization of benchmark results

djukicmilica · 2026-04-14T21:12:40Z

Can you please move those scripts into tools/altest folder (you will have to create those folders) instead? And leave a README under tools folder saying those scripts are designed for downloading and analyzing GitHub Action artifacts

the scripts folder is designed for powershell scripts used in environment setup, etc.

And also mention it in

BC-Bench/.github/copilot-instructions.md

Lines 5 to 10 in 143a74e

- **Dataset**: Benchmark entries following SWE-Bench schema with BC-specific adjustments

- **Python Package** (`src/bcbench/`): CLI tools, agent implementations, and validation utilities

- **PowerShell Scripts** (`scripts/`): Environment setup and dataset verification using AL-GO/BCContainerHelper

- **Agent Evaluations**: Focuses on GitHub Copilot CLI and Claude Code

- **Experiments**: MCP Servers, custom instructions, custom agents, skills, etc. and their performance on the benchmark

- **Notebooks** (`notebooks/`): Analysis and visualization of benchmark results

The modifications have been done, please check.

ventselartur and others added 30 commits December 19, 2025 19:00

update AL test agent and enable it in config.yaml

27c252f

shorten the dataset in my private branch with only items fail with AL…

e61a576

… test agent but pass with general agent

agent without libraries and al mcp

6fab1ee

al test minimal

924de54

revert bcbench.jsonl to get full dataset

c902dd2

update ALTestMinimal

e6f8731

merge from main

1712667

improve instructions with LLM

ec41ee4

improve instructions with LLM

6da5dc0

Get-WorkFlowSummary.ps1 baseline

99ca30e

cherry pick Sasha additional logging

665594f

keep only first-party apps in dataset/bcbench.jsonl

8430964

merge from main

d9009b4

fix scripts which accept only dataset entries with tests

a542317

add NAV Prs with first-party apps

c9903b4

disable the agent on config.yaml

f3d16dd

remove failed PR from the dataset and add area for all new entries

a79e3ee

remove entries from dataset where app cannot compile with the fix

0ec3e3c

introduce one more version of the agent with special hints for handle…

313d0bb

…rs and table relation

merge from master

41ed3a0

update

8ff082f

remove two commits which do not work for BC bench

b1fa5ff

revert changes in bcbench.jsonl to get the original one

5cdc3f7

update instructions for AL test agent

ee3f7f9

merge from main

92829d4

merge from main

009badb

revert changes to Python scripts

3be7aff

remove unnecessary problem statements

806c343

revert changes to Python scripts

70425c4

revert change to scripts/AppUtils.psm1

cfe0324

MilicaDjukic added 3 commits April 7, 2026 17:31

delete unused script

f490eff

removed workflows

f14a39c

Merge branch 'main' into private/milicadjukic/BCBenchScript2

dfdf006

djukicmilica requested a review from haoranpb April 7, 2026 15:44

djukicmilica marked this pull request as ready for review April 7, 2026 15:45

scripts updated

cec9215

djukicmilica requested review from AleksanderGladkov and ventselartur April 8, 2026 10:18

ventselartur reviewed Apr 8, 2026

View reviewed changes

revert changes for al test agent

3d38770

djukicmilica changed the title ~~Enable ALTest agent and add evaluation analysis scripts~~ Add evaluation analysis scripts Apr 9, 2026

djukicmilica requested a review from ventselartur April 9, 2026 14:25

ventselartur previously approved these changes Apr 9, 2026

View reviewed changes

AleksanderGladkov previously approved these changes Apr 9, 2026

View reviewed changes

haoranpb requested changes Apr 10, 2026

View reviewed changes

MilicaDjukic added 2 commits April 14, 2026 22:58

Merge branch 'main' into private/milicadjukic/BCBenchScript2

25e58d1

PR review

d09b91b

djukicmilica dismissed stale reviews from ventselartur and AleksanderGladkov via d09b91b April 14, 2026 21:07

PR review 2

728ebbb

djukicmilica enabled auto-merge (squash) April 14, 2026 21:12

djukicmilica disabled auto-merge April 14, 2026 21:12

djukicmilica requested a review from haoranpb April 14, 2026 21:12

Merge branch 'main' into private/milicadjukic/BCBenchScript2

f5a0bbd

haoranpb approved these changes Apr 15, 2026

View reviewed changes

djukicmilica merged commit c362cd1 into main Apr 15, 2026
6 checks passed

djukicmilica deleted the private/milicadjukic/BCBenchScript2 branch April 15, 2026 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluation analysis scripts#605

Add evaluation analysis scripts#605
djukicmilica merged 69 commits intomainfrom
private/milicadjukic/BCBenchScript2

djukicmilica commented Apr 7, 2026 •

edited

Loading

Uh oh!

ventselartur Apr 8, 2026 •

edited

Loading

Uh oh!

haoranpb Apr 9, 2026

Uh oh!

ventselartur Apr 8, 2026 •

edited

Loading

Uh oh!

haoranpb commented Apr 9, 2026

Uh oh!

haoranpb left a comment •

edited

Loading

Uh oh!

djukicmilica commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	- Dataset: Benchmark entries following SWE-Bench schema with BC-specific adjustments
	- Python Package (`src/bcbench/`): CLI tools, agent implementations, and validation utilities
	- PowerShell Scripts (`scripts/`): Environment setup and dataset verification using AL-GO/BCContainerHelper
	- Agent Evaluations: Focuses on GitHub Copilot CLI and Claude Code
	- Experiments: MCP Servers, custom instructions, custom agents, skills, etc. and their performance on the benchmark
	- Notebooks (`notebooks/`): Analysis and visualization of benchmark results

Conversation

djukicmilica commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New script: Get-WorkflowSummary.ps1

New script: bcbench_analyze_artifacts.py

New script: group_errors_from_summary.py

Uh oh!

ventselartur Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haoranpb Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

ventselartur Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haoranpb commented Apr 9, 2026

Uh oh!

haoranpb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djukicmilica commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

djukicmilica commented Apr 7, 2026 •

edited

Loading

ventselartur Apr 8, 2026 •

edited

Loading

ventselartur Apr 8, 2026 •

edited

Loading

haoranpb left a comment •

edited

Loading