Skip to content

Add evaluation analysis scripts#605

Merged
djukicmilica merged 69 commits intomainfrom
private/milicadjukic/BCBenchScript2
Apr 15, 2026
Merged

Add evaluation analysis scripts#605
djukicmilica merged 69 commits intomainfrom
private/milicadjukic/BCBenchScript2

Conversation

@djukicmilica
Copy link
Copy Markdown
Collaborator

@djukicmilica djukicmilica commented Apr 7, 2026

New script: Get-WorkflowSummary.ps1

New script: Get-WorkflowSummary.ps1
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
PowerShell script that fetches evaluation workflow run summaries from GitHub Actions, downloads JSONL artifacts (including from nested zips), and optionally copies them into a stable output folder for analysis.

New script: bcbench_analyze_artifacts.py

New script: bcbench_analyze_artifacts.py
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
Python script for offline analysis of downloaded BC-Bench artifacts. Supports ZIP and pre-extracted input modes. Produces summary CSVs, top failures, grouped errors, and extracts generated test code/patches per test ID.

New script: group_errors_from_summary.py

New script: group_errors_from_summary.py
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
Python script that groups errors from a summary CSV into high-level categories (tests passed pre-patch, failed post-patch, build failures, etc.).

AB#626728

ventselartur and others added 30 commits December 19, 2025 19:00
@djukicmilica djukicmilica requested a review from haoranpb April 7, 2026 15:44
@djukicmilica djukicmilica marked this pull request as ready for review April 7, 2026 15:45
Copy link
Copy Markdown
Collaborator

@ventselartur ventselartur Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot check the change to config.yaml file in. I would suggest to separate changes to the scripts from AlTest.agent.md. The latter should be run at least 5 times to see our score on BC Bench. It should be more than the existing version of AL test agent

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.yaml change is ok for now, you'll need it to run things.

But do separate the changes for the scripts

Copy link
Copy Markdown
Collaborator

@ventselartur ventselartur Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to run this at least 5 times to see if this is going to perform better than the existing AL test agent. Let's not push that change to master yet.

@haoranpb
Copy link
Copy Markdown
Collaborator

haoranpb commented Apr 9, 2026

Run 1 completed: https://github.com/microsoft/BC-Bench/actions/runs/24083046010

Run 2 in progress: https://github.com/microsoft/BC-Bench/actions/runs/24148965983

@djukicmilica djukicmilica changed the title Enable ALTest agent and add evaluation analysis scripts Add evaluation analysis scripts Apr 9, 2026
ventselartur
ventselartur previously approved these changes Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@haoranpb haoranpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move those scripts into tools/altest folder (you will have to create those folders) instead? And leave a README under tools folder saying those scripts are designed for downloading and analyzing GitHub Action artifacts

the scripts folder is designed for powershell scripts used in environment setup, etc.

And also mention it in

- **Dataset**: Benchmark entries following SWE-Bench schema with BC-specific adjustments
- **Python Package** (`src/bcbench/`): CLI tools, agent implementations, and validation utilities
- **PowerShell Scripts** (`scripts/`): Environment setup and dataset verification using AL-GO/BCContainerHelper
- **Agent Evaluations**: Focuses on GitHub Copilot CLI and Claude Code
- **Experiments**: MCP Servers, custom instructions, custom agents, skills, etc. and their performance on the benchmark
- **Notebooks** (`notebooks/`): Analysis and visualization of benchmark results

@djukicmilica djukicmilica enabled auto-merge (squash) April 14, 2026 21:12
@djukicmilica djukicmilica disabled auto-merge April 14, 2026 21:12
@djukicmilica
Copy link
Copy Markdown
Collaborator Author

Can you please move those scripts into tools/altest folder (you will have to create those folders) instead? And leave a README under tools folder saying those scripts are designed for downloading and analyzing GitHub Action artifacts

the scripts folder is designed for powershell scripts used in environment setup, etc.

And also mention it in

- **Dataset**: Benchmark entries following SWE-Bench schema with BC-specific adjustments
- **Python Package** (`src/bcbench/`): CLI tools, agent implementations, and validation utilities
- **PowerShell Scripts** (`scripts/`): Environment setup and dataset verification using AL-GO/BCContainerHelper
- **Agent Evaluations**: Focuses on GitHub Copilot CLI and Claude Code
- **Experiments**: MCP Servers, custom instructions, custom agents, skills, etc. and their performance on the benchmark
- **Notebooks** (`notebooks/`): Analysis and visualization of benchmark results

The modifications have been done, please check.

@djukicmilica djukicmilica requested a review from haoranpb April 14, 2026 21:12
@djukicmilica djukicmilica merged commit c362cd1 into main Apr 15, 2026
6 checks passed
@djukicmilica djukicmilica deleted the private/milicadjukic/BCBenchScript2 branch April 15, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants