Skip to content

Conversation

nirinchev
Copy link
Collaborator

Proposed changes

Fixes a bunch of false-negatives for accuracy tests. It also adds the ability to mark expected tool calls as optional to handle cases where the LLM may or may not call some tool.

Based on #621.

@nirinchev nirinchev requested a review from a team as a code owner October 14, 2025 11:35

This comment has been minimized.

@nirinchev nirinchev force-pushed the ni/accuracy-test-fixes branch from b0079de to 0e14c14 Compare October 14, 2025 11:44
Base automatically changed from ni/create-vector-index to main October 15, 2025 11:45
@Copilot Copilot AI review requested due to automatic review settings October 15, 2025 12:08
@nirinchev nirinchev force-pushed the ni/accuracy-test-fixes branch from 0e14c14 to 8ec9c59 Compare October 15, 2025 12:08
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes false-negatives in accuracy tests by introducing optional tool call support and updating test expectations. The main change is allowing certain tool calls (like atlas-list-projects and list-databases) to be marked as optional, meaning LLMs may or may not invoke them depending on context.

Key changes:

  • Added optional field to ExpectedToolCall type to mark tool calls that may be skipped by LLMs
  • Updated scoring logic to handle optional tool calls correctly
  • Refactored test expectations to reduce duplication and mark exploration tools as optional

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/accuracy/sdk/accuracyScorer.ts Modified scoring logic to properly handle optional expected tool calls instead of returning 0
tests/accuracy/sdk/accuracyResultStorage/resultStorage.ts Added optional field to ExpectedToolCall type definition
tests/accuracy/getPerformanceAdvisor.test.ts Extracted common tool calls into reusable array and marked them as optional to reduce duplication
tests/accuracy/find.test.ts Marked exploration tool call as optional and relaxed filter matching to allow empty objects
tests/accuracy/dropCollection.test.ts Added optional exploration tool calls that LLMs may invoke before dropping collection
tests/accuracy/createCollection.test.ts Added optional list-databases tool call that LLMs may invoke for verification
scripts/accuracy/generateTestSummary.ts Updated UI to wrap optional tool names in parentheses for visual distinction

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

📊 Accuracy Test Results

📈 Summary

Metric Value
Commit SHA 6fb033e01a3554bedffc9c40054c5fbf55c41412
Run ID f7eb78a6-32cc-4df9-95e8-e9270500391b
Status done
Total Prompts Evaluated 73
Models Tested 1
Average Accuracy 96.6%
Responses with 0% Accuracy 2
Responses with 75% Accuracy 2
Responses with 100% Accuracy 69

📊 Baseline Comparison

Metric Value
Baseline Commit 18fe5495cea00cd3de484077d1e3711ca0a9389e
Baseline Run ID ac33cde0-acca-492c-8a1b-6962a9795686
Baseline Run Status done
Responses Improved 7
Responses Regressed 2

📎 Download Full HTML Report - Look for the accuracy-test-summary artifact for detailed results.

Report generated on: 10/15/2025, 1:50:47 PM

@nirinchev nirinchev merged commit 1cf6f6d into main Oct 15, 2025
17 of 19 checks passed
@nirinchev nirinchev deleted the ni/accuracy-test-fixes branch October 15, 2025 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants