Align AI analysis (Stage 3) with static analysis (Stage 2) in Query Insights#616
Conversation
… badge color handling Co-authored-by: Copilot <copilot@github.com>
Add comprehensive test coverage for prompt templates as a safety net before refactoring them into resource files: - Test createPriorityDeclaration() and createSecurityInstructions() - Test CRITICAL_JSON_REMINDER content - Test all 3 index advisor prompt templates (find/aggregate/count): - Required sections (PRIORITY DECLARATION, SECURITY INSTRUCTIONS, etc.) - JSON schema structural elements - Escape character and backtick fence integrity - Analysis and educational content template structure - Markdown compatibility rules - Azure vCore and low-selectivity guidance - Test getQueryTypeConfig() for all query types - Test structural integrity across all templates (no unresolved template literals, numbered instructions, verification requirements)
Extract the body of each index advisor prompt template (find, aggregate, count) into standalone resource files under resources/prompts/: - index-advisor-find.prompt.md - index-advisor-aggregate.prompt.md - index-advisor-count.prompt.md The dynamic header (PRIORITY DECLARATION + SECURITY INSTRUCTIONS) remains generated in TypeScript code. At runtime, the PromptTemplateService composes the full prompt by loading the body from the resource file and prepending the dynamic header. The inline template constants are kept as fallbacks for when the extension context is not available (e.g., early startup, tests). Adds: - loadPromptBody() function for reading resource files with caching - buildIndexAdvisorPrompt() function for composing header + body - Snapshot tests to detect accidental content changes - Tests for the new loading/fallback behavior
Add buildStaticAnalysisSummary() function that produces a compact text summary of the Stage 2 static analysis results for inclusion in the AI prompt context. The summary includes: - Collection context (total docs, returned, examined, keys, exec time) - Performance rating (excellent/good/fair/poor) - Summary indicators (selectivity, index used, fetch overhead, sort) - All diagnostic badges with type markers ([+]/[-]/[i]) and details - Concerns when present The summary is designed to be concise (~300-500 chars for typical queries) while giving the LLM enough context to understand what the user has already been told by the static analysis. Includes 16 tests covering all scenarios: good/poor queries, covered queries, collection scans, low-cardinality badges, missing data, etc.
Wire up the Stage 2 static analysis results as additional context in the AI prompt for Stage 3. The LLM now receives a summary of what the user has already been shown (performance rating, selectivity, fetch overhead, diagnostic badges, concerns). Changes: - Add staticAnalysisSummary optional field to QueryOptimizationContext - Update fillPromptTemplate() to append static analysis summary to context data when available - Update QueryInsightsAIService.getOptimizationRecommendations() to accept and pass through the summary - Add Stage 2 response caching on ClusterSession (setStage2Response/ getStage2Response) so Stage 3 can access the transformed results - Update Stage 3 router handler to build static analysis summary from cached Stage 2 response and pass it to the AI service - Update DATA PLACEHOLDERS in all 3 prompt resource files and inline template constants to document the new Static Analysis Results section
Add specific instructions to all 3 index advisor prompt templates (find, aggregate, count) that teach the LLM about: 1. Low-cardinality index awareness: When the static analysis has shown a 'Low-cardinality index' badge, do NOT recommend creating indexes on those fields (boolean, enum-like). If still recommending, mark as low priority with a confidence note. 2. High return ratio: When the query returns >20% of the collection, creating an index is unlikely to help. Suggest hiding unused indexes instead. These instructions appear in both the resource files and the inline template constants (fallback).
Add 'Static analysis alignment' instruction to all 3 index advisor prompt templates that governs how the LLM handles agreement and disagreement with the static analysis shown to the user: - If the AI agrees with the static analysis, briefly affirm it - If the AI disagrees, it MUST explain why in the Performance Summary section using the format: 'The initial analysis showed [X], but after deeper inspection [Y] because [Z]' - Must not silently contradict the static analysis - Must explain when it thinks the heuristic was too strict or missed something The instruction acknowledges that static analysis is heuristic-based while the AI has access to the full execution plan, so divergence is expected and acceptable when explained.
- Run npm run l10n to update localization bundle - Run npm run prettier-fix to format code - Fix lint error: replace inline import() type annotation with proper import in collectionViewRouter.ts - Run npx jest --no-coverage: all 1910 tests pass - Run npm run build: TypeScript compilation successful
…uctions The low-cardinality and high-return-ratio rules were in the 'Thinking tips' section of the prompt, which the LLM treats as optional guidance. This caused the AI to recommend creating high-priority indexes on boolean fields (e.g., isFamilyFriendly: true) even when the query returns 55% of the collection. Changes: - Promote both rules from tips to mandatory numbered instructions in all 3 prompt templates (find, aggregate, count) - Make the rules work from the query data itself (boolean filter values, return ratio) rather than depending on the static analysis badge, which only fires on index scans - Remove the now-redundant tip-section duplicates - The rules now mandate: no high-priority index on boolean/low- cardinality fields, and no high-priority single-field index when >20% of collection is returned
… source Stage 2 telemetry (no PII/OII — all categorical or numeric): - performanceScore: excellent/good/fair/poor distribution - executionStrategy, indexUsed, hadCollectionScan, hadInMemorySort, isCoveringQuery, fetchOverheadKind, isSharded (categorical flags) - executionTimeMs, documentsReturned, totalDocsExamined, totalKeysExamined, examinedToReturnedRatio (aggregatable metrics) - selectivityPercent, totalCollectionDocs (when available) - diagnosticBadgeIds (comma-separated IDs for segmentation) - positiveBadgeCount, neutralBadgeCount, negativeBadgeCount Stage 3 telemetry: - hasStaticAnalysisSummary: whether the summary was available - staticAnalysisSummaryLength: prompt size impact - staticAnalysisSummaryError: tracks build failures - hasCachedExecutionPlan: whether Stage 2 plan was reused Prompt source tracking: - promptSource: 'resource-file' | 'inline-fallback' | 'custom-file' Tracks whether resource files are being used successfully, enabling future removal of inline fallback constants. - hasStaticAnalysisSummary in optimizeQuery context
…sitions 3-4
The LLM was ignoring low-cardinality and high-return-ratio rules when
they were at positions 21-22 out of 22 instructions. LLMs weight early
instructions much more heavily, and the strong COLLSCAN→create-index
prior from training data was overriding the late rules.
Changes:
- Move both rules to positions 3-4 (right after 'do not hallucinate')
in all 6 prompt sources (3 resource files + 3 inline fallbacks)
- Add CRITICAL prefix to both rules for emphasis
- Add concrete example: { isFamilyFriendly: true } on 65K docs
returning 35K → do NOT suggest indexing this boolean field
- Remove old duplicate rules from bottom of each template
- Renumber all subsequent instructions (old 3→5, old 4→6, etc.)
Verified with live testing: the AI now correctly returns LOW PRIORITY
with a low-cardinality risk note instead of HIGH PRIORITY.
Update the Performance Summary and Performance Metrics sections in all prompt templates to be consistent with the static analysis: Performance Summary: - Replace generic '(excellent/good/poor)' with the actual scale used by our static analysis: 'Excellent, Good, Fair, or Poor' Performance Metrics (educational content): - Find: Replace generic 'Documents Examined vs Returned' / 'Keys Examined' / 'Inefficiencies Detected' with Selectivity, Fetch Overhead, In-Memory Sort, and Efficiency Ratio — matching the indicators the user already sees in the Stage 2 static analysis - Aggregate: Add Selectivity alongside Pipeline Efficiency - Count: Replace 'Documents Examined' with Selectivity and Efficiency Ratio This ensures the AI educational content uses the same vocabulary and concepts as the static analysis, reducing cognitive dissonance.
…ion 8 The instruction telling the LLM it can disagree with the static analysis (but must explain why) was at the very end of the instruction list. Moving it to position 8 (right after the analysis/educational content structure rules 6-7) ensures the LLM processes it early when deciding how to frame its Performance Summary. Applied to all 6 prompt sources (3 resource files + 3 inline fallbacks). Remaining instructions renumbered 9-22.
Prettier was mangling the prompt template files: - Collapsing triple backtick references (```json → `json) - Adding blank lines after markdown headings - Changing indentation on continuation lines These changes break the LLM prompt structure (the triple backtick instruction is critical for JSON output formatting). Added resources/prompts/ to .prettierignore to prevent this.
Apply 6 prompt improvements across all resource files and inline fallback templates: 1. Rename role from 'Index Advisor assistant' to 'Query Performance Analyst' — reduces bias toward always recommending indexes 2. Tighten compound index escape hatch — low-cardinality leading keys in compound indexes also get low priority 3. Remove metadata echo requirement — the metadata field was never parsed or used, saving ~40% of response tokens 4. Remove '<=6 sentences' constraint on analysis — conflicted with the 3-section structure template 5. Remove duplicate tips already covered by CRITICAL instructions (Low-selectivity fields, Prioritize restrictive range) 6. Fix trailing literal newline in Limited confidence rule 7. Replace 'Be brave to say no' with explicit 'It is OK to recommend nothing' — clearer signal for weaker models 8. Remove specific isFamilyFriendly example from CRITICAL rule — too specific, could introduce skew 9. Use DocumentDB API / MongoDB API terminology in role naming
Add structured documentation for the AI static analysis alignment work under docs/ai-and-plans/616-ai-static-analysis-alignment/. Includes architecture diagram, key implementation decisions, lessons learned about LLM instruction positioning, and file map.
There was a problem hiding this comment.
Pull request overview
This PR reduces contradictions between Query Insights Stage 2 (static analysis UI) and Stage 3 (AI recommendations) by caching Stage 2 results and injecting a compact Stage 2 summary into the AI prompt context, plus refactoring index-advisor prompts into resource files with added telemetry and tests.
Changes:
- Add a Stage 2 → Stage 3 context pipeline via
ClusterSessioncaching andbuildStaticAnalysisSummary()prompt injection. - Refactor index-advisor prompt templates to load prompt bodies from
resources/prompts/*with fallback behavior and prompt-source tracking. - Expand telemetry and add tests/snapshots for prompt templates and the static analysis summary formatter.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/webviews/documentdb/collectionView/components/queryInsightsTab/components/summaryCard/custom/PerformanceRatingCell.tsx | Updates diagnostics rendering (icons/colors) in performance rating cell. |
| src/webviews/documentdb/collectionView/collectionViewRouter.ts | Caches Stage 2 response, builds static analysis summary for Stage 3 prompt, adds Stage 2/3 telemetry. |
| src/services/promptTemplateService.ts | Builds built-in index-advisor prompts via resource-file composition and tracks prompt source for custom templates. |
| src/services/ai/QueryInsightsAIService.ts | Adds staticAnalysisSummary parameter plumbing into optimization calls. |
| src/documentdb/queryInsights/staticAnalysisSummary.ts | New helper to convert Stage 2 UI results into compact prompt-ready text summary. |
| src/documentdb/queryInsights/staticAnalysisSummary.test.ts | Unit tests for static analysis summary formatting across scenarios. |
| src/documentdb/ClusterSession.ts | Adds Stage 2 response cache storage/retrieval for Stage 3 usage. |
| src/commands/llmEnhancedCommands/promptTemplates.ts | Adds prompt-body resource loading/composition + prompt-source tracking; updates prompt instructions/schema text. |
| src/commands/llmEnhancedCommands/promptTemplates.test.ts | Adds structural + snapshot tests for prompt templates and load/fallback behavior. |
| src/commands/llmEnhancedCommands/snapshots/promptTemplates.test.ts.snap | Snapshots for prompt template content integrity. |
| src/commands/llmEnhancedCommands/indexAdvisorCommands.ts | Injects static analysis summary into context data and records prompt-source/static-summary telemetry. |
| resources/prompts/index-advisor-find.prompt.md | New resource prompt body for find index-advisor template. |
| resources/prompts/index-advisor-aggregate.prompt.md | New resource prompt body for aggregate index-advisor template. |
| resources/prompts/index-advisor-count.prompt.md | New resource prompt body for count index-advisor template. |
| l10n/bundle.l10n.json | Adds localized strings for Stage 3 static analysis summary build logging. |
| .prettierignore | Excludes prompt resource files from Prettier formatting. |
Comments suppressed due to low confidence (1)
src/commands/llmEnhancedCommands/promptTemplates.ts:174
INDEX_ADVISOR_ROLEis still set to "MongoDB Index Advisor assistant" and is used increatePriorityDeclaration(...), which becomes the highest-priority instruction in the prompt. This contradicts the new "Query Performance Analyst" positioning in the TASK section and also violates the repo terminology guidance to avoid "MongoDB" alone (see .github/copilot-instructions.md:197-211). Consider updating the role string to the new analyst wording (DocumentDB / MongoDB API).
const INDEX_ADVISOR_ROLE = 'MongoDB Index Advisor assistant';
const QUERY_GENERATOR_ROLE = 'MongoDB Query Generator assistant';
const INDEX_ADVISOR_TASK_FIND =
'analyze MongoDB queries and provide index optimization suggestions based on the data provided';
const INDEX_ADVISOR_TASK_AGGREGATE =
'analyze MongoDB aggregation pipelines and provide index optimization suggestions based on the data provided';
const INDEX_ADVISOR_TASK_COUNT =
'analyze MongoDB count queries and provide index optimization suggestions based on the data provided';
…PI Query Performance Analyst and Index Advisor' The PRIORITY DECLARATION — the first thing the LLM reads — still used the old 'MongoDB Index Advisor assistant' role, contradicting the body text that says 'Query Performance Analyst'. By the PR's own argument about positional sensitivity, this preserved the bias the rename was meant to remove. Updated the role string in all 3 promptConfigs entries (promptTemplateService.ts) and the INDEX_ADVISOR_ROLE constant (promptTemplates.ts) to use the same neutral framing. Fixes: HIGH-1 from review
The inline fallback templates had broken numbering: - FIND: 1,2,3,4,7,8,7,8,10,10,11-21 (5,6,9 missing; 7,8,10 duplicated) - AGGREGATE/COUNT: 10,10 (9 missing; 10 duplicated) Renumbered all three templates to sequential 1-21, matching the resource files. This ensures correct LLM instruction ordering when the resource file cannot be loaded. Fixes: HIGH-2 from review
…nown The Stage 2 response cache in ClusterSession stored `unknown` and consumers cast it back with `as QueryInsightsStage2Response` — a classic erase-then-cast anti-pattern. The producer already returns `QueryInsightsStage2Response` from `transformStage2Response()`, so the type information existed at both boundaries but was discarded. Now the cache field, setter, and getter all use the concrete type. Removed the unsafe cast at the consumer in collectionViewRouter.ts. Fixes: MEDIUM-1 from review
Added `staticAnalysisSummaryErrorKind` telemetry property (captures the error constructor name: TypeError, RangeError, etc.) to allow production diagnosis without PII. Promoted logging from trace to error level with the actual error message, since silent degradation here is hard to diagnose. Fixes: MEDIUM-3 + LOW-2 from review
|
MEDIUM-3 + LOW-2 fix: Added See e3a5b0a |
Per repo terminology rules, 'MongoDB' alone should not be used as a product name. Updated the index advisor prompt constants, message descriptions, task descriptions, DATA PLACEHOLDERS in inline fallback templates, and the three resource files (find, aggregate, count) to use 'MongoDB API' consistently. Scope limited to LLM-facing prompt strings only (not general codebase). Fixes: LOW-1 from review
Changed loadPromptBody return type from `string | undefined` to a discriminated union that carries the failure reason. Now `lastPromptSource` is set to `'inline-fallback-no-context'` (expected during tests/early init) vs `'inline-fallback-read-error'` (packaging bug), allowing production telemetry to distinguish the two cases. Updated the test to match the new return type. Fixes: LOW-3 from review
|
LOW-3 fix: See cd6bcdf |
For find queries routed through QueryInsightsAIService, only queryObject is populated — query is undefined. This caused the user message to contain 'N/A' even though the prompt tells the LLM the first user message contains the query. Now falls back to JSON.stringify(queryObject) when query is not available. Fixes: LOW-5 from review
|
LOW-6 fix: Regenerated 3 snapshot files after HIGH-1 (role rename), HIGH-2 (numbering fix), and LOW-1 (MongoDB → MongoDB API). All 66 prompt template tests pass. See eba1dba |
- Updated l10n bundle to reflect the error message change from
MEDIUM-3/LOW-2 (added {error} parameter)
- Fixed import/consistent-type-specifier-style lint error in
ClusterSession.ts (use inline type specifier)
Changed from 'MongoDB API Query Performance Analyst and Index Advisor' to 'MongoDB API Index Advisor assistant' per operator decision. Regenerated 3 snapshots. Amends: HIGH-1 from review
Co-authored-by: Copilot <copilot@github.com>
- Add MANDATORY bitmap single-field index hide rule (rule 19) to all prompts - Improve justification requirements (rule 12): 2-3 sentences per recommendation - Add runtime filter removal signal thinking tip for compound index suggestions - Update fallback templates in promptTemplates.ts to match .md prompts - Update snapshots
Split the 'Unhide recommendations' clause from rule 3 into its own dedicated rule 4 (CRITICAL — Unhiding low-cardinality / boolean indexes) with explicit requirements: low priority, justification with percentage, and risks note. Renumbered all subsequent rules (now 1-24). Applied consistently to find, aggregate, count prompt files and inline fallback templates in promptTemplates.ts.
Change PREFERRED_MODEL from gpt-4o to gpt-5-mini for better
instruction following on complex 24-rule prompts. Add gpt-4o as
fallback (replacing gpt-4o-mini).
Add modelOptions support to CopilotService to pass reasoning_effort
('medium') via the VS Code LanguageModelChatRequestOptions API.
Wire through both index advisor and query generation callers.
Revert PREFERRED_MODEL back to gpt-4o (from gpt-5-mini) and FALLBACK_MODELS back to gpt-4o-mini — needs more prompt review. Fix inaccurate copilotDurationMs: move timing into CopilotService.sendToModel to measure only the actual LLM request + streaming, excluding selectChatModels overhead and telemetry wrapper. Return durationMs in CopilotResponse so callers use the accurate measurement.
Co-authored-by: Copilot <copilot@github.com>
…fixupDocumentDbExplain
|
R-001 resolved — Priority-declaration role conflicts with resource-file body The prompt has been tested and changes to LLM-facing strings require regression testing. Created a tracked issue instead of a code change: → #619 — Query Insights: Align priority-declaration role with resource-file prompt framing (milestone: 0.9.0) The resource files already use the correct "Query Performance Analyst" framing; only the wrapping |
|
R-002 resolved — Planner-tree traversal helpers do not recurse through shard branches Created a tracked issue to address alongside the explain schema work: → #620 — Query Insights: Planner-tree traversal helpers miss shard branches (milestone: 0.9.0) Severity was reassessed from Medium to Low: only affects the narrow edge case of sharded clusters with |
Cross-collection template jumped 4→7→10 (missing 5,6,9). Single-collection template had the same gap pattern (4→7→8→10→11...). Renumbered both instruction lists to be sequential. Ref: PR #616 review (R-003, C-012)
|
R-003 resolved — Query-generation prompt instruction numbering is non-sequential Fixed in commit Related review comment: C-012. |
Cache now stores { template, source } together instead of just the template
string. On cache hits, the source is restored before returning so that
telemetry reports the correct prompt source for the current request.
Also extracted a PromptSource type for all valid source values, widening
setLastPromptSource() from 3 values to the full union.
Ref: PR #616 review (R-004, C-006)
|
R-004 resolved — Fixed in commit Also extracted a Related review comment: C-006. |
✅ Code Quality Checks
This comment is updated automatically on each push. |
📦 Build Size Report
Download artifact · updated automatically on each push. |
Summary
The AI-powered query analysis (Stage 3) had no knowledge of what the user was already shown in the static analysis (Stage 2). This caused contradictions — the static analysis would show "Good" performance with badges, then the AI would give a different assessment without explaining why. Worse, for low-cardinality boolean queries returning >50% of the collection, the AI would recommend creating high-priority indexes that would not help and would later trigger a "Low-cardinality index" warning when re-run.
This PR bridges the gap between Stage 2 and Stage 3, making the AI aware of the static analysis results and teaching it when NOT to recommend indexes.
Changes
Core: Static Analysis → AI Context Pipeline
buildStaticAnalysisSummary()— New function that produces a compact text summary of Stage 2 results (performance rating with scale, summary indicators, diagnostic badges, concerns) for inclusion in the AI promptClusterSession.setStage2Response()/getStage2Response()stores the transformed Stage 2 response so Stage 3 can access itfillPromptTemplate()Prompt Engineering
lowpriorityInfrastructure
resources/prompts/index-advisor-{find,aggregate,count}.prompt.mdwith runtime loading viabuildIndexAdvisorPrompt(). Inline constants kept as fallbacks.prettierignore— Prompt files excluded from prettier (it was mangling triple backticks and indentation)getLastPromptSource()/setLastPromptSource()tracks whether resource files or inline fallbacks are usedTests
Verified Behavior
Tested with
{ "additionalInfo.isFamilyFriendly": true }on a 65K-document collection returning 35K docs (55% selectivity):