Skip to content

feat(encoder): improve semantic extraction quality from reference analysis#166

Merged
amondnet merged 3 commits intomainfrom
feat/encoder-reference-gap-improvements
Mar 5, 2026
Merged

feat(encoder): improve semantic extraction quality from reference analysis#166
amondnet merged 3 commits intomainfrom
feat/encoder-reference-gap-improvements

Conversation

@amondnet
Copy link
Contributor

@amondnet amondnet commented Mar 5, 2026

Summary

Implements four quality improvements to the TypeScript encoder identified by comparing it against the Python reference implementation (vendor/RPG-ZeroRepo/zerorepo/rpg_encoder/).

  • Gap 1 (HIGH): Repository skeleton is now passed to buildBatchClassPrompt and buildBatchFunctionPrompt via a new skeleton?: string param, giving the LLM architectural context for more grounded feature descriptions. The skeleton is built once in encode() and threaded via SemanticExtractor.setSkeleton().

  • Gap 2 (HIGH): On retry for missing entities, instead of re-sending the full prompt with all code, the encoder now sends a targeted conversational follow-up message listing only the specific class methods or function names that are missing. A Memory instance with contextWindow: 0 is carried across iterations so the LLM retains the original code context.

  • Gap 3 (MEDIUM): Test files are detected via isTestFile() and routed to new dedicated prompts (buildBatchTestClassPrompt, buildBatchTestFunctionPrompt) that instruct the LLM to describe what is being tested rather than the test mechanics.

  • Gap 4 (LOW-MEDIUM): File-level summaries are now batched. A first pass uses heuristic placeholders (skipLLM: true); after all files are processed, aggregateFileFeaturesInBatch() calls buildBatchFileSummaryPrompt() once for all deferred files, reducing LLM round-trips significantly for large repositories.

Test plan

  • Run existing encoder unit tests: bun run test packages/encoder/tests/
  • Verify semantic-batching.test.ts still passes (token-aware batching unchanged)
  • Run integration test on a sample repo and compare output quality before/after
  • Verify semantic cache still works correctly (cache key may differ due to new skeleton in prompt)
  • Check token usage improvement from conversational follow-up vs full re-prompt

Summary by cubic

Improves the encoder’s semantic extraction with richer repo context, test-aware prompts, smarter conversational retries, and token-aware batched file summaries for higher accuracy, fewer LLM calls, and safer batch outputs.

  • New Features

    • Pass repo skeleton into batch prompts for grounded descriptions.
    • Use Memory follow-ups that only ask for missing items, reference parsed names, and correct invalid keys; normalize "/" to " or ".
    • Detect test files and use prompts that describe the subject under test.
    • Batch file summaries with token-aware splitting, entity names/types, and follow-up retries; fall back to heuristic and surface warnings.
  • Bug Fixes

    • Validate LLM batch summary fields (ensure description is a string and keywords are strings) to guard against malformed output.

Written for commit cb8ea5d. Summary will update on new commits.

…lysis

Implement four improvements identified by comparing the TypeScript encoder
against the Python reference implementation (vendor/RPG-ZeroRepo):

Gap 1 (HIGH): Add repository skeleton to batch prompts
- Thread skeleton from buildRepoSkeleton() into SemanticExtractor via
  setSkeleton() and pass it to buildBatchClassPrompt / buildBatchFunctionPrompt
- Gives the LLM structural context for architecturally-grounded features

Gap 2 (HIGH): Conversational follow-up for missing entities on retry
- Replace full re-prompt with targeted follow-up messages that list only
  the missing class methods or function names
- Uses Memory with contextWindow:0 so prior code context is preserved

Gap 3 (MEDIUM): Dedicated test file prompts
- Add isTestFile() detection and buildBatchTestClassPrompt /
  buildBatchTestFunctionPrompt that instruct the LLM to describe what is
  being tested, not the test mechanics
- Routes test entities to dedicated prompts instead of generic ones

Gap 4 (LOW-MEDIUM): Batched file summary generation
- Two-pass extraction: first pass uses heuristic placeholder, second pass
  batches all deferred file summaries into a single LLM call via
  aggregateFileFeaturesInBatch() and buildBatchFileSummaryPrompt()
- Reduces per-file LLM round-trips for large repositories
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the quality of semantic feature extraction within the encoder by implementing several improvements identified through comparison with a Python reference. These changes provide the LLM with richer contextual information, optimize retry mechanisms for efficiency, tailor prompts for test code analysis, and streamline file-level summary generation, ultimately leading to more accurate and efficient code understanding.

Highlights

  • Repository Skeleton Integration: The repository's architectural context (skeleton) is now passed to LLM prompts for class and function extraction, providing more grounded feature descriptions.
  • Targeted Retry for Missing Entities: When the LLM fails to extract all entities, subsequent retries send targeted conversational follow-up messages listing only the missing items, rather than re-sending the entire code, reducing token usage.
  • Dedicated Prompts for Test Files: Test files are now identified and routed to specialized prompts that instruct the LLM to describe the purpose of the tests (what is being tested) instead of their implementation details.
  • Batched File-Level Summaries: File-level summaries are initially generated using heuristic placeholders and then batched for a single LLM call to aggregate them, significantly reducing LLM round-trips for large repositories.
Changelog
  • packages/encoder/src/encoder.ts
    • Added pendingFileSummary to ExtractionResult interface.
    • Integrated repository skeleton building and setting it on the semantic extractor.
    • Modified extractEntities to accept a deferFileSummary parameter.
    • Implemented batch processing logic for deferred file summaries.
    • Updated extractEntities to conditionally use heuristic placeholders for file summaries when deferring.
  • packages/encoder/src/reorganization/prompts.ts
    • Updated buildBatchClassPrompt and buildBatchFunctionPrompt to accept an optional skeleton parameter.
    • Introduced buildBatchTestClassPrompt for test class semantic extraction.
    • Introduced buildBatchTestFunctionPrompt for test function semantic extraction.
    • Added buildBatchFileSummaryPrompt for aggregating file-level summaries.
  • packages/encoder/src/semantic.ts
    • Imported Memory for conversational LLM interactions.
    • Added skeleton property to SemanticOptions interface.
    • Implemented isTestFile utility function to identify test files.
    • Added setSkeleton method to update the repository skeleton.
    • Modified processClassGroupBatches and processFunctionBatches to differentiate between regular and test entities and to manage conversational memory for retries.
    • Updated extractClassBatch and extractFunctionBatch to use Memory for conversational retries and to select appropriate prompts based on isTest flag.
    • Added buildClassFollowUpMessage and buildFunctionFollowUpMessage for targeted retry prompts.
    • Modified aggregateFileFeatures to include a skipLLM parameter for deferred processing.
    • Implemented aggregateFileFeaturesInBatch for efficient batch generation of file summaries.
Activity
  • The pull request was created by amondnet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to the semantic extraction process, aligning it more closely with the Python reference implementation. The changes address four key areas: providing repository skeleton context to the LLM, implementing conversational retries for missing entities, adding special handling for test files with dedicated prompts, and batching file summary generation to reduce LLM round-trips. These changes are well-implemented and should improve both the quality of the semantic features and the efficiency of the encoding process.

I have one suggestion for a minor efficiency improvement in the batch file summary generation logic.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/encoder/src/reorganization/prompts.ts">

<violation number="1" location="packages/encoder/src/reorganization/prompts.ts:457">
P2: Keying test-class results by bare class name risks collisions when duplicate class names exist across test files. Use composite identifiers (e.g., "filePath::ClassName") in the prompt/output contract so batch parsing can disambiguate entries.

(Based on your team's feedback about using composite identifiers for LLM batch prompts.) [FEEDBACK_USED]</violation>

<violation number="2" location="packages/encoder/src/reorganization/prompts.ts:512">
P2: Keying test-function results by bare function name risks collisions when duplicate names exist across test files. Use composite identifiers (e.g., "filePath::functionName") in the prompt/output contract so batch parsing can disambiguate entries.

(Based on your team's feedback about using composite identifiers for LLM batch prompts.) [FEEDBACK_USED]</violation>
</file>

<file name="packages/encoder/src/semantic.ts">

<violation number="1" location="packages/encoder/src/semantic.ts:951">
P2: Recoverable batch file-summary failures should also be recorded in `this.warnings`; currently they are only logged, so callers lose visibility into partial degradation.

(Based on your team's feedback about pushing non-fatal warnings to a warnings array.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

…4 batch improvements

Gap 2-A: Include already-parsed function names in follow-up messages
- buildFunctionFollowUpMessage now accepts alreadyParsedNames and prevInvalidKeys
- processFunctionBatches tracks successfully parsed function names across iterations
- Mirrors Python semantic_parsing.py:803-806 "So far, you've extracted features for: ..."

Gap 2-B: Detect and report invalid keys in function batch responses
- extractFunctionBatch detects keys in parsed response not matching valid function names
- Invalid keys are reported in the next follow-up message (mirrors Python :808-811)

Gap 2-C: Replace "/" with " or " in feature strings
- featureListToSemanticFeature normalizes slash-separated values (mirrors Python :687,795)

Gap 4-A: Include entity names/types in batch file summary prompts
- aggregateFileFeaturesInBatch accepts optional childEntities with name/type/feature
- buildBatchFileSummaryPrompt renders entity-named sections when childEntities provided
- encoder.ts captures entity names alongside childFeatures in pendingFileSummary
- Mirrors Python summarize_file_batch feature_map format (class/function key hierarchy)

Gap 4-B: Token-aware batch splitting for file summaries
- createFileSummaryBatches splits files by token estimate (min=1000/max=8000 tokens)
- Processes batches concurrently via runConcurrent (mirrors Python ThreadPoolExecutor)

Gap 4-C: Memory-based conversational retry for missing file summaries
- runFileSummaryBatch retries up to 3 times with follow-up for missing file paths
- Mirrors Python summarize_file_batch() conversational retry logic

Review fixes:
- encoder.ts: pre-filter allExtractionResults once to avoid redundant full-list iteration
- semantic.ts: push batch file summary LLM failures to this.warnings for caller visibility
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/encoder/src/semantic.ts">

<violation number="1" location="packages/encoder/src/semantic.ts:1093">
P2: Parsed LLM response fields `description` and `keywords` are cast without type validation. Verify `description` is a string and `keywords` (if present) is an `Array<string>` before use to guard against malformed LLM output.

(Based on your team's feedback about validating parsed JSON mappings before use.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

…atch

Validate that `description` is a string and `keywords` (if present) is
an array of strings before use, guarding against malformed LLM output.

Previously, the fields were cast without runtime type checks. Now the
`description` field is explicitly verified via `typeof === 'string'` in
the if-condition, and `keywords` is filtered with a type predicate to
strip any non-string elements.
@amondnet amondnet self-assigned this Mar 5, 2026
@amondnet amondnet merged commit aa3f980 into main Mar 5, 2026
6 checks passed
@amondnet amondnet deleted the feat/encoder-reference-gap-improvements branch March 5, 2026 08:35
@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 5, 2026

@pleaeai-bot pleaeai-bot bot mentioned this pull request Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant