feat(encoder): improve semantic extraction quality from reference analysis by amondnet · Pull Request #166 · pleaseai/soop

amondnet · 2026-03-05T06:57:52Z

Summary

Implements four quality improvements to the TypeScript encoder identified by comparing it against the Python reference implementation (vendor/RPG-ZeroRepo/zerorepo/rpg_encoder/).

Gap 1 (HIGH): Repository skeleton is now passed to buildBatchClassPrompt and buildBatchFunctionPrompt via a new skeleton?: string param, giving the LLM architectural context for more grounded feature descriptions. The skeleton is built once in encode() and threaded via SemanticExtractor.setSkeleton().
Gap 2 (HIGH): On retry for missing entities, instead of re-sending the full prompt with all code, the encoder now sends a targeted conversational follow-up message listing only the specific class methods or function names that are missing. A Memory instance with contextWindow: 0 is carried across iterations so the LLM retains the original code context.
Gap 3 (MEDIUM): Test files are detected via isTestFile() and routed to new dedicated prompts (buildBatchTestClassPrompt, buildBatchTestFunctionPrompt) that instruct the LLM to describe what is being tested rather than the test mechanics.
Gap 4 (LOW-MEDIUM): File-level summaries are now batched. A first pass uses heuristic placeholders (skipLLM: true); after all files are processed, aggregateFileFeaturesInBatch() calls buildBatchFileSummaryPrompt() once for all deferred files, reducing LLM round-trips significantly for large repositories.

Test plan

Run existing encoder unit tests: bun run test packages/encoder/tests/
Verify semantic-batching.test.ts still passes (token-aware batching unchanged)
Run integration test on a sample repo and compare output quality before/after
Verify semantic cache still works correctly (cache key may differ due to new skeleton in prompt)
Check token usage improvement from conversational follow-up vs full re-prompt

Summary by cubic

Improves the encoder’s semantic extraction with richer repo context, test-aware prompts, smarter conversational retries, and token-aware batched file summaries for higher accuracy, fewer LLM calls, and safer batch outputs.

New Features
- Pass repo skeleton into batch prompts for grounded descriptions.
- Use Memory follow-ups that only ask for missing items, reference parsed names, and correct invalid keys; normalize "/" to " or ".
- Detect test files and use prompts that describe the subject under test.
- Batch file summaries with token-aware splitting, entity names/types, and follow-up retries; fall back to heuristic and surface warnings.
Bug Fixes
- Validate LLM batch summary fields (ensure description is a string and keywords are strings) to guard against malformed output.

^{Written for commit cb8ea5d. Summary will update on new commits.}

…lysis Implement four improvements identified by comparing the TypeScript encoder against the Python reference implementation (vendor/RPG-ZeroRepo): Gap 1 (HIGH): Add repository skeleton to batch prompts - Thread skeleton from buildRepoSkeleton() into SemanticExtractor via setSkeleton() and pass it to buildBatchClassPrompt / buildBatchFunctionPrompt - Gives the LLM structural context for architecturally-grounded features Gap 2 (HIGH): Conversational follow-up for missing entities on retry - Replace full re-prompt with targeted follow-up messages that list only the missing class methods or function names - Uses Memory with contextWindow:0 so prior code context is preserved Gap 3 (MEDIUM): Dedicated test file prompts - Add isTestFile() detection and buildBatchTestClassPrompt / buildBatchTestFunctionPrompt that instruct the LLM to describe what is being tested, not the test mechanics - Routes test entities to dedicated prompts instead of generic ones Gap 4 (LOW-MEDIUM): Batched file summary generation - Two-pass extraction: first pass uses heuristic placeholder, second pass batches all deferred file summaries into a single LLM call via aggregateFileFeaturesInBatch() and buildBatchFileSummaryPrompt() - Reduces per-file LLM round-trips for large repositories

gemini-code-assist · 2026-03-05T06:58:10Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the quality of semantic feature extraction within the encoder by implementing several improvements identified through comparison with a Python reference. These changes provide the LLM with richer contextual information, optimize retry mechanisms for efficiency, tailor prompts for test code analysis, and streamline file-level summary generation, ultimately leading to more accurate and efficient code understanding.

Highlights

Repository Skeleton Integration: The repository's architectural context (skeleton) is now passed to LLM prompts for class and function extraction, providing more grounded feature descriptions.
Targeted Retry for Missing Entities: When the LLM fails to extract all entities, subsequent retries send targeted conversational follow-up messages listing only the missing items, rather than re-sending the entire code, reducing token usage.
Dedicated Prompts for Test Files: Test files are now identified and routed to specialized prompts that instruct the LLM to describe the purpose of the tests (what is being tested) instead of their implementation details.
Batched File-Level Summaries: File-level summaries are initially generated using heuristic placeholders and then batched for a single LLM call to aggregate them, significantly reducing LLM round-trips for large repositories.

Changelog

packages/encoder/src/encoder.ts
- Added pendingFileSummary to ExtractionResult interface.
- Integrated repository skeleton building and setting it on the semantic extractor.
- Modified extractEntities to accept a deferFileSummary parameter.
- Implemented batch processing logic for deferred file summaries.
- Updated extractEntities to conditionally use heuristic placeholders for file summaries when deferring.
packages/encoder/src/reorganization/prompts.ts
- Updated buildBatchClassPrompt and buildBatchFunctionPrompt to accept an optional skeleton parameter.
- Introduced buildBatchTestClassPrompt for test class semantic extraction.
- Introduced buildBatchTestFunctionPrompt for test function semantic extraction.
- Added buildBatchFileSummaryPrompt for aggregating file-level summaries.
packages/encoder/src/semantic.ts
- Imported Memory for conversational LLM interactions.
- Added skeleton property to SemanticOptions interface.
- Implemented isTestFile utility function to identify test files.
- Added setSkeleton method to update the repository skeleton.
- Modified processClassGroupBatches and processFunctionBatches to differentiate between regular and test entities and to manage conversational memory for retries.
- Updated extractClassBatch and extractFunctionBatch to use Memory for conversational retries and to select appropriate prompts based on isTest flag.
- Added buildClassFollowUpMessage and buildFunctionFollowUpMessage for targeted retry prompts.
- Modified aggregateFileFeatures to include a skipLLM parameter for deferred processing.
- Implemented aggregateFileFeaturesInBatch for efficient batch generation of file summaries.

Activity

The pull request was created by amondnet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant improvements to the semantic extraction process, aligning it more closely with the Python reference implementation. The changes address four key areas: providing repository skeleton context to the LLM, implementing conversational retries for missing entities, adding special handling for test files with dedicated prompts, and batching file summary generation to reduce LLM round-trips. These changes are well-implemented and should improve both the quality of the semantic features and the efficiency of the encoding process.

I have one suggestion for a minor efficiency improvement in the batch file summary generation logic.

packages/encoder/src/encoder.ts

cubic-dev-ai

3 issues found across 3 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/encoder/src/reorganization/prompts.ts">

<violation number="1" location="packages/encoder/src/reorganization/prompts.ts:457">
P2: Keying test-class results by bare class name risks collisions when duplicate class names exist across test files. Use composite identifiers (e.g., "filePath::ClassName") in the prompt/output contract so batch parsing can disambiguate entries.

(Based on your team's feedback about using composite identifiers for LLM batch prompts.) [FEEDBACK_USED]</violation>

<violation number="2" location="packages/encoder/src/reorganization/prompts.ts:512">
P2: Keying test-function results by bare function name risks collisions when duplicate names exist across test files. Use composite identifiers (e.g., "filePath::functionName") in the prompt/output contract so batch parsing can disambiguate entries.

(Based on your team's feedback about using composite identifiers for LLM batch prompts.) [FEEDBACK_USED]</violation>
</file>

<file name="packages/encoder/src/semantic.ts">

<violation number="1" location="packages/encoder/src/semantic.ts:951">
P2: Recoverable batch file-summary failures should also be recorded in `this.warnings`; currently they are only logged, so callers lose visibility into partial degradation.

(Based on your team's feedback about pushing non-fatal warnings to a warnings array.) [FEEDBACK_USED]</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

packages/encoder/src/reorganization/prompts.ts

packages/encoder/src/semantic.ts

…4 batch improvements Gap 2-A: Include already-parsed function names in follow-up messages - buildFunctionFollowUpMessage now accepts alreadyParsedNames and prevInvalidKeys - processFunctionBatches tracks successfully parsed function names across iterations - Mirrors Python semantic_parsing.py:803-806 "So far, you've extracted features for: ..." Gap 2-B: Detect and report invalid keys in function batch responses - extractFunctionBatch detects keys in parsed response not matching valid function names - Invalid keys are reported in the next follow-up message (mirrors Python :808-811) Gap 2-C: Replace "/" with " or " in feature strings - featureListToSemanticFeature normalizes slash-separated values (mirrors Python :687,795) Gap 4-A: Include entity names/types in batch file summary prompts - aggregateFileFeaturesInBatch accepts optional childEntities with name/type/feature - buildBatchFileSummaryPrompt renders entity-named sections when childEntities provided - encoder.ts captures entity names alongside childFeatures in pendingFileSummary - Mirrors Python summarize_file_batch feature_map format (class/function key hierarchy) Gap 4-B: Token-aware batch splitting for file summaries - createFileSummaryBatches splits files by token estimate (min=1000/max=8000 tokens) - Processes batches concurrently via runConcurrent (mirrors Python ThreadPoolExecutor) Gap 4-C: Memory-based conversational retry for missing file summaries - runFileSummaryBatch retries up to 3 times with follow-up for missing file paths - Mirrors Python summarize_file_batch() conversational retry logic Review fixes: - encoder.ts: pre-filter allExtractionResults once to avoid redundant full-list iteration - semantic.ts: push batch file summary LLM failures to this.warnings for caller visibility

cubic-dev-ai

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/encoder/src/semantic.ts">

<violation number="1" location="packages/encoder/src/semantic.ts:1093">
P2: Parsed LLM response fields `description` and `keywords` are cast without type validation. Verify `description` is a string and `keywords` (if present) is an `Array<string>` before use to guard against malformed LLM output.

(Based on your team's feedback about validating parsed JSON mappings before use.) [FEEDBACK_USED]</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

packages/encoder/src/semantic.ts

…atch Validate that `description` is a string and `keywords` (if present) is an array of strings before use, guarding against malformed LLM output. Previously, the fields were cast without runtime type checks. Now the `description` field is explicitly verified via `typeof === 'string'` in the if-condition, and `keywords` is filtered with a type predicate to strip any non-string elements.

sonarqubecloud · 2026-03-05T08:35:40Z

Quality Gate passed

Issues
9 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

packages/encoder/src/encoder.ts Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Mar 5, 2026

View reviewed changes

packages/encoder/src/reorganization/prompts.ts Show resolved Hide resolved

packages/encoder/src/reorganization/prompts.ts Show resolved Hide resolved

packages/encoder/src/semantic.ts Outdated Show resolved Hide resolved

amondnet mentioned this pull request Mar 5, 2026

Review follow-up: use composite keys in test batch prompts to prevent name collision #167

Closed

4 tasks

cubic-dev-ai bot reviewed Mar 5, 2026

View reviewed changes

packages/encoder/src/semantic.ts Outdated Show resolved Hide resolved

amondnet self-assigned this Mar 5, 2026

amondnet merged commit aa3f980 into main Mar 5, 2026
6 checks passed

amondnet deleted the feat/encoder-reference-gap-improvements branch March 5, 2026 08:35

pleaeai-bot bot mentioned this pull request Mar 5, 2026

chore: release #157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(encoder): improve semantic extraction quality from reference analysis#166

feat(encoder): improve semantic extraction quality from reference analysis#166
amondnet merged 3 commits intomainfrom
feat/encoder-reference-gap-improvements

amondnet commented Mar 5, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amondnet commented Mar 5, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by cubic

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Mar 5, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amondnet commented Mar 5, 2026 •

edited by cubic-dev-ai bot

Loading