Conversation
…lysis Implement four improvements identified by comparing the TypeScript encoder against the Python reference implementation (vendor/RPG-ZeroRepo): Gap 1 (HIGH): Add repository skeleton to batch prompts - Thread skeleton from buildRepoSkeleton() into SemanticExtractor via setSkeleton() and pass it to buildBatchClassPrompt / buildBatchFunctionPrompt - Gives the LLM structural context for architecturally-grounded features Gap 2 (HIGH): Conversational follow-up for missing entities on retry - Replace full re-prompt with targeted follow-up messages that list only the missing class methods or function names - Uses Memory with contextWindow:0 so prior code context is preserved Gap 3 (MEDIUM): Dedicated test file prompts - Add isTestFile() detection and buildBatchTestClassPrompt / buildBatchTestFunctionPrompt that instruct the LLM to describe what is being tested, not the test mechanics - Routes test entities to dedicated prompts instead of generic ones Gap 4 (LOW-MEDIUM): Batched file summary generation - Two-pass extraction: first pass uses heuristic placeholder, second pass batches all deferred file summaries into a single LLM call via aggregateFileFeaturesInBatch() and buildBatchFileSummaryPrompt() - Reduces per-file LLM round-trips for large repositories
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the quality of semantic feature extraction within the encoder by implementing several improvements identified through comparison with a Python reference. These changes provide the LLM with richer contextual information, optimize retry mechanisms for efficiency, tailor prompts for test code analysis, and streamline file-level summary generation, ultimately leading to more accurate and efficient code understanding. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant improvements to the semantic extraction process, aligning it more closely with the Python reference implementation. The changes address four key areas: providing repository skeleton context to the LLM, implementing conversational retries for missing entities, adding special handling for test files with dedicated prompts, and batching file summary generation to reduce LLM round-trips. These changes are well-implemented and should improve both the quality of the semantic features and the efficiency of the encoding process.
I have one suggestion for a minor efficiency improvement in the batch file summary generation logic.
There was a problem hiding this comment.
3 issues found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/encoder/src/reorganization/prompts.ts">
<violation number="1" location="packages/encoder/src/reorganization/prompts.ts:457">
P2: Keying test-class results by bare class name risks collisions when duplicate class names exist across test files. Use composite identifiers (e.g., "filePath::ClassName") in the prompt/output contract so batch parsing can disambiguate entries.
(Based on your team's feedback about using composite identifiers for LLM batch prompts.) [FEEDBACK_USED]</violation>
<violation number="2" location="packages/encoder/src/reorganization/prompts.ts:512">
P2: Keying test-function results by bare function name risks collisions when duplicate names exist across test files. Use composite identifiers (e.g., "filePath::functionName") in the prompt/output contract so batch parsing can disambiguate entries.
(Based on your team's feedback about using composite identifiers for LLM batch prompts.) [FEEDBACK_USED]</violation>
</file>
<file name="packages/encoder/src/semantic.ts">
<violation number="1" location="packages/encoder/src/semantic.ts:951">
P2: Recoverable batch file-summary failures should also be recorded in `this.warnings`; currently they are only logged, so callers lose visibility into partial degradation.
(Based on your team's feedback about pushing non-fatal warnings to a warnings array.) [FEEDBACK_USED]</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…4 batch improvements Gap 2-A: Include already-parsed function names in follow-up messages - buildFunctionFollowUpMessage now accepts alreadyParsedNames and prevInvalidKeys - processFunctionBatches tracks successfully parsed function names across iterations - Mirrors Python semantic_parsing.py:803-806 "So far, you've extracted features for: ..." Gap 2-B: Detect and report invalid keys in function batch responses - extractFunctionBatch detects keys in parsed response not matching valid function names - Invalid keys are reported in the next follow-up message (mirrors Python :808-811) Gap 2-C: Replace "/" with " or " in feature strings - featureListToSemanticFeature normalizes slash-separated values (mirrors Python :687,795) Gap 4-A: Include entity names/types in batch file summary prompts - aggregateFileFeaturesInBatch accepts optional childEntities with name/type/feature - buildBatchFileSummaryPrompt renders entity-named sections when childEntities provided - encoder.ts captures entity names alongside childFeatures in pendingFileSummary - Mirrors Python summarize_file_batch feature_map format (class/function key hierarchy) Gap 4-B: Token-aware batch splitting for file summaries - createFileSummaryBatches splits files by token estimate (min=1000/max=8000 tokens) - Processes batches concurrently via runConcurrent (mirrors Python ThreadPoolExecutor) Gap 4-C: Memory-based conversational retry for missing file summaries - runFileSummaryBatch retries up to 3 times with follow-up for missing file paths - Mirrors Python summarize_file_batch() conversational retry logic Review fixes: - encoder.ts: pre-filter allExtractionResults once to avoid redundant full-list iteration - semantic.ts: push batch file summary LLM failures to this.warnings for caller visibility
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/encoder/src/semantic.ts">
<violation number="1" location="packages/encoder/src/semantic.ts:1093">
P2: Parsed LLM response fields `description` and `keywords` are cast without type validation. Verify `description` is a string and `keywords` (if present) is an `Array<string>` before use to guard against malformed LLM output.
(Based on your team's feedback about validating parsed JSON mappings before use.) [FEEDBACK_USED]</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…atch Validate that `description` is a string and `keywords` (if present) is an array of strings before use, guarding against malformed LLM output. Previously, the fields were cast without runtime type checks. Now the `description` field is explicitly verified via `typeof === 'string'` in the if-condition, and `keywords` is filtered with a type predicate to strip any non-string elements.
|



Summary
Implements four quality improvements to the TypeScript encoder identified by comparing it against the Python reference implementation (
vendor/RPG-ZeroRepo/zerorepo/rpg_encoder/).Gap 1 (HIGH): Repository skeleton is now passed to
buildBatchClassPromptandbuildBatchFunctionPromptvia a newskeleton?: stringparam, giving the LLM architectural context for more grounded feature descriptions. The skeleton is built once inencode()and threaded viaSemanticExtractor.setSkeleton().Gap 2 (HIGH): On retry for missing entities, instead of re-sending the full prompt with all code, the encoder now sends a targeted conversational follow-up message listing only the specific class methods or function names that are missing. A
Memoryinstance withcontextWindow: 0is carried across iterations so the LLM retains the original code context.Gap 3 (MEDIUM): Test files are detected via
isTestFile()and routed to new dedicated prompts (buildBatchTestClassPrompt,buildBatchTestFunctionPrompt) that instruct the LLM to describe what is being tested rather than the test mechanics.Gap 4 (LOW-MEDIUM): File-level summaries are now batched. A first pass uses heuristic placeholders (
skipLLM: true); after all files are processed,aggregateFileFeaturesInBatch()callsbuildBatchFileSummaryPrompt()once for all deferred files, reducing LLM round-trips significantly for large repositories.Test plan
bun run test packages/encoder/tests/semantic-batching.test.tsstill passes (token-aware batching unchanged)Summary by cubic
Improves the encoder’s semantic extraction with richer repo context, test-aware prompts, smarter conversational retries, and token-aware batched file summaries for higher accuracy, fewer LLM calls, and safer batch outputs.
New Features
Bug Fixes
Written for commit cb8ea5d. Summary will update on new commits.