Skip to content

feat(native): port Groovy extractor to Rust#1101

Open
carlos-alm wants to merge 4 commits into
mainfrom
feat/1071-groovy-rust-extractor
Open

feat(native): port Groovy extractor to Rust#1101
carlos-alm wants to merge 4 commits into
mainfrom
feat/1071-groovy-rust-extractor

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Add tree-sitter-groovy (v0.1) dependency and a native Groovy extractor under crates/codegraph-core/src/extractors/groovy.rs, mirroring the WASM-side extractGroovySymbols behaviour.
  • Wire Groovy into LanguageKind, the parser registry, file collector extensions, AST type/string maps, and NATIVE_SUPPORTED_EXTENSIONS so the native engine handles .groovy / .gvy files end-to-end.
  • Update the native-drop classification test to reflect that Groovy is no longer in the unsupported set.

Part of #1071

Test plan

  • cargo build --release -p codegraph-core
  • cargo test -p codegraph-core --lib (191 passed)
  • npx vitest run tests/parsers/groovy.test.ts tests/parsers/native-drop-classification.test.ts (18 passed)
  • npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts -t groovy (5 passed)

Adds tree-sitter-groovy dependency and native extractor matching
the WASM-side behavior for Groovy symbol, import, and call extraction.

Part of #1071
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Groovy Native Extractor Implementation

Reviewing this PR that ports Groovy extraction from WASM to native Rust implementation.

Todo List:

  • Read and analyze all changed files
  • Conduct root cause analysis
  • Check for configuration changes (high scrutiny)
  • Verify test integrity
  • Assess code quality and implementation
  • Evaluate backlog compliance
  • Provide final assessment and rating

Root Cause Analysis

Problem Being Solved: This PR implements native Rust support for Groovy files (.groovy, .gvy), eliminating a coverage gap where the native engine would drop Groovy files and fall back to WASM. This addresses issue #1071 and maintains engine parity.

Why This Approach: The dual-engine architecture requires both native (Rust) and WASM engines to produce identical results. Adding native Groovy support prevents engine parity warnings and allows the faster native engine to handle Groovy files end-to-end.

Risk Assessment: Low risk. This is a straightforward language addition following established patterns. The implementation mirrors existing extractors and includes comprehensive tests.

Configuration Changes (HIGH SCRUTINY)

No concerning configuration changes detected

  • Only addition is tree-sitter-groovy = "0.1" dependency in Cargo.toml - appropriate for adding language support
  • No relaxed linting rules, timeouts, or test thresholds
  • Registry and parser updates follow established patterns

Test Integrity

Test changes are appropriate and strengthen coverage

  • native-drop-classification.test.ts correctly removes .groovy/.gvy from unsupported set
  • Comprehensive unit tests in groovy.rs covering all major language features
  • No weakened assertions or reduced test strictness

Code Quality Assessment

Excellent Implementation Quality

Architecture Consistency: The Groovy extractor perfectly follows the established patterns:

  • Implements SymbolExtractor trait ✓
  • Uses helper functions from super::helpers::*
  • Handles complexity/CFG computation via shared infrastructure ✓
  • Mirrors WASM extractor behavior for engine parity ✓

Language Support Completeness:

  • Classes: ✓ Handles class_declaration/class_definition with inheritance (extends) and interfaces (implements)
  • Methods: ✓ Extracts methods, constructors, and top-level functions with parameter extraction
  • Imports: ✓ Java-style import semantics with wildcard support (import foo.bar.*)
  • Calls: ✓ Method invocations, object creation, and field access patterns
  • AST Analysis: ✓ Properly wires into GROOVY_AST_CONFIG for dataflow/complexity

Engine Parity: The implementation meticulously mirrors the WASM extractGroovySymbols behavior:

  • Handles grammar variations (both class_declaration and class_definition)
  • Java import semantics with java_import: true flag
  • Identical node kind matching for resilience across grammar versions

Error Handling: Robust with early returns and safe node access patterns.

Registry Integration

All required integration points are properly updated:

  • LanguageKind::Groovy added to enum ✓
  • from_extension() handles .groovy and .gvy
  • Tree-sitter language binding added ✓
  • WASM extensions registry includes Groovy ✓
  • AST type maps properly configured ✓

Test Coverage

Comprehensive unit tests cover:

  • Class/interface/enum extraction
  • Method parameters and class fields
  • Import statements (regular and wildcard)
  • Method calls and object creation
  • Inheritance relationships

Backlog Compliance

  • Zero-dep: ✓ — Only adds tree-sitter-groovy grammar dependency, builds on existing tree-sitter/SQLite/AST infrastructure
  • Foundation-aligned: ✓ — Fully compliant with all FOUNDATION.md principles. No cloud APIs, external dependencies, or agent behavior
  • Problem-fit: 4 — Directly improves agent accuracy by preventing engine parity drops and false "unsupported language" warnings that waste tokens during AI orientation
  • Breaking: No — Purely additive. Existing behavior unchanged, only adds native support for previously WASM-only language
  • Tier: 1 — Zero-dep ✓ + foundation-aligned ✓

Minor Observations

  • Line groovy.rs:283 - java_import = Some(true) is correctly set for import resolution
  • The extensive node kind variations handled (e.g., class_declaration vs class_definition) shows good defensive programming
  • Tests verify both extraction completeness and structural correctness

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Eliminating native engine coverage gap for Groovy files to maintain dual-engine parity
**Why This Approach**: Follows established language addition patterns, ensures both engines handle identical file sets
**Risk Assessment**: Low - straightforward implementation with comprehensive tests

## Backlog Compliance
- Zero-dep: ✓ — grammar dependency only
- Foundation-aligned: ✓ — no principle violations
- Problem-fit: 4 — directly improves agent accuracy
- Breaking: No
- Tier: 1

## Critical Concerns
None. This is a high-quality, well-tested implementation that follows all established patterns.

## Final Recommendation
- Rating: ⭐⭐⭐⭐☆ (4/5)
- Action: APPROVE
- Reasoning: Excellent implementation quality, comprehensive testing, perfect adherence to architectural patterns, and direct value for AI agent accuracy. Only missing the 5th star due to being an incremental language addition rather than transformative functionality.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR ports the Groovy symbol extractor from JavaScript/WASM to Rust, wiring Groovy into the full native pipeline — LanguageKind, parser registry, file collector, AST type/string maps, and NATIVE_SUPPORTED_EXTENSIONS. The Rust implementation mirrors the JS source-of-truth in src/extractors/groovy.ts, with parity-enhancing additions like spread_parameter support and recursive type_list handling for interface collections.

  • groovy.rs: New 500-line extractor covering class, interface, enum, method, constructor, function, import, call, and object-creation nodes; includes inline unit tests for all major extraction paths.
  • Wiring changes: parser_registry.rs, file_collector.rs, mod.rs, parser.ts, and src/ast-analysis/rules/index.ts updated consistently to register .groovy/.gvy as native-supported.
  • Test update: native-drop-classification.test.ts corrects the unsupported-by-native count now that Groovy is handled natively.

Confidence Score: 5/5

Safe to merge — all previously identified divergences from the JS source-of-truth were addressed in follow-up commits, and the full test suite (191 Rust + 18 TS) passes.

The extractor faithfully mirrors the JS source-of-truth across all node kinds, the wiring changes are mechanical and consistent across Rust and TypeScript sides, and prior review findings have been resolved. No new logic gaps were found.

No files require special attention.

Important Files Changed

Filename Overview
crates/codegraph-core/src/extractors/groovy.rs New 500-line Rust extractor mirroring JS source-of-truth; all discussed divergences (identifier in enum body, juxt_function_call doc, gstring) addressed in follow-up commits. Inline unit tests cover all major paths.
crates/codegraph-core/src/extractors/helpers.rs Adds GROOVY_AST_CONFIG with string_literal + gstring (defensive), throw_statement, and object_creation_expression — matches the TS config.
crates/codegraph-core/src/parser_registry.rs Adds Groovy variant to LanguageKind, wires groovy/gvy extension mapping, tree_sitter_groovy::LANGUAGE, and the EXPECTED_LEN guard test.
crates/codegraph-core/src/extractors/mod.rs Registers the groovy module and dispatches LanguageKind::Groovy to GroovyExtractor::extract_with_opts — straightforward wiring.
src/ast-analysis/rules/index.ts Adds GROOVY_AST_TYPES and GROOVY_STRING_CONFIG (both quote chars) and registers them in the language maps; matches the Rust-side config.
src/domain/parser.ts Adds .groovy and .gvy to NATIVE_SUPPORTED_EXTENSIONS, completing the JS-side awareness that the native engine handles these files.
crates/codegraph-core/src/file_collector.rs Appends groovy and gvy to SUPPORTED_EXTENSIONS; matches the parser_registry extension-to-language mapping.
tests/parsers/native-drop-classification.test.ts Removes .groovy from the unsupported-by-native set and decrements expected count from 9 to 8 — correct and minimal update.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[".groovy / .gvy file"] --> B[file_collector.rs\nSUPPORTED_EXTENSIONS]
    B --> C[parser_registry.rs\nLanguageKind::Groovy]
    C --> D[tree_sitter_groovy::LANGUAGE\nparse → Tree]
    D --> E[extractors/mod.rs\nextract_symbols_with_opts]
    E --> F[groovy::GroovyExtractor]
    F --> G1[walk_tree → match_groovy_node]
    F --> G2[walk_ast_nodes_with_config\nGROOVY_AST_CONFIG]
    G1 --> H1[class / interface / enum\ndefinitions]
    G1 --> H2[method / constructor /\nfunction definitions]
    G1 --> H3[imports\njava_import=true]
    G1 --> H4[calls &\nobject_creation]
    G2 --> H5[AST nodes\nstring / throw / new]
    H1 & H2 & H3 & H4 & H5 --> I[FileSymbols]

    J["NATIVE_SUPPORTED_EXTENSIONS\n(parser.ts)"] -.->|JS engine awareness| A
    K["GROOVY_AST_TYPES + GROOVY_STRING_CONFIG\n(rules/index.ts)"] -.->|TS-side config parity| G2
Loading

Reviews (4): Last reviewed commit: "fix: resolve merge conflicts with main (..." | Re-trigger Greptile

if let Some(body) = body {
for i in 0..body.child_count() {
let Some(child) = body.child(i) else { continue };
if child.kind() == "enum_constant" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The Rust port diverges from the JS source-of-truth here. In handleGroovyEnumDecl (groovy.ts line 163) the JS extractor accepts both enum_constant and identifier children as enum members. The Rust port only matches enum_constant, so Groovy enums whose body produces bare identifier nodes (which can occur with some grammar variants or simple Groovy enum Color { RED, GREEN } forms) will have no extracted children in the Rust engine while the JS engine would capture them correctly.

Suggested change
if child.kind() == "enum_constant" {
if child.kind() == "enum_constant" || child.kind() == "identifier" {

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 74cd047. Added identifier to the enum body match — now matches the JS source-of-truth in handleGroovyEnumDecl (groovy.ts:163).

Comment on lines +54 to +56
"method_invocation" | "method_call" | "call_expression" | "function_call" => {
handle_call_expr(node, source, symbols)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The docstring says the extractor mirrors extractGroovySymbols and mentions juxt_function_call as a node kind present in the grammar, but it does not appear in match_groovy_node. The JS extractor also omits it, so this isn't a regression, but the discrepancy between the doc and the implementation means Groovy command-style calls like foo bar(x) are silently dropped from the call graph. At minimum the doc claim should be removed or an arm added.

Suggested change
"method_invocation" | "method_call" | "call_expression" | "function_call" => {
handle_call_expr(node, source, symbols)
}
"method_invocation" | "method_call" | "call_expression" | "function_call"
| "juxt_function_call" => {
handle_call_expr(node, source, symbols)
}

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the doc in 74cd047. The doc previously listed juxt_function_call alongside features the extractor handles, which was misleading. The JS source-of-truth also omits it, so I haven't added the arm here to preserve parity. Tracked separately in #1108 — fix needs to land in both engines to keep them in lockstep.

Comment on lines +363 to +371
pub const GROOVY_AST_CONFIG: LangAstConfig = LangAstConfig {
new_types: &["object_creation_expression"],
throw_types: &["throw_statement"],
await_types: &[],
string_types: &["string_literal"],
regex_types: &[],
quote_chars: &['\'', '"'],
string_prefixes: &[],
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Groovy's double-quoted strings are GStrings, which the tree-sitter-groovy grammar represents with a gstring node (not string_literal). string_literal covers only single-quoted strings. The same gap exists in GROOVY_AST_TYPES on the TypeScript side. Both configs are consistent with each other, but any AST-node analysis that looks for string literals in Groovy code will silently miss all GString interpolations.

Suggested change
pub const GROOVY_AST_CONFIG: LangAstConfig = LangAstConfig {
new_types: &["object_creation_expression"],
throw_types: &["throw_statement"],
await_types: &[],
string_types: &["string_literal"],
regex_types: &[],
quote_chars: &['\'', '"'],
string_prefixes: &[],
};
pub const GROOVY_AST_CONFIG: LangAstConfig = LangAstConfig {
new_types: &["object_creation_expression"],
throw_types: &["throw_statement"],
await_types: &[],
string_types: &["string_literal", "gstring"],
regex_types: &[],
quote_chars: &['\'', '"'],
string_prefixes: &[],
};

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added gstring defensively in 74cd047. Verified against node_modules/tree-sitter-groovy/grammar.js and src/node-types.json: the 0.1.x grammar uses string_literal for both single- and double-quoted strings (it extends tree-sitter-java) and does not emit a separate gstring node, so this is a no-op for the current grammar — but it costs nothing and guards against future grammar variants. Matched the TS config.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added gstring defensively in 74cd047. Verified against node_modules/tree-sitter-groovy/grammar.js and src/node-types.json: the 0.1.x grammar uses string_literal for both single- and double-quoted strings (it extends tree-sitter-java) and does not emit a separate gstring node, so this is a no-op for the current grammar — but it costs nothing and guards against future grammar variants. Matched the TS config.

Comment on lines +156 to +160
const GROOVY_AST_TYPES: Record<string, string> = {
object_creation_expression: 'new',
throw_statement: 'throw',
string_literal: 'string',
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mirrors the gap in the Rust config: string_literal covers single-quoted strings only; gstring should also be listed for double-quoted / interpolated Groovy strings. The GROOVY_STRING_CONFIG already includes '" as quote chars, so only the type key is missing.

Suggested change
const GROOVY_AST_TYPES: Record<string, string> = {
object_creation_expression: 'new',
throw_statement: 'throw',
string_literal: 'string',
};
const GROOVY_AST_TYPES: Record<string, string> = {
object_creation_expression: 'new',
throw_statement: 'throw',
string_literal: 'string',
gstring: 'string',
};

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added gstring to GROOVY_AST_TYPES in 74cd047, matching the Rust config change. As noted in the Rust-side reply, the current tree-sitter-groovy 0.1.x grammar emits string_literal for both quote styles, so this is defensive against future grammar variants rather than a current bug fix.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Codegraph Impact Analysis

30 functions changed13 callers affected across 2 files

  • GroovyExtractor.extract in crates/codegraph-core/src/extractors/groovy.rs:28 (0 transitive callers)
  • find_groovy_parent_class in crates/codegraph-core/src/extractors/groovy.rs:45 (3 transitive callers)
  • match_groovy_node in crates/codegraph-core/src/extractors/groovy.rs:49 (0 transitive callers)
  • handle_class_decl in crates/codegraph-core/src/extractors/groovy.rs:68 (1 transitive callers)
  • collect_interfaces in crates/codegraph-core/src/extractors/groovy.rs:121 (2 transitive callers)
  • handle_interface_decl in crates/codegraph-core/src/extractors/groovy.rs:154 (1 transitive callers)
  • handle_enum_decl in crates/codegraph-core/src/extractors/groovy.rs:168 (1 transitive callers)
  • handle_method_decl in crates/codegraph-core/src/extractors/groovy.rs:202 (1 transitive callers)
  • handle_constructor_decl in crates/codegraph-core/src/extractors/groovy.rs:223 (1 transitive callers)
  • handle_function_decl in crates/codegraph-core/src/extractors/groovy.rs:245 (1 transitive callers)
  • handle_import_decl in crates/codegraph-core/src/extractors/groovy.rs:262 (1 transitive callers)
  • handle_call_expr in crates/codegraph-core/src/extractors/groovy.rs:293 (1 transitive callers)
  • handle_object_creation in crates/codegraph-core/src/extractors/groovy.rs:342 (1 transitive callers)
  • extract_params in crates/codegraph-core/src/extractors/groovy.rs:361 (4 transitive callers)
  • extract_class_fields in crates/codegraph-core/src/extractors/groovy.rs:385 (2 transitive callers)
  • parse_groovy in crates/codegraph-core/src/extractors/groovy.rs:417 (7 transitive callers)
  • extracts_class_and_methods in crates/codegraph-core/src/extractors/groovy.rs:427 (0 transitive callers)
  • extracts_method_parameters in crates/codegraph-core/src/extractors/groovy.rs:437 (0 transitive callers)
  • extracts_class_fields in crates/codegraph-core/src/extractors/groovy.rs:448 (0 transitive callers)
  • extracts_imports in crates/codegraph-core/src/extractors/groovy.rs:459 (0 transitive callers)

…tries

Address Greptile review on PR #1101:

- Rust enum handler now accepts both `enum_constant` and `identifier` children,
  matching the JS source-of-truth in `handleGroovyEnumDecl` (groovy.ts:163).
  Without this, Groovy enums whose grammar emits bare identifier nodes had no
  extracted members in the native engine.
- Add `gstring` defensively to GROOVY_AST_CONFIG (Rust) and GROOVY_AST_TYPES (TS).
  tree-sitter-groovy 0.1.x emits `string_literal` for both quote styles, but
  this keeps both engines resilient to grammar variants.
- Clarify module doc: `juxt_function_call` was previously listed alongside
  features the extractor handles. It is intentionally unhandled (matches JS).
  Tracked in #1108 for adding support to both engines.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant