Skip to content

feat(native): port Erlang extractor to Rust#1103

Open
carlos-alm wants to merge 4 commits into
mainfrom
feat/1071-erlang-rust-extractor
Open

feat(native): port Erlang extractor to Rust#1103
carlos-alm wants to merge 4 commits into
mainfrom
feat/1071-erlang-rust-extractor

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Adds tree-sitter-erlang dependency and a native Erlang extractor in crates/codegraph-core/src/extractors/erlang.rs.
  • Registers .erl/.hrl with LanguageKind::Erlang and the Rust file_collector, removes Erlang from the WASM-only drop list, and wires ERLANG_AST_CONFIG (string literals) on both the native and JS sides.
  • Mirrors extractErlangSymbols: module declarations as module, function clauses (deduplicated per arity) as function, records as record, type aliases/opaque as type, macros as variable, plus -include/-include_lib/-import directives and module:function/local call extraction.

Closes #1071

Test plan

  • cargo build --release -p codegraph-core (clean build)
  • cargo test -p codegraph-core --lib — 191/191 (7 new erlang extractor tests)
  • npx tree-sitter build --wasm regenerates tree-sitter-erlang.wasm
  • npx vitest run tests/parsers/erlang.test.ts — 5/5
  • npx vitest run tests/parsers/native-drop-classification.test.ts — 13/13

Adds tree-sitter-erlang dependency and native extractor matching
the WASM-side behavior for Erlang symbol, import, and call extraction.

Part of #1071
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude finished @carlos-alm's task —— View job


Root Cause Analysis

Problem Being Solved: This PR ports the Erlang extractor from WASM-only to native Rust implementation, achieving engine parity. Currently, Erlang files are only parsed by the WASM engine, causing performance overhead and engine inconsistencies.

Why This Approach: Following the established dual-engine architecture pattern used for all other supported languages. The native Rust implementation provides better performance and eliminates the need for WASM fallback on Erlang files.

Risk Assessment: Low risk - this is additive functionality following existing patterns. The implementation mirrors the existing WASM extractErlangSymbols function and includes comprehensive test coverage.

Backlog Compliance

  • Zero-dep: ✓ - Only adds tree-sitter-erlang = "0.16" as a parsing dependency, which follows the existing pattern for all other languages
  • Foundation-aligned: ✓ - Fully compliant with all principles. Adds local parsing capability without external dependencies or cloud services
  • Problem-fit: 4/5 - Directly improves native engine coverage and parsing performance for Erlang codebases
  • Breaking: No - Purely additive, no existing behavior changes
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Phase-by-Phase Analysis

Phase 2: Configuration Changes ✓

No concerning rule relaxations found. All changes are additive:

  • Cargo.toml: adds tree-sitter-erlang dependency
  • No timeout increases, test threshold reductions, or linting rule disables

Phase 3: Test Integrity ✓

Strong test coverage across both native and WASM engines:

  • Native Rust: 7 new extractor tests in erlang.rs:324-396 covering module, function, record, import, call, include, and multi-clause deduplication
  • WASM JS: 5 tests in tests/parsers/erlang.test.ts verifying equivalent extraction
  • Resolution benchmark: Hand-annotated expected edges in tests/benchmarks/resolution/fixtures/erlang/expected-edges.json
  • Author claims 191/191 cargo tests and 5/5 vitest tests passing

Phase 4: Code Quality Analysis ✅

Architecture Adherence: Perfect consistency with existing patterns

  • Follows identical structure to other language extractors (walk_tree + walk_ast_nodes_with_config)
  • Proper registration in parser_registry.rs with .erl/.hrl extensions
  • Native/WASM AST config sync: ERLANG_AST_CONFIG mirrors JS ERLANG_AST_TYPES

Symbol Extraction Coverage: Comprehensive Erlang language support

  • Modules: -module(name) → kind: "module"
  • Functions: Multi-clause deduplication prevents duplicate definitions (lines 135-142)
  • Records: Field extraction with children (lines 62-77)
  • Types: -type and -opaque declarations → kind: "type"
  • Macros: -define(NAME, value) → kind: "variable" (appropriate choice)
  • Imports: -import and -include/-include_lib directives
  • Calls: Local and remote calls (module:function syntax) with proper receiver handling

Implementation Quality:

  • Robust error handling with early returns on missing nodes
  • Memory safety through proper tree-sitter node handling
  • Legacy grammar compatibility (lines 276-287) handles different tree-sitter-erlang versions
  • Consistent use of helper functions (find_child, node_text, start_line)

Engine Parity: Excellent synchronization

  • Native ERLANG_AST_CONFIG exactly matches JS ERLANG_AST_TYPES (both: string literals only)
  • Extension registration sync: .erl/.hrlin bothNATIVE_SUPPORTED_EXTENSIONSandLANGUAGE_REGISTRY`
  • Test removes Erlang from WASM-only drop list (tests/parsers/native-drop-classification.test.ts:80-81)

Minor Observations:

  • Function deduplication logic (lines 135-142) is sound but could benefit from a comment explaining multi-clause Erlang functions
  • String literal handling uses double quotes only, appropriate for Erlang
  • Call extraction handles both legacy and modern tree-sitter-erlang grammar shapes

Phase 5: Final Assessment

No Critical Issues Identified

This is a well-executed language port that follows all established patterns and provides comprehensive Erlang support. The dual-engine architecture requires perfect parity between native and WASM implementations, and this PR achieves that through:

  1. Comprehensive symbol extraction mirroring the WASM implementation
  2. Proper AST configuration synchronization
  3. Thorough test coverage across both engines
  4. Clean integration with existing infrastructure

The implementation quality is high, following Rust best practices and the codebase's established patterns. Engine parity compliance is excellent.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation that properly extends native engine coverage to Erlang. Follows all established patterns, includes comprehensive tests, and maintains perfect engine parity. This is exactly the kind of incremental language support addition the codebase is designed for.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR ports the Erlang symbol extractor from WASM/JS to native Rust, adding tree-sitter-erlang as a dependency and wiring .erl/.hrl files through the native pipeline. It also backports several robustness fixes to the JS extractor so both engines remain in sync.

  • New erlang.rs extractor handles module declarations, function clauses (deduplicated per name/arity), records, type aliases/opaque, macros, include directives, import attributes, and qualified calls.
  • JS erlang.ts updated to mirror all Rust fixes: childForFieldName with findChild fallback, namedChild iteration for params, arity-aware dedup, and type_name-wrapped atom fallback.
  • Registration plumbing and AST-node config added consistently across both the Rust and TS layers.

Confidence Score: 5/5

Safe to merge — purely additive change that adds a new language extractor without touching existing extraction paths.

All three previously flagged correctness issues (arity dedup, complex-pattern parameter counting, module-attr field-name fragility) are resolved with dedicated tests. The Rust and JS extractors are kept in sync throughout.

No files require special attention.

Important Files Changed

Filename Overview
crates/codegraph-core/src/extractors/erlang.rs New 456-line Erlang extractor; previously flagged issues (arity dedup, complex-pattern params, module attr field name) are all addressed.
src/extractors/erlang.ts JS extractor updated to mirror Rust fixes; minor duplicate findChild call in handleTypeAlias.
crates/codegraph-core/src/parser_registry.rs Registers Erlang language kind, maps .erl/.hrl extensions, wires grammar, updates exhaustiveness count to 26.
tests/parsers/erlang.test.ts Adds two new TS tests covering distinct-arity preservation and complex-pattern arity counting.
tests/parsers/native-drop-classification.test.ts Removes .erl from unsupported-by-native list, updates expected count from 10 to 9.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["'.erl' / '.hrl' file"] --> B{Native path?}
    B -- yes --> C["LanguageKind::Erlang"]
    B -- no --> D["WASM JS path"]
    C --> E["tree-sitter-erlang parse"]
    E --> F["ErlangExtractor::extract"]
    F --> G["walk_tree -> match_erlang_node"]
    G --> G1["module_attribute -> Definition (module)"]
    G --> G2["fun_decl -> Definition (function, name/arity dedup)"]
    G --> G3["record_decl -> Definition (record) + fields"]
    G --> G4["type_alias / opaque -> Definition (type)"]
    G --> G5["pp_define -> Definition (variable/macro)"]
    G --> G6["pp_include -> Import"]
    G --> G7["import_attribute -> Import"]
    G --> G8["call -> Call (local or module:func)"]
    F --> H["walk_ast_nodes_with_config (string literals)"]
    G1 & G2 & G3 & G4 & G5 & G6 & G7 & G8 & H --> I["FileSymbols"]
    D --> J["ExtractorOutput"]
Loading

Fix All in Claude Code

Reviews (3): Last reviewed commit: "fix(extractors): align Erlang record_dec..." | Re-trigger Greptile

Comment on lines +135 to +142
// Don't duplicate if we already have this function
if symbols
.definitions
.iter()
.any(|d| d.name == name && d.kind == "function")
{
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Deduplication ignores arity, silently drops overloaded functions

The guard checks only d.name == name, so when a module defines foo/1 and foo/2 as two separate fun_decl nodes, the second fun_decl is processed, hits the check, finds the already-registered "foo" entry, and returns early. Only the first-encountered arity survives in definitions. Erlang's overloading by arity is idiomatic and common, so this will silently omit real definitions in virtually every non-trivial module. The fix is to compute arity before the guard and include it in the comparison, for example by counting expr_args children upfront and matching against d.children.as_ref().map_or(0, |c| c.len()).

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4c3c13f — dedup now compares (name, arity) so all clauses (foo/1, foo/2, ...) survive. Added a test covering three arities for the same name.

Comment on lines +170 to +182
for i in 0..args_node.child_count() {
let child = match args_node.child(i) {
Some(c) => c,
None => continue,
};
if child.kind() == "var" || child.kind() == "atom" {
params.push(child_def(
node_text(&child, source).to_string(),
"parameter",
start_line(&child),
));
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Parameters only capture bare var and atom patterns; complex pattern-match arguments (tuples {ok, X}, lists [H|T], binaries <<Bin>>) are silently skipped. For Erlang functions that rely on pattern-matching to distinguish clauses, the extracted param list will be shorter than the actual arity, which will interact badly with any arity-based deduplication fix. Consider counting all non-punctuation children of expr_args to get a stable arity count independent of pattern complexity.

Suggested change
for i in 0..args_node.child_count() {
let child = match args_node.child(i) {
Some(c) => c,
None => continue,
};
if child.kind() == "var" || child.kind() == "atom" {
params.push(child_def(
node_text(&child, source).to_string(),
"parameter",
start_line(&child),
));
}
}
for i in 0..args_node.child_count() {
let child = match args_node.child(i) {
Some(c) => c,
None => continue,
};
// Skip punctuation so every argument pattern counts as one parameter.
if matches!(child.kind(), "," | "(" | ")") {
continue;
}
let label = if child.kind() == "var" || child.kind() == "atom" {
node_text(&child, source).to_string()
} else {
format!("_{}", i) // placeholder for complex patterns
};
params.push(child_def(label, "parameter", start_line(&child)));
}

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4c3c13f — now iterates named children so every argument pattern (tuple, list, binary) counts as one parameter. Complex patterns get a positional placeholder label so arity is preserved. Added a test for {ok, X} / [H|T] arguments.

Comment on lines +37 to +42
fn handle_module_attr(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
// module_attribute: - module ( atom ) .
let name_node = match find_child(node, "atom") {
Some(n) => n,
None => return,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 handle_module_attr uses find_child(node, "atom") which returns the first atom child. In tree-sitter-erlang v0.16 the -module(name). attribute produces a module_attribute whose first child is often the anonymous keyword token module, but if that keyword is represented as a named atom node (grammar-version dependent), the extracted name would be the literal string "module" rather than the actual module name. The existing test passes on the current grammar version, but it is fragile. Using a field name lookup — node.child_by_field_name("name") — is more robust if the grammar exposes it.

Suggested change
fn handle_module_attr(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
// module_attribute: - module ( atom ) .
let name_node = match find_child(node, "atom") {
Some(n) => n,
None => return,
};
fn handle_module_attr(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
// module_attribute: - module ( atom ) .
// Prefer the named field to avoid accidentally picking up the `module`
// keyword itself if the grammar represents it as an atom node.
let name_node = node
.child_by_field_name("name")
.or_else(|| find_child(node, "atom"))
.filter(|n| node_text(n, source) != "module");
let name_node = match name_node {
Some(n) => n,
None => return,
};

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4c3c13f — handle_module_attr now prefers child_by_field_name("name") and only falls back to find_child(node, "atom") if the grammar doesn't expose the field. handle_function_clause and extract_params use the same approach for the function-clause name and args. The behavior matches the current grammar's tests and is robust to future grammar revisions.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Codegraph Impact Analysis

33 functions changed17 callers affected across 3 files

  • ErlangExtractor.extract in crates/codegraph-core/src/extractors/erlang.rs:9 (0 transitive callers)
  • match_erlang_node in crates/codegraph-core/src/extractors/erlang.rs:17 (0 transitive callers)
  • handle_module_attr in crates/codegraph-core/src/extractors/erlang.rs:37 (1 transitive callers)
  • handle_record_decl in crates/codegraph-core/src/extractors/erlang.rs:62 (1 transitive callers)
  • handle_type_alias in crates/codegraph-core/src/extractors/erlang.rs:103 (1 transitive callers)
  • handle_fun_decl in crates/codegraph-core/src/extractors/erlang.rs:129 (1 transitive callers)
  • handle_function_clause in crates/codegraph-core/src/extractors/erlang.rs:139 (2 transitive callers)
  • extract_params in crates/codegraph-core/src/extractors/erlang.rs:181 (3 transitive callers)
  • handle_define in crates/codegraph-core/src/extractors/erlang.rs:210 (1 transitive callers)
  • handle_include in crates/codegraph-core/src/extractors/erlang.rs:236 (1 transitive callers)
  • handle_import_attr in crates/codegraph-core/src/extractors/erlang.rs:251 (1 transitive callers)
  • handle_call in crates/codegraph-core/src/extractors/erlang.rs:282 (1 transitive callers)
  • parse_erlang in crates/codegraph-core/src/extractors/erlang.rs:339 (9 transitive callers)
  • extracts_module_declaration in crates/codegraph-core/src/extractors/erlang.rs:349 (0 transitive callers)
  • extracts_function_definition in crates/codegraph-core/src/extractors/erlang.rs:360 (0 transitive callers)
  • extracts_record_definition in crates/codegraph-core/src/extractors/erlang.rs:371 (0 transitive callers)
  • extracts_import_attribute in crates/codegraph-core/src/extractors/erlang.rs:386 (0 transitive callers)
  • extracts_function_calls in crates/codegraph-core/src/extractors/erlang.rs:396 (0 transitive callers)
  • extracts_include_directive in crates/codegraph-core/src/extractors/erlang.rs:402 (0 transitive callers)
  • deduplicates_multi_clause_function in crates/codegraph-core/src/extractors/erlang.rs:408 (0 transitive callers)

…1103)

- Dedupe Erlang function defs by (name, arity) so foo/1 and foo/2 are
  both kept
- Count every argument pattern (tuple, list, binary) as one parameter
  via named children, using placeholder labels for complex patterns
- Prefer the named 'name'/'args' fields for module attributes and clause
  args, falling back to the previous atom/expr_args lookups
- Add Rust and TS tests covering multi-arity overloads and complex
  pattern args
…-field fallback (#1103)

- Rust handle_record_decl now prefers child_by_field_name("name")
  before falling back to find_child(atom), matching the other Erlang
  handlers and avoiding accidental keyword pickup if the grammar
  exposes 'record' as a named atom.
- TypeScript handleTypeAlias now mirrors the Rust type_name->atom
  fallback so the two engines agree when the grammar wraps the alias
  name in a type_name node.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rust engine parity: port the 11 remaining JS-only language extractors

1 participant