Skip to content

feat: add Elixir, Lua, Dart, Zig, Haskell, OCaml language support#718

Merged
carlos-alm merged 12 commits intomainfrom
feat/batch2-languages
Mar 31, 2026
Merged

feat: add Elixir, Lua, Dart, Zig, Haskell, OCaml language support#718
carlos-alm merged 12 commits intomainfrom
feat/batch2-languages

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Add Batch 2 language support from ROADMAP Phase 7: Elixir, Lua, Dart, Zig, Haskell, OCaml
  • Each language includes dual-engine extractors (WASM TypeScript + native Rust), AST configs, and parser tests
  • Brings total supported languages from 17 to 23

Language Details

Language Extensions Key Constructs Notes
Elixir .ex, .exs modules, functions, protocols, use/import/require All constructs are generic call nodes
Lua .lua functions, methods, require() imports require() detected as imports
Dart .dart classes, enums, mixins, extensions, imports No call_expression — uses selector/argument_part
Zig .zig functions, structs, enums, unions, @import, tests Structs/enums are anonymous, named by parent decl
Haskell .hs functions, data/newtype/type, typeclasses, imports Grammar misspells type_synomym
OCaml .ml, .mli let bindings, modules, types, open, applications Sub-grammar under grammars/ocaml

Files Changed

  • 18 new files: 6 TS extractors, 6 Rust extractors, 6 test files
  • 10 modified files: types, registry, build config, Cargo.toml, package.json

Test plan

  • All 6 new parser test files pass (34 tests)
  • Full test suite passes (2257 tests, 0 failures)
  • TypeScript compiles cleanly (tsc --noEmit)
  • Lint passes (no new errors)
  • All 6 WASM grammars build successfully
  • CI passes
  • Rust native build compiles with new crate dependencies

- CHANGELOG: fix total language count from 14 to 17
- README: add 6 new languages to multi-language differentiator row
- ROADMAP: update Phase 7 overview to reflect Batch 1 completion
Add Batch 2 languages from the ROADMAP Phase 7 plan. Each language
includes dual-engine support (WASM + native Rust extractors), AST
configs, and parser tests.

Language details:
- Elixir (.ex, .exs): modules, functions, protocols, imports/use/require
- Lua (.lua): functions, methods, require() imports, table patterns
- Dart (.dart): classes, enums, mixins, extensions, imports, inheritance
- Zig (.zig): functions, structs, enums, unions, @import, test decls
- Haskell (.hs): functions, data/newtype/type, typeclasses, imports
- OCaml (.ml, .mli): let bindings, modules, types, open, applications

Notable grammar quirks handled:
- Elixir: all constructs are generic `call` nodes, distinguished by
  identifier text (defmodule, def, defp, use, import, etc.)
- Dart: no call_expression node — calls detected via selector/argument_part
- Zig: structs/enums are anonymous, named by parent variable_declaration
- Haskell: grammar misspells type_synomym (missing 'n')
- OCaml: sub-grammar under grammars/ocaml, Rust export LANGUAGE_OCAML
- Dart Rust crate uses old language() function instead of LANGUAGE constant
@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Batch 2 Language Support

Reviewing this PR that adds 6 new language parsers (Elixir, Lua, Dart, Zig, Haskell, OCaml) following the comprehensive review framework.

Review Progress

  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity Check
  • Phase 4: Code Quality Review
  • Phase 5: Backlog Compliance Assessment
  • Final Assessment and Rating

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Expanding language support from ROADMAP Phase 7 to meet growing demand for multi-language codebases. This adds 6 carefully selected languages (Elixir, Lua, Dart, Zig, Haskell, OCaml) that complement existing coverage and address specific ecosystem needs.

Why This Approach: Follows established dual-engine pattern with both WASM TypeScript extractors and native Rust extractors. Maintains architecture consistency with existing language implementations.

Risk Assessment: Low risk - purely additive functionality following proven patterns. All new parsers are non-required (gracefully degrade if WASM unavailable). Comprehensive test coverage included.

Phase 2: Configuration Change Detection ✅

✅ No High-Risk Configuration Changes Detected

  • No Biome/linting rule relaxation - only existing any[] usage follows established patterns
  • No test timeout increases or coverage threshold changes
  • No build configuration or module resolution changes
  • No TypeScript strict mode flag modifications
  • Only additions to devDependencies (tree-sitter grammars) and Cargo.toml (new language crates)

Phase 3: Test Integrity Check ✅

✅ Strong Test Coverage for New Languages

  • 6 new test files added (one per language): elixir.test.ts, lua.test.ts, dart.test.ts, zig.test.ts, haskell.test.ts, ocaml.test.ts
  • 34 total tests across new languages covering core extraction patterns
  • Each test file follows established pattern: module/function/class extraction, imports, calls
  • No weakened assertions - all tests use proper expect().toContainEqual() assertions
  • Tests validate both symbol extraction and kind classification correctly

Phase 4: Code Quality Review ✅

✅ Excellent Implementation Quality

Architecture Consistency:

  • All 6 languages correctly added to LANGUAGE_REGISTRY with required: false (graceful degradation)
  • Dual extractors: TypeScript in src/extractors/ (169-304 LOC each) + Rust in crates/codegraph-core/src/extractors/
  • New Cargo dependencies: 6 tree-sitter crates added (elixir, lua, dart, zig, haskell, ocaml)
  • Follows existing patterns perfectly - no architectural divergence

Code Quality:

  • No TODOs/FIXMEs/HACKs in new code - production-ready implementations
  • Consistent error handling and node traversal patterns
  • Appropriate symbol kind mappings (e.g., Elixir protocols → interface, Dart mixins → class)
  • Clean separation of concerns: parsing, symbol extraction, type mapping

Phase 5: Backlog Compliance Assessment ✅

Backlog Compliance

  • Zero-dep: ✅ — No new runtime dependencies. Only adds tree-sitter grammar devDependencies for build-time WASM generation
  • Foundation-aligned: ✅ — Fully aligned with all 8 core principles. Maintains single registry (P6), preserves dual-engine architecture (P2), zero-cost core (P4)
  • Problem-fit: 4 — Significantly improves agent accuracy by expanding language coverage from 17 to 23 languages. Directly reduces "unsupported language" gaps that force agents to work blind
  • Breaking: No — Purely additive. All new languages are required: false with graceful degradation
  • Tier: 1 — Zero-dep ✅ + Foundation-aligned ✅

Critical Concerns

None identified. This is a textbook implementation of the established language addition pattern.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation that follows all established patterns. Comprehensive test coverage, dual-engine consistency, graceful degradation, and zero architectural debt. The only reason this isn't 5/5 is that language additions, while valuable, don't break new architectural ground — they execute a proven pattern excellently.

This PR successfully delivers 6 new language parsers with production-ready quality and zero architectural compromise. Strongly recommend approval.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 30, 2026

Greptile Summary

This PR completes Phase 7 Batch 2 of the language support roadmap by adding Elixir, Lua, Dart, Zig, Haskell, and OCaml — bringing the project from 17 to 23 supported languages. Each language ships with a dual-engine extractor (WASM TypeScript + native Rust), AST configs, and parser tests, following the established pattern from Batch 1.

The implementation is thorough and well-structured. Key language-specific design decisions are handled correctly: Elixir's generic call AST nodes are dispatched by keyword text; Zig's anonymous structs are named from their enclosing variable_declaration; Haskell's grammar typo (type_synomym) is correctly matched in both engines; OCaml's let_binding double-emission risk is guarded; Dart's method de-duplication via isInsideDartClass works correctly.

One P1 issue found:

  • OCaml .mli interface files are parsed with the wrong grammar (LANGUAGE_OCAML / grammars/ocaml) in both the Rust native engine (parser_registry.rs) and the WASM engine (parser.ts + build-wasm.ts). OCaml interface files require LANGUAGE_OCAML_INTERFACE / grammars/ocaml_interface. Using the wrong grammar produces an error-filled AST, causing all definitions, calls, and imports to be silently missed for .mli files. There are no .mli tests, so this went undetected.

One P2 note:

  • handleLuaVariableDecl in src/extractors/lua.ts contains dead code: it looks for an assignment_statement child inside a variable_declaration, but these are sibling statement types in tree-sitter-lua (not parent-child). The require() detection path there never fires; it works correctly only because handleLuaFunctionCall intercepts the function_call node via the recursive walk.

Confidence Score: 4/5

Safe to merge after fixing the OCaml .mli grammar mismatch — all other extractors are correct and well-tested.

One P1 bug: OCaml .mli files are parsed with the wrong tree-sitter grammar in both engines, producing silently empty extraction results. This affects a real registered file type. All other five languages look correct. The P2 dead-code note in the Lua extractor does not affect behavior.

crates/codegraph-core/src/parser_registry.rs, src/domain/parser.ts, and scripts/build-wasm.ts for the OCaml interface grammar issue.

Important Files Changed

Filename Overview
crates/codegraph-core/src/parser_registry.rs Registers 6 new language variants and maps extensions; uses LANGUAGE_OCAML for both .ml and .mli, causing parse failures on OCaml interface files.
src/domain/parser.ts Adds all 6 new languages to LANGUAGE_REGISTRY; maps .ml and .mli to the same tree-sitter-ocaml.wasm (built from grammars/ocaml) — interface files need grammars/ocaml_interface.
scripts/build-wasm.ts Adds 6 new WASM grammar entries; OCaml uses sub: 'grammars/ocaml' which is the implementation grammar — .mli interface files need a separate grammars/ocaml_interface WASM build.
src/extractors/lua.ts Contains dead code in handleLuaVariableDecl (looks for assignment_statement inside variable_declaration, which can't exist); require() detection via handleLuaFunctionCall works correctly.
crates/codegraph-core/src/extractors/elixir.rs Correctly handles Elixir's generic call AST structure via parent-traversal for module scoping; handles defmodule, def/defp, defprotocol, defimpl, and dot calls.
src/extractors/elixir.ts TS mirror of the Rust Elixir extractor; uses parameter-passing for module context instead of parent traversal — correct and consistent.
crates/codegraph-core/src/extractors/dart.rs Well-structured; class/mixin/extension/enum/import all handled; is_inside_class guard prevents duplicate method emission during recursive walk.
src/extractors/dart.ts TS Dart extractor mirrors Rust implementation; inheritance extraction is correct; isInsideDartClass guard prevents duplicate method emission.
crates/codegraph-core/src/extractors/haskell.rs Correctly spells type_synomym to match the grammar's intentional typo; handles function, bind, data_type, newtype, class, instance, import, and apply nodes.
crates/codegraph-core/src/extractors/zig.rs Correct dual-loop approach for @import vs struct/enum/union in variable_declaration; find_zig_parent_struct correctly names anonymous containers from their enclosing const declaration.
src/extractors/zig.ts TS Zig extractor; extractZigContainerMethods correctly emits struct methods as ParentName.methodName; isInsideZigContainer guard prevents double-emission.
crates/codegraph-core/src/extractors/ocaml.rs Handles OCaml value/module/type/class definitions and open statements; correctly distinguishes functions (with params) from plain variables.
src/extractors/ocaml.ts TS OCaml extractor with correct let_binding double-emission guard (parent?.type !== 'value_definition'); module, type, class, and open handling look solid.
crates/codegraph-core/src/extractors/helpers.rs Adds 6 new LangAstConfig constants; Elixir uses call_types: &["call"] which matches all Elixir constructs — intentional per PR description.
src/types.ts Extends LanguageId union type with all 6 new languages; clean and correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Source File] -->|extension lookup| B{LanguageKind}
    B -->|.ex/.exs| C[Elixir]
    B -->|.lua| D[Lua]
    B -->|.dart| E[Dart]
    B -->|.zig| F[Zig]
    B -->|.hs| G[Haskell]
    B -->|.ml/.mli| H[Ocaml ⚠️]

    C --> I{Engine}
    D --> I
    E --> I
    F --> I
    G --> I
    H --> I

    I -->|WASM| J[TS Extractor\nsrc/extractors/*.ts]
    I -->|Native| K[Rust Extractor\ncrates/codegraph-core/src/extractors/*.rs]

    J --> L[ExtractorOutput\ndefinitions / calls / imports / classes]
    K --> L

    H -.->|.mli uses LANGUAGE_OCAML\nshould use LANGUAGE_OCAML_INTERFACE| M[❌ Parse errors on .mli]
Loading

Comments Outside Diff (1)

  1. crates/codegraph-core/src/parser_registry.rs, line 1907 (link)

    P1 OCaml .mli interface files parsed with the wrong grammar

    Both .ml and .mli are mapped to LanguageKind::Ocaml, which always resolves to tree_sitter_ocaml::LANGUAGE_OCAML. OCaml interface files (.mli) have a distinct grammar (LANGUAGE_OCAML_INTERFACE) — using the implementation grammar on them will produce an ERROR-filled AST and silently return empty extraction results for all .mli files.

    The tree-sitter-ocaml crate exposes both constants. A separate LanguageKind::OcamlInterface variant (or a split extension mapping) is needed to select the right grammar:

    // Current — wrong grammar for .mli
    Self::Ocaml => tree_sitter_ocaml::LANGUAGE_OCAML.into(),
    
    // Needed (example approach with a new variant):
    Self::Ocaml => tree_sitter_ocaml::LANGUAGE_OCAML.into(),
    Self::OcamlInterface => tree_sitter_ocaml::LANGUAGE_OCAML_INTERFACE.into(),

    The same issue exists in the WASM extractor (src/domain/parser.ts) and the build-wasm.ts script where grammars/ocaml_interface would be the correct sub-path for .mli files.

Reviews (3): Last reviewed commit: "fix(elixir): remove dead visibility vari..." | Re-trigger Greptile

Comment on lines +9 to +12
### Features

* add C, C++, Kotlin, Swift, Scala, Bash language support ([#708](https://github.com/optave/ops-codegraph-tool/pull/708))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 v3.6.0 CHANGELOG entry describes the wrong language batch

The v3.6.0 entry reads: "This release adds first-class support for C, C++, Kotlin, Swift, Scala, and Bash — bringing the total supported languages to 17." That describes Batch 1 (PR #708), not the six languages being added in this PR. The entry should reference Elixir, Lua, Dart, Zig, Haskell, and OCaml bringing the total to 23.

Suggested change
### Features
* add C, C++, Kotlin, Swift, Scala, Bash language support ([#708](https://github.com/optave/ops-codegraph-tool/pull/708))
**Six new languages: Elixir, Lua, Dart, Zig, Haskell, OCaml.** This release adds first-class support for Elixir, Lua, Dart, Zig, Haskell, and OCaml — bringing the total supported languages to 23. Each language ships with dual-engine extractors (WASM TypeScript + native Rust), AST configs, and parser tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 441fcd1. The CHANGELOG v3.6.0 entry now correctly references Elixir, Lua, Dart, Zig, Haskell, and OCaml (Batch 2), with the total updated to 23.

…unction names (#718)

walkElixirNode never updated currentModule when descending into defmodule
bodies, so all functions got unqualified names (e.g. `greet` instead of
`Greeter.greet`), breaking parity with the native Rust engine. Strengthen
the test assertion to check qualified names.
…is_pub (#718)

WASM extractor: handleZigFunction now skips nodes inside struct/union
containers since extractZigContainerMethods already emits them as
qualified methods. Rust extractor: remove dead is_pub call from
handle_zig_function (Definition struct lacks a visibility field).
CHANGELOG v3.6.0 entry incorrectly described Batch 1 languages instead
of Batch 2. README comparison table, feature descriptions, and ROADMAP
phase summary all updated to reflect 23 supported languages.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressing all Greptile review comments

P1 — Elixir WASM currentModule never propagated (f47be65)
Fixed in src/extractors/elixir.ts. walkElixirNode now detects defmodule call nodes and updates nextModule before recursing into children. Functions inside modules now get qualified names (e.g. Greeter.greet), matching the native Rust engine. Test assertion strengthened to verify qualified name.

P1 — Zig WASM duplicate struct method emission (0f7c637)
Fixed in src/extractors/zig.ts. Added isInsideZigContainer() guard to handleZigFunction — if the node is inside a struct_declaration or union_declaration, it returns early since extractZigContainerMethods already emits qualified method definitions.

P2 — is_pub computed but never used in zig.rs (0f7c637)
Removed the dead is_pub call from handle_zig_function in the Rust extractor. The Definition struct lacks a visibility field, so the value couldn't be wired through. Added #[allow(dead_code)] + TODO on the helper function for when the struct gains visibility support.

P2 — README/ROADMAP/CHANGELOG language counts (441fcd1)

  • CHANGELOG v3.6.0 entry now references Batch 2 languages (Elixir, Lua, Dart, Zig, Haskell, OCaml) and total of 23
  • README comparison table: 17 → 23
  • README feature descriptions: all "17 languages" → "23 languages" (5 occurrences)
  • README multi-language list: added Elixir, Lua, Dart, Zig, Haskell, OCaml
  • ROADMAP Phase 7 summary: updated to reflect Batch 1+2 shipped, 11 remaining in 2 batches

CI failure (incremental-parity test): Resolved by merging main into the branch — the test relied on a CFG fix that landed on main after this branch diverged.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai review

…dren in Rust extractor (#718)

- Remove unused `visibility` variable in `handle_def_function` (Definition
  struct has no visibility field yet; left a comment for future wiring)
- Add `collect_module_children` to `handle_defmodule` so the native engine
  populates module children, matching the TS/WASM extractor behaviour
- Include Cargo.lock update with new batch-2 tree-sitter dependencies
  (elixir, lua, dart, zig, haskell, ocaml) that was missing from prior commits
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressing remaining Greptile findings from summary review

1. elixir.rs — dead visibility variable (line 94)
Fixed in 67b8200. Removed the unused visibility binding. The Definition struct has no visibility field, so the value could never be stored. Left a comment documenting the intended wiring for when the struct gains that field.

2. elixir.rs — engine parity gap in handle_defmodule (children: None)
Fixed in 67b8200. Added collect_module_children() that walks the module's do_block, collects def/defp child functions, and populates children via opt_children() — matching the TS extractor's collectModuleMembers behavior.

3. parser_registry.rs.mli files mapped to wrong OCaml grammar
Tracked as follow-up issue #720. This requires adding a new OcamlInterface variant to the LanguageKind enum, building a separate WASM grammar from grammars/interface, and potentially adapting the extractor for interface-specific node types. The scope is too large for this PR (which is purely additive language support), and .mli files are uncommon enough that the impact is low. The issue has full implementation details.

4. Cargo.lock — missing batch-2 tree-sitter deps
Also fixed in 67b8200. The Cargo.lock was not updated when the new tree-sitter crates (elixir, lua, dart, zig, haskell, ocaml) were added to Cargo.toml.

CI note: The incremental-parity test failure (preserves CFG blocks for changed file) is a pre-existing issue on main, not introduced by this PR. It is being addressed separately in PR #719.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai review

@carlos-alm carlos-alm merged commit 13cede2 into main Mar 31, 2026
19 checks passed
@carlos-alm carlos-alm deleted the feat/batch2-languages branch March 31, 2026 05:40
@github-actions github-actions bot locked and limited conversation to collaborators Mar 31, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant