Skip to content

feat(native): port Objective-C extractor to Rust#1106

Open
carlos-alm wants to merge 6 commits into
mainfrom
feat/1071-objc-rust-extractor
Open

feat(native): port Objective-C extractor to Rust#1106
carlos-alm wants to merge 6 commits into
mainfrom
feat/1071-objc-rust-extractor

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Adds tree-sitter-objc dependency and a native Objective-C extractor in crates/codegraph-core/src/extractors/objc.rs.
  • Registers .m with LanguageKind::ObjC and the Rust file_collector, adds Objective-C to NATIVE_SUPPORTED_EXTENSIONS on the JS side, and wires OBJC_AST_TYPES / OBJC_AST_CONFIG on both the native and JS sides so the two engines extract identical ast_nodes for Objective-C source.
  • Mirrors extractObjCSymbols: class_interface / class_implementation (with : Superclass) as class, @protocol as interface, instance and class method declarations / definitions with multi-part selectors assembled from leading identifiers and method_parameter children, C-level function declarations / definitions, #import / @import imports, and message-expression call extraction.

Closes #1071

Test plan

  • cargo build --release -p codegraph-core (clean build)
  • cargo test -p codegraph-core --lib — 193/193
  • npx tree-sitter build --wasm regenerates tree-sitter-objc.wasm
  • npx vitest run tests/parsers/objc.test.ts — 5/5
  • npx vitest run tests/parsers/native-drop-classification.test.ts — 13/13

Mirrors `src/extractors/objc.ts` in `crates/codegraph-core/src/extractors/objc.rs`.
Adds the `tree-sitter-objc` dependency, wires `LanguageKind::ObjC` (`.m`) in
the Rust `parser_registry` and `file_collector`, adds `.m` to
`NATIVE_SUPPORTED_EXTENSIONS` on the JS side, and registers
`OBJC_AST_TYPES` / `OBJC_AST_CONFIG` so the native and WASM engines extract
identical `ast_nodes` for Objective-C source.

Handles class interfaces / implementations (with `: Superclass`), protocols,
instance and class method declarations/definitions with multi-part selectors
assembled from leading identifiers and `method_parameter` children, C-level
function declarations/definitions, `#import`/`@import` imports, and message
expression call extraction.
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude finished @carlos-alm's task —— View job


PR Review: feat(native): port Objective-C extractor to Rust

Todo List:

  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity Review
  • Phase 4: Code Quality Assessment
  • Phase 5: Backlog Compliance Assessment
  • Final Rating and Recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: Adding Objective-C language support to complete the dual-engine architecture. The native Rust engine lacks Objective-C extraction capabilities that exist in the WASM engine, creating a parser gap that violates the "both engines must produce identical results" principle.

Why This Approach: The PR follows the established pattern for adding language support - adds tree-sitter grammar dependency, implements native Rust extractor mirroring existing WASM extractor (extractObjCSymbols), registers language in both engines, and ensures AST node extraction parity.

Risk Assessment: Low risk - purely additive change following established patterns, comprehensive test coverage, no breaking changes to existing functionality.

Phase 2: Configuration Change Detection ✅

CLEAN - No high-risk configuration changes detected:

  • ✅ Single new dependency: tree-sitter-objc = "3" in Cargo.toml (legitimate grammar addition)
  • ✅ No linting rule modifications
  • ✅ No test timeout increases or coverage threshold lowering
  • ✅ No TypeScript strict mode changes
  • ✅ AST type mapping additions follow established pattern
  • ✅ Native extensions list properly updated with drift guard protection

Phase 3: Test Integrity ✅

HEALTHY - Test changes are appropriate:

  • ✅ New comprehensive test file tests/parsers/objc.test.ts with proper WASM extractor coverage
  • ✅ Updated drift guard test correctly reflects new .m extension in NATIVE_SUPPORTED_EXTENSIONS
  • ✅ Rust unit tests in objc.rs provide thorough native extractor verification
  • ✅ No test assertions weakened or removed
  • ✅ All test modifications are additive and language-specific

Phase 4: Code Quality Assessment ✅

EXCELLENT - High-quality implementation:

Architecture Consistency:

  • ✅ Perfect adherence to dual-engine architecture patterns
  • ✅ Properly registered in LanguageKind enum and dispatcher (mod.rs:130-132)
  • ✅ AST configuration correctly mirrors native LangAstConfig constants
  • ✅ Extension mapping follows established conventions

Implementation Quality:

  • Selector building logic (build_selector, build_message_selector) correctly handles multi-part Objective-C selectors
  • Category support properly formats Foo(Cat) syntax for both interfaces and implementations
  • Property extraction robustly navigates complex grammar nesting (struct_declarator > pointer_declarator > identifier)
  • Error handling throughout with graceful fallbacks and Option returns
  • Memory safety with proper lifetime management and UTF-8 validation

Code Structure:

  • ✅ Clean separation of concerns (classes/protocols, methods, imports, calls)
  • ✅ Comprehensive unit tests covering edge cases (categories, multi-keyword selectors, properties)
  • ✅ Consistent with existing extractor patterns (mirrors c.rs for C constructs)
  • ✅ Excellent documentation explaining grammar-specific challenges

No Issues Detected: Security, performance, or maintainability concerns.

Root Cause Analysis

Problem Being Solved: Engine parity violation - native Rust engine missing Objective-C support that exists in WASM engine
Why This Approach: Follows established dual-engine architecture pattern with comprehensive symbol extraction mirroring existing WASM implementation
Risk Assessment: Very low risk - purely additive, well-tested, follows proven patterns

Backlog Compliance

  • Zero-dep: ✅ — Only adds tree-sitter grammar dependency (standard pattern for language support)
  • Foundation-aligned: ✅ — Fully compliant with all principles (P1-P8), enhances local parsing capabilities
  • Problem-fit: 4/5 — Directly improves AI agent accuracy by providing complete language coverage for dual-engine architecture
  • Breaking: No — Purely additive, no existing behavior changes
  • Tier: 1 (zero-dep ✅ + foundation-aligned ✅)

Critical Concerns

None identified. This is a textbook example of proper language support addition.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation that perfectly follows established patterns, comprehensive test coverage, maintains engine parity, and adds genuine value for Objective-C codebases. Minor deduction only because language additions, while important, aren't transformative features.

Excellent work - this PR exemplifies how language support should be added to codegraph.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Codegraph Impact Analysis

46 functions changed26 callers affected across 3 files

  • extract_symbols_with_opts in crates/codegraph-core/src/extractors/mod.rs:62 (1 transitive callers)
  • ObjCExtractor.extract in crates/codegraph-core/src/extractors/objc.rs:18 (0 transitive callers)
  • match_objc_node in crates/codegraph-core/src/extractors/objc.rs:26 (0 transitive callers)
  • handle_class_interface in crates/codegraph-core/src/extractors/objc.rs:51 (1 transitive callers)
  • handle_class_implementation in crates/codegraph-core/src/extractors/objc.rs:115 (1 transitive callers)
  • handle_protocol_decl in crates/codegraph-core/src/extractors/objc.rs:140 (1 transitive callers)
  • handle_method in crates/codegraph-core/src/extractors/objc.rs:159 (1 transitive callers)
  • handle_function_def in crates/codegraph-core/src/extractors/objc.rs:183 (1 transitive callers)
  • handle_import in crates/codegraph-core/src/extractors/objc.rs:203 (1 transitive callers)
  • handle_at_import in crates/codegraph-core/src/extractors/objc.rs:224 (1 transitive callers)
  • handle_struct_specifier in crates/codegraph-core/src/extractors/objc.rs:239 (1 transitive callers)
  • handle_enum_specifier in crates/codegraph-core/src/extractors/objc.rs:254 (1 transitive callers)
  • handle_typedef in crates/codegraph-core/src/extractors/objc.rs:269 (1 transitive callers)
  • handle_c_call_expr in crates/codegraph-core/src/extractors/objc.rs:300 (1 transitive callers)
  • handle_message_expr in crates/codegraph-core/src/extractors/objc.rs:345 (1 transitive callers)
  • build_selector in crates/codegraph-core/src/extractors/objc.rs:370 (4 transitive callers)
  • build_message_selector in crates/codegraph-core/src/extractors/objc.rs:405 (2 transitive callers)
  • find_objc_parent_class in crates/codegraph-core/src/extractors/objc.rs:430 (2 transitive callers)
  • find_objc_decl_name in crates/codegraph-core/src/extractors/objc.rs:455 (6 transitive callers)
  • collect_class_members in crates/codegraph-core/src/extractors/objc.rs:467 (2 transitive callers)

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR ports the Objective-C extractor from TypeScript to Rust by adding a new crates/codegraph-core/src/extractors/objc.rs, wiring the tree-sitter-objc crate, and updating both the native and JS sides to register .m files, AST configs, and the routing layer. Several previously-flagged divergences (category naming, @import field name, dead-code branch) were fixed in a follow-up commit.

  • Rust extractor (objc.rs, 768 lines): handles class_interface/class_implementation, protocols, keyword-selector assembly, #import/@import, C functions, structs/enums/typedefs, and call extraction; backed by 9 unit tests.
  • JS parity fixes (src/extractors/objc.ts): handleClassInterface, handleClassImplementation, and findObjCParentClass now include the (Cat) category suffix, matching the Rust output.
  • Routing/config wiring: NATIVE_SUPPORTED_EXTENSIONS, OBJC_AST_TYPES/OBJC_STRING_CONFIG (JS), and OBJC_AST_CONFIG (Rust) are all in sync; LanguageKind::ObjC and .m extension mapping are correctly registered.

Confidence Score: 4/5

Safe to merge with one known gap: the JS extractor silently drops every @import statement while the Rust extractor captures them correctly.

The Rust extractor is well-tested and the cross-engine alignment work (category naming, field names, routing) is solid. The one remaining gap is in src/extractors/objc.ts: the switch arm dispatches @import handling on 'import_declaration' but tree-sitter-objc v3.0.2 emits module_import for those nodes, so handleAtImport is never called on the JS path and all @import symbols are silently lost.

src/extractors/objc.ts — the 'import_declaration' case label on the @import dispatch arm needs to be 'module_import'.

Important Files Changed

Filename Overview
crates/codegraph-core/src/extractors/objc.rs New 768-line Rust ObjC extractor; handles class/protocol, methods, imports, C functions, calls. Selector assembly, import field names, and category naming are all correct per tests.
src/extractors/objc.ts Category and findObjCParentClass fixes align JS with Rust correctly, but @import dispatch remains on 'import_declaration' instead of 'module_import', silently dropping all @import statements.
crates/codegraph-core/src/file_collector.rs Adds .m to SUPPORTED_EXTENSIONS with a well-documented MATLAB/Octave collision note.
crates/codegraph-core/src/parser_registry.rs Adds ObjC variant to LanguageKind, wires .m extension and tree_sitter_objc::LANGUAGE, updates all() array and EXPECTED_LEN count correctly.
tests/parsers/native-drop-classification.test.ts Removes src/k.m from the unsupported-by-native bucket and updates the expected count from 8 to 7.

Sequence Diagram

sequenceDiagram
    participant Router as parser.ts / file_collector
    participant Native as ObjCExtractor (Rust)
    participant JS as extractObjCSymbols (TS)
    participant Grammar as tree-sitter-objc v3.0.2

    Router->>Router: sees .m extension
    alt Native path
        Router->>Native: extract(tree, source)
        Native->>Grammar: parse .m
        Grammar-->>Native: module_import / class_interface / method_definition
        Native-->>Router: "FileSymbols with @import captured"
    else JS path
        Router->>JS: extractObjCSymbols(tree, filePath)
        JS->>Grammar: parse .m
        Grammar-->>JS: module_import node
        JS-->>Router: ExtractorOutput
        Note over JS: case import_declaration never matches module_import
    end
Loading

Comments Outside Diff (1)

  1. src/extractors/objc.ts, line 58-60 (link)

    P1 @import silently dropped — wrong node type in JS dispatch

    The tree-sitter-objc v3.0.2 grammar (used by both the JS WASM build and the Rust crate) emits module_import for @import Foundation; statements, not import_declaration. Because the switch arm never matches, handleAtImport is never called and every @import statement is silently discarded by the JS extractor. The Rust extractor correctly dispatches on "module_import" and the test extracts_imports confirms this — the same test doesn't exist on the JS side, which is why the mismatch was not caught. The fix is a one-character change to the case label. This directly contradicts the PR's stated goal of identical output between engines.

    Fix in Claude Code

Fix All in Claude Code

Reviews (6): Last reviewed commit: "fix: resolve merge conflicts with main (..." | Re-trigger Greptile

Comment on lines +479 to +486
"implementation_definition" => {
// Wraps a `method_definition` inside `class_implementation`.
if let Some(method) = find_child(&child, "method_definition") {
if let Some(sel) = build_selector(&method, source) {
members.push(child_def(sel, "method", start_line(&method)));
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead-code branch: implementation_definition inside collect_class_members

collect_class_members is only ever called from handle_class_interface, and class_interface nodes never contain implementation_definition children — those live exclusively inside class_implementation. This branch can never execute and the comment ("Wraps a method_definition inside class_implementation") even acknowledges the mismatch. If the intent was to also populate children for class implementations, collect_class_members would need to be called from handle_class_implementation as well.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in de05222 — removed the unreachable implementation_definition branch from collect_class_members. Confirmed: collect_class_members is only invoked by handle_class_interface, so a class_implementation wrapper would never reach it. The Rust test suite still passes (9 ObjC tests).

"js", "jsx", "mjs", "cjs", "ts", "tsx", "d.ts", "py", "pyi", "go", "rs", "java", "cs", "rb",
"rake", "gemspec", "php", "phtml", "tf", "hcl", "c", "h", "cpp", "cc", "cxx", "hpp", "kt",
"kts", "swift", "scala", "sh", "bash", "ex", "exs", "lua", "dart", "zig", "hs", "ml", "mli",
"m",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 .m extension collision with MATLAB/Octave

The .m extension is also the canonical extension for MATLAB and GNU Octave source files. Any MATLAB code in a repository will now be silently routed to the ObjC parser and produce garbled or empty symbol output without any error signal. If supporting MATLAB is a future goal, disambiguation will require heuristics (e.g., checking for ObjC-specific tokens like @interface vs MATLAB keywords). Worth at least documenting the known false-positive risk in a comment.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in de05222 — added a "Known extension collisions" note to SUPPORTED_EXTENSIONS in crates/codegraph-core/src/file_collector.rs calling out the .m overlap with MATLAB/Octave, the resulting silent garbled-symbol output for MATLAB files, and the disambiguation heuristics that would be needed to support MATLAB later. Disambiguation itself is out of scope for this PR — .m is unambiguously routed to the ObjC parser today.

Comment on lines +429 to +448
fn find_objc_parent_class(node: &Node, source: &[u8]) -> Option<String> {
let mut current = node.parent();
while let Some(parent) = current {
match parent.kind() {
"class_interface"
| "class_implementation"
| "protocol_declaration" => {
let name_node = find_objc_decl_name(&parent)?;
let base = node_text(&name_node, source).to_string();
// Categories: include `(Cat)` so methods are grouped per category.
if let Some(cat) = parent.child_by_field_name("category") {
return Some(format!("{}({})", base, node_text(&cat, source)));
}
return Some(base);
}
_ => {}
}
current = parent.parent();
}
None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Category method naming diverges from the JS extractor

find_objc_parent_class includes the category suffix in the parent name (e.g. Foo(Cat).catMethod), but the JS findObjCParentClass returns only the bare class name (Foo.catMethod). The Rust behaviour is technically more precise — the JS version is arguably a bug — but the two engines will produce different definitions entries for category methods, which breaks the goal of identical symbol output stated in the PR description. Worth a deliberate decision: fix the JS side to match Rust, or document the intentional divergence.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned by updating the JS extractor in de05222handleClassInterface, handleClassImplementation, and findObjCParentClass now include the (Category) suffix when the grammar emits class_interface/class_implementation with a category field. Per the CLAUDE.md guidance, when engines diverge we fix the less-accurate side rather than document the gap; the qualified parent is the correct behaviour because two categories on the same class can declare same-named methods that need to stay distinguished in the symbol output.

Comment on lines +224 to +225
let module_node = node.child_by_field_name("path")
.or_else(|| find_child(node, "identifier"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 handle_at_import uses wrong field name "path"

The JS extractor looks up childForFieldName('module') for @import declarations, consistent with the grammar's module field on import_declaration / module_import. The Rust side tries child_by_field_name("path"), which is the field name used for #import (preprocessor includes). Since both fall back to the first identifier child, extraction still works — but the path fallback means the named-field fast-path will always miss, and the field name signals an incorrect assumption about the grammar structure.

Suggested change
let module_node = node.child_by_field_name("path")
.or_else(|| find_child(node, "identifier"));
let module_node = node.child_by_field_name("module")
.or_else(|| find_child(node, "identifier"));

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in de05222 — switched handle_at_import to look up child_by_field_name("module") first, matching the JS extractor and the actual module_import grammar field. The identifier fallback is preserved so older grammar revisions still parse.

- Use `module` field (not `path`) for `@import` in Rust to mirror the JS
  extractor and match the tree-sitter-objc `module_import` grammar field.
- Drop the unreachable `implementation_definition` branch from
  `collect_class_members` — it is only invoked from `handle_class_interface`
  and `class_interface` nodes do not contain `implementation_definition`
  children.
- Qualify category methods with `(Category)` in the JS extractor so its
  output matches Rust for `@interface Foo (Cat)` / `@implementation Foo (Cat)`
  when the grammar emits `class_interface`/`class_implementation` rather
  than dedicated `category_interface` nodes. Two categories can declare
  same-named methods, so the qualified parent disambiguates the symbols.
- Document the `.m` extension collision with MATLAB/Octave in the file
  collector since `.m` files are unconditionally routed to the ObjC parser.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

Comment on lines 27 to 28
]);
expect(totals['unsupported-by-native']).toBe(10);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Off-by-one assertion after removing 'src/k.m'

The diff removes 'src/k.m' from the input array (because .m is now natively supported and would no longer fall into the unsupported-by-native bucket), but the expected count was not updated. The array now contains 9 paths yet the assertion is toBe(10), so this test fails at runtime.

Suggested change
]);
expect(totals['unsupported-by-native']).toBe(10);
expect(totals['unsupported-by-native']).toBe(9);

Fix in Claude Code

The native-drop-classification test asserts the count of WASM-only
languages. The Clojure PR (#1097) removed src/c.clj and decremented
the count from 11 to 10. This PR removes src/k.m (now natively
supported via ObjC). After the merge both entries are removed, so
the assertion needs to be 9, not 10.
Comment on lines +391 to +412
}

if parts.is_empty() {
return None;
}
if has_params {
Some(format!("{}:", parts.join(":")))
} else {
Some(parts.join(":"))
}
}

/// Build a message-expression selector by collecting all `identifier`
/// children annotated with the `method` field.
fn build_message_selector(message_node: &Node, source: &[u8]) -> String {
let mut parts: Vec<String> = Vec::new();
let mut has_colon = false;
for i in 0..message_node.child_count() {
if let Some(child) = message_node.child(i) {
if let Some(field) = message_node.field_name_for_child(i as u32) {
if field == "method" {
parts.push(node_text(&child, source).to_string());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Rust captures C-style calls that the JS extractor silently skips

The Rust comment on line 392 states that tree-sitter-objc lacks a function field on call_expression, so the named-field lookup always misses and the identifier-child fallback is always used. The JS handleCCallExpr in src/extractors/objc.ts also calls childForFieldName('function') but returns immediately when it is null — no fallback, no call recorded. In practice this means C-style calls like printf(...) or CGContextFillRect(...) appear in the native graph but are absent from the JS one. The stated goal of identical output between both engines is not met for this node type.

Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rust engine parity: port the 11 remaining JS-only language extractors

1 participant