feat(native): port Objective-C extractor to Rust#1106
Conversation
Mirrors `src/extractors/objc.ts` in `crates/codegraph-core/src/extractors/objc.rs`. Adds the `tree-sitter-objc` dependency, wires `LanguageKind::ObjC` (`.m`) in the Rust `parser_registry` and `file_collector`, adds `.m` to `NATIVE_SUPPORTED_EXTENSIONS` on the JS side, and registers `OBJC_AST_TYPES` / `OBJC_AST_CONFIG` so the native and WASM engines extract identical `ast_nodes` for Objective-C source. Handles class interfaces / implementations (with `: Superclass`), protocols, instance and class method declarations/definitions with multi-part selectors assembled from leading identifiers and `method_parameter` children, C-level function declarations/definitions, `#import`/`@import` imports, and message expression call extraction.
|
Claude finished @carlos-alm's task —— View job PR Review: feat(native): port Objective-C extractor to RustTodo List:
Phase 1: Root Cause Analysis ✅ Problem Being Solved: Adding Objective-C language support to complete the dual-engine architecture. The native Rust engine lacks Objective-C extraction capabilities that exist in the WASM engine, creating a parser gap that violates the "both engines must produce identical results" principle. Why This Approach: The PR follows the established pattern for adding language support - adds tree-sitter grammar dependency, implements native Rust extractor mirroring existing WASM extractor ( Risk Assessment: Low risk - purely additive change following established patterns, comprehensive test coverage, no breaking changes to existing functionality. Phase 2: Configuration Change Detection ✅ CLEAN - No high-risk configuration changes detected:
Phase 3: Test Integrity ✅ HEALTHY - Test changes are appropriate:
Phase 4: Code Quality Assessment ✅ EXCELLENT - High-quality implementation: Architecture Consistency:
Implementation Quality:
Code Structure:
No Issues Detected: Security, performance, or maintainability concerns. Root Cause AnalysisProblem Being Solved: Engine parity violation - native Rust engine missing Objective-C support that exists in WASM engine Backlog Compliance
Critical ConcernsNone identified. This is a textbook example of proper language support addition. Final Recommendation
Excellent work - this PR exemplifies how language support should be added to codegraph. |
Codegraph Impact Analysis46 functions changed → 26 callers affected across 3 files
|
Greptile SummaryThis PR ports the Objective-C extractor from TypeScript to Rust by adding a new
Confidence Score: 4/5Safe to merge with one known gap: the JS extractor silently drops every @import statement while the Rust extractor captures them correctly. The Rust extractor is well-tested and the cross-engine alignment work (category naming, field names, routing) is solid. The one remaining gap is in src/extractors/objc.ts: the switch arm dispatches @import handling on 'import_declaration' but tree-sitter-objc v3.0.2 emits module_import for those nodes, so handleAtImport is never called on the JS path and all @import symbols are silently lost. src/extractors/objc.ts — the 'import_declaration' case label on the @import dispatch arm needs to be 'module_import'. Important Files Changed
Sequence DiagramsequenceDiagram
participant Router as parser.ts / file_collector
participant Native as ObjCExtractor (Rust)
participant JS as extractObjCSymbols (TS)
participant Grammar as tree-sitter-objc v3.0.2
Router->>Router: sees .m extension
alt Native path
Router->>Native: extract(tree, source)
Native->>Grammar: parse .m
Grammar-->>Native: module_import / class_interface / method_definition
Native-->>Router: "FileSymbols with @import captured"
else JS path
Router->>JS: extractObjCSymbols(tree, filePath)
JS->>Grammar: parse .m
Grammar-->>JS: module_import node
JS-->>Router: ExtractorOutput
Note over JS: case import_declaration never matches module_import
end
|
| "implementation_definition" => { | ||
| // Wraps a `method_definition` inside `class_implementation`. | ||
| if let Some(method) = find_child(&child, "method_definition") { | ||
| if let Some(sel) = build_selector(&method, source) { | ||
| members.push(child_def(sel, "method", start_line(&method))); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Dead-code branch:
implementation_definition inside collect_class_members
collect_class_members is only ever called from handle_class_interface, and class_interface nodes never contain implementation_definition children — those live exclusively inside class_implementation. This branch can never execute and the comment ("Wraps a method_definition inside class_implementation") even acknowledges the mismatch. If the intent was to also populate children for class implementations, collect_class_members would need to be called from handle_class_implementation as well.
There was a problem hiding this comment.
Fixed in de05222 — removed the unreachable implementation_definition branch from collect_class_members. Confirmed: collect_class_members is only invoked by handle_class_interface, so a class_implementation wrapper would never reach it. The Rust test suite still passes (9 ObjC tests).
| "js", "jsx", "mjs", "cjs", "ts", "tsx", "d.ts", "py", "pyi", "go", "rs", "java", "cs", "rb", | ||
| "rake", "gemspec", "php", "phtml", "tf", "hcl", "c", "h", "cpp", "cc", "cxx", "hpp", "kt", | ||
| "kts", "swift", "scala", "sh", "bash", "ex", "exs", "lua", "dart", "zig", "hs", "ml", "mli", | ||
| "m", |
There was a problem hiding this comment.
.m extension collision with MATLAB/Octave
The .m extension is also the canonical extension for MATLAB and GNU Octave source files. Any MATLAB code in a repository will now be silently routed to the ObjC parser and produce garbled or empty symbol output without any error signal. If supporting MATLAB is a future goal, disambiguation will require heuristics (e.g., checking for ObjC-specific tokens like @interface vs MATLAB keywords). Worth at least documenting the known false-positive risk in a comment.
There was a problem hiding this comment.
Documented in de05222 — added a "Known extension collisions" note to SUPPORTED_EXTENSIONS in crates/codegraph-core/src/file_collector.rs calling out the .m overlap with MATLAB/Octave, the resulting silent garbled-symbol output for MATLAB files, and the disambiguation heuristics that would be needed to support MATLAB later. Disambiguation itself is out of scope for this PR — .m is unambiguously routed to the ObjC parser today.
| fn find_objc_parent_class(node: &Node, source: &[u8]) -> Option<String> { | ||
| let mut current = node.parent(); | ||
| while let Some(parent) = current { | ||
| match parent.kind() { | ||
| "class_interface" | ||
| | "class_implementation" | ||
| | "protocol_declaration" => { | ||
| let name_node = find_objc_decl_name(&parent)?; | ||
| let base = node_text(&name_node, source).to_string(); | ||
| // Categories: include `(Cat)` so methods are grouped per category. | ||
| if let Some(cat) = parent.child_by_field_name("category") { | ||
| return Some(format!("{}({})", base, node_text(&cat, source))); | ||
| } | ||
| return Some(base); | ||
| } | ||
| _ => {} | ||
| } | ||
| current = parent.parent(); | ||
| } | ||
| None |
There was a problem hiding this comment.
Category method naming diverges from the JS extractor
find_objc_parent_class includes the category suffix in the parent name (e.g. Foo(Cat).catMethod), but the JS findObjCParentClass returns only the bare class name (Foo.catMethod). The Rust behaviour is technically more precise — the JS version is arguably a bug — but the two engines will produce different definitions entries for category methods, which breaks the goal of identical symbol output stated in the PR description. Worth a deliberate decision: fix the JS side to match Rust, or document the intentional divergence.
There was a problem hiding this comment.
Aligned by updating the JS extractor in de05222 — handleClassInterface, handleClassImplementation, and findObjCParentClass now include the (Category) suffix when the grammar emits class_interface/class_implementation with a category field. Per the CLAUDE.md guidance, when engines diverge we fix the less-accurate side rather than document the gap; the qualified parent is the correct behaviour because two categories on the same class can declare same-named methods that need to stay distinguished in the symbol output.
| let module_node = node.child_by_field_name("path") | ||
| .or_else(|| find_child(node, "identifier")); |
There was a problem hiding this comment.
handle_at_import uses wrong field name "path"
The JS extractor looks up childForFieldName('module') for @import declarations, consistent with the grammar's module field on import_declaration / module_import. The Rust side tries child_by_field_name("path"), which is the field name used for #import (preprocessor includes). Since both fall back to the first identifier child, extraction still works — but the path fallback means the named-field fast-path will always miss, and the field name signals an incorrect assumption about the grammar structure.
| let module_node = node.child_by_field_name("path") | |
| .or_else(|| find_child(node, "identifier")); | |
| let module_node = node.child_by_field_name("module") | |
| .or_else(|| find_child(node, "identifier")); |
There was a problem hiding this comment.
Fixed in de05222 — switched handle_at_import to look up child_by_field_name("module") first, matching the JS extractor and the actual module_import grammar field. The identifier fallback is preserved so older grammar revisions still parse.
- Use `module` field (not `path`) for `@import` in Rust to mirror the JS extractor and match the tree-sitter-objc `module_import` grammar field. - Drop the unreachable `implementation_definition` branch from `collect_class_members` — it is only invoked from `handle_class_interface` and `class_interface` nodes do not contain `implementation_definition` children. - Qualify category methods with `(Category)` in the JS extractor so its output matches Rust for `@interface Foo (Cat)` / `@implementation Foo (Cat)` when the grammar emits `class_interface`/`class_implementation` rather than dedicated `category_interface` nodes. Two categories can declare same-named methods, so the qualified parent disambiguates the symbols. - Document the `.m` extension collision with MATLAB/Octave in the file collector since `.m` files are unconditionally routed to the ObjC parser.
| ]); | ||
| expect(totals['unsupported-by-native']).toBe(10); |
There was a problem hiding this comment.
Off-by-one assertion after removing
'src/k.m'
The diff removes 'src/k.m' from the input array (because .m is now natively supported and would no longer fall into the unsupported-by-native bucket), but the expected count was not updated. The array now contains 9 paths yet the assertion is toBe(10), so this test fails at runtime.
| ]); | |
| expect(totals['unsupported-by-native']).toBe(10); | |
| expect(totals['unsupported-by-native']).toBe(9); |
The native-drop-classification test asserts the count of WASM-only languages. The Clojure PR (#1097) removed src/c.clj and decremented the count from 11 to 10. This PR removes src/k.m (now natively supported via ObjC). After the merge both entries are removed, so the assertion needs to be 9, not 10.
| } | ||
|
|
||
| if parts.is_empty() { | ||
| return None; | ||
| } | ||
| if has_params { | ||
| Some(format!("{}:", parts.join(":"))) | ||
| } else { | ||
| Some(parts.join(":")) | ||
| } | ||
| } | ||
|
|
||
| /// Build a message-expression selector by collecting all `identifier` | ||
| /// children annotated with the `method` field. | ||
| fn build_message_selector(message_node: &Node, source: &[u8]) -> String { | ||
| let mut parts: Vec<String> = Vec::new(); | ||
| let mut has_colon = false; | ||
| for i in 0..message_node.child_count() { | ||
| if let Some(child) = message_node.child(i) { | ||
| if let Some(field) = message_node.field_name_for_child(i as u32) { | ||
| if field == "method" { | ||
| parts.push(node_text(&child, source).to_string()); |
There was a problem hiding this comment.
Rust captures C-style calls that the JS extractor silently skips
The Rust comment on line 392 states that tree-sitter-objc lacks a function field on call_expression, so the named-field lookup always misses and the identifier-child fallback is always used. The JS handleCCallExpr in src/extractors/objc.ts also calls childForFieldName('function') but returns immediately when it is null — no fallback, no call recorded. In practice this means C-style calls like printf(...) or CGContextFillRect(...) appear in the native graph but are absent from the JS one. The stated goal of identical output between both engines is not met for this node type.
Summary
tree-sitter-objcdependency and a native Objective-C extractor incrates/codegraph-core/src/extractors/objc.rs..mwithLanguageKind::ObjCand the Rustfile_collector, adds Objective-C toNATIVE_SUPPORTED_EXTENSIONSon the JS side, and wiresOBJC_AST_TYPES/OBJC_AST_CONFIGon both the native and JS sides so the two engines extract identicalast_nodesfor Objective-C source.extractObjCSymbols:class_interface/class_implementation(with: Superclass) asclass,@protocolasinterface, instance and class method declarations / definitions with multi-part selectors assembled from leading identifiers andmethod_parameterchildren, C-level function declarations / definitions,#import/@importimports, and message-expression call extraction.Closes #1071
Test plan
cargo build --release -p codegraph-core(clean build)cargo test -p codegraph-core --lib— 193/193npx tree-sitter build --wasmregeneratestree-sitter-objc.wasmnpx vitest run tests/parsers/objc.test.ts— 5/5npx vitest run tests/parsers/native-drop-classification.test.ts— 13/13