Skip to content

Commit 8504702

Browse files
docs: rewrite adding-a-language guide for LANGUAGE_REGISTRY architecture
The guide was outdated — it described manual parser routing, ternary chains in builder.js, and hand-edited EXTENSIONS in constants.js. Rewritten around the current LANGUAGE_REGISTRY pattern: - One registry entry + one extractor function = full WASM support - No more constants.js edits (EXTENSIONS derived automatically) - No more builder.js edits (parseFilesAuto uses the registry) - createParsers() now returns a Map, not an object with .xParser props - Removed feat/rust-core branch references (native engine is on main) - Updated test examples to use parsers.get() instead of parsers.xParser - Clarified required vs optional parsers and graceful fallback
1 parent 2312c92 commit 8504702

1 file changed

Lines changed: 95 additions & 87 deletions

File tree

docs/adding-a-language.md

Lines changed: 95 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
# Adding a New Language to Codegraph
22

33
This guide walks through every file you need to touch when adding support for a
4-
new programming language. It covers both the **WASM engine** (main branch) and
5-
the **native Rust engine** (`feat/rust-core` branch).
4+
new programming language.
65

76
---
87

@@ -20,6 +19,32 @@ queries are engine-agnostic. When adding a new language you implement the
2019
extraction logic **twice** — once in JavaScript (WASM) and once in Rust
2120
(native) — and a parity test guarantees they agree.
2221

22+
### The LANGUAGE_REGISTRY
23+
24+
`LANGUAGE_REGISTRY` in `src/parser.js` is the **single source of truth** for all
25+
supported languages. Each entry declares:
26+
27+
```js
28+
{
29+
id: 'go', // Language identifier
30+
extensions: ['.go'], // File extensions (auto-derives EXTENSIONS)
31+
grammarFile: 'tree-sitter-go.wasm', // WASM grammar filename
32+
extractor: extractGoSymbols, // Extraction function reference
33+
required: false, // true = crash if missing; false = skip gracefully
34+
}
35+
```
36+
37+
Adding a language to the WASM engine requires **one registry entry** plus an
38+
extractor function. Everything else — extension routing, parser loading, dispatch
39+
— is automatic.
40+
41+
- `SUPPORTED_EXTENSIONS` (re-exported as `EXTENSIONS` in `constants.js`) is
42+
**derived** from the registry. You never edit it manually.
43+
- `createParsers()` iterates the registry and builds a `Map<id, Parser>`.
44+
- `getParser()` uses an extension→registry lookup map (`_extToLang`).
45+
- `wasmExtractSymbols()` calls `entry.extractor(tree, filePath)` — no ternary chains.
46+
- `parseFilesAuto()` in `builder.js` handles all dispatch — no per-language routing needed.
47+
2348
---
2449

2550
## Symbol Model
@@ -40,12 +65,16 @@ FileSymbols {
4065

4166
| Structure | Fields | Notes |
4267
|-----------|--------|-------|
43-
| `Definition` | `name`, `kind`, `line`, `endLine`, `decorators?` | `kind``function`, `method`, `class`, `interface`, `type` |
68+
| `Definition` | `name`, `kind`, `line`, `endLine`, `decorators?` | `kind``SYMBOL_KINDS` (see below) |
4469
| `Call` | `name`, `line`, `dynamic?` | |
4570
| `Import` | `source`, `names[]`, `line`, `<lang>Import?` | Set a language flag like `cInclude: true` |
4671
| `ClassRelation` | `name`, `extends?`, `implements?`, `line` | |
4772
| `ExportInfo` | `name`, `kind`, `line` | |
4873

74+
**Symbol kinds:** `function`, `method`, `class`, `interface`, `type`, `struct`,
75+
`enum`, `trait`, `record`, `module`. Use the language's native kind (e.g. Go
76+
structs → `struct`, Rust traits → `trait`, Ruby modules → `module`).
77+
4978
Methods inside a class use the `ClassName.methodName` naming convention.
5079

5180
---
@@ -85,50 +114,19 @@ Build the WASM binary:
85114
npm run build:wasm
86115
```
87116

88-
This generates `grammars/tree-sitter-<lang>.wasm`. Commit this file.
89-
90-
### 3. `src/constants.js` — register file extensions
91-
92-
```js
93-
export const EXTENSIONS = new Set([
94-
// ... existing ...
95-
'.<ext>', // e.g. '.c', '.h'
96-
]);
97-
```
98-
99-
### 4. `src/parser.js` — WASM extraction (3 changes)
100-
101-
#### 4a. Load the grammar in `createParsers()`
102-
103-
Follow the graceful-fallback pattern used by every optional language:
104-
105-
```js
106-
let <lang>Parser = null;
107-
try {
108-
const <Lang> = await Language.load(grammarPath('tree-sitter-<lang>.wasm'));
109-
<lang>Parser = new Parser();
110-
<lang>Parser.setLanguage(<Lang>);
111-
} catch (e) {
112-
warn(`<Lang> parser failed to initialize: ${e.message}. <Lang> files will be skipped.`);
113-
}
114-
```
115-
116-
Return it from the object: `return { ..., <lang>Parser };`
117+
This generates `grammars/tree-sitter-<lang>.wasm` (gitignored — built from
118+
devDeps on `npm install`).
117119

118-
#### 4b. Route extensions in `getParser()`
120+
### 3. `src/parser.js` — add extractor and registry entry
119121

120-
```js
121-
if ((filePath.endsWith('.<ext>')) && parsers.<lang>Parser)
122-
return parsers.<lang>Parser;
123-
```
122+
This is the only source file where you need to make changes on the JS side.
123+
Two things to do:
124124

125-
> Place this *before* the `return null;` at the end of `getParser()`.
125+
#### 3a. Create `extract<Lang>Symbols(tree, filePath)`
126126

127-
#### 4c. Create `extract<Lang>Symbols(tree, filePath)`
128-
129-
This is where the real work happens. Write a recursive AST walker that matches
130-
tree-sitter node types for your language. Copy the pattern from an existing
131-
extractor like `extractGoSymbols` or `extractRustSymbols`:
127+
Write a recursive AST walker that matches tree-sitter node types for your
128+
language. Copy the pattern from an existing extractor like `extractGoSymbols` or
129+
`extractRustSymbols`:
132130

133131
```js
134132
/**
@@ -197,53 +195,53 @@ export function extract<Lang>Symbols(tree, filePath) {
197195
to explore AST node types for your language. Paste sample code and inspect the
198196
tree to find the right `node.type` strings.
199197

200-
#### 4d. Add WASM dispatch in `wasmExtractSymbols()` (feat/rust-core only)
198+
#### 3b. Add an entry to `LANGUAGE_REGISTRY`
201199

202-
On the `feat/rust-core` branch, `parser.js` has a unified `wasmExtractSymbols`
203-
helper. Add your language before the final `return extractSymbols(...)`:
200+
Add your language to the `LANGUAGE_REGISTRY` array in `src/parser.js`:
204201

205202
```js
206-
if (filePath.endsWith('.<ext>')) return extract<Lang>Symbols(tree, filePath);
203+
{
204+
id: '<lang>',
205+
extensions: ['.<ext>'],
206+
grammarFile: 'tree-sitter-<lang>.wasm',
207+
extractor: extract<Lang>Symbols,
208+
required: false,
209+
},
207210
```
208211

209-
### 5. `src/builder.js` — route parsing (main branch only)
210-
211-
On `main`, the builder dispatches manually. On `feat/rust-core` this is
212-
replaced by `parseFilesAuto`, so **skip this step on the rust branch**.
212+
Set `required: false` so codegraph still works when the WASM grammar isn't
213+
available (e.g. in CI without `npm install`). Only JS/TS/TSX are `required: true`.
213214

214-
**main branch** — add your language to the import and ternary chain:
215+
That's it for the WASM engine. The registry automatically:
216+
- Adds `.<ext>` to `SUPPORTED_EXTENSIONS` (and `EXTENSIONS` in `constants.js`)
217+
- Registers the parser in `createParsers()`
218+
- Routes `getParser()` calls via the extension map
219+
- Dispatches to your extractor in `wasmExtractSymbols()`
220+
- Handles `builder.js` routing via `parseFilesAuto()`
215221

216-
```js
217-
// Import
218-
import { ..., extract<Lang>Symbols } from './parser.js';
222+
**You do not need to edit `constants.js` or `builder.js`.**
219223

220-
// In the parsing loop, add before `extractSymbols(tree, filePath)`
221-
const is<Lang> = filePath.endsWith('.<ext>');
222-
// ... add to the ternary chain:
223-
: is<Lang> ? extract<Lang>Symbols(tree, filePath)
224-
```
225-
226-
### 6. `src/parser.js` — update `normalizeNativeSymbols` (feat/rust-core only)
224+
### 4. `src/parser.js` — update `normalizeNativeSymbols` (if needed)
227225

228226
If your language's imports use a language-specific flag (e.g. `c_include`), add
229-
the camelCase mapping:
227+
the camelCase mapping in `normalizeNativeSymbols`:
230228

231229
```js
232230
<lang>Import: i.<lang>Import ?? i.<lang>_import,
233231
```
234232
235233
---
236234
237-
## Native Engine (feat/rust-core branch)
235+
## Native Engine (Rust)
238236
239-
### 7. `crates/codegraph-core/Cargo.toml` — add the Rust tree-sitter crate
237+
### 5. `crates/codegraph-core/Cargo.toml` — add the Rust tree-sitter crate
240238
241239
```toml
242240
[dependencies]
243241
tree-sitter-<lang> = "0.x"
244242
```
245243
246-
### 8. `crates/codegraph-core/src/parser_registry.rs` — register the language
244+
### 6. `crates/codegraph-core/src/parser_registry.rs` — register the language
247245
248246
Three changes in this file:
249247
@@ -274,7 +272,7 @@ impl LanguageKind {
274272
}
275273
```
276274
277-
### 9. `crates/codegraph-core/src/extractors/<lang>.rs` — implement the Rust extractor
275+
### 7. `crates/codegraph-core/src/extractors/<lang>.rs` — implement the Rust extractor
278276
279277
Create a new file following the pattern in `go.rs` or `rust_lang.rs`:
280278
@@ -332,7 +330,7 @@ fn walk_node(node: &Node, source: &[u8], symbols: &mut FileSymbols) {
332330
| `named_child_text(&node, "field", source)` | Shorthand for field text |
333331
| `start_line(&node)` / `end_line(&node)` | 1-based line numbers |
334332

335-
### 10. `crates/codegraph-core/src/extractors/mod.rs` — wire it up
333+
### 8. `crates/codegraph-core/src/extractors/mod.rs` — wire it up
336334

337335
```rust
338336
// 1. Declare module
@@ -347,7 +345,7 @@ pub fn extract_symbols(...) -> FileSymbols {
347345
}
348346
```
349347

350-
### 11. `crates/codegraph-core/src/types.rs` — add language flag (if needed)
348+
### 9. `crates/codegraph-core/src/types.rs` — add language flag (if needed)
351349

352350
If your imports need a language-specific flag, add it to the `Import` struct:
353351

@@ -361,7 +359,7 @@ And update `Import::new()` to default it to `None`.
361359

362360
## Tests
363361

364-
### 12. `tests/parsers/<lang>.test.js` — WASM parser tests
362+
### 10. `tests/parsers/<lang>.test.js` — WASM parser tests
365363

366364
Follow the pattern from `tests/parsers/go.test.js`:
367365

@@ -377,7 +375,7 @@ describe('<Lang> parser', () => {
377375
});
378376

379377
function parse<Lang>(code) {
380-
const parser = parsers.<lang>Parser;
378+
const parser = parsers.get('<lang>');
381379
if (!parser) throw new Error('<Lang> parser not available');
382380
const tree = parser.parse(code);
383381
return extract<Lang>Symbols(tree, 'test.<ext>');
@@ -394,6 +392,9 @@ describe('<Lang> parser', () => {
394392
});
395393
```
396394

395+
> **Note:** `parsers` is a `Map` — use `parsers.get('<lang>')`, not
396+
> `parsers.<lang>Parser`.
397+
397398
**Recommended test cases:**
398399
- Function definitions (regular, with parameters)
399400
- Class/struct/enum definitions
@@ -403,7 +404,7 @@ describe('<Lang> parser', () => {
403404
- Type definitions / aliases
404405
- Forward declarations (if applicable)
405406

406-
### 13. Parity tests (feat/rust-core only)
407+
### 11. Parity tests — native vs WASM
407408

408409
Add test snippets to `tests/engines/parity.test.js` to verify the native and
409410
WASM extractors produce identical output for your language.
@@ -422,27 +423,34 @@ npx vitest run tests/parsers/<lang>.test.js
422423
# 3. Run the full test suite
423424
npm test
424425

425-
# 4. (feat/rust-core) Build native and test parity
426+
# 4. Build native and test parity
426427
cd crates/codegraph-core && cargo build
427428
npx vitest run tests/engines/parity.test.js
429+
430+
# 5. Test on a real project
431+
codegraph build /path/to/a/<lang>/project
432+
codegraph map
433+
codegraph fn someFunction
428434
```
429435

430436
---
431437

432438
## File Checklist Summary
433439

434-
| # | File | Branch | Action |
440+
| # | File | Engine | Action |
435441
|---|------|--------|--------|
436-
| 1 | `package.json` | both | Add `tree-sitter-<lang>` devDependency |
437-
| 2 | `scripts/build-wasm.js` | both | Add grammar entry |
438-
| 3 | `grammars/tree-sitter-<lang>.wasm` | both | Generated by `npm run build:wasm` |
439-
| 4 | `src/constants.js` | both | Add file extensions |
440-
| 5 | `src/parser.js` | both | Load grammar, route parser, add `extract<Lang>Symbols()`, add WASM dispatch |
441-
| 6 | `src/builder.js` | main only | Import + ternary routing (not needed on rust branch) |
442-
| 7 | `tests/parsers/<lang>.test.js` | both | WASM parser tests |
443-
| 8 | `crates/codegraph-core/Cargo.toml` | rust | Add tree-sitter crate |
444-
| 9 | `crates/.../parser_registry.rs` | rust | Register enum + extension + grammar |
445-
| 10 | `crates/.../extractors/<lang>.rs` | rust | Implement `SymbolExtractor` trait |
446-
| 11 | `crates/.../extractors/mod.rs` | rust | Declare module + dispatch arm |
447-
| 12 | `crates/.../types.rs` | rust | Add language flag to `Import` (if needed) |
448-
| 13 | `tests/engines/parity.test.js` | rust | Cross-engine validation snippets |
442+
| 1 | `package.json` | WASM | Add `tree-sitter-<lang>` devDependency |
443+
| 2 | `scripts/build-wasm.js` | WASM | Add grammar entry to array |
444+
| 3 | `src/parser.js` | WASM | Create `extract<Lang>Symbols()` + add `LANGUAGE_REGISTRY` entry |
445+
| 4 | `src/parser.js` | WASM | Update `normalizeNativeSymbols` (if language flag needed) |
446+
| 5 | `crates/codegraph-core/Cargo.toml` | Native | Add tree-sitter crate |
447+
| 6 | `crates/.../parser_registry.rs` | Native | Register enum + extension + grammar |
448+
| 7 | `crates/.../extractors/<lang>.rs` | Native | Implement `SymbolExtractor` trait |
449+
| 8 | `crates/.../extractors/mod.rs` | Native | Declare module + dispatch arm |
450+
| 9 | `crates/.../types.rs` | Native | Add language flag to `Import` (if needed) |
451+
| 10 | `tests/parsers/<lang>.test.js` | WASM | Parser extraction tests |
452+
| 11 | `tests/engines/parity.test.js` | Both | Cross-engine validation snippets |
453+
454+
**Files you do NOT need to touch:**
455+
- `src/constants.js``EXTENSIONS` is derived from the registry automatically
456+
- `src/builder.js``parseFilesAuto()` uses the registry, no manual routing

0 commit comments

Comments
 (0)