-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Priority: HIGH
Expected Impact: 8-12% lexer speedup (eliminates ~37 bounds checks per token)
Complexity: Moderate
Append a null terminator (\0) to the input string to eliminate bounds checks in tight loops. This is a common optimization in high-performance parsers.
Current implementation:
constructor(source: string, skip_comments: boolean = false) {
this.source = source
this.pos = 0
// ...
}
// Every loop needs bounds check
while (this.pos < this.source.length) {
let ch = this.source.charCodeAt(this.pos)
// ...
}Proposed implementation:
constructor(source: string, skip_comments: boolean = false) {
this.source = source + '\0' // Append sentinel
this.source_length = source.length // Track original length
this.pos = 0
// ...
}
// Loops can rely on sentinel for termination (where safe)
while (true) {
let ch = this.source.charCodeAt(this.pos)
if (ch === 0) break
// Character checks will fail on '\0' sentinel
// ...
}Why this improves performance:
- Eliminates ~37 bounds checks per token stream
- Character classification functions return false for
\0(automatic loop termination) - No CPU pipeline stalls on comparison instructions
- Particularly effective for long files (more iterations saved)
Safe for these loops (character checks breaks loop on \0):
- consume_whitespace() - whitespace check fails on sentinel
- consume_number() - digit check fails on sentinel
- consume_ident_or_function() - ident_char check fails on sentinel
Still requires bounds checks (searching for specific sequences):
- consume_comment() - must find */ terminator
- consume_string() - must find closing quote
Implementation Steps
- Add source_length: number field to Lexer class (line 73)
- Update constructor (line 85):
- Append \0 to source: this.source = source + '\0'
- Store original length: this.source_length = source.length
- Update peek() method (line 532):
- Check against source_length instead of source.length
- Return 0 for out-of-bounds access
- Update EOF detection (line 109):
- Use this.pos >= this.source_length instead of >= this.source.length
- Update all loop conditions:
- Replace this.source.length with this.source_length
- Verify loops are safe with sentinel (character checks fail on \0)
- Update make_token() if it references source.length
Files to Modify
- src/tokenize.ts (constructor, all loops, peek, EOF checks)
Specific tests to add:
- EOF handling with sentinel
- Position tracking accuracy
- Line/column calculation correctness
Acceptance Criteria
- All existing tests pass
- Position tracking is exact (no off-by-one errors)
- Line/column tracking is correct
- EOF detection works properly
- Benchmark shows 8-12% lexer speedup
- Memory overhead is +1 byte per input (acceptable)
- New tests added for sentinel edge cases
Trade-offs
- Memory: +1 byte per input string (negligible)
- Complexity: Requires careful position tracking (use source_length not source.length)
- Risk: Off-by-one errors in position tracking (mitigated by thorough testing)
Metadata
Metadata
Assignees
Labels
No labels