Add sentinel-terminated input to eliminate bounds checks

Priority: HIGH
Expected Impact: 8-12% lexer speedup (eliminates ~37 bounds checks per token)
Complexity: Moderate

--- 

Append a null terminator (`\0`) to the input string to eliminate bounds checks in tight loops. This is a common optimization in high-performance parsers.

 Current implementation:

```ts
 constructor(source: string, skip_comments: boolean = false) {
   this.source = source
   this.pos = 0
   // ...
 }

 // Every loop needs bounds check
 while (this.pos < this.source.length) {
   let ch = this.source.charCodeAt(this.pos)
   // ...
 }
```

 Proposed implementation:

```ts
 constructor(source: string, skip_comments: boolean = false) {
   this.source = source + '\0' // Append sentinel
   this.source_length = source.length // Track original length
   this.pos = 0
   // ...
 }

 // Loops can rely on sentinel for termination (where safe)
 while (true) {
   let ch = this.source.charCodeAt(this.pos)
   if (ch === 0) break
   // Character checks will fail on '\0' sentinel
   // ...
 }
```

 Why this improves performance:

 - Eliminates ~37 bounds checks per token stream
 - Character classification functions return false for `\0` (automatic loop termination)
 - No CPU pipeline stalls on comparison instructions
 - Particularly effective for long files (more iterations saved)

 Safe for these loops (character checks breaks loop on `\0`):

 - consume_whitespace() - whitespace check fails on sentinel
 - consume_number() - digit check fails on sentinel
 - consume_ident_or_function() - ident_char check fails on sentinel

 Still requires bounds checks (searching for specific sequences):

 - consume_comment() - must find */ terminator
 - consume_string() - must find closing quote

## Implementation Steps

 1. Add source_length: number field to Lexer class (line 73)
 2. Update constructor (line 85):
   - Append \0 to source: this.source = source + '\0'
   - Store original length: this.source_length = source.length
 3. Update peek() method (line 532):
   - Check against source_length instead of source.length
   - Return 0 for out-of-bounds access
 4. Update EOF detection (line 109):
   - Use this.pos >= this.source_length instead of >= this.source.length
 5. Update all loop conditions:
   - Replace this.source.length with this.source_length
   - Verify loops are safe with sentinel (character checks fail on \0)
 6. Update make_token() if it references source.length

## Files to Modify

- src/tokenize.ts (constructor, all loops, peek, EOF checks)

## Specific tests to add:
 
- EOF handling with sentinel
- Position tracking accuracy
- Line/column calculation correctness

## Acceptance Criteria

 - All existing tests pass
 - Position tracking is exact (no off-by-one errors)
 - Line/column tracking is correct
 - EOF detection works properly
 - Benchmark shows 8-12% lexer speedup
 - Memory overhead is +1 byte per input (acceptable)
 - New tests added for sentinel edge cases

 Trade-offs

 - Memory: +1 byte per input string (negligible)
 - Complexity: Requires careful position tracking (use source_length not source.length)
 - Risk: Off-by-one errors in position tracking (mitigated by thorough testing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sentinel-terminated input to eliminate bounds checks #99

Implementation Steps

Files to Modify

Specific tests to add:

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add sentinel-terminated input to eliminate bounds checks #99

Description

Implementation Steps

Files to Modify

Specific tests to add:

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions