Skip to content

Add sentinel-terminated input to eliminate bounds checks #99

@bartveneman

Description

@bartveneman

Priority: HIGH
Expected Impact: 8-12% lexer speedup (eliminates ~37 bounds checks per token)
Complexity: Moderate


Append a null terminator (\0) to the input string to eliminate bounds checks in tight loops. This is a common optimization in high-performance parsers.

Current implementation:

 constructor(source: string, skip_comments: boolean = false) {
   this.source = source
   this.pos = 0
   // ...
 }

 // Every loop needs bounds check
 while (this.pos < this.source.length) {
   let ch = this.source.charCodeAt(this.pos)
   // ...
 }

Proposed implementation:

 constructor(source: string, skip_comments: boolean = false) {
   this.source = source + '\0' // Append sentinel
   this.source_length = source.length // Track original length
   this.pos = 0
   // ...
 }

 // Loops can rely on sentinel for termination (where safe)
 while (true) {
   let ch = this.source.charCodeAt(this.pos)
   if (ch === 0) break
   // Character checks will fail on '\0' sentinel
   // ...
 }

Why this improves performance:

  • Eliminates ~37 bounds checks per token stream
  • Character classification functions return false for \0 (automatic loop termination)
  • No CPU pipeline stalls on comparison instructions
  • Particularly effective for long files (more iterations saved)

Safe for these loops (character checks breaks loop on \0):

  • consume_whitespace() - whitespace check fails on sentinel
  • consume_number() - digit check fails on sentinel
  • consume_ident_or_function() - ident_char check fails on sentinel

Still requires bounds checks (searching for specific sequences):

  • consume_comment() - must find */ terminator
  • consume_string() - must find closing quote

Implementation Steps

  1. Add source_length: number field to Lexer class (line 73)
  2. Update constructor (line 85):
  • Append \0 to source: this.source = source + '\0'
  • Store original length: this.source_length = source.length
  1. Update peek() method (line 532):
  • Check against source_length instead of source.length
  • Return 0 for out-of-bounds access
  1. Update EOF detection (line 109):
  • Use this.pos >= this.source_length instead of >= this.source.length
  1. Update all loop conditions:
  • Replace this.source.length with this.source_length
  • Verify loops are safe with sentinel (character checks fail on \0)
  1. Update make_token() if it references source.length

Files to Modify

  • src/tokenize.ts (constructor, all loops, peek, EOF checks)

Specific tests to add:

  • EOF handling with sentinel
  • Position tracking accuracy
  • Line/column calculation correctness

Acceptance Criteria

  • All existing tests pass
  • Position tracking is exact (no off-by-one errors)
  • Line/column tracking is correct
  • EOF detection works properly
  • Benchmark shows 8-12% lexer speedup
  • Memory overhead is +1 byte per input (acceptable)
  • New tests added for sentinel edge cases

Trade-offs

  • Memory: +1 byte per input string (negligible)
  • Complexity: Requires careful position tracking (use source_length not source.length)
  • Risk: Off-by-one errors in position tracking (mitigated by thorough testing)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions