Skip to content

feat(lexer): add base Scanner struct with operators, identifiers, whitespace#3

Merged
ohah merged 2 commits intomainfrom
feature/lexer-scanner-base
Mar 18, 2026
Merged

feat(lexer): add base Scanner struct with operators, identifiers, whitespace#3
ohah merged 2 commits intomainfrom
feature/lexer-scanner-base

Conversation

@ohah
Copy link
Copy Markdown
Owner

@ohah ohah commented Mar 18, 2026

Summary

  • Scanner struct: 렉서의 핵심 구조체
  • next(): 파서가 호출하는 메인 스캔 함수
  • 공백/줄바꿈 처리: \n, \r\n, \r, U+2028, U+2029, NBSP, BOM
  • 모든 연산자/구두점 토큰 (51개 복합 형태 포함)
  • ASCII 식별자 + 키워드 매핑
  • Hashbang, private identifier
  • line_offsets 테이블 + getLineColumn() lazy 계산
  • 리터럴은 placeholder (추후 PR에서 세부 구현)

Design Decisions Applied

  • D015: start+end byte offset
  • D019: BOM, 줄 끝 문자 전부 인식
  • D035: UTF-8 기본
  • D036: 파서가 렉서 호출

Placeholder (추후 PR)

  • 숫자 리터럴 세부 파싱 (hex, octal, binary, bigint, separator)
  • 문자열 리터럴 escape sequence
  • 템플릿 리터럴 ${} interpolation
  • 주석 처리 (// /* */)
  • 유니코드 식별자
  • SIMD 최적화

Test plan

  • zig build test 통과
  • zig fmt --check src/ 통과
  • empty source → eof
  • BOM skip
  • single character tokens (9종)
  • compound operators (12종)
  • shift operators (6종)
  • identifiers + keyword lookup
  • newline → has_newline_before
  • CRLF → single newline
  • line offset table
  • getLineColumn binary search
  • hashbang
  • private identifier
  • optional chaining vs ternary+number (?. vs ?.5)
  • string literal basic

🤖 Generated with Claude Code

ohah and others added 2 commits March 18, 2026 20:18
…tifiers

Scanner core:
- init() with BOM skip, deinit() for cleanup
- next() main scan loop — tokenize one token per call
- peek(), peekAt(), advance(), isAtEnd() — basic read helpers
- tokenText() — current token source text
- line_offsets table + getLineColumn() for lazy line/column calc

Whitespace:
- Space, tab, VT, FF skipping
- Newline handling: \n, \r\n, \r, U+2028 (LS), U+2029 (PS)
- U+00A0 (NBSP), U+FEFF (BOM/ZWNBSP) as whitespace
- has_newline_before flag for ASI

Operators (all compound forms):
- Arithmetic: + - * / % ** ++ --
- Comparison: < > <= >= == != === !==
- Bitwise: & | ^ ~ << >> >>>
- Logical: && || !
- Assignment: = += -= *= /= %= **= &= |= ^= <<= >>= >>>= &&= ||= ??=
- Nullish/Optional: ?? ?.
- Arrow: =>
- Spread: ...

Identifiers & Keywords:
- ASCII identifier scan (unicode PR later)
- Keyword lookup via StaticStringMap
- Private identifier (#name)
- Hashbang (#!)

Literals (placeholder — detailed parsing in future PRs):
- Numeric: basic digit scan
- String: basic quote matching
- Template: basic backtick matching

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code reuse:
- skipWhitespace 0xE2 branch delegates to handleNewline() directly (removes duplicate check)
- BOM checks use std.mem.startsWith for readability

Code quality:
- Add 4GB source limit assert in init() (D015 u32 offset constraint)
- line_offsets initial append uses @Panic instead of catch {} (OOM = unusable state)

Tests added:
- Empty string literals ('', "")
- /= operator
- \r alone as line terminator
- Whitespace-only source
- NBSP (U+00A0) whitespace skipping
- All 16 assignment operators

Backlog updated with 9 deferred optimization items from review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ohah ohah merged commit 4ce02a0 into main Mar 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant