Skip to content

fix: improve toml-test invalid compliance#2

Merged
dereuromark merged 4 commits intomasterfrom
fix/toml-test-compliance-2
Mar 25, 2026
Merged

fix: improve toml-test invalid compliance#2
dereuromark merged 4 commits intomasterfrom
fix/toml-test-compliance-2

Conversation

@dereuromark
Copy link
Contributor

@dereuromark dereuromark commented Mar 25, 2026

Summary

Improves toml-test invalid compliance from 90.3% to 98.5% (TOML 1.1).

Lexer Fixes

  • Add UTF-8 encoding validation upfront to reject invalid byte sequences
  • Add control character validation in basic and literal strings
  • Reject bare CR (without LF) in multiline strings
  • Enforce lowercase-only prefixes for non-decimal integers (0x, 0o, 0b)
  • Reject signed non-decimal integers (+0x, -0o, etc.)

Normalizer Fixes (Table Semantics)

  • Reject extending explicitly defined tables via dotted keys
    • [a.b.c] followed by [a] + b.c.t = ... is now rejected
  • Reject extending array tables via dotted keys
    • [[a.b]] followed by [a] + b.y = ... is now rejected
  • Reject explicitly defining tables created by dotted keys
    • [a] + b.c = 1 followed by [a.b] is now rejected

Compliance Summary

Category Fix
UTF-8 Invalid byte sequences rejected upfront
Control chars Characters < 0x20 (except tab) or 0x7F rejected in strings
Bare CR TOML requires CRLF or LF; bare CR now rejected
Integer prefixes Must be lowercase (0x, 0o, 0b) per spec
Integer signs Only decimal integers can have +/- prefix
Table semantics Dotted key vs explicit table conflicts rejected

Follows up on PR #1 which established the baseline compliance testing.

- Add UTF-8 encoding validation upfront to reject invalid byte sequences
- Add control character validation in basic and literal strings
- Reject bare CR (without LF) in multiline strings
- Enforce lowercase-only prefixes for non-decimal integers (0x, 0o, 0b)
- Reject signed non-decimal integers (+0x, -0o, etc.)
- Update tests to verify new validation rules

These changes address several toml-test invalid test failures by
enforcing stricter TOML spec compliance:
- Invalid UTF-8 now rejected early
- Control characters (< 0x20 except tab, or 0x7F) rejected in strings
- Bare CR rejected (TOML requires CRLF or LF only)
- Integer prefixes must be lowercase per TOML spec
- Only decimal integers can have sign prefix
Update compliance numbers after fixing:
- UTF-8 encoding validation
- Control character validation in strings
- Bare CR rejection in multiline strings
- Integer prefix validation (lowercase only)
- Signed non-decimal integer rejection

TOML 1.1: 96.8% invalid compliance (up from 90.3%)
TOML 1.0: 95.3% invalid compliance (up from 89.0%)
Add stricter validation for table semantics edge cases:

1. Cannot extend explicitly defined tables via dotted keys
   - [a.b.c] followed by [a] + b.c.t = ... is now rejected

2. Cannot extend array tables via dotted keys
   - [[a.b]] followed by [a] + b.y = ... is now rejected

3. Cannot explicitly define tables created by dotted keys
   - [a] + b.c = 1 followed by [a.b] is now rejected

The fix introduces a new 'dotted' kind for implicit tables created
via dotted key notation, distinguishing them from 'implicit' tables
created by super-table headers which CAN be explicitly defined later.

Add semantic test fixtures for all three patterns.
With the dotted key vs explicit table conflicts now properly rejected,
invalid test compliance improves significantly:

TOML 1.1: 98.5% invalid (up from 96.8%)
TOML 1.0: 97.0% invalid (up from 95.3%)
@dereuromark dereuromark merged commit 2ecd523 into master Mar 25, 2026
3 checks passed
@dereuromark dereuromark deleted the fix/toml-test-compliance-2 branch March 25, 2026 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant