Decode: validate UTF-8 #629

pelletier · 2021-10-15T13:09:50Z

Follow up to #620. Went the way of only validating UTF-8 characters as opposed to decoding them. Realized there was an opportunity by also checking against the set of invalid TOML characters (TOML doesn't allow all UTF-8 characters everywhere). This goes against the two-passes version I was thinking of initially, but it ended up being faster on my machine. There is still the option to change that validation to vectorized instructions, but that's only useful for long strings / comments (see segmentio/asm#58), which seems quite rare in TOML docs.

Also added some strings and comments parsing benchmarks. Not quite useful from a user perspective but gave much stabler signal when implementing those methods.

Compared to #620. The map benchmark is slower on that run, but when ran separately it was on par. Allocations omitted because there is no change.

name                                   old time/op    new time/op    delta
pkg:github.com/pelletier/go-toml/v2 goos:linux goarch:amd64
ScanComments/10Valid-2                   39.0ns ± 0%    35.7ns ± 0%   -8.37%  (p=0.000 n=8+10)
ScanComments/1kValid-2                   3.34µs ± 0%    3.04µs ± 1%   -8.92%  (p=0.000 n=10+10)
ScanComments/1MValid-2                   3.41ms ± 0%    3.10ms ± 0%   -9.00%  (p=0.000 n=9+9)
ScanComments/10ValidUtf8-2               23.7ns ± 0%    20.9ns ± 1%  -11.68%  (p=0.000 n=9+10)
ScanComments/1kValidUtf8-2               2.66µs ± 0%    2.40µs ± 0%   -9.79%  (p=0.000 n=10+9)
ScanComments/1MValidUtf8-2               2.72ms ± 0%    2.46ms ± 0%   -9.87%  (p=0.000 n=9+9)
ParseLiteralStringValid/1MValidUtf8-2    2.74ms ± 0%    2.50ms ± 0%   -8.58%  (p=0.000 n=10+10)
ParseLiteralStringValid/10Valid-2        38.2ns ± 0%    35.1ns ± 4%   -8.03%  (p=0.000 n=9+9)
ParseLiteralStringValid/1kValid-2        3.34µs ± 0%    3.03µs ± 0%   -9.09%  (p=0.000 n=10+8)
ParseLiteralStringValid/1MValid-2        3.41ms ± 0%    3.10ms ± 0%   -9.25%  (p=0.000 n=10+8)
ParseLiteralStringValid/10ValidUtf8-2    22.8ns ± 0%    20.7ns ± 0%   -8.99%  (p=0.000 n=10+10)
ParseLiteralStringValid/1kValidUtf8-2    2.67µs ± 0%    2.45µs ± 0%   -8.24%  (p=0.000 n=10+9)
pkg:github.com/pelletier/go-toml/v2/benchmark goos:linux goarch:amd64
UnmarshalDataset/config-2                22.7ms ± 0%    21.3ms ± 0%   -6.23%  (p=0.000 n=10+9)
UnmarshalDataset/canada-2                79.3ms ± 1%    78.3ms ± 0%   -1.19%  (p=0.000 n=10+10)
UnmarshalDataset/citm_catalog-2          24.8ms ± 1%    24.2ms ± 0%   -2.40%  (p=0.000 n=9+9)
UnmarshalDataset/twitter-2               9.34ms ± 1%    8.96ms ± 0%   -4.02%  (p=0.000 n=10+9)
UnmarshalDataset/code-2                  98.4ms ± 1%    96.9ms ± 0%   -1.50%  (p=0.000 n=10+8)
UnmarshalDataset/example-2                167µs ± 0%     157µs ± 0%   -6.50%  (p=0.000 n=9+9)
Unmarshal/SimpleDocument/struct-2         537ns ± 1%     534ns ± 0%   -0.54%  (p=0.001 n=10+10)
Unmarshal/SimpleDocument/map-2            788ns ± 1%     809ns ± 1%   +2.67%  (p=0.000 n=9+9)
Unmarshal/ReferenceFile/struct-2         48.1µs ± 0%    47.5µs ± 0%   -1.35%  (p=0.000 n=8+10)
Unmarshal/ReferenceFile/map-2            75.0µs ± 0%    74.3µs ± 0%   -0.95%  (p=0.000 n=10+8)
Unmarshal/HugoFrontMatter-2              14.3µs ± 0%    13.5µs ± 0%   -5.86%  (p=0.000 n=10+10)

name                                   old speed      new speed      delta
pkg:github.com/pelletier/go-toml/v2 goos:linux goarch:amd64
ScanComments/10Valid-2                  334MB/s ± 0%   364MB/s ± 0%   +9.17%  (p=0.000 n=8+9)
ScanComments/1kValid-2                  308MB/s ± 0%   338MB/s ± 1%   +9.80%  (p=0.000 n=10+10)
ScanComments/1MValid-2                  308MB/s ± 0%   338MB/s ± 0%   +9.89%  (p=0.000 n=9+9)
ScanComments/10ValidUtf8-2              549MB/s ± 0%   622MB/s ± 1%  +13.23%  (p=0.000 n=9+10)
ScanComments/1kValidUtf8-2              386MB/s ± 0%   428MB/s ± 0%  +10.85%  (p=0.000 n=10+9)
ScanComments/1MValidUtf8-2              385MB/s ± 0%   427MB/s ± 0%  +10.95%  (p=0.000 n=9+9)
ParseLiteralStringValid/1MValidUtf8-2   383MB/s ± 0%   419MB/s ± 0%   +9.38%  (p=0.000 n=10+10)
ParseLiteralStringValid/10Valid-2       314MB/s ± 0%   342MB/s ± 4%   +8.76%  (p=0.000 n=9+9)
ParseLiteralStringValid/1kValid-2       308MB/s ± 0%   338MB/s ± 0%  +10.01%  (p=0.000 n=10+8)
ParseLiteralStringValid/1MValid-2       307MB/s ± 0%   339MB/s ± 0%  +10.20%  (p=0.000 n=10+8)
ParseLiteralStringValid/10ValidUtf8-2   527MB/s ± 0%   578MB/s ± 0%   +9.87%  (p=0.000 n=10+10)
ParseLiteralStringValid/1kValidUtf8-2   384MB/s ± 0%   419MB/s ± 0%   +8.98%  (p=0.000 n=10+9)
pkg:github.com/pelletier/go-toml/v2/benchmark goos:linux goarch:amd64
UnmarshalDataset/config-2              46.1MB/s ± 0%  49.2MB/s ± 0%   +6.65%  (p=0.000 n=10+9)
UnmarshalDataset/canada-2              27.8MB/s ± 1%  28.1MB/s ± 0%   +1.19%  (p=0.000 n=10+10)
UnmarshalDataset/citm_catalog-2        22.5MB/s ± 1%  23.1MB/s ± 0%   +2.47%  (p=0.000 n=9+9)
UnmarshalDataset/twitter-2             47.4MB/s ± 0%  49.3MB/s ± 0%   +4.12%  (p=0.000 n=9+9)
UnmarshalDataset/code-2                27.3MB/s ± 1%  27.7MB/s ± 0%   +1.53%  (p=0.000 n=10+8)
UnmarshalDataset/example-2             48.4MB/s ± 0%  51.7MB/s ± 0%   +6.95%  (p=0.000 n=9+9)
Unmarshal/SimpleDocument/struct-2      20.5MB/s ± 1%  20.6MB/s ± 0%   +0.55%  (p=0.001 n=10+10)
Unmarshal/SimpleDocument/map-2         14.0MB/s ± 1%  13.6MB/s ± 1%   -2.61%  (p=0.000 n=9+9)
Unmarshal/ReferenceFile/struct-2        109MB/s ± 0%   110MB/s ± 0%   +1.37%  (p=0.000 n=8+10)
Unmarshal/ReferenceFile/map-2          69.9MB/s ± 0%  70.5MB/s ± 0%   +0.96%  (p=0.000 n=10+8)
Unmarshal/HugoFrontMatter-2            38.1MB/s ± 0%  40.5MB/s ± 0%   +6.23%  (p=0.000 n=10+10)

Compared to base (cc0d1a9), it's much slower which is to be expected given the parser and scanner are doing a lot more work.

name                                   old time/op    new time/op    delta
pkg:github.com/pelletier/go-toml/v2 goos:linux goarch:amd64
ScanComments/1MValid-2                    310µs ± 0%    3100µs ± 0%  +901.54%  (p=0.000 n=10+9)
ScanComments/10ValidUtf8-2               4.43ns ± 0%   20.91ns ± 1%  +372.46%  (p=0.000 n=8+10)
ScanComments/1kValidUtf8-2                310ns ± 0%    2402ns ± 0%  +675.69%  (p=0.000 n=9+9)
ScanComments/1MValidUtf8-2                310µs ± 0%    2456µs ± 0%  +692.92%  (p=0.000 n=10+9)
ScanComments/10Valid-2                   4.43ns ± 0%   35.71ns ± 0%  +706.85%  (p=0.000 n=8+10)
ScanComments/1kValid-2                    309ns ± 0%    3040ns ± 1%  +882.52%  (p=0.000 n=8+10)
ParseLiteralStringValid/1MValidUtf8-2     619µs ± 0%    2502µs ± 0%  +304.44%  (p=0.000 n=9+10)
ParseLiteralStringValid/10Valid-2        10.3ns ± 0%    35.1ns ± 4%  +240.18%  (p=0.000 n=9+9)
ParseLiteralStringValid/1kValid-2         615ns ± 0%    3033ns ± 0%  +393.17%  (p=0.000 n=9+8)
ParseLiteralStringValid/1MValid-2         619µs ± 0%    3096µs ± 0%  +400.35%  (p=0.000 n=9+8)
ParseLiteralStringValid/10ValidUtf8-2    10.3ns ± 0%    20.7ns ± 0%  +100.76%  (p=0.000 n=10+10)
ParseLiteralStringValid/1kValidUtf8-2     615ns ± 0%    2450ns ± 0%  +298.41%  (p=0.000 n=9+9)
pkg:github.com/pelletier/go-toml/v2/benchmark goos:linux goarch:amd64
UnmarshalDataset/config-2                21.0ms ± 0%    21.3ms ± 0%    +1.39%  (p=0.000 n=9+9)
UnmarshalDataset/canada-2                79.2ms ± 1%    78.3ms ± 0%    -1.09%  (p=0.000 n=10+10)
UnmarshalDataset/citm_catalog-2          24.4ms ± 1%    24.2ms ± 0%    -0.91%  (p=0.000 n=10+9)
UnmarshalDataset/twitter-2               8.92ms ± 1%    8.96ms ± 0%    +0.47%  (p=0.006 n=10+9)
UnmarshalDataset/code-2                  95.9ms ± 0%    96.9ms ± 0%    +1.05%  (p=0.000 n=10+8)
UnmarshalDataset/example-2                155µs ± 0%     157µs ± 0%    +1.30%  (p=0.000 n=9+9)
Unmarshal/SimpleDocument/struct-2         525ns ± 2%     534ns ± 0%    +1.66%  (p=0.001 n=10+10)
Unmarshal/SimpleDocument/map-2            780ns ± 1%     809ns ± 1%    +3.74%  (p=0.000 n=9+9)
Unmarshal/ReferenceFile/struct-2         37.8µs ± 2%    47.5µs ± 0%   +25.52%  (p=0.000 n=10+10)
Unmarshal/ReferenceFile/map-2            64.3µs ± 0%    74.3µs ± 0%   +15.66%  (p=0.000 n=10+8)
Unmarshal/HugoFrontMatter-2              13.1µs ± 0%    13.5µs ± 0%    +3.03%  (p=0.000 n=9+10)

name                                   old speed      new speed      delta
pkg:github.com/pelletier/go-toml/v2 goos:linux goarch:amd64
ScanComments/1MValid-2                 3.39GB/s ± 0%  0.34GB/s ± 0%   -90.02%  (p=0.000 n=10+9)
ScanComments/10ValidUtf8-2             2.94GB/s ± 0%  0.62GB/s ± 1%   -78.83%  (p=0.000 n=8+10)
ScanComments/1kValidUtf8-2             3.32GB/s ± 0%  0.43GB/s ± 0%   -87.11%  (p=0.000 n=9+9)
ScanComments/1MValidUtf8-2             3.39GB/s ± 0%  0.43GB/s ± 0%   -87.39%  (p=0.000 n=10+9)
ScanComments/10Valid-2                 2.94GB/s ± 0%  0.36GB/s ± 0%   -87.60%  (p=0.000 n=8+9)
ScanComments/1kValid-2                 3.32GB/s ± 0%  0.34GB/s ± 1%   -89.82%  (p=0.000 n=8+10)
ParseLiteralStringValid/1MValidUtf8-2  1.70GB/s ± 0%  0.42GB/s ± 0%   -75.27%  (p=0.000 n=9+10)
ParseLiteralStringValid/10Valid-2      1.16GB/s ± 0%  0.34GB/s ± 4%   -70.60%  (p=0.000 n=9+9)
ParseLiteralStringValid/1kValid-2      1.67GB/s ± 0%  0.34GB/s ± 0%   -79.72%  (p=0.000 n=9+8)
ParseLiteralStringValid/1MValid-2      1.69GB/s ± 0%  0.34GB/s ± 0%   -80.01%  (p=0.000 n=9+8)
ParseLiteralStringValid/10ValidUtf8-2  1.16GB/s ± 0%  0.58GB/s ± 0%   -50.19%  (p=0.000 n=10+10)
ParseLiteralStringValid/1kValidUtf8-2  1.67GB/s ± 0%  0.42GB/s ± 0%   -74.90%  (p=0.000 n=9+9)
pkg:github.com/pelletier/go-toml/v2/benchmark goos:linux goarch:amd64
UnmarshalDataset/config-2              49.9MB/s ± 0%  49.2MB/s ± 0%    -1.38%  (p=0.000 n=9+9)
UnmarshalDataset/canada-2              27.8MB/s ± 1%  28.1MB/s ± 0%    +1.10%  (p=0.000 n=10+10)
UnmarshalDataset/citm_catalog-2        22.9MB/s ± 1%  23.1MB/s ± 0%    +0.92%  (p=0.000 n=10+9)
UnmarshalDataset/twitter-2             49.5MB/s ± 1%  49.3MB/s ± 0%    -0.47%  (p=0.006 n=10+9)
UnmarshalDataset/code-2                28.0MB/s ± 0%  27.7MB/s ± 0%    -1.04%  (p=0.000 n=10+8)
UnmarshalDataset/example-2             52.4MB/s ± 0%  51.7MB/s ± 0%    -1.28%  (p=0.000 n=9+9)
Unmarshal/SimpleDocument/struct-2      20.9MB/s ± 2%  20.6MB/s ± 0%    -1.64%  (p=0.001 n=10+10)
Unmarshal/SimpleDocument/map-2         14.1MB/s ± 1%  13.6MB/s ± 1%    -3.62%  (p=0.000 n=9+9)
Unmarshal/ReferenceFile/struct-2        139MB/s ± 2%   110MB/s ± 0%   -20.34%  (p=0.000 n=10+10)
Unmarshal/ReferenceFile/map-2          81.6MB/s ± 0%  70.5MB/s ± 0%   -13.54%  (p=0.000 n=10+8)
Unmarshal/HugoFrontMatter-2            41.7MB/s ± 0%  40.5MB/s ± 0%    -2.94%  (p=0.000 n=9+10)

Fixes some of #613:

pelletier added 9 commits October 14, 2021 12:26

parser: flag invalid unicode escape sequence

bbdfe2a

Fix basic strings characters validation

2592473

Fix invalid unicode in single line literal strings

5a562e9

Flag invalid unicode in multiline literal strings

dc406f6

Enable two more tests for basic string checks

b9df547

Fix multiline basic string tests

98ef4b4

Handle invalid characters in comments

a2a6334

Add benchmark file

4444f1e

Add tests and remove dead code

5c8cc59

pelletier merged commit cd54472 into v2 Oct 15, 2021

pelletier deleted the utf8 branch October 15, 2021 23:13

moorereason mentioned this pull request Oct 16, 2021

v2: disallowed UTF-8 sequence inside string not detected #631

Closed

pelletier added the bug Issues describing a bug in go-toml. label Oct 28, 2021

pelletier changed the title ~~Validate UTF-8~~ Decode: validate UTF-8 Oct 28, 2021

oschwald mentioned this pull request Feb 23, 2022

Decode: convert table key to correct type #741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode: validate UTF-8 #629

Decode: validate UTF-8 #629

pelletier commented Oct 15, 2021

Decode: validate UTF-8 #629

Decode: validate UTF-8 #629

Conversation

pelletier commented Oct 15, 2021