Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode: validate UTF-8 #629

Merged
merged 9 commits into from
Oct 15, 2021
Merged

Decode: validate UTF-8 #629

merged 9 commits into from
Oct 15, 2021

Conversation

pelletier
Copy link
Owner

Follow up to #620. Went the way of only validating UTF-8 characters as opposed to decoding them. Realized there was an opportunity by also checking against the set of invalid TOML characters (TOML doesn't allow all UTF-8 characters everywhere). This goes against the two-passes version I was thinking of initially, but it ended up being faster on my machine. There is still the option to change that validation to vectorized instructions, but that's only useful for long strings / comments (see segmentio/asm#58), which seems quite rare in TOML docs.

Also added some strings and comments parsing benchmarks. Not quite useful from a user perspective but gave much stabler signal when implementing those methods.


Compared to #620. The map benchmark is slower on that run, but when ran separately it was on par. Allocations omitted because there is no change.

name                                   old time/op    new time/op    delta
pkg:github.com/pelletier/go-toml/v2 goos:linux goarch:amd64
ScanComments/10Valid-2                   39.0ns ± 0%    35.7ns ± 0%   -8.37%  (p=0.000 n=8+10)
ScanComments/1kValid-2                   3.34µs ± 0%    3.04µs ± 1%   -8.92%  (p=0.000 n=10+10)
ScanComments/1MValid-2                   3.41ms ± 0%    3.10ms ± 0%   -9.00%  (p=0.000 n=9+9)
ScanComments/10ValidUtf8-2               23.7ns ± 0%    20.9ns ± 1%  -11.68%  (p=0.000 n=9+10)
ScanComments/1kValidUtf8-2               2.66µs ± 0%    2.40µs ± 0%   -9.79%  (p=0.000 n=10+9)
ScanComments/1MValidUtf8-2               2.72ms ± 0%    2.46ms ± 0%   -9.87%  (p=0.000 n=9+9)
ParseLiteralStringValid/1MValidUtf8-2    2.74ms ± 0%    2.50ms ± 0%   -8.58%  (p=0.000 n=10+10)
ParseLiteralStringValid/10Valid-2        38.2ns ± 0%    35.1ns ± 4%   -8.03%  (p=0.000 n=9+9)
ParseLiteralStringValid/1kValid-2        3.34µs ± 0%    3.03µs ± 0%   -9.09%  (p=0.000 n=10+8)
ParseLiteralStringValid/1MValid-2        3.41ms ± 0%    3.10ms ± 0%   -9.25%  (p=0.000 n=10+8)
ParseLiteralStringValid/10ValidUtf8-2    22.8ns ± 0%    20.7ns ± 0%   -8.99%  (p=0.000 n=10+10)
ParseLiteralStringValid/1kValidUtf8-2    2.67µs ± 0%    2.45µs ± 0%   -8.24%  (p=0.000 n=10+9)
pkg:github.com/pelletier/go-toml/v2/benchmark goos:linux goarch:amd64
UnmarshalDataset/config-2                22.7ms ± 0%    21.3ms ± 0%   -6.23%  (p=0.000 n=10+9)
UnmarshalDataset/canada-2                79.3ms ± 1%    78.3ms ± 0%   -1.19%  (p=0.000 n=10+10)
UnmarshalDataset/citm_catalog-2          24.8ms ± 1%    24.2ms ± 0%   -2.40%  (p=0.000 n=9+9)
UnmarshalDataset/twitter-2               9.34ms ± 1%    8.96ms ± 0%   -4.02%  (p=0.000 n=10+9)
UnmarshalDataset/code-2                  98.4ms ± 1%    96.9ms ± 0%   -1.50%  (p=0.000 n=10+8)
UnmarshalDataset/example-2                167µs ± 0%     157µs ± 0%   -6.50%  (p=0.000 n=9+9)
Unmarshal/SimpleDocument/struct-2         537ns ± 1%     534ns ± 0%   -0.54%  (p=0.001 n=10+10)
Unmarshal/SimpleDocument/map-2            788ns ± 1%     809ns ± 1%   +2.67%  (p=0.000 n=9+9)
Unmarshal/ReferenceFile/struct-2         48.1µs ± 0%    47.5µs ± 0%   -1.35%  (p=0.000 n=8+10)
Unmarshal/ReferenceFile/map-2            75.0µs ± 0%    74.3µs ± 0%   -0.95%  (p=0.000 n=10+8)
Unmarshal/HugoFrontMatter-2              14.3µs ± 0%    13.5µs ± 0%   -5.86%  (p=0.000 n=10+10)

name                                   old speed      new speed      delta
pkg:github.com/pelletier/go-toml/v2 goos:linux goarch:amd64
ScanComments/10Valid-2                  334MB/s ± 0%   364MB/s ± 0%   +9.17%  (p=0.000 n=8+9)
ScanComments/1kValid-2                  308MB/s ± 0%   338MB/s ± 1%   +9.80%  (p=0.000 n=10+10)
ScanComments/1MValid-2                  308MB/s ± 0%   338MB/s ± 0%   +9.89%  (p=0.000 n=9+9)
ScanComments/10ValidUtf8-2              549MB/s ± 0%   622MB/s ± 1%  +13.23%  (p=0.000 n=9+10)
ScanComments/1kValidUtf8-2              386MB/s ± 0%   428MB/s ± 0%  +10.85%  (p=0.000 n=10+9)
ScanComments/1MValidUtf8-2              385MB/s ± 0%   427MB/s ± 0%  +10.95%  (p=0.000 n=9+9)
ParseLiteralStringValid/1MValidUtf8-2   383MB/s ± 0%   419MB/s ± 0%   +9.38%  (p=0.000 n=10+10)
ParseLiteralStringValid/10Valid-2       314MB/s ± 0%   342MB/s ± 4%   +8.76%  (p=0.000 n=9+9)
ParseLiteralStringValid/1kValid-2       308MB/s ± 0%   338MB/s ± 0%  +10.01%  (p=0.000 n=10+8)
ParseLiteralStringValid/1MValid-2       307MB/s ± 0%   339MB/s ± 0%  +10.20%  (p=0.000 n=10+8)
ParseLiteralStringValid/10ValidUtf8-2   527MB/s ± 0%   578MB/s ± 0%   +9.87%  (p=0.000 n=10+10)
ParseLiteralStringValid/1kValidUtf8-2   384MB/s ± 0%   419MB/s ± 0%   +8.98%  (p=0.000 n=10+9)
pkg:github.com/pelletier/go-toml/v2/benchmark goos:linux goarch:amd64
UnmarshalDataset/config-2              46.1MB/s ± 0%  49.2MB/s ± 0%   +6.65%  (p=0.000 n=10+9)
UnmarshalDataset/canada-2              27.8MB/s ± 1%  28.1MB/s ± 0%   +1.19%  (p=0.000 n=10+10)
UnmarshalDataset/citm_catalog-2        22.5MB/s ± 1%  23.1MB/s ± 0%   +2.47%  (p=0.000 n=9+9)
UnmarshalDataset/twitter-2             47.4MB/s ± 0%  49.3MB/s ± 0%   +4.12%  (p=0.000 n=9+9)
UnmarshalDataset/code-2                27.3MB/s ± 1%  27.7MB/s ± 0%   +1.53%  (p=0.000 n=10+8)
UnmarshalDataset/example-2             48.4MB/s ± 0%  51.7MB/s ± 0%   +6.95%  (p=0.000 n=9+9)
Unmarshal/SimpleDocument/struct-2      20.5MB/s ± 1%  20.6MB/s ± 0%   +0.55%  (p=0.001 n=10+10)
Unmarshal/SimpleDocument/map-2         14.0MB/s ± 1%  13.6MB/s ± 1%   -2.61%  (p=0.000 n=9+9)
Unmarshal/ReferenceFile/struct-2        109MB/s ± 0%   110MB/s ± 0%   +1.37%  (p=0.000 n=8+10)
Unmarshal/ReferenceFile/map-2          69.9MB/s ± 0%  70.5MB/s ± 0%   +0.96%  (p=0.000 n=10+8)
Unmarshal/HugoFrontMatter-2            38.1MB/s ± 0%  40.5MB/s ± 0%   +6.23%  (p=0.000 n=10+10)


Compared to base (cc0d1a9), it's much slower which is to be expected given the parser and scanner are doing a lot more work.

name                                   old time/op    new time/op    delta
pkg:github.com/pelletier/go-toml/v2 goos:linux goarch:amd64
ScanComments/1MValid-2                    310µs ± 0%    3100µs ± 0%  +901.54%  (p=0.000 n=10+9)
ScanComments/10ValidUtf8-2               4.43ns ± 0%   20.91ns ± 1%  +372.46%  (p=0.000 n=8+10)
ScanComments/1kValidUtf8-2                310ns ± 0%    2402ns ± 0%  +675.69%  (p=0.000 n=9+9)
ScanComments/1MValidUtf8-2                310µs ± 0%    2456µs ± 0%  +692.92%  (p=0.000 n=10+9)
ScanComments/10Valid-2                   4.43ns ± 0%   35.71ns ± 0%  +706.85%  (p=0.000 n=8+10)
ScanComments/1kValid-2                    309ns ± 0%    3040ns ± 1%  +882.52%  (p=0.000 n=8+10)
ParseLiteralStringValid/1MValidUtf8-2     619µs ± 0%    2502µs ± 0%  +304.44%  (p=0.000 n=9+10)
ParseLiteralStringValid/10Valid-2        10.3ns ± 0%    35.1ns ± 4%  +240.18%  (p=0.000 n=9+9)
ParseLiteralStringValid/1kValid-2         615ns ± 0%    3033ns ± 0%  +393.17%  (p=0.000 n=9+8)
ParseLiteralStringValid/1MValid-2         619µs ± 0%    3096µs ± 0%  +400.35%  (p=0.000 n=9+8)
ParseLiteralStringValid/10ValidUtf8-2    10.3ns ± 0%    20.7ns ± 0%  +100.76%  (p=0.000 n=10+10)
ParseLiteralStringValid/1kValidUtf8-2     615ns ± 0%    2450ns ± 0%  +298.41%  (p=0.000 n=9+9)
pkg:github.com/pelletier/go-toml/v2/benchmark goos:linux goarch:amd64
UnmarshalDataset/config-2                21.0ms ± 0%    21.3ms ± 0%    +1.39%  (p=0.000 n=9+9)
UnmarshalDataset/canada-2                79.2ms ± 1%    78.3ms ± 0%    -1.09%  (p=0.000 n=10+10)
UnmarshalDataset/citm_catalog-2          24.4ms ± 1%    24.2ms ± 0%    -0.91%  (p=0.000 n=10+9)
UnmarshalDataset/twitter-2               8.92ms ± 1%    8.96ms ± 0%    +0.47%  (p=0.006 n=10+9)
UnmarshalDataset/code-2                  95.9ms ± 0%    96.9ms ± 0%    +1.05%  (p=0.000 n=10+8)
UnmarshalDataset/example-2                155µs ± 0%     157µs ± 0%    +1.30%  (p=0.000 n=9+9)
Unmarshal/SimpleDocument/struct-2         525ns ± 2%     534ns ± 0%    +1.66%  (p=0.001 n=10+10)
Unmarshal/SimpleDocument/map-2            780ns ± 1%     809ns ± 1%    +3.74%  (p=0.000 n=9+9)
Unmarshal/ReferenceFile/struct-2         37.8µs ± 2%    47.5µs ± 0%   +25.52%  (p=0.000 n=10+10)
Unmarshal/ReferenceFile/map-2            64.3µs ± 0%    74.3µs ± 0%   +15.66%  (p=0.000 n=10+8)
Unmarshal/HugoFrontMatter-2              13.1µs ± 0%    13.5µs ± 0%    +3.03%  (p=0.000 n=9+10)

name                                   old speed      new speed      delta
pkg:github.com/pelletier/go-toml/v2 goos:linux goarch:amd64
ScanComments/1MValid-2                 3.39GB/s ± 0%  0.34GB/s ± 0%   -90.02%  (p=0.000 n=10+9)
ScanComments/10ValidUtf8-2             2.94GB/s ± 0%  0.62GB/s ± 1%   -78.83%  (p=0.000 n=8+10)
ScanComments/1kValidUtf8-2             3.32GB/s ± 0%  0.43GB/s ± 0%   -87.11%  (p=0.000 n=9+9)
ScanComments/1MValidUtf8-2             3.39GB/s ± 0%  0.43GB/s ± 0%   -87.39%  (p=0.000 n=10+9)
ScanComments/10Valid-2                 2.94GB/s ± 0%  0.36GB/s ± 0%   -87.60%  (p=0.000 n=8+9)
ScanComments/1kValid-2                 3.32GB/s ± 0%  0.34GB/s ± 1%   -89.82%  (p=0.000 n=8+10)
ParseLiteralStringValid/1MValidUtf8-2  1.70GB/s ± 0%  0.42GB/s ± 0%   -75.27%  (p=0.000 n=9+10)
ParseLiteralStringValid/10Valid-2      1.16GB/s ± 0%  0.34GB/s ± 4%   -70.60%  (p=0.000 n=9+9)
ParseLiteralStringValid/1kValid-2      1.67GB/s ± 0%  0.34GB/s ± 0%   -79.72%  (p=0.000 n=9+8)
ParseLiteralStringValid/1MValid-2      1.69GB/s ± 0%  0.34GB/s ± 0%   -80.01%  (p=0.000 n=9+8)
ParseLiteralStringValid/10ValidUtf8-2  1.16GB/s ± 0%  0.58GB/s ± 0%   -50.19%  (p=0.000 n=10+10)
ParseLiteralStringValid/1kValidUtf8-2  1.67GB/s ± 0%  0.42GB/s ± 0%   -74.90%  (p=0.000 n=9+9)
pkg:github.com/pelletier/go-toml/v2/benchmark goos:linux goarch:amd64
UnmarshalDataset/config-2              49.9MB/s ± 0%  49.2MB/s ± 0%    -1.38%  (p=0.000 n=9+9)
UnmarshalDataset/canada-2              27.8MB/s ± 1%  28.1MB/s ± 0%    +1.10%  (p=0.000 n=10+10)
UnmarshalDataset/citm_catalog-2        22.9MB/s ± 1%  23.1MB/s ± 0%    +0.92%  (p=0.000 n=10+9)
UnmarshalDataset/twitter-2             49.5MB/s ± 1%  49.3MB/s ± 0%    -0.47%  (p=0.006 n=10+9)
UnmarshalDataset/code-2                28.0MB/s ± 0%  27.7MB/s ± 0%    -1.04%  (p=0.000 n=10+8)
UnmarshalDataset/example-2             52.4MB/s ± 0%  51.7MB/s ± 0%    -1.28%  (p=0.000 n=9+9)
Unmarshal/SimpleDocument/struct-2      20.9MB/s ± 2%  20.6MB/s ± 0%    -1.64%  (p=0.001 n=10+10)
Unmarshal/SimpleDocument/map-2         14.1MB/s ± 1%  13.6MB/s ± 1%    -3.62%  (p=0.000 n=9+9)
Unmarshal/ReferenceFile/struct-2        139MB/s ± 2%   110MB/s ± 0%   -20.34%  (p=0.000 n=10+10)
Unmarshal/ReferenceFile/map-2          81.6MB/s ± 0%  70.5MB/s ± 0%   -13.54%  (p=0.000 n=10+8)
Unmarshal/HugoFrontMatter-2            41.7MB/s ± 0%  40.5MB/s ± 0%    -2.94%  (p=0.000 n=9+10)

Fixes some of #613:

  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_CommentDel
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_CommentLf
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_CommentNull
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_CommentUs
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_MultiDel
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_MultiLf
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_MultiNull
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_MultiUs
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_RawmultiDel
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_RawmultiLf
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_RawmultiNull
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_RawmultiUs
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_RawstringDel
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_RawstringLf
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_RawstringNull
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_RawstringUs
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_StringBs
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_StringDel
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_StringLf
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_StringNull
  • go test -tags testsuite -run TestTOMLTest_Invalid_Control_StringUs
  • go test -tags testsuite -run TestTOMLTest_Invalid_String_BadCodepoint
  • go test -tags testsuite -run TestTOMLTest_Invalid_String_BasicMultilineOutOfRangeUnicodeEscape1
  • go test -tags testsuite -run TestTOMLTest_Invalid_String_BasicMultilineOutOfRangeUnicodeEscape2
  • go test -tags testsuite -run TestTOMLTest_Invalid_String_BasicOutOfRangeUnicodeEscape1
  • go test -tags testsuite -run TestTOMLTest_Invalid_String_BasicOutOfRangeUnicodeEscape2
  • go test -tags testsuite -run TestTOMLTest_Valid_String_UnicodeEscape
  • go test -tags testsuite -run TestTOMLTest_Valid_String_UnicodeLiteral

@pelletier pelletier merged commit cd54472 into v2 Oct 15, 2021
@pelletier pelletier deleted the utf8 branch October 15, 2021 23:13
@pelletier pelletier added the bug Issues describing a bug in go-toml. label Oct 28, 2021
@pelletier pelletier changed the title Validate UTF-8 Decode: validate UTF-8 Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues describing a bug in go-toml.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant