Fix 575, regression tests #582

kozross · 2023-04-18T01:18:35Z

Sorry for the long wait on this one folks. This fixes #575, as well as incorporating @Bodigrim's regression tests (which all pass, of course).

The reason this happened was a subtle mistake on my part. In brief, the way the SIMDed (both SSE and AVX) validator used to work is roughly as so:

Break the input up into 128 byte (64 byte on SSE) strides.
For each stride:
a. Check speculatively if the entire block is ASCII.
b. If the entire block is ASCII, plough on.
c. Otherwise, check if the block is UTF-8.
d. Accumulate error
Finish up the tail one byte at a time.

A key factor to making this work is the ability to 'preserve validation state' across strides: this means that, were we to have a stride where the last byte is the start of a multi-byte sequence, the validator is aware of this and can respond.

The issue I had to fix was that step 2a did two very naughty things:

alpha. Ignore the 'preserved state' from the previous stride entirely.
beta. Overwrite the 'preserved state' with a 'clean' one, indicating no carryover.

Thus, the issues observed previously. The checks for both SSE and AVX versions of speculative ASCII validation have been modified to take this into account. Near as I can tell, this has no impact on performance: the code was choked on memory movement (and the corresponding true dependencies), and register pressure was not increased by this change. I've added comments explaining exactly how both versions work in case someone else needs to maintain this code.

I also added an extra test to exclude the possibility that ASCII verification itself was broken (it wasn't).

…of ASCII

kozross · 2023-04-18T01:47:25Z

@Bodigrim - what version of QuickCheck gets pulled into our unusual architecture jobs? I'm also seeing some tests regarding lazy ByteStrings which failed even before I made any changes, but not consistently.

Bodigrim · 2023-04-18T23:36:44Z

@kozross it's QuickCheck-2.12. Feel free to bump uraimo/run-on-arch-action to v2.5 or whichever is the latest, and distro to ubuntu22.04, but I think it still is not enough for chooseEnum; use old plain choose instead.

kozross · 2023-04-19T01:28:36Z

I'm getting a bunch of failures for lazy bytestring tests on both Windows and 64-bit ARM. The NetBSD tests are just not starting at all. Not sure how to fix either issue.

clyring · 2023-04-19T13:31:02Z

I'm not sure what you mean by "failures for lazy bytestring tests on both Windows and 64-bit ARM." Are you seeing these locally? These are the only test failures I see in the CI logs:

windows

2023-04-19T01:24:48.8529648Z     Invalid UTF-8 ByteString:                              FAIL (0.45s)
2023-04-19T01:24:48.8545529Z       *** Failed! Falsified (after 20 tests and 188 shrinks):
2023-04-19T01:24:48.8570168Z       InvalidUtf8 {prefix = "!\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL", invalid = "\244\176\142\169", suffix = "", asBS = "!\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\244\176\142\169", length = 34}
2023-04-19T01:24:48.8583723Z       Use --quickcheck-replay=316071 to reproduce.
2023-04-19T01:24:48.8608578Z       Use -p '/Invalid UTF-8 ByteString/' to rerun this test only.
[...]
2023-04-19T01:24:48.8652669Z     Invalid UTF-8 ShortByteString:                         FAIL
2023-04-19T01:24:48.8653501Z       *** Failed! Falsified (after 47 tests and 35 shrinks):
2023-04-19T01:24:48.8654413Z       InvalidUtf8 {prefix = "!\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL", invalid = "\240\131\183\168", suffix = "", asBS = "!\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\240\131\183\168", length = 34}
2023-04-19T01:24:48.8655491Z       Use --quickcheck-replay=723812 to reproduce.
2023-04-19T01:24:48.8656096Z       Use -p '/Invalid UTF-8 ShortByteString/' to rerun this test only.

arm64

2023-04-19T01:22:59.1431991Z       Invalid byte between spaces:                         FAIL
2023-04-19T01:22:59.1432363Z         *** Failed! Falsified (after 1 test):
2023-04-19T01:22:59.1432794Z         Use --quickcheck-replay=96163 to reproduce.
2023-04-19T01:22:59.1433296Z         Use -p '/Invalid byte between spaces/' to rerun this test only.
2023-04-19T01:22:59.1433711Z       Two invalid bytes between spaces:                    FAIL
2023-04-19T01:22:59.1434074Z         *** Failed! Falsified (after 1 test):
2023-04-19T01:22:59.1434498Z         Use --quickcheck-replay=334360 to reproduce.
2023-04-19T01:22:59.1434998Z         Use -p '/Two invalid bytes between spaces/' to rerun this test only.
2023-04-19T01:22:59.1435716Z       Three invalid bytes between spaces:                  FAIL
2023-04-19T01:22:59.1436076Z         *** Failed! Falsified (after 1 test):
2023-04-19T01:22:59.1436513Z         Use --quickcheck-replay=72422 to reproduce.
2023-04-19T01:22:59.1437029Z         Use -p '/Three invalid bytes between spaces/' to rerun this test only.

The NetBSD job is indeed currently broken. I haven't seen this s390x failure before; I just tried restarting the job.

kozross · 2023-04-19T19:15:39Z

@clyring - thanks for pointing that out to me: I thought the laziness stuff indicated failures due to exceptions. I'll try and repro those locally, but it's quite odd that these don't show up anywhere else.

Edit: The ARM stuff isn't too surprising actually: I forgot to put the fix in the NEON code, but that shouldn't be difficult.

Bodigrim · 2023-04-19T19:40:48Z

@kozross if you rebase atop the latest master, NetBSD failure should go away.

kozross · 2023-04-24T02:09:58Z

@Bodigrim - the NEON side of things is fixed now, using a similar fix to the SSE/AVX ones. However, the Windows issue is weird, and I'm unable to diagnose it. Here's what I can say for sure:

It's definitely a Windows-only problem: it triggers basically every time, but none of its seeds or specific cases fail on anything else I've been able to test.
The failure doesn't happen in SIMD code: whichever version of SSE or AVX Windows ends up compiling to, neither will get called on the failing cases, as they're too small.
Every case has the same form: a 32-byte-ish stretch of ASCII, followed by 2-3 bytes of nearly-valid UTF-8.
This seems to happen with GHC 9.6.1, and whichever C compiler it's bundled with. While our test matrix doesn't include 9.6 for Linux, I checked locally, and the issue doesn't seem to happen there.

My best guess is that there's a subtle form of UB at play here which gets compiled differently on Windows than on Mac or Linux. However, this both seems unlikely, and is nearly impossible for me to verify: it's clearly not a Clang-versus-GCC issue, as 9.6 on Windows bundles Clang AFAICT, and the Mac build doesn't hit this problem. Thus, it means it's some kind of Clang version divergence in the non-SIMDed validation code, and as it currently stands, I don't even know what Clang version gets bundled with GHC 9.6.1 on Windows, and have no good way to check.

If someone has access to a Windows machine, and some expertise with C-on-Windows-with-Clang, I'd appreciate the assistance, as I have neither.

Kleidukos · 2023-04-24T07:23:19Z

According to this comment, this might be Clang 15 haskell/cabal#8841 (comment)

Bodigrim · 2023-04-24T17:42:21Z

@kozross
On Windows GHC 9.4 (shipped with Clang 13.0.0) works fine, but GHC 9.6 (shipped with Clang 14.0.6) fails in the same manner as the CI job. Running GHC 9.4 with Clang 14.0.6 also fails, so you are likely right that the issue is connected to Clang version.

Bodigrim · 2023-04-24T18:07:12Z

@kozross could you please rebase once more? And feel free to set fail-fast: false for Windows jobs in your branch. Let's see which of them pass.

…of ASCII

sergv · 2023-04-25T22:02:21Z

Tests on Windows seem to fail because __builtin_ctzl accepts longs, which are 4 byte wide there. The argument to __builtin_ctzl is an 8-byte uint64_t which leads to truncation (or is it UB??). In any case it seems that __builtin_ctzll than accepts long long should be used instead to always process all 8 bytes. Cannot comment inline so will point thus.

kozross · 2023-04-25T22:32:31Z

@sergv - thanks, that's an excellent catch!

Bodigrim · 2023-04-25T23:02:17Z

CI looks good so far, great stuff!

@kozross could you possibly rebase one last time atop of #578 please? There've been a bug in shrinking which I just fixed + added one more test case.

…valid-utf8

kozross · 2023-04-25T23:19:04Z

@Bodigrim - done.

Bodigrim

Excellent, many thanks!

clyring · 2023-04-26T14:04:32Z

Gross. Why isn't -Wconversion on by default?

clyring

(To be clear, my previous comment was an expression of frustration rather than a request for this patch. Thanks @sergv for spotting that; I was not looking forward to diagnosing it.)

cbits/aarch64/is-valid-utf8.c

cbits/is-valid-utf8.c

Bodigrim · 2023-05-20T09:02:16Z

@kozross @clyring what remains to be done here? Only enabling -fno-strict-aliasing? I'm keen to get this merged and released.

kozross · 2023-05-20T19:04:32Z

@Bodigrim - yep, -fno-strict-aliasing is the only thing left to go as far as I can tell.

clyring · 2023-05-22T02:04:24Z

I'd rather rewrite to avoid strict-aliasing violations to begin with, but it doesn't make sense to hold up this patch over that. Let's add the flag to cc-options and merge.

kozross · 2023-05-22T05:13:45Z

@clyring This flag is used fairly often, and a rewrite to fix this would be tedious and pointless. There's nothing per-se wrong with disabling this behaviour: the standard is not a goal in and of itself.

Bodigrim · 2023-06-12T23:20:57Z

@clyring good to go now?

clyring

@clyring good to go now?

Yes. To quote myself:

Let's add the flag to cc-options and merge.

I'll create separate issues for the low-priority follow-up work.

clyring · 2023-06-13T13:50:23Z

@kozross @sergv @Bodigrim Thanks again!

* Refactor isValidUtf8 tests to enable shrinking * Test isValidUtf8 on strings which are almost valid, with long ranges of ASCII * Repair invalid UTF-8 issue, more tests * Fix NEON ASCII check * Refactor isValidUtf8 tests to enable shrinking * Test isValidUtf8 on strings which are almost valid, with long ranges of ASCII * Use ctzll to ensure 8 bytes get processed * Set -fno-strict-aliasing --------- Co-authored-by: Bodigrim <andrew.lelechenko@gmail.com>

Bodigrim added 2 commits February 18, 2023 21:35

Refactor isValidUtf8 tests to enable shrinking

0b1314b

Test isValidUtf8 on strings which are almost valid, with long ranges …

ad4c5a5

…of ASCII

kozross force-pushed the koz/invalid-utf8 branch from e4d7cdd to 33c4003 Compare April 18, 2023 01:25

Repair invalid UTF-8 issue, more tests

7be4489

kozross force-pushed the koz/invalid-utf8 branch from 33c4003 to 7be4489 Compare April 19, 2023 01:08

Merge branch 'master' into koz/invalid-utf8

4680898

kozross force-pushed the koz/invalid-utf8 branch 2 times, most recently from 8ca5425 to 71cbb89 Compare April 24, 2023 01:22

Fix NEON ASCII check

4c084d6

kozross force-pushed the koz/invalid-utf8 branch from 71cbb89 to 4c084d6 Compare April 24, 2023 01:47

Bodigrim mentioned this pull request Apr 24, 2023

Extend CI matrix #584

Merged

Bodigrim added 2 commits April 25, 2023 22:42

Refactor isValidUtf8 tests to enable shrinking

c830b45

Test isValidUtf8 on strings which are almost valid, with long ranges …

46be54f

…of ASCII

kozross added 2 commits April 26, 2023 10:44

Merge branch 'master' into koz/invalid-utf8

9bf994a

Use ctzll to ensure 8 bytes get processed

99f58db

Merge remote-tracking branch 'bodigrim/isValidUtf8-tests' into koz/in…

690a279

…valid-utf8

Bodigrim added this to the 0.11.5.0 milestone Apr 25, 2023

Bodigrim approved these changes Apr 25, 2023

View reviewed changes

Bodigrim requested a review from clyring April 25, 2023 23:46

clyring reviewed Apr 28, 2023

View reviewed changes

cbits/aarch64/is-valid-utf8.c Show resolved Hide resolved

cbits/is-valid-utf8.c Show resolved Hide resolved

cbits/is-valid-utf8.c Show resolved Hide resolved

cbits/is-valid-utf8.c Show resolved Hide resolved

Set -fno-strict-aliasing

2c92d63

clyring approved these changes Jun 13, 2023

View reviewed changes

clyring merged commit dac5675 into haskell:master Jun 13, 2023
22 of 24 checks passed

This was referenced Jun 13, 2023

Enable warnings in our C code #593

Open

Remove strict aliasing violations from our C code #594

Open

Bodigrim mentioned this pull request Jun 13, 2023

0.12.0.0 release planning #573

Closed

clyring mentioned this pull request Jun 16, 2023

Improve isValidUtf8 tests #578

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 575, regression tests #582

Fix 575, regression tests #582

kozross commented Apr 18, 2023

kozross commented Apr 18, 2023

Bodigrim commented Apr 18, 2023

kozross commented Apr 19, 2023

clyring commented Apr 19, 2023

kozross commented Apr 19, 2023 •

edited

Bodigrim commented Apr 19, 2023

kozross commented Apr 24, 2023

Kleidukos commented Apr 24, 2023

Bodigrim commented Apr 24, 2023

Bodigrim commented Apr 24, 2023

sergv commented Apr 25, 2023

kozross commented Apr 25, 2023

Bodigrim commented Apr 25, 2023

kozross commented Apr 25, 2023

Bodigrim left a comment

clyring commented Apr 26, 2023

clyring left a comment

Bodigrim commented May 20, 2023

kozross commented May 20, 2023

clyring commented May 22, 2023

kozross commented May 22, 2023

Bodigrim commented Jun 12, 2023

clyring left a comment

clyring commented Jun 13, 2023

Fix 575, regression tests #582

Fix 575, regression tests #582

Conversation

kozross commented Apr 18, 2023

kozross commented Apr 18, 2023

Bodigrim commented Apr 18, 2023

kozross commented Apr 19, 2023

clyring commented Apr 19, 2023

kozross commented Apr 19, 2023 • edited

Bodigrim commented Apr 19, 2023

kozross commented Apr 24, 2023

Kleidukos commented Apr 24, 2023

Bodigrim commented Apr 24, 2023

Bodigrim commented Apr 24, 2023

sergv commented Apr 25, 2023

kozross commented Apr 25, 2023

Bodigrim commented Apr 25, 2023

kozross commented Apr 25, 2023

Bodigrim left a comment

Choose a reason for hiding this comment

clyring commented Apr 26, 2023

clyring left a comment

Choose a reason for hiding this comment

Bodigrim commented May 20, 2023

kozross commented May 20, 2023

clyring commented May 22, 2023

kozross commented May 22, 2023

Bodigrim commented Jun 12, 2023

clyring left a comment

Choose a reason for hiding this comment

clyring commented Jun 13, 2023

kozross commented Apr 19, 2023 •

edited