Skip to content

Commit

Permalink
Optimize a scan of non state-chaning bytes with SSE2 instructions
Browse files Browse the repository at this point in the history
This commit optimizes the scan of non-state-changing bytes using SSE2 instructions.

A [_mm_cmpestri](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_cmpestri) operation appears to be quite slow
compared to alternative approach that involves (_mm_shuffle_epi8)[https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_shuffle_epi8]
for low/high nibble of the input and using bitwise and for the results to get a 16 bytes of LUT in one go (it also involves a bunch of other SSE2 operations
which all have nice latency/throughput properties). The resulting LUT of 16 bytes can be analyzed (also vectorized) to get the index of the first byte (if any)
that changes the state. That is done by figuring out the first byte that LUTs to zero.

The tricky part here is the following:

```
Find A, B arrays (uint8_t[16]) such that
* `A[i] | B[j] == 0` if `LUT[i | (j <<4)] == 0`
* `A[i] | B[j] != 0` if `LUT[i | (j <<4)] != 0` // Note we don't need any specific non-zero value
for all i,j = 0..15.
```

To find `A` and `B` satisfying the above conditions a [Z3](https://github.com/Z3Prover/z3) library is used.
The npm package that wrapps z3 for using in ts is not particularly friendly to the author of this change so another package (synckit)
was required to handle the async API for z3-wrapper.

Using llhttp as a benchmark framework this change draws the following improvemnts:

```
Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz

http: "seanmonstar/httparse" (C)
BEFORE: 8192.00 mb | 1456.72 mb/s | 2172811.81 ops/sec | 5.62 s
AFTER:  8192.00 mb | 1752.90 mb/s | 2614577.82 ops/sec | 4.67 s

~20% improvement

http: "nodejs/http-parser" (C)
BEFORE: 8192.00 mb | 1050.60 mb/s | 2118535.14 ops/sec | 7.80 s
AFTER:  8192.00 mb | 1167.42 mb/s | 2354101.76 ops/sec | 7.02 s

~11% improvement
```

For more header-fields-heavy messages numbers might be even more convincing.
  • Loading branch information
ngrodzitski committed Oct 10, 2023
1 parent 4d7e352 commit 71da0d6
Show file tree
Hide file tree
Showing 5 changed files with 3,077 additions and 28 deletions.

0 comments on commit 71da0d6

Please sign in to comment.