Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf(zstd): Improve 'matchLen' performance by vector instructions. #823

Closed
wants to merge 2 commits into from

Conversation

zzzzwc
Copy link

@zzzzwc zzzzwc commented Jun 7, 2023

When we use zstd to compress csv text, we find that the matchlen function takes a lot of time, so we try to use vector instructions to speed up the matchlen function.

The following is the benchmark test command and the comparison with the old version:

⇒  go test -run=None -bench='Encoder_EncodeAll|Random' -count=6  >  new.txt
⇒  go test -run=None -bench='Encoder_EncodeAll|Random' -count=6  >  old.txt
⇒  benchstat old.txt new.txt
goos: linux
goarch: amd64
pkg: github.com/klauspost/compress/zstd
cpu: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
                                     │   old.txt    │               new.txt               │
                                     │    sec/op    │    sec/op     vs base               │
Encoder_EncodeAllXML-64                19.13m ±  3%   18.13m ±  2%   -5.21% (p=0.002 n=6)
Encoder_EncodeAllSimple/fastest-64     328.7µ ±  3%   344.9µ ±  2%   +4.94% (p=0.002 n=6)
Encoder_EncodeAllSimple/default-64     489.9µ ±  4%   500.9µ ±  3%   +2.24% (p=0.026 n=6)
Encoder_EncodeAllSimple/better-64      634.4µ ±  3%   632.9µ ±  3%        ~ (p=0.937 n=6)
Encoder_EncodeAllSimple/best-64        2.674m ±  6%   2.680m ±  1%        ~ (p=0.937 n=6)
Encoder_EncodeAllSimple4K/fastest-64   16.84µ ±  3%   16.99µ ±  8%        ~ (p=0.818 n=6)
Encoder_EncodeAllSimple4K/default-64   44.17µ ±  3%   46.14µ ±  1%   +4.45% (p=0.002 n=6)
Encoder_EncodeAllSimple4K/better-64    56.05µ ±  3%   55.94µ ±  3%        ~ (p=0.732 n=6)
Encoder_EncodeAllSimple4K/best-64      285.4µ ± 25%   292.3µ ±  3%        ~ (p=0.485 n=6)
Encoder_EncodeAllHTML-64               315.6µ ±  1%   308.4µ ±  3%   -2.27% (p=0.002 n=6)
Encoder_EncodeAllTwain-64              4.579m ±  1%   4.717m ±  3%   +3.02% (p=0.015 n=6)
Encoder_EncodeAllPi-64                 1.582m ±  3%   1.612m ±  3%        ~ (p=0.132 n=6)
Random4KEncodeAllFastest-64            1.450µ ±  2%   1.443µ ±  3%        ~ (p=0.517 n=6)
Random10MBEncodeAllFastest-64          5.692m ±  8%   5.131m ±  2%   -9.86% (p=0.002 n=6)
Random4KEncodeAllDefault-64            4.013µ ±  4%   4.007µ ±  3%        ~ (p=0.699 n=6)
RandomEncodeAllDefault-64              4.504m ± 24%   4.539m ±  2%        ~ (p=0.699 n=6)
Random10MBEncoderFastest-64            7.018m ±  5%   4.841m ±  3%  -31.02% (p=0.002 n=6)
RandomEncoderDefault-64                9.438m ±  7%   5.803m ± 39%  -38.51% (p=0.002 n=6)
geomean                                473.0µ         451.3µ         -4.58%

                                     │    old.txt    │               new.txt                │
                                     │      B/s      │      B/s       vs base               │
Encoder_EncodeAllXML-64                266.5Mi ±  3%   281.2Mi ±  3%   +5.50% (p=0.002 n=6)
Encoder_EncodeAllSimple/fastest-64     115.5Mi ±  3%   110.1Mi ±  2%   -4.71% (p=0.002 n=6)
Encoder_EncodeAllSimple/default-64     77.49Mi ±  4%   75.79Mi ±  3%   -2.18% (p=0.026 n=6)
Encoder_EncodeAllSimple/better-64      59.84Mi ±  3%   59.99Mi ±  3%        ~ (p=0.937 n=6)
Encoder_EncodeAllSimple/best-64        14.20Mi ±  6%   14.17Mi ±  1%        ~ (p=0.937 n=6)
Encoder_EncodeAllSimple4K/fastest-64   232.0Mi ±  3%   229.9Mi ±  7%        ~ (p=0.818 n=6)
Encoder_EncodeAllSimple4K/default-64   88.43Mi ±  3%   84.67Mi ±  1%   -4.25% (p=0.002 n=6)
Encoder_EncodeAllSimple4K/better-64    69.69Mi ±  3%   69.83Mi ±  3%        ~ (p=0.732 n=6)
Encoder_EncodeAllSimple4K/best-64      13.69Mi ± 20%   13.37Mi ±  3%        ~ (p=0.485 n=6)
Encoder_EncodeAllHTML-64               134.4Mi ±  1%   137.5Mi ±  4%   +2.32% (p=0.002 n=6)
Encoder_EncodeAllTwain-64              80.80Mi ±  1%   78.44Mi ±  3%   -2.93% (p=0.015 n=6)
Encoder_EncodeAllPi-64                 60.30Mi ±  3%   59.17Mi ±  3%        ~ (p=0.132 n=6)
Random4KEncodeAllFastest-64            2.631Gi ±  2%   2.644Gi ±  3%        ~ (p=0.485 n=6)
Random10MBEncodeAllFastest-64          1.716Gi ±  9%   1.903Gi ±  2%  +10.93% (p=0.002 n=6)
Random4KEncodeAllDefault-64            973.4Mi ±  4%   974.9Mi ±  3%        ~ (p=0.699 n=6)
RandomEncodeAllDefault-64              2.168Gi ± 20%   2.151Gi ±  2%        ~ (p=0.699 n=6)
Random10MBEncoderFastest-64            1.392Gi ±  6%   2.017Gi ±  3%  +44.93% (p=0.002 n=6)
RandomEncoderDefault-64                1.035Gi ±  7%   1.683Gi ± 28%  +62.69% (p=0.002 n=6)
geomean                                204.8Mi         214.6Mi         +4.79%

                                     │     old.txt     │                new.txt                │
                                     │      B/op       │     B/op       vs base                │
Encoder_EncodeAllXML-64                  0.000 ±  0%       0.000 ±  0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple/fastest-64       2.000 ±  0%       2.000 ±  0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple/default-64       4.000 ± 25%       4.000 ± 25%       ~ (p=1.000 n=6)
Encoder_EncodeAllSimple/better-64        5.000 ±  0%       5.000 ±  0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple/best-64          20.50 ±  2%       21.00 ±  5%       ~ (p=0.232 n=6)
Encoder_EncodeAllSimple4K/fastest-64     0.000 ±  0%       0.000 ±  0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple4K/default-64     0.000 ±  0%       0.000 ±  0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple4K/better-64      0.000 ±  0%       0.000 ±  0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple4K/best-64        1.000 ±  0%       1.000 ±  0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllHTML-64                 2.000 ±  0%       2.000 ±  0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllTwain-64                0.000 ±  0%       0.000 ±  0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllPi-64                   12.00 ±  8%       13.00 ±  8%       ~ (p=0.242 n=6)
Random4KEncodeAllFastest-64              0.000 ±  0%       0.000 ±  0%       ~ (p=1.000 n=6) ¹
Random10MBEncodeAllFastest-64          40.25Ki ±  5%     36.76Ki ±  4%  -8.66% (p=0.002 n=6)
Random4KEncodeAllDefault-64              0.000 ±  0%       0.000 ±  0%       ~ (p=1.000 n=6) ¹
RandomEncodeAllDefault-64                0.000 ±  0%       0.000 ±  0%       ~ (p=1.000 n=6) ¹
Random10MBEncoderFastest-64            22.58Ki ±  1%     22.56Ki ±  0%       ~ (p=0.056 n=6)
RandomEncoderDefault-64                11.31Ki ±  1%     11.32Ki ±  0%       ~ (p=0.450 n=6)
geomean                                              ²                  +0.07%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                     │   old.txt    │              new.txt               │
                                     │  allocs/op   │ allocs/op   vs base                │
Encoder_EncodeAllXML-64                0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple/fastest-64     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple/default-64     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple/better-64      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple/best-64        0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple4K/fastest-64   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple4K/default-64   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple4K/better-64    0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllSimple4K/best-64      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllHTML-64               0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllTwain-64              0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Encoder_EncodeAllPi-64                 0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Random4KEncodeAllFastest-64            0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Random10MBEncodeAllFastest-64          0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Random4KEncodeAllDefault-64            0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
RandomEncodeAllDefault-64              0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Random10MBEncoderFastest-64            482.0 ± 0%     482.0 ± 0%       ~ (p=1.000 n=6) ¹
RandomEncoderDefault-64                242.0 ± 0%     242.0 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                           ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

From the comparison results, it can be seen that when we compress some large content (when the matchlen function can match a long length), it will be much faster than before.

@zzzzwc
Copy link
Author

zzzzwc commented Jun 7, 2023

And I will write the code to make it work fine on amd64 cpu that does not support avx2

Copy link
Owner

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have experimented with dedicated assembly matching before, but never really found any convincing+consistent improvement.

These are rather unexpected and really shouldn't be affected by this at all.

Random10MBEncoderFastest-64            1.392Gi ±  6%   2.017Gi ±  3%  +44.93% (p=0.002 n=6)
RandomEncoderDefault-64                1.035Gi ±  7%   1.683Gi ± 28%  +62.69% (p=0.002 n=6)

I will benchmark and do a few tests.


Label("ret")
{
Store(ret, ReturnIndex(0))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VZEROUPPER missing.

TZCNTQ(equalMaskBits, matchedLen)
Comment("if matched len > remaining len, just add remaining on ret")
CMPQ(remainLen, matchedLen)
CMOVQLT(remainLen, matchedLen)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fairly predictable, so I suspect a branch will be faster.

If you keep it, check CMOV - not entirely sure this is always present on AMD64.

matchedLen := GP64()
NOTQ(equalMaskBits)
Comment("store first not equal position into matchedLen")
TZCNTQ(equalMaskBits, matchedLen)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for BMI

adata := YMM()
bdata := YMM()
equalMaskBytes := YMM()
VMOVDQU(Mem{Base: aptr}, adata)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell you are over-reading. This is not feasible. You cannot read longer than provided slices.

@klauspost
Copy link
Owner

klauspost commented Jun 7, 2023

Current code is a big speed regression. Testing on CSV input:

Before/after (3 runs):

file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	8373	378.75
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	8342	380.19
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	8346	379.98
file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	21627	146.64
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	21093	150.35
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	21799	145.49

Level 2 (default)

file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	10543	300.82
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	10621	298.60
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	10634	298.22
file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	23413	135.46
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	23136	137.08
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	23133	137.10

add 	Pragma("noescape")

Co-authored-by: Klaus Post <klauspost@gmail.com>
@klauspost
Copy link
Owner

Adding VZEROUPPER fixes that:

file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	9472	334.82
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	9524	332.98
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	9475	334.70

So if you can fix the issues, this could be feasible.

@klauspost
Copy link
Owner

Tested on a few more bodies. Seems to be 5-10% faster.

@zzzzwc
Copy link
Author

zzzzwc commented Jun 7, 2023

Current code is a big speed regression. Testing on CSV input:

Before/after (3 runs):

file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	8373	378.75
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	8342	380.19
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	8346	379.98
file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	21627	146.64
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	21093	150.35
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	21799	145.49

Level 2 (default)

file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	10543	300.82
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	10621	298.60
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	10634	298.22
file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	23413	135.46
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	23136	137.08
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	23133	137.10

Is there code for the above benchmark tests in this repo? I did not find it

@klauspost
Copy link
Owner

No, I have a separate tool that does that. Adding it would add a whole bunch of dependencies, which I am not interested in (and it is a bit messy).

If it is any use, here it is: https://gist.github.com/klauspost/de4b4d42fe3248fa99fd3f661badffe2

@klauspost
Copy link
Owner

I copied over the matchlen assembly from s2, and it is faster than AVX2 based - and it doesn't use anything beyond standard amd64:

file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	8171	388.12
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	8104	391.33
nyc-taxi-data-10M.csv	zskp	1	3325605752	641319332	8109	391.08
file	out	level	insize	outsize	millis	mb/s
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	10077	314.73
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	10259	309.13
nyc-taxi-data-10M.csv	zskp	2	3325605752	588976126	10052	315.50

klauspost added a commit that referenced this pull request Jun 9, 2023
Copied from the S2 implementation. 5-10% faster.

Replaces #823
@zzzzwc
Copy link
Author

zzzzwc commented Jun 12, 2023

the cost of matchLen looks like:

image x is matched length

y is cost time

the green line is the normal way and the blue line is the vectorized way

I'm trying to reduce the external cost 20 and 2

@klauspost
Copy link
Owner

@zzzzwc I am not sure how you got the equations, but they show the problem fine. You show a crossover at 320 bytes. That is incredibly rare. Most matches are much less than 100 bytes.

VPMOVMSKB is a vector -> GPR register transfer. This incurs a 4-7 cycle latency depending on what platform you measure. Also having to add NOT is an additional 1 cycle. But I presume that is what your "20" accounts for.

This means the penalty for a 4-8 (and possibly up to 16) byte match is much higher. Also add that when loading 32 bytes you are much more likely to cross a 64byte cache line which incurs a minor load penalty.

However, I would expect branch prediction to be better, since it is very likely that only one loop will be called. However, it is not enough that it gives an overall improvement.

I think the crossover is lower than what your graph indicates. But you are optimizing the wrong thing. We don't want long matches to be faster. We want short matches to be faster, since they are the main bulk, and fast matches are not a problem for overall performance.

I tried adding it to S2 assembly #825 with appropriate fallback - and again, it is not faster than the BSF code, so I left it disabled. I left the code in, so you can see how it is integrated.

It is important that you test on real data.

@klauspost klauspost closed this Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants