Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fft gpu optimization #79

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

bchyl
Copy link

@bchyl bchyl commented Jun 20, 2022

we reuse Bellperson and change its bls12-381 to bn254.
the next is fft benchmark data on machine that Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz 80cores+ 35G mem and 4 T4 gpu:

Testing FFT for 524288 elements...
GPU took 35ms.
CPU (64 cores) took 149ms.
Speedup: x4.257143
============================
Testing FFT for 1048576 elements...
GPU took 67ms.
Testing FFT3 for 1048576 elements...
GPU took 61ms.
CPU (64 cores) took 263ms.
Speedup: x3.925373
============================
Testing FFT for 2097152 elements...
GPU took 102ms.
CPU (64 cores) took 752ms.
Speedup: 12.327868
============================
CPU (64 cores) took 428ms.
Speedup: x4.1960783
============================
test fft::tests::fft ... ok
Testing FFT3 for 2097152 elements...
GPU took 98ms.
CPU (64 cores) took 1275ms.
Speedup: 13.010204
============================
Testing FFT3 for 4194304 elements...
GPU took 182ms.
CPU (64 cores) took 2540ms.
Speedup: 13.956044
============================
Testing FFT3 for 8388608 elements...
GPU took 332ms.
CPU (64 cores) took 5209ms.
Speedup: 15.689759
============================
Testing FFT3 for 16777216 elements...
GPU took 764ms.
CPU (64 cores) took 11522ms.
Speedup: 15.081152
============================

when the degree above 2^19, the data show that gpu has increasing performance advantages.

Someone who can help for me to check if this optimization works? thank you very much.

@CPerezz CPerezz requested review from CPerezz and ed255 June 20, 2022 05:00
halo2_proofs/Cargo.toml Outdated Show resolved Hide resolved
@CPerezz
Copy link
Member

CPerezz commented Jun 20, 2022

Sorry @ed255 I assigned us both but the PR is WIP although not marked as draft..

Will mark it as draft and let's wait until ready for review to check it..
Apologies for the noise.

@CPerezz CPerezz marked this pull request as draft June 20, 2022 09:32
@@ -62,11 +78,12 @@ rand_core = { version = "0.6", default-features = false, features = ["getrandom"
getrandom = { version = "0.2", features = ["js"] }

[features]
default = ["shplonk"]
default = ["shplonk", "gpu"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about adding this as default now.

We will need some time to update the CI infra to support GPU usage.
cc: @AronisAt79 @ntampakas @barryWhiteHat

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thanks very much.

Copy link
Author

@bchyl bchyl Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about adding this as default now.

We will need some time to update the CI infra to support GPU usage. cc: @AronisAt79 @ntampakas @barryWhiteHat

hi @CPerezz.
May I ask how long it will take for the CI infra to be ready?

In addition to update FFT data, we ported the multiexp operation patch compatible with pairing library based on filecoin ec-gpu last week. The performance data as following:

running 1 test
Testing Multiexp for 1024 elements...
GPU took 10ms.
CPU took 6ms.
Speedup: x0.6
============================
Testing Multiexp for 2048 elements...
GPU took 8ms.
CPU took 5ms.
Speedup: x0.625
============================
Testing Multiexp for 4096 elements...
GPU took 8ms.
CPU took 6ms.
Speedup: x0.75
============================
Testing Multiexp for 8192 elements...
GPU took 10ms.
CPU took 12ms.
Speedup: x1.2
============================
Testing Multiexp for 16384 elements...
GPU took 12ms.
CPU took 22ms.
Speedup: x1.8333334
============================
Testing Multiexp for 32768 elements...
GPU took 16ms.
CPU took 40ms.
Speedup: x2.5
============================
Testing Multiexp for 65536 elements...
GPU took 25ms.
CPU took 83ms.
Speedup: x3.32
============================
Testing Multiexp for 131072 elements...
GPU took 34ms.
CPU took 169ms.
Speedup: x4.970588
============================
Testing Multiexp for 262144 elements...
GPU took 59ms.
CPU took 287ms.
Speedup: x4.8644066
============================
Testing Multiexp for 524288 elements...
GPU took 91ms.
CPU took 469ms.
Speedup: x5.1538463
============================
Testing Multiexp for 1048576 elements...
GPU took 152ms.
CPU took 864ms.
Speedup: x5.6842103
============================
Testing Multiexp for 2097152 elements...
GPU took 246ms.
CPU took 1707ms.
Speedup: x6.9390244
============================
Testing Multiexp for 4194304 elements...
GPU took 373ms.
CPU took 3431ms.
Speedup: x9.198391
============================
Testing Multiexp for 8388608 elements...
GPU took 574ms.
CPU took 6788ms.
Speedup: x11.825784
============================
Testing Multiexp for 16777216 elements...
GPU took 928ms.
CPU took 14723ms.
Speedup: x15.865302
============================
Testing Multiexp for 33554432 elements...
GPU took 1541ms.
test multiexp::tests::gpu_multiexp_consistency has been running for over 60 seconds
CPU took 32038ms.
Speedup: x20.790396
============================
Testing Multiexp for 67108864 elements...
GPU took 2997ms.
CPU took 65527ms.
Speedup: x21.864197
============================

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bchyl, some of these GPU times for FFT/MSM look great! For the MSM however it seems like you'll have to test for larger number of elements because the current numbers are very small and you don't yet see the expected ~linear increase in time you'd expect (so for small multiexps there are other overheads which makes sense but that doesn't really matter).

I do have a question about the CPU times for MSM and FFTs. They seem to be extremely slow for some reason, much slower than even when running them on a normal desktop CPU, and you're running them on a very powerful machine so that doesn't make any sense. On a standard machine (8 CPU cores) you can do an FFT of 2^20 in roughly 0.25s and an MSM of 2^20 in less than 10 seconds (the exact numbers of course dependent on the specific implementation). The numbers make it look like you're running only on a single core or something. Am I misinterpreting the data or do you think there may be something up with the CPU performance numbers?

Copy link
Author

@bchyl bchyl Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Brechtpd, Thank you very much for your reference 8 cores performance data.
We have double checked the data of our CPU (test machine is 80 cores) and it has been updated as mentioned above.

In general, for MSM it was been consistent with your results, 2^20 less than 1s(864ms). But for fft our result was 0.263s same as your theoretical value 0.25s, it look like cpu acceleration is not obvious beyond a certain number of cores.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the updated numbers! I think the limited scalability beyond a certain number of CPU cores is at least partially caused by the multi-threading approach followed by the current CPU FFT/MSM implementations. Not really sure how big the impact is of that with a CPU with that many cores though.

halo2_proofs/src/lib.rs Outdated Show resolved Hide resolved
halo2_proofs/src/lib.rs Show resolved Hide resolved
halo2_proofs/src/gpu/error.rs Outdated Show resolved Hide resolved
halo2_proofs/src/gpu/fft.rs Outdated Show resolved Hide resolved
halo2_proofs/Cargo.toml Outdated Show resolved Hide resolved
@bchyl bchyl closed this Jun 20, 2022
@CPerezz
Copy link
Member

CPerezz commented Jun 21, 2022

Why was this closed @bchyl ?? I thought we should review it.

@bchyl
Copy link
Author

bchyl commented Jun 21, 2022

Why was this closed @bchyl ?? I thought we should review it.

hi, CPerezz @CPerezz
sorry for the noise.
yesterday replied to you "we will temporarily close this PR and re-submit a new after CI passed for refactor code."

Today we are refactoring the code and doing much more tests. We expect to reopen the PR about 2 days, hopefully in time for the new release.

thank you very much.

@bchyl bchyl reopened this Jun 23, 2022
@@ -31,14 +31,29 @@ harness = false
[dependencies]
backtrace = { version = "0.3", optional = true }
rayon = "1.5.1"
ff = "0.11"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: here will be restored when fr/fq in pariring repo impl the char method for prime filed trait in ff/group.

merged path: zkcrypto/ff and group -> pse/pairing -> starslabhq/ff-cl-gen -> pse/gpu fft module -> pse halo2 fft

@bchyl bchyl marked this pull request as ready for review June 23, 2022 16:41
@bchyl
Copy link
Author

bchyl commented Jun 27, 2022

we reuse Bellperson and change its bls12-381 to bn254. the next is fft benchmark data on machine that Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz 80cores+ 35G mem and 4 T4 gpu:

when the degree above 2^19, the data show that gpu has increasing performance advantages.

Someone who can help for me to check if this optimization works? thank you very much.

After taking params coeffs as measurement, the latest result was been as follows:

Start:   prev_fft degree 1
End:     prev_fft degree 1 .........................................................3.132ms
Start:   gpu_fft degree 1
use multiple GPUs
End:     gpu_fft degree 1 ..........................................................637.135ms
Start:   prev_fft degree 2
End:     prev_fft degree 2 .........................................................18.405µs
Start:   gpu_fft degree 2
use multiple GPUs
End:     gpu_fft degree 2 ..........................................................430.600ms
Start:   prev_fft degree 3
End:     prev_fft degree 3 .........................................................27.227µs
Start:   gpu_fft degree 3
use multiple GPUs
End:     gpu_fft degree 3 ..........................................................388.614ms
Start:   prev_fft degree 4
End:     prev_fft degree 4 .........................................................37.148µs
Start:   gpu_fft degree 4
use multiple GPUs
End:     gpu_fft degree 4 ..........................................................403.945ms
Start:   prev_fft degree 5
End:     prev_fft degree 5 .........................................................49.842µs
Start:   gpu_fft degree 5
use multiple GPUs
End:     gpu_fft degree 5 ..........................................................392.746ms
Start:   prev_fft degree 6
End:     prev_fft degree 6 .........................................................68.417µs
Start:   gpu_fft degree 6
use multiple GPUs
End:     gpu_fft degree 6 ..........................................................389.343ms
Start:   prev_fft degree 7
End:     prev_fft degree 7 .........................................................2.001ms
Start:   gpu_fft degree 7
use multiple GPUs
End:     gpu_fft degree 7 ..........................................................417.015ms
Start:   prev_fft degree 8
End:     prev_fft degree 8 .........................................................1.758ms
Start:   gpu_fft degree 8
use multiple GPUs
End:     gpu_fft degree 8 ..........................................................396.529ms
Start:   prev_fft degree 9
End:     prev_fft degree 9 .........................................................1.870ms
Start:   gpu_fft degree 9
use multiple GPUs
End:     gpu_fft degree 9 ..........................................................395.902ms
Start:   prev_fft degree 10
End:     prev_fft degree 10 ........................................................2.174ms
Start:   gpu_fft degree 10
use multiple GPUs
End:     gpu_fft degree 10 .........................................................394.633ms
Start:   prev_fft degree 11
End:     prev_fft degree 11 ........................................................2.383ms
Start:   gpu_fft degree 11
use multiple GPUs
End:     gpu_fft degree 11 .........................................................391.948ms
Start:   prev_fft degree 12
End:     prev_fft degree 12 ........................................................2.514ms
Start:   gpu_fft degree 12
use multiple GPUs
End:     gpu_fft degree 12 .........................................................390.495ms
Start:   prev_fft degree 13
End:     prev_fft degree 13 ........................................................3.546ms
Start:   gpu_fft degree 13
use multiple GPUs
End:     gpu_fft degree 13 .........................................................394.464ms
Start:   prev_fft degree 14
End:     prev_fft degree 14 ........................................................4.992ms
Start:   gpu_fft degree 14
use multiple GPUs
End:     gpu_fft degree 14 .........................................................397.971ms
Start:   prev_fft degree 15
End:     prev_fft degree 15 ........................................................13.433ms
Start:   gpu_fft degree 15
use multiple GPUs
End:     gpu_fft degree 15 .........................................................420.388ms
Start:   prev_fft degree 16
End:     prev_fft degree 16 ........................................................15.060ms
Start:   gpu_fft degree 16
use multiple GPUs
End:     gpu_fft degree 16 .........................................................402.647ms
Start:   prev_fft degree 17
End:     prev_fft degree 17 ........................................................27.624ms
Start:   gpu_fft degree 17
use multiple GPUs
End:     gpu_fft degree 17 .........................................................415.804ms
Start:   prev_fft degree 18
End:     prev_fft degree 18 ........................................................68.354ms
Start:   gpu_fft degree 18
use multiple GPUs
End:     gpu_fft degree 18 .........................................................503.651ms
Start:   prev_fft degree 19
End:     prev_fft degree 19 ........................................................136.238ms
Start:   gpu_fft degree 19
use multiple GPUs
End:     gpu_fft degree 19 .........................................................574.728ms
Start:   prev_fft degree 20
End:     prev_fft degree 20 ........................................................253.138ms
Start:   gpu_fft degree 20
use multiple GPUs
End:     gpu_fft degree 20 .........................................................632.553ms
Start:   prev_fft degree 21
End:     prev_fft degree 21 ........................................................553.865ms
Start:   gpu_fft degree 21
use multiple GPUs
End:     gpu_fft degree 21 .........................................................701.713ms
test poly::domain::test_best_fft_multiple_gpu has been running for over 60 seconds
Start:   prev_fft degree 22
End:     prev_fft degree 22 ........................................................1.111s
Start:   gpu_fft degree 22
use multiple GPUs
End:     gpu_fft degree 22 .........................................................806.533ms
Start:   prev_fft degree 23
End:     prev_fft degree 23 ........................................................2.401s
Start:   gpu_fft degree 23
use multiple GPUs
End:     gpu_fft degree 23 .........................................................1.004s
Start:   prev_fft degree 24
End:     prev_fft degree 24 ........................................................4.570s
Start:   gpu_fft degree 24
use multiple GPUs
End:     gpu_fft degree 24 .........................................................1.369s
Start:   prev_fft degree 25
End:     prev_fft degree 25 ........................................................9.581s
Start:   gpu_fft degree 25
use multiple GPUs
End:     gpu_fft degree 25 .........................................................2.152s
Start:   prev_fft degree 26
End:     prev_fft degree 26 ........................................................21.325s
Start:   gpu_fft degree 26
use multiple GPUs
End:     gpu_fft degree 26 .........................................................4.464s
Start:   prev_fft degree 27
End:     prev_fft degree 27 ........................................................41.408s
Start:   gpu_fft degree 27
use multiple GPUs
End:     gpu_fft degree 27 .........................................................8.381s

for much more better performing version based on filecoin ec-gpu

test fft_cpu::tests::parallel_fft_consistency ... ok
Testing FFT for 2^ 1 = 2 elements...
GPU took 0ms.
CPU (64 cores) took 0ms.
Speedup: xNaN
============================
Testing FFT for 2^ 2 = 4 elements...
GPU took 0ms.
CPU (64 cores) took 0ms.
Speedup: xNaN
============================
Testing FFT for 2^ 3 = 8 elements...
GPU took 0ms.
CPU (64 cores) took 0ms.
Speedup: xNaN
============================
Testing FFT for 2^ 4 = 16 elements...
GPU took 0ms.
CPU (64 cores) took 0ms.
Speedup: xNaN
============================
Testing FFT for 2^ 5 = 32 elements...
GPU took 0ms.
CPU (64 cores) took 0ms.
Speedup: xNaN
============================
Testing FFT for 2^ 6 = 64 elements...
GPU took 0ms.
CPU (64 cores) took 0ms.
Speedup: xNaN
============================
Testing FFT for 2^ 7 = 128 elements...
GPU took 0ms.
CPU (64 cores) took 12ms.
Speedup: xinf
============================
Testing FFT for 2^ 8 = 256 elements...
GPU took 3ms.
CPU (64 cores) took 7ms.
Speedup: x2.3333333
============================
Testing FFT for 2^ 9 = 512 elements...
GPU took 1ms.
CPU (64 cores) took 1ms.
Speedup: x1
============================
Testing FFT for 2^ 10 = 1024 elements...
GPU took 1ms.
CPU (64 cores) took 1ms.
Speedup: x1
============================
Testing FFT for 2^ 11 = 2048 elements...
GPU took 0ms.
CPU (64 cores) took 1ms.
Speedup: xinf
============================
Testing FFT for 2^ 12 = 4096 elements...
GPU took 0ms.
CPU (64 cores) took 1ms.
Speedup: xinf
============================
Testing FFT for 2^ 13 = 8192 elements...
GPU took 1ms.
CPU (64 cores) took 10ms.
Speedup: x10
============================
Testing FFT for 2^ 14 = 16384 elements...
GPU took 4ms.
CPU (64 cores) took 16ms.
Speedup: x4
============================
Testing FFT for 2^ 15 = 32768 elements...
GPU took 2ms.
CPU (64 cores) took 38ms.
Speedup: x19
============================
Testing FFT for 2^ 16 = 65536 elements...
GPU took 4ms.
CPU (64 cores) took 42ms.
Speedup: x10.5
============================
Testing FFT for 2^ 17 = 131072 elements...
GPU took 17ms.
CPU (64 cores) took 36ms.
Speedup: x2.1176472
============================
Testing FFT for 2^ 18 = 262144 elements...
GPU took 27ms.
CPU (64 cores) took 67ms.
Speedup: x2.4814816
============================
Testing FFT for 2^ 19 = 524288 elements...
GPU took 40ms.
Testing FFT3 for 1048576 elements...
CPU (64 cores) took 132ms.
Speedup: x3.3
GPU took 80ms.
============================
Testing FFT for 2^ 20 = 1048576 elements...
GPU took 59ms.
CPU (64 cores) took 282ms.
Speedup: x4.779661
============================
CPU (64 cores) took 742ms.
Speedup: 9.275
============================
Testing FFT for 2^ 21 = 2097152 elements...
GPU took 110ms.
CPU (64 cores) took 618ms.
Speedup: x5.6181817
============================
Testing FFT3 for 2097152 elements...
GPU took 120ms.
Testing FFT for 2^ 22 = 4194304 elements...
GPU took 130ms.
CPU (64 cores) took 1862ms.
Speedup: x14.323077
CPU (64 cores) took 2640ms.
Speedup: 22
============================
============================
Testing FFT for 2^ 23 = 8388608 elements...
GPU took 328ms.
Testing FFT3 for 4194304 elements...
GPU took 203ms.
CPU (64 cores) took 2753ms.
Speedup: x8.393292
============================
CPU (64 cores) took 4504ms.
Speedup: 22.187193
============================
Testing FFT for 2^ 24 = 16777216 elements...
GPU took 1074ms.
Testing FFT3 for 8388608 elements...
GPU took 414ms.
CPU (64 cores) took 5073ms.
Speedup: x4.7234635
============================
CPU (64 cores) took 9138ms.
Speedup: 22.072464
============================
Testing FFT for 2^ 25 = 33554432 elements...
GPU took 3301ms.
Testing FFT3 for 16777216 elements...
GPU took 1039ms.
CPU (64 cores) took 11256ms.
Speedup: x3.4098759
============================
CPU (64 cores) took 14948ms.
Speedup: 14.38691
============================
test fft::tests::fft_many has been running for over 60 seconds
test fft::tests::fft_one has been running for over 60 seconds
Testing FFT for 2^ 26 = 67108864 elements...
GPU took 4089ms.
Testing FFT3 for 33554432 elements...
CPU (64 cores) took 18344ms.
Speedup: x4.486182
GPU took 2808ms.
============================
CPU (64 cores) took 27082ms.
Speedup: 9.644587
Testing FFT for 2^ 27 = 134217728 elements...
============================
GPU took 5468ms.
CPU (64 cores) took 39812ms.
Speedup: x7.280907

@bchyl bchyl changed the title 【WIP】fft gpu optimization fft gpu optimization Jun 27, 2022
@bchyl
Copy link
Author

bchyl commented Jul 6, 2022

hi @Brechtpd, we have just enabled asm version of field basic operations on pairing_bn256.

According to the following performance result, there is about 20%-30% performance improvement for CPU and GPU :
fft

Testing FFT for 2^20=1048576 elements...
GPU took 55ms.
CPU (64 cores) took 214ms.
Speedup: x3.8909092
============================
Testing FFT for 2^21=2097152 elements...
GPU took 95ms.
CPU (64 cores) took 385ms.
Speedup: x4.0526314
============================
Testing FFT for 2^22=4194304 elements...
GPU took 172ms.
CPU (64 cores) took 659ms.
Speedup: x3.8313954
============================
Testing FFT for 2^23=8388608 elements...
GPU took 317ms.
CPU (64 cores) took 1632ms.
Speedup: x5.148265
============================
Testing FFT for 2^24=16777216 elements...
GPU took 515ms.
CPU (64 cores) took 3187ms.
Speedup: x6.1883497
============================
Testing FFT for 2^25=33554432 elements...
GPU took 2280ms.
CPU (64 cores) took 7474ms.
Speedup: x3.2780702
============================
Testing FFT for 2^26=67108864 elements...
GPU took 3999ms.
CPU (64 cores) took 15191ms.
Speedup: x3.7986996

msm:

Testing Multiexp for 1048576 elements...
GPU took 221ms.
CPU took 907ms.
Speedup: x4.1040726
============================
Testing Multiexp for 2097152 elements...
GPU took 346ms.
test multiexp::tests::gpu_multiexp_consistency has been running for over 60 seconds
CPU took 1543ms.
Speedup: x4.4595375
============================
Testing Multiexp for 4194304 elements...
GPU took 467ms.
CPU took 3999ms.
Speedup: x8.5631695
============================
Testing Multiexp for 8388608 elements...
GPU took 1107ms.
CPU took 6508ms.
Speedup: x5.878952
============================
Testing Multiexp for 16777216 elements...
GPU took 1121ms.
CPU took 13754ms.
Speedup: x12.2694025
============================
Testing Multiexp for 33554432 elements...
GPU took 2115ms.
CPU took 28523ms.
Speedup: x13.486052
============================
Testing Multiexp for 67108864 elements...
GPU took 4417ms.
CPU took 59907ms.
Speedup: x13.562825
============================
Testing Multiexp for 134217728 elements...
GPU took 8038ms.
CPU took 122464ms.
Speedup: x15.235631
============================

@CPerezz
Copy link
Member

CPerezz commented Jan 17, 2023

@ed255 @han0110 @kilic is this still relevant and worth taking again?? Or should we close this for now?

@han0110
Copy link

han0110 commented Jan 17, 2023

I think at some point we need to start investigate gpu acceleration, and this thread already contains some useful measurement, perhaps we can keep it until we really have a replacement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants