Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance #3

Open
Meakk opened this issue Jan 26, 2024 · 24 comments
Open

Improve the performance #3

Meakk opened this issue Jan 26, 2024 · 24 comments

Comments

@Meakk
Copy link

Meakk commented Jan 26, 2024

Background:

I have a 3D gaussian splatting viewer in OpenGL which requires sorting a lot of points by view depth.
I'm running the depth computation in a compute shader, then sort the points using a custom bitonic sort.
Since the bitonic sort is the bottleneck (~30fps right now), I'm looking for a good implementation of the radix sort.

Issue:

So far, I managed to call your radix sort and it works.
However it's quite slow and after debugging, it seems like many compute dispatches are called with a surprising low number of workgroups: most of the time, it's 128, sometimes 23953 (for information, the total number of points I'm sorting is 6131954.

Is there any parameter I can tweak to improve the performance or reduce the number of shader invocation?

@loryruta
Copy link
Owner

Hi! First of all thank you for having reached out and tried my project 😄

I'm already aware I'm using a low number of threads per workgroup: every workgroup has 64 threads which is unreasoned considering modern GPUs can support up to 1024 threads.

Increasing the num of threads to 1024 will eventually decrease the number of workgroups (and performance?).

This doesn't really correspond to your observation (i.e. the number of workgroups should be higher now, and then become lower once I increase the workgroup size):

it seems like many compute dispatches are called with a surprising low number of workgroups

About the count of workgroups, it runs a Scan operation thus the low-to-high (and high-to-low) number of workgroups makes sense. Also, 23953 as a max number of workgroups is fine and should be the case where every item is processed:

THREADS_NUM = 64
ITEMS_PER_THREAD = 4
num_workgroups = ceil(6131954 / (64 * 4)) = 23953

So, back to your question:

Is there any parameter I can tweak to improve the performance or reduce the number of shader invocation?

Increase THREADS_NUM (in shaders and radix_sort.hpp). You could also try to increase ITEMS_PER_THREAD.
However I can give no guarantee that it'll work without breaking anything else...

I'm going to take back this project in these days!
Thank you again for having brushed it up 😅

@Meakk
Copy link
Author

Meakk commented Jan 27, 2024

Thank you for your answer. I'll try that next week and let you know!

@loryruta
Copy link
Owner

I have reworked the code in dev branch:

https://github.com/loryruta/gpu-radix-sort/tree/dev

Hopefully I may also have improved the performance; the next thing is to do benchmarking.
Also, I have exposed two "hidden" primitives I was using in RadixSort: Reduce and BlellochScan (exclusive scan).

@Meakk
Copy link
Author

Meakk commented Jan 30, 2024

The performance is better on dev branch, but still too slow for my usage.
If I understood correctly, the time complexity of the radix sort depends on the number of radices and maybe on the distribution in the different buckets?
I'm having float keys, which are probably not really well distributed in the full 32-bits range (I'm also adding a constant value to make sure all values are positive).
Maybe it would be useful to remap all my values in a byte only (from 0 to 255)?

@loryruta
Copy link
Owner

but still too slow for my usage

You're still sorting 6131954 elements right? What's your maximum acceptable execution time? In my setup (NVIDIA GeForce RTX 2060 SUPER), 6131954 elements are sorted in ~6ms.

the time complexity of the radix sort depends on the number of radices and maybe on the distribution in the different buckets

Yes, I'm still not sure how much it impacts the overall performance but the distribution of the radices do impact: during the "counting step" I count the number of per-block radices and global radices using atomic operations. If radices are all 0 (worst case) all atomic operations would be concentrated in one memory location. However I have already optimized it by first counting on shared memory and then aggregating results on global memory.

So I'm not sure that is the main bottleneck.

Maybe it would be useful to remap all my values in a byte only (from 0 to 255)?

At the moment the radix sort algorithm isn't data dependent and therefore the complexity doesn't depend on your data (except from the atomic operations I wrote before). Which means if your data is [0, 0xF] or [0, 0xFFFFFFFF] it will always execute the same steps and visit all radices.

This can be for sure optimized; but question: if your data is [0, 255] why not using a BucketSort?

I'm having float keys

Note, just to be sure: you can't run radix sort with float keys. You have to convert them to GLuint first!

@Meakk
Copy link
Author

Meakk commented Jan 30, 2024

Yes, I'm still sorting 6131954 elements. It takes ~135ms on my RTX 3070.
But as I said in my previous message, the keys are positive floats, so radix sort works, but since it relies on binary representation I suspect it's not well distributed.
I mention [0, 255] range but maybe it's a bit too aggressive and 16-bits might be better. I just suggested to remapping the float by doing something like key_int = floor( (key_float - range_min) / (range_max - range_min) ).

@Meakk
Copy link
Author

Meakk commented Jan 30, 2024

Sorry, the timing is the master branch. Using the dev branch gives ~80ms.

@loryruta
Copy link
Owner

the keys are positive floats, so radix sort works, but since it relies on binary representation I suspect it's not well distributed.

Oh okay, I wasn't aware binary representation of positive floats was compliant with the radix sort algorithm. I've always sorted GLuint. Data distribution matters, and the only data-dependant point in the algorithm are the atomic operations mentioned above.

You can absolutely try to re-map your floats such that radices are more evenly distributed:

The best case would be (i.e. fewest memory contention by atomic operatons):
For every 4 bits, from LSB to MSB, you should have roughly the same number of 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15.

However I'm not sure how much would impact the performance.


Another thing that could be done is to re-introduce iterating multiple items per thread. I was doing it in master branch but removed it in dev as I pretty much re-wrote the algorithm.

@loryruta
Copy link
Owner

loryruta commented Jan 30, 2024

Using the dev branch gives ~80ms

Wait, how's this possible? To me it takes only ~6ms and we have comparable GPUs.

Please if you have time, clone the repository, checkout to dev, and then:

mkdir build
cmake ..
cmake --build .
cd test
./glu_test [benchmark]

and send the output.

Alternatively, this is what I'm measuring (code):

RadixSort radix_sort; // Outside!
measure {
        radix_sort(key_buffer, val_buffer, 6131954); // ~6ms
}  

Note I've not included the construction of the RadixSort class, and you should construct it once in your code!

@Meakk
Copy link
Author

Meakk commented Jan 30, 2024

$ ./glu_test 
---------------------------------------------------------------- Device info
Device: NVIDIA GeForce RTX 3070/PCIe/SSE2
Vendor: NVIDIA Corporation
Version: 4.6.0 NVIDIA 545.29.06
GLSL version: 4.60 NVIDIA
GL_MAX_COMPUTE_WORK_GROUP_COUNT: (2147483647, 65535, 65535)
GL_MAX_COMPUTE_WORK_GROUP_SIZE: (1024, 1024, 64)
GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS: 1024
GL_MAX_COMPUTE_SHARED_MEMORY_SIZE: 49152
GL_MAX_COMPUTE_SHADER_STORAGE_BLOCKS: 16
GL_MAX_COMBINED_SHADER_STORAGE_BLOCKS: 96
GL_WARP_SIZE_NV: 32
----------------------------------------------------------------
Randomness seeded to: 1212726197
Num elements: 128; Seed: 1
Num elements: 256; Seed: 1
Num elements: 512; Seed: 1
Num elements: 1024; Seed: 1
Num elements: 2048; Seed: 1
Num elements: 10993; Seed: 1
Num elements: 14978; Seed: 1
Num elements: 16243; Seed: 1
Num elements: 18985; Seed: 1
Num elements: 23857; Seed: 1
Num elements: 27865; Seed: 1
Num elements: 33363; Seed: 1
Num elements: 41298; Seed: 1
Num elements: 45821; Seed: 1
Num elements: 47487; Seed: 1
===============================================================================
All tests passed (1233 assertions in 9 test cases)

I confirm I am measuring only the operator() call, I'm excluding the shader compilation.

@loryruta
Copy link
Owner

You've missed the [benchmark] argument:

./glu_test [benchmark]

@Meakk
Copy link
Author

Meakk commented Jan 30, 2024

Here it is:

$ ./glu_test [benchmark]
---------------------------------------------------------------- Device info
Device: NVIDIA GeForce RTX 3070/PCIe/SSE2
Vendor: NVIDIA Corporation
Version: 4.6.0 NVIDIA 545.29.06
GLSL version: 4.60 NVIDIA
GL_MAX_COMPUTE_WORK_GROUP_COUNT: (2147483647, 65535, 65535)
GL_MAX_COMPUTE_WORK_GROUP_SIZE: (1024, 1024, 64)
GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS: 1024
GL_MAX_COMPUTE_SHARED_MEMORY_SIZE: 49152
GL_MAX_COMPUTE_SHADER_STORAGE_BLOCKS: 16
GL_MAX_COMBINED_SHADER_STORAGE_BLOCKS: 96
GL_WARP_SIZE_NV: 32
----------------------------------------------------------------
Filters: [benchmark]
Randomness seeded to: 2379691280
Reduce; Num elements: 1024, Elapsed: 0.040 ms
Reduce; Num elements: 16384, Elapsed: 0.060 ms
Reduce; Num elements: 65536, Elapsed: 0.050 ms
Reduce; Num elements: 131072, Elapsed: 0.052 ms
Reduce; Num elements: 524288, Elapsed: 0.056 ms
Reduce; Num elements: 1048576, Elapsed: 0.085 ms
Reduce; Num elements: 16777216, Elapsed: 0.110 ms
Reduce; Num elements: 67108864, Elapsed: 0.106 ms
Reduce; Num elements: 134217728, Elapsed: 0.106 ms
Reduce; Num elements: 268435456, Elapsed: 0.114 ms
BlellochScan; Num elements: 1024, Elapsed: 0.038 ms
BlellochScan; Num elements: 16384, Elapsed: 1.400 ms
BlellochScan; Num elements: 65536, Elapsed: 0.085 ms
BlellochScan; Num elements: 131072, Elapsed: 0.098 ms
BlellochScan; Num elements: 524288, Elapsed: 0.099 ms
BlellochScan; Num elements: 1048576, Elapsed: 0.114 ms
BlellochScan; Num elements: 16777216, Elapsed: 0.158 ms
BlellochScan; Num elements: 67108864, Elapsed: 0.173 ms
BlellochScan; Num elements: 134217728, Elapsed: 0.150 ms
BlellochScan; Num elements: 268435456, Elapsed: 0.160 ms
Radix sort; Num elements: 1024, Elapsed: 48.501 ms
Radix sort; Num elements: 16384, Elapsed: 0.369 ms
Radix sort; Num elements: 65536, Elapsed: 0.423 ms
Radix sort; Num elements: 131072, Elapsed: 0.637 ms
Radix sort; Num elements: 524288, Elapsed: 0.692 ms
Radix sort; Num elements: 1048576, Elapsed: 2.788 ms
Radix sort; Num elements: 16777216, Elapsed: 4.817 ms
Radix sort; Num elements: 67108864, Elapsed: 13.530 ms
Radix sort; Num elements: 134217728, Elapsed: 38.760 ms
Radix sort; Num elements: 268435456, Elapsed: 85.152 ms
===============================================================================
test cases: 3 | 3 passed
assertions: - none -

@loryruta
Copy link
Owner

Your results are even better than mine!

Radix sort; Num elements: 1048576, Elapsed: 2.788 ms
Radix sort; Num elements: 16777216, Elapsed: 4.817 ms

Elapsed time for 6131954 should be similar to these numbers... (i.e. the ~6ms I'm getting makes sense).
Why are you measuring ~80ms? Have to test further...

@Meakk
Copy link
Author

Meakk commented Jan 30, 2024

Would it help if I send the .trace file produce by apitrace to you?

@Meakk
Copy link
Author

Meakk commented Jan 30, 2024

I can also dump my float keys buffer if you prefer

@loryruta
Copy link
Owner

Would it help if I send the .trace file produce by apitrace to you?

Yes please!

I can also dump my float keys buffer if you prefer

Yeah, I was about to ask it!

Also, ./glu_test [benchmark] runs radix sort on an uninitialized key/value array; that's the only different I can think of with respect to your setup. So maybe it's some data-dependency showing up here...

@Meakk
Copy link
Author

Meakk commented Jan 30, 2024

Here's the buffer in binary format. It contains the raw buffer. Even if it's floats, I guess it doesn't matter for the bitonic sort and you can reinterpret them as integer.

buffer.zip

@Meakk
Copy link
Author

Meakk commented Jan 30, 2024

And the apitrace file: https://we.tl/t-3hiKrkNcZL

@loryruta
Copy link
Owner

Here's the buffer in binary format. It contains the raw buffer. Even if it's floats, I guess it doesn't matter for the bitonic sort and you can reinterpret them as integer.

buffer.zip

Tested with your data:

Radix sort (issue #3); Num elements: 6131954, Elapsed: 5.528 ms

Code:

TEST_CASE("RadixSort-3-benchmark", "[.][benchmark]")
{
    const size_t k_num_elements = 6131954;

    FILE* f = fopen("/tmp/3-glu-buffer.bin", "r");
    REQUIRE(f);

    std::vector<float> data(k_num_elements * sizeof(GLfloat));
    fread(data.data(), sizeof(GLfloat), k_num_elements, f);
    REQUIRE(fgetc(f) == -1); // File should be ended
    fclose(f);

    ShaderStorageBuffer key_buffer(data);
    ShaderStorageBuffer val_buffer(data); // Upload the same data for values

    RadixSort radix_sort;

    StopWatch stopwatch;

    radix_sort(key_buffer.handle(), val_buffer.handle(), k_num_elements);

    std::string duration_str = stopwatch.elapsed_time_str();
    printf("Radix sort (issue #3); Num elements: %zu, Elapsed: %s\n", k_num_elements, duration_str.c_str());
}

@Meakk
Copy link
Author

Meakk commented Jan 31, 2024

Ok thank you for investigating. ~5ms would be much better than my Bitonic Sort (~16ms).
I'll try to debug it on my side. Could the value buffer affect the performances somehow?

@loryruta
Copy link
Owner

loryruta commented Feb 1, 2024

@Meakk turned out the reason is that I was wrongly measuring elapsed time: I was using std::chrono without glFinish and thus I wasn't waiting device to complete the execution of the algorithm.

I'm now using GL_ELAPSED_TIME of Query Objects and results are very different (and similar to what you observe):

Reduce; Num elements: 1024, Elapsed: 0.082 ms
Reduce; Num elements: 16384, Elapsed: 0.012 ms
Reduce; Num elements: 65536, Elapsed: 0.016 ms
Reduce; Num elements: 131072, Elapsed: 0.019 ms
Reduce; Num elements: 524288, Elapsed: 0.029 ms
Reduce; Num elements: 1048576, Elapsed: 0.049 ms
Reduce; Num elements: 16777216, Elapsed: 0.620 ms
Reduce; Num elements: 67108864, Elapsed: 2.515 ms
Reduce; Num elements: 134217728, Elapsed: 5.029 ms
Reduce; Num elements: 268435456, Elapsed: 10.054 ms
BlellochScan; Num elements: 1024, Elapsed: 0.993 ms
BlellochScan; Num elements: 16384, Elapsed: 0.082 ms
BlellochScan; Num elements: 65536, Elapsed: 0.102 ms
BlellochScan; Num elements: 131072, Elapsed: 0.115 ms
BlellochScan; Num elements: 524288, Elapsed: 0.169 ms
BlellochScan; Num elements: 1048576, Elapsed: 0.357 ms
BlellochScan; Num elements: 16777216, Elapsed: 4.501 ms
BlellochScan; Num elements: 67108864, Elapsed: 18.431 ms
BlellochScan; Num elements: 134217728, Elapsed: 37.874 ms
BlellochScan; Num elements: 268435456, Elapsed: 76.464 ms
Radix sort; Num elements: 1024, Elapsed: 0.668 ms
Radix sort; Num elements: 16384, Elapsed: 1.008 ms
Radix sort; Num elements: 65536, Elapsed: 1.778 ms
Radix sort; Num elements: 131072, Elapsed: 3.103 ms
Radix sort; Num elements: 524288, Elapsed: 10.601 ms
Radix sort; Num elements: 1048576, Elapsed: 20.630 ms
Radix sort (issue #3); Num elements: 6131954, Elapsed: 0.164 s  ---> Your data (you get ~80ms)
Radix sort; Num elements: 16777216, Elapsed: 0.313 s
Radix sort; Num elements: 67108864, Elapsed: 1.250 s
Radix sort; Num elements: 134217728, Elapsed: 2.523 s
Radix sort; Num elements: 268435456, Elapsed: 5.060 s

@loryruta
Copy link
Owner

loryruta commented Feb 1, 2024

Could the value buffer affect the performances somehow?

Removing the value_buffer would avoid a few writes so yes. But changing value_buffer data doesn't impact.

@loryruta
Copy link
Owner

loryruta commented Feb 1, 2024

I mention [0, 255] range but maybe it's a bit too aggressive and 16-bits might be better

Mapping your data to a smaller range would surely be a first optimization.
Here're the timings you get if you would remap your data into different ranges and sort less bits (= less RadixSort iterations):

Radix sort (issue #3); Num elements: 6131954, Num steps: 1, Elapsed: 20.759 ms  // 4  bit sorting
Radix sort (issue #3); Num elements: 6131954, Num steps: 2, Elapsed: 39.247 ms  // 8  bit sorting <---
Radix sort (issue #3); Num elements: 6131954, Num steps: 3, Elapsed: 58.858 ms  // 12 bit sorting
Radix sort (issue #3); Num elements: 6131954, Num steps: 4, Elapsed: 77.727 ms  // 16 bit sorting <---
Radix sort (issue #3); Num elements: 6131954, Num steps: 5, Elapsed: 83.696 ms  // 20 bit sorting
Radix sort (issue #3); Num elements: 6131954, Num steps: 6, Elapsed: 99.351 ms  // 24 bit sorting
Radix sort (issue #3); Num elements: 6131954, Num steps: 7, Elapsed: 0.105 s    // 28 bit sorting
Radix sort (issue #3); Num elements: 6131954, Num steps: 8, Elapsed: 0.119 s    // 32 bit sorting <--- here you get ~80ms

Further optimization maybe can be achieved by reworking the algorithm. Now that I have proper measurements I can work on it 👍

@Meakk
Copy link
Author

Meakk commented Feb 2, 2024

Ok it explains this huge difference.
I'll go ahead with my bitonic sort for now (I achieved ~15ms by optimizing the shader) and see if our users complain about performance.
In any case I'm definitely interested in migrating to radix sort at some point (looks like the state of the art is the onesweep method), so let me know if you make significant progress!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants