-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the performance #3
Comments
Hi! First of all thank you for having reached out and tried my project 😄 I'm already aware I'm using a low number of threads per workgroup: every workgroup has 64 threads which is unreasoned considering modern GPUs can support up to 1024 threads. Increasing the num of threads to 1024 will eventually decrease the number of workgroups (and performance?). This doesn't really correspond to your observation (i.e. the number of workgroups should be higher now, and then become lower once I increase the workgroup size):
About the count of workgroups, it runs a Scan operation thus the low-to-high (and high-to-low) number of workgroups makes sense. Also,
So, back to your question:
Increase I'm going to take back this project in these days! |
Thank you for your answer. I'll try that next week and let you know! |
I have reworked the code in https://github.com/loryruta/gpu-radix-sort/tree/dev Hopefully I may also have improved the performance; the next thing is to do benchmarking. |
The performance is better on |
You're still sorting
Yes, I'm still not sure how much it impacts the overall performance but the distribution of the radices do impact: during the "counting step" I count the number of per-block radices and global radices using atomic operations. If radices are all 0 (worst case) all atomic operations would be concentrated in one memory location. However I have already optimized it by first counting on shared memory and then aggregating results on global memory. So I'm not sure that is the main bottleneck.
At the moment the radix sort algorithm isn't data dependent and therefore the complexity doesn't depend on your data (except from the atomic operations I wrote before). Which means if your data is [0, 0xF] or [0, 0xFFFFFFFF] it will always execute the same steps and visit all radices. This can be for sure optimized; but question: if your data is [0, 255] why not using a BucketSort?
Note, just to be sure: you can't run radix sort with float keys. You have to convert them to GLuint first! |
Yes, I'm still sorting |
Sorry, the timing is the |
Oh okay, I wasn't aware binary representation of positive floats was compliant with the radix sort algorithm. I've always sorted You can absolutely try to re-map your floats such that radices are more evenly distributed: The best case would be (i.e. fewest memory contention by atomic operatons): However I'm not sure how much would impact the performance. Another thing that could be done is to re-introduce iterating multiple items per thread. I was doing it in |
Wait, how's this possible? To me it takes only Please if you have time, clone the repository, checkout to
and send the output. Alternatively, this is what I'm measuring (code): RadixSort radix_sort; // Outside!
measure {
radix_sort(key_buffer, val_buffer, 6131954); // ~6ms
} Note I've not included the construction of the |
I confirm I am measuring only the |
You've missed the [benchmark] argument:
|
Here it is:
|
Your results are even better than mine!
Elapsed time for |
Would it help if I send the .trace file produce by |
I can also dump my float keys buffer if you prefer |
Yes please!
Yeah, I was about to ask it! Also, |
Here's the buffer in binary format. It contains the raw buffer. Even if it's floats, I guess it doesn't matter for the bitonic sort and you can reinterpret them as integer. |
And the apitrace file: https://we.tl/t-3hiKrkNcZL |
Tested with your data:
Code: TEST_CASE("RadixSort-3-benchmark", "[.][benchmark]")
{
const size_t k_num_elements = 6131954;
FILE* f = fopen("/tmp/3-glu-buffer.bin", "r");
REQUIRE(f);
std::vector<float> data(k_num_elements * sizeof(GLfloat));
fread(data.data(), sizeof(GLfloat), k_num_elements, f);
REQUIRE(fgetc(f) == -1); // File should be ended
fclose(f);
ShaderStorageBuffer key_buffer(data);
ShaderStorageBuffer val_buffer(data); // Upload the same data for values
RadixSort radix_sort;
StopWatch stopwatch;
radix_sort(key_buffer.handle(), val_buffer.handle(), k_num_elements);
std::string duration_str = stopwatch.elapsed_time_str();
printf("Radix sort (issue #3); Num elements: %zu, Elapsed: %s\n", k_num_elements, duration_str.c_str());
} |
Ok thank you for investigating. ~5ms would be much better than my Bitonic Sort (~16ms). |
@Meakk turned out the reason is that I was wrongly measuring elapsed time: I was using I'm now using
|
Removing the |
Mapping your data to a smaller range would surely be a first optimization.
Further optimization maybe can be achieved by reworking the algorithm. Now that I have proper measurements I can work on it 👍 |
Ok it explains this huge difference. |
Background:
I have a 3D gaussian splatting viewer in OpenGL which requires sorting a lot of points by view depth.
I'm running the depth computation in a compute shader, then sort the points using a custom bitonic sort.
Since the bitonic sort is the bottleneck (~30fps right now), I'm looking for a good implementation of the radix sort.
Issue:
So far, I managed to call your radix sort and it works.
However it's quite slow and after debugging, it seems like many compute dispatches are called with a surprising low number of workgroups: most of the time, it's
128
, sometimes23953
(for information, the total number of points I'm sorting is6131954
.Is there any parameter I can tweak to improve the performance or reduce the number of shader invocation?
The text was updated successfully, but these errors were encountered: