What would be the main design trade-offs when re-implementing in clean modern C++? #354

mikeroberts3000 · 2024-05-04T19:02:43Z

Hi Andrej, this implementation is fantastic!

In your view, what would be the main design trade-offs if one were to re-implement the C code that is intended to run on the CPU in modern C++? By modern C++, I generally mean code that favors std containers over raw C-style arrays, favors std ranges and similar tools over raw loops, etc. Obviously, doing so would deviate somewhat from the minimal as-close-to-the-metal-as-possible design philosophy of this repository. And obviously, some kinds of code stand to benefit from this kind of transformation more than others, so maybe there isn't much benefit in this case. On the other hand, modern C++ may result in fewer lines of code without sacrificing readability (or performance), which also appears to be one of the design goals of this repo.

With all of that being said, how much speed do you think one would gain/lose from a clean modern C++ implementation, and how much simplicity/safety/flexibility do you think you'd gain/lose? What about re-implementing the raw CUDA kernels as thrust kernels?

Cheers,
Mike

The text was updated successfully, but these errors were encountered:

ngc92 · 2024-05-05T15:59:14Z

this is a very ill-defined question, because people's ideas of what constitute "clean" and "modern". Personally, I wouldn't consider using an abstraction that leads to noticeable performance impact clean. In that case, you're using the wrong tool for the job.

That still leaves a lot things you can do we C++ that are "cleaner" than the C -counterpart. unique_ptr is a great tool, and probably more appropriate than std::vector here. Implementing pointwise operations in thrust / CUB likewise shouldn't result in a performance loss. There is a C++ implementation linked in the forks.

Btw, there isn't really any relation between whether code is C or C++ and how close it is to "bare metal", e.g, cutlass implements fast matrix multiplication on GPUs with a lot of template magic that often ends in inline assembly.

mikeroberts3000 · 2024-05-05T20:13:29Z

Hi @ngc92, thanks for your comment.

I agree that my question is subjective, and different people have different opinions about what constitutes clean/modern code, especially in the C++ community. I am interested in your subjective opinion, and the subjective opinion of the other authors of this repo. I tried to highlight this nuance in my original question by prefixing it with the phrase, in your view...
With all due respect, I don't think my question is "very ill-defined", just because it is subjective. Consider the question, what's your favorite restaurant in SF? Is this question also "very ill-defined" because it is subjective?
I'm not arguing for or against std::unique_ptr or std::vector or any other specific algorithm or data structure from std. Instead, I'm wondering aloud if there are opportunities to reduce the lines of code without sacrificing readability or performance, using whatever tools from std you think might be appropriate, if any.
I agree that a thrust kernel that replaces a straightforward map/transform kernel implemented in raw CUDA (e.g., multiplying every element in a GPU array by 2) is unlikely to result in a noticeable performance loss. This type of kernel would be an example where, e.g., thrust::for_each(...) would probably reduce lines of code without sacrificing readability or performance. But the advantages of thrust might be less clear for other kernels. For example, a 2D convolution with a large stencil implemented naively using thrust::for_each(...) will likely sacrifice some performance compared to a raw CUDA kernel that makes careful use of GPU shared memory.
I agree that C code is not always low-level, and C++ code is not always high-level. But std data structures are often used in a way that is slightly "further away from the metal" than their raw C counterparts, e.g., when using accessor functions that perform some kind of bounds checking. This would be an example where using higher-level tools (std data structures with bounds checking) might sacrifice some performance. On the other hand, I like your example of cutlass using template specialization to optimize GPU kernels at compile-time (thrust also does a lot of this). This would be an example where using higher-level tools (C++ template specialization) might result in performance improvements without sacrificing readability or increasing lines of code. Again, I'm wondering aloud if you see opportunities in this repo to reduce the lines of GPU code (and surrounding code for launching kernels) without sacrificing readability or performance, using something like thrust.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What would be the main design trade-offs when re-implementing in clean modern C++? #354

What would be the main design trade-offs when re-implementing in clean modern C++? #354

mikeroberts3000 commented May 4, 2024 •

edited

ngc92 commented May 5, 2024

mikeroberts3000 commented May 5, 2024 •

edited

What would be the main design trade-offs when re-implementing in clean modern C++? #354

What would be the main design trade-offs when re-implementing in clean modern C++? #354

Comments

mikeroberts3000 commented May 4, 2024 • edited

ngc92 commented May 5, 2024

mikeroberts3000 commented May 5, 2024 • edited

mikeroberts3000 commented May 4, 2024 •

edited

mikeroberts3000 commented May 5, 2024 •

edited