Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What would be the main design trade-offs when re-implementing in clean modern C++? #354

Open
mikeroberts3000 opened this issue May 4, 2024 · 2 comments

Comments

@mikeroberts3000
Copy link

mikeroberts3000 commented May 4, 2024

Hi Andrej, this implementation is fantastic!

In your view, what would be the main design trade-offs if one were to re-implement the C code that is intended to run on the CPU in modern C++? By modern C++, I generally mean code that favors std containers over raw C-style arrays, favors std ranges and similar tools over raw loops, etc. Obviously, doing so would deviate somewhat from the minimal as-close-to-the-metal-as-possible design philosophy of this repository. And obviously, some kinds of code stand to benefit from this kind of transformation more than others, so maybe there isn't much benefit in this case. On the other hand, modern C++ may result in fewer lines of code without sacrificing readability (or performance), which also appears to be one of the design goals of this repo.

With all of that being said, how much speed do you think one would gain/lose from a clean modern C++ implementation, and how much simplicity/safety/flexibility do you think you'd gain/lose? What about re-implementing the raw CUDA kernels as thrust kernels?

Cheers,
Mike

@ngc92
Copy link
Contributor

ngc92 commented May 5, 2024

this is a very ill-defined question, because people's ideas of what constitute "clean" and "modern". Personally, I wouldn't consider using an abstraction that leads to noticeable performance impact clean. In that case, you're using the wrong tool for the job.

That still leaves a lot things you can do we C++ that are "cleaner" than the C -counterpart. unique_ptr is a great tool, and probably more appropriate than std::vector here. Implementing pointwise operations in thrust / CUB likewise shouldn't result in a performance loss. There is a C++ implementation linked in the forks.

Btw, there isn't really any relation between whether code is C or C++ and how close it is to "bare metal", e.g, cutlass implements fast matrix multiplication on GPUs with a lot of template magic that often ends in inline assembly.

@mikeroberts3000
Copy link
Author

mikeroberts3000 commented May 5, 2024

Hi @ngc92, thanks for your comment.

  • I agree that my question is subjective, and different people have different opinions about what constitutes clean/modern code, especially in the C++ community. I am interested in your subjective opinion, and the subjective opinion of the other authors of this repo. I tried to highlight this nuance in my original question by prefixing it with the phrase, in your view...
  • With all due respect, I don't think my question is "very ill-defined", just because it is subjective. Consider the question, what's your favorite restaurant in SF? Is this question also "very ill-defined" because it is subjective?
  • I'm not arguing for or against std::unique_ptr or std::vector or any other specific algorithm or data structure from std. Instead, I'm wondering aloud if there are opportunities to reduce the lines of code without sacrificing readability or performance, using whatever tools from std you think might be appropriate, if any.
  • I agree that a thrust kernel that replaces a straightforward map/transform kernel implemented in raw CUDA (e.g., multiplying every element in a GPU array by 2) is unlikely to result in a noticeable performance loss. This type of kernel would be an example where, e.g., thrust::for_each(...) would probably reduce lines of code without sacrificing readability or performance. But the advantages of thrust might be less clear for other kernels. For example, a 2D convolution with a large stencil implemented naively using thrust::for_each(...) will likely sacrifice some performance compared to a raw CUDA kernel that makes careful use of GPU shared memory.
  • I agree that C code is not always low-level, and C++ code is not always high-level. But std data structures are often used in a way that is slightly "further away from the metal" than their raw C counterparts, e.g., when using accessor functions that perform some kind of bounds checking. This would be an example where using higher-level tools (std data structures with bounds checking) might sacrifice some performance. On the other hand, I like your example of cutlass using template specialization to optimize GPU kernels at compile-time (thrust also does a lot of this). This would be an example where using higher-level tools (C++ template specialization) might result in performance improvements without sacrificing readability or increasing lines of code. Again, I'm wondering aloud if you see opportunities in this repo to reduce the lines of GPU code (and surrounding code for launching kernels) without sacrificing readability or performance, using something like thrust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants