Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memset builtins #45

Closed
wants to merge 4 commits into from
Closed

Memset builtins #45

wants to merge 4 commits into from

Conversation

abagusetty
Copy link
Contributor

  1. Use of device memset builtins for CUDA, HIP, SYCL improves device code gen
  2. YAKL_DEBUG issue with SYCL backend. printf is not supported in device code for yakl_throw()
  3. Updates the compilers to Aurora MPICH

This was referenced Jan 25, 2022
@mrnorman
Copy link
Owner

Thanks @abagusetty! I think I'll have time to work on this today. I looked up C's memset and have realized I didn't fully understand its limitations (integer param only). I think it might be appropriate to rename YAKL's memset to fill instead, which I'll consider for a later PR since I have to worry about backwards compatibility.

For SYCL, do we want to specialize the template for int parameters and continue to use memset, and use fill for all non-int parameters (rather than zero versus non-zero)?

Regarding CUDA and HIP, memset is synchronous, and I need to do some testing to see what the cost / benefit is of causing a sync versus faster memory filling. I might create a CPP define to control this behavior for users who feel the need to speed-up memset with integers who do not mind a synchronization point.

Thanks for catching the printf issue. I took it for granted that SYCL would allow it on the device. Is this being planned in later SYCL specs? It would be useful.

@abagusetty
Copy link
Contributor Author

Thanks @abagusetty! I think I'll have time to work on this today. I looked up C's memset and have realized I didn't fully understand its limitations (integer param only). I think it might be appropriate to rename YAKL's memset to fill instead, which I'll consider for a later PR since I have to worry about backwards compatibility.

For SYCL, do we want to specialize the template for int parameters and continue to use memset, and use fill for all non-int parameters (rather than zero versus non-zero)?

This might be a cleaner option. All forms (CUDA, HIP, SYCL, HOST) of memset can only handle int params. This might require to cast any 0.0 memsets from any use cases in the app code to int(0)s. sycl::fill might be used for both int and non-ints too since I can confirm that the int=0 moves through the memsets. I too haven't really measured the performance gains but thought that using device RT APIs might be beneficial to generate (a) better PTX/GCN/SPIR-V code gens for contiguous memory, (b) lighter over-head over the current parallel_for routines.

Regarding CUDA and HIP, memset is synchronous, and I need to do some testing to see what the cost / benefit is of causing a sync versus faster memory filling. I might create a CPP define to control this behavior for users who feel the need to speed-up memset with integers who do not mind a synchronization point.

hip/cudaMemset and hip/cudaMemsetAsync seems be to async with respect to the host except with some corner cases of using pinned host mem or unified mem. Might be a way to approach ?
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memset

Thanks for catching the printf issue. I took it for granted that SYCL would allow it on the device. Is this being planned in later SYCL specs? It would be useful.

An experimental form of printf is already supported but with some very minor limitations that YAKL is currently heading into. I can open an issue to fix it accordingly for tracking purposes.
https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/experimental/builtins.hpp#L67

@mrnorman
Copy link
Owner

Ah, it looks like cudaMemsetAsync has been in there for a while (at least since 10.0). Clearly memset is preferred then for int inputs, then. So here's the summary of what I think would be best:

  • For int inputs: memset for host, CUDA, HIP, and SYCL
  • For non-int inputs: fill for host and SYCL, and parallel_for for CUDA and HIP

Having an issue open for better printf support would be great. In the end, the user will be backtracing with gdb regardless, but it's nice to give the user some printed info at least.

@mrnorman
Copy link
Owner

I have to admit, though, I'm confused by the memset documentation. They keep saying "byte" value, but the input is int. Maybe doing it for general int values is a bad idea? I wish it were clearer.

@mrnorman
Copy link
Owner

Yeah, we'll just use memset for zero values then... Back to what you had originally.

@@ -115,72 +115,17 @@
sycl::access::address_space::global_space>
using relaxed_atomic_ref =
sycl::ext::oneapi::atomic_ref< T,
sycl::ext::oneapi::memory_order::relaxed,
sycl::ext::oneapi::memory_order::seq_cst,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abagusetty , why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mostly for better stability for atomics for future SDK update and hardware releases in upcoming weeks. Currently this change would be no-different than previous relaxed

@@ -34,6 +34,12 @@
#define YAKL_CURRENTLY_ON_HOST() (! defined(__SYCL_DEVICE_ONLY__))
#define YAKL_CURRENTLY_ON_DEVICE() (defined(__SYCL_DEVICE_ONLY__))

#ifdef __SYCL_DEVICE_ONLY__
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONSTANT doesn't appear to be used anywhere. Is this for future SYCL work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the CONSTANT was for the use case of sycl::printf, https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/experimental/builtins.hpp#L67. For eg:

const CONSTANT char format[] = "KERNEL CHECK FAILED:\n   %s\n   %s\n";
sycl::ext::oneapi::experimental::printf(format,msg);

Given that printf(%s\n") with const char * is not really aligning with the above sycl::printf signature

@mrnorman
Copy link
Owner

All features in this branch have been merged to master.

@mrnorman mrnorman closed this Jan 25, 2022
@abagusetty abagusetty deleted the memset_builtins branch January 25, 2022 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants