Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an alias for a page migrating memory space #5289

Merged
merged 19 commits into from
Sep 24, 2022

Conversation

JBludau
Copy link
Contributor

@JBludau JBludau commented Jul 28, 2022

This pr picks up on the discussion in #5193 and provides an alias for a page migrating memory space.

Even though the participants in the discussion preferred the name DefaultSharedMemorySpace I propose MigratingMemorySpace but have no problems changing that name. The reasoning was the following:

  • Dropping the Default as it is to close to the execution space aliases we have (DefaultExecutionSpace and DefaultHostExecutionSpace). I wanted to prevent it being mistaken as an execution space, thus it follows the convention we have for the (memory)spaces. Furthermore, if we stick with it being migrating the Default looses its meaning as there is only one per backend.
  • I propose Migrating, as it targets developers who actually want the property of a memory that automatically moves to the device and is accessed locally. I expect they want the same behavior independent of the backend and deliberately choose this (especially given that not all backends support this feature). If they switch to another backend that has no page migration, this should not silently chance behavior.
  • An alias for a memoryspace that is always accessible by host and device and available in (almost) all backends is useful. I propose to create a separate alias for this, maybe UniversalMemorySpace. I think this would need a good documentation on what the limitations and restrictions are, but it will be clearer what the cost for the universal accessibility is.
  • We can actually specify and thus test what we expect from this memory and track if we do any changes that would change behavior.

If we introduce this alias, we should reconsider removing Kokkos_ENABLE_CUDA_UVM as there is now a better way to specify that the user wants page migration and we are moving to a major release which includes HIP moving out of Experimental.

The specification we are testing is:

  • Migrate on fist touch in new execution space
  • Migrate only if switching to different execution space

Missing:

  • write documentation: #157

@JBludau JBludau added the Enhancement Improve existing capability; will potentially require voting label Jul 28, 2022
@JBludau
Copy link
Contributor Author

JBludau commented Jul 28, 2022

Quick note on the amount of memory we allocate and the number of repetitions the test is running:
It turned out, that Nvidia's V100 had no problem cycling 10 times through 40% of its memory without being at 100% of the core clock rate. That is the reason for the high amount of (literal) boiler-plate runs.

MI100 (and below) do not support page migration, thus the test is disabled for these archs.

For SYCL: I need to read more documentation on getting device attributes and looking which devices do support page migration.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add support for SYCL using Kokkos::Experimental::SYCLSharedUSMSpace.

core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
@JBludau
Copy link
Contributor Author

JBludau commented Jul 29, 2022

Add support for SYCL using Kokkos::Experimental::SYCLSharedUSMSpace.

added. According to the doc the initial placement is not specified.
I will look at cuda and hip, especially what they do if we overcommit on migratingMemory.
Might make sense for us to not expect anything about the intital placement

core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPageMigration.hpp Outdated Show resolved Hide resolved
@masterleinad
Copy link
Contributor

Regarding the name, I'm not sure if people would find it intuitive to understand that this is a space they can access from the host and the device (which is what people would mainly be interested IMHO).

@JBludau
Copy link
Contributor Author

JBludau commented Jul 29, 2022

looks like our rocm CI is building for Kokkos_ARCH_VEGA90A which it does not have hardware for. As the tests need real page migration support to pass, they are failing on our CI. Maybe we should add a label and specify the architecture. -> will do this in another pr -> this is outdated as it was a problem with the autodetection of the arch, see below

@JBludau
Copy link
Contributor Author

JBludau commented Jul 29, 2022

Regarding the name, I'm not sure if people would find it intuitive to understand that this is a space they can access from the host and the device (which is what people would mainly be interested IMHO).

But this space is far more than just accessible from both. It actively moves the memory and allows local access afterwards. We can just add another alias for space that is accessible everywhere and does not migrate (As I proposed in the description)

@masterleinad
Copy link
Contributor

On Intel GPUs, I see no migration overhead for host->device but only for device->host, i.e., the loop time is constant on the device and the first access on the host is slower than the remaining ones. Also, the first access in the first host loop is slower than the first access in the later host loops. Thus, the test currently fails with

Initial placement on device is 1 we expect true 
Memory migrates on every space access is 0 we expect true 
Memory migrates only once per access 0 we expect true 

and I need

diff --git a/core/unit_test/TestPageMigration.hpp b/core/unit_test/TestPageMigration.hpp
index 9b6e1a051..ab4e0a494 100644
--- a/core/unit_test/TestPageMigration.hpp
+++ b/core/unit_test/TestPageMigration.hpp
@@ -122,7 +122,7 @@ TEST(TEST_CATEGORY, page_migration) {
   const unsigned int numWarmupRepetitions = 100;
   const unsigned int numDeviceHostCycles  = 3;
   double fractionOfDeviceMemory           = 0.4;
-  double threshold                        = 2.0;
+  double threshold                        = 1.5;
   size_t numBytes       = fractionOfDeviceMemory * getDeviceMemorySize();
   unsigned int numPages = numBytes / getBytesPerPage();
 
@@ -190,7 +190,7 @@ TEST(TEST_CATEGORY, page_migration) {
     if (cycle == 0 && indicatedPageMigrationsDevice == 0)
       initialPlacementOnDevice = true;
     else {
-      if (indicatedPageMigrationsDevice != 1) migratesOnlyOncePerAccess = false;
+      if (indicatedPageMigrationsDevice > 1) migratesOnlyOncePerAccess = false;
     }
 
     unsigned int indicatedPageMigrationsHost = std::count_if(

for it to pass.

@dalg24
Copy link
Member

dalg24 commented Aug 2, 2022

Please justify that the unit test added here cannot ran faster and achieve the same thing. (I would probably already have complained for more than a few seconds.)

test 7
      Start  7: KokkosCore_UnitTest_Cuda4

7: Test command: /var/jenkins/workspace/Kokkos/build/core/unit_test/KokkosCore_UnitTest_Cuda4
7: Test timeout computed to be: 1500
7: [==========] Running 1 test from 1 test suite.
7: [----------] Global test environment set-up.
7: [----------] 1 test from cuda_uvm
7: [ RUN      ] cuda_uvm.page_migration
7: [       OK ] cuda_uvm.page_migration (38209 ms)
7: [----------] 1 test from cuda_uvm (38209 ms total)
7: 
7: [----------] Global test environment tear-down
7: [==========] 1 test from 1 test suite ran. (38209 ms total)
7: [  PASSED  ] 1 test.
 7/61 Test  #7: KokkosCore_UnitTest_Cuda4 ....................   Passed   39.66 sec

@JBludau
Copy link
Contributor Author

JBludau commented Aug 3, 2022

just some name ideas for the temporally local and fixed but universal: (plz don't hate me for this)
UniversalLocalMemorySpace and UniversalFixedMemorySpace
GlobalMovingMemorySpace and GlobalPinnedMemorySpace
MoveThenLocalMemorySpace and FixedButGlobalMemorySpace
SharedMovingMemorySpace and SharedFixedMemorySpace

@masterleinad
Copy link
Contributor

I like Universal but would try to avoid Local and Global so maybe UniversalMovingMemorySpace and UniversalPinnedMemorySpace?

@PhilMiller
Copy link
Contributor

UniversalMoving sounds fine to me.

UniversalPinned is more troublesome, because either or both of host or device pinned could exist, and with different properties on both performance and non-compute accessibility

@crtrott
Copy link
Member

crtrott commented Aug 4, 2022

I know its somewhat confusing but how about: SharedSpace, and SharedHostPinnedSpace. I know CUDA shared memory is something totally different, however ironically CUDA shared memory is the least shared of all allocations possible in CUDA except for registers ...

@JBludau
Copy link
Contributor Author

JBludau commented Aug 4, 2022

I know its somewhat confusing but how about: SharedSpace, and SharedHostPinnedSpace. I know CUDA shared memory is something totally different, however ironically CUDA shared memory is the least shared of all allocations possible in CUDA except for registers ...

Damien will fight you on this :-)

@JBludau
Copy link
Contributor Author

JBludau commented Aug 10, 2022

The unit test is now based on clock cycles rather than on wall clock time which eliminated the need for warmup runs and thus large memory chunks on AMD and Nvidia GPUs. @masterleinad could you rerun this for intel gpus?

Furthermore, I adapted the name @crtrott suggested but we can still change it. @crtrott could you sum up your reasoning for this here to have it documented?

@JBludau
Copy link
Contributor Author

JBludau commented Aug 10, 2022

Please justify that the unit test added here cannot ran faster and achieve the same thing. (I would probably already have complained for more than a few seconds.)

test 7
      Start  7: KokkosCore_UnitTest_Cuda4

7: Test command: /var/jenkins/workspace/Kokkos/build/core/unit_test/KokkosCore_UnitTest_Cuda4
7: Test timeout computed to be: 1500
7: [==========] Running 1 test from 1 test suite.
7: [----------] Global test environment set-up.
7: [----------] 1 test from cuda_uvm
7: [ RUN      ] cuda_uvm.page_migration
7: [       OK ] cuda_uvm.page_migration (38209 ms)
7: [----------] 1 test from cuda_uvm (38209 ms total)
7: 
7: [----------] Global test environment tear-down
7: [==========] 1 test from 1 test suite ran. (38209 ms total)
7: [  PASSED  ] 1 test.
 7/61 Test  #7: KokkosCore_UnitTest_Cuda4 ....................   Passed   39.66 sec

should be in the order of ms now

@masterleinad
Copy link
Contributor

The unit test is now based on clock cycles rather than on wall clock time which eliminated the need for warmup runs and thus large memory chunks on AMD and Nvidia GPUs. @masterleinad could you rerun this for intel gpus?

No, it doesn't pass on Intel GPUs and the numbers for device access are much higher than for host access even though the runtime before showed the opposite. Note that this calls different functions on the host and on the device and it's not implemented properly for SYCL+CUDA on the device. We use the function as seed and thus returning 0 is good enough for that but not for the purpose here. CI for SYCL+Cuda shows

[ RUN      ] sycl_shared_usm.page_migration
14: Page size as reported by os: 4096 bytes 
14: Allocating 100 pages of memory in pageMigratingMemorySpace.
14: Behavior found: 
14: Initial placement on device is: 1 we expect true 
14: Memory migrates back to GPU is: 0 we expect true 
14: Memory migrates at max once per access: 1 we expect true 
14: 
14: Please look at the following timings. A migration was marked detected if the time was larger than 0 for the device 
14: 
14: device timings of run 0:
14: TimingResult contains 10 results:
14: Duration of loop 0 is 0 clock cycles
14: Duration of loop 1 is 0 clock cycles
14: Duration of loop 2 is 0 clock cycles
14: Duration of loop 3 is 0 clock cycles
14: Duration of loop 4 is 0 clock cycles
14: Duration of loop 5 is 0 clock cycles
14: Duration of loop 6 is 0 clock cycles
14: Duration of loop 7 is 0 clock cycles
14: Duration of loop 8 is 0 clock cycles
14: Duration of loop 9 is 0 clock cycles
14: host timings of run 0:
14: TimingResult contains 10 results:
14: Duration of loop 0 is 20 clock cycles
14: Duration of loop 1 is 15 clock cycles
14: Duration of loop 2 is 15 clock cycles
14: Duration of loop 3 is 15 clock cycles
14: Duration of loop 4 is 15 clock cycles
14: Duration of loop 5 is 15 clock cycles
14: Duration of loop 6 is 14 clock cycles
14: Duration of loop 7 is 15 clock cycles
14: Duration of loop 8 is 14 clock cycles
14: Duration of loop 9 is 14 clock cycles
14: device timings of run 1:
14: TimingResult contains 10 results:
14: Duration of loop 0 is 0 clock cycles
14: Duration of loop 1 is 0 clock cycles
14: Duration of loop 2 is 0 clock cycles
14: Duration of loop 3 is 0 clock cycles
14: Duration of loop 4 is 0 clock cycles
14: Duration of loop 5 is 0 clock cycles
14: Duration of loop 6 is 0 clock cycles
14: Duration of loop 7 is 0 clock cycles
14: Duration of loop 8 is 0 clock cycles
14: Duration of loop 9 is 0 clock cycles
14: host timings of run 1:
14: TimingResult contains 10 results:
14: Duration of loop 0 is 18 clock cycles
14: Duration of loop 1 is 15 clock cycles
14: Duration of loop 2 is 14 clock cycles
14: Duration of loop 3 is 14 clock cycles
14: Duration of loop 4 is 16 clock cycles
14: Duration of loop 5 is 15 clock cycles
14: Duration of loop 6 is 15 clock cycles
14: Duration of loop 7 is 16 clock cycles
14: Duration of loop 8 is 14 clock cycles
14: Duration of loop 9 is 15 clock cycles
14: device timings of run 2:
14: TimingResult contains 10 results:
14: Duration of loop 0 is 0 clock cycles
14: Duration of loop 1 is 0 clock cycles
14: Duration of loop 2 is 0 clock cycles
14: Duration of loop 3 is 0 clock cycles
14: Duration of loop 4 is 0 clock cycles
14: Duration of loop 5 is 0 clock cycles
14: Duration of loop 6 is 0 clock cycles
14: Duration of loop 7 is 0 clock cycles
14: Duration of loop 8 is 0 clock cycles
14: Duration of loop 9 is 0 clock cycles
14: host timings of run 2:
14: TimingResult contains 10 results:
14: Duration of loop 0 is 18 clock cycles
14: Duration of loop 1 is 15 clock cycles
14: Duration of loop 2 is 15 clock cycles
14: Duration of loop 3 is 15 clock cycles
14: Duration of loop 4 is 16 clock cycles
14: Duration of loop 5 is 15 clock cycles
14: Duration of loop 6 is 15 clock cycles
14: Duration of loop 7 is 15 clock cycles
14: Duration of loop 8 is 15 clock cycles
14: Duration of loop 9 is 15 clock cycles
14: /var/jenkins/workspace/Kokkos/core/unit_test/TestPageMigration.hpp:206: Failure
14: Value of: passed
14:   Actual: false
14: Expected: true
14: [  FAILED  ] sycl_shared_usm.page_migration (57 ms)

All that is to say that a timing-based test would be better for testing the SYCL implementation.

@masterleinad
Copy link
Contributor

Also, the test fails for HIP in the CI with something like

4: [ RUN      ] hip_managed.page_migration
4: Page size as reported by os: 4096 bytes 
4: Allocating 100 pages of memory in pageMigratingMemorySpace.
4: Behavior found: 
4: Initial placement on device is: 0 we expect true 
4: Memory migrates back to GPU is: 0 we expect true 
4: Memory migrates at max once per access: 0 we expect true 
4: 
4: Please look at the following timings. A migration was marked detected if the time was larger than 13600 for the device 
4: 
4: device timings of run 0:
4: TimingResult contains 10 results:
4: Duration of loop 0 is 28412 clock cycles
4: Duration of loop 1 is 22968 clock cycles
4: Duration of loop 2 is 23048 clock cycles
4: Duration of loop 3 is 22856 clock cycles
4: Duration of loop 4 is 22896 clock cycles
4: Duration of loop 5 is 23355 clock cycles
4: Duration of loop 6 is 23344 clock cycles
4: Duration of loop 7 is 21965 clock cycles
4: Duration of loop 8 is 21477 clock cycles
4: Duration of loop 9 is 24143 clock cycles
4: host timings of run 0:
4: TimingResult contains 10 results:
4: Duration of loop 0 is 33 clock cycles
4: Duration of loop 1 is 33 clock cycles
4: Duration of loop 2 is 33 clock cycles
4: Duration of loop 3 is 33 clock cycles
4: Duration of loop 4 is 33 clock cycles
4: Duration of loop 5 is 32 clock cycles
4: Duration of loop 6 is 33 clock cycles
4: Duration of loop 7 is 33 clock cycles
4: Duration of loop 8 is 33 clock cycles
4: Duration of loop 9 is 32 clock cycles
4: device timings of run 1:
4: TimingResult contains 10 results:
4: Duration of loop 0 is 22179 clock cycles
4: Duration of loop 1 is 22560 clock cycles
4: Duration of loop 2 is 22308 clock cycles
4: Duration of loop 3 is 21959 clock cycles
4: Duration of loop 4 is 22544 clock cycles
4: Duration of loop 5 is 23154 clock cycles
4: Duration of loop 6 is 22839 clock cycles
4: Duration of loop 7 is 21566 clock cycles
4: Duration of loop 8 is 22493 clock cycles
4: Duration of loop 9 is 22335 clock cycles
4: host timings of run 1:
4: TimingResult contains 10 results:
4: Duration of loop 0 is 32 clock cycles
4: Duration of loop 1 is 33 clock cycles
4: Duration of loop 2 is 33 clock cycles
4: Duration of loop 3 is 33 clock cycles
4: Duration of loop 4 is 33 clock cycles
4: Duration of loop 5 is 33 clock cycles
4: Duration of loop 6 is 33 clock cycles
4: Duration of loop 7 is 33 clock cycles
4: Duration of loop 8 is 33 clock cycles
4: Duration of loop 9 is 33 clock cycles
4: device timings of run 2:
4: TimingResult contains 10 results:
4: Duration of loop 0 is 22573 clock cycles
4: Duration of loop 1 is 22335 clock cycles
4: Duration of loop 2 is 21409 clock cycles
4: Duration of loop 3 is 22105 clock cycles
4: Duration of loop 4 is 22678 clock cycles
4: Duration of loop 5 is 22888 clock cycles
4: Duration of loop 6 is 22534 clock cycles
4: Duration of loop 7 is 22149 clock cycles
4: Duration of loop 8 is 22226 clock cycles
4: Duration of loop 9 is 22630 clock cycles
4: host timings of run 2:
4: TimingResult contains 10 results:
4: Duration of loop 0 is 32 clock cycles
4: Duration of loop 1 is 33 clock cycles
4: Duration of loop 2 is 33 clock cycles
4: Duration of loop 3 is 33 clock cycles
4: Duration of loop 4 is 33 clock cycles
4: Duration of loop 5 is 33 clock cycles
4: Duration of loop 6 is 34 clock cycles
4: Duration of loop 7 is 32 clock cycles
4: Duration of loop 8 is 31 clock cycles
4: Duration of loop 9 is 31 clock cycles
4: /var/jenkins/workspace/Kokkos/core/unit_test/TestPageMigration.hpp:206: Failure
4: Value of: passed
4:   Actual: false
4: Expected: true
4: [  FAILED  ] hip_managed.page_migration (16 ms)

which is pretty close to my experience with SYCL.

@JBludau
Copy link
Contributor Author

JBludau commented Aug 11, 2022

Also, the test fails for HIP in the CI with something like

4: Expected: true
4: [ FAILED ] hip_managed.page_migration (16 ms)


which is pretty close to my experience with SYCL.

This should not even execute for HIP given the hardware in our CI has no proper page migration. Will investigate again. Thought the includeguard on the test would prevent it.

@JBludau
Copy link
Contributor Author

JBludau commented Aug 11, 2022

BLOCKED by #5327 as it would break the CI otherwise

@JBludau
Copy link
Contributor Author

JBludau commented Aug 29, 2022

Documentation issue #149

core/perf_test/CMakeLists.txt Outdated Show resolved Hide resolved
core/perf_test/test_sharedSpace.cpp Outdated Show resolved Hide resolved
core/unit_test/CMakeLists.txt Show resolved Hide resolved
core/unit_test/TestSharedSpace.hpp Outdated Show resolved Hide resolved
@JBludau
Copy link
Contributor Author

JBludau commented Sep 6, 2022

Okay, I changed the following:

Except for OpenMPTarget and OpenACC the SharedSpace alias is now defined. If there is a device (Cuda,HIP,SYCL) it will point to the corresponding page migrating MemorySpace. If it is a host only build the SharedSpace points to HostSpace. There is both a preproc define and a constexpr function for checking if the feature is available (We do not have a configure time check, but this would be trivial for users if they need to know)

The unit test now test for the conditions @crtrott proposed:

Essentially the semantics of SharedSpace are:
(1) every existing execution space type can access it.
(2) if accessing a SharedSpace repeatedly from the same execution space, without accessing it from some other one in between, it will perform close to the performance of the native memory space of that execution space.

Thus we do not evaluate the first access in a new ExecutionSpace and compare the subsequent accesses to the speed of pure local memory. If we detect more than 50% deviation in the memory speed the test fails.

@JBludau JBludau requested a review from crtrott September 6, 2022 19:38
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not in detail review the tests, but my concerns are addressed regarding when this is defined.

@JBludau
Copy link
Contributor Author

JBludau commented Sep 9, 2022

Retest this please

Comment on lines 197 to 213
for (unsigned i = 0; i < numDeviceHostCycles; ++i) {
// WARMUP GPU
incrementInLoop<Kokkos::DefaultExecutionSpace>(
deviceData,
numWarmupRepetitions); // warming up gpu
// GET RESULTS DEVICE
deviceResults.push_back(incrementInLoop<Kokkos::DefaultExecutionSpace>(
migratableData, numRepetitions));

// WARMUP HOST
incrementInLoop<Kokkos::DefaultHostExecutionSpace>(
hostData,
numWarmupRepetitions); // warming up host
// GET RESULTS HOST
hostResults.push_back(incrementInLoop<Kokkos::DefaultHostExecutionSpace>(
migratableData, numRepetitions));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm reading things wrong, but shouldn't the warmup calls here both access migratableData - i.e. pull the pages to the space being measured?

Or is the warmup being performed here the clock-speed warmup, to make sure that each core is running at full speed? If so, please elaborate the comments within this loop to clarify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed to ensure the core-clock is at max when we do the actual measurement that does include the page migration. Therefore, it should not use the migratableData

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope 3feb4d is helping

@JBludau
Copy link
Contributor Author

JBludau commented Sep 19, 2022

Retest this please

core/perf_test/test_sharedSpace.cpp Outdated Show resolved Hide resolved
core/perf_test/test_sharedSpace.cpp Outdated Show resolved Hide resolved
JBludau and others added 2 commits September 21, 2022 10:50
Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants