-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding an alias for a page migrating memory space #5289
Conversation
Quick note on the amount of memory we allocate and the number of repetitions the test is running: MI100 (and below) do not support page migration, thus the test is disabled for these archs. For SYCL: I need to read more documentation on getting device attributes and looking which devices do support page migration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add support for SYCL using Kokkos::Experimental::SYCLSharedUSMSpace
.
added. According to the doc the initial placement is not specified. |
Regarding the name, I'm not sure if people would find it intuitive to understand that this is a space they can access from the host and the device (which is what people would mainly be interested IMHO). |
looks like our |
But this space is far more than just accessible from both. It actively moves the memory and allows local access afterwards. We can just add another alias for space that is accessible everywhere and does not migrate (As I proposed in the description) |
On
and I need diff --git a/core/unit_test/TestPageMigration.hpp b/core/unit_test/TestPageMigration.hpp
index 9b6e1a051..ab4e0a494 100644
--- a/core/unit_test/TestPageMigration.hpp
+++ b/core/unit_test/TestPageMigration.hpp
@@ -122,7 +122,7 @@ TEST(TEST_CATEGORY, page_migration) {
const unsigned int numWarmupRepetitions = 100;
const unsigned int numDeviceHostCycles = 3;
double fractionOfDeviceMemory = 0.4;
- double threshold = 2.0;
+ double threshold = 1.5;
size_t numBytes = fractionOfDeviceMemory * getDeviceMemorySize();
unsigned int numPages = numBytes / getBytesPerPage();
@@ -190,7 +190,7 @@ TEST(TEST_CATEGORY, page_migration) {
if (cycle == 0 && indicatedPageMigrationsDevice == 0)
initialPlacementOnDevice = true;
else {
- if (indicatedPageMigrationsDevice != 1) migratesOnlyOncePerAccess = false;
+ if (indicatedPageMigrationsDevice > 1) migratesOnlyOncePerAccess = false;
}
unsigned int indicatedPageMigrationsHost = std::count_if( for it to pass. |
Please justify that the unit test added here cannot ran faster and achieve the same thing. (I would probably already have complained for more than a few seconds.)
|
just some name ideas for the |
I like |
UniversalMoving sounds fine to me. UniversalPinned is more troublesome, because either or both of host or device pinned could exist, and with different properties on both performance and non-compute accessibility |
I know its somewhat confusing but how about: |
Damien will fight you on this :-) |
The unit test is now based on clock cycles rather than on wall clock time which eliminated the need for warmup runs and thus large memory chunks on AMD and Nvidia GPUs. @masterleinad could you rerun this for intel gpus? Furthermore, I adapted the name @crtrott suggested but we can still change it. @crtrott could you sum up your reasoning for this here to have it documented? |
should be in the order of |
No, it doesn't pass on Intel GPUs and the numbers for device access are much higher than for host access even though the runtime before showed the opposite. Note that this calls different functions on the host and on the device and it's not implemented properly for SYCL+CUDA on the device. We use the function as seed and thus returning 0 is good enough for that but not for the purpose here. CI for SYCL+Cuda shows
All that is to say that a timing-based test would be better for testing the |
Also, the test fails for HIP in the CI with something like
which is pretty close to my experience with SYCL. |
This should not even execute for HIP given the hardware in our CI has no proper page migration. Will investigate again. Thought the includeguard on the test would prevent it. |
BLOCKED by #5327 as it would break the CI otherwise |
4c50f94
to
f737917
Compare
Documentation issue #149 |
Okay, I changed the following: Except for The unit test now test for the conditions @crtrott proposed:
Thus we do not evaluate the first access in a new ExecutionSpace and compare the subsequent accesses to the speed of pure local memory. If we detect more than 50% deviation in the memory speed the test fails. |
…here is no migration
Co-authored-by: Phil Miller <pbmille@sandia.gov>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not in detail review the tests, but my concerns are addressed regarding when this is defined.
Retest this please |
core/perf_test/test_sharedSpace.cpp
Outdated
for (unsigned i = 0; i < numDeviceHostCycles; ++i) { | ||
// WARMUP GPU | ||
incrementInLoop<Kokkos::DefaultExecutionSpace>( | ||
deviceData, | ||
numWarmupRepetitions); // warming up gpu | ||
// GET RESULTS DEVICE | ||
deviceResults.push_back(incrementInLoop<Kokkos::DefaultExecutionSpace>( | ||
migratableData, numRepetitions)); | ||
|
||
// WARMUP HOST | ||
incrementInLoop<Kokkos::DefaultHostExecutionSpace>( | ||
hostData, | ||
numWarmupRepetitions); // warming up host | ||
// GET RESULTS HOST | ||
hostResults.push_back(incrementInLoop<Kokkos::DefaultHostExecutionSpace>( | ||
migratableData, numRepetitions)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm reading things wrong, but shouldn't the warmup calls here both access migratableData
- i.e. pull the pages to the space being measured?
Or is the warmup being performed here the clock-speed warmup, to make sure that each core is running at full speed? If so, please elaborate the comments within this loop to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed to ensure the core-clock is at max when we do the actual measurement that does include the page migration. Therefore, it should not use the migratableData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hope 3feb4d is helping
Retest this please |
Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
Co-authored-by: Bruno Turcksin <bruno.turcksin@gmail.com>
This pr picks up on the discussion in #5193 and provides an alias for a page migrating memory space.
Even though the participants in the discussion preferred the name
DefaultSharedMemorySpace
I proposeMigratingMemorySpace
but have no problems changing that name. The reasoning was the following:Default
as it is to close to the execution space aliases we have (DefaultExecutionSpace
andDefaultHostExecutionSpace
). I wanted to prevent it being mistaken as an execution space, thus it follows the convention we have for the (memory)spaces. Furthermore, if we stick with it being migrating theDefault
looses its meaning as there is only one per backend.Migrating
, as it targets developers who actually want the property of a memory that automatically moves to the device and is accessed locally. I expect they want the same behavior independent of the backend and deliberately choose this (especially given that not all backends support this feature). If they switch to another backend that has no page migration, this should not silently chance behavior.UniversalMemorySpace
. I think this would need a good documentation on what the limitations and restrictions are, but it will be clearer what the cost for the universal accessibility is.If we introduce this alias, we should reconsider removing
Kokkos_ENABLE_CUDA_UVM
as there is now a better way to specify that the user wants page migration and we are moving to a major release which includesHIP
moving out ofExperimental
.The specification we are testing is:
Missing: