-
Notifications
You must be signed in to change notification settings - Fork 15k
[OpenMP][Offload] Handle non-memberof present/to/from entries irrespective of order.
#165494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…espective of order. For cases like: ```c map(alloc: x) map(to: x) ``` If the entry of `map(to: x)` is encountered after the entry for `map(alloc:x)`, we still want to do a data-transfer even though the ref-count of `x` was already 0, because the new allocation for `x` happened as part of the current directive. Similarly, for: ```c ... map(alloc: x) map(from: x) ``` If the entry for `map(from:x)` is encountered before the entry for `map(alloc:x)`, we want to do a data-transfer even though the ref-count was not 0 when looking at the `from` entry, because by the end of the directive, the ref-count of `x` will go down to zero. And for: ```c ... map(from : x) map(alloc, present: x) ``` If the "present" entry is encountered after the "from" entry, then it becomes a no-op, as the "from" entry will do an allocation if no match was found. In this PR, these are handled by the runtime via the following: * For `to` and `present`, we also look-up in the existing table where we tracked new allocations when making the decision for the entry. * For `from`, we keep track of any deferred data transfers and when the ref-count of a pointer goes to zero, see if there were any previously deferred `from` transfers for that pointer. This can be done in the compiler, and that would avoid any runtime overhead, but it would require creating two separate offload struct entries for the entry and exit mappings (even for the `target` construct), with properly decayed maps, and either: (1) sorted in order of: * `present > to > ...` for the implied `target enter data`; and * `from > ...` for the `target exit data` e.g. ```c #pragma omp target map(to: x) map(present, alloc: x) map(always, from: x) // has to be broken into: // from becomes alloc on entry: // #pragma omp target enter data map(present, alloc: x) // map(to: x) // map(alloc: x) // // "present" and "to" just "decay" into "alloc" // #pragma omp target exit data map(always, from: x) // map(alloc: x) // map(alloc: x) ``` Or, (2) Merged into one entry each on the `target enter/exit data` directives. ```c #pragma omp target map(to: x) map(present, alloc: x) map(always, from: x) // has to be broken into: // from becomes alloc on entry: // #pragma omp target enter data map(present, to: x) // // "present" and "to" just "decay" into "alloc" // #pragma omp target exit data map(always, from: x) ``` The number of entries on the two would need to stay the same on the two to avoid ref-count mismatch. (1) would be simpler, but won't likely work for cases like: ```c ... map(delete: x) map(from:x) ``` as there is no clear "winner" between the two. So, for such cases, the compiler would likely have to do (2), which is the cleanest solution, but will take longer to implement. For EXPR comparisons, it can build-upon the `AttachPtrExprComparator` that was implemented as part of llvm#153683, but that should probably wait for the PR to be merged to avoid conflicts. Another alternative is to sort the entries in the runtime, which may be slower than on-demand lookups/updates that this PR does, because we always would be doing this sorting even when not needed, but may be faster in others where the constant-time overhead of map/set insertions/lookups becomes too large because of the number of maps. But that will still have to worry about the `from` + `delete` case.
present/to/from entries irrespective of order.
I thought that the re-ordering may be an equivalent but simpler approach. But I think that it may not be. The issue for target entry, for example:
It's questionable if the re-ordering approach is still simpler if we need to check for overlapping memory. |
|
We don't have to worry about cases like OpenMP 6.0 page 286: However, the following is allowed. @dreachem might be able to confirm: int *p = ...;
#pragma omp target map(to : p[0:4]) map(present, from: p[0:4]) map(delete, storage: p[:])Which can be handled by re-ordering for the // (3.1)
#pragma omp target_enter_data map(present, from: p[0:4]) // "from" decays into "storage"
map(to: p[0:4]) // No transfer because "present"
map(delete, storage: p[:])) // No-op
// ...
// (3.2)
#pragma omp target_exit_data map(delete, storage: p[:]))
map(present, from: p[0:4])
map(to: p[0:4]) // "to" decays into "storage"There is no clear way to order "delete" vs "from" if they are on different maps. This can still be handled by the compiler like the following (or in the runtime by maintaining state, like this PR), but this case makes it more complicated for the compiler since the expressions are not identical. // Possible compiler codegen
// (4.1)
#pragma omp target_enter_data map(present, to: p[0:4]) // use list-item from "to"
// and add "present".
...
// (4.2)
#pragma omp target_exit_data map(present, delete, from: p[0:4])) // Use list-item from "from"
// and add "present", "delete"Another thing to think of, but maybe from the spec's perspective, is that should the modifiers from a For the following: int x;
int main(void) {
#pragma omp target map(delete, to: x) map(present, from: x)
;
}@dreachem, @mjklemm, is this expected to run OK? If we decay // Applying "present" to "enter_data" would cause runtime-error
// (5.1)
#pragma omp target_enter_data(present, to: x)
...
// (5.2)
#pragma omp target_exit_data(present, delete, from: x)But the user may be expecting to only do a present-check on // Ignoring modifiers from the decayed-to-storage clause.
// (6.1)
#pragma omp target_enter_data(to: x) // delete ignored for enter_data
...
// (6.2)
#pragma omp target_exit_data(present, from: x)But if we do so, then |
Hm, is this also true if the alloc map exists due to an implicit mapping? |
Why wouldn't |
The "delete" should just be handled at the end, after everything else. I'm not too familiar with the LLVM runtime implementation, so it's not clear to me what the technical difficulty is here. Is there a reason the algorithm in the map clause section (based on "mappable storage blocks") can't be used? It does require grouping individual maps into sets that apply to a common storage block. These are the pertinent sentences:
In general, the runtime would need to be involved in some of the matching because we allow overlapping storage for assumed-size arrays and use_device_addr. Other than these cases, the compiler should in theory be able to group the maps into sets where each set applies to the same storage block. |
So, you are describing For Let's suppose the construct was instead Here, it looks like we hit a possible unintended discrepancy. When we decided to make |
|
Tagging @ro-i for awareness. |
For cases like:
If the entry of
map(to: x)is encountered after the entry formap(alloc:x), we still want to do a data-transfer even though the ref-count ofxwas already 0, because the new allocation forxhappened as part of the current directive.Similarly, for:
If the entry for
map(from:x)is encountered before the entry formap(alloc:x), we want to do a data-transfer even though the ref-count was not 0 when looking at thefromentry, because by the end of the directive, the ref-count ofxwill go down to zero.For:
If the "present" entry is encountered after the "from" entry, then it becomes a no-op, as the "from" entry will do an allocation if no match was found.
And for the following:
We need to make sure that only one data transfer for
fromhappens, and it happens even whendeleteis encountered beforefrom.In this PR, these are handled by the runtime via the following:
toandpresent, we also look-up in the existing table where we tracked new allocations when making the decision for the entry.from, we keep track of any deferred data transfers and when the ref-count of a pointer goes to zero, see if there were any previously deferredfromtransfers for that pointer.fromanddelete+fromcases, we keep track of any successfuldeleteentries andfromtransfers.This can be done in the compiler, and that would avoid any runtime overhead, but it would require creating two separate offload struct entries for the entry and exit mappings (even for the
targetconstruct), with properly decayed maps, and either:(1) Sorted in the following order:
present > to/tofrom > ...for the impliedtarget enter data; andfrom/tofrom > ...for thetarget exit datae.g.Or,
(2) Merged into one entry each on the
target enter/exit datadirectives.The number of entries on the two would need to stay the same on the two to avoid ref-count mismatch.
(1) would be simpler, but won't work for cases like:
as there is no clear "winner" between the two. So, for such cases, the compiler would likely have to do (2), which is the cleanest solution, but will take longer to implement. For EXPR comparisons, it can build-upon the
AttachPtrExprComparatorthat was implemented as part of #153683, but that should probably wait for the PR to be merged to avoid conflicts.Another alternative is to sort the entries in the runtime, which may be slower than on-demand lookups/updates that this PR does, because we always would be doing this sorting even when not needed, but may be faster in others where the constant-time overhead of map/set insertions/lookups becomes too large because of the number of maps. But that will still have to worry about the
from+deletecase.