Skip to content

Relax the definition of SHMEM_TEAM_SHARED#307

Closed
davidozog wants to merge 1 commit intoopenshmem-org:masterfrom
davidozog:pr/team_shared_definition
Closed

Relax the definition of SHMEM_TEAM_SHARED#307
davidozog wants to merge 1 commit intoopenshmem-org:masterfrom
davidozog:pr/team_shared_definition

Conversation

@davidozog
Copy link
Collaborator

This PR changes SHMEM_TEAM_SHARED to include only a subset of the PEs that are accessible via shmem_ptr, not necessarily all of them. This would better support implementations that track teams with a (start, stride, size) triplet, as opposed to a table/list of PEs.

SHMEM_TEAM_SHARED to include only a subset of the PEs
that are accessible via shmem_ptr, not necessarily all
of them.

Signed-off-by: David Ozog <david.m.ozog@intel.com>
@minsii
Copy link
Collaborator

minsii commented Nov 8, 2019

@davidozog I actually think it might be more straightforward to user if the spec ensures that SHMEM_TEAM_SHARED contains all processes that can be accessed by shmem_ptr. I am not sure how an implementation can benefit from this change. Can you please explain more?

@davidozog
Copy link
Collaborator Author

Whoops, sorry for the slow response @minsii.

Implementations (like the SOS prototype) may prefer to represent teams using only a (start, stride, size) triplet. As of now, the entire teams proposal supports this design, except for SHMEM_TEAM_SHARED, because it may not be representable as a triplet. A simple example when this would occur is in a 6-PE application with PEs {0, 1, 3} all able to access each other via shmem_ptr and {2, 4, 5} accessible via shmem_ptr. My proposed change simply relaxes the constraint that all PEs accessible via shmem_ptr be included in SHMEM_TEAM_SHARED. In this case, the implementation might choose to set SHMEM_TEAM_SHARED to define teams {0, 1} and {2, 4}... or perhaps just {0} and {2}, which is what SOS does as of now.

This does not prevent implementations from including all the accessible PEs if they want to. I hope that the SOS prototype supports arbitrary PE sets eventually, but it would require implementing lookup-table translations between teams and is potentially a bit less efficient.

@minsii
Copy link
Collaborator

minsii commented Nov 13, 2019

Thanks for the explanation. I assume that SOS needs store network address related information for each remote PE, and any communication over SHMEM_TEAM_SHARED still goes to network. Thus, if there is a lookup-table translation required for mapping PE in SHMEM_TEAM_SHARED to the underlying network address, the translation cost has to be added onto every operation. Is my understanding correct?

@davidozog
Copy link
Collaborator Author

Yes, that's pretty much correct (except in the case when we enable XPMEM or CMA, then communication may not go through the network).

I was also referring to the efficiency of the shmem_team_translate_pe routine (if that matters). With triplet representations, this translation is just a little arithmetic, but with arbitrary team sets, it would need to be a lookup table.

@minsii
Copy link
Collaborator

minsii commented Nov 14, 2019

I do not think the cost of shmem_team_translate_pe really matters for applications. The user program usually calls it (at least in MPI applications AFAIK) at non-performance-critical path and it is likely only a one-time cost. For instance, the translated PEs are stored in an array for data transfer operations.

@nspark
Copy link
Contributor

nspark commented Nov 14, 2019

I'm sure it depends on the implementation, but the translation cost may also affect RMA and AMO operations if the implementation needs to use the world-based PE numbers. We already know we need table lookups for RKEYs and whatnot; the current rigidity of the teams structure (linear-stride PE triplet) is intended to minimize the overhead of these translations.

I would expect that at implementation with an RKEY table indexes this table by world-based PE numbers, and point-to-point operations on a team would need to know the world-based PE number to provide such an index. In that case, the translation overhead would appear in the critical path for low-latency operations. #JustAUserNotAnImplementer

@davidozog
Copy link
Collaborator Author

Yes - on the critical path of every RMA and AMO operation (outside the default context), the SOS prototype does a translation to TEAM_WORLD's index , which is simply team->start + team->stride * pe (assuming the team has a triplet representation). I'm not sure how much message rate would be affected by doing a lookup instead of the FMA here, but it's nice to have the option to keep the triplet approach even if all PEs accessible via shmem_ptr aren't included in TEAM_SHARED.

This is a little different than shmem_team_translate_pe... Sorry, I probably shouldn't have used that name in my comment above. I do agree that applications will rarely call this routine and it's not even on the critical path.

@minsii
Copy link
Collaborator

minsii commented Nov 14, 2019

I believe the lookup table overhead can be nonnegligible on issuing side (we have to pay the cost in MPI in some cases :-( ). Anyway, I agree that the spec should allow the implementation to define the scope of TEAM_SHARED that can provide better performance.

However, from a user's standpoint, is it confusing that both TEAM_SHARED and shmem_ptr provide the scope of load/store accessible PEs but TEAM_SHARED might be only a subset of the returned scope of shmem_ptr? Actually, according to the definition of shmem_ptr (section 9.1.8) , I think the scope of shmem_ptr is also implementation-defined. E.g., it is valid to return NULL for a remote PE even if it internally allocates symm heap on the same memory region shared with the local PE. The implementation defines whether a remote data object is load/store accessible. Please correct me if my understanding is wrong.

If the above understanding is correct, I'd think it is better to have a consistent PE scope for both routines.

@naveen-rn
Copy link
Contributor

Initially, the motivation for this relaxation made sense to me. But, I'm getting confused. To me, there is enough leeway in the current specification to implement a complete lookup table free PE subset list for all possible configurations on every team.

Yes - on the critical path of every RMA and AMO operation (outside the default context), the SOS prototype does a translation to TEAM_WORLD's index

@davidozog Are you saying that even for RMA and AMO operations performed through SHMEM_TEAM_SHARED and all its child teams - you need to convert to TEAM_WORLD's index? AFAIU, this is an implementation design choice and is not an absolute requirement.

I can understand this argument for shmem_team_translate_pe. I assume that we all agree that it doesn't matter, as it is not in the performance critical path.

@davidozog
Copy link
Collaborator Author

@naveen-rn Sorry for the confusion - I don't believe this discussion regarding the performance impact of PE index translation is a strong motivation for the change. To me, it's is a pertinent side-discussion.

The primary motivation for the proposal is that the definition of SHMEM_TEAM_SHARED may unnecessarily burden implementations to support arbitrary team sets. For instance, most (perhaps all) SOS collectives are implemented around the assumption that the active set has a linear stride. It would take an unreasonable amount of development to support collectives on arbitrary PE sets, and the benefit of implementing this would be very small. In practice, I don't believe this code would ever be exercised, unless a user passes a strange custom hostfile to the process launcher or in some other weird scenario...

I think @minsii is correct that implementations do have the freedom to change the behavior of shmem_ptr to include only PEs that have a regular stride, and that is a more realistic fix to the problem for implementation like SOS. So maybe this is a question of whether users would prefer to limit PE accessibility via shmem_ptr to directly reflect SHMEM_TEAM_SHARED membership, or do users prefer shmem_ptr to have access to as many PEs as possible at the cost of a relatively smaller SHMEM_TEAM_SHARED in the "weird scenarios" I mentioned. My opinion is that we should do the latter, at least until the teams API itself supports team creation with arbitrary PE sets. I do think advanced users may want arbitrary teams, but the proposal should not require it by way of SHMEM_TEAM_SHARED of all things.

Anyway, regarding Naveen's question:

@davidozog Are you saying that even for RMA and AMO operations performed through SHMEM_TEAM_SHARED and all its child teams - you need to convert to TEAM_WORLD's index? AFAIU, this is an implementation design choice and is not an absolute requirement.

I'm not saying "you" need to do it, only that SOS currently does it. It's definitely not an absolute requirement.

@minsii
Copy link
Collaborator

minsii commented Nov 14, 2019

So maybe this is a question of whether users would prefer to limit PE accessibility via shmem_ptr to directly reflect SHMEM_TEAM_SHARED membership, or do users prefer shmem_ptr to have access to as many PEs as possible at the cost of a relatively smaller SHMEM_TEAM_SHARED in the "weird scenarios" I mentioned

Either approach is fine with me, if I was a user. If we go with the latter, I think the spec should clearly define the expected usage of the two routines in order to avoid any confusion.

@naveen-rn
Copy link
Contributor

The collectives requirement (overlapping AS and Teams) make sense.

@jdinan
Copy link
Collaborator

jdinan commented Nov 20, 2019

I suggest that we do a quick straw poll. Please vote for your preference, as follows:

👍 -- SHMEM_TEAM_SHARED and shmem_ptr must match (keep the current semantic). This may require an implementation to restrict shmem_ptr to what can be captured in SHMEM_TEAM_SHARED in implementations that are strictly <start, stride, size>.

👎 -- SHMEM_TEAM_SHARED is a subset of shmem_ptr. This provides more flexibility to implementors and users. In most cases, user complexity is opt-in if apps choose to discover more shared memory than is captured by SHMEM_TEAM_SHARED.

@tonycurtis
Copy link

tonycurtis commented Nov 20, 2019 via email

@minsii
Copy link
Collaborator

minsii commented Nov 20, 2019

I voted 👍 SHMEM_TEAM_SHARED and shmem_ptr must match for consistent semantics. The latter might be useful for implementors but I do not see how it could help users. Do we expect that the two APIs benefit for different user scenarios?

@naveen-rn
Copy link
Contributor

  1. I agree with @minsii statement. In general, I would prefer to match the SHMEM_TEAM_SHARED with shmem_ptr. To make the shared team usable, it is not necessary that all PEs which are part of the same node that is capable of returning a non-NULL for shmem_ptr be part of the shared team. Instead, PEs which are part of the shared team should be the only ones which returns non-NULL for shmem_ptr. This way users can reliably use the shared team to get the shmem_ptr information. But, implementations can decide that based on the job configuration.

  2. Also, I would prefer to have a new API like shmem_team_ptr or update the definition of the shmem_ptr to use the index from SHMEM_TEAM_SHARED to retrieve the local ptr to make it usable.

@jdinan
Copy link
Collaborator

jdinan commented Dec 2, 2019

@naveen-rn It sounds like you are advocating for SHMEM_TEAM_SHARED <= shmem_ptr, but you voted for SHMEM_TEAM_SHARED == shmem_ptr.

@jdinan jdinan mentioned this pull request Dec 2, 2019
@naveen-rn
Copy link
Contributor

@jdinan Ah, I see the issue. I'm not sure, how to vote. May be I should clarify my previous statement with an example. Let us take an example job config where the following PEs share the same node: 0, 1, 7, 8, 100, 101 and all these PEs are capable of accessing other PEs SHEAP using shmem_ptr.

As per the options to vote:

  1. In SHMEM_TEAM_SHARED == shmem_ptr, there will be only one SHARED team with all the PEs part of the same team.
  2. In SHMEM_TEAM_SHARED <= shmem_ptr, there is a possibility that implementations could have three SHARED teams {0, 1}, {7, 8}, and {100, 101}. And, also shmem_ptr will return a non-NULL when any PE in the node is trying to access other PEs in the node.
  3. I think, my expectation would be where there could still be three SHARED teams {0, 1}, {7, 8}, and {100, 101}, but only PEs which are part of the same SHARED team will be able to return a non-NULL pointer for shmem_ptr. For example, if PE 0 uses shmem_ptr on either PE 7, 8, 100 or 101 - it will not return a NULL value. Only PE 1 will return a non-NULL for PE 0.

@shamisp
Copy link
Contributor

shamisp commented Dec 3, 2019

SHMEM_TEAM_SHARED == shmem_ptr

The relaxed option can help optimize OpenSHMEM implementation for a use case that I would identify as unusual for this community.

@minsii
Copy link
Collaborator

minsii commented Dec 3, 2019

@naveen-rn 's example captures my opinion exactly!

@jdinan
Copy link
Collaborator

jdinan commented Dec 4, 2019

Thank you all for the discussion and helpful examples. Looks like we have 3 votes for each option. A split like this doesn't seem to support a change to the existing text. Anybody have a case we should think about that would better motivate the change to SHMEM_TEAM_SHARED <= shmem_ptr?

@tonycurtis
Copy link

tonycurtis commented Dec 6, 2019 via email

@davidozog
Copy link
Collaborator Author

@tonycurtis - you raise a good point.

During the RMA WG call yesterday, we discussed similar issues with the definition of SHMEM_TEAM_SHARED. Let me capture a few observations we made:

  1. shmem_ptr may not have symmetry with regard to accessibility. For instance, PE 1 may return non-null values from shmem_ptr when passing PE 2, but perhaps PE 2 may return NULL values when passing PE 1. Mustn't we require shmem_ptr to have such symmetry for the definition of SHMEM_TEAM_SHARED to make sense?
  2. shmem_ptr may not return a consistent (null or non-null) value for ALL symmetric objects. For instance, a PE 1 might return non-null for object A on PE 2, but NULL for object B on PE 2. In particular, @jdinan says that OpenMPI-SHMEM returns null for objects on the data segment and non-null for objects on the symmetric heap. Our definition of SHMEM_TEAM_SHARED does not account for this, and to fix it, we need to decide whether shmem_ptr must return non-null for "all" or "at least one" symmetric object(s). (Side comment that may be obvious - it's also possible for PE 1 to return non-null for PE 2 on object A and object B, but for PE 2 to return non-null for PE 1 on object A, but not object B.)
  3. Similar to what @tonycurtis said - @manjugv argued that there is value in a predefined team with "host" or "node" wide accessibility. IIUC, directly mapping the SHMEM_TEAM_SHARED definition to shmem_ptr may prohibit implementations from exploiting such locality.

Anyway, I'm not sure where these observations take us, but it makes me leery of defining SHMEM_TEAM_SHARED in terms of shmem_ptr as it is now. I think we need to discuss further...

@tonycurtis
Copy link

tonycurtis commented Dec 7, 2019 via email

@davidozog
Copy link
Collaborator Author

Closing this PR, which is superseded by #325.

@davidozog davidozog closed this Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants