-
Notifications
You must be signed in to change notification settings - Fork 911
Move process name {jobid,vpid} down to the OPAL layer. #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move process name {jobid,vpid} down to the OPAL layer. #261
Conversation
992ee45
to
55f0bd3
Compare
This change looks extremely invasive, introduce new and extremely specialized containers forcing. Is there any benefit, performance or productivity to adopt this approach? Personally, I think that the process_name should be an undefined entity, without clear specification. An accessor to get a rank out of it might be necessary in the current code, but clearly OPAL should not know what an job id is. |
I think you may be misunderstanding the change, so let me try to explain the reasoning behind it, and offer a suggestion. At the very beginning of the Open MPI project, we had a fairly lengthy discussion over the form of the process_name_t. At that time, all parties (including UTK) settled on the solution of a struct containing two 32-bit fields, one for a "jobid" and the other for a "vpid". Our rationale at that time was based on the desire to support heterogeneous situations, and for big/little endian systems. This continued until we reached the point where two things occurred. First, you asked that I move the database framework to OPAL so that each layer could have its own db. When I did that, I created a non-structured "opal_identifier_t" comprised of a flat 64-bit field and mapped it on top of the structured process_name_t used by OMPI. This was a mistake on my part as it created a number of problems for hetero operations and alignment violations on big endian systems. However, we don't test those environments and so it went undetected for quite some time. In fairness, Siegmar did point out the issues, and we did try to bandaid them over time - yet maintaining the support proved troublesome, and we frankly misunderstood the root cause of many of his problem reports. The error was further compounded when we moved the BTLs down to OPAL. At that time, you adopted my opal_identifier_t and extended its use even further, creating an additional ecosystem around it. In addition, we moved several other code areas down to OPAL to support the BTLs, including the modex operations (now embodied in the opal/pmix framework). These areas depend strongly on the process_name_t structure. In fact, we currently have to artificially define the structured form of the name in each of those code areas, populate them, and then memcpy them across into the opal_identifier_t. This is necessary because every resource manager we know about/support passes identifier info to us as a jobid (in various formats) and rank, and so we have to load the identifier using that info. Even with those efforts, we continue to encounter problems with support for hetero operations and big-endian systems. The simplest solution we can find is to revert back to the original definition of the OMPI process_name_t, acknowledging that our original thinking was correct and that we made a mistake. So rather than this change being something new, it actually is a reversion of a portion of a past commit that has proven to create a problem within the code base. I think we all understand that you have a desire to use the OMPI code base in some external project where an abstract opal_identifier_t would be helpful. What I would suggest is that you adopt the strategy previously employed in such circumstances, including when I've worked on external projects: add the abstraction on your side of the code. In other words, instead of us continuing to fight the abstraction problem in the OMPI code base where it doesn't really help us, the abstraction should be dealt with in the external project. You would need to write your own wrapper functions to convert back/forth from the opal_process_name_t struct, but that isn't an overly burdensome requirement for re-use of the code base. Note that nothing in OMPI cares about the actual values inside the jobid/vpid fields of the process_name_t - you are free to assign any value you like to them. We just need that identifier to be structured so we can properly ensure alignment everywhere they are used, and to ensure hetero operations are fully supported in a more easily maintainable way. HTH |
Test FAILed. Build Log
Test FAILed. |
Any way to have these not fail if help messages unrelated to the PR are causing issues with the detection script (particularly when the script seems to only be checking oshmem...)? Also, the "details" link next to "Build finished" goes to a 404 on bgate.mellanox.com. |
yep, will fix it. |
ok,
|
@ggouaillardet Looks like this PR has now become stale (can't be merged automatically). :-( Can you refresh? |
I've been working on the PMIx code release some more, and on the next phase of memory footprint reductions. In doing that, I believe I'm converging on a solution that may be more palatable to George while still resolving the hetero and SPARC issues. However, it will still be "extremely invasive" as there is no way to resolve these things without causing disruption somewhere. George's approach is causing significant problems in the RTE-OPAL interface. The currently proposed solution removes those, but shifts the disruption to external re-users of OPAL such as George. The revised approach would also be disruptive as it will cause significant changes to ORTE, and will cause at least some changes in OPAL. However, it will leave us with a non-structured "identifier", although perhaps only uint32_t instead of uint64_t to conserve memory. So the question becomes:
Neither of these fixes is going to backport to the 1.8 series, so it is purely a question of setting up for the 1.9 branch. Both paths would be in time for that event. |
@jsquyres will do |
@ggouaillardet - yep, it failed when I misconfigured jenkins 3d ago and fixed later. |
seems openib btl is not working with this commit, this command fails 19:58:59 + timeout -s SIGKILL 3m mpirun -np 8 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -mca pml ob1 -mca btl self,openib /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c |
Okay - well that is pretty useless since most of us don't have IB. How did it fail? Can you fix it? |
same command line on "master" branch works as expected. |
We could certainly check /* if only you would tell us what the heck is wrong */ |
I ll do the rebase from now and then test helloworld on an ib cluster. |
55f0bd3
to
804e1d4
Compare
@miked-mellanox i rebased, did the required porting and updated my branch (and hence this PR). |
Test FAILed. Build Log
Test FAILed. |
804e1d4
to
991d618
Compare
just fixed remaining error in pmix/s2, rebased and updated my branch |
Test FAILed. Build Log
Test FAILed. |
now it passed the MPI part (also previous failure passed ok now) but failed on OSHMEM API (oshrun helloworld) |
@miked-mellanox i will have a look and make a "blind" port ... |
991d618
to
58b3e41
Compare
"blind" port commited ... |
58b3e41
to
668d152
Compare
Test FAILed. Build Log
Test FAILed. |
@miked-mellanox @elenash i just fixed a bug when vader is used without knem then i pushed the fix to the master, rebased my branch and "asked" jenkins to test again. |
@ggouaillardet I just run it on master. Should I check it on your fork? Is this issue anyhow related to the issue in oshmem? As I understood, this issue is in master from last week. |
@elenash is the root cause a very small /tmp filesystem ? |
@elenash the thing is @miked-mellanox and i observe different behaviors ... |
@ggouaillardet - jenkins is testing again on your command. (see last box in this thread with jenkins details) |
Test FAILed. Build Log
Test FAILed. |
@alex-mikheev. @elenash - could you please comment on sshmem memheap failure reason? |
@ggouaillardet Do you test it with mxm? |
@elenash no (but i ll give it a try shortly) |
@ggouaillardet master is working after your fix, thanks! |
@ggouaillardet Something strange, but I still reproduce the issue with vader btl on your branch. |
@elenash on my side, i get a hang with --mca btl tcp,self on the master i ll write a simplest mpi (e.g. non oshmem) reproducer and investigate |
The reason is that on master sshmem verbs memory allocator turns on ‘shared_mr’ on ConnectIB. I pushed the fix 097b469 into master From: Mike Dubman [mailto:notifications@github.com] @alex-mikheevhttps://github.com/alex-mikheev. @elenashhttps://github.com/elenash - could you please comment on sshmem memheap failure reason? — |
@ggouaillardet - could u please rebase on top of the Alex`s fix and let jenkins do the rest :) |
668d152
to
84d94a7
Compare
retest this please |
Test FAILed. Build Log
Test FAILed. |
84d94a7
to
668d152
Compare
* opal_process_name_t is now a struct : typedef uint32_t opal_jobid_t; typedef uint32_t opal_vpid_t; typedef struct { opal_jobid_t jobid; opal_vpid_t vpid; } opal_process_name_t; * new opal_proc_table_t class : this is a hash table (key is jobid) of hash tables (key is vpid) and is used to store per opal_process_name_t info. * new OPAL_NAME dss type This commit is co-authored by Ralph and Gilles.
668d152
to
ddb72d1
Compare
retest this please |
Test FAILed. Build Log
Test FAILed. |
Test FAILed. Build Log
Test FAILed. |
retest this please |
Test PASSed. |
Committed by Ralph per telecon discussion on 11/11/2014. |
Some cleanup of warnings when building optimized
Fix the hardcoded path to singularity and add code documentation
this is a hash table (key is jobid) of hash tables (key is vpid)
and is used to store per opal_process_name_t info.