Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rmaps/ppr: Fix case where oversubscribe is ignored. #1327

Merged
merged 1 commit into from
Apr 7, 2022

Conversation

awlauria
Copy link
Contributor

@awlauria awlauria commented Apr 5, 2022

There was a case where prte was returning an error
on a failure to map even though the user passed
:OVERSUBSCRIBE to the command line.
Introduced by: 8a3e938

The fix is easy, just check if :OVERSUBSCRIBE is set
in the run command, and not return an error if so.

Refs: open-mpi/ompi#10216

Signed-off-by: Austen Lauria awlauria@us.ibm.com

@awlauria awlauria requested a review from rhc54 April 5, 2022 20:51
Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this actually fixes the error you reference - e.g., it doesn't fix the cmd line in the deprecated case (the first case the user reported). I'm also not sure it really fixes the oversubscribed case - it will suppress the error, but I'm not sure it actually maps correctly.

What is the value of node->slots_available in this scenario? I suspect it is zero, which means the entire loop is getting ignored and we simply drop down to the "overflow" logic.

@awlauria
Copy link
Contributor Author

awlauria commented Apr 6, 2022

@rhc54 I verified it produced the same results as the OMPI v4.1 branch with orte. This code/logic is basically the same as that (minus the change referenced in this commit). So I am just restoring old behavior there. Now, maybe that old behavior is incorrect. You are right in that this doesn't address the command line, but that can come in a separate commit/fix.

Here is the output from v4.1 on an init/finalize program:

./exports/bin/mpirun -n 100 --map-by ppr:100:node:OVERSUBSCRIBE --report-bindings init_finalize
...
[c685f8n02:2553223] MCW rank 36 is not bound (or bound to all available processors)
[c685f8n02:2553212] MCW rank 29 is not bound (or bound to all available processors)
[c685f8n02:2553202] MCW rank 27 is not bound (or bound to all available processors)
[c685f8n02:2553211] MCW rank 28 is not bound (or bound to all available processors)
[c685f8n02:2553173] MCW rank 0 is not bound (or bound to all available processors)
[c685f8n02:2553176] MCW rank 3 is not bound (or bound to all available processors)
[c685f8n02:2553255] MCW rank 46 is not bound (or bound to all available processors)
[c685f8n02:2553178] MCW rank 5 is not bound (or bound to all available processors)
[c685f8n02:2553183] MCW rank 10 is not bound (or bound to all available processors)
[c685f8n02:2553210] MCW rank 26 is not bound (or bound to all available processors)
[c685f8n02:2553191] MCW rank 18 is not bound (or bound to all available processors)
[c685f8n02:2553186] MCW rank 13 is not bound (or bound to all available processors)
[c685f8n02:2553232] MCW rank 40 is not bound (or bound to all available processors)
[c685f8n02:2553184] MCW rank 11 is not bound (or bound to all available processors)
[c685f8n02:2553189] MCW rank 16 is not bound (or bound to all available processors)
[c685f8n02:2553177] MCW rank 4 is not bound (or bound to all available processors)
[c685f8n02:2553218] MCW rank 32 is not bound (or bound to all available processors)
[c685f8n02:2553221] MCW rank 35 is not bound (or bound to all available processors)
[c685f8n02:2553219] MCW rank 33 is not bound (or bound to all available processors)
[c685f8n02:2553224] MCW rank 37 is not bound (or bound to all available processors)
[c685f8n02:2553264] MCW rank 48 is not bound (or bound to all available processors)
[c685f8n02:2553195] MCW rank 21 is not bound (or bound to all available processors)
[c685f8n02:2553247] MCW rank 42 is not bound (or bound to all available processors)
[c685f8n02:2553272] MCW rank 53 is not bound (or bound to all available processors)
[c685f8n02:2553182] MCW rank 9 is not bound (or bound to all available processors)
[c685f8n02:2553301] MCW rank 63 is not bound (or bound to all available processors)
[c685f8n02:2553190] MCW rank 17 is not bound (or bound to all available processors)
[c685f8n02:2553228] MCW rank 39 is not bound (or bound to all available processors)
[c685f8n02:2553237] MCW rank 41 is not bound (or bound to all available processors)
[c685f8n02:2553213] MCW rank 31 is not bound (or bound to all available processors)
[c685f8n02:2553174] MCW rank 1 is not bound (or bound to all available processors)
[c685f8n02:2553261] MCW rank 49 is not bound (or bound to all available processors)
[c685f8n02:2553187] MCW rank 14 is not bound (or bound to all available processors)
[c685f8n02:2553257] MCW rank 47 is not bound (or bound to all available processors)
[c685f8n02:2553180] MCW rank 7 is not bound (or bound to all available processors)
[c685f8n02:2553252] MCW rank 44 is not bound (or bound to all available processors)
[c685f8n02:2553230] MCW rank 38 is not bound (or bound to all available processors)
[c685f8n02:2553203] MCW rank 25 is not bound (or bound to all available processors)
[c685f8n02:2553175] MCW rank 2 is not bound (or bound to all available processors)
[c685f8n02:2553299] MCW rank 61 is not bound (or bound to all available processors)
[c685f8n02:2553192] MCW rank 19 is not bound (or bound to all available processors)
[c685f8n02:2553194] MCW rank 23 is not bound (or bound to all available processors)
[c685f8n02:2553193] MCW rank 20 is not bound (or bound to all available processors)
[c685f8n02:2553293] MCW rank 59 is not bound (or bound to all available processors)
[c685f8n02:2553304] MCW rank 62 is not bound (or bound to all available processors)
[c685f8n02:2553188] MCW rank 15 is not bound (or bound to all available processors)
[c685f8n02:2553285] MCW rank 57 is not bound (or bound to all available processors)
[c685f8n02:2553185] MCW rank 12 is not bound (or bound to all available processors)
[c685f8n02:2553241] MCW rank 43 is not bound (or bound to all available processors)
[c685f8n02:2553271] MCW rank 51 is not bound (or bound to all available processors)
[c685f8n02:2553254] MCW rank 45 is not bound (or bound to all available processors)
[c685f8n02:2553300] MCW rank 60 is not bound (or bound to all available processors)
[c685f8n02:2553222] MCW rank 34 is not bound (or bound to all available processors)
[c685f8n02:2553179] MCW rank 6 is not bound (or bound to all available processors)
[c685f8n02:2553196] MCW rank 22 is not bound (or bound to all available processors)
[c685f8n02:2553214] MCW rank 30 is not bound (or bound to all available processors)
[c685f8n02:2553181] MCW rank 8 is not bound (or bound to all available processors)
[c685f8n02:2553279] MCW rank 54 is not bound (or bound to all available processors)
[c685f8n02:2553294] MCW rank 58 is not bound (or bound to all available processors)
[c685f8n02:2553282] MCW rank 55 is not bound (or bound to all available processors)
[c685f8n02:2553276] MCW rank 52 is not bound (or bound to all available processors)
[c685f8n02:2553375] MCW rank 69 is not bound (or bound to all available processors)
[c685f8n02:2553374] MCW rank 74 is not bound (or bound to all available processors)
[c685f8n02:2553288] MCW rank 56 is not bound (or bound to all available processors)
[c685f8n02:2553400] MCW rank 91 is not bound (or bound to all available processors)
[c685f8n02:2553385] MCW rank 84 is not bound (or bound to all available processors)
[c685f8n02:2553381] MCW rank 79 is not bound (or bound to all available processors)
[c685f8n02:2553397] MCW rank 92 is not bound (or bound to all available processors)
[c685f8n02:2553373] MCW rank 71 is not bound (or bound to all available processors)
[c685f8n02:2553411] MCW rank 98 is not bound (or bound to all available processors)
[c685f8n02:2553379] MCW rank 73 is not bound (or bound to all available processors)
[c685f8n02:2553369] MCW rank 68 is not bound (or bound to all available processors)
[c685f8n02:2553393] MCW rank 87 is not bound (or bound to all available processors)
[c685f8n02:2553378] MCW rank 78 is not bound (or bound to all available processors)
[c685f8n02:2553392] MCW rank 88 is not bound (or bound to all available processors)
[c685f8n02:2553368] MCW rank 67 is not bound (or bound to all available processors)
[c685f8n02:2553395] MCW rank 85 is not bound (or bound to all available processors)
[c685f8n02:2553367] MCW rank 66 is not bound (or bound to all available processors)
[c685f8n02:2553371] MCW rank 70 is not bound (or bound to all available processors)
[c685f8n02:2553372] MCW rank 72 is not bound (or bound to all available processors)
[c685f8n02:2553370] MCW rank 65 is not bound (or bound to all available processors)
[c685f8n02:2553384] MCW rank 83 is not bound (or bound to all available processors)
[c685f8n02:2553410] MCW rank 95 is not bound (or bound to all available processors)
[c685f8n02:2553402] MCW rank 94 is not bound (or bound to all available processors)
[c685f8n02:2553417] MCW rank 99 is not bound (or bound to all available processors)
[c685f8n02:2553396] MCW rank 90 is not bound (or bound to all available processors)
[c685f8n02:2553386] MCW rank 81 is not bound (or bound to all available processors)
[c685f8n02:2553377] MCW rank 75 is not bound (or bound to all available processors)
[c685f8n02:2553376] MCW rank 76 is not bound (or bound to all available processors)
[c685f8n02:2553412] MCW rank 93 is not bound (or bound to all available processors)
[c685f8n02:2553406] MCW rank 96 is not bound (or bound to all available processors)
[c685f8n02:2553404] MCW rank 89 is not bound (or bound to all available processors)
[c685f8n02:2553366] MCW rank 64 is not bound (or bound to all available processors)
[c685f8n02:2553413] MCW rank 97 is not bound (or bound to all available processors)
[c685f8n02:2553380] MCW rank 80 is not bound (or bound to all available processors)
[c685f8n02:2553390] MCW rank 86 is not bound (or bound to all available processors)
[c685f8n02:2553383] MCW rank 77 is not bound (or bound to all available processors)
[c685f8n02:2553382] MCW rank 82 is not bound (or bound to all available processors)
[awlauria@c685f8n02 v4.1.x]$ 

with this patch on ompi main:

[awlauria@c685f8n02 master]$ ./exports/bin/mpirun -n 100 --map-by ppr:100:node:OVERSUBSCRIBE --report-bindings init_finalize
[c685f8n02:2557476] MCW rank 0 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 1 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 4 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 9 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 7 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 5 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 10 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 11 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 8 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 12 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 6 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 13 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 18 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 17 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 19 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 22 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 24 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 25 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 29 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 2 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 20 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 23 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 30 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 31 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 42 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 3 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 15 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 21 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 34 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 36 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 37 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 43 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 14 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 26 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 16 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 33 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 28 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 38 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 32 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 35 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 44 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 27 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 45 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 40 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 39 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 46 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 41 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 47 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 49 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 48 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 52 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 54 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 55 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 58 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 59 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 57 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 51 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 63 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 67 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 50 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 66 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 62 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 53 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 70 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 71 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 60 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 61 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 72 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 73 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 74 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 56 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 68 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 69 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 64 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 78 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 80 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 65 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 75 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 82 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 77 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 83 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 81 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 76 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 79 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 85 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 84 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 86 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 88 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 87 is not bound (or bound to all available processors)
[c685f8n02:2557476] MCW rank 89 is not bound (or bound to all available processors)
...

if the old behavior is incorrect, I can try and fix the logic to be what it should be.

@rhc54
Copy link
Contributor

rhc54 commented Apr 6, 2022

I'm not saying the old behavior is incorrect. I'm only pointing out that I didn't change the logic you reference and now modify. The reason it is now generating an error is because slots_available is probably zero. If you then go down that code path, you'll see that the loop does absolutely nothing, and thus you fall down into the "overflow" logic down below.

In other words, your "fix" just disables the loop without generating an error. This seems kinda weird and likely to cause problems in the future. It also makes me wonder if we have other problems that will show as soon as someone types a slightly different cmd line.

The critical question is: is slots_available zero? If so, then we need to look at the logic - the correct answer may well be to set that variable to one in the case where oversubscribe is allowed.

@awlauria
Copy link
Contributor Author

awlauria commented Apr 6, 2022

@rhc54 no - slots available is 44.

$ ./exports/bin/mpirun -n 100 --map-by ppr:100:node:OVERSUBSCRIBE --report-bindings true 2>&1 | grep slots
node -> slots_available = 44

@rhc54
Copy link
Contributor

rhc54 commented Apr 7, 2022

I really had to scratch my head over this one - the change I made couldn't possibly have resulted in the change in behavior being reported in the cited OMPI issue. I had to go back and look at the difference between what is currently in OMPI v4.x vs what is in PRRTE - and there it is. The difference is simply that OMPI is wrong - it doesn't check for oversubscribe conditions at all and just puts the procs on the node, regardless of the number of slots.

PRRTE had been changed to detect oversubscription and error out if that was found, which is correct. However, it should also have checked if the user permitted oversubscribe. So the referenced change you cite above actually didn't cause the problem - PRRTE had this bug for a long time.

This change should fix that problem. You might want to go back and fix the OMPI v4 series as it is incorrect.

@rhc54
Copy link
Contributor

rhc54 commented Apr 7, 2022

Actually, I correct that statement. ORTE checks the oversubscribe condition later, after all the procs have been placed on the node. In fact, PRRTE does as well, so the correct fix is to simply remove that code block - PRRTE will detect oversubscribe and deal with it down below.

There are still two problems with the code. First, IIRC even specifying "oversubscribe allowed" cannot override a provided max-slots value. The check in the code block starting at line 380 doesn't seem quite correct in that regard. We should check the other mappers to see how they interpret the max-slots limit to ensure this code is consistent.

Second, it isn't clear to me that we correctly recover the slots if we rule that the job violates oversubscription limits and we terminate the mapping procedure. It looks like "slots_inuse" has been modified, but I don't see any place where it gets reset. Might happen in the state machine when the job gets reported as unable to map and the job object gets cleaned up, but I haven't verified that.

Regardless, the correct solution here is to remove the code block starting at line 315, and then check the max-slots interpretation.

There was a case where prte was returning an error
on a failure to map even though the user passed
`:OVERSUBSCRIBE` to the command line.
Introduced by: 8a3e938

Refs: open-mpi/ompi#10216

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
@awlauria
Copy link
Contributor Author

awlauria commented Apr 7, 2022

For what it is worth, the seq (rmaps_seq.c line 334) and rank_file (rmaps_rankfile.c line 315) have the same code block as ppr

@rhc54
Copy link
Contributor

rhc54 commented Apr 7, 2022

For what it is worth, the seq (rmaps_seq.c line 334) and rank_file (rmaps_rankfile.c line 315) have the same code block as ppr

Sigh - okay, I'll take a look at them. Guess I'll also look into the max_slots thing.

@rhc54 rhc54 merged commit 317d42e into openpmix:master Apr 7, 2022
@awlauria awlauria deleted the ppr_oversubscribe branch April 7, 2022 18:48
@rhc54
Copy link
Contributor

rhc54 commented Apr 7, 2022

For what it is worth, the seq (rmaps_seq.c line 334) and rank_file (rmaps_rankfile.c line 315) have the same code block as ppr

Sigh - okay, I'll take a look at them. Guess I'll also look into the max_slots thing.

I took a look and those files are fine - they check the oversubscription condition on-the-fly. The ppr case assigns all the procs to the node and then checks for oversubscription. The extra code block in ppr was causing it to do the check both on-the-fly and at the end, which is unnecessary.

It makes sense for seq and rank_file to do it on-the-fly because they don't work on a per-node basis, but rather assign one proc at a time to whatever node was given.

@rhc54
Copy link
Contributor

rhc54 commented Apr 7, 2022

FWIW: max_slots appears to be treated consistently across the mappers, although I'm not sure I agree with the treatment. Still, it has been that way for quite some time - no point in changing it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants