-
Notifications
You must be signed in to change notification settings - Fork 934
orte/util: Extend nidmap support for hostnames with sets of digits #4689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* The compression failed for hostnames of the form `c712f6n02` which have multiple sets of digits separated by other characters. So add a reverse prefix finding mechanism for the prefix. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
464e65a to
1d0cc65
Compare
|
For the reviewers:
|
|
My head hurts - all these squirrelly naming conventions make it hard to come up with any foolproof regex algorithm, I fear. It seems to me that this one will fail if the variation at the end isn't digits, but letters. In other words, if we have "prefix01a,prefix01b,prefix01c", then this algorithm won't find the commonality in any of them - yes? |
|
me too ... a simple fix (i did not try yet ...) could be diff --git a/orte/util/nidmap.c b/orte/util/nidmap.c
index fa05f68..0f50772 100644
--- a/orte/util/nidmap.c
+++ b/orte/util/nidmap.c
@@ -13,7 +13,7 @@
* Copyright (c) 2012-2014 Los Alamos National Security, LLC.
* All rights reserved.
* Copyright (c) 2013-2017 Intel, Inc. All rights reserved.
- * Copyright (c) 2014-2017 Research Organization for Information Science
+ * Copyright (c) 2014-2018 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
@@ -274,14 +274,12 @@ int orte_util_nidmap_create(opal_pointer_array_t *pool, char **regex)
len = strlen(node);
startnum = -1;
memset(prefix, 0, ORTE_MAX_NODE_PREFIX);
- numdigits = 0;
for (i=0, j=0; i < len; i++) {
/* valid hostname characters are ascii letters, digits and the '-' character. */
if (isdigit(node[i])) {
/* count the size of the numeric field - but don't
* add the digits to the prefix
*/
- numdigits++;
if (startnum < 0) {
/* okay, this defines end of the prefix */
startnum = i;
@@ -306,10 +304,13 @@ int orte_util_nidmap_create(opal_pointer_array_t *pool, char **regex)
}
/* convert the digits and get any suffix */
nodenum = strtol(&node[startnum], &sfx, 10);
+
if (NULL != sfx) {
suffix = strdup(sfx);
+ numdigits = (int)(sfx - &node[startnum]);
} else {
suffix = NULL;
+ numdigits = strlen(node[startnum]);
}
/* is this node name already on our list? */
found = false;if a hostname is made of several sets of consecutive digits, and you believe the parts to be compressed is more likely at the end rather than at the beginning, then the reverse method is fine. otherwise, try all the scenarios, and choose the shortest regex |
|
i just tested my patch and it seems to work. |
|
bot:mellanox:retest |
|
Yeah i'm not looking for a general solution here, just support for these types of node names which are extremely common for IBM systems where the hostname can often encode system name and host location within the system (e.g., rack or aisle location). I would think the reverse search would be the best in any case, except fully qualified hostnames, since most clusters I've see have hostnames like |
| if (startnum < 0) { | ||
| prefix[j++] = node[i]; | ||
| } | ||
| else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment does not look right to me.
I think that would tryreverse even with ‘x0y’
| numdigits++; | ||
| continue; | ||
| } | ||
| else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that would work with ‘c712f6n01a’
|
FYI, I wrote a test program I will push tomorrow in |
|
Having slept on it, I'm convinced this is a lost cause. There are far too many bars out there where someone can concoct these ridiculous schemes over a barrel of beer - we cannot hope to cover them all with a single, understandable algorithm. So let's create a regex framework that can start with two components - our original simple one (to be the default) and this "reverse" one for IBM. IBM can supply an MCA param to get theirs selected. If someone comes along with yet another scheme, then they can provide a plugin for resolving it. |
|
Sounds good to me. |
|
I think that's fine for master, but we/I will need a solution that I can pick to v3.x branches. So a version of this PR will be needed since we can't add a new framework to those branches. If I had an IBM component for the node regex I might be able to further compress some of the stranger cases that I've seen. So how about this for a plan.
|
|
Looks good to me. For the release branches, an even simpler solution could be to always parse host names in reverse mode, and stop at the end of the first set of digits. |
|
There is no problem moving a new framework into v3.1, and I would be leery of moving even this PR into the v3.0 series. Too much chance of it messing up someone's cluster out there - we cannot possibly test or imagine all the craziness. So I'd advocate for just doing it right the first time. SpectrumMPI can always put out their own version in the meantime. |
|
Yeah Spectrum MPI can (and does) their own version of this patch to support our machines on our branch of OMPI. But I know there are folks (e.g., ORNL) that pick up OMPI and try to run it on their IBM machines where they would need this fix. That's why I'm posting it here, and trying to get it into the release branches. I like having a test case where we can enumerate the hostname patterns that we have seen/support. So as we are building a framework we can test to make sure it covers all the cases. |
|
NP - let's build the framework and create a separate plugin for this case. If you can determine some way of identifying that you are in such an environment, then you can always auto-select your component. I'd just rather not push something like this into a prior release as it impacts everyone and we can't be sure of the side-effects. |
|
That's fine with me. I know we have been bitten by things like that before. That's partly why I'd love to have a unit test available that we can extend with new supported patterns, and run regularly. Even then putting in a change close to release seems to be asking for trouble :) Let's keep this PR open here for now. I won't be able to help much on the framework for the next few weeks. Having this PR up gives folks from the community something to pull in the mean time. |
rhc54
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mark this so we know not to commit it. No issues with leaving it around until replaced.
|
@jjhursey When you are ready, you can just drop your algorithms into the orte/mca/regx/reverse component. |
|
@rhc54 the new Also, the inline patch above should be added to the |
|
No problem adding the API, and I expect someone will want to shift some of the code in the "fwd" component to base. I would prefer that you not add the inline code to "fwd" as the entire point of creating the framework was to avoid that scenario. The inline code should go into the "reverse" component. |
|
My inline patch (about |
|
Ah, no - I thought you meant the "reverse" patch. Sure, your change doesn't impact the basic algo. |
Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: Scott Miller <scott.miller1@ibm.com>
Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit a056fde)
* The compression failed for hostnames of the form `c712f6n02` which have multiple sets of digits separated by other characters. So add a reverse prefix finding mechanism for the prefix. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com> Commit reference from PR open-mpi#4689: open-mpi#4689 (cherry picked from commit 1d0cc65)
Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit a056fde)
Refs. open-mpi/ompi#4689 Refs. open-mpi/ompi#4748 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (back-ported from commit a056fde) Signed-off-by: Scott Miller <scott.miller1@ibm.com>
Refs. open-mpi/ompi#4689 Refs. open-mpi/ompi#4748 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit 98073d7) Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* Refs open-mpi/ompi#4689 * Refs openpmix#657 Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (back-ported from commit a056fde) Signed-off-by: Scott Miller <scott.miller1@ibm.com>
c712f6n02whichhave multiple sets of digits separated by other characters. So
add a reverse prefix finding mechanism for the prefix.
c712f6n01,c712f6n02,c712f6n03c[6:712]f6n01,c[6:712]f6n02,c[6:712]f6n03@(3)c712f6n[2:1-3]@0(3)c[6:712]f6n02asc000712f6n02instead ofc712f6n02and fail to resolve the hostname.