orte/util: Extend nidmap support for hostnames with sets of digits #4689

jjhursey · 2018-01-10T02:02:46Z

The compression failed for hostnames of the form c712f6n02 which
have multiple sets of digits separated by other characters. So
add a reverse prefix finding mechanism for the prefix.
Given the nodelist of c712f6n01,c712f6n02,c712f6n03
- Old way created: c[6:712]f6n01,c[6:712]f6n02,c[6:712]f6n03@(3)
- New way created: c712f6n[2:1-3]@0(3)
This became a problem when tree launching with rsh as the next hop would decode c[6:712]f6n02 as c000712f6n02 instead of c712f6n02 and fail to resolve the hostname.

* The compression failed for hostnames of the form `c712f6n02` which have multiple sets of digits separated by other characters. So add a reverse prefix finding mechanism for the prefix. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

jjhursey · 2018-01-10T02:07:02Z

For the reviewers:

I debated about removing the old way of a forward search for the prefix, and just leaving the reverse search for the prefix. I decided not to do so because I didn't know if there was a corner case I was missing here. So when you review if you think that we can ditch the old way, and just use the reverse method then I can re-create the patch to do that.

rhc54 · 2018-01-10T02:45:48Z

My head hurts - all these squirrelly naming conventions make it hard to come up with any foolproof regex algorithm, I fear. It seems to me that this one will fail if the variation at the end isn't digits, but letters. In other words, if we have "prefix01a,prefix01b,prefix01c", then this algorithm won't find the commonality in any of them - yes?

ggouaillardet · 2018-01-10T03:02:52Z

me too ...

a simple fix (i did not try yet ...) could be

diff --git a/orte/util/nidmap.c b/orte/util/nidmap.c
index fa05f68..0f50772 100644
--- a/orte/util/nidmap.c
+++ b/orte/util/nidmap.c
@@ -13,7 +13,7 @@
  * Copyright (c) 2012-2014 Los Alamos National Security, LLC.
  *                         All rights reserved.
  * Copyright (c) 2013-2017 Intel, Inc.  All rights reserved.
- * Copyright (c) 2014-2017 Research Organization for Information Science
+ * Copyright (c) 2014-2018 Research Organization for Information Science
  *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
@@ -274,14 +274,12 @@ int orte_util_nidmap_create(opal_pointer_array_t *pool, char **regex)
         len = strlen(node);
         startnum = -1;
         memset(prefix, 0, ORTE_MAX_NODE_PREFIX);
-        numdigits = 0;
         for (i=0, j=0; i < len; i++) {
             /* valid hostname characters are ascii letters, digits and the '-' character. */
             if (isdigit(node[i])) {
                 /* count the size of the numeric field - but don't
                  * add the digits to the prefix
                  */
-                numdigits++;
                 if (startnum < 0) {
                     /* okay, this defines end of the prefix */
                     startnum = i;
@@ -306,10 +304,13 @@ int orte_util_nidmap_create(opal_pointer_array_t *pool, char **regex)
         }
         /* convert the digits and get any suffix */
         nodenum = strtol(&node[startnum], &sfx, 10);
+     
         if (NULL != sfx) {
             suffix = strdup(sfx);
+            numdigits = (int)(sfx - &node[startnum]);
         } else {
             suffix = NULL;
+            numdigits = strlen(node[startnum]);
         }
         /* is this node name already on our list? */
         found = false;

if a hostname is made of several sets of consecutive digits, and you believe the parts to be compressed is more likely at the end rather than at the beginning, then the reverse method is fine.
i have no strong opinion on that

otherwise, try all the scenarios, and choose the shortest regex

ggouaillardet · 2018-01-10T04:27:09Z

i just tested my patch and it seems to work.
obviously, the regex is suboptimal e.g. c[3:712]f6n01,c[3:712]f6n02,c[3:712]f6n03@0(3) instead of c712f6n[2:1-3]@0(3), but there is no absolute rule whether the digits should be searched at the beginning or the end.

artpol84 · 2018-01-10T04:48:23Z

bot:mellanox:retest

jjhursey · 2018-01-10T12:44:53Z

Yeah i'm not looking for a general solution here, just support for these types of node names which are extremely common for IBM systems where the hostname can often encode system name and host location within the system (e.g., rack or aisle location).

I would think the reverse search would be the best in any case, except fully qualified hostnames, since most clusters I've see have hostnames like odin012 or n37 or node-135 So the last set of digits are what is changing.

ggouaillardet · 2018-01-10T13:44:37Z

orte/util/nidmap.c

            if (startnum < 0) {
                prefix[j++] = node[i];
            }
+            else {


The comment does not look right to me.
I think that would tryreverse even with ‘x0y’

ggouaillardet · 2018-01-10T13:46:53Z

orte/util/nidmap.c

+                    numdigits++;
+                    continue;
+                }
+                else {


I do not think that would work with ‘c712f6n01a’

ggouaillardet · 2018-01-10T13:48:32Z

FYI, I wrote a test program I will push tomorrow in test/util/nidmap.c

rhc54 · 2018-01-10T14:42:50Z

Having slept on it, I'm convinced this is a lost cause. There are far too many bars out there where someone can concoct these ridiculous schemes over a barrel of beer - we cannot hope to cover them all with a single, understandable algorithm.

So let's create a regex framework that can start with two components - our original simple one (to be the default) and this "reverse" one for IBM. IBM can supply an MCA param to get theirs selected.

If someone comes along with yet another scheme, then they can provide a plugin for resolving it.

ggouaillardet · 2018-01-10T14:53:04Z

Sounds good to me.
We could also try all the components and keep the shortest regex. That behavior could be controlled by a MCA parameter.

jjhursey · 2018-01-10T14:56:36Z

I think that's fine for master, but we/I will need a solution that I can pick to v3.x branches. So a version of this PR will be needed since we can't add a new framework to those branches. If I had an IBM component for the node regex I might be able to further compress some of the stranger cases that I've seen.

So how about this for a plan.

Giles commits his nidmap test program with some use cases that we have seen.
I adjust this PR based upon review comments. Then I can cherry-pick this commit to the v3.x branches
We work on a new PR that breaks the nidmap into a proper framework.

ggouaillardet · 2018-01-10T15:06:24Z

Looks good to me.
Is it really too late for the v3.1.x series ?

For the release branches, an even simpler solution could be to always parse host names in reverse mode, and stop at the end of the first set of digits.
We could also try both directions and keep the shortest regex, and this could be tunable via a MCA param.

rhc54 · 2018-01-10T15:30:24Z

There is no problem moving a new framework into v3.1, and I would be leery of moving even this PR into the v3.0 series. Too much chance of it messing up someone's cluster out there - we cannot possibly test or imagine all the craziness. So I'd advocate for just doing it right the first time.

SpectrumMPI can always put out their own version in the meantime.

jjhursey · 2018-01-10T15:51:22Z

Yeah Spectrum MPI can (and does) their own version of this patch to support our machines on our branch of OMPI. But I know there are folks (e.g., ORNL) that pick up OMPI and try to run it on their IBM machines where they would need this fix. That's why I'm posting it here, and trying to get it into the release branches.

I like having a test case where we can enumerate the hostname patterns that we have seen/support. So as we are building a framework we can test to make sure it covers all the cases.

rhc54 · 2018-01-10T15:57:23Z

NP - let's build the framework and create a separate plugin for this case. If you can determine some way of identifying that you are in such an environment, then you can always auto-select your component. I'd just rather not push something like this into a prior release as it impacts everyone and we can't be sure of the side-effects.

jjhursey · 2018-01-10T16:06:01Z

That's fine with me. I know we have been bitten by things like that before. That's partly why I'd love to have a unit test available that we can extend with new supported patterns, and run regularly. Even then putting in a change close to release seems to be asking for trouble :)

Let's keep this PR open here for now. I won't be able to help much on the framework for the next few weeks. Having this PR up gives folks from the community something to pull in the mean time.

rhc54

Let's mark this so we know not to commit it. No issues with leaving it around until replaced.

rhc54 · 2018-01-11T05:17:09Z

@jjhursey When you are ready, you can just drop your algorithms into the orte/mca/regx/reverse component.

ggouaillardet · 2018-01-11T09:09:19Z

@rhc54 the new regx framework makes the test very painful to write.
In order to ease that, and possibly do some code factorisarion too, could you consider adding (or replace with) a extract_node_names(char *regex, char*** node_names) callback ?

Also, the inline patch above should be added to the fwd component (and I can do that but tomorrow only)

rhc54 · 2018-01-11T10:39:01Z

No problem adding the API, and I expect someone will want to shift some of the code in the "fwd" component to base. I would prefer that you not add the inline code to "fwd" as the entire point of creating the framework was to avoid that scenario. The inline code should go into the "reverse" component.

ggouaillardet · 2018-01-11T10:52:00Z

My inline patch (about numdigits computation) is a bug fix. Without it, fwd will likely generate an incorrect regex on IBM nodes (see @jjhursey description in the first comment). With it, it will generate a flat regex, and it is up to the reverse component to do a better job.
Are we discussing the same thing ?

rhc54 · 2018-01-11T10:55:05Z

Ah, no - I thought you meant the "reverse" patch. Sure, your change doesn't impact the basic algo.

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: Scott Miller <scott.miller1@ibm.com>

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit a056fde)

* The compression failed for hostnames of the form `c712f6n02` which have multiple sets of digits separated by other characters. So add a reverse prefix finding mechanism for the prefix. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com> Commit reference from PR open-mpi#4689: open-mpi#4689 (cherry picked from commit 1d0cc65)

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit a056fde)

Refs. open-mpi/ompi#4689 Refs. open-mpi/ompi#4748 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (back-ported from commit a056fde) Signed-off-by: Scott Miller <scott.miller1@ibm.com>

Refs. open-mpi/ompi#4689 Refs. open-mpi/ompi#4748 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit 98073d7) Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

* Refs open-mpi/ompi#4689 * Refs openpmix#657

* Refs open-mpi/ompi#4689 * Refs openpmix#657 Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (back-ported from commit a056fde) Signed-off-by: Scott Miller <scott.miller1@ibm.com>

jjhursey added the bug label Jan 10, 2018

jjhursey assigned rhc54 and ggouaillardet Jan 10, 2018

jjhursey requested review from ggouaillardet and rhc54 January 10, 2018 02:02

jjhursey force-pushed the fix/nidmap-multiset branch from 464e65a to 1d0cc65 Compare January 10, 2018 02:04

ggouaillardet reviewed Jan 10, 2018

View reviewed changes

orte/util/nidmap.c

numdigits++;

continue;

}

else {

Copy link

Contributor

ggouaillardet Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that would work with ‘c712f6n01a’

gpaulsen approved these changes Jan 10, 2018

View reviewed changes

rhc54 requested changes Jan 10, 2018

View reviewed changes

jsquyres added the ⚠️ WIP-DNM! label Jan 10, 2018

rhc54 mentioned this pull request Jan 11, 2018

Convert nidmap to regx framework #4699

Merged

ggouaillardet added a commit to ggouaillardet/ompi that referenced this pull request Jan 12, 2018

regx/fwd: correctly handle node names with multiple set of digits

a056fde

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

ggouaillardet mentioned this pull request Jan 12, 2018

orte/regx: fix, revamp and enhancement #4704

Merged

rhc54 closed this in #4704 Jan 12, 2018

sam6258 pushed a commit to sam6258/ompi that referenced this pull request Jan 24, 2018

regx/fwd: correctly handle node names with multiple set of digits

2f437f1

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: Scott Miller <scott.miller1@ibm.com>

sam6258 pushed a commit to sam6258/ompi that referenced this pull request Jan 25, 2018

regx/fwd: correctly handle node names with multiple set of digits

8a00fbf

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit a056fde)

jjhursey mentioned this pull request Jan 25, 2018

nidmap reverse prefix patch #4748

Merged

sam6258 pushed a commit to sam6258/ompi that referenced this pull request Jan 31, 2018

regx/fwd: correctly handle node names with multiple set of digits

31ae3ea

Refs. open-mpi#4689 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit a056fde)

ggouaillardet added a commit to ggouaillardet/pmix that referenced this pull request Feb 1, 2018

preg/native: correctly handle node names with multiple set of digits

98073d7

Refs. open-mpi/ompi#4689 Refs. open-mpi/ompi#4748 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

jjhursey mentioned this pull request Feb 1, 2018

preg/native: correctly handle node names with multiple set of digits openpmix/openpmix#658

Merged

jjhursey added a commit to jjhursey/openpmix that referenced this pull request Feb 1, 2018

server_regex: correctly handle node names with multiple set of digits

64720d5

* Refs open-mpi/ompi#4689 * Refs openpmix#657

jjhursey mentioned this pull request Feb 1, 2018

server_regex: correctly handle node names with multiple set of digits openpmix/openpmix#659

Merged

jjhursey added a commit to jjhursey/openpmix that referenced this pull request Feb 1, 2018

server_regex: correctly handle node names with multiple set of digits

49c6088

* Refs open-mpi/ompi#4689 * Refs openpmix#657 Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

jjhursey deleted the fix/nidmap-multiset branch March 9, 2021 15:19

orte/util: Extend nidmap support for hostnames with sets of digits #4689

orte/util: Extend nidmap support for hostnames with sets of digits #4689

Uh oh!

Conversation

jjhursey commented Jan 10, 2018

Uh oh!

jjhursey commented Jan 10, 2018

Uh oh!

rhc54 commented Jan 10, 2018

Uh oh!

ggouaillardet commented Jan 10, 2018

Uh oh!

ggouaillardet commented Jan 10, 2018

Uh oh!

artpol84 commented Jan 10, 2018

Uh oh!

jjhursey commented Jan 10, 2018

Uh oh!

ggouaillardet Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

ggouaillardet Jan 10, 2018

Choose a reason for hiding this comment

Uh oh!

ggouaillardet commented Jan 10, 2018

Uh oh!

rhc54 commented Jan 10, 2018

Uh oh!

ggouaillardet commented Jan 10, 2018

Uh oh!

jjhursey commented Jan 10, 2018

Uh oh!

ggouaillardet commented Jan 10, 2018

Uh oh!

rhc54 commented Jan 10, 2018

Uh oh!

jjhursey commented Jan 10, 2018

Uh oh!

rhc54 commented Jan 10, 2018

Uh oh!

jjhursey commented Jan 10, 2018

Uh oh!

rhc54 left a comment

Choose a reason for hiding this comment

Uh oh!

rhc54 commented Jan 11, 2018

Uh oh!

ggouaillardet commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Jan 11, 2018

Uh oh!

ggouaillardet commented Jan 11, 2018

Uh oh!

rhc54 commented Jan 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ggouaillardet commented Jan 11, 2018 •

edited

Loading