Skip to content

Conversation

jsquyres
Copy link
Member

User Nikolay reported on the users mailing list that Open MPI 1.10.x has problems in OpenVZ containers.

I actually ran into a similar situation recently, and hacked a workaround (which is the initial patch on this PR). Specifically, per Nikolay's description, here's the networking picture in an OpenVZ container:

[root@ct111 /]# ifconfig
lo        Link encap:Local Loopback
         inet addr:127.0.0.1  Mask:255.0.0.0
         inet6 addr: ::1/128 Scope:Host
         UP LOOPBACK RUNNING  MTU:65536  Metric:1
         RX packets:24 errors:0 dropped:0 overruns:0 frame:0
         TX packets:24 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:0
         RX bytes:1200 (1.1 KiB)  TX bytes:1200 (1.1 KiB)

venet0    Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
         inet addr:127.0.0.1  P-t-P:127.0.0.1  Bcast:0.0.0.0  Mask:255.255.255.255
         UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
         RX packets:855 errors:0 dropped:0 overruns:0 frame:0
         TX packets:774 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:0
         RX bytes:122212 (119.3 KiB)  TX bytes:112304 (109.6 KiB)

venet0:0  Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
         inet addr:10.0.50.41  P-t-P:10.0.50.41  Bcast:10.0.50.41  Mask:255.255.255.255
         UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1

Note the 10.0.40.41/32 (!) address. That's just how OpenVZ sets up its networking environment -- it works fine.

Using --mca btl_tcp_if_include venet0:0 --mca oob_tcp_if_include venet0:0 does not work -- the jobs fail with the TCP BTL complaining about a broken pipe (!).

IIRC (from back when I created this patch), I think that somehow the TCP OOB is still using venet0, which ends up not working properly (i.e., it tries to open a socket to a peer, and it fails). The issue is that OMPI isn't ignoring localhost based on its 127.0.0.0/8 address, but rather based on its name of lo*. I.e., it still tries to use venet0, even though it's also 127.0.0.1 (same as lo).

I don't know offhand/remember why the oob_tcp_if_include venet0:0 doesn't fix the problem. ...something about string matching, perhaps?

Regardless, Nikolay replied that OMPI works for him when he applies this hackaround patch. We should probably do something better / more formal -- here's some not-fully-thought-out suggestions:

  • Ignore POINTOPOINT interfaces
  • Ignore 127.0.0.0/8 interfaces (i.e., add 127.0.0.0/8 to *_if_exclude default values)

This is not a good solution, but it's something to talk about.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@ggouaillardet
Copy link
Contributor

ggouaillardet commented Jun 27, 2016

@jsquyres
iirc, we include/exclude kernel interfaces only.
if an ip/netmask is used as a parameter, the matching kernel interface is searched and then included/excluded.
vnet0:0 is not a kernel interface, or it is the same than vnet0
=> --mca btl_tcp_if_include venet0:0 --mca oob_tcp_if_include venet0:0 also includes 127.0.0.1/8
=> --mca btl_tcp_if_exclude 127.0.0.0/8 --mca oob_tcp_if_exclude 127.0.0.0/8 does exclude lo, vnet0 but also vnet0:0

EDIT: Per comment below, changed "physical interface" to "kernel interface"

@jsquyres
Copy link
Member Author

@ggouaillardet I'm afraid I don't grok your comment: I don't know what you mean by "physical interfaces" -- the Linux IP interfaces have no relation to hardware...?

@ggouaillardet
Copy link
Contributor

@jsquyres i chose my words poorly, sorry for that.
please read "kernel interface" instead of "physical interface"
and let me elaborate

on my system, lo, eth1:0 and eth1 are up and running
eht1:0 has ip 127.0.0.1/255.255.255.255 and i run with
--mca btl_tcp_if_exclude eth1:0

in opal_if_list these three interfaces appear in that order.
lo has {if_index,if_kernel_index} (1,1)
eth1:0 has (2,3)
eth1 has (3,3)
/* eth1:0 and eth1 share the same if_kernel_index */

in mca_btl_tcp_component_create_instances, only the first interface of a given if_kernel_index is considered, that means only lo and eth1:0

since eth1:0 is excluded via the command line, only lo is available, and this interface is discarded in an other place
bottom line, no tcp interface can be used and MPI_Init aborts

now if you
mpirun --mca btl_tcp_if_exclude 127.0.0.1/8
then the tcp btl excludes lo and do not even bother trying and excluding eth1:0
so i guess eth1:0 does not get excluded and bad things happen when this interface is used for inter node communication.

now if eth1 appears before eth1:0 in opal_if_list, then i guess
mpirun --mca btl_tcp_if_exclude eth1:0 does not exclude anything

@jsquyres
Copy link
Member Author

Huh. It sounds like we're doing the wrong thing in include/exclude, then -- especially if multiple user interfaces can share the same kernel interface, but with a different IP address. Should we revamp this include/exclude IP interface routine? (I honestly, forget: is this in OPAL? If it's not, it seems like it should be, because there's multiple places in OMPI where we need to include/exclude IP interfaces)

More specifically: I'm wondering why we have the code skip interfaces with the same kernel ID. Did we (mistakenly?) assume that interfaces with the same kernel ID would be the same physical interface (and/or IP address), and therefore we wouldn't want to oversubscribe the bandwidth? I.e., I wonder what problem we were trying to fix by skipping interfaces with the same kernel ID, and whether we need to actually solve that problem another way.

For example, is this simplistic algorithm sufficient?

if include list is empty and exclude list is empty
    copy all_interfaces list to to_be_used list
else
    # This foreach goes over *all* interfaces -- even those with the same kernel ID
    foreach interface in all_interfaces
        if exclude list is not empty
            if interface name is in exclude list or interface CIDR is in exclude list
                continue
            else
                save interface in to_be_used list
        else if include list is not empty
            if interface name is in include list or interface CIDR is in include list
                save interface in to_be_used list
            else
                continue

I'm a little wary of the "everything" foreach in there:

  1. On the one hand, your case (and the OpenVZ case) show that skipping interfaces of the same kernel ID isn't a good idea
  2. But on the other hand, we must have done that for a reason / to fix some problem

So if we now do not skip interfaces with the same kernel ID, what problem are we (re)introducing?

@ggouaillardet
Copy link
Contributor

this include/exclude thing is done in the tcp btl.
I did not check how this is handled in the tcp oob.

did we do that because we do not want load balancing on several interfaces with the same kernel id ?
@bosilca iirc you are the author of the tcp btl, any thoughts ?

@jsquyres
Copy link
Member Author

FWIW, we also do this same kind of include/exclude stuff in the usnic BTL (because it's ethernet, so we use IP addressing, yadda yadda yadda). Sounds like a good improvement would be to move this functionality down to OPAL so that all of us can share it (and therefore exhibit the same include/exclude behavior across all IP interface selection in OMPI).

@jjhursey
Copy link
Member

FWIW I think moving this to OPAL sounds fine. I would just want to make sure we preserve the ability to include/exclude differently for the OOB and BTL components so we can separate the traffic if we have multiple interfaces. Maybe putting the logic in OPAL, but still preserve the different MCA options.

@ggouaillardet
Copy link
Contributor

fwiw I wrote a simple fix, but I screwed up when hitting the Comment button, so I will post it tomorrow.
include/exclude is still handled in btl/tcp, but it now uses interfaces instead of kernel interface.
that being said, only max one instance per kernel interface is created, so the change is minimal at the end.

@jsquyres
Copy link
Member Author

@jjhursey Yeah, we're on the same sheet of music here -- I just want the machinery for all the parsing / selecting / etc. down in OPAL, and different top-levels can invoke the machinery with different sets of inputs / outputs.

@bosilca do you remember why the TCP BTL will only add an interface if it has a unique kernel ID? Was it to prevent underlying device bandwidth oversubscription?

@ggouaillardet
Copy link
Contributor

here is the patch

diff --git a/opal/mca/btl/tcp/btl_tcp_component.c b/opal/mca/btl/tcp/btl_tcp_component.c
index e4a547a..88b44c3 100644
--- a/opal/mca/btl/tcp/btl_tcp_component.c
+++ b/opal/mca/btl/tcp/btl_tcp_component.c
@@ -17,7 +17,7 @@
  *                         reserved.
  * Copyright (c) 2013-2015 NVIDIA Corporation.  All rights reserved.
  * Copyright (c) 2014-2015 Intel, Inc. All rights reserved.
- * Copyright (c) 2014-2015 Research Organization for Information Science
+ * Copyright (c) 2014-2016 Research Organization for Information Science
  *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
@@ -56,6 +56,7 @@
 #include <limits.h>

 #include "opal/mca/event/event.h"
+#include "opal/mca/if/if.h"
 #include "opal/util/ethtool.h"
 #include "opal/util/if.h"
 #include "opal/util/output.h"
@@ -668,6 +669,7 @@ static int mca_btl_tcp_component_create_instances(void)
     int if_index;
     int kif_count = 0;
     int *kindexes; /* this array is way too large, but never too small */
+    opal_if_t **intfs, *intf;
     char **include = NULL;
     char **exclude = NULL;
     char **argv;
@@ -681,6 +683,11 @@ static int mca_btl_tcp_component_create_instances(void)
     if (NULL == kindexes) {
         return OPAL_ERR_OUT_OF_RESOURCE;
     }
+    intfs = (opal_if_t **) malloc(sizeof(opal_if_t *) * if_count);
+    if (NULL == intfs) {
+        free(kindexes);
+        return OPAL_ERR_OUT_OF_RESOURCE;
+    }

     /* calculate the number of kernel indexes (number of IP interfaces) */
     {
@@ -712,6 +719,7 @@ static int mca_btl_tcp_component_create_instances(void)
             }
         }
     }
+    free(kindexes);

     /* allocate memory for btls */
     mca_btl_tcp_component.tcp_btls = (mca_btl_tcp_module_t**)malloc(mca_btl_tcp_component.tcp_num_links *
@@ -726,15 +734,37 @@ static int mca_btl_tcp_component_create_instances(void)
     /* if the user specified an interface list - use these exclusively */
     argv = include = split_and_resolve(&mca_btl_tcp_component.tcp_if_include,
                                        "include", true);
+    if_index = 0;
     while(argv && *argv) {
         char* if_name = *argv;
-        int if_index = opal_ifnametokindex(if_name);
-        if(if_index < 0) {
+        bool found = false;
+        int i;
+        for (intf =  (opal_if_t*)opal_list_get_first(&opal_if_list);
+             intf != (opal_if_t*)opal_list_get_end(&opal_if_list);
+             intf =  (opal_if_t*)opal_list_get_next(intf)) {
+            if (strcmp(intf->if_name, if_name) == 0) {
+                found = true;
+                break;
+            }
+        }
+        if (!found) {
             BTL_ERROR(("invalid interface \"%s\"", if_name));
             ret = OPAL_ERR_NOT_FOUND;
             goto cleanup;
         }
-        mca_btl_tcp_create(if_index, if_name);
+        found = false;
+        /* search a previous interface with the same kernel index */
+        for (i=0; i<if_index; i++) {
+            if (intfs[i]->if_kernel_index == intf->if_kernel_index) {
+                found = true;
+                break;
+            }
+        }
+        if (!found) {
+            /* add the interface and create it */
+            intfs[if_index++] = intf;
+            mca_btl_tcp_create(intf->if_kernel_index, intf->if_name);
+        }
         argv++;
     }

@@ -748,27 +778,49 @@ static int mca_btl_tcp_component_create_instances(void)
     /* if the interface list was not specified by the user, create
      * a BTL for each interface that was not excluded.
     */
+
+    /* retrieve all interfaces */
+    if_index = 0;
+    for (intf =  (opal_if_t*)opal_list_get_first(&opal_if_list);
+         intf != (opal_if_t*)opal_list_get_end(&opal_if_list);
+         intf =  (opal_if_t*)opal_list_get_next(intf)) {
+        intfs[if_index++] = intf;
+    }
+
     exclude = split_and_resolve(&mca_btl_tcp_component.tcp_if_exclude,
                                 "exclude", false);
     {
         int i;
-        for(i = 0; i < kif_count; i++) {
-            /* IF_NAMESIZE is defined in opal/util/if.h */
-            char if_name[IF_NAMESIZE];
-            if_index = kindexes[i];
-
-            opal_ifkindextoname(if_index, if_name, sizeof(if_name));
-
+        for(i = 0; i < if_index; i++) {
             /* check to see if this interface exists in the exclude list */
             argv = exclude;
             while(argv && *argv) {
-                if(strncmp(*argv,if_name,strlen(*argv)) == 0)
+                if(strncmp(*argv,intfs[i]->if_name,strlen(*argv)) == 0) {
+                    /* the interface must be excluded */
+                    intfs[i] = NULL;
                     break;
+                }
                 argv++;
             }
-            /* if this interface was not found in the excluded list, create a BTL */
-            if(argv == 0 || *argv == 0) {
-                mca_btl_tcp_create(if_index, if_name);
+        }
+        for (i = 0; i < if_index; i++) {
+            int j;
+            bool found = false;
+            if (NULL == intfs[i]) {
+                continue;
+            }
+            /* search a previous interface with the same kernel index */
+            for (j=0; j<i; j++) {
+                if (NULL == intfs[j]) {
+                    continue;
+                }
+                if (intfs[i]->if_kernel_index == intfs[j]->if_kernel_index) {
+                    found = true;
+                }
+            }
+            if (!found) {
+                /* add the interface and create it */
+                mca_btl_tcp_create(intfs[i]->if_kernel_index, intfs[i]->if_name);
             }
         }
     }
@@ -780,9 +832,7 @@ static int mca_btl_tcp_component_create_instances(void)
     if (NULL != exclude) {
         opal_argv_free(exclude);
     }
-    if (NULL != kindexes) {
-        free(kindexes);
-    }
+    free(intfs);
     return ret;
 }

@jsquyres
Copy link
Member Author

Please add a Signed-off-by line to this PR's commit.

@bwbarrett
Copy link
Member

@jsquyres / @ggouaillardet / @bosilca, I'm still not sure I totally grok what this PR is trying to fix, but where are we in closing out / tossing? We're a bit ripe on this PR...

@jsquyres
Copy link
Member Author

This hasn't come up again, so I'm willing to close it as "wontfix". If someone wants to bring it up again, no problem -- this PR was definitely a hackaround / placeholder for a real fix.

@jsquyres jsquyres closed this Sep 13, 2017
@jsquyres jsquyres deleted the pr/openvz branch September 13, 2017 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants