-
Notifications
You must be signed in to change notification settings - Fork 918
Support OpenVZ container networking #1814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is not a good solution, but it's something to talk about. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@jsquyres EDIT: Per comment below, changed "physical interface" to "kernel interface" |
@ggouaillardet I'm afraid I don't grok your comment: I don't know what you mean by "physical interfaces" -- the Linux IP interfaces have no relation to hardware...? |
@jsquyres i chose my words poorly, sorry for that. on my system, in in since now if you now if |
Huh. It sounds like we're doing the wrong thing in include/exclude, then -- especially if multiple user interfaces can share the same kernel interface, but with a different IP address. Should we revamp this include/exclude IP interface routine? (I honestly, forget: is this in OPAL? If it's not, it seems like it should be, because there's multiple places in OMPI where we need to include/exclude IP interfaces) More specifically: I'm wondering why we have the code skip interfaces with the same kernel ID. Did we (mistakenly?) assume that interfaces with the same kernel ID would be the same physical interface (and/or IP address), and therefore we wouldn't want to oversubscribe the bandwidth? I.e., I wonder what problem we were trying to fix by skipping interfaces with the same kernel ID, and whether we need to actually solve that problem another way. For example, is this simplistic algorithm sufficient?
I'm a little wary of the "everything" foreach in there:
So if we now do not skip interfaces with the same kernel ID, what problem are we (re)introducing? |
this include/exclude thing is done in the tcp btl. did we do that because we do not want load balancing on several interfaces with the same kernel id ? |
FWIW, we also do this same kind of include/exclude stuff in the usnic BTL (because it's ethernet, so we use IP addressing, yadda yadda yadda). Sounds like a good improvement would be to move this functionality down to OPAL so that all of us can share it (and therefore exhibit the same include/exclude behavior across all IP interface selection in OMPI). |
FWIW I think moving this to OPAL sounds fine. I would just want to make sure we preserve the ability to include/exclude differently for the OOB and BTL components so we can separate the traffic if we have multiple interfaces. Maybe putting the logic in OPAL, but still preserve the different MCA options. |
fwiw I wrote a simple fix, but I screwed up when hitting the Comment button, so I will post it tomorrow. |
@jjhursey Yeah, we're on the same sheet of music here -- I just want the machinery for all the parsing / selecting / etc. down in OPAL, and different top-levels can invoke the machinery with different sets of inputs / outputs. @bosilca do you remember why the TCP BTL will only add an interface if it has a unique kernel ID? Was it to prevent underlying device bandwidth oversubscription? |
here is the patch diff --git a/opal/mca/btl/tcp/btl_tcp_component.c b/opal/mca/btl/tcp/btl_tcp_component.c
index e4a547a..88b44c3 100644
--- a/opal/mca/btl/tcp/btl_tcp_component.c
+++ b/opal/mca/btl/tcp/btl_tcp_component.c
@@ -17,7 +17,7 @@
* reserved.
* Copyright (c) 2013-2015 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2014-2015 Intel, Inc. All rights reserved.
- * Copyright (c) 2014-2015 Research Organization for Information Science
+ * Copyright (c) 2014-2016 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
@@ -56,6 +56,7 @@
#include <limits.h>
#include "opal/mca/event/event.h"
+#include "opal/mca/if/if.h"
#include "opal/util/ethtool.h"
#include "opal/util/if.h"
#include "opal/util/output.h"
@@ -668,6 +669,7 @@ static int mca_btl_tcp_component_create_instances(void)
int if_index;
int kif_count = 0;
int *kindexes; /* this array is way too large, but never too small */
+ opal_if_t **intfs, *intf;
char **include = NULL;
char **exclude = NULL;
char **argv;
@@ -681,6 +683,11 @@ static int mca_btl_tcp_component_create_instances(void)
if (NULL == kindexes) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
+ intfs = (opal_if_t **) malloc(sizeof(opal_if_t *) * if_count);
+ if (NULL == intfs) {
+ free(kindexes);
+ return OPAL_ERR_OUT_OF_RESOURCE;
+ }
/* calculate the number of kernel indexes (number of IP interfaces) */
{
@@ -712,6 +719,7 @@ static int mca_btl_tcp_component_create_instances(void)
}
}
}
+ free(kindexes);
/* allocate memory for btls */
mca_btl_tcp_component.tcp_btls = (mca_btl_tcp_module_t**)malloc(mca_btl_tcp_component.tcp_num_links *
@@ -726,15 +734,37 @@ static int mca_btl_tcp_component_create_instances(void)
/* if the user specified an interface list - use these exclusively */
argv = include = split_and_resolve(&mca_btl_tcp_component.tcp_if_include,
"include", true);
+ if_index = 0;
while(argv && *argv) {
char* if_name = *argv;
- int if_index = opal_ifnametokindex(if_name);
- if(if_index < 0) {
+ bool found = false;
+ int i;
+ for (intf = (opal_if_t*)opal_list_get_first(&opal_if_list);
+ intf != (opal_if_t*)opal_list_get_end(&opal_if_list);
+ intf = (opal_if_t*)opal_list_get_next(intf)) {
+ if (strcmp(intf->if_name, if_name) == 0) {
+ found = true;
+ break;
+ }
+ }
+ if (!found) {
BTL_ERROR(("invalid interface \"%s\"", if_name));
ret = OPAL_ERR_NOT_FOUND;
goto cleanup;
}
- mca_btl_tcp_create(if_index, if_name);
+ found = false;
+ /* search a previous interface with the same kernel index */
+ for (i=0; i<if_index; i++) {
+ if (intfs[i]->if_kernel_index == intf->if_kernel_index) {
+ found = true;
+ break;
+ }
+ }
+ if (!found) {
+ /* add the interface and create it */
+ intfs[if_index++] = intf;
+ mca_btl_tcp_create(intf->if_kernel_index, intf->if_name);
+ }
argv++;
}
@@ -748,27 +778,49 @@ static int mca_btl_tcp_component_create_instances(void)
/* if the interface list was not specified by the user, create
* a BTL for each interface that was not excluded.
*/
+
+ /* retrieve all interfaces */
+ if_index = 0;
+ for (intf = (opal_if_t*)opal_list_get_first(&opal_if_list);
+ intf != (opal_if_t*)opal_list_get_end(&opal_if_list);
+ intf = (opal_if_t*)opal_list_get_next(intf)) {
+ intfs[if_index++] = intf;
+ }
+
exclude = split_and_resolve(&mca_btl_tcp_component.tcp_if_exclude,
"exclude", false);
{
int i;
- for(i = 0; i < kif_count; i++) {
- /* IF_NAMESIZE is defined in opal/util/if.h */
- char if_name[IF_NAMESIZE];
- if_index = kindexes[i];
-
- opal_ifkindextoname(if_index, if_name, sizeof(if_name));
-
+ for(i = 0; i < if_index; i++) {
/* check to see if this interface exists in the exclude list */
argv = exclude;
while(argv && *argv) {
- if(strncmp(*argv,if_name,strlen(*argv)) == 0)
+ if(strncmp(*argv,intfs[i]->if_name,strlen(*argv)) == 0) {
+ /* the interface must be excluded */
+ intfs[i] = NULL;
break;
+ }
argv++;
}
- /* if this interface was not found in the excluded list, create a BTL */
- if(argv == 0 || *argv == 0) {
- mca_btl_tcp_create(if_index, if_name);
+ }
+ for (i = 0; i < if_index; i++) {
+ int j;
+ bool found = false;
+ if (NULL == intfs[i]) {
+ continue;
+ }
+ /* search a previous interface with the same kernel index */
+ for (j=0; j<i; j++) {
+ if (NULL == intfs[j]) {
+ continue;
+ }
+ if (intfs[i]->if_kernel_index == intfs[j]->if_kernel_index) {
+ found = true;
+ }
+ }
+ if (!found) {
+ /* add the interface and create it */
+ mca_btl_tcp_create(intfs[i]->if_kernel_index, intfs[i]->if_name);
}
}
}
@@ -780,9 +832,7 @@ static int mca_btl_tcp_component_create_instances(void)
if (NULL != exclude) {
opal_argv_free(exclude);
}
- if (NULL != kindexes) {
- free(kindexes);
- }
+ free(intfs);
return ret;
}
|
Please add a Signed-off-by line to this PR's commit. |
@jsquyres / @ggouaillardet / @bosilca, I'm still not sure I totally grok what this PR is trying to fix, but where are we in closing out / tossing? We're a bit ripe on this PR... |
This hasn't come up again, so I'm willing to close it as "wontfix". If someone wants to bring it up again, no problem -- this PR was definitely a hackaround / placeholder for a real fix. |
User Nikolay reported on the users mailing list that Open MPI 1.10.x has problems in OpenVZ containers.
I actually ran into a similar situation recently, and hacked a workaround (which is the initial patch on this PR). Specifically, per Nikolay's description, here's the networking picture in an OpenVZ container:
Note the 10.0.40.41/32 (!) address. That's just how OpenVZ sets up its networking environment -- it works fine.
Using
--mca btl_tcp_if_include venet0:0 --mca oob_tcp_if_include venet0:0
does not work -- the jobs fail with the TCP BTL complaining about a broken pipe (!).IIRC (from back when I created this patch), I think that somehow the TCP OOB is still using
venet0
, which ends up not working properly (i.e., it tries to open a socket to a peer, and it fails). The issue is that OMPI isn't ignoring localhost based on its 127.0.0.0/8 address, but rather based on its name oflo*
. I.e., it still tries to usevenet0
, even though it's also 127.0.0.1 (same aslo
).I don't know offhand/remember why the
oob_tcp_if_include venet0:0
doesn't fix the problem. ...something about string matching, perhaps?Regardless, Nikolay replied that OMPI works for him when he applies this hackaround patch. We should probably do something better / more formal -- here's some not-fully-thought-out suggestions:
*_if_exclude
default values)