-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP BTL does not support Linux virtual interfaces #160
Comments
Imported from trac issue 3339. Created by jsquyres on 2012-10-02T17:20:21, last modified: 2014-04-18T22:36:27 |
Trac comment by roye on 2014-04-18 22:36:27: Hello, |
This is clearly not going to happen in the v1.8 series. @rhc54 may well be doing this revamp-the-OPAL-if interface work, but only in terms of v1.9 or beyond. |
Fix performance regression caused by enabling opal thread support
Since George and I are mucking in this part of the code, I'm going to re-open this issue. |
The reason we use kindex instead of index is that index is a way of enumerating the addresses associated with the interfaces on the host, but kindex is (supposed to be) the way to enumerate actual interfaces. So let's say we had a platform with two devices (A and B) and each have two addresses (A1, A2, B3, B4). That means we'll have 4 unique indexes, but only 2 unique kindexes. There's no advantage to spreading traffic across IPs on the same interface, which is why the BTL looks at kindexes instead of indexes. That, however, isn't why |
The kindex behavior is also a bit wrong on some platforms with regards to IPv6. On Linux, the kindexes are consistent across the IPv4 and IPv6 addresses (that is, we could build the same output as We never actually use the kindex in a call to the kernel. One potential solution to both the MacOS IPv6 problem and the virtual interfaces problem is to use the ifnames to build our own kindexes based on ifname. It would be an O(^2) operation, but hopefully N would be sufficiently small that who cares. |
If you create a virtual ethernet device in Linux, the TCP BTL gets confused.
This is because the Linux kernel will use the same kernel index for both interfaces -- the TCP BTL fundamentally assumes that all interfaces will have a unique kernel index (we use that kernel index for indexing and unique identification in modex data). This is clearly a bad assumption.
I chatted with Ralph about this on the phone: we're wondering why the kernel index was used at all. Why not use the OPAL IF index? That ''is'' unique (in a process), and is suitable for both indexing and identification in modex data.
Ralph is going to revamp the OPAL IF interface soon, anyway (e.g., convert it from a list to an array) and will likely be removing all the kernel index stuff. This will force changing the TCP BTL to use the OPAL IF index (instead of the kernel index). This will likely solve the problem.
Once we fix this, perhaps Bart at Atipa can test it for us (he ran into the issue because he has eth0:0 on his cluster head node to talk to the IPMI network. He doesn't usually run MPI jobs on the head node, but he did this once and ran into hangs/badness, and I helped diagnose the issue). :-)
The text was updated successfully, but these errors were encountered: