Timeout on first connection attempt in multi-tenant mode after some time of inactivity (origin 1.1.0.1) #231
Comments
On the VXLAN interface the VNID is lost when the service IP (172.30.184.188) is used :
If I use the pod IP (10.1.1.2), everything is OK : I think this is the root of the problem. |
'vnid 0' is expected because the service proxy has global access. |
Why do we have to wait for an ARP request ?
Port 4 is the pod I want to connect to |
Yes, but the pods don't know about that; they think they are connected to an ethernet network, so the source pod can't send any data to the destination until it knows its MAC address (since it has to write that address into the packet). |
@barkbay Try editing /usr/bin/openshift-sdn-ovs-setup.sh on the node with the service; search for the rule that starts with "table=3" and ends with "goto_table:7", and change that to "goto_table:6" (and restart the node). Fixed? |
@danwinship Thanks for the tip I'll try your suggestion and update the issue. |
/usr/bin/openshift-sdn-ovs-setup.sh does not exist in Openshift v1.1.0.1 |
Hm... the bug fixed in #236 was introduced after openshift-sdn-kube-subnet-setup.sh and openshift-sdn-multitenant-setup.sh were merged. Looking at your patch, it is basically bypassing the ARP cache table. So if that fixes things, that suggests that the problem is that the sender still has the MAC address cached, so doesn't need to send an ARP, but OVS has dropped it from its cache. And looking at the rules, that definitely seems like it could happen... |
Congratulation 👍 |
Your patch leaves redundant rules, so it's not quite right. Between this and #239, it looks like we'll need some heavy rewriting of the OVS rules. I'm working on it. |
As it seems, we're facing the same problem on an openshift enterprise 3.0.2. Between an openshift php pod with wordpress and a mysql DB pod. So at times - not really reproduceable - the wordpress complaines that it could not connect to the mysql (Service via Cluster IP). |
Hi all,
I'm stuck for several days on an issue where after quite a long time of inactivity I get a connection timeout for the first connection between 2 pods (subsequent connections are OK) :
It seems to happen if no network frame was transmitted during a period equals to the value "hard_timeout" in the Openvswitch learn table (8).
I have decreased this value from 900 seconds to 60 seconds and I can consistently reproduce the issue if I wait 60 seconds between each connection attempt.
Here is a small diagram of my setup : https://www.dropbox.com/s/esu2gtiz8y6js1x/use_case.png?dl=1
Here is the result of the debug.sh script : https://www.dropbox.com/s/k8x4i2308wpkqlu/openshift-sdn-debug-2015-12-20.tgz?dl=1
I would say that it is related to an ARP issue, here what I see if I do a tcpdump on the VXLAN connection (client side):
(By the way you can see that the vni is 0, not sure if it is expected in a multitenant environment)
The interesting point here is that once the ARP request is done then subsequent connections are successful !
In the mean time these flows are created in the ovs flow table on the client :
and these flows are created on the server :
Once they have expired the problem appears again.
As a side note there is no issue if I use the pod IP (10.1.1.2) instead of the service IP (172.30.184.188)
Let me know if you need any other information, this problem prevents us from activating the multitenant mode.
Thank you !
The text was updated successfully, but these errors were encountered: