Timeout on first connection attempt in multi-tenant mode after some time of inactivity (origin 1.1.0.1) #231

barkbay · 2015-12-20T15:18:06Z

Hi all,

I'm stuck for several days on an issue where after quite a long time of inactivity I get a connection timeout for the first connection between 2 pods (subsequent connections are OK) :

lenaic-node-compute-1ba08.example.com origin-node[2055]: I1220 15:08:33.591561    2055 proxysocket.go:96] Mapped service "postgresql2/postgresql:postgresql" to endpoint 10.1.1.2:5432
lenaic-node-compute-1ba08.example.com origin-node[2055]: E1220 15:08:34.591741    2055 proxysocket.go:104] Dial failed: dial tcp 10.1.1.2:5432: i/o timeout
lenaic-node-compute-1ba08.example.com origin-node[2055]: I1220 15:08:34.591788    2055 roundrobin.go:130] NextEndpoint for service "postgresql2/postgresql:postgresql", srcAddr=10.1.2.2:37328: endpoints: [10.1.1.2:5432]
lenaic-node-compute-1ba08.example.com origin-node[2055]: I1220 15:08:34.591806    2055 proxysocket.go:96] Mapped service "postgresql2/postgresql:postgresql" to endpoint 10.1.1.2:5432
lenaic-node-compute-1ba08.example.com origin-node[2055]: E1220 15:08:36.592010    2055 proxysocket.go:104] Dial failed: dial tcp 10.1.1.2:5432: i/o timeout
lenaic-node-compute-1ba08.example.com origin-node[2055]: E1220 15:08:36.592043    2055 proxysocket.go:138] Failed to connect to balancer: failed to connect to an endpoint.

It seems to happen if no network frame was transmitted during a period equals to the value "hard_timeout" in the Openvswitch learn table (8).
I have decreased this value from 900 seconds to 60 seconds and I can consistently reproduce the issue if I wait 60 seconds between each connection attempt.

Here is a small diagram of my setup : https://www.dropbox.com/s/esu2gtiz8y6js1x/use_case.png?dl=1
Here is the result of the debug.sh script : https://www.dropbox.com/s/k8x4i2308wpkqlu/openshift-sdn-debug-2015-12-20.tgz?dl=1

I would say that it is related to an ARP issue, here what I see if I do a tcpdump on the VXLAN connection (client side):

[root@lenaic-node-compute-1ba08 openshift]# tcpdump -i eth0 -nn -v udp port 4789 
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 
14:11:22.222814 IP (tos 0x0, ttl 64, id 1049, offset 0, flags [DF], proto UDP (17), length 110) 
   192.168.55.233.36826 > 192.168.55.175.4789: VXLAN, flags [I] (0x08), vni 0 
IP (tos 0x0, ttl 64, id 3905, offset 0, flags [DF], proto TCP (6), length 60) 
   10.1.2.1.60708 > 10.1.1.2.5432: Flags [S], cksum 0x7bfd (correct), seq 2334882139, win 14100, options [mss 1410,sackOK,TS val 18773530 ecr 0,nop,wscale 7], length 0 
                                    [... cut a lot of SYN packets ...]
14:11:24.976028 IP (tos 0x0, ttl 64, id 1053, offset 0, flags [DF], proto UDP (17), length 110) 
   192.168.55.233.35083 > 192.168.55.175.4789: VXLAN, flags [I] (0x08), vni 0 
IP (tos 0x0, ttl 64, id 3443, offset 0, flags [DF], proto TCP (6), length 60) 
   10.1.2.1.60715 > 10.1.1.2.5432: Flags [S], cksum 0xb13d (correct), seq 1392144771, win 14100, options [mss 1410,sackOK,TS val 18776284 ecr 0,nop,wscale 7], length 0 
14:11:27.236170 IP (tos 0x0, ttl 64, id 1054, offset 0, flags [DF], proto UDP (17), length 78) 
   192.168.55.233.39892 > 192.168.55.175.4789: VXLAN, flags [I] (0x08), vni 0 
ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.1.2 tell 10.1.2.1, length 28 
14:11:27.236741 IP (tos 0x0, ttl 64, id 17103, offset 0, flags [DF], proto UDP (17), length 78) 
   192.168.55.175.37297 > 192.168.55.233.4789: VXLAN, flags [I] (0x08), vni 0 
ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.1.2 is-at 02:42:0a:01:01:02, length 28

(By the way you can see that the vni is 0, not sure if it is expected in a multitenant environment)
The interesting point here is that once the ARP request is done then subsequent connections are successful !

In the mean time these flows are created in the ovs flow table on the client :

cookie=0x0, duration=14.426s, table=8, n_packets=16, n_bytes=1390, hard_timeout=60, priority=200,dl_dst=02:42:0a:01:02:02 actions=load:0->NXM_NX_TUN_IPV4_DST[],output:4 
cookie=0x0, duration=14.426s, table=8, n_packets=3, n_bytes=126, hard_timeout=60, priority=200,dl_dst=02:9e:4d:7f:00:a3 actions=load:0->NXM_NX_TUN_IPV4_DST[],output:2 
cookie=0x0, duration=9.419s, table=8, n_packets=0, n_bytes=0, hard_timeout=60, priority=200,dl_dst=02:42:0a:01:01:02 actions=load:0xc0a837af->NXM_NX_TUN_IPV4_DST[],output:1

and these flows are created on the server :

cookie=0x0, duration=10.301s, table=8, n_packets=1, n_bytes=42, hard_timeout=60, priority=200,dl_dst=02:9e:4d:7f:00:a3 actions=load:0xc0a837e9->NXM_NX_TUN_IPV4_DST[],output:1 
cookie=0x0, duration=5.291s, table=8, n_packets=0, n_bytes=0, hard_timeout=60, priority=200,dl_dst=02:42:0a:01:01:02 actions=load:0->NXM_NX_TUN_IPV4_DST[],output:4

Once they have expired the problem appears again.

As a side note there is no issue if I use the pod IP (10.1.1.2) instead of the service IP (172.30.184.188)

Let me know if you need any other information, this problem prevents us from activating the multitenant mode.

Thank you !

The text was updated successfully, but these errors were encountered:

barkbay · 2015-12-21T15:03:09Z

On the VXLAN interface the VNID is lost when the service IP (172.30.184.188) is used :

[root@lenaic-node-compute-1ba08 openshift]# tcpdump -vvv -nn -i eth0 udp|grep vni 192.168.55.233.59854 > 192.168.55.175.4789: [no cksum] VXLAN, flags [I] (0x08), vni 0 192.168.55.175.48585 > 192.168.55.233.4789: [no cksum] VXLAN, flags [I] (0x08), vni 13 192.168.55.233.59854 > 192.168.55.175.4789: [no cksum] VXLAN, flags [I] (0x08), vni 0 192.168.55.233.59854 > 192.168.55.175.4789: [no cksum] VXLAN, flags [I] (0x08), vni 0 [....]

If I use the pod IP (10.1.1.2), everything is OK :
[root@lenaic-node-compute-1ba08 openshift]# tcpdump -vvv -nn -i eth0 udp|grep vni 192.168.55.233.43037 > 192.168.55.175.4789: [no cksum] VXLAN, flags [I] (0x08), vni 13 192.168.55.175.44208 > 192.168.55.233.4789: [no cksum] VXLAN, flags [I] (0x08), vni 13 192.168.55.233.43037 > 192.168.55.175.4789: [no cksum] VXLAN, flags [I] (0x08), vni 13 [...]

I think this is the root of the problem.
Any idea ?

rajatchopra · 2015-12-21T18:49:22Z

'vnid 0' is expected because the service proxy has global access.
Not sure why the ARP learning takes so long that a connection time out has to happen.
/cc @danwinship

barkbay · 2015-12-21T19:29:42Z

Why do we have to wait for an ARP request ?
It's not the purpose of the table 6 to send the data to the right port ?

cookie=0x0, duration=3756.657s, table=6, n_packets=82, n_bytes=6511, priority=200,ip,reg0=0 actions=goto_table:8
cookie=0x0, duration=3751.984s, table=6, n_packets=0, n_bytes=0, priority=100,ip,reg0=0xd,nw_dst=10.1.1.2 actions=output:4

Port 4 is the pod I want to connect to

danwinship · 2016-01-04T18:46:49Z

Why do we have to wait for an ARP request ?
It's not the purpose of the table 6 to send the data to the right port ?

Yes, but the pods don't know about that; they think they are connected to an ethernet network, so the source pod can't send any data to the destination until it knows its MAC address (since it has to write that address into the packet).

danwinship · 2016-01-05T18:41:54Z

@barkbay Try editing /usr/bin/openshift-sdn-ovs-setup.sh on the node with the service; search for the rule that starts with "table=3" and ends with "goto_table:7", and change that to "goto_table:6" (and restart the node). Fixed?

barkbay · 2016-01-06T09:57:37Z

@danwinship Thanks for the tip
fyi here is the fix I'm currently testing on my cluster : https://gist.github.com/barkbay/cf57b2b4022bbc2d02ba
So far it seems to also solve the problem.

I'll try your suggestion and update the issue.

barkbay · 2016-01-13T13:47:29Z

/usr/bin/openshift-sdn-ovs-setup.sh does not exist in Openshift v1.1.0.1

danwinship · 2016-01-15T15:47:04Z

/usr/bin/openshift-sdn-ovs-setup.sh does not exist in Openshift v1.1.0.1

Hm... the bug fixed in #236 was introduced after openshift-sdn-kube-subnet-setup.sh and openshift-sdn-multitenant-setup.sh were merged.

Looking at your patch, it is basically bypassing the ARP cache table. So if that fixes things, that suggests that the problem is that the sender still has the MAC address cached, so doesn't need to send an ARP, but OVS has dropped it from its cache. And looking at the rules, that definitely seems like it could happen...

barkbay · 2016-01-20T09:49:48Z

Looking at your patch, it is basically bypassing the ARP cache table. So if that fixes things, that suggests that the problem is that the sender still has the MAC address cached, so doesn't need to send an ARP, but OVS has dropped it from its cache. And looking at the rules, that definitely seems like it could happen...

Congratulation 👍
Indeed if I delete (regularly) the ARP entries on the hosts the problem no longer appears.
I need a battle plan. Do you think my patch is an acceptable solution?
Let me know how can I help you

danwinship · 2016-01-20T14:16:23Z

Your patch leaves redundant rules, so it's not quite right. Between this and #239, it looks like we'll need some heavy rewriting of the OVS rules. I'm working on it.

mmaier2 · 2016-01-22T14:10:17Z

As it seems, we're facing the same problem on an openshift enterprise 3.0.2. Between an openshift php pod with wordpress and a mysql DB pod. So at times - not really reproduceable - the wordpress complaines that it could not connect to the mysql (Service via Cluster IP).
And indeed any connection (curl, telnet etc.) to the cluster IP + port of the mysql service leads to a timeout. But after a connection to the IP of the mysql pod itself happens, the service cluster IP again works, so the mysql server responds on the cluster ip.

rajatchopra assigned danwinship Jan 4, 2016

danwinship mentioned this issue Jan 5, 2016

Build error: dial tcp: i/o timeout openshift/origin#5796

Closed

danwinship mentioned this issue Jan 5, 2016

Fix a bug with services in multitenant #236

Merged

danwinship mentioned this issue Jan 27, 2016

OVS rule updates for ARP bugs, etc #254

Merged

dcbw closed this as completed in #254 Feb 23, 2016

danwinship mentioned this issue Mar 3, 2016

resync openshift-sdn: ARP/OVS rule fixes openshift/origin#7765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout on first connection attempt in multi-tenant mode after some time of inactivity (origin 1.1.0.1) #231

Timeout on first connection attempt in multi-tenant mode after some time of inactivity (origin 1.1.0.1) #231

barkbay commented Dec 20, 2015

barkbay commented Dec 21, 2015

rajatchopra commented Dec 21, 2015

barkbay commented Dec 21, 2015

danwinship commented Jan 4, 2016

danwinship commented Jan 5, 2016

barkbay commented Jan 6, 2016

barkbay commented Jan 13, 2016

danwinship commented Jan 15, 2016

barkbay commented Jan 20, 2016

danwinship commented Jan 20, 2016

mmaier2 commented Jan 22, 2016

Timeout on first connection attempt in multi-tenant mode after some time of inactivity (origin 1.1.0.1) #231

Timeout on first connection attempt in multi-tenant mode after some time of inactivity (origin 1.1.0.1) #231

Comments

barkbay commented Dec 20, 2015

barkbay commented Dec 21, 2015

rajatchopra commented Dec 21, 2015

barkbay commented Dec 21, 2015

danwinship commented Jan 4, 2016

danwinship commented Jan 5, 2016

barkbay commented Jan 6, 2016

barkbay commented Jan 13, 2016

danwinship commented Jan 15, 2016

barkbay commented Jan 20, 2016

danwinship commented Jan 20, 2016

mmaier2 commented Jan 22, 2016