Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS latency of 5s when uses iptables forward in pods network traffic #62628

Closed
xiaoxubeii opened this issue Apr 16, 2018 · 19 comments
Closed

DNS latency of 5s when uses iptables forward in pods network traffic #62628

xiaoxubeii opened this issue Apr 16, 2018 · 19 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@xiaoxubeii
Copy link
Member

xiaoxubeii commented Apr 16, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
The DNS will get a 5s latency of AAAA when uses iptables forward in network traffic between pods.

What you expected to happen:
No latency.

How to reproduce it (as minimally and precisely as possible):

  • CNI configuration
        "name": "mynet",
        "type": "macvlan",
	"master": "eth0",
        "ipam": {
                "type": "host-local",
                "subnet": "172.20.0.0/17",
		"rangeStart": "172.20.64.129",
		"rangeEnd": "172.20.64.254",
		"gateway": "172.20.127.254",
		"routes": [
			{"dst":"0.0.0.0/0"},
			{"dst":"172.20.80.0/24", "gw":"172.20.0.62"}
		]
        }
}
  • Network Architecture
    The cluster cidr is 172.20.80.0/24, gw is current node. Cluster, pods and nodes are in l2 network using VXLAN.

Anything else we need to know?:
If cni gw of cluster cidr is current node, the network traffic between pods and services will use iptables forward:

-P FORWARD ACCEPT
-A FORWARD -m comment --comment "kubernetes forward rules" -j KUBE-FORWARD

-N KUBE-FORWARD
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -s 172.20.0.0/17 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 172.20.0.0/17 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

If enable forwarding conntrack, netfilter will drop first AAAA record packet when requests dns. It will cause dns latency of 5s.

Environment:

  • Kubernetes version (use kubectl version): v1.9.2
  • Cloud provider or hardware configuration: None
  • OS (e.g. from /etc/os-release): CentOS Linux release 7.2.1511 (Core)
  • Kernel (e.g. uname -a): 3.10.0-327.18.2.el7.x86_64
  • Install tools: kubeadm
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Apr 16, 2018
@xiaoxubeii
Copy link
Member Author

xiaoxubeii commented Apr 16, 2018

/sig network
/assign

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 16, 2018
@MrHohn
Copy link
Member

MrHohn commented Apr 30, 2018

If enable forwarding conntrack, netfilter will drop last AAAA record packet when requests dns

@xiaoxubeii Could you elaborate a bit more on this behavior? Is this a bug in netfilter? Or is this a bug in kube-proxy that it doesn't follow certain standard while using netfilter? Thanks.

cc @bowei

@Quentin-M
Copy link
Contributor

Quentin-M commented Apr 30, 2018

I am by the way experiencing the same thing. With Kubernetes 1.10+CoreOS+Weave+CoreDNS/kube-dns+kube-proxy ipvs, I see constant 5s latency on DNS resolution. tcpdump shows that the first AAAA requests get lost somehow: https://hastebin.com/banulayire.swift. With single-request or single-request-reopen, the issue is gone.

@Quentin-M
Copy link
Contributor

Quentin-M commented Apr 30, 2018

@bboreham
Copy link
Contributor

bboreham commented May 1, 2018

Most of the comments relate to things which will cause intermittent packet loss, but OP seems to be talking about a consistent symptom - every time you do the request it will drop the same packet.

Am I understanding the OP correctly?

I can’t imagine what would cause it to drop the last packet. How would it know it’s the last one?

@Quentin-M
Copy link
Contributor

@bboreham The blog post I linked above explains the issue very well. It's a race condition with conntrack/SNAT. glibc/musl are very good at triggering it when sending A/AAAA lookups in parallel. Using single-request-reopen works around the issue by making serializing the queries. One better fix (as documented) is to add --random-fully to every MASQUERADE rules (kubelet, kube-proxy, overlay).

@xiaoxubeii
Copy link
Member Author

@MrHohn @bboreham @Quentin-M I think the problem that i met is about race condition on
conntrack insertions. I used the node as the gateway, so it would redirects packets from pods to services. When iptables sets:

-A KUBE-FORWARD -s 172.20.0.0/17 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 172.20.0.0/17 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

and we send DNS requests in pods, the second DNS requests would be received while the first one is still not confirmed and both DNS requests would have an unconfirmed conntrack. So the second one would be dropped in nf_conntrack_confirm, which results in an DNS timeout and retransmit.

So a simple solution is to use single-request-reopen:

single-request-reopen (since glibc 2.9)
                     Sets RES_SNGLKUPREOP in _res.options.  The resolver
                     uses the same socket for the A and AAAA requests.  Some
                     hardware mistakenly sends back only one reply.  When
                     that happens the client system will sit and wait for
                     the second reply.  Turning this option on changes this
                     behavior so that if two requests from the same port are
                     not handled correctly it will close the socket and open
                     a new one before sending the second request.

It is more like a netfilter bug or flaw, but i think the kube-dns also need do something to avoid it.

@Quentin-M
Copy link
Contributor

Quentin-M commented May 2, 2018

Yes, this is what's described there. AFAIK, real solution would be to patch kubelet, kube-proxy and the overlay networks (flannel, weave, calico, etc), adding --random-fully to the MASQUERADE rules.

But I agree that your patch, while slowing down DNS lookups a little (well it's not as bad as 5secs+!), is simple and effective. People in other threads have also been mentioning deployment initializers, using dnsPolicy=None in Kubernetes 1.10 or manually mounting /etc/resolv.conf - but I'd rather not force cluster users' to apply such workaround, whereas it's an actual infrastructure issue.

Or it's a totally different issue?

@bboreham
Copy link
Contributor

bboreham commented May 2, 2018

It’s great you are all agreed what is the cause, but still: a race condition would randomly cause or not cause the problem.

So, was it consistent - happened every time, or occasional - sometimes happened?

@Quentin-M
Copy link
Contributor

Fair point. I wonder if something like this may be happening instead? I am not that familiar with networking to be able to tell.

@xiaoxubeii
Copy link
Member Author

@Quentin-M I am not sure which one causes this problem, conntrack in KUBE-FORWARD or MASQUERADE rules, because in my case, they all exist. And when i remove conntrack from KUBE-FORWARD, the problem is gone (or only alleviated).

@bboreham In my case, it is consistent. It always drop AAAA packet.

@bboreham
Copy link
Contributor

bboreham commented May 8, 2018

The inestimable @brb has found another race condition, which will tend to cause the last packet to be dropped. weaveworks/weave#3287 (comment)

@Quentin-M
Copy link
Contributor

Quentin-M commented May 15, 2018

I would just like to add here that the single-request(-reopen) workaround does not work with Alpine-based containers, as musl does not support the option (see below). Unfortunately, Alpine Linux is the base image of 90% of our infrastructure.

src/network/resolvconf.c

                if (!strncmp(line, "options", 7) && isspace(line[7])) {
                        p = strstr(line, "ndots:");
                        if (p && isdigit(p[6])) {
                                p += 6;
                                unsigned long x = strtoul(p, &z, 10);
                                if (z != p) conf->ndots = x > 15 ? 15 : x;
                        }
                        p = strstr(line, "attempts:");
                        if (p && isdigit(p[9])) {
                                p += 9;
                                unsigned long x = strtoul(p, &z, 10);
                                if (z != p) conf->attempts = x > 10 ? 10 : x;
                        }
                        p = strstr(line, "timeout:");
                        if (p && (isdigit(p[8]) || p[8]=='.')) {
                                p += 8;
                                unsigned long x = strtoul(p, &z, 10);
                                if (z != p) conf->timeout = x > 60 ? 60 : x;
                        }
                        continue;
                }

src/network/lookup.h

struct resolvconf {
        struct address ns[MAXNS];
        unsigned nns, attempts, ndots;
        unsigned timeout;
};

I have reached out on the freenode's #musl channel, but unfortunately it does not seem like there is much desire to add support for the option:

[16:19] <dalias> why not fix the bug causing it?
[16:20] <dalias> sprry
[16:20] <dalias> the option is not something that can be added, its contrary to the lookup architecture
[17:39] <dalias> quentinm, thanks for the report. i just don't know any good way to work around it on our side without nasty hacks
[17:40] <dalias> the architecture is not designed to support sequential queries

@xiaoxubeii
Copy link
Member Author

/close

@Quentin-M
Copy link
Contributor

I just posted a little write-up about our journey troubleshooting the issue, and how we are worked around it in production: https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/.

@szuecs
Copy link
Member

szuecs commented Jul 9, 2018

@xiaoxubeii @Quentin-M what is the current favorite workaround for this?

We run flannel with vxlan and have here and there bleeps in out monitoring spike dns request up to 5s. One time probably (not 100% sure) we had a production incident, because of that.

Should I port the script referenced by @Quentin-M to flannel or is there already something else?

@szuecs
Copy link
Member

szuecs commented Jul 10, 2018

FYI: flannel-io/flannel#1001 (comment)
And the "port to flannel" https://github.com/szuecs/flannel-tc

@bboreham
Copy link
Contributor

bboreham commented Aug 2, 2018

@xiaoxubeii why is this issue closed? @brb has fixed one of the kernel races but other causes remain.

@inter169
Copy link

inter169 commented Aug 3, 2018

coded a fix for musl on Alpine Linux 3.7, which removed AAAA query by default (AF_UNSPEC).

see #56903 (comment)

thanks,
harper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants