Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround for spurious retransmits leading to connection resets #1090

Closed
aaronlehmann opened this issue Apr 8, 2016 · 19 comments · May be fixed by #2275
Closed

Add workaround for spurious retransmits leading to connection resets #1090

aaronlehmann opened this issue Apr 8, 2016 · 19 comments · May be fixed by #2275

Comments

@aaronlehmann
Copy link
Contributor

@aaronlehmann aaronlehmann commented Apr 8, 2016

There is a longstanding issue over at distribution/distribution#785 where users reported connection resets trying to push to an AWS-hosted registry from inside the AWS network. After months, we've finally narrowed this down to a bad interaction between spurious TCP retransmits and the NAT rules that Docker sets up for bridge networking.

Here is a summary of what happens:

  1. For some reason, when an AWS EC2 machine connects to itself using its external-facing IP address, there are occasional packets with sequence numbers and timestamps that are far behind the rest.
  2. Normally these packets would be ignored as spurious retransmits. However, because the packets fall outside the TCP window, Linux's conntrack module marks them invalid, and their destination addresses do not get rewritten by DNAT.
  3. The packets are eventually interpreted as packets destined to the actual address/port in the IP/TCP headers. Since there is no flow matching these, the host sends a RST.
  4. The RST terminates the actual NAT'd connection, since its source address and port matches the NAT'd connection.

I think it would be hugely helpful for libnetwork to include a workaround for this. It has affected a lot of users trying to use the registry in AWS, and it presumably affects other Dockerized applications as well. While I'll reach out to AWS to point out the spurious retransmits, I don't know if they'll be able to fix them, and there may also be other environments with similar issues.

I've found two possible workarounds:

  • Turn on conntrack's "be liberal" flag: echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal. This causes conntrack/NAT to treat packets outside the TCP window as part of the flow being tracked, instead of marking them invalid and causing them to be handled by the host.
  • Add a rule to drop invalid packets instead of allowing them to trigger RSTs: iptables -I INPUT -m conntrack --ctstate INVALID -j DROP

Both of these can potentially affect non-Docker traffic. The former causes NAT to forward packets that it would otherwise err on the side of not forwarding, which seems relatively harmless, but it's a system-level setting, so it's not limited to Docker flows. The latter would drop any packets that conntrack deems invalid, system-wide, unless we added specific destination filters for the addresses/ports that Docker set up NAT rules for, which could add overhead.

It may be too late to hope for a workaround to be included in Docker 1.11, but anything we can do on this front will really improve the lives of Docker users on AWS.

@thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented May 2, 2016

@aaronlehmann I saw the linked issue turned out to be an issue with AWS, is there still something that needs to be done in libnetwork?

@aaronlehmann
Copy link
Contributor Author

@aaronlehmann aaronlehmann commented May 2, 2016

@thaJeztah: This issue is a suggestion to work around problems like this in libnetwork. The problem came from a combination of invalid packets generated somewhere in AWS' infrastructure, and the NAT setup used by libnetwork reacting to those invalid packets by tearing down the connection. This means the invalid packets cause problems for Dockerized applications but they are harmless for most other setups. moby/moby#19532 revealed that this problem was also seen on a residential internet connection. I think there is value in finding a workaround.

@jrabbit
Copy link

@jrabbit jrabbit commented Jun 4, 2016

I'm being bit by this in production what more information could I provide?

@middleagedman
Copy link

@middleagedman middleagedman commented Jun 26, 2016

Same here.. Simple docker container build on an arch linux system in residential. Just trying to do a git clone from a https git site (bitbucket).

  • GnuTLS recv error (-54): Error in the pull function.
  • Closing connection 1
    error: RPC failed; result=56, HTTP code = 200
    fatal: The remote end hung up unexpectedly
    fatal: early EOF
    fatal: index-pack failed

@BenSjoberg
Copy link

@BenSjoberg BenSjoberg commented Jul 28, 2016

Just ran into this on my office's internal network. Thankfully I found this page or all my hair would be ripped out by morning.

The iptables workaround did the trick for me, thanks very much for providing that. If it helps, I'm running Docker 1.11.2 on Ubuntu 16.04. Let me know if there's any more information I can give that would be useful.

@GordonTheTurtle
Copy link

@GordonTheTurtle GordonTheTurtle commented Aug 30, 2017

@aaronlehmann It has been detected that this issue has not received any activity in over 6 months. Can you please let us know if it is still relevant:

  • For a bug: do you still experience the issue with the latest version?
  • For a feature request: was your request appropriately answered in a later version?

Thank you!
This issue will be automatically closed in 1 week unless it is commented on.
For more information please refer to #1926

@aaronlehmann
Copy link
Contributor Author

@aaronlehmann aaronlehmann commented Aug 30, 2017

A fix was implemented in AWS. I don't think a workaround is necessary anymore.

@mitchcapper
Copy link

@mitchcapper mitchcapper commented Sep 7, 2017

I will comment that this does happen on networks outside of AWS. The iptables fix does fix it HOWEVER you first have to find this issue to learn that. The errors are very generic, so if implementing the fix in docker is not a big deal it would probably save some people many hours of research into it:)

@vduglued
Copy link

@vduglued vduglued commented Oct 14, 2017

Any solution to this problem on a macOS host?

@p53
Copy link

@p53 p53 commented Nov 3, 2017

we have similar problem downloading file to our docker image from nexus throws connection reset by peer, adding iptables rules fixes it

@guillon
Copy link

@guillon guillon commented Sep 28, 2018

As it has been reported multiple times (@middleagedman, @BenSjoberg, @mitchcapper , @p53) the fix in the iptables resolves the issue ('connections reset by peer' or RST packet sent at TCP level).
Quick fix (ref @aaronlehmann): iptables -I INPUT -m conntrack --ctstate INVALID -j DROP

The issue is actually occurring in any container running in the default bridge network. Whether the issue occurs frequently or not depends on lot of factors (bandwidth, latency, host load). For sure, it occurs at some point. This issue is probably most of the time non-understood and incorrectly explained by a possible transient network partition, but it is not. It is a bug in the NAT setup installed by Docker.

We face this issue with a perfectly valid TCP client-server transfer (for instance a curl from a container downloading a large file though HTTP from an external server at high throughput). Do the very same download from the host directly and all is fine. Do it from a container on the same host and it breaks.

The problem as already mentioned by @aaronlehmann is that benign "invalid" packets to the SNAT'ed container (caused for instance by TCP window overflow due to high throughput but slow client) are assigned to the host interface and considered incorrectly martians, which causes a connection reset.
This is a limitation of conntrack which does not differentiate perfectly legal packets causing TCP window overflow from actually malformed packets (all get treated as INVALID). Hence the need to drop any conntrack INVALID packet seen when installing SNAT'ed virtual networks.

This is a problem references at several places, due to this netfilter/conntrack limitation:
https://serverfault.com/a/312687
https://www.spinics.net/lists/netfilter/msg51409.html
Quoting the last link from netfilter mailing list:

If NAT is enabled, never ever let packets with INVALID state pass through, because NAT will skip them.
Best regards,
Jozsef

The source NAT setup in iptables are installed by Docker for its bridge network support and are thus incomplete.
It should be the responsibility of Docker to set this up correctly.
Apparently this was never fixed, hence my request to re-open this issue.

I can attempt to make a pull request if it can help, or I can open a new issue if needed, tell me.

Note that the abandoned pull request attempt #1129 does not fix the issue because the inserted rule does not drop the packets. There should be no filter on the destination because at that time the destination is not yet NAT'ed. Any conntrack invalid packets in filter INPUT chain have to be dropped as in : iptables -I INPUT -m conntrack --ctstate INVALID -j DROP.

guillon added a commit to guillon/libnetwork that referenced this issue Oct 2, 2018
Add drop of conntrack INVALID packets in input
such that invalid packets due to TCP window overflow do
not cause a connection reset.

Due to some netfilter/conntrack limitations, invalid packets
are never treated as NAT'ed but reassigned to the
host and considered martians.
This causes a RST response from the host and resets the connection.
As soon as NAT is setup, for bridge networks for instance,
invalid packets have to be dropped in input.

The implementation adds a generic DOCKER-INPUT chain prefilled
with a rule for dropping invalid packets and a return rule.
As soon as some bridge network is setup, the DOCKER-INPUT
chain call is inserted in the filter table INPUT chain.

Fixes moby#1090.

Signed-off-by: Christophe Guillon <christophe.guillon@st.com>
guillon added a commit to guillon/libnetwork that referenced this issue Oct 2, 2018
Add drop of conntrack INVALID packets in input
such that invalid packets due to TCP window overflow do
not cause a connection reset.

Due to some netfilter/conntrack limitations, invalid packets
are never treated as NAT'ed but reassigned to the
host and considered martians.
This causes a RST response from the host and resets the connection.
As soon as NAT is setup, for bridge networks for instance,
invalid packets have to be dropped in input.

The implementation adds a generic DOCKER-INPUT chain prefilled
with a rule for dropping invalid packets and a return rule.
As soon as some bridge network is setup, the DOCKER-INPUT
chain call is inserted in the filter table INPUT chain.

Fixes moby#1090.

Signed-off-by: Christophe Guillon <christophe.guillon@st.com>
@dcui
Copy link

@dcui dcui commented Apr 30, 2019

FYI:
"/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal" has gone since 2016-08-13 (see "netfilter: remove ip_conntrack* sysctl compat code" https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adf0516845bcd0e626323c858ece28ee58c74455)

Now I think we should use "/proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal" instead.

@johannesboon
Copy link

@johannesboon johannesboon commented Jun 8, 2019

FYI: This is also an issue for kubernetes that they are trying to solve with similar strategies:

kubernetes/kubernetes#74840

@guillon
Copy link

@guillon guillon commented Jun 10, 2019

Hi @aaronlehmann,
I think that the issue was closed but never fixed, can you consider re-opening it.
Note that the pr #2275 solves the issue.

@unilynx
Copy link

@unilynx unilynx commented Nov 28, 2019

I'm using neither AWS nor Kubernetes, and I see the issue too between our office network (where our CI runners use) and external resource at digitalocean or maxmind.com. It generally manifests itself as

curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104

With tcpdumps I see lost but then reappearing packets (it reappeared after about 90ms or 200KB of data) triggering a RST. I'm not sure where the actual problem is, I'm assuming our ISP is doings something funky or a link aggregation is messing up packets. It happens mostly during quiet hours and the actual network issue is something we probably have to live with, but a 90ms packet delay shouldn't terminate connections

The liberal sysctl fixes our issue (and firewalling RST probably too), but as the issue is not AWS (or even K8S specific) I too think this issue should be reopened.

imbstack added a commit to taskcluster/monopacker that referenced this issue Apr 24, 2020
@rwkarg
Copy link

@rwkarg rwkarg commented Jun 22, 2020

This is impacting us as well just using docker. Should this issue be reopened?

@leakingtapan
Copy link

@leakingtapan leakingtapan commented Oct 26, 2020

Had the same issue on GCP when downloading large file from inside container using curl. The iptables rule solves the problem for me. Another workaround was to use wget instead of curl, not this workaround might not be generally applicable to all cases

@ssup2
Copy link

@ssup2 ssup2 commented Nov 19, 2020

Hello. To solve this problem, I developed a kubernetes controller called node-network-manager. By simply deploying and configuring network-node-manager, you can set iptables -I INPUT -m conntrack --ctstate INVALID -j DROP rule to all nodes of cluster. Please try this and give me feedback. Thanks.

https://github.com/kakao/network-node-manager

@karunchennuri
Copy link

@karunchennuri karunchennuri commented Dec 17, 2020

This issue pretty much exists in non-AWS, non-GCP world as well. We run our clusters on-prem and were able to reproduce this issue esp with requests going outbound with higher payloads. Getting into details...

Problem: An app team complained an issue with their app behavior. This app reaches outbound external service with certain sizes of payloads. In literal cURL world, it's nothing but passing JSON payloads in --data-raw. What was weird was that the requests went through fine with smaller payloads, but when the payload size reaches certain KB, the request goes outbound through firewall, gets executed on external service but response never reaches the container. We thought it's intermittent issue, but NO we could reproduce this issue 100% with certain request payload size.

Steps we took to narrow down:

  • To remove any possible bad behavior of app itself due to coding issue, we wrote a simplest client i.e. running the cURL directly from with in SSH'd container instance.
  • We ran the curl from worker node with smaller payload where the container is hosted, this worked
  • Ran the curl from worker node with larger payload, this worked
  • We then ran the cURL with smaller payload from within container (app instance), this worked
  • Ran the curl from container with larger payload, this failed (intermittent at times)
  • Took packet captures on the container virtual interface (overlay networking) and eth0 default interface.
  • Packet captures on the virtual interface had no abnormal behavior. But pcap on eth0 showed RST connections from worker node to external service within a second or 2 of the request initiation.
  • We took captures on the external endpoint as well as on firewall. All of them showed symptom of the problem but not the root cause.
  • We tried running the same cURL on other clustered environment based on Kubernetes. We could reproduce this issue on every docker runtime. Though Cloudfoundry uses garden technology, but it still delegates the job to RunC which is the runtime for container based on Docker code.
  • For us running this command on the echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal on the worker node did the trick! Thanks @aaronlehmann for taking time drafting this issue. Had we not stumbled on this issue, not sure how many man hours we would have spent around troubleshooting.

Since this is a system level setting that impacts not just docker traffic, we are still looking at best action that meets our environment needs. I am not inclined to give a resolution step, but just thought will put my thoughts/experience w.r.t this issue on how this took several man hours of effort to identify the root cause. Reading through above responses, 'am curious to know how this was fixed in AWS and why or if there exists a fix for this in any of the docker releases (considering this issue showed up 4 years ago). If this is not yet fixed, what's the best way forward to reopen this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.