-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protokube Gossip causing network congestion #11064
Comments
This is a tricky one ... The kOps version you are using is somewhat old, and newer versions will have updated e.g mesh libraries. There is also #7427 which led to #7521 Also be aware that 1.16 upgrade had an issue wrt gossip: #8771 |
Agreed on both points that we're running an older version of kops, and that reproducing with such cluster sizes is inaccessible to most. We are in process of upgrading kops to ~1.18, and we have long-term plans to front masters with a VIP, since this is our primary use case of Protokube (rather than paying the cost for Gossip across >1k nodes just for master discovery). We mainly wanted to formally flag this since we couldn't find an open or closed Github Issue. And if anyone had some immediate mitigation input, we'd welcome it. (For now, we're increasing TCP buffer size to mitigate.) |
I'd try to upgrade to 1.16+ and see if the alternative mesh implementation improves things. How do you plan on implementing VIP for the masters on AWS? Worth to mention that in upcoming versions of kops there are several other important domains used for scaling the control plane, such as dedicated API server nodes and etcd. This is especially useful for scaling larger clusters. |
Yep, we're planning on going to 1.18. Optimistically, this will fix the Gossip issues, but even if it doesn't we're re-thinking our master-discovery mechanism.
Still evaluating options, and I'm ramping up on Kops, but I'm interested in chasing the following options, roughly in order:
I'll come back to this issue and comment with a summary of what we end up doing. |
Thanks. Layer 7 load balancing is a bit tricky because of TLS, |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
1. What
kops
version are you running? The commandkops version
, will displaythis information.
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.3. What cloud provider are you using?
AWS.
4. What commands did you run? What is the simplest way to reproduce this issue?
Protokube is containerized, attached to each node's host network. Some Docker inspection is provided, if more is needed please ask (some information redacted):
This reported issue started happening recently, after our cluster scaled up to >1,000 nodes. We believe it's a scaling issue, so reproducing is difficult.
5. What happened after the commands executed?
The container starts. It runs, opening connections to all other nodes, and all other nodes connect to it. Traffic is passed, and masters are discovered, persisted to
/etc/hosts
.Within <10 minutes, we start to see cluster degradation in terms of TCP receive windows becoming full, TCP resets, and data transfer with RTT >10ms starting to slow.
6. What did you expect to happen?
Protokube starts and performs its function as expected, but the eventual state of nodes in the cluster become degraded, causing other network traffic to be affected.
We do not expect this to happen -- It caused RTT=80ms (or higher) network operations to slow down from 30 MB/sec to 300 KB/sec.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
This issue is happening across a few clusters, consisting of >1k, large nodes. Specific numbers redacted.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
There is a third-party blog post describing our similar experiences in a nearly identical manner.
The text was updated successfully, but these errors were encountered: