Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protokube Gossip causing network congestion #11064

Closed
KetchupBomb opened this issue Mar 17, 2021 · 6 comments
Closed

Protokube Gossip causing network congestion #11064

KetchupBomb opened this issue Mar 17, 2021 · 6 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@KetchupBomb
Copy link

1. What kops version are you running? The command kops version, will display
this information.

$ kops version
Version 1.15.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

$ kubectl version --short
Client Version: v1.20.4-dirty
Server Version: v1.12.10

3. What cloud provider are you using?

AWS.

4. What commands did you run? What is the simplest way to reproduce this issue?

Protokube is containerized, attached to each node's host network. Some Docker inspection is provided, if more is needed please ask (some information redacted):

$ sudo docker inspect $(sudo docker ps | grep -i protokube | awk '{print $1}') | jq '.[] | .State, .Config.Cmd'
{
  "Status": "running",
  "Running": true,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 6048,
  "ExitCode": 0,
  "Error": "",
  "StartedAt": "2021-03-12T06:03:59.498290524Z",
  "FinishedAt": "0001-01-01T00:00:00Z"
}
[
  "/usr/bin/protokube",
  "--channels=s3://k8s/production.k8s.local/addons/bootstrap-channel.yaml",
  "--cloud=aws",
  "--containerized=true",
  "--dns-internal-suffix=internal.production.k8s.local",
  "--dns=gossip",
  "--etcd-backup-store=s3://k8s/production.k8s.local/backups/etcd/main",
  "--etcd-image=k8s.gcr.io/etcd:3.3.13",
  "--initialize-rbac=true",
  "--manage-etcd=true",
  "--master=false",
  "--peer-ca=/srv/kubernetes/ca.crt",
  "--peer-cert=/srv/kubernetes/etcd-peer.pem",
  "--peer-key=/srv/kubernetes/etcd-peer-key.pem",
  "--tls-auth=true",
  "--tls-ca=/srv/kubernetes/ca.crt",
  "--tls-cert=/srv/kubernetes/etcd.pem",
  "--tls-key=/srv/kubernetes/etcd-key.pem",
  "--v=4"
]

This reported issue started happening recently, after our cluster scaled up to >1,000 nodes. We believe it's a scaling issue, so reproducing is difficult.

5. What happened after the commands executed?

The container starts. It runs, opening connections to all other nodes, and all other nodes connect to it. Traffic is passed, and masters are discovered, persisted to /etc/hosts.

Within <10 minutes, we start to see cluster degradation in terms of TCP receive windows becoming full, TCP resets, and data transfer with RTT >10ms starting to slow.

6. What did you expect to happen?

Protokube starts and performs its function as expected, but the eventual state of nodes in the cluster become degraded, causing other network traffic to be affected.

We do not expect this to happen -- It caused RTT=80ms (or higher) network operations to slow down from 30 MB/sec to 300 KB/sec.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

This issue is happening across a few clusters, consisting of >1k, large nodes. Specific numbers redacted.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

# ~60ms RTT
$ ping -c 3 public-dev-test-delete-me.s3-us-west-2.amazonaws.com
PING s3-us-west-2-r-w.amazonaws.com (52.218.229.113) 56(84) bytes of data.
64 bytes from s3-us-west-2-r-w.amazonaws.com (52.218.229.113): icmp_seq=1 ttl=40 time=63.8 ms
64 bytes from s3-us-west-2-r-w.amazonaws.com (52.218.229.113): icmp_seq=2 ttl=40 time=62.6 ms
64 bytes from s3-us-west-2-r-w.amazonaws.com (52.218.229.113): icmp_seq=3 ttl=40 time=62.6 ms


# Protokube is running
$ sudo docker ps | grep -i protokube ; pgrep protokube
d1ad7cbdef2a        protokube:1.15.1                                                             "/usr/bin/protokube …"   4 days ago          Up 4 days                               musing_proskuriakova
6048


# Downloading a file is slow
$ curl https://public-dev-test-delete-me.s3-us-west-2.amazonaws.com/testfile > /dev/null
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
--
Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
15 45.6M   15 7277k    0     0   333k      0  0:02:19  0:00:21  0:01:58  347k^C



# Stopping Protokube causes the previous download to now move quickly
$ sudo docker pause $(sudo docker ps | grep -i protokube | awk '{print $1}')
$ curl https://public-dev-test-delete-me.s3-us-west-2.amazonaws.com/testfile > /dev/null
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
--
Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 76.6M  100 76.6M    0     0  32.2M      0  0:00:02  0:00:02 --:--:-- 34.6M

9. Anything else do we need to know?

There is a third-party blog post describing our similar experiences in a nearly identical manner.

@olemarkus
Copy link
Member

This is a tricky one ...

The kOps version you are using is somewhat old, and newer versions will have updated e.g mesh libraries.
Trying to reproduce something with 1000+ nodes is also somewhat hard for us maintainers.

There is also #7427 which led to #7521
The milestone says 1.15, but was actually released in 1.16.

Also be aware that 1.16 upgrade had an issue wrt gossip: #8771

@KetchupBomb
Copy link
Author

Agreed on both points that we're running an older version of kops, and that reproducing with such cluster sizes is inaccessible to most. We are in process of upgrading kops to ~1.18, and we have long-term plans to front masters with a VIP, since this is our primary use case of Protokube (rather than paying the cost for Gossip across >1k nodes just for master discovery).

We mainly wanted to formally flag this since we couldn't find an open or closed Github Issue. And if anyone had some immediate mitigation input, we'd welcome it. (For now, we're increasing TCP buffer size to mitigate.)

@olemarkus
Copy link
Member

I'd try to upgrade to 1.16+ and see if the alternative mesh implementation improves things.

How do you plan on implementing VIP for the masters on AWS?

Worth to mention that in upcoming versions of kops there are several other important domains used for scaling the control plane, such as dedicated API server nodes and etcd. This is especially useful for scaling larger clusters.

@KetchupBomb
Copy link
Author

@olemarkus: I'd try to upgrade to 1.16+ and see if the alternative mesh implementation improves things.

Yep, we're planning on going to 1.18. Optimistically, this will fix the Gossip issues, but even if it doesn't we're re-thinking our master-discovery mechanism.

@olemarkus: How do you plan on implementing VIP for the masters on AWS?

Still evaluating options, and I'm ramping up on Kops, but I'm interested in chasing the following options, roughly in order:

  1. Layer 7 or layer 4 ELB in front of auto-scaling groups which enforce master instance counts.
  2. A DNS solution where resolving the name will either answer with:
  • n A records, 1 for each master, or
  • 1 A record per query via round-robin between masters.
  1. Assign an IP to each master and leverage ECMP to load balance.
  2. Configure VRRP/keepalived to implement a proper VIP.

I'll come back to this issue and comment with a summary of what we end up doing.

@olemarkus
Copy link
Member

Thanks.

Layer 7 load balancing is a bit tricky because of TLS,
Layer 4 LB or DNS is what kOps uses if you don't use gossip.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants