-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major reduction in performance upgrading to 2.7.4 using AWS cluster #2236
Comments
What do you observe? Can you post CPU/Disk io/RAM output on the auth server? 250 node cluster should work fine without any notable differences between 2.6 and 2.7, if anything we made 2.7 faster, so this is unusual. |
ah, it's 250 trusted clusters, my first bet is that CA rotation heartbeats will put load on the dynamodb because they added polling, can you check there. |
Anyways, I can help you troubleshoot the problem, if you want I can jump on a chat/call with you tomorrow. Meanwhile if it's a production you probably need to downgrade now. Just send me an email to sasha@gravitational.com re this ticket and we can schedule some time. |
At the worst we had 8 m5.xlarge nodes running the auth cluster with 20,000 units of provisioned read capacity in the dynamodb table, we where still getting throttling on the dynamo requests. In fact it seemed the more Dynamodb capacity we provisioned the more throttled requests we got. See https://i.imgur.com/FNqjO4I.png CPU and memory wise we had the Auth servers running at 100% capacity and about 60% memory usage. Auth server CPU: https://i.imgur.com/IbHRF1r.png The way the cluster is setup there is zero disk IO and the network IO averaged about 120Mbps. The proxy servers where at about 70% utilisation with no disk IO and similar network usage. Roll back has been done for the absolutely critical services and the system split in two so I have still have a 2.7.4 cluster exhibiting similar issues just not to the magnitude of the original issue. |
Oh also, on here is probably best, we're based in Perth, Western Australia, so its just on 9am here. |
ok, thanks for the info. I will try to reproduce this week and get back to you with my findings. Meanwhile, can you give me all the specs and steps to reproduce this, so I can try and see what's going on. BTW, if you have time, you can active --diag-addr endpoint and collect some metric dumps for me to see what's bothering auth server so much. |
To get debug CPU and RAM profiles for me:
|
Auth: m5.xlarge Setup is as per example terraform scripts for AWS cluster, modified to install the OSS version and not to run any nodes or grafana/influx server. The cloud instance is just for user auth and providing the trusted endpoint. Each of the remote clusters is an all in one node with the following config:
Process:
I'll see about getting those files directly. We do have the diag running and pushing to our influxdb cluster permenantly, you can view a snapshot of the last 24 hours at https://snapshot.raintank.io/dashboard/snapshot/Q2Q1fvvuoFVeE3Py71EVo8fLF79qre4x the data was a bit intermittent at the worst of the issue. |
I was able to reproduce, landed a couple of patches in 2.7, but work is still in progress - have a couple of ideas to test in the next couple of days |
I have landed several patches that improve performance in the scenario mentioned above. Some notes:
I recommend you try this branch in a dev cluster and communicate results back to me |
Yeah I noticed 2.6 had the same problem when we rolled back, it was getting slower adding each extra node and then the massive reconnect of all the remote clusters reconnecting after the upgrade slammed it. Just to confirm to see the improvement I'd only need to upgrade the main cluster not all the remote ones. Or do I need to upgrade all the remote ones before I see the improvement? |
I only tested the scenario when both nodes and the cluster were on 3.0+ my branch, so not sure. I will leave the rest for you to verify on your own. |
Finally had time to test this myself, looks good so far. |
What happened:
Upgraded our production cluster from 2.6.7 to 2.7.4, seeing major reduction in performance of the Auth component reducing entire system to unusable. Test cluster didn't see issues.
Cluster is main cluster with ~250 single node trusted clusters. Cloud infrastructure is based on example terraform scripts from Teleport. DynamoDB storage, S3 audit logs, Auth nodes behind network LB, Proxy nodes behind network LB.
Same cluster on 2.6.4 handled the load no issue, upgrade to 2.7.4 has brought the system to its knees, logins to all nodes timeout.
What you expected to happen:
System to work as before.
How to reproduce it (as minimally and precisely as possible):
Hard to say, upgrade on smaller test cluster with 5 trusted clusters worked fine with no issue.
Environment:
teleport version
): Teleport v2.7.4 git:v2.7.4-0-g2fff1056tsh version
): Teleport v2.7.4 git:v2.7.4-0-g2fff1056Browser environment
Relevant Debug Logs If Applicable
The text was updated successfully, but these errors were encountered: