Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major reduction in performance upgrading to 2.7.4 using AWS cluster #2236

Closed
aashley opened this issue Sep 20, 2018 · 13 comments
Closed

Major reduction in performance upgrading to 2.7.4 using AWS cluster #2236

aashley opened this issue Sep 20, 2018 · 13 comments
Assignees

Comments

@aashley
Copy link

aashley commented Sep 20, 2018

What happened:
Upgraded our production cluster from 2.6.7 to 2.7.4, seeing major reduction in performance of the Auth component reducing entire system to unusable. Test cluster didn't see issues.

Cluster is main cluster with ~250 single node trusted clusters. Cloud infrastructure is based on example terraform scripts from Teleport. DynamoDB storage, S3 audit logs, Auth nodes behind network LB, Proxy nodes behind network LB.

Same cluster on 2.6.4 handled the load no issue, upgrade to 2.7.4 has brought the system to its knees, logins to all nodes timeout.

What you expected to happen:
System to work as before.

How to reproduce it (as minimally and precisely as possible):
Hard to say, upgrade on smaller test cluster with 5 trusted clusters worked fine with no issue.

Environment:

  • Teleport version (use teleport version): Teleport v2.7.4 git:v2.7.4-0-g2fff1056
  • Tsh version (use tsh version): Teleport v2.7.4 git:v2.7.4-0-g2fff1056
  • OS (e.g. from /etc/os-release): Main Cluster: Debian 9. Remote Clusters: Ubuntu 16.04.4

Browser environment

  • Browser Version (for UI-related issues):
  • Install tools:
  • Others:

Relevant Debug Logs If Applicable

  • tsh --debug
  • teleport --debug
@klizhentas
Copy link
Contributor

What do you observe?

Can you post CPU/Disk io/RAM output on the auth server?
Do you see anything in the logs?
Do you see any rate limiting on the DynamoDB side? (Check cloudwatch metrics).

250 node cluster should work fine without any notable differences between 2.6 and 2.7, if anything we made 2.7 faster, so this is unusual.

@klizhentas
Copy link
Contributor

ah, it's 250 trusted clusters, my first bet is that CA rotation heartbeats will put load on the dynamodb because they added polling, can you check there.

@klizhentas
Copy link
Contributor

Anyways, I can help you troubleshoot the problem, if you want I can jump on a chat/call with you tomorrow. Meanwhile if it's a production you probably need to downgrade now.

Just send me an email to sasha@gravitational.com re this ticket and we can schedule some time.

@aashley
Copy link
Author

aashley commented Sep 20, 2018

At the worst we had 8 m5.xlarge nodes running the auth cluster with 20,000 units of provisioned read capacity in the dynamodb table, we where still getting throttling on the dynamo requests. In fact it seemed the more Dynamodb capacity we provisioned the more throttled requests we got. See https://i.imgur.com/FNqjO4I.png

CPU and memory wise we had the Auth servers running at 100% capacity and about 60% memory usage. Auth server CPU: https://i.imgur.com/IbHRF1r.png The way the cluster is setup there is zero disk IO and the network IO averaged about 120Mbps. The proxy servers where at about 70% utilisation with no disk IO and similar network usage.

Roll back has been done for the absolutely critical services and the system split in two so I have still have a 2.7.4 cluster exhibiting similar issues just not to the magnitude of the original issue.

@aashley
Copy link
Author

aashley commented Sep 20, 2018

Oh also, on here is probably best, we're based in Perth, Western Australia, so its just on 9am here.

@klizhentas
Copy link
Contributor

ok, thanks for the info. I will try to reproduce this week and get back to you with my findings. Meanwhile, can you give me all the specs and steps to reproduce this, so I can try and see what's going on.

BTW, if you have time, you can active --diag-addr endpoint and collect some metric dumps for me to see what's bothering auth server so much.

@klizhentas
Copy link
Contributor

To get debug CPU and RAM profiles for me:

teleport start -d --diag-addr=127.0.0.1:6060
curl http://127.0.0.1:6060/debug/profile
curl http://127.0.0.1:6060/debug/heap

@aashley
Copy link
Author

aashley commented Sep 20, 2018

Auth: m5.xlarge
Proxy: m5.large

Setup is as per example terraform scripts for AWS cluster, modified to install the OSS version and not to run any nodes or grafana/influx server. The cloud instance is just for user auth and providing the trusted endpoint. Each of the remote clusters is an all in one node with the following config:

teleport:
  auth_servers:
  - 127.0.0.1:3025
  data_dir: "/var/lib/teleport"
auth_service:
  enabled: 'yes'
  cluster_name: od-server-00296
  listen_addr: 0.0.0.0:3025
  session_recording: proxy
  tokens:
  - proxy,node:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ssh_service:
  enabled: 'yes'
  labels:
    role: proxy
    env: dev
  commands:
  - name: hostname
    command:
    - "/bin/hostname"
    period: 1m0s
  - name: arch
    command:
    - "/bin/uname"
    - "-p"
    period: 1h0m0s
proxy_service:
  enabled: 'yes'
  https_key_file: "/etc/ssl/private/od-server-00296.key"
  https_cert_file: "/etc/ssl/certs/od-server-00296.pem"

Process:

  • Cluster was installed new at 2.6.3, upgrade to 2.6.4. Running with 2 proxies, 2 auth
  • Shutdown proxies
  • Shutdown all but one auth
  • Upgrade binaries on running auth, restart teleport
  • Confirm upgrade complete, upgrade and start second auth
  • Upgrade and start proxies
  • Upgrade two remote clusters for testing

I'll see about getting those files directly. We do have the diag running and pushing to our influxdb cluster permenantly, you can view a snapshot of the last 24 hours at https://snapshot.raintank.io/dashboard/snapshot/Q2Q1fvvuoFVeE3Py71EVo8fLF79qre4x the data was a bit intermittent at the worst of the issue.

@klizhentas klizhentas self-assigned this Sep 20, 2018
@klizhentas
Copy link
Contributor

I was able to reproduce, landed a couple of patches in 2.7, but work is still in progress - have a couple of ideas to test in the next couple of days

@klizhentas
Copy link
Contributor

klizhentas commented Sep 27, 2018

I have landed several patches that improve performance in the scenario mentioned above.
Due to the nature of the patches, they will be available post 3.0 release in this branch:

#2243

Some notes:

  • 2.6 was never handling 250 trusted clusters well either, it exhibits the same behavior as 2.7
  • I reproduced exactly the behavior you described and made sure this branch fixes it - reads to the database on 250 clusters are below 100/sec.

I recommend you try this branch in a dev cluster and communicate results back to me

@aashley
Copy link
Author

aashley commented Sep 28, 2018

Yeah I noticed 2.6 had the same problem when we rolled back, it was getting slower adding each extra node and then the massive reconnect of all the remote clusters reconnecting after the upgrade slammed it.

Just to confirm to see the improvement I'd only need to upgrade the main cluster not all the remote ones. Or do I need to upgrade all the remote ones before I see the improvement?

@klizhentas
Copy link
Contributor

I only tested the scenario when both nodes and the cluster were on 3.0+ my branch, so not sure. I will leave the rest for you to verify on your own.

@aashley
Copy link
Author

aashley commented Oct 8, 2018

Finally had time to test this myself, looks good so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants