Major reduction in performance upgrading to 2.7.4 using AWS cluster #2236

aashley · 2018-09-20T00:31:58Z

What happened:
Upgraded our production cluster from 2.6.7 to 2.7.4, seeing major reduction in performance of the Auth component reducing entire system to unusable. Test cluster didn't see issues.

Cluster is main cluster with ~250 single node trusted clusters. Cloud infrastructure is based on example terraform scripts from Teleport. DynamoDB storage, S3 audit logs, Auth nodes behind network LB, Proxy nodes behind network LB.

Same cluster on 2.6.4 handled the load no issue, upgrade to 2.7.4 has brought the system to its knees, logins to all nodes timeout.

What you expected to happen:
System to work as before.

How to reproduce it (as minimally and precisely as possible):
Hard to say, upgrade on smaller test cluster with 5 trusted clusters worked fine with no issue.

Environment:

Teleport version (use teleport version): Teleport v2.7.4 git:v2.7.4-0-g2fff1056
Tsh version (use tsh version): Teleport v2.7.4 git:v2.7.4-0-g2fff1056
OS (e.g. from /etc/os-release): Main Cluster: Debian 9. Remote Clusters: Ubuntu 16.04.4

Browser environment

Browser Version (for UI-related issues):
Install tools:
Others:

Relevant Debug Logs If Applicable

tsh --debug
teleport --debug

The text was updated successfully, but these errors were encountered:

klizhentas · 2018-09-20T00:36:29Z

What do you observe?

Can you post CPU/Disk io/RAM output on the auth server?
Do you see anything in the logs?
Do you see any rate limiting on the DynamoDB side? (Check cloudwatch metrics).

250 node cluster should work fine without any notable differences between 2.6 and 2.7, if anything we made 2.7 faster, so this is unusual.

klizhentas · 2018-09-20T00:37:47Z

ah, it's 250 trusted clusters, my first bet is that CA rotation heartbeats will put load on the dynamodb because they added polling, can you check there.

klizhentas · 2018-09-20T00:52:40Z

Anyways, I can help you troubleshoot the problem, if you want I can jump on a chat/call with you tomorrow. Meanwhile if it's a production you probably need to downgrade now.

Just send me an email to sasha@gravitational.com re this ticket and we can schedule some time.

aashley · 2018-09-20T01:02:07Z

At the worst we had 8 m5.xlarge nodes running the auth cluster with 20,000 units of provisioned read capacity in the dynamodb table, we where still getting throttling on the dynamo requests. In fact it seemed the more Dynamodb capacity we provisioned the more throttled requests we got. See https://i.imgur.com/FNqjO4I.png

CPU and memory wise we had the Auth servers running at 100% capacity and about 60% memory usage. Auth server CPU: https://i.imgur.com/IbHRF1r.png The way the cluster is setup there is zero disk IO and the network IO averaged about 120Mbps. The proxy servers where at about 70% utilisation with no disk IO and similar network usage.

Roll back has been done for the absolutely critical services and the system split in two so I have still have a 2.7.4 cluster exhibiting similar issues just not to the magnitude of the original issue.

aashley · 2018-09-20T01:02:56Z

Oh also, on here is probably best, we're based in Perth, Western Australia, so its just on 9am here.

klizhentas · 2018-09-20T01:08:17Z

ok, thanks for the info. I will try to reproduce this week and get back to you with my findings. Meanwhile, can you give me all the specs and steps to reproduce this, so I can try and see what's going on.

BTW, if you have time, you can active --diag-addr endpoint and collect some metric dumps for me to see what's bothering auth server so much.

klizhentas · 2018-09-20T01:10:43Z

To get debug CPU and RAM profiles for me:

teleport start -d --diag-addr=127.0.0.1:6060
curl http://127.0.0.1:6060/debug/profile
curl http://127.0.0.1:6060/debug/heap

aashley · 2018-09-20T02:05:30Z

Auth: m5.xlarge
Proxy: m5.large

Setup is as per example terraform scripts for AWS cluster, modified to install the OSS version and not to run any nodes or grafana/influx server. The cloud instance is just for user auth and providing the trusted endpoint. Each of the remote clusters is an all in one node with the following config:

teleport:
  auth_servers:
  - 127.0.0.1:3025
  data_dir: "/var/lib/teleport"
auth_service:
  enabled: 'yes'
  cluster_name: od-server-00296
  listen_addr: 0.0.0.0:3025
  session_recording: proxy
  tokens:
  - proxy,node:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ssh_service:
  enabled: 'yes'
  labels:
    role: proxy
    env: dev
  commands:
  - name: hostname
    command:
    - "/bin/hostname"
    period: 1m0s
  - name: arch
    command:
    - "/bin/uname"
    - "-p"
    period: 1h0m0s
proxy_service:
  enabled: 'yes'
  https_key_file: "/etc/ssl/private/od-server-00296.key"
  https_cert_file: "/etc/ssl/certs/od-server-00296.pem"

Process:

Cluster was installed new at 2.6.3, upgrade to 2.6.4. Running with 2 proxies, 2 auth
Shutdown proxies
Shutdown all but one auth
Upgrade binaries on running auth, restart teleport
Confirm upgrade complete, upgrade and start second auth
Upgrade and start proxies
Upgrade two remote clusters for testing

I'll see about getting those files directly. We do have the diag running and pushing to our influxdb cluster permenantly, you can view a snapshot of the last 24 hours at https://snapshot.raintank.io/dashboard/snapshot/Q2Q1fvvuoFVeE3Py71EVo8fLF79qre4x the data was a bit intermittent at the worst of the issue.

klizhentas · 2018-09-21T01:10:12Z

I was able to reproduce, landed a couple of patches in 2.7, but work is still in progress - have a couple of ideas to test in the next couple of days

klizhentas · 2018-09-27T20:58:51Z

I have landed several patches that improve performance in the scenario mentioned above.
Due to the nature of the patches, they will be available post 3.0 release in this branch:

#2243

Some notes:

2.6 was never handling 250 trusted clusters well either, it exhibits the same behavior as 2.7
I reproduced exactly the behavior you described and made sure this branch fixes it - reads to the database on 250 clusters are below 100/sec.

I recommend you try this branch in a dev cluster and communicate results back to me

aashley · 2018-09-28T01:49:22Z

Yeah I noticed 2.6 had the same problem when we rolled back, it was getting slower adding each extra node and then the massive reconnect of all the remote clusters reconnecting after the upgrade slammed it.

Just to confirm to see the improvement I'd only need to upgrade the main cluster not all the remote ones. Or do I need to upgrade all the remote ones before I see the improvement?

klizhentas · 2018-09-28T01:56:16Z

I only tested the scenario when both nodes and the cluster were on 3.0+ my branch, so not sure. I will leave the rest for you to verify on your own.

aashley · 2018-10-08T01:34:55Z

Finally had time to test this myself, looks good so far.

klizhentas self-assigned this Sep 20, 2018

klizhentas closed this as completed Sep 28, 2018

aashley mentioned this issue Dec 4, 2018

Logins to nodes in trusted cluster are extremely slow when lots of clusters #2415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major reduction in performance upgrading to 2.7.4 using AWS cluster #2236

Major reduction in performance upgrading to 2.7.4 using AWS cluster #2236

aashley commented Sep 20, 2018

klizhentas commented Sep 20, 2018

klizhentas commented Sep 20, 2018

klizhentas commented Sep 20, 2018

aashley commented Sep 20, 2018

aashley commented Sep 20, 2018

klizhentas commented Sep 20, 2018

klizhentas commented Sep 20, 2018

aashley commented Sep 20, 2018

klizhentas commented Sep 21, 2018

klizhentas commented Sep 27, 2018 •

edited

Loading

aashley commented Sep 28, 2018

klizhentas commented Sep 28, 2018

aashley commented Oct 8, 2018

Major reduction in performance upgrading to 2.7.4 using AWS cluster #2236

Major reduction in performance upgrading to 2.7.4 using AWS cluster #2236

Comments

aashley commented Sep 20, 2018

klizhentas commented Sep 20, 2018

klizhentas commented Sep 20, 2018

klizhentas commented Sep 20, 2018

aashley commented Sep 20, 2018

aashley commented Sep 20, 2018

klizhentas commented Sep 20, 2018

klizhentas commented Sep 20, 2018

aashley commented Sep 20, 2018

klizhentas commented Sep 21, 2018

klizhentas commented Sep 27, 2018 • edited Loading

aashley commented Sep 28, 2018

klizhentas commented Sep 28, 2018

aashley commented Oct 8, 2018

klizhentas commented Sep 27, 2018 •

edited

Loading