Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syncstorage-rs latency spikes #61

Closed
erkolson opened this issue Jul 20, 2020 · 2 comments
Closed

syncstorage-rs latency spikes #61

erkolson opened this issue Jul 20, 2020 · 2 comments
Assignees
Labels
8 Estimate - xl - Moderately complex, medium effort, some uncertainty. bug Something isn't working p1

Comments

@erkolson
Copy link

erkolson commented Jul 20, 2020

Spanner latency spikes have been eliminated by dropping the BatchExpiry index but the application is still occasionally experiencing latency spikes that break latency SLA targets.

One example from overnight 7/18-7/19, spanner has normal performance, syncstorage has outlier latency:
Screen Shot 2020-07-20 at 10 05 46 AM

This incident caused 2 of the 9 running pods to max out active connections and get stuck. When the pods get in this state, manual intervention is required (to kill them).

The result, increased latency for request handling until the "stuck" pods are deleted:
Screen Shot 2020-07-20 at 10 25 59 AM
Screen Shot 2020-07-20 at 10 27 33 AM

I'm seeing a number of errors like these at the time but can not tell if they are the cause of the problem:

6-ALREADY_EXISTS Row [<fxa_uid>,<fxa_kid>,3,<another-id>,<yet-another-id>] in table batch_bsos already exists, status: 500 }"}}

(I removed the identifying information)

A database error occurred: RpcFailure: 8-RESOURCE_EXHAUSTED AnyAggregator ran out of memory during aggregation., status: 500 }"}}
@tublitzed tublitzed added this to Backlog: Misc in Services Engineering via automation Jul 20, 2020
@tublitzed tublitzed moved this from Backlog: Misc to Prioritized in Services Engineering Jul 20, 2020
@tublitzed tublitzed added the p1 label Jul 20, 2020
@tublitzed tublitzed moved this from Prioritized to Scheduled in Services Engineering Jul 20, 2020
@tublitzed tublitzed added the bug Something isn't working label Jul 21, 2020
@pjenvey pjenvey moved this from Scheduled to In Progress in Services Engineering Jul 23, 2020
@tublitzed tublitzed added the 8 Estimate - xl - Moderately complex, medium effort, some uncertainty. label Aug 3, 2020
@pjenvey
Copy link
Member

pjenvey commented Aug 6, 2020

I haven't seen "stuck" pods during 0.5.x load tests on stage but I see something somewhat similar.

Part of the challenge here is stage cluster's size significantly scales up/down for load testing, e.g. from a size of 1-2 nodes when idle -> 5-6 under load, then back down when it's finished.

However the canary node tends to stick around throughout, and a couple of different load tests against 0.5.x show the following:

  • canary takes the brunt of the load test when it begins, bumping its active connections often significantly higher than other nodes (e.g. 72)
  • when the traffic concludes, the canary mysteriously maintains a number of active connections (e.g. 12-35) even though the cluster is almost completely idle
  • ..correlated w/ upstream durations creeping up into a number of seconds. The cluster's idle so the requests are mostly lightweight health checks (__lbheartbeat__ or __heartbeat__)

Screen Shot 2020-08-05 at 6 20 08 PM

Screen Shot 2020-08-05 at 6 20 11 PM

Canary isn't "stuck" here but taking seconds for do nothing health checks.

Zooming out a bit you can see the pattern reflected in the Uptime Check:

Screen Shot 2020-08-05 at 6 51 16 PM

Screen Shot 2020-08-05 at 6 48 51 PM

Screen Shot 2020-08-05 at 6 44 12 PM

Especially easy to see the 4 days of lengthy health checks (23 - 27) when the cluster was mostly idle in between a few load tests.

@tublitzed
Copy link
Contributor

Thank you Phil, for the details here!

@erkolson:

  1. In terms of the referenced latency SLA targets: what are they? :) Ie, are you referencing that doc I made awhile back with rough targets (that I need to revisit), or something else?
  2. Are you still seeing this issue in production? (ie, it's high priority for us now, want to confirm it's still marked accurately)
  3. Do you have any suggestions on how we might continue to debug here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8 Estimate - xl - Moderately complex, medium effort, some uncertainty. bug Something isn't working p1
Projects
Development

No branches or pull requests

3 participants