Investigate proper scaling and hardware platform for fxa-profile servers in prod #155
Comments
|
That's a graph of CPU over the last two weeks of this train. My concern lies in the "churn" that you can see in various points of the graph. Zooming in to a particular date, October 5th, here's a view of 24 hours: Looking at this we can see the morning churrn and tbh, that's expected; we see that in many of our apps, however if you look closer, you can see that an additional 4 instances are spun up and falling off later in the night at 17:50, 17:29, 19:45, and 21:51 (times are local to my TZ, GMT-5) Ideally, I think we want to leverage autoscaling more as a reliability component rather than a capacity crutch or substitution for the correct instance size. Sizing a stack up by 4 hosts periodically throughout the day just... feels wrong. 1 or maybe even two in times of sustained load, sure, but scaling up by a factor of 3 or 4 on a daily basis feels like wrong-sizing. |
|
from mtg: need to find time to discuss this via @jrgm |
|
@ckolos is an appropriate action to just "upgrade the default box by 3-4x"? |
|
@seanmonstar maybe. Maybe move to lambda. |
|
@ckolos oh, our previous conversations made it seem like the load was coming from GET requests, not POST. If they're POST and the heavy load is processing image files, lambda is a possibility. If they are GET requests, then the 'workers' aren't even involved, and lambda would make no difference. Instead, it's just the web nodes being hit with thousands and upon thousands of requests (thanks to Firefox having a sizable user base). |


From https://bugzilla.mozilla.org/show_bug.cgi?id=1209811 via @ckolos:
"""
FxA Profile servers have been scaling up and down since ff 41 was released. While this is the point of autoscaling, it's unusual to see a greater than 100% stack size during peak loads and totday the stack scaled from REDACTED to 2*REDACTED+2. For this reason, I propose that the current hardware platform is under-spec'd for the task and we should investigate alternatives to will allow the stack to run smaller and more efficiently, if possible.
"""
From IRL discussion, I think this is mostly dev work to figure out what's causing performance to be so spiky and if there's anything we can do about it. It might involve something fun like e.g. moving the image processing work out to AWS lambda. But let's not jump to conclusions before we've measured things.
Our first task should be to confirm what is causing the performance issues. @ckolos can you link to e.g. a datadog graph that shows the problem you're seeing, so we can correlate events on with activity on the server?
The text was updated successfully, but these errors were encountered: