Skip to content
This repository has been archived by the owner. It is now read-only.

Investigate proper scaling and hardware platform for fxa-profile servers in prod #155

Closed
rfk opened this issue Oct 14, 2015 · 7 comments
Closed
Assignees

Comments

@rfk
Copy link
Member

@rfk rfk commented Oct 14, 2015

From https://bugzilla.mozilla.org/show_bug.cgi?id=1209811 via @ckolos:

"""
FxA Profile servers have been scaling up and down since ff 41 was released. While this is the point of autoscaling, it's unusual to see a greater than 100% stack size during peak loads and totday the stack scaled from REDACTED to 2*REDACTED+2. For this reason, I propose that the current hardware platform is under-spec'd for the task and we should investigate alternatives to will allow the stack to run smaller and more efficiently, if possible.
"""

From IRL discussion, I think this is mostly dev work to figure out what's causing performance to be so spiky and if there's anything we can do about it. It might involve something fun like e.g. moving the image processing work out to AWS lambda. But let's not jump to conclusions before we've measured things.

Our first task should be to confirm what is causing the performance issues. @ckolos can you link to e.g. a datadog graph that shows the problem you're seeing, so we can correlate events on with activity on the server?

@rfk rfk added this to the FxA-0: quality milestone Oct 14, 2015
@ckolos
Copy link

@ckolos ckolos commented Oct 14, 2015

2 week cpu util graph

That's a graph of CPU over the last two weeks of this train. My concern lies in the "churn" that you can see in various points of the graph. Zooming in to a particular date, October 5th, here's a view of 24 hours:

24 hours of profile Oct. 5th

Looking at this we can see the morning churrn and tbh, that's expected; we see that in many of our apps, however if you look closer, you can see that an additional 4 instances are spun up and falling off later in the night at 17:50, 17:29, 19:45, and 21:51 (times are local to my TZ, GMT-5)

Ideally, I think we want to leverage autoscaling more as a reliability component rather than a capacity crutch or substitution for the correct instance size. Sizing a stack up by 4 hosts periodically throughout the day just... feels wrong. 1 or maybe even two in times of sustained load, sure, but scaling up by a factor of 3 or 4 on a daily basis feels like wrong-sizing.

@vladikoff
Copy link
Contributor

@vladikoff vladikoff commented Oct 19, 2015

from mtg: need to find time to discuss this via @jrgm

@seanmonstar
Copy link
Member

@seanmonstar seanmonstar commented Oct 19, 2015

@ckolos is an appropriate action to just "upgrade the default box by 3-4x"?

@ckolos
Copy link

@ckolos ckolos commented Oct 19, 2015

@seanmonstar maybe. Maybe move to lambda.

@seanmonstar
Copy link
Member

@seanmonstar seanmonstar commented Oct 19, 2015

@ckolos oh, our previous conversations made it seem like the load was coming from GET requests, not POST. If they're POST and the heavy load is processing image files, lambda is a possibility.

If they are GET requests, then the 'workers' aren't even involved, and lambda would make no difference. Instead, it's just the web nodes being hit with thousands and upon thousands of requests (thanks to Firefox having a sizable user base).

@vladikoff
Copy link
Contributor

@vladikoff vladikoff commented Nov 2, 2015

@jrgm to follow up with @rfk

@rfk
Copy link
Member Author

@rfk rfk commented Nov 11, 2015

@jrgm and @ckolos and IIUC, are happy that we understand and accept the current behaviour of the boxes. No changed needed for now.

@rfk rfk closed this Nov 11, 2015
@rfk rfk removed the waffle:now label Nov 11, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants