Test impact of metrics gathering on production traffic #81
Comments
|
Somewhat related to this: |
|
@rfk, I'm thinking this should be a script in the repo so that we can easily re-run it as we add further metrics queries in the future. Rough methodology, first draft:
|
We can probably use the existing loadtest for this: https://github.com/mozilla/fxa-auth-server/tree/master/test/load unless @jrgm has a more up-to-date version. |
|
|
|
|
|
Just curious, but are the values in |
|
Yep.
|
|
The loads framework just does as much work as it can in a fixed time interval. So, that the tests took ~300 seconds is expected independent of actual work being done on the database. (It's good that the load test reports no errors though). |
|
Arrrgh! Thanks, I had no idea. Stupidly assumed it was performing a fixed number of operations. Back to the drawing board... |
|
@philbooth it should also give you some summary of RPS or other throughput measure; we can look at kibana graphs during the run and eyeball for differences there as well. |
|
Using the fixed perf-testing script I now have, quel surprise, some realistic-looking numbers. I'll paste the raw data in below, but the summary is:
Tests were executed on a large EC2 instance, 4 cores, running against the stage auth-server. Raw data for one load-test process, five iterations: Raw data for four load-test processes, nine iterations (I thought there might be greater variance in this set): The percentage slowdown figure I gave is based on the median values for the total number of requests (I used that because the requests per second number is rounded by the load-tester). @rfk, how do these fit in with your expectations / reasonable limits? |
|
This seems broadly as-expected - not blocking other queries by taking row locks or anything, just tying up resources on the server and making other queries run slower. Some other things it would be intersting to look at:
|
|
Also, how long does the metrics script take to run in its entirety? |
There doesn't seem to be any difference, actually. Removing the isolation level and running the queries outside a transaction, I got 266 requests per second from one instance of the load-tests and 66 requests per second with four instances running (both medians from five runs). Raw data, one load test: Raw data, four load tests:
~1.4 seconds without load, ~1.8 seconds under light load, ~2 seconds under heavy load (medians from five). Combined, this and the previous answer surprised me enough to think something was wrong, the metrics script couldn't really be doing the work I thought it was. But if I run the queries individually in the console, they are very quick there too. The slowest one being: MySQL [fxa]> SELECT COUNT(DISTINCT a.uid) AS count
-> FROM accounts AS a
-> INNER JOIN sessionTokens AS s
-> ON a.uid = s.uid
-> WHERE a.createdAt < 1441358306643
-> AND s.uaDeviceType = 'mobile';
+-------+
| count |
+-------+
| 29 |
+-------+
1 row in set (0.60 sec)My next fear was that the database wasn't as populated as I thought it was. But nope, it has 54 million session tokens in there: MySQL [fxa]> SELECT COUNT(*) AS count FROM sessionTokens;
+----------+
| count |
+----------+
| 54119099 |
+----------+
1 row in set (8.85 sec)And notice how much slower that last query is than the metrics queries. Why is that? It smells wrong to me, am I doing something stupid?
I'm not sure about this one yet because I don't know how to get on to the database host. Is there a way to find out from a mysql console or do I need to run |
Oh, hang on: MySQL [fxa]> SELECT COUNT(*) AS count
-> FROM accounts AS a
-> INNER JOIN sessionTokens AS s
-> ON a.uid = s.uid;
+--------+
| count |
+--------+
| 119561 |
+--------+
1 row in set (0.38 sec)It looks like, because the Does that sound correct? If so, it certainly invalidates my timings for the metrics script itself and my comparison without |
|
Fwiw, I'm running a script that will (eventually) give me a local database that is both big and more representative for these queries to run against. When it's ready, I'll re-run the perf tests against it to see how the numbers compare to those above. |
|
Running locally against a database with 10 million accounts and 16 million session tokens, I saw:
Timings for the metrics script itself:
Raw data, one load test: Raw data, four concurrent load tests: Raw data, one load test, metrics queries running without |
|
@philbooth does the test report avg request time/latency by any chance? It would be another interesting axis for comparison. |
|
Using the updated script:
Raw data, one load test: Raw data, four concurrent load tests: |
|
@jrgm what's your take on the above? Based on @philbooth's findings, it seems like this script is low-risk to try running occasionally in production in order to start getting these metrics flows, but we should continue to work towards a lower-impact way of gathering them for long-term use. |
|
add a blocked label |
|
This came up in my bi-weekly chat with Travis, who recommended that we push ahead with running it against the read-replica rather than trying to do it on the production db. I'm going to close this out, but leave #83 open to represent the outstanding work on that front. |
(Filing this to capture the work as part of our planning process).
We're hoping to land some new metrics-gathering scripts in #72, but they're heavy queries that will make the auth db work hard. Let's try them out in stage and get a feel for how they might impact production traffic.
The text was updated successfully, but these errors were encountered: