-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🔥 release-mainnet
aggregator is unreachable
#1310
Comments
AnalysisAt first glanceIt looks like a problem with database being locked, triggering timeouts from the status service that scrapes the We have also witnessed a high variance on the response times for the same routes (from Reproducing the problemWe have been able to reproduce the problem on the
We have ran the following commands sequentially: $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator
...
Requests per second: 118.30 [#/sec] (mean)
Time per request: 845.330 [ms] (mean)
Time per request: 8.453 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
50% 767
66% 829
75% 873
80% 923
90% 1041
95% 1122
98% 1289
99% 1362
100% 1557 (longest request) $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/epoch-settings
...
Requests per second: 119.29 [#/sec] (mean)
Time per request: 838.319 [ms] (mean)
Time per request: 8.383 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
50% 755
66% 833
75% 879
80% 930
90% 1056
95% 1190
98% 1290
99% 1317
100% 1557 (longest request) $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/snapshots
...
Requests per second: 78.72 [#/sec] (mean)
Time per request: 1270.301 [ms] (mean)
Time per request: 12.703 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
50% 1155
66% 1380
75% 1494
80% 1557
90% 1748
95% 1920
98% 2122
99% 2196
100% 2501 (longest request) $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/mithril-stake-distributions
...
Requests per second: 24.00 [#/sec] (mean)
Time per request: 4165.801 [ms] (mean)
Time per request: 41.658 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
50% 4094
66% 4118
75% 4138
80% 4155
90% 4227
95% 4303
98% 4350
99% 4688
100% 5205 (longest request) $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second: 24.00 [#/sec] (mean)
Time per request: 4165.801 [ms] (mean)
Time per request: 41.658 [ms] (mean, across all concurrent requests)
...
50% 4094
66% 4118
75% 4138
80% 4155
90% 4227
95% 4303
98% 4350
99% 4688
100% 5205 (longest request) From these tests, we can deduct that there is probably a lock that is not released early enough by the routes. Also, the aggregator HTTP server does not seem to purge the requests that are too old, and simply serves them when it can do so. This is probably responsible for extending further the delay to serve pages. When this delay is above |
Following the merge of PR #1316, we have run the same commands and here are the figures: $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator
...
Requests per second: 126.53 [#/sec] (mean)
Time per request: 790.333 [ms] (mean)
Time per request: 7.903 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
50% 563
66% 641
75% 693
80% 728
90% 891
95% 968
98% 998
99% 1044
100% 1428 (longest request) $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/epoch-settings
...
Requests per second: 123.41 [#/sec] (mean)
Time per request: 810.336 [ms] (mean)
Time per request: 8.103 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
50% 589
66% 662
75% 708
80% 743
90% 822
95% 887
98% 1092
99% 1204
100% 1264 (longest request) $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/snapshots
...
Requests per second: 122.67 [#/sec] (mean)
Time per request: 815.213 [ms] (mean)
Time per request: 8.152 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
50% 626
66% 686
75% 730
80% 763
90% 898
95% 1019
98% 1245
99% 1297
100% 1509 (longest request) $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/mithril-stake-distributions
...
Requests per second: 60.72 [#/sec] (mean)
Time per request: 1646.788 [ms] (mean)
Time per request: 16.468 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
50% 1367
66% 1544
75% 1678
80% 1771
90% 2015
95% 2286
98% 2656
99% 2740
100% 2817 (longest request) $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second: 40.60 [#/sec] (mean)
Time per request: 2462.868 [ms] (mean)
Time per request: 24.629 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
50% 1963
66% 2254
75% 2487
80% 2630
90% 3036
95% 3422
98% 3725
99% 4000
100% 4400 (longest request) We have seen significant improvements on the routes:
|
Following the merge of PR #1314, we have run the same command for the route $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second: 31.24 [#/sec] (mean)
Time per request: 3201.020 [ms] (mean)
Time per request: 32.010 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
50% 3074
66% 3517
75% 3765
80% 3921
90% 4349
95% 4746
98% 5056
99% 5205
100% 5444 (longest request) We have also run the same commands as previously on the other routes and the results were approximately the same. We see that the performances have worsen with the addition of a LIMIT to the query. We suspect that the way it is implemented is responsible for this drop of performances and we will try to implement them in a more efficient way. |
Change the type |
Following the merge of PR #1333, we have run the same command for the route $ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second: 60.25 [#/sec] (mean)
Time per request: 1659.740 [ms] (mean)
Time per request: 16.597 [ms] (mean, across all concurrent requests)
...
50% 1560
66% 1763
75% 1878
80% 1969
90% 2197
95% 2427
98% 3019
99% 3266
100% 4030 (longest request) We see that the performances have improved 👍 |
Why
🔥 Alerts are received stating that the
release-mainnet
aggregator is unreachable.2 incidents have been created:
What
The source of the problem must be identified and fixed swiftly.
Facts
/epoch-settings
route (> 10s
)x4
times~10-20 min
How
Now
testing-preview
networkConnectionWithFullMutex
instead ofMutex<Connection>
(will release the Mutex early)The text was updated successfully, but these errors were encountered: