Skip to content

Commit

Permalink
Add rate limiting of /api/annotations (#5423)
Browse files Browse the repository at this point in the history
* Add rate limiting per user & POST /api/annotations

Rate limit endpoints that are known to cause issues if over-requested
on a per authorization token basis. Use the authorization token instead
of ip address since users from a university may have the same ip address
and the bulk of users in h day-to-day are students.

The main endpoint that is known to cause issues is /api/annotations:create.
When over requested it can overstress the db with create requests causing
long query times which ultimately hog up the gunicorn worker time which
would otherwise be <1%.

* Add /api/badge and /assets rate limiting

/api/badge accounts for a large portion of our traffic but it's
not a very valuable endpoint. Rather than prioritizing these requests
by sending them to the server right away, deprioritize them by queueing
them and sending 1 per sec.

/assets is an endpoint that is rarely touched by when it is, it's hit
a lot. Because of this, give it a larger than usual burst limit.

* Adjust the bursts towards allowing more requests

* Remove inherited response stat and add exact match

* Add custom 429 response & multiple zones

Add custom 429 json response for api requests.

Add multiple zones so that quotas on one request don't
impact another request. Re-using a zone means the queue's are
shared and that's not what we want so make different zones
for each endpoint.

Re-order the rate limits so that it follows and if, else format
so it's easier to read.

* Replace comments w/ calcs w/ general statements

Replace the previous comments that contained detailed calculations
that may be system specific with more general statements about how
the number was chosen at a high level.

The following are the detailed calculations that were replaced:
- The 95th percentile time for a badge request is .042s.
 7.6% of worker time is spent handling these requests.
 Typical usage per user is around 50 rpm.
 Queue up badge requests rather than sending them directly
 to the server-this will allow other requests to take priority.

- The 95th percentile time for an asset is .013s.
 The maximum burst of requests from a single page
 https://hypothes.is/docs/help is 20 requests.
 <1% of worker time is spent handling these requests.
 The maximum expected request rate is 25rpm.
 Assume in frustration the user hammers on the refresh
 button 7 times in a row. Worst case this results in
 a burst of 140 requests.

- Each /api/annotations:create request has a 95 percentile
 response time of .56s and there are 12 gunicorn
 workers per host. Create requests account for <1% of the
 traffic on the host so assume .12 workers are allocated
 for /api/annotations:create requests.
   .12 workers * 1 request / .56 s = .21 requests/s = ~1rps
 If too many of these requests happen back to back it can
 overwhelm the database so instead of letting a burst of
 requests pass to the server, queue these requests and
 only send 1 each second.
 Allow a user to queue up 8 rps (8 times the expected rate).

- A bot may burst up to 50rpm and the client issues 5 requests upon
 loading the sidebar. Assume a max request rate of 15 rps in a burst.
 This means the queue size would be 14 requests.
 Allow a user to sustain the max bursty request rate for 3 seconds.
 This would mean they can request up to 45 requests in one second but
 only 1 new request each second after that.
  • Loading branch information
Hannah Stepanek committed Nov 23, 2018
1 parent 0bc6360 commit 3f7bcb0
Showing 1 changed file with 61 additions and 0 deletions.
61 changes: 61 additions & 0 deletions conf/nginx.conf
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,22 @@ http {

access_log off;

# If there is an auth token, rate limit based on that,
# otherwise rate limit per ip.
map $http_authorization $limit_per_user {
"" $binary_remote_addr;
default $http_authorization;
}

# 1m stands for 1 megabyte so the zone can store ~8k users.
# User's typically don't go over 1rps including bots so set the
# generic rate limit of all endpoints to 1rps.
limit_req_zone $limit_per_user zone=badge_user_1rps_limit:1m rate=1r/s;
limit_req_zone $limit_per_user zone=assets_user_1rps_limit:1m rate=1r/s;
limit_req_zone $limit_per_user zone=create_ann_user_1rps_limit:1m rate=1r/s;
limit_req_zone $limit_per_user zone=user_1rps_limit:1m rate=1r/s;
limit_req_status 429;

# We set fail_timeout=0 so that the upstream isn't marked as down if a single
# request fails (e.g. if gunicorn kills a worker for taking too long to handle
# a single request).
Expand Down Expand Up @@ -55,6 +71,10 @@ http {
return 302 "https://trello.com/b/2ajZ2dWe/public-roadmap";
}

location @api_error_429 {
return 429 '{"status": "failure", "reason": "Request rate limit exceeded"}';
}

location / {
proxy_pass http://web;
proxy_http_version 1.1;
Expand All @@ -66,6 +86,47 @@ http {
proxy_set_header X-Forwarded-Server $http_host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Request-Start "t=${msec}";

# The /api/badge endpoint limit is chosen to limit the
# load from any single user, and take advantage of latency
# not being critical.
location /api/badge {
limit_req zone=badge_user_1rps_limit burst=15;
error_page 429 @api_error_429;

proxy_pass http://web;
}

# The /assets rate limit is chosen so that the user
# can refresh the web page with the most asset links
# on it a few times in succession without hitting the
# limit.
location /assets {
limit_req zone=assets_user_1rps_limit burst=139 nodelay;

proxy_pass http://web;
}

# The POST /api/annotations limit is chosen to allow
# reasonable usage while preventing a single user from
# causing service disruption.
location =/api/annotations {
limit_req zone=create_ann_user_1rps_limit burst=8;
error_page 429 @api_error_429;

proxy_pass http://web;
}

location /api {
limit_req zone=user_1rps_limit burst=44 nodelay;
error_page 429 @api_error_429;

proxy_pass http://web;
}

# An overall rate limit was chosen to allow reasonable usage while
# preventing a single user from causing service disruption.
limit_req zone=user_1rps_limit burst=44 nodelay;
}
}

Expand Down

0 comments on commit 3f7bcb0

Please sign in to comment.