Where is the best layer to add throttling to avoid being marked for abuse? #431

dmitshur · 2016-09-21T22:48:41Z

GitHub API has some guidelines to follow when accessing their API to avoid hitting abuse rate limit:

https://developer.github.com/guides/best-practices-for-integrators/#dealing-with-abuse-rate-limits

An application should follow those rules. Some of them can be enforced in a general/automated way via throttling, I think, and so they can be factored out into a library (rather than having to be done manually inside the application layer). Specifically:

Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently.

If we can assume a single *github.Client maps to a single user or client ID, then this can be done by using a mutex or semaphore or similar.
If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request.

Again, if we can assume a single *github.Client maps to a single user or client ID, then this can be done by keeping track of the last POST, PATCH, PUT, or DELETE request time, and sleeping if neccessary to make the next one happen at least one second later.
When you have been limited, wait the number of seconds specified in the Retry-After response header.

We can detect abuse rate limit response (it's documented here), read the number of seconds specified in the Retry-After response header, and not allow making network requests until that time has passed (probably returning *RateLimitError error, or an abuse-specific version thereof, right away for all requests).

I see two places where this can be done:

Possibly inside go-github library. Inside Client itself. Probably as a configurable option.
Or outside of this library. As a higher-level wrapper layer around go-github library.

So far, go-github has been a relatively "low-level" API client, meaning it helps you out with type safety and provides structured output, but aside from that, it maps relatively closely to 1:1 to making HTTP requests to GitHub API.

It has taken some steps to be a smarter/higher-level client where it could be done so transparently, without blocking. For example, after #277, RateLimitError type was added, and after #347, go-github is smart enough to track rate limit reset time and not make network calls when the rate limit is known to still be exceeded, returning RateLimitError right away.

So far it has also been a "return-right-away" API, where none of the calls are artificially throttled on the client side, they just return an error if there's a rate limit problem. So it seems adding throttling to the go-github client itself may be out of place, and would belong in a higher-level wrapper around go-github.

On the other hand, if done as an option that can be controlled, perhaps this is the best place to do these things, since it means more people can use the GitHub API correctly with less work from their side, by default. The user of the github client could make calls concurrently, but they would block and get serialized by the client as to follow the GH API guidelines. It would effectively shift some of the throttling that would otherwise be done on GitHub server side, to the go-github library client side.

@willnorris, @gmlewis, what are your thoughts on this? Where is the best place for this functionality?

Related issues: #152, #153, #277, #347, #304.

The text was updated successfully, but these errors were encountered:

gmlewis · 2016-09-21T22:59:17Z

To be honest, every time I write a big API-dependent job, I do this:

githubDelay = 720 * time.Millisecond  // 5000 QPH = 83.3 QPM = 1.38 QPS = 720ms/query

and then call time.Sleep(githubDelay) before each and every API call. That way, no matter how long it takes to run my batch job, I never hit the API quota limit. The downside to this is if you have any other jobs running as that same GitHub owner (of the Personal Access Token), the two jobs might interfere with each other, which Will and I have been talking about a lot. We are hoping that the new Integrations API will help with this situation.

I know this doesn't really help much... but is just another data point.

gmlewis · 2017-02-01T16:44:45Z

Shortly after I wrote the comment above, I switched to using a time.Ticker:

const githubRate = 720 * time.Millisecond
...
ticker := time.NewTicker(githubRate)
...
<-ticker.C // Don't hammer GitHub API
// call to GitHub API

which allows my tool to run for potentially hours and not abuse the rate limit.

It's similar to RateLimitError, but it's a different type of rate limit error. It is documented at: https://developer.github.com/v3/#abuse-rate-limits Parse and include the Retry-After header value, if present. It is documented at: https://developer.github.com/guides/best-practices-for-integrators/#dealing-with-abuse-rate-limits According to GitHub support, its type is an integer value representing seconds. Helps google#431.

We do this because the Github API triggers abuse warnings if a single user makes lots of concurrent requests. By having a shared ticker, we ensure that Github API requests don't happen in parallel. Moreover, we also ensure that there's a global rate limit of 1 request to the GithubAPI per 720 milliseconds. We're following the approach shown here: google/go-github#431 This means that we'll stay within the rate limit requirement of 5000 requests-per-hour for authenticated users. ``` 5000 QPH = 83.3 QPM = 1.38 QPS = 720ms/query ```

radeksimko · 2018-10-08T13:57:56Z

Here's how we implemented linked best practices in our repository:
https://github.com/terraform-providers/terraform-provider-github/blob/fa73654b66e37b1fd8d886141d9c2974e24ba42f/github/transport.go#L42-L109

I'd be more than happy to raise a PR if there's any interest in bringing these transports upstream along with tests. I don't expect everyone will want/need to use them, but it would be IMO nice to keep it upstream.

gmlewis · 2021-08-06T17:12:30Z

Closing issue for lack of activity/interest. Feel free to reopen.

dmitshur mentioned this issue Oct 5, 2016

Add and detect AbuseRateLimitError. #441

Merged

dmitshur added the question label Nov 28, 2016

dmitshur mentioned this issue Jan 6, 2017

Add support for retrying certain operations automatically #304

Closed

nathanleiby mentioned this issue Dec 8, 2017

Explicitly meter calls to Github API. Remove retrier. Clever/microplane#17

Merged

okkhoy mentioned this issue Apr 8, 2018

To avoid Github rate limits nus-cs2103/hubatch#12

Open

naveensrinivasan mentioned this issue Feb 21, 2021

Implement abuse detection in GitHub API requests ossf/scorecard#206

Closed

gmlewis closed this as completed Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where is the best layer to add throttling to avoid being marked for abuse? #431

Where is the best layer to add throttling to avoid being marked for abuse? #431

dmitshur commented Sep 21, 2016 •

edited

Loading

gmlewis commented Sep 21, 2016

gmlewis commented Feb 1, 2017

radeksimko commented Oct 8, 2018

gmlewis commented Aug 6, 2021

Where is the best layer to add throttling to avoid being marked for abuse? #431

Where is the best layer to add throttling to avoid being marked for abuse? #431

Comments

dmitshur commented Sep 21, 2016 • edited Loading

gmlewis commented Sep 21, 2016

gmlewis commented Feb 1, 2017

radeksimko commented Oct 8, 2018

gmlewis commented Aug 6, 2021

dmitshur commented Sep 21, 2016 •

edited

Loading