Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sanity caching and retry controls #1744

Merged
merged 8 commits into from
Nov 22, 2023
Merged

Conversation

bitwiseman
Copy link
Member

@bitwiseman bitwiseman commented Nov 19, 2023

Description

  • Add one query per second limit on getRateLimit() and getMeta()
  • Add configurable and randomization to retries

Mitigate the most severe parts of #1728

@samrocketman @KeepItSimpleStupid
Please take a look.

Before submitting a PR:

  • Changes must not break binary backwards compatibility. If you are unclear on how to make the change you think is needed while maintaining backward compatibility, CONTRIBUTING.md for details.
  • Add JavaDocs and other comments explaining the behavior.
  • When adding or updating methods that fetch entities, add @link JavaDoc entries to the relevant documentation on https://docs.github.com/en/rest .
  • Add tests that cover any added or changed code. This generally requires capturing snapshot test data. See CONTRIBUTING.md for details.
  • Run mvn -D enable-ci clean install site locally. If this command doesn't succeed, your change will not pass CI.
  • Push your changes to a branch other than main. You will create your PR from that branch.

When creating a PR:

  • Fill in the "Description" above with clear summary of the changes. This includes:
    • If this PR fixes one or more issues, include "Fixes #" lines for each issue.
    • Provide links to relevant documentation on https://docs.github.com/en/rest where possible. If not including links, explain why not.
  • All lines of new code should be covered by tests as reported by code coverage. Any lines that are not covered must have PR comments explaining why they cannot be covered. For example, "Reaching this particular exception is hard and is not a particular common scenario."
  • Enable "Allow edits from maintainers".

Copy link

codecov bot commented Nov 19, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (2062439) 80.16% compared to head (7a73735) 80.41%.

Files Patch % Lines
src/main/java/org/kohsuke/github/GitHubClient.java 97.36% 1 Missing and 1 partial ⚠️
...va/org/kohsuke/github/GitHubSanityCachedValue.java 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1744      +/-   ##
============================================
+ Coverage     80.16%   80.41%   +0.24%     
- Complexity     2302     2309       +7     
============================================
  Files           217      218       +1     
  Lines          6964     7015      +51     
  Branches        371      371              
============================================
+ Hits           5583     5641      +58     
+ Misses         1150     1141       -9     
- Partials        231      233       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

*/
<E extends Throwable> T get(SupplierThrows<T, E> query) throws E {
synchronized (lock) {
if (Instant.now().getEpochSecond() > lastQueriedAtEpochSeconds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the delay of "1 second" should be configurable as well ?

Co-authored-by: Fabien Thouny <3714318+KeepItSimpleStupid@users.noreply.github.com>
Copy link

@samrocketman samrocketman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use properties in static fields? They require a JVM restart which can be a challenge on a heavily active system. Any particular reason you don't want to do a property look up where the integers get used?

Defining startup properties isn't generally bad but if you're trying to do tuning for the retry behavior it's a drag to have downtime every tine you tweak it.

e.g. System.setProperty within script console and then setting the system property on startup once you have found an acceptable range of values.

@@ -632,11 +651,15 @@ private static IOException interpretApiError(IOException e,

private static void logRetryConnectionError(IOException e, URL url, int retries) throws IOException {
// There are a range of connection errors where we want to wait a moment and just automatically retry
long sleepTime = minRetryInterval;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest setting minRetryInterval and maxRetryInterval via Integer.getInteger property here.

LOGGER.log(INFO,
e.getMessage() + " while connecting to " + url + ". Sleeping " + GitHubClient.retryTimeoutMillis
e.getMessage() + " while connecting to " + url + ". Sleeping " + sleepTime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend creating a trace ID here for debug logging. That way an admin can search debug logs and find related logs for a single retry sequence. Here's an example

When I enable debug logging for a class in the mentioned class it is so active in parallel that all of the logs come in out of order. Because of that, using the trace- ID as a prefix to all of the logs enable me to search for a series of logs along with their retries. It enabled me to find the maximum retry count across logs as well which helps an admin with tuning.

For exmaple, I default to retries of 30 in my class but I found in practice with GitHub it could retry up to 28 times. Because that was so close to the max retry limit I increased the retry limit to 60 in my particular setup.

I also set the minimum time between retries to be 1000ms and the maximum to be 3000ms. I've found GitHub requiring me to retry up to 1 minute in these scenarios because of secondary API limits.

The new secondary API limits are very aggressive at the moment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize the logging mechanisms have to change a little bit with my feedback; they're not as straightforward as making the change just in this area of code.

@bitwiseman
Copy link
Member Author

@samrocketman

I was thinking property lookup is slow and I don't want to do it per query. But compared to query times it's probably negligible. I see your point about down time.

This isn't intended as a long term feature as it stands right now and I want to have the least change to existing behavior possible.

I'd be okay with saying that if any of the environment variables are set on startup then they will be checked for every query. So, you have to opt in to the behavior on startup.

@samrocketman
Copy link

samrocketman commented Nov 19, 2023

@samrocketman

I was thinking property lookup is slow and I don't want to do it per query. But compared to query times it's probably negligible. I see your point about down time.

This isn't intended as a long term feature as it stands right now and I want to have the least change to existing behavior possible.

I'd be okay with saying that if any of the environment variables are set on startup then they will be checked for every query. So, you have to opt in to the behavior on startup.

being tunable at all is a plus, really. In the case of pipeline API interactions we're moving to a weird but workable hack where we obtain one of 10 flocks randomly meaning there can be up to 10 clients active with GitHub at a time in pipelines. That's kind of how bad it is, though, we're at that point.

We're kind of pegged against GH limits so I do think property look up is negligible. But either way if that's how it is we can work with it. It solves a critical issue on our end with dropped pipeline jobs not being created when they should.

@bitwiseman
Copy link
Member Author

@samrocketman
Updated with trace id for logging.

You said:

Maybe the delay of "1 second" should be configurable as well ?

Perhaps, but why add the complexity? My thought right now is this is purely a sanity check. One second is sane and simple.
If this is enough to bring stability, we're done. If not, we can add configurability.

@KeepItSimpleStupid
Copy link
Contributor

You said:

Maybe the delay of "1 second" should be configurable as well ?

Perhaps, but why add the complexity? My thought right now is this is purely a sanity check. One second is sane and simple. If this is enough to bring stability, we're done. If not, we can add configurability.

@bitwiseman This comment was from me : I was not sure if you wanted to adopt an iterative approach or if you wanted to anticipate all the needs ;) But it's perfectly fine like that !

For my information, once it's merged and released, for a usage in Jenkins, a new version of this plugin would be needed, right ?

Thanks a lot !

@bitwiseman
Copy link
Member Author

@KeepItSimpleStupid

@bitwiseman This comment was from me : I was not sure if you wanted to adopt an iterative approach or if you wanted to anticipate all the needs ;) But it's perfectly fine like that !

Yes, iterative.

For my information, once it's merged and released, for a usage in Jenkins, a new version of this plugin would be needed, right ?

Yes.

@bitwiseman
Copy link
Member Author

@samrocketman
Does this look ready to merge?

@bitwiseman
Copy link
Member Author

@samrocketman
Approved?

@samrocketman
Copy link

@samrocketman
Approved?

Away on holiday at the moment so can't easily review from mobile but I'll take a look

Copy link

@samrocketman samrocketman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me. Thanks for implementing and accommodating feedback. Interesting use of thread local string I learn something every review in Jenkins projects.

@bitwiseman bitwiseman merged commit b2b3e1c into hub4j:main Nov 22, 2023
11 checks passed
@bitwiseman bitwiseman deleted the sanity-cache branch November 22, 2023 22:53
@bitwiseman
Copy link
Member Author

Released in 1.318.

@samrocketman
Copy link

I subscribed to github-api plugin repo for releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants