Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected long delay uploading BOM artifact s3 bucket #3745

Closed
MarkEWaite opened this issue Sep 5, 2023 · 10 comments
Closed

Unexpected long delay uploading BOM artifact s3 bucket #3745

MarkEWaite opened this issue Sep 5, 2023 · 10 comments

Comments

@MarkEWaite
Copy link

Service(s)

ci.jenkins.io

Summary

While building and running full tests of:

I was surprised to see a 35 minute delay between the start of the stash steps and the sh step that follows the stash steps. The delay seems to be due to a timeout that is reported in the logfile.

The crucial log lines with the retry of uploads due to a timeout were:

[2023-09-05T21:06:41.473Z] [Pipeline] stash
[2023-09-05T21:07:08.126Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/PR-2464/3/stashes/weekly.tgz
[2023-09-05T21:07:06.548Z] [Pipeline] stash
[2023-09-05T21:22:24.175Z] Retrying upload after: java.io.IOException: java.util.concurrent.TimeoutException
[2023-09-05T21:22:33.840Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/PR-2464/3/stashes/2.414.x.tgz
[2023-09-05T21:22:32.216Z] [Pipeline] stash
[2023-09-05T21:37:50.727Z] Retrying upload after: java.io.IOException: java.util.concurrent.TimeoutException
[2023-09-05T21:38:00.290Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/PR-2464/3/stashes/2.401.x.tgz

The detailed log file looks like this:

[2023-09-05T21:06:38.449Z] + launchable record session --build jenkins-Tools-bom-PR-2464-3-2.387.x --flavor platform=linux --flavor jdk=17
[2023-09-05T21:06:40.703Z] [Pipeline] stash
[2023-09-05T21:06:43.078Z] Stashed 1 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/PR-2464/3/stashes/launchable-session-2.387.x.txt.tgz
[2023-09-05T21:06:41.455Z] [Pipeline] }
[2023-09-05T21:06:41.460Z] [Pipeline] // withCredentials
[2023-09-05T21:06:41.464Z] [Pipeline] }
[2023-09-05T21:06:41.468Z] [Pipeline] // dir
[2023-09-05T21:06:41.473Z] [Pipeline] stash
[2023-09-05T21:07:08.126Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/PR-2464/3/stashes/weekly.tgz
[2023-09-05T21:07:06.548Z] [Pipeline] stash
[2023-09-05T21:22:24.175Z] Retrying upload after: java.io.IOException: java.util.concurrent.TimeoutException
[2023-09-05T21:22:33.840Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/PR-2464/3/stashes/2.414.x.tgz
[2023-09-05T21:22:32.216Z] [Pipeline] stash
[2023-09-05T21:37:50.727Z] Retrying upload after: java.io.IOException: java.util.concurrent.TimeoutException
[2023-09-05T21:38:00.290Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/PR-2464/3/stashes/2.401.x.tgz
[2023-09-05T21:37:58.665Z] [Pipeline] stash
[2023-09-05T21:43:40.258Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/PR-2464/3/stashes/2.387.x.tgz
[2023-09-05T21:43:38.681Z] [Pipeline] sh
[2023-09-05T21:43:39.745Z] + mvn -Dset.changelist -Dexpression=settings.localRepository -q -DforceStdout help:evaluate
[2023-09-05T21:43:44.125Z] [Pipeline] sh
[2023-09-05T21:43:45.193Z] + mvn -Dset.changelist -Dexpression=project.version -q -DforceStdout help:evaluate
[2023-09-05T21:43:48.333Z] [Pipeline] echo
[2023-09-05T21:43:48.334Z] Collecting 2413.vf0b_b_487d483c from /home/jenkins/agent/workspace/Tools_bom_PR-2464@tmp/m2repo for possible Incrementals publishing
[2023-09-05T21:43:48.335Z] [Pipeline] dir
[2023-09-05T21:43:48.335Z] Running in /home/jenkins/agent/workspace/Tools_bom_PR-2464@tmp/m2repo
[2023-09-05T21:43:48.335Z] [Pipeline] {
[2023-09-05T21:43:48.341Z] [Pipeline] fingerprint
[2023-09-05T21:43:48.432Z] Recording fingerprints

Reproduction steps

I don't have detailed steps that will duplicate the problem. I will check other build logs for the BOM job to see if I can assess how frequently a timeout is reported in the jobs.

@MarkEWaite MarkEWaite added the triage Incoming issues that need review label Sep 5, 2023
@dduportal
Copy link
Contributor

I can't see slowness except on a bom build. This problem looks the same as the sh slow steps when there are too much concurrency (ref. jenkinsci/bom#1969 (comment)).

cc @jglick for info

@dduportal dduportal added this to the infra-team-sync-2023-09-12 milestone Sep 8, 2023
@dduportal dduportal removed the triage Incoming issues that need review label Sep 8, 2023
@jglick
Copy link

jglick commented Sep 8, 2023

This problem looks the same as the sh slow steps

Perhaps, but from a first glance I just suspect there was some sort of AWS brownout. When using artifact-manager-s3, the uploads and downloads are directly between agent and S3, so performance issues inside the controller are unlikely to matter.

@dduportal
Copy link
Contributor

jenkins-infra/jenkins-infra#3062 was deployed to ci.jenkins.io (also trusted.ci and cert.ci e.g. all puppet-managed controller):

Capture d’écran 2023-09-12 à 08 36 48

@MarkEWaite can you retry?

@jglick is there any test or checks you want us to try on ci.jenkins.io?

@jglick
Copy link

jglick commented Sep 12, 2023

Just checking speed of large builds such as bom, and correctness of console text in case there are any unusual console filters installed (I cannot think of any we would be using here).

@dduportal
Copy link
Contributor

Triggering a bom build to check results: https://ci.jenkins.io/job/Tools/job/bom/job/master/2097/

@dduportal
Copy link
Contributor

Currently checking https://ci.jenkins.io/job/Tools/job/bom/job/master/2108/ (which is marked as kept foreverer for now)

@dduportal
Copy link
Contributor

Looks like the timings are better: a few min for staches at max: far from the 35 min you saw.

  • @jglick assumption about a potential transient error on AWS S3 when it happened
  • The addition of the flag did not seem to have any bad side effect at first sight

For info, the bom build took ~ 3h30 with a lot of agent reconnections. Worth restart work in jenkinsci/bom#1969

@MarkEWaite
Copy link
Author

Thanks @dduportal! In bom build 2106 (also successful) I see that the time between the entry that starts the stash and the "stashed n file(s)" line is most commonly a little over 20 seconds. The lines look like this:

[2023-09-17T08:17:59.491Z] [Pipeline] stash
[2023-09-17T08:18:21.298Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/master/2106/stashes/weekly.tgz
[2023-09-17T08:18:20.405Z] [Pipeline] stash
[2023-09-17T08:18:43.619Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/master/2106/stashes/2.414.x.tgz
[2023-09-17T08:18:42.727Z] [Pipeline] stash
[2023-09-17T08:19:05.939Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/master/2106/stashes/2.401.x.tgz
[2023-09-17T08:19:05.049Z] [Pipeline] stash
[2023-09-17T08:19:29.140Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/master/2106/stashes/2.387.x.tgz

There are no retries mentioned in that log. I assume a 20 second upload time is expected for those artifacts due to their size?

@dduportal
Copy link
Contributor

There are no retries mentioned in that log.

Sorry I wrote a bit too quickly: I meant "retries at agent level". The build https://ci.jenkins.io/job/Tools/job/bom/job/master/2106/ had 3 agents removed (due to a spot instance being reclaimed, seen in AWS EC2 cloudwatch).
When an agent is removed (connection timeout, instance reclaimed, etc.), the corresponding block is retried, including the eventual stash/unstash (which must happen within a node block).

I see that the time between the entry that starts the stash and the "stashed n file(s)" line is most commonly a little over 20 seconds. The lines look like this:

[2023-09-17T08:17:59.491Z] [Pipeline] stash
[2023-09-17T08:18:21.298Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/master/2106/stashes/weekly.tgz
[2023-09-17T08:18:20.405Z] [Pipeline] stash
[2023-09-17T08:18:43.619Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/master/2106/stashes/2.414.x.tgz
[2023-09-17T08:18:42.727Z] [Pipeline] stash
[2023-09-17T08:19:05.939Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/master/2106/stashes/2.401.x.tgz
[2023-09-17T08:19:05.049Z] [Pipeline] stash
[2023-09-17T08:19:29.140Z] Stashed 4 file(s) to https://ci-jenkins-io-artifacts.s3.us-east-2.amazonaws.com/ci.jenkins.io/Tools/bom/master/2106/stashes/2.387.x.tgz

I assume a 20 second upload time is expected for those artifacts due to their size?

It is mostly due to their sizes yes. But we can't rule out the slowness we see on the sh steps due to high concurrency.
The pipeline flow table is also a really practical way to visualize the time spent on each step: https://ci.jenkins.io/job/Tools/job/bom/job/master/2106/flowGraphTable/ (I can't thanks @jglick for teaching me this tool)

@dduportal
Copy link
Contributor

Closing as the initial problem (stash at 30 min each) seems solved (but the slow stash/sh steps are NOT resolved!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants