Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artifact Caching Proxy (ACP) does not cache artifacts from Maven Central #3969

Closed
basil opened this issue Mar 1, 2024 · 25 comments
Closed

Comments

@basil
Copy link
Collaborator

basil commented Mar 1, 2024

Service(s)

Artifact-caching-proxy

Summary

The mirrorOf directive in our Artifact Caching Proxy (ACP) configuration in https://github.com/jenkins-infra/jenkins-infra/blob/ac501efc3cee3b613761e3026dd9798ddf827ec4/dist/profile/templates/jenkinscontroller/casc/artifact-caching-proxy.yaml.erb#L14 contains !central, which means that Central artifacts are downloaded directly from https://repo.maven.apache.org and not via ACP. This introduces a degree of fragility into our builds. While https://repo.maven.apache.org has not gone down completely, its performance fluctuates widely—sometimes it is quite fast, while other times it exhibits latency high enough to triple or quadruple build times, in some cases long enough for builds to time out and fail outright.

For local builds that use a persistent ~/.m2/repository without ACP, the performance of Central is not very important. Building core or a plugin for the first time on a clean system will indeed result in a lot of Central artifacts being downloaded from https://repo.maven.apache.org, which could take a long time if you get unlucky and latency happens to be high, but at least the result will be cached for subsequent builds. Over time, the cache will fill up with commonly-used artifacts and the only Central artifacts downloaded from https://repo.maven.apache.org will be a handful of new releases.

In contrast, CI builds do not have a persistent ~/.m2/repository, so every CI run (which includes multiple builds on different JDK versions and platforms) will download all Central artifacts from https://repo.maven.apache.org each time, exposing us to build fragility caused by performance problems in infrastructure that is out of our control. The design of CI builds not having a persistent ~/.m2/repository is predicated on high-performance artifact downloading, which is in conflict with the use of an uncached third-party server in each build. To resolve this conflict, Central artifacts for CI builds should be cached in ACP, where they can be reliably downloaded with low latency in CI builds without a persistent ~/.m2/repository.

Reproduction steps

Run a plugin, core, or BOM build 6 times a day for 6 days and observe the large amount of variance in build times based on the current performance of a third-party service that is out of our control.

@dduportal
Copy link
Contributor

Started working on this topic with @smerle33

The main "setup" challenge, if we remove the !central from the settings.xml setup, we'll also need to setup ACP to use the Maven Central as a source of artifacts (it does not today as we used to have repo.jenkins-ci.org doing this through the virtual repository including the mirror of Maven central which we removed since then).

We are working on this ACP setup and we'll update the issue accordingly

@dduportal
Copy link
Contributor

Yes, I've found a setup which work but that will require changes to the settings.xml (due to the relative path /maven2 for central which is different than the /public or our setup).

Additionnally, I've found that disabling ipv6 resolution decrease the amount of system calls made by Nginx when resolving DNS names. I believe this could help in production on any of the dualstack networks we have.

I need more work to formally prove there are improvement though but it's going on a good direction.

@dduportal
Copy link
Contributor

Yes, I've found a setup which work but that will require changes to the settings.xml (due to the relative path /maven2 for central which is different than the /public or our setup).

Additionnally, I've found that disabling ipv6 resolution decrease the amount of system calls made by Nginx when resolving DNS names. I believe this could help in production on any of the dualstack networks we have.

I need more work to formally prove there are improvement though but it's going on a good direction.

Another interesting performance point: with the current setup, the settings.xml specifies 2 mirrors. Maven behavior seems to try the first, if no artefacts is retrieved then try the second etc.
The performance penalty of this "try one after the other" seems higher than tuning the ACP Nginx to do the same: I guess Nginx is more efficient and has the benefit to cache the requests once it has solved the "fallback" behavior.

I initially wanted to introduce the "central caching" as a new feature to avoid breaking the existing setups during the deployments, but as I'm seeing a 2x to 10x performances boost in the dependency resolution on my local ACP test, I believe we should try delegating the fallback mechansim fully to Nginx.

@dduportal
Copy link
Contributor

dduportal commented Mar 27, 2024

First step: Update of the ACP helm chart where the Nginx configuration is located - jenkins-infra/helm-charts#1110

The local tests are really promising for performances: I get between 5x to 10x faster rm -rf ~/.m2/repository && mvn clean -U with https://github.com/jenkinsci/jenkins-infra-test-plugin. Still need to see if it extends to other builds.

@dduportal
Copy link
Contributor

@basil
Copy link
Collaborator Author

basil commented Mar 27, 2024

I believe we should try delegating the fallback mechansim fully to Nginx.

The performance improvement is compelling, but the (potential) downside is that now these builds are using a different implementation of the fallback mechanism than local builds. As long as the two stay in sync, I think this should be fine. In other words, I believe the performance benefit is worth the (small) risk of diverging from local builds, simply mentioning it for completeness.

@dduportal
Copy link
Contributor

Thanks for the feedback and the confirmation. I believe we'll have to carefully check the maven plugin and the incremental mechanism.

Proceeding to the deployment of this new setup and we'll check the results in the next 24h

dduportal added a commit to jenkins-infra/pipeline-library that referenced this issue Mar 28, 2024
This PR is not expected to be merged or review and will stay  as a draft as a matter of caution.

It is aimed for testing jenkins-infra/helpdesk#3969
dduportal added a commit to jenkinsci/jenkins-infra-test-plugin that referenced this issue Mar 28, 2024
…#847

This PR is not expected to be merged or review and will stay as a draft as a matter of caution.

It is aimed for testing jenkins-infra/helpdesk#3969 and relies on jenkins-infra/pipeline-library#847
@dduportal
Copy link
Contributor

Deployed, messages added (status, ci.jenkins.io homepage, IRC). Email to developer incoming.

First tests shows no errors on build of jenkinsci/jenkins-infra-test, I've triggered a bom build to "load" cache (note it is distributed so we need a few builds before hitting a fully cached thing).

I've opened jenkins-infra/pipeline-library#847 to be used to print the Maven transfer information if needed to debug (example: jenkinsci/jenkins-infra-test-plugin#120).

@dduportal
Copy link
Contributor

First results on the branch master of https://github.com/jenkinsci/jenkins-infra-test-plugin on ci.jenkins.io:

  • Build 222 (ACP not caching Maven Central):
    • Linux (21): [INFO] Total time: 12:27 min (do)
    • Windows (17) : [INFO] Total time: 02:35 min (azure)
  • Build 223 (ACP not caching Maven Central):
    • Linux (21): [INFO] Total time: 02:43 min (aws)
    • Windows (17) : [INFO] Total time: 02:59 min (azure)
  • Build 224 (new ACP):
    • Linux (21): [INFO] Total time: 12:34 min (do)
    • Windows (17) : [INFO] Total time: 03:21 min (azure)
  • Build 225 (new ACP):
    • Linux (21): [INFO] Total time: 01:19 min (do)
    • Windows (17) : [INFO] Total time: 02:25 min (azure)
  • Build 226 (new ACP):
    • Linux (21): [INFO] Total time: 01:41 min (aws)
    • Windows (17) : [INFO] Total time: 01:04 min (azure)
  • Build 227 (new ACP):
    • Linux (21): [INFO] Total time: 01:23 min (do)
    • Windows (17) : [INFO] Total time: 01:12 min (azure)

=> At first superficial sight, caching Maven Central seems beneficial

  • On DigitalOcean (do) without any doubt: from 12min to ~2 min!! I believe the penalty Us-east-1 (where the Maven central server we specify is) <-> Frankfurt Germany (Digital Ocean) could be a cause and caching central clearly changes the behavior here.
  • On AWS and Azure (both on us east 1) we see also an improvement but not as impressive: still almost 1 min improvement on a 3 min build!

=> I've triggered multiple builds of BOM and Core to warm up the caches everywhere (BOM has the prep randomly on digitalocean or AWS, Core is spread across the 3 cloud providers).

@dduportal
Copy link
Contributor

Update: watching the logs of DigitalOcean's ACP instances, removing the HTTP/200 and got 2 kinds of messages:

  • Nginx buffering proxy response body to temp files (always a Jenkins WAR: big artifact most probably bigger than the in-memory configured buffer). Not sure of the performance impact though (to be watched):
2024/03/28 08:52:12 [warn] 20#20: *8535 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/7/93/0000001937 while reading upstream, client: 10.244.1.227, server: _, request: "GET /org/jenkins-ci/main/jenkins-war/2.401.3/jenkins-war-2.401.3.war HTTP/1.1", upstream: "https://3.5.3.112:443/aol-jenkinsci/filestore/76/7661876c65beb83030ebc30ff67ca22dc780f53b?X-Artifactory-username=anonymous&X-Artifactory-repoType=local&X-Artifactory-repositoryKey=releases&X-Artifactory-packageType=maven&X-Artifactory-artifactPath=org%2Fjenkins-ci%2Fmain%2Fjenkins-war%2F2.401.3%2Fjenkins-war-2.401.3.war&X-Artifactory-projectKey=default&x-jf-traceId=bcd35093bb886267&response-content-disposition=attachment%3Bfilename%3D%22jenkins-war-2.401.3.war%22&response-content-type=application%2Fjava-archive&X-Amz-Security-Token=<redacted>
  • A few HTTP/404 with a macOS user agent. I believe this should be from @smerle33 local tests (confirmation in progress):
10.244.1.227 - - [28/Mar/2024:08:38:34 +0000] "GET /public/io/jenkins/tools/incrementals/git-changelist-maven-extension/1.6/git-changelist-maven-extension-1.6.pom HTTP/1.1" 404 192 1.145 "-" "Apache-Maven/3.9.6 (Java 17.0.7; Mac OS X 14.2.1)" "10.114.0.5" "34.225.228.28:443 : 44.209.124.32:443 : 146.75.120.209:443" "404 : 404 : 404" "0.638 : 0.374 : 0.128"

@dduportal
Copy link
Contributor

* A few HTTP/404 with a macOS user agent. I believe this should be from @smerle33 local tests (confirmation in progress):

Confirmed: @smerle33 is in the ingress access logs and maps to all the 404 requests ✅

@dduportal
Copy link
Contributor

Update:

  • Disk usage for the cache increased a bit but we clearly have space: from ~5 to 6.5 Gb on DigitalOcean (for both instances), on a 100 Gb volume (note: we have monitoring of the persistence volumes that alert us when 80% or more is used).

@dduportal
Copy link
Contributor

The prep stage of the BOM build on its master branch is improving since the ACP configuration is applied:

  • Build 2640 (on Digital Ocean)
08:56:02  [INFO] --------------------------------[ hpi ]---------------------------------
09:19:16  [INFO] 
...
Total time:  32:26 min # First mvn command
  • Build 2641 (on Digital Ocean)
10:12:45  [INFO] --------------------------------[ hpi ]---------------------------------
10:18:17  [INFO] 
10:18:17  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---
...
Total time:  10:40 min  # First mvn command
  • Build 2642 (on Digital Ocean)
10:39:22  [INFO] --------------------------------[ hpi ]---------------------------------
10:40:33  [INFO] 
10:40:33  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---
...
[INFO] Total time:  04:46 min  # First mvn command

I'll check the result in the same way on the Core builds and git-plugin build later today

@basil
Copy link
Collaborator Author

basil commented Mar 28, 2024

I watched some plugin builds today, and performance was back to the level I remember it being a few months ago.

@MarkEWaite
Copy link

All the plugin builds that I watched today were performing very well. Builds where the compile stage took more than 5 minutes were consistently 2 minutes or less.

@MarkEWaite
Copy link

I detected one surprise from the artifact caching proxy on Digital Ocean. https://ci.jenkins.io/job/Core/job/jenkins/job/PR-9099/1/ reports on the Java 11 build that an artifact unexpectedly had less content than was expected.

I've marked the build as "keep forever" in case you want to investigate further. https://ci.jenkins.io/job/Core/job/jenkins/job/PR-9099/1/pipeline-console/?selected-node=96 is the specific task that shows the message.

@smerle33
Copy link
Contributor

I detected one surprise from the artifact caching proxy on Digital Ocean. ci.jenkins.io/job/Core/job/jenkins/job/PR-9099/1 reports on the Java 11 build that an artifact unexpectedly had less content than was expected.

I've marked the build as "keep forever" in case you want to investigate further. ci.jenkins.io/job/Core/job/jenkins/job/PR-9099/1/pipeline-console/?selected-node=96 is the specific task that shows the message.

the ACP have receive this connexion closed by jfrog

21#21: *37961 upstream prematurely closed connection while reading upstream
GET /io/jenkins/plugins/echarts-api/5.4.3-1/echarts-api-5.4.3-1.hpi

as the second build went through https://ci.jenkins.io/job/Core/job/jenkins/job/PR-9099/2/
I am removing the keep for ever flag.

@MarkEWaite
Copy link

Thanks so much for investigating @smerle33 . I'm closing this as "not planned".

@MarkEWaite MarkEWaite closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024
@MarkEWaite MarkEWaite reopened this Mar 29, 2024
@MarkEWaite
Copy link

And reopening because I failed to read that this was the entire issue, not just my report of a transient failure. Apologies for the noise of closing and reopening.

@dduportal
Copy link
Contributor

I believe we can close this particular issue as "Done" under the following assumptions:

=> Without any objection I'll close this in the next hours

@basil
Copy link
Collaborator Author

basil commented Mar 29, 2024

I agree that this issue can be closed. Is this another transient issue?

jenkinsci/active-choices-plugin#236 (comment)

@dduportal
Copy link
Contributor

I agree that this issue can be closed. Is this another transient issue?

jenkinsci/active-choices-plugin#236 (comment)

Oh, it says status code: 500, reason phrase: Internal Server Error (500) which a bit more serious than a 404. Let's analyse the logs (infra team) to determine if it is serious enough, reproducible or transient

@dduportal
Copy link
Contributor

dduportal commented Mar 29, 2024

I agree that this issue can be closed. Is this another transient issue?

jenkinsci/active-choices-plugin#236 (comment)

Looks like a transient issue at first sight:

  • The artifact org/jenkins-ci/plugins/durable-task/550.v0930093c4b_a_6/durable-task-550.v0930093c4b_a_6.hpi is now present in the cache of the 6 ACP instances (including the 2 Digital's Oncen replicas) with the same fingerprint.
  • Confirmed in the access logs (in datadog) that this artifact has been downloaded successfully in DigitalOcean at least twice per replica (HTTP/200 from the cached file)

A few notes:

There only 2 lines of error logs for the whole ACP services (6 replicas across the 3 clouds) in the past 24h as per the datadog access log collections:

=> both happened in DigitalOcean and both artefacts are associated to a lot of warning messages such as

an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/<...> while reading upstream

which i saw yesterday on certain big artifacts such as Jenkins WARs.

Currently checking the metrics to check how is the memory buffer of Nginx behaving: I believe setting http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_max_temp_file_size to 0 to prevent the "write buffer to file" to happen could help , as we ends caching the file from buffer. But it might be a sign of memory limits so I prefer checking metrics before acting.

@dduportal
Copy link
Contributor

Update:

  • No more ERROR message of this kind: the only few (10 in 6 days) error messages are related to the upstream remote reaching network timeout, e.g. unrelated to the setup (and already present at a similar rate before 1.0.0).
  • On the past 2 days, for ~4.3M of HTTP/200 in the ACP access logs, we had around 43K of warning message an upstream response is buffered to a temporary file). It's a low ratio even though we want to clean it on long term.
  • Metrics on ACP shows we are using less memory than befoe (and CPU is doing almost nothing)
Capture d’écran 2024-04-05 à 08 48 38

As such I'm gonna close this issue (even if a subsequent effort will be spent on tuning the proxy buffering).

dduportal added a commit to dduportal/helm-charts that referenced this issue Apr 5, 2024
… gzip with upstream

This change is motivated by jenkins-infra/helpdesk#3969 (comment).
The goal is to decrease or remove the warning messages:

```
an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/<...> while reading upstream
````

It includes the 2 following elements:

- Increasing the proxy buffer size given the size of the files served by ACP. It maps to JFrog Artifactory recommended settings
(https://jfrog.com/help/r/artifactory-how-to-enable-tls-within-the-jfrog-platform/artifactory-nginx-configuration).
    - Note: This will increase memory usage but the ACP metrics shows than there is enough memory to handle this change

- Enable gzip for requests with the upstream to decrease the size of in-memory required buffers of responses
    - Note: This will increase CPU usage but the ACP metrics shows that CPU is doing almost nothing.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor

For info: jenkins-infra/helm-charts#1119 is opened to fine tune the buffers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants