Artifact Caching Proxy (ACP) does not cache artifacts from Maven Central #3969

basil · 2024-03-01T21:44:23Z

Service(s)

Artifact-caching-proxy

Summary

The mirrorOf directive in our Artifact Caching Proxy (ACP) configuration in https://github.com/jenkins-infra/jenkins-infra/blob/ac501efc3cee3b613761e3026dd9798ddf827ec4/dist/profile/templates/jenkinscontroller/casc/artifact-caching-proxy.yaml.erb#L14 contains !central, which means that Central artifacts are downloaded directly from https://repo.maven.apache.org and not via ACP. This introduces a degree of fragility into our builds. While https://repo.maven.apache.org has not gone down completely, its performance fluctuates widely—sometimes it is quite fast, while other times it exhibits latency high enough to triple or quadruple build times, in some cases long enough for builds to time out and fail outright.

For local builds that use a persistent ~/.m2/repository without ACP, the performance of Central is not very important. Building core or a plugin for the first time on a clean system will indeed result in a lot of Central artifacts being downloaded from https://repo.maven.apache.org, which could take a long time if you get unlucky and latency happens to be high, but at least the result will be cached for subsequent builds. Over time, the cache will fill up with commonly-used artifacts and the only Central artifacts downloaded from https://repo.maven.apache.org will be a handful of new releases.

In contrast, CI builds do not have a persistent ~/.m2/repository, so every CI run (which includes multiple builds on different JDK versions and platforms) will download all Central artifacts from https://repo.maven.apache.org each time, exposing us to build fragility caused by performance problems in infrastructure that is out of our control. The design of CI builds not having a persistent ~/.m2/repository is predicated on high-performance artifact downloading, which is in conflict with the use of an uncached third-party server in each build. To resolve this conflict, Central artifacts for CI builds should be cached in ACP, where they can be reliably downloaded with low latency in CI builds without a persistent ~/.m2/repository.

Reproduction steps

Run a plugin, core, or BOM build 6 times a day for 6 days and observe the large amount of variance in build times based on the current performance of a third-party service that is out of our control.

The text was updated successfully, but these errors were encountered:

dduportal · 2024-03-22T15:37:45Z

Started working on this topic with @smerle33

The main "setup" challenge, if we remove the !central from the settings.xml setup, we'll also need to setup ACP to use the Maven Central as a source of artifacts (it does not today as we used to have repo.jenkins-ci.org doing this through the virtual repository including the mirror of Maven central which we removed since then).

We are working on this ACP setup and we'll update the issue accordingly

dduportal · 2024-03-27T16:57:47Z

Yes, I've found a setup which work but that will require changes to the settings.xml (due to the relative path /maven2 for central which is different than the /public or our setup).

Additionnally, I've found that disabling ipv6 resolution decrease the amount of system calls made by Nginx when resolving DNS names. I believe this could help in production on any of the dualstack networks we have.

I need more work to formally prove there are improvement though but it's going on a good direction.

dduportal · 2024-03-27T17:22:04Z

Yes, I've found a setup which work but that will require changes to the settings.xml (due to the relative path /maven2 for central which is different than the /public or our setup).

Additionnally, I've found that disabling ipv6 resolution decrease the amount of system calls made by Nginx when resolving DNS names. I believe this could help in production on any of the dualstack networks we have.

I need more work to formally prove there are improvement though but it's going on a good direction.

Another interesting performance point: with the current setup, the settings.xml specifies 2 mirrors. Maven behavior seems to try the first, if no artefacts is retrieved then try the second etc.
The performance penalty of this "try one after the other" seems higher than tuning the ACP Nginx to do the same: I guess Nginx is more efficient and has the benefit to cache the requests once it has solved the "fallback" behavior.

I initially wanted to introduce the "central caching" as a new feature to avoid breaking the existing setups during the deployments, but as I'm seeing a 2x to 10x performances boost in the dependency resolution on my local ACP test, I believe we should try delegating the fallback mechansim fully to Nginx.

dduportal · 2024-03-27T19:45:31Z

First step: Update of the ACP helm chart where the Nginx configuration is located - jenkins-infra/helm-charts#1110

The local tests are really promising for performances: I get between 5x to 10x faster rm -rf ~/.m2/repository && mvn clean -U with https://github.com/jenkinsci/jenkins-infra-test-plugin. Still need to see if it extends to other builds.

dduportal · 2024-03-27T20:25:36Z

Next steps: the following PRs have been opened in draft with the new expected settings.xml:

feat(profile::jenkinscontroller) change Maven mirror settings.xml file to use "unified" ACP jenkins-infra#3350
feat(profile::jenkinscontroller) change Maven mirror settings.xml file to use "unified" ACP kubernetes-management#5090

basil · 2024-03-27T22:49:04Z

I believe we should try delegating the fallback mechansim fully to Nginx.

The performance improvement is compelling, but the (potential) downside is that now these builds are using a different implementation of the fallback mechanism than local builds. As long as the two stay in sync, I think this should be fine. In other words, I believe the performance benefit is worth the (small) risk of diverging from local builds, simply mentioning it for completeness.

dduportal · 2024-03-28T07:38:04Z

Thanks for the feedback and the confirmation. I believe we'll have to carefully check the maven plugin and the incremental mechanism.

Proceeding to the deployment of this new setup and we'll check the results in the next 24h

This PR is not expected to be merged or review and will stay as a draft as a matter of caution. It is aimed for testing jenkins-infra/helpdesk#3969

…#847 This PR is not expected to be merged or review and will stay as a draft as a matter of caution. It is aimed for testing jenkins-infra/helpdesk#3969 and relies on jenkins-infra/pipeline-library#847

dduportal · 2024-03-28T07:59:41Z

Deployed, messages added (status, ci.jenkins.io homepage, IRC). Email to developer incoming.

First tests shows no errors on build of jenkinsci/jenkins-infra-test, I've triggered a bom build to "load" cache (note it is distributed so we need a few builds before hitting a fully cached thing).

I've opened jenkins-infra/pipeline-library#847 to be used to print the Maven transfer information if needed to debug (example: jenkinsci/jenkins-infra-test-plugin#120).

dduportal · 2024-03-28T08:59:30Z

First results on the branch master of https://github.com/jenkinsci/jenkins-infra-test-plugin on ci.jenkins.io:

Build 222 (ACP not caching Maven Central):
- Linux (21): [INFO] Total time: 12:27 min (do)
- Windows (17) : [INFO] Total time: 02:35 min (azure)
Build 223 (ACP not caching Maven Central):
- Linux (21): [INFO] Total time: 02:43 min (aws)
- Windows (17) : [INFO] Total time: 02:59 min (azure)
Build 224 (new ACP):
- Linux (21): [INFO] Total time: 12:34 min (do)
- Windows (17) : [INFO] Total time: 03:21 min (azure)
Build 225 (new ACP):
- Linux (21): [INFO] Total time: 01:19 min (do)
- Windows (17) : [INFO] Total time: 02:25 min (azure)
Build 226 (new ACP):
- Linux (21): [INFO] Total time: 01:41 min (aws)
- Windows (17) : [INFO] Total time: 01:04 min (azure)
Build 227 (new ACP):
- Linux (21): [INFO] Total time: 01:23 min (do)
- Windows (17) : [INFO] Total time: 01:12 min (azure)

=> At first superficial sight, caching Maven Central seems beneficial

On DigitalOcean (do) without any doubt: from 12min to ~2 min!! I believe the penalty Us-east-1 (where the Maven central server we specify is) <-> Frankfurt Germany (Digital Ocean) could be a cause and caching central clearly changes the behavior here.
On AWS and Azure (both on us east 1) we see also an improvement but not as impressive: still almost 1 min improvement on a 3 min build!

=> I've triggered multiple builds of BOM and Core to warm up the caches everywhere (BOM has the prep randomly on digitalocean or AWS, Core is spread across the 3 cloud providers).

dduportal · 2024-03-28T09:14:20Z

Update: watching the logs of DigitalOcean's ACP instances, removing the HTTP/200 and got 2 kinds of messages:

Nginx buffering proxy response body to temp files (always a Jenkins WAR: big artifact most probably bigger than the in-memory configured buffer). Not sure of the performance impact though (to be watched):

2024/03/28 08:52:12 [warn] 20#20: *8535 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/7/93/0000001937 while reading upstream, client: 10.244.1.227, server: _, request: "GET /org/jenkins-ci/main/jenkins-war/2.401.3/jenkins-war-2.401.3.war HTTP/1.1", upstream: "https://3.5.3.112:443/aol-jenkinsci/filestore/76/7661876c65beb83030ebc30ff67ca22dc780f53b?X-Artifactory-username=anonymous&X-Artifactory-repoType=local&X-Artifactory-repositoryKey=releases&X-Artifactory-packageType=maven&X-Artifactory-artifactPath=org%2Fjenkins-ci%2Fmain%2Fjenkins-war%2F2.401.3%2Fjenkins-war-2.401.3.war&X-Artifactory-projectKey=default&x-jf-traceId=bcd35093bb886267&response-content-disposition=attachment%3Bfilename%3D%22jenkins-war-2.401.3.war%22&response-content-type=application%2Fjava-archive&X-Amz-Security-Token=<redacted>

A few HTTP/404 with a macOS user agent. I believe this should be from @smerle33 local tests (confirmation in progress):

10.244.1.227 - - [28/Mar/2024:08:38:34 +0000] "GET /public/io/jenkins/tools/incrementals/git-changelist-maven-extension/1.6/git-changelist-maven-extension-1.6.pom HTTP/1.1" 404 192 1.145 "-" "Apache-Maven/3.9.6 (Java 17.0.7; Mac OS X 14.2.1)" "10.114.0.5" "34.225.228.28:443 : 44.209.124.32:443 : 146.75.120.209:443" "404 : 404 : 404" "0.638 : 0.374 : 0.128"

dduportal · 2024-03-28T09:16:27Z

* A few HTTP/404 with a macOS user agent. I believe this should be from @smerle33 local tests (confirmation in progress):

Confirmed: @smerle33 is in the ingress access logs and maps to all the 404 requests ✅

dduportal · 2024-03-28T10:31:36Z

Update:

Disk usage for the cache increased a bit but we clearly have space: from ~5 to 6.5 Gb on DigitalOcean (for both instances), on a 100 Gb volume (note: we have monitoring of the persistence volumes that alert us when 80% or more is used).

dduportal · 2024-03-28T10:49:35Z

The prep stage of the BOM build on its master branch is improving since the ACP configuration is applied:

Build 2640 (on Digital Ocean)

08:56:02  [INFO] --------------------------------[ hpi ]---------------------------------
09:19:16  [INFO] 
...
Total time:  32:26 min # First mvn command

Build 2641 (on Digital Ocean)

10:12:45  [INFO] --------------------------------[ hpi ]---------------------------------
10:18:17  [INFO] 
10:18:17  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---
...
Total time:  10:40 min  # First mvn command

Build 2642 (on Digital Ocean)

10:39:22  [INFO] --------------------------------[ hpi ]---------------------------------
10:40:33  [INFO] 
10:40:33  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---
...
[INFO] Total time:  04:46 min  # First mvn command

I'll check the result in the same way on the Core builds and git-plugin build later today

basil · 2024-03-28T22:37:55Z

I watched some plugin builds today, and performance was back to the level I remember it being a few months ago.

MarkEWaite · 2024-03-28T23:15:11Z

All the plugin builds that I watched today were performing very well. Builds where the compile stage took more than 5 minutes were consistently 2 minutes or less.

MarkEWaite · 2024-03-29T00:26:37Z

I detected one surprise from the artifact caching proxy on Digital Ocean. https://ci.jenkins.io/job/Core/job/jenkins/job/PR-9099/1/ reports on the Java 11 build that an artifact unexpectedly had less content than was expected.

I've marked the build as "keep forever" in case you want to investigate further. https://ci.jenkins.io/job/Core/job/jenkins/job/PR-9099/1/pipeline-console/?selected-node=96 is the specific task that shows the message.

smerle33 · 2024-03-29T08:31:26Z

I detected one surprise from the artifact caching proxy on Digital Ocean. ci.jenkins.io/job/Core/job/jenkins/job/PR-9099/1 reports on the Java 11 build that an artifact unexpectedly had less content than was expected.

I've marked the build as "keep forever" in case you want to investigate further. ci.jenkins.io/job/Core/job/jenkins/job/PR-9099/1/pipeline-console/?selected-node=96 is the specific task that shows the message.

the ACP have receive this connexion closed by jfrog

21#21: *37961 upstream prematurely closed connection while reading upstream
GET /io/jenkins/plugins/echarts-api/5.4.3-1/echarts-api-5.4.3-1.hpi

as the second build went through https://ci.jenkins.io/job/Core/job/jenkins/job/PR-9099/2/
I am removing the keep for ever flag.

MarkEWaite · 2024-03-29T11:45:51Z

Thanks so much for investigating @smerle33 . I'm closing this as "not planned".

MarkEWaite · 2024-03-29T11:46:52Z

And reopening because I failed to read that this was the entire issue, not just my report of a transient failure. Apologies for the noise of closing and reopening.

dduportal · 2024-03-29T12:12:05Z

I believe we can close this particular issue as "Done" under the following assumptions:

I don't see any more actions here related to the topic "Cache artifacts from Maven Central"
Unexpected delays building small plugin on linux agent #3916 tracks the performances of builds under ACP
Artifacts proxy does not deliver incremental jars #3957 tracks an unexpected behavior with incrementals which might be fixed today with the ACP 1.0.0 change
@basil and @MarkEWaite confirmed they don't see any regression and even improved performances
The problem reported by @MarkEWaite has been confirmed to be transient by @smerle33

=> Without any objection I'll close this in the next hours

basil · 2024-03-29T13:21:22Z

I agree that this issue can be closed. Is this another transient issue?

jenkinsci/active-choices-plugin#236 (comment)

dduportal · 2024-03-29T14:09:41Z

I agree that this issue can be closed. Is this another transient issue?

jenkinsci/active-choices-plugin#236 (comment)

Oh, it says status code: 500, reason phrase: Internal Server Error (500) which a bit more serious than a 404. Let's analyse the logs (infra team) to determine if it is serious enough, reproducible or transient

dduportal · 2024-03-29T15:02:05Z

I agree that this issue can be closed. Is this another transient issue?

jenkinsci/active-choices-plugin#236 (comment)

Looks like a transient issue at first sight:

The artifact org/jenkins-ci/plugins/durable-task/550.v0930093c4b_a_6/durable-task-550.v0930093c4b_a_6.hpi is now present in the cache of the 6 ACP instances (including the 2 Digital's Oncen replicas) with the same fingerprint.
Confirmed in the access logs (in datadog) that this artifact has been downloaded successfully in DigitalOcean at least twice per replica (HTTP/200 from the cached file)

A few notes:

There only 2 lines of error logs for the whole ACP services (6 replicas across the 3 clouds) in the past 24h as per the datadog access log collections:

the echart API (Artifact Caching Proxy (ACP) does not cache artifacts from Maven Central #3969 (comment))
and this second case (durable-task-550.v0930093c4b_a_6.hpi)

=> both happened in DigitalOcean and both artefacts are associated to a lot of warning messages such as

an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/<...> while reading upstream

which i saw yesterday on certain big artifacts such as Jenkins WARs.

Currently checking the metrics to check how is the memory buffer of Nginx behaving: I believe setting http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_max_temp_file_size to 0 to prevent the "write buffer to file" to happen could help , as we ends caching the file from buffer. But it might be a sign of memory limits so I prefer checking metrics before acting.

dduportal · 2024-04-05T07:46:44Z

Update:

No more ERROR message of this kind: the only few (10 in 6 days) error messages are related to the upstream remote reaching network timeout, e.g. unrelated to the setup (and already present at a similar rate before 1.0.0).
On the past 2 days, for ~4.3M of HTTP/200 in the ACP access logs, we had around 43K of warning message an upstream response is buffered to a temporary file). It's a low ratio even though we want to clean it on long term.
Metrics on ACP shows we are using less memory than befoe (and CPU is doing almost nothing)

As such I'm gonna close this issue (even if a subsequent effort will be spent on tuning the proxy buffering).

… gzip with upstream This change is motivated by jenkins-infra/helpdesk#3969 (comment). The goal is to decrease or remove the warning messages: ``` an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/<...> while reading upstream ```` It includes the 2 following elements: - Increasing the proxy buffer size given the size of the files served by ACP. It maps to JFrog Artifactory recommended settings (https://jfrog.com/help/r/artifactory-how-to-enable-tls-within-the-jfrog-platform/artifactory-nginx-configuration). - Note: This will increase memory usage but the ACP metrics shows than there is enough memory to handle this change - Enable gzip for requests with the upstream to decrease the size of in-memory required buffers of responses - Note: This will increase CPU usage but the ACP metrics shows that CPU is doing almost nothing. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal · 2024-04-05T11:03:26Z

For info: jenkins-infra/helm-charts#1119 is opened to fine tune the buffers

basil added the triage Incoming issues that need review label Mar 1, 2024

jenkins-infra-helpdesk-app bot added the artifact-caching-proxy label Mar 1, 2024

dduportal self-assigned this Mar 5, 2024

dduportal removed the triage Incoming issues that need review label Mar 5, 2024

dduportal added this to the infra-team-sync-2024-03-12 milestone Mar 5, 2024

basil mentioned this issue Mar 7, 2024

Unexpected delays building small plugin on linux agent #3916

Closed

smerle33 modified the milestones: infra-team-sync-2024-03-12, infra-team-sync-2024-03-19 Mar 12, 2024

dduportal removed their assignment Mar 19, 2024

dduportal modified the milestones: infra-team-sync-2024-03-19, infra-team-sync-2024-03-26 Mar 20, 2024

dduportal assigned dduportal and smerle33 Mar 22, 2024

This was referenced Mar 28, 2024

Bump artifact-caching-proxy helm chart version to 1.0.0 jenkins-infra/kubernetes-management#5091

Merged

open ACP 1.0.0 jenkins-infra/status#492

Merged

dduportal mentioned this issue Mar 28, 2024

debug(infra.groovy) print Maven transfer info for debugging purposes jenkins-infra/pipeline-library#847

Draft

dduportal mentioned this issue Mar 28, 2024

debug(Jenkinsfile) print Maven transfer info for debugging purposes … jenkinsci/jenkins-infra-test-plugin#120

Draft

dduportal mentioned this issue Mar 28, 2024

Artifacts proxy does not deliver incremental jars #3957

Closed

MarkEWaite closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024

MarkEWaite reopened this Mar 29, 2024

dduportal modified the milestones: infra-team-sync-2024-04-02, infra-team-sync-2024-04-09 Apr 3, 2024

dduportal closed this as completed Apr 5, 2024

dduportal mentioned this issue Apr 5, 2024

feat(artifact-caching-proxy) tune proxy upstream buffering and enable gzip with upstream jenkins-infra/helm-charts#1119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Artifact Caching Proxy (ACP) does not cache artifacts from Maven Central #3969

Artifact Caching Proxy (ACP) does not cache artifacts from Maven Central #3969

basil commented Mar 1, 2024

dduportal commented Mar 22, 2024

dduportal commented Mar 27, 2024

dduportal commented Mar 27, 2024

dduportal commented Mar 27, 2024 •

edited

dduportal commented Mar 27, 2024

basil commented Mar 27, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

basil commented Mar 28, 2024

MarkEWaite commented Mar 28, 2024

MarkEWaite commented Mar 29, 2024

smerle33 commented Mar 29, 2024

MarkEWaite commented Mar 29, 2024

MarkEWaite commented Mar 29, 2024

dduportal commented Mar 29, 2024

basil commented Mar 29, 2024

dduportal commented Mar 29, 2024

dduportal commented Mar 29, 2024 •

edited

dduportal commented Apr 5, 2024

dduportal commented Apr 5, 2024

Artifact Caching Proxy (ACP) does not cache artifacts from Maven Central #3969

Artifact Caching Proxy (ACP) does not cache artifacts from Maven Central #3969

Comments

basil commented Mar 1, 2024

Service(s)

Summary

Reproduction steps

dduportal commented Mar 22, 2024

dduportal commented Mar 27, 2024

dduportal commented Mar 27, 2024

dduportal commented Mar 27, 2024 • edited

dduportal commented Mar 27, 2024

basil commented Mar 27, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

dduportal commented Mar 28, 2024

basil commented Mar 28, 2024

MarkEWaite commented Mar 28, 2024

MarkEWaite commented Mar 29, 2024

smerle33 commented Mar 29, 2024

MarkEWaite commented Mar 29, 2024

MarkEWaite commented Mar 29, 2024

dduportal commented Mar 29, 2024

basil commented Mar 29, 2024

dduportal commented Mar 29, 2024

dduportal commented Mar 29, 2024 • edited

dduportal commented Apr 5, 2024

dduportal commented Apr 5, 2024

dduportal commented Mar 27, 2024 •

edited

dduportal commented Mar 29, 2024 •

edited