-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suddenly getting very slow builds on inner container with Sysbox v0.4.1 #435
Comments
Hi @TomKeyte, thanks for your detailed description of the issue. Question. Is problem reproduced when a particular workflow (e.g. image build) is executed within the gha-runner? If so, can you provide an example of this ci-pipeline job so that I can reproduce it? |
Also, assuming that the instruction that is causing the delay is part of your CI pipeline (i.e. it's not a GHA internal instruction -- which is unlikely), could you please try to run it directly within the sysbox container itself? For example, if you suspect of an image-build step, you can try to run an equivalent action by making use of the |
Aside from my questions above, please provide this info too:
I think I should be able to reproduce the problem once that I got all this info. Thanks! |
Hi @rodnymolina Appreciate you getting back so quickly :) I have (temporarily) resolved our issue by downgrading from sysbox 0.4.1 -> 0.4.0 I'm afraid I can't really show the particular action being run, as it could expose company internals, but in summary it's just name: Run Tests
jobs:
test:
runs-on: [self-hosted, test-runner]
steps:
# SETUP
- name: Checkout
uses: actions/checkout@v2
# ...Steps to configure the env
- name: Build images
run: >
docker build ... I can also confirm that if I exec into the inner container & manually run Tailing the sysbox-fs journal, I can see a lot of messages like
|
@TomKeyte, thanks, it's good to know that issue is not GHA specific ... Unfortunately, the log message above is too generic and doesn't really help us root-cause the problem. That's why I was wondering if you could help us here by trying to reproduce with a stripped-down Dockerfile (w/o company-specific instructions/tools) so that we can expedite root-cause analysis (?). Also, please be aware that we are currently making important optimizations in the code to address issues like this one, but we won't know for sure if this fixes your setup till we are able to reproduce it locally. |
I am pretty sure the underlying cause for this is Sysbox's interception of the In the upcoming v0.5.0 release, it will be possible to disable Sysbox's interception of the |
As you know we had problems with 0.4.1 so we directly updated from 0.4.0 to 0.5.1. We updated to 0.5.0 and now have an enormous increase in build times in our gitlab infrastructure. I tested with a gitlab-ci job that builds all our teams docker images on same server (pure hardware): We run on Debian 11 with backport Kernel 5.16.12-1 and userns-remap set to sysbox. The log size of sysbox-fs increased heavily with ca. 40000 of these kind of lines today:
With sysbox 0.4.0 - docker-dind 20.10.7 we had these lines ca. 5 times per day. Seems like building several docker images sequentionally in the same (temporary) private dind can trigger this behavior. Every new docker build is slower than the one before. |
Hi @nudgegoonies, thanks for the update. For sysbox v0.5.0, did you try the passing the See this sysbox doc for more details. |
Apart from the Sysbox's interception of the *xattr syscalls (described in this prior comment), we've also noticed another factor that contributes to inner Docker builds running slower inside a Sysbox container: the inner Docker engine does not use "native overlay diffs" when creating the build. This can be seen in the following docker engine log inside the Sysbox container:
The relevant source code is in the Moby repo, here. Basically, dockerd is assuming that when it's running inside a user-namespace (as in Sysbox containers), it can't use the native diff for overlay2, because it assumes that it can't use the The slow down is significant: I noticed up to 6x on a quick experiment, where the inner dockerd was modified to use native overlay2 diffs even when running inside a user-namespace, versus an unmodified inner dockerd. |
@ctalledo Thank you very much! With that option it works! Now sysbox, docker and docker-dind are up to date and it is even faster than sysbox.0.4.0 then: |
Hi @nudgegoonies, that's great to hear! Yes Sysbox v0.5.0 has some performance optimizations that make it faster than v0.4.0, glad you are seeing those.
That's interesting; I would have assumed that passing env-vars to containers would be easy in all CI systems, given that it's a fairly common operation. |
Closing this issue since we now have a solution for the slowdown in Sysbox v0.5.0. Please re-open if you see any such issues with v0.5.0. |
Yes, in theory. There is a private dind configured via gitlab runner running with every build, dinds configured as service in pipelines, shared dinds started by systemd, etc. Using the one place in the sysbox-mgr is easier. |
What's the preferred way to configure |
I'm finding that docker builds are still about 2x slower with |
Hi @mike-chen-samsung, yes we need to add a ConfigMap or similar to allow users to easily configure Sysbox flags via the sysbox-deploy-k8s daemonset. We've not yet had cycles to do this unfortunately. Having said that, I think Sysbox should switch the default of |
It's been a while since I measured, but I don't recall seeing 2x slower. A couple of factors that could be contributing could be Docker not using native overlay diffs and the fact the the Docker Engine inside a Sysbox container goes through an extra network bridge. I suspect the former could be the main issue though. |
I checked |
Can you show me the log from the docker daemon inside the Sysbox container, so I can double check? It should be using it's overlay2 driver in general, except for native overlay diffs. |
Sorry, I should have mentioned that I am using GitHub ARC with Sysbox, so it's running a dind setup. I've pasted a cleaned up Pod yaml at the bottom
Pod yaml
|
Thanks; the dockerd logs inside the sysbox container don't look good, it should have used it's overlayfs driver (as it would on a regular host or VM). Can you post the output of |
|
Thanks @mike-chen-samsung; the output of
This means that inside the container, I need to understand why docker engine inside the container is failing with:
I noticed you are using docker engine v24.0.2; let me check if something changed in that version that is causing a problem when running inside a Sysbox container. |
I tried docker v24.0.2 inside a Sysbox container (created with Docker) and it works fine:
Notice how docker engine reports it's using overlay2 as expected:
|
I also tried with something that is a lot closer to what you have in your pod yaml and it also works fine:
|
Hi @mike-chen-samsung, I was able to repro in a GKE pod, using a pod spec similar to the one you posted above. I suspect it's something in the Just for sanity check, I also ran a pod with this spec and it worked fine (i.e., docker used the overlay2 driver inside the pod).
|
hello @ctalledo , wondering if you have an update to share |
Hi @mike-chen-samsung, apologies for the belated response. I've not yet had a chance to look into this, will allocate some cycles this coming week to get to the bottom of it. Thanks for your patience. |
Our company runs our GitHub actions CI pipelines on a docker-in-docker setup using sysbox.
The setup is
The GitHub actions run inside a container running on the sysbox runtime
All of a sudden on Monday - our docker builds which usually take ~5 minutes started not completing even after ~2 hours
If I exec into the action-runner container and run them manually, I can see that especially the
transferring context
step is taking an eternity. Here's a portion of the output:The total build context is ~350M
So far I have:
However the problem persists
The server does have unattended upgrades enabled, and in the period of last week (when things were fine), to this week (when things have gone wrong), I can see the following packages have upgraded on the host:
The docker-compose file, used to start the runner, is
Any help debugging/fixing this greatly appreciated :)
The text was updated successfully, but these errors were encountered: