Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many parallel calls to docker exec make Docker unresponsive #45595

Closed
GeorgDangl opened this issue May 22, 2023 · 17 comments · Fixed by #45624
Closed

Many parallel calls to docker exec make Docker unresponsive #45595

GeorgDangl opened this issue May 22, 2023 · 17 comments · Fixed by #45624
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage version/24.0

Comments

@GeorgDangl
Copy link

Description

Since today, we've experienced a problem where our CI system could no longer run our tests properly.

After a bit of investigation, we found how to resolve it, but not the root cause. This happens on two independent servers, one on Ubuntu 22.04, the other on 18.04. Both are running Docker 24.0.1.

Essentially, we're launching an image of SQL Server using a tmpfs volume for the data files. Before the tests, we're creating a database with seed data. Then, for every test run, we run docker exec to copy the data (and log) files, and attach them as a new database. This is running in parallel without any concurrency limit, and has worked before.

Now we found that when doing this, all Docker commands become unresponsive after a short time. E.g. docker stats no longer refreshes correctly and shows only dashes -- instead of the actual information:
image

Other Docker commands, e.g. docker stop take extremely long (if they finish at all). The tests run usually in 10 minutes, now it takes several hours with most tests failing due to a timeout exception when calling Docker commands from within the test suite.

Reproduce

  1. First I'm starting SQL Server
docker run -e ACCEPT_EULA=Y -e "SA_PASSWORD=yourStrong(#)Password" \
--mount type=tmpfs,destination=/var/opt/mssql/data \
--mount type=tmpfs,destination=/var/opt/mssql/log \
--mount type=tmpfs,destination=/var/opt/mssql/secrets \
dangl/mssql-tmpfs:latest
  1. Then I'm creating a quick script to just do some cp operations via docker exec in parallel:
touch concurrent_exec.sh
for i in {1..50}
do
  echo "docker exec <CONTAINER_ID> sh -c \"cp /var/opt/mssql/data/master.mdf /var/opt/mssql/data/$i.mdf\" &" >> concurrent_exec.sh
done
echo "docker exec <CONTAINER_ID> sh -c \"cp /var/opt/mssql/data/master.mdf /var/opt/mssql/data/last.mdf\"" >> concurrent_exec.sh
  1. Now I execute sh concurrent_exec.sh and watch the output of docker stats in another session. docker stats will freeze for a time (but eventually recover). It takes longer the more parallel operations I do.

The actual behavior depends on how many parallel calls I'm making. E.g. with 5, there's no (noticeable) interruption, with 50 it already blocks for around a minute.

Expected behavior

No response

docker version

Client: Docker Engine - Community
 Version:           24.0.1
 API version:       1.43
 Go version:        go1.20.4
 Git commit:        6802122
 Built:             Fri May 19 18:06:21 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.1
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.4
  Git commit:       463850e
  Built:            Fri May 19 18:06:21 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.21
  GitCommit:        3dce8eb055cbb6872793272b4f20ed16117344f8
 runc:
  Version:          1.1.7
  GitCommit:        v1.1.7-0-g860f061
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client: Docker Engine - Community
 Version:    24.0.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.4
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.18.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 15
 Server Version: 24.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-72-generic
 Operating System: Ubuntu 22.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 125.7GiB
 Name: Ubuntu-2204-jammy-amd64-base
 ID: 74911907-4761-4663-906c-a79557c4089a
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

I've tried it out on a fresh server, had the same behavior there. I'd be happy to assist with any further details, but I would need some guidance on what is relevant and how to get that.

@GeorgDangl GeorgDangl added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels May 22, 2023
@thaJeztah
Copy link
Member

You mention "since today"; was this working ok before on the same version of Docker Engine, or were you running different version before today?

@GeorgDangl
Copy link
Author

I did update Docker Engine today, it worked on previous versions. But I’m not sure what we updated, I just ran ‘apt-get upgrade’. I saw that 24.0.1 was released on May 19th, so I assumed it might have been that.

We have done such integration tests for two years, and didn’t have this problem before.

@thaJeztah
Copy link
Member

Thanks! Yes, mostly curious between which versions the change happened on your side. Based on the timeline, I expect that to be the 23.0 -> 24.0 upgrade; if you have a system to test on, you could try downgrading to the latest 23.0.x release (https://docs.docker.com/engine/install/ubuntu/#install-docker-engine - see the "specific version" tab in that section).

I know there's been some refactoring that happened in exec code that's part of docker 24.0.x; from the top of my head, these two come to mind (but no idea if they're related);

/cc @corhere if you have immediate ideas.

@GeorgDangl
Copy link
Author

Ok, I'll try it out with 2.23.0 and check if there's any difference. I'm pretty busy today, but I'll be able to check it later today.

For the downgrade, can I just uninstall Docker, and then install 2.23.0?

Also, FYI, I just checked it locally on my Windows machine, there everything works fine, but it's also on 20.10.21, so quite a bit older.

@thaJeztah
Copy link
Member

You should probably be able to use the steps outlined in that section and use the --allow-downgrades option. The latest 23.0.x release is 23.0.6 (but the apt-cache madison docker-ce | awk '{ print $3 }' command should show a list of the available versions.

I should mention that it's always best practice to verify upgrades/downgrades in a testing environment before rolling out to production, but from your description it looks like this is a testing environment (so "probably" no critical content stored in containers/volumes/etc) 😅

My initial thinking is that possibly the exec processes don't terminate after the process completes, but I haven't been able to try to reproduce your steps yet.

Also wondering if it would make a difference if the docker exec tasks would be started with the -d ("detach") option, instead of running with & (but haven't tried if that works)

@GeorgDangl
Copy link
Author

Yeah, that's not on critical systems. It's just two dedicated servers that act as agents for the CI/CD system. We're not blocked since we can still run everything on hosted runners or locally.

About the -d detach option: We're actually running the exec commands differently. We have a .NET test suite or the integration tests. Every single test has an initialization and dispose hook, where we separately call docker exec to copy the database files (and later delete them). So it's not one huge "batch" of exec commands, but since the tests are running in parallel they might be called concurrently.
If I understand it correctly, running with -d would not wait for the command to finish, right? That wouldn't work for us, because a next step is an SQL command to attach the new database file to SQL Server.

@thaJeztah
Copy link
Member

If I understand it correctly, running with -d would not wait for the command to finish, right?

Ah, yes, it won't wait; or at least, the docker exec would run detached, so it would complete in the background.

Right so your case is "roughly" multiple of these in parallel;

docker exec <do copy>

# wait for copy to complete
docker exec <do sql command>

I guess for your example (as /var/opt/mssql/data/ is a bind-mount from the host), the actual copy could be done on the host itself instead of using a docker exec for that, but that's a bit beside the point here (for this ticket); we should look what changed between 23.0 and 24.0 that causes the change in behavior.

I asked a colleague on my team to try and reproduce the issue, and to do a git bisect if possible (at least to see if we can narrow down the cause).

@GeorgDangl
Copy link
Author

GeorgDangl commented May 23, 2023

Almost😀

The only time we're calling exec is to internally copy the SQL database files, the actual SQL is done via a direct TCP connection to the container.

And, we're running on tmpfs, so there's no actual host directory mapping present. We're using tmpfs because it speeds up the tests by a lot, and it's all just ephemeral data for the current test run.

Roughly, the workflow is like this:

  1. Start container, with tmpfs mounts for the data files
  2. Connect via TCP to SQL Server, create a database and initialize it with seed data
  3. Again use SQL via TCP to detach the database
  4. For every test, run docker excec to copy the database files inside the tmpfs mount, then send SQL via TCP to attach the copied database files as a new database
  5. Run tests (they internally also do SQL calls via TCP)
  6. Teardown, first SQL to detach the database, then docker exec to delete the database files (to reduce memory usage in the tmpfs mount)

Points 4. and 6. run per test, and the tests run in parallel. We can work around the issue by preventing parallel calls to docker exec, everything else can still stay parallelized.

And, to make it still a bit more complicated, we're not actually calling docker exec via the CLI but we're using Docker.DotNet to handle everything in the .NET integration tests, which looks a bit like this:

public static async Task CopySeedDatabaseFilesForNewDatabase(string newDatabaseName,
    string dockerContainerId)
{
    var dockerClient = DockerContainerUtilities.GetDockerClient();

    try
    {
        await _semaphore.WaitAsync();
        var dataFile = await dockerClient
            .Exec
            .ExecCreateContainerAsync(dockerContainerId,
            new ContainerExecCreateParameters
            {
                Cmd = new List<string>
                {
                    "cp",
                    $"/var/opt/mssql/data/{SqlServerDockerCollectionFixture.INITIAL_SEED_DATABASE_NAME}.mdf",
                    $"/var/opt/mssql/data/{newDatabaseName}.mdf"
                }
            });

        await dockerClient.Exec.StartContainerExecAsync(dataFile.ID);

        var logFile = await dockerClient
                .Exec
                .ExecCreateContainerAsync(dockerContainerId,
                new ContainerExecCreateParameters
                {
                    Cmd = new List<string>
                    {
                        "cp",
                        $"/var/opt/mssql/data/{SqlServerDockerCollectionFixture.INITIAL_SEED_DATABASE_NAME}_log.ldf",
                        $"/var/opt/mssql/data/{newDatabaseName}_log.ldf"
                    }
                });

        await dockerClient.Exec.StartContainerExecAsync(logFile.ID);
    }
    finally
    {
        _semaphore.Release();
    }
}

Above, the _semaphore is like an async lock, here configured with a concurrency of 1. That's already the workaround. We didn't have this before, and that resulted in the unresponsivenes.

The dockerClient variable is returning a cached instance, and if on Linux, it connects to unix:///var/run/docker.sock.

But, I could also reproduce this on two servers with the repro steps from the issue, where I just called docker exec via the CLI multiple times in parallel on a SQL Server image inside a tmpfs directory.

@laurazard
Copy link
Member

laurazard commented May 23, 2023

Right, I've been looking into this (thanks for the ping @thaJeztah), and I haven't been able to start bisecting yet but the gist so far is –

There's no specific tmpfs or other issue here: I can replicate it with just

for i in {1..50}
do
    docker exec test-container sh -c "touch $i.txt" &
done

And then I see the symptoms described by @GeorgDangl.

If I check:

$ ps aux | grep exec

...
laurabr+   23843  0.0  0.7 1105448 22648 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 22.txt
laurabr+   23844  0.0  0.7 1105448 23220 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 23.txt
laurabr+   23845  0.0  0.7 1179240 22748 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 24.txt
laurabr+   23846  0.0  0.7 1179496 23596 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 25.txt
laurabr+   23847  0.0  0.7 1179496 23624 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 26.txt
laurabr+   23848  0.0  0.7 1105192 23380 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 27.txt
laurabr+   23849  0.0  0.7 1253288 23416 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 28.txt
laurabr+   23850  0.0  0.7 1179496 22460 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 29.txt
laurabr+   23851  0.0  0.7 1179240 22136 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 30.txt
laurabr+   23852  0.0  0.7 1179496 22400 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 31.txt
laurabr+   23853  0.0  0.7 1253288 22808 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 32.txt
laurabr+   23854  0.0  0.7 1179240 23452 pts/3   Sl   13:12   0:00 docker exec test-container sh -c touch 33.txt
...

We have a lot of hanging execs, so maybe this is just a case of we're keeping all these clients/connections open and that's causing the issue. I'll need to confirm with the bisect if this behaviour is new – although I'm curious, since you also mention:

The dockerClient variable is returning a cached instance, and if on Linux, it connects to unix:///var/run/docker.sock.

and still see the issue, so maybe having 50+ clients connected aren't the issue (of course, even with 1 client we could be messing up and leaving a ton of open connections).

@corhere
Copy link
Contributor

corhere commented May 23, 2023

Using the following reproducer, the behaviour I observed is that when many execs are started concurrently, all the execs take an unreasonable amount of time to run on v24.0.1 but not v23.0.6.

$ docker run --rm -it --privileged docker:24.0.1-dind sh
# unset DOCKER_HOST
# dockerd -D &
# docker run --rm -d --name pause1 alpine sleep infinity
# time sh -c 'myjobs=""; for i in $(seq 1 50); do docker exec pause1 true & myjobs="$myjobs $!"; done; for pid in $myjobs; do wait $pid; done'

The inflection point for execs to start taking an unreasonable amount of time appears to be somewhere around 16-20 concurrent execs on my machine, though not consistently. At 20 concurrent execs I am seeing times between 1s and 7s, and at 50 execs it takes between 45 seconds and a minute.

The debug logs are very interesting, as well. The first few execs appear to complete their lifecycles almost instantaneously. Then at some point there is a flood of TaskExecStarted events, which then slows to a trickle: the gap between events is a precise multiple of 1s: most commonly 2s, but sometimes 1s or 3s. After the last TaskExecStarted, all the TaskExit events flood in instantaneously. The precision of the delays between the TaskExecStarted events is very suspicious. Whether the root cause lies within the dockerd changes or is a regression in containerd 1.7 remains to be determined.

@laurazard
Copy link
Member

laurazard commented May 23, 2023

Tiny update, bisect is done – looks like this started happening with b752462. Will need to take a bit of a closer look to see why the change is breaking this and how we can mitigate it. I've got an idea, will try to have a PR in soon.

Oh hey @corhere 👋 this might interest you :)

@thaJeztah
Copy link
Member

Thanks! Ah, yes, in hindsight we probably should've started the Slack thread in the maintainers channel, not our internal one ☺️ (my bad! posted it in our team's channel)

@corhere
Copy link
Contributor

corhere commented May 23, 2023

On a hunch, I instrumented

if err = p.Start(ctx); err != nil {
// use new context for cleanup because old one may be cancelled by user, but leave a timeout to make sure
// we are not waiting forever if containerd is unresponsive or to work around fifo cancelling issues in
// older containerd-shim
ctx, cancel := context.WithTimeout(context.Background(), 45*time.Second)
defer cancel()
p.Delete(ctx)
return nil, wrapError(err)
}
to log how long it takes for the RPC to return. On my first try with 30 concurrent exec starts, all the logs came back with seventeen seconds. With 50 concurrent, 59s! I think that's pretty conclusive: if more than some threshold of TaskService().Start() requests are in-flight concurrently for the same task, containerd takes an excessive amount of time to handle them all. That is also consistent with b752462 being the first commit which the issue could be reproduced with, as that is the commit which removed the locking which had effectively serialized all exec requests for a particular container.

@thaJeztah
Copy link
Member

Yes, we came to the "previously they were (either intentionally or unintentionally) serialized" conclusion as well. But that was just before I had to head out, so haven't checked my slack it @laurazard found more details (all credits to her for doing the bisecting / digging)

@GeorgDangl
Copy link
Author

That's good news, so I wasn't imagining it😀 I hope it's fine if I don't try it out on older versions, since it seems like you've already been able to find when the behavior changed.

I'd be happy to help if there's anything else I can do.

@corhere corhere changed the title Many parallel calls to docker exec to copy a file inside the container on a tmpfs volume make Docker unresponsive Many parallel calls to docker exec make Docker unresponsive May 23, 2023
@thaJeztah
Copy link
Member

Discussing in the maintainers call; while the containerd bug has to be looked into (but no fix / root-cause / ETA yet), we can add a temporary (famous last words), but this time deliberate serialisation back.

@GeorgDangl
Copy link
Author

Thank you guys for the quick resolution😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage version/24.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants