New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TestCreateServiceSecretFileMode, TestCreateServiceConfigFileMode #42960
Fix TestCreateServiceSecretFileMode, TestCreateServiceConfigFileMode #42960
Conversation
@AkihiroSuda @cpuguy83 PTAL |
integration/service/create_test.go
Outdated
}) | ||
assert.NilError(t, err) | ||
assert.Check(t, is.Equal(len(tasks), 1)) | ||
tasks := swarm.GetRunningTasks(t, client, serviceID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This utility should probably also be updated to check the actual state, but leaving that for a follow-up (similar to;
moby/integration/internal/swarm/states.go
Lines 59 to 70 in 7b9275c
var running int | |
var taskError string | |
for _, task := range tasks { | |
switch task.Status.State { | |
case swarmtypes.TaskStateRunning: | |
running++ | |
case swarmtypes.TaskStateFailed: | |
if task.Status.Err != "" { | |
taskError = task.Status.Err | |
} | |
} | |
} |
LOL! okay, okay, I'll fix
|
f674eaa
to
9b41144
Compare
So this PR should fix the bug in the test (looking for "any" task), but doesn't solve flakiness in tasks failing to start (i.e., we end up without running tasks within XX seconds);
Let me try get a bundle to see why the tasks fail to start |
4941809
to
634b490
Compare
I think I found the actual problem with this test (also see #37132 (comment)), and ... it's silly 🤦 Updated this PR, and the description. |
integration/service/create_test.go
Outdated
@@ -267,7 +267,7 @@ func TestCreateServiceSecretFileMode(t *testing.T) { | |||
serviceID := swarm.CreateService(t, d, | |||
swarm.ServiceWithReplicas(instances), | |||
swarm.ServiceWithName(serviceName), | |||
swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret || /bin/top"}), | |||
swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret && sleep 600"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we don't really need these tasks to keep running (running once and print the output would be sufficient), but keeping them running makes sure SwarmKit doesn't continue reconciling / spinning up new tasks.
Possibly we could simplify the test to get the results out in some different way, but I'm gonna leave that for a future exercise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with this test code, but what image is it running? Some sleep(1)
implementations (including the GNU coreutils version) accept inf
as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with this test code, but what image is it running?
The default for these tests is to use busybox
.
The Dockerfile
has a frozen-images
stage, which downloads some images that are used in integration tests. Originally this was added to;
- be sure we always use the exact same image
- speed up the tests (no
docker pull
for each of the tests); this also helped with hiccups / outages of Docker Hub, if they would happen
Nowadays it also comes in handy with rate limits 😅 (although, we should set up a "pull only" token in CI)
Some sleep(1) implementations (including the GNU coreutils version) accept inf as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.
Oh, TIL about inf
being a valid option. Looks like busybox sleep
supports that value.
I'll update the test; I suspect there will be various other tests that used <some arbitrary, randomly picked long timeout>
; we can go through our tests and update those as well.
634b490
to
c3ba756
Compare
|
||
body, err := client.ContainerLogs(ctx, tasks[0].Status.ContainerStatus.ContainerID, types.ContainerLogsOptions{ | ||
body, err := client.ServiceLogs(ctx, serviceID, types.ContainerLogsOptions{ | ||
Tail: "1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 line should be enough 😅 (famous last words)
@samuelkarp @AkihiroSuda this LGTY? This should bring CI in a slightly better shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a minor suggestion
integration/service/create_test.go
Outdated
@@ -267,7 +267,7 @@ func TestCreateServiceSecretFileMode(t *testing.T) { | |||
serviceID := swarm.CreateService(t, d, | |||
swarm.ServiceWithReplicas(instances), | |||
swarm.ServiceWithName(serviceName), | |||
swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret || /bin/top"}), | |||
swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret && sleep 600"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with this test code, but what image is it running? Some sleep(1)
implementations (including the GNU coreutils version) accept inf
as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.
…ileMode Looks like this test was broken from the start, and fully relied on a race condition. (Test was added in 65ee7ff) The problem is in the service's command: `ls -l /etc/config || /bin/top`, which will either: - exit immediately if the secret is mounted correctly at `/etc/config` (which it should) - keep running with `/bin/top` if the above failed After the service is created, the test enters a race-condition, checking for 1 task to be running (which it ocassionally is), after which it proceeds, and looks up the list of tasks of the service, to get the log output of `ls -l /etc/config`. This is another race: first of all, the original filter for that task lookup did not filter by `running`, so it would pick "any" task of the service (either failed, running, or "completed" (successfully exited) tasks). In the meantime though, SwarmKit kept reconciling the service, and creating new tasks, so even if the test was able to get the ID of the correct task, that task may already have been exited, and removed (task-limit is 5 by default), so only if the test was "lucky", it would be able to get the logs, but of course, chances were likely that it would be "too late", and the task already gone. The problem can be easily reproduced when running the steps manually: echo 'CONFIG' | docker config create myconfig - docker service create --config source=myconfig,target=/etc/config,mode=0777 --name myservice busybox sh -c 'ls -l /etc/config || /bin/top' The above creates the service, but it keeps retrying, because each task exits immediately (followed by SwarmKit reconciling and starting a new task); mjntpfkkyuuc1dpay4h00c4oo overall progress: 0 out of 1 tasks 1/1: ready [======================================> ] verify: Detected task failure ^COperation continuing in background. Use `docker service ps mjntpfkkyuuc1dpay4h00c4oo` to check progress. And checking the tasks for the service reveals that tasks exit cleanly (no error), but _do exit_, so swarm just keeps up reconciling, and spinning up new tasks; docker service ps myservice --no-trunc ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS 2wmcuv4vffnet8nybg3he4v9n myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Ready Ready less than a second ago 5p8b006uec125iq2892lxay64 \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete less than a second ago k8lpsvlak4b3nil0zfkexw61p \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete 6 seconds ago vsunl5pi7e2n9ol3p89kvj6pn \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete 11 seconds ago orxl8b6kt2l6dfznzzd4lij4s \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete 17 seconds ago This patch changes the service's command to `sleep`, so that a successful task (after successfully performing `ls -l /etc/config`) continues to be running until the service is deleted. With that change, the service should (usually) reconcile immediately, which removes the race condition, and should also make it faster :) This patch changes the tests to use client.ServiceLogs() instead of using the service's tasklist to directly access container logs. This should also fix some failures that happened if some tasks failed to start before reconciling, in which case client.TaskList() (with the current filters), could return more tasks than anticipated (as it also contained the exited tasks); === RUN TestCreateServiceSecretFileMode create_test.go:291: assertion failed: 2 (int) != 1 (int) --- FAIL: TestCreateServiceSecretFileMode (7.88s) === RUN TestCreateServiceConfigFileMode create_test.go:355: assertion failed: 2 (int) != 1 (int) --- FAIL: TestCreateServiceConfigFileMode (7.87s) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
c3ba756
to
13cff6d
Compare
Changed I'll merge this one, once CI completes |
CI passed, but failed in cleanup afterwards;
|
hopefully fixes #37132
Looks like this test was broken from the start, and fully relied on a race
condition. (Test was added in 65ee7ff / #36130)
The problem is in the service's command:
ls -l /etc/config || /bin/top
, whichwill either:
/etc/config
(which it should)/bin/top
if the above failedAfter the service is created, the test enters a race-condition, checking for 1
task to be running (which it ocassionally is), after which it proceeds, and looks
up the list of tasks of the service, to get the log output of
ls -l /etc/config
.This is another race: first of all, the original filter for that task lookup did
not filter by
running
, so it would pick "any" task of the service (either failed,running, or "completed" (successfully exited) tasks).
In the meantime though, SwarmKit kept reconciling the service, and creating new
tasks, so even if the test was able to get the ID of the correct task, that task
may already have been exited, and removed (task-limit is 5 by default), so only
if the test was "lucky", it would be able to get the logs, but of course, chances
were likely that it would be "too late", and the task already gone.
The problem can be easily reproduced when running the steps manually:
The above creates the service, but it keeps retrying, because each task exits
immediately (followed by SwarmKit reconciling and starting a new task);
And checking the tasks for the service reveals that tasks exit cleanly (no error),
but do exit, so swarm just keeps up reconciling, and spinning up new tasks;
This patch changes the service's command to
sleep
, so that a successful task(after successfully performing
ls -l /etc/config
) continues to be running untilthe service is deleted. With that change, the service should (usually) reconcile
immediately, which removes the race condition, and should also make it faster :)
This patch changes the tests to use client.ServiceLogs() instead of using the
service's tasklist to directly access container logs. This should also fix some
failures that happened if some tasks failed to start before reconciling, in which
case client.TaskList() (with the current filters), could return more tasks than
anticipated (as it also contained the exited tasks);