Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TestCreateServiceSecretFileMode, TestCreateServiceConfigFileMode #42960

Merged

Conversation

thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Oct 22, 2021

hopefully fixes #37132

Looks like this test was broken from the start, and fully relied on a race
condition. (Test was added in 65ee7ff / #36130)

The problem is in the service's command: ls -l /etc/config || /bin/top, which
will either:

  • exit immediately if the secret is mounted correctly at /etc/config (which it should)
  • keep running with /bin/top if the above failed

After the service is created, the test enters a race-condition, checking for 1
task to be running (which it ocassionally is), after which it proceeds, and looks
up the list of tasks of the service, to get the log output of ls -l /etc/config.

This is another race: first of all, the original filter for that task lookup did
not filter by running, so it would pick "any" task of the service (either failed,
running, or "completed" (successfully exited) tasks).

In the meantime though, SwarmKit kept reconciling the service, and creating new
tasks, so even if the test was able to get the ID of the correct task, that task
may already have been exited, and removed (task-limit is 5 by default), so only
if the test was "lucky", it would be able to get the logs, but of course, chances
were likely that it would be "too late", and the task already gone.

The problem can be easily reproduced when running the steps manually:

echo 'CONFIG' | docker config create myconfig -

docker service create --config source=myconfig,target=/etc/config,mode=0777 --name myservice busybox sh -c 'ls -l /etc/config || /bin/top'

The above creates the service, but it keeps retrying, because each task exits
immediately (followed by SwarmKit reconciling and starting a new task);

mjntpfkkyuuc1dpay4h00c4oo
overall progress: 0 out of 1 tasks
1/1: ready     [======================================>            ]
verify: Detected task failure
^COperation continuing in background.
Use `docker service ps mjntpfkkyuuc1dpay4h00c4oo` to check progress.

And checking the tasks for the service reveals that tasks exit cleanly (no error),
but do exit, so swarm just keeps up reconciling, and spinning up new tasks;

docker service ps myservice --no-trunc
ID                          NAME              IMAGE                                                                                    NODE             DESIRED STATE   CURRENT STATE                     ERROR     PORTS
2wmcuv4vffnet8nybg3he4v9n   myservice.1       busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Ready           Ready less than a second ago
5p8b006uec125iq2892lxay64    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete less than a second ago
k8lpsvlak4b3nil0zfkexw61p    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete 6 seconds ago
vsunl5pi7e2n9ol3p89kvj6pn    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete 11 seconds ago
orxl8b6kt2l6dfznzzd4lij4s    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete 17 seconds ago

This patch changes the service's command to sleep, so that a successful task
(after successfully performing ls -l /etc/config) continues to be running until
the service is deleted. With that change, the service should (usually) reconcile
immediately, which removes the race condition, and should also make it faster :)

This patch changes the tests to use client.ServiceLogs() instead of using the
service's tasklist to directly access container logs. This should also fix some
failures that happened if some tasks failed to start before reconciling, in which
case client.TaskList() (with the current filters), could return more tasks than
anticipated (as it also contained the exited tasks);

=== RUN   TestCreateServiceSecretFileMode
    create_test.go:291: assertion failed: 2 (int) != 1 (int)
--- FAIL: TestCreateServiceSecretFileMode (7.88s)
=== RUN   TestCreateServiceConfigFileMode
    create_test.go:355: assertion failed: 2 (int) != 1 (int)
--- FAIL: TestCreateServiceConfigFileMode (7.87s)

@thaJeztah
Copy link
Member Author

@AkihiroSuda @cpuguy83 PTAL

})
assert.NilError(t, err)
assert.Check(t, is.Equal(len(tasks), 1))
tasks := swarm.GetRunningTasks(t, client, serviceID)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This utility should probably also be updated to check the actual state, but leaving that for a follow-up (similar to;

var running int
var taskError string
for _, task := range tasks {
switch task.Status.State {
case swarmtypes.TaskStateRunning:
running++
case swarmtypes.TaskStateFailed:
if task.Status.Err != "" {
taskError = task.Status.Err
}
}
}

@thaJeztah
Copy link
Member Author

LOL! okay, okay, I'll fix

=== RUN   TestCreateServiceSecretFileMode
    create_test.go:286: assertion failed: 1 (int) != 1 (instances uint64)
--- FAIL: TestCreateServiceSecretFileMode (2.46s)

@thaJeztah
Copy link
Member Author

So this PR should fix the bug in the test (looking for "any" task), but doesn't solve flakiness in tasks failing to start (i.e., we end up without running tasks within XX seconds);

=== RUN   TestCreateServiceConfigFileMode
    create_test.go:342: timeout hit after 10s: running task count at 0 waiting for 1 (total tasks: 3)
--- FAIL: TestCreateServiceConfigFileMode (11.73s)

Let me try get a bundle to see why the tasks fail to start

@thaJeztah thaJeztah force-pushed the fix_TestCreateServiceSecretFileMode branch 3 times, most recently from 4941809 to 634b490 Compare October 26, 2021 11:41
@thaJeztah
Copy link
Member Author

I think I found the actual problem with this test (also see #37132 (comment)), and ... it's silly 🤦

Updated this PR, and the description.

@@ -267,7 +267,7 @@ func TestCreateServiceSecretFileMode(t *testing.T) {
serviceID := swarm.CreateService(t, d,
swarm.ServiceWithReplicas(instances),
swarm.ServiceWithName(serviceName),
swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret || /bin/top"}),
swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret && sleep 600"}),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we don't really need these tasks to keep running (running once and print the output would be sufficient), but keeping them running makes sure SwarmKit doesn't continue reconciling / spinning up new tasks.

Possibly we could simplify the test to get the results out in some different way, but I'm gonna leave that for a future exercise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with this test code, but what image is it running? Some sleep(1) implementations (including the GNU coreutils version) accept inf as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with this test code, but what image is it running?

The default for these tests is to use busybox.

The Dockerfile has a frozen-images stage, which downloads some images that are used in integration tests. Originally this was added to;

  • be sure we always use the exact same image
  • speed up the tests (no docker pull for each of the tests); this also helped with hiccups / outages of Docker Hub, if they would happen

Nowadays it also comes in handy with rate limits 😅 (although, we should set up a "pull only" token in CI)

Some sleep(1) implementations (including the GNU coreutils version) accept inf as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.

Oh, TIL about inf being a valid option. Looks like busybox sleep supports that value.

I'll update the test; I suspect there will be various other tests that used <some arbitrary, randomly picked long timeout>; we can go through our tests and update those as well.

@thaJeztah thaJeztah force-pushed the fix_TestCreateServiceSecretFileMode branch from 634b490 to c3ba756 Compare October 26, 2021 12:06

body, err := client.ContainerLogs(ctx, tasks[0].Status.ContainerStatus.ContainerID, types.ContainerLogsOptions{
body, err := client.ServiceLogs(ctx, serviceID, types.ContainerLogsOptions{
Tail: "1",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 line should be enough 😅 (famous last words)

@thaJeztah
Copy link
Member Author

Failures on Windows is another flaky test (#42612)

=== RUN   TestRequestReleaseAddressDuplicate
    allocator_test.go:1531: IP 198.168.1.244/23 was previously allocated
--- FAIL: TestRequestReleaseAddressDuplicate (0.01s)

Also failing: TestNetworkDBIslands (tracked through #42459)

kicking ci again

@thaJeztah
Copy link
Member Author

@samuelkarp @AkihiroSuda this LGTY? This should bring CI in a slightly better shape

Copy link
Member

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor suggestion

@@ -267,7 +267,7 @@ func TestCreateServiceSecretFileMode(t *testing.T) {
serviceID := swarm.CreateService(t, d,
swarm.ServiceWithReplicas(instances),
swarm.ServiceWithName(serviceName),
swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret || /bin/top"}),
swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret && sleep 600"}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with this test code, but what image is it running? Some sleep(1) implementations (including the GNU coreutils version) accept inf as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.

…ileMode

Looks like this test was broken from the start, and fully relied on a race
condition. (Test was added in 65ee7ff)

The problem is in the service's command: `ls -l /etc/config || /bin/top`, which
will either:

- exit immediately if the secret is mounted correctly at `/etc/config` (which it should)
- keep running with `/bin/top` if the above failed

After the service is created, the test enters a race-condition, checking for 1
task to be running (which it ocassionally is), after which it proceeds, and looks
up the list of tasks of the service, to get the log output of `ls -l /etc/config`.

This is another race: first of all, the original filter for that task lookup did
not filter by `running`, so it would pick "any" task of the service (either failed,
running, or "completed" (successfully exited) tasks).

In the meantime though, SwarmKit kept reconciling the service, and creating new
tasks, so even if the test was able to get the ID of the correct task, that task
may already have been exited, and removed (task-limit is 5 by default), so only
if the test was "lucky", it would be able to get the logs, but of course, chances
were likely that it would be "too late", and the task already gone.

The problem can be easily reproduced when running the steps manually:

    echo 'CONFIG' | docker config create myconfig -

    docker service create --config source=myconfig,target=/etc/config,mode=0777 --name myservice busybox sh -c 'ls -l /etc/config || /bin/top'

The above creates the service, but it keeps retrying, because each task exits
immediately (followed by SwarmKit reconciling and starting a new task);

    mjntpfkkyuuc1dpay4h00c4oo
    overall progress: 0 out of 1 tasks
    1/1: ready     [======================================>            ]
    verify: Detected task failure
    ^COperation continuing in background.
    Use `docker service ps mjntpfkkyuuc1dpay4h00c4oo` to check progress.

And checking the tasks for the service reveals that tasks exit cleanly (no error),
but _do exit_, so swarm just keeps up reconciling, and spinning up new tasks;

    docker service ps myservice --no-trunc
    ID                          NAME              IMAGE                                                                                    NODE             DESIRED STATE   CURRENT STATE                     ERROR     PORTS
    2wmcuv4vffnet8nybg3he4v9n   myservice.1       busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Ready           Ready less than a second ago
    5p8b006uec125iq2892lxay64    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete less than a second ago
    k8lpsvlak4b3nil0zfkexw61p    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete 6 seconds ago
    vsunl5pi7e2n9ol3p89kvj6pn    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete 11 seconds ago
    orxl8b6kt2l6dfznzzd4lij4s    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete 17 seconds ago

This patch changes the service's command to `sleep`, so that a successful task
(after successfully performing `ls -l /etc/config`) continues to be running until
the service is deleted. With that change, the service should (usually) reconcile
immediately, which removes the race condition, and should also make it faster :)

This patch changes the tests to use client.ServiceLogs() instead of using the
service's tasklist to directly access container logs. This should also fix some
failures that happened if some tasks failed to start before reconciling, in which
case client.TaskList() (with the current filters), could return more tasks than
anticipated (as it also contained the exited tasks);

    === RUN   TestCreateServiceSecretFileMode
        create_test.go:291: assertion failed: 2 (int) != 1 (int)
    --- FAIL: TestCreateServiceSecretFileMode (7.88s)
    === RUN   TestCreateServiceConfigFileMode
        create_test.go:355: assertion failed: 2 (int) != 1 (int)
    --- FAIL: TestCreateServiceConfigFileMode (7.87s)

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah thaJeztah force-pushed the fix_TestCreateServiceSecretFileMode branch from c3ba756 to 13cff6d Compare October 27, 2021 08:56
@thaJeztah
Copy link
Member Author

Changed sleep 600 to sleep inf (no other changes)

I'll merge this one, once CI completes

@thaJeztah
Copy link
Member Author

CI passed, but failed in cleanup afterwards;

Unable to delete '/home/ubuntu/workspace/moby_PR-42960'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. (Discarded 13 additional exceptions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: TestServicePlugin, TestCreateServiceSecretFileMode
3 participants