Fix TestCreateServiceSecretFileMode, TestCreateServiceConfigFileMode #42960

thaJeztah · 2021-10-22T13:38:08Z

hopefully fixes #37132

Looks like this test was broken from the start, and fully relied on a race
condition. (Test was added in 65ee7ff / #36130)

The problem is in the service's command: ls -l /etc/config || /bin/top, which
will either:

exit immediately if the secret is mounted correctly at /etc/config (which it should)
keep running with /bin/top if the above failed

After the service is created, the test enters a race-condition, checking for 1
task to be running (which it ocassionally is), after which it proceeds, and looks
up the list of tasks of the service, to get the log output of ls -l /etc/config.

This is another race: first of all, the original filter for that task lookup did
not filter by running, so it would pick "any" task of the service (either failed,
running, or "completed" (successfully exited) tasks).

In the meantime though, SwarmKit kept reconciling the service, and creating new
tasks, so even if the test was able to get the ID of the correct task, that task
may already have been exited, and removed (task-limit is 5 by default), so only
if the test was "lucky", it would be able to get the logs, but of course, chances
were likely that it would be "too late", and the task already gone.

The problem can be easily reproduced when running the steps manually:

echo 'CONFIG' | docker config create myconfig -

docker service create --config source=myconfig,target=/etc/config,mode=0777 --name myservice busybox sh -c 'ls -l /etc/config || /bin/top'

The above creates the service, but it keeps retrying, because each task exits
immediately (followed by SwarmKit reconciling and starting a new task);

mjntpfkkyuuc1dpay4h00c4oo
overall progress: 0 out of 1 tasks
1/1: ready     [======================================>            ]
verify: Detected task failure
^COperation continuing in background.
Use `docker service ps mjntpfkkyuuc1dpay4h00c4oo` to check progress.

And checking the tasks for the service reveals that tasks exit cleanly (no error),
but do exit, so swarm just keeps up reconciling, and spinning up new tasks;

docker service ps myservice --no-trunc
ID                          NAME              IMAGE                                                                                    NODE             DESIRED STATE   CURRENT STATE                     ERROR     PORTS
2wmcuv4vffnet8nybg3he4v9n   myservice.1       busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Ready           Ready less than a second ago
5p8b006uec125iq2892lxay64    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete less than a second ago
k8lpsvlak4b3nil0zfkexw61p    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete 6 seconds ago
vsunl5pi7e2n9ol3p89kvj6pn    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete 11 seconds ago
orxl8b6kt2l6dfznzzd4lij4s    \_ myservice.1   busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57   docker-desktop   Shutdown        Complete 17 seconds ago

This patch changes the service's command to sleep, so that a successful task
(after successfully performing ls -l /etc/config) continues to be running until
the service is deleted. With that change, the service should (usually) reconcile
immediately, which removes the race condition, and should also make it faster :)

This patch changes the tests to use client.ServiceLogs() instead of using the
service's tasklist to directly access container logs. This should also fix some
failures that happened if some tasks failed to start before reconciling, in which
case client.TaskList() (with the current filters), could return more tasks than
anticipated (as it also contained the exited tasks);

=== RUN   TestCreateServiceSecretFileMode
    create_test.go:291: assertion failed: 2 (int) != 1 (int)
--- FAIL: TestCreateServiceSecretFileMode (7.88s)
=== RUN   TestCreateServiceConfigFileMode
    create_test.go:355: assertion failed: 2 (int) != 1 (int)
--- FAIL: TestCreateServiceConfigFileMode (7.87s)

thaJeztah · 2021-10-22T13:38:20Z

@AkihiroSuda @cpuguy83 PTAL

thaJeztah · 2021-10-22T13:41:06Z

integration/service/create_test.go

-	})
-	assert.NilError(t, err)
-	assert.Check(t, is.Equal(len(tasks), 1))
+	tasks := swarm.GetRunningTasks(t, client, serviceID)


This utility should probably also be updated to check the actual state, but leaving that for a follow-up (similar to;

moby/integration/internal/swarm/states.go

Lines 59 to 70 in 7b9275c

var running int

var taskError string

for _, task := range tasks {

switch task.Status.State {

case swarmtypes.TaskStateRunning:

running++

case swarmtypes.TaskStateFailed:

if task.Status.Err != "" {

taskError = task.Status.Err

}

}

}

thaJeztah · 2021-10-22T14:25:22Z

LOL! okay, okay, I'll fix

=== RUN   TestCreateServiceSecretFileMode
    create_test.go:286: assertion failed: 1 (int) != 1 (instances uint64)
--- FAIL: TestCreateServiceSecretFileMode (2.46s)

thaJeztah · 2021-10-26T10:05:17Z

So this PR should fix the bug in the test (looking for "any" task), but doesn't solve flakiness in tasks failing to start (i.e., we end up without running tasks within XX seconds);

=== RUN   TestCreateServiceConfigFileMode
    create_test.go:342: timeout hit after 10s: running task count at 0 waiting for 1 (total tasks: 3)
--- FAIL: TestCreateServiceConfigFileMode (11.73s)

Let me try get a bundle to see why the tasks fail to start

thaJeztah · 2021-10-26T11:41:53Z

I think I found the actual problem with this test (also see #37132 (comment)), and ... it's silly 🤦

Updated this PR, and the description.

thaJeztah · 2021-10-26T11:55:05Z

integration/service/create_test.go

@@ -267,7 +267,7 @@ func TestCreateServiceSecretFileMode(t *testing.T) {
 	serviceID := swarm.CreateService(t, d,
 		swarm.ServiceWithReplicas(instances),
 		swarm.ServiceWithName(serviceName),
-		swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret || /bin/top"}),
+		swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret && sleep 600"}),


FWIW, we don't really need these tasks to keep running (running once and print the output would be sufficient), but keeping them running makes sure SwarmKit doesn't continue reconciling / spinning up new tasks.

Possibly we could simplify the test to get the results out in some different way, but I'm gonna leave that for a future exercise.

Not super familiar with this test code, but what image is it running? Some sleep(1) implementations (including the GNU coreutils version) accept inf as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.

Not super familiar with this test code, but what image is it running?

The default for these tests is to use busybox.

The Dockerfile has a frozen-images stage, which downloads some images that are used in integration tests. Originally this was added to;

be sure we always use the exact same image

speed up the tests (no docker pull for each of the tests); this also helped with hiccups / outages of Docker Hub, if they would happen

Nowadays it also comes in handy with rate limits 😅 (although, we should set up a "pull only" token in CI)

Some sleep(1) implementations (including the GNU coreutils version) accept inf as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.

Oh, TIL about inf being a valid option. Looks like busybox sleep supports that value.

I'll update the test; I suspect there will be various other tests that used <some arbitrary, randomly picked long timeout>; we can go through our tests and update those as well.

integration/service/create_test.go

thaJeztah · 2021-10-26T12:07:12Z

integration/service/create_test.go

-
-	body, err := client.ContainerLogs(ctx, tasks[0].Status.ContainerStatus.ContainerID, types.ContainerLogsOptions{
+	body, err := client.ServiceLogs(ctx, serviceID, types.ContainerLogsOptions{
+		Tail:       "1",


1 line should be enough 😅 (famous last words)

thaJeztah · 2021-10-26T12:38:55Z

Failures on Windows is another flaky test (#42612)

=== RUN   TestRequestReleaseAddressDuplicate
    allocator_test.go:1531: IP 198.168.1.244/23 was previously allocated
--- FAIL: TestRequestReleaseAddressDuplicate (0.01s)

Also failing: TestNetworkDBIslands (tracked through #42459)

kicking ci again

thaJeztah · 2021-10-26T21:21:03Z

@samuelkarp @AkihiroSuda this LGTY? This should bring CI in a slightly better shape

samuelkarp

LGTM with a minor suggestion

samuelkarp · 2021-10-26T22:21:47Z

integration/service/create_test.go

@@ -267,7 +267,7 @@ func TestCreateServiceSecretFileMode(t *testing.T) {
 	serviceID := swarm.CreateService(t, d,
 		swarm.ServiceWithReplicas(instances),
 		swarm.ServiceWithName(serviceName),
-		swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret || /bin/top"}),
+		swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret && sleep 600"}),


Not super familiar with this test code, but what image is it running? Some sleep(1) implementations (including the GNU coreutils version) accept inf as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.

…ileMode Looks like this test was broken from the start, and fully relied on a race condition. (Test was added in 65ee7ff) The problem is in the service's command: `ls -l /etc/config || /bin/top`, which will either: - exit immediately if the secret is mounted correctly at `/etc/config` (which it should) - keep running with `/bin/top` if the above failed After the service is created, the test enters a race-condition, checking for 1 task to be running (which it ocassionally is), after which it proceeds, and looks up the list of tasks of the service, to get the log output of `ls -l /etc/config`. This is another race: first of all, the original filter for that task lookup did not filter by `running`, so it would pick "any" task of the service (either failed, running, or "completed" (successfully exited) tasks). In the meantime though, SwarmKit kept reconciling the service, and creating new tasks, so even if the test was able to get the ID of the correct task, that task may already have been exited, and removed (task-limit is 5 by default), so only if the test was "lucky", it would be able to get the logs, but of course, chances were likely that it would be "too late", and the task already gone. The problem can be easily reproduced when running the steps manually: echo 'CONFIG' | docker config create myconfig - docker service create --config source=myconfig,target=/etc/config,mode=0777 --name myservice busybox sh -c 'ls -l /etc/config || /bin/top' The above creates the service, but it keeps retrying, because each task exits immediately (followed by SwarmKit reconciling and starting a new task); mjntpfkkyuuc1dpay4h00c4oo overall progress: 0 out of 1 tasks 1/1: ready [======================================> ] verify: Detected task failure ^COperation continuing in background. Use `docker service ps mjntpfkkyuuc1dpay4h00c4oo` to check progress. And checking the tasks for the service reveals that tasks exit cleanly (no error), but _do exit_, so swarm just keeps up reconciling, and spinning up new tasks; docker service ps myservice --no-trunc ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS 2wmcuv4vffnet8nybg3he4v9n myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Ready Ready less than a second ago 5p8b006uec125iq2892lxay64 \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete less than a second ago k8lpsvlak4b3nil0zfkexw61p \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete 6 seconds ago vsunl5pi7e2n9ol3p89kvj6pn \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete 11 seconds ago orxl8b6kt2l6dfznzzd4lij4s \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete 17 seconds ago This patch changes the service's command to `sleep`, so that a successful task (after successfully performing `ls -l /etc/config`) continues to be running until the service is deleted. With that change, the service should (usually) reconcile immediately, which removes the race condition, and should also make it faster :) This patch changes the tests to use client.ServiceLogs() instead of using the service's tasklist to directly access container logs. This should also fix some failures that happened if some tasks failed to start before reconciling, in which case client.TaskList() (with the current filters), could return more tasks than anticipated (as it also contained the exited tasks); === RUN TestCreateServiceSecretFileMode create_test.go:291: assertion failed: 2 (int) != 1 (int) --- FAIL: TestCreateServiceSecretFileMode (7.88s) === RUN TestCreateServiceConfigFileMode create_test.go:355: assertion failed: 2 (int) != 1 (int) --- FAIL: TestCreateServiceConfigFileMode (7.87s) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

thaJeztah · 2021-10-27T08:56:47Z

Changed sleep 600 to sleep inf (no other changes)

I'll merge this one, once CI completes

thaJeztah · 2021-10-27T10:29:19Z

CI passed, but failed in cleanup afterwards;

Unable to delete '/home/ubuntu/workspace/moby_PR-42960'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. (Discarded 13 additional exceptions)

thaJeztah added area/testing process/cherry-pick area/swarm kind/bugfix PR's that fix bugs labels Oct 22, 2021

thaJeztah added this to the 21.xx milestone Oct 22, 2021

thaJeztah mentioned this pull request Oct 22, 2021

[20.10 backport] Fix racey TestHealthKillContainer #42959

Merged

thaJeztah commented Oct 22, 2021

View reviewed changes

thaJeztah force-pushed the fix_TestCreateServiceSecretFileMode branch from f674eaa to 9b41144 Compare October 22, 2021 14:27

thaJeztah mentioned this pull request Oct 25, 2021

Add http(s) proxy properties to daemon configuration (carry 42647) #42835

Merged

thaJeztah mentioned this pull request Oct 26, 2021

Flaky test: TestServicePlugin, TestCreateServiceSecretFileMode #37132

Closed

thaJeztah force-pushed the fix_TestCreateServiceSecretFileMode branch 3 times, most recently from 4941809 to 634b490 Compare October 26, 2021 11:41

thaJeztah commented Oct 26, 2021

View reviewed changes

integration/service/create_test.go Outdated Show resolved Hide resolved

thaJeztah force-pushed the fix_TestCreateServiceSecretFileMode branch from 634b490 to c3ba756 Compare October 26, 2021 12:06

thaJeztah commented Oct 26, 2021

View reviewed changes

tianon approved these changes Oct 26, 2021

View reviewed changes

samuelkarp approved these changes Oct 26, 2021

View reviewed changes

thaJeztah force-pushed the fix_TestCreateServiceSecretFileMode branch from c3ba756 to 13cff6d Compare October 27, 2021 08:56

thaJeztah added the status/4-merge label Oct 27, 2021

thaJeztah merged commit aef8e48 into moby:master Oct 27, 2021

thaJeztah deleted the fix_TestCreateServiceSecretFileMode branch October 27, 2021 10:29

thaJeztah mentioned this pull request Oct 27, 2021

[20.10 backport] Fix race in TestCreateServiceSecretFileMode, TestCreateServiceConfigFileMode #42971

Merged

thaJeztah added process/cherry-picked and removed process/cherry-pick labels Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TestCreateServiceSecretFileMode, TestCreateServiceConfigFileMode #42960

Fix TestCreateServiceSecretFileMode, TestCreateServiceConfigFileMode #42960

thaJeztah commented Oct 22, 2021 •

edited

thaJeztah commented Oct 22, 2021

thaJeztah Oct 22, 2021

thaJeztah commented Oct 22, 2021

thaJeztah commented Oct 26, 2021

thaJeztah commented Oct 26, 2021

thaJeztah Oct 26, 2021

samuelkarp Oct 26, 2021

thaJeztah Oct 27, 2021

thaJeztah Oct 26, 2021

thaJeztah commented Oct 26, 2021

thaJeztah commented Oct 26, 2021

samuelkarp left a comment

samuelkarp Oct 26, 2021

thaJeztah commented Oct 27, 2021

thaJeztah commented Oct 27, 2021

	var running int
	var taskError string
	for _, task := range tasks {
	switch task.Status.State {
	case swarmtypes.TaskStateRunning:
	running++
	case swarmtypes.TaskStateFailed:
	if task.Status.Err != "" {
	taskError = task.Status.Err
	}
	}
	}

Fix TestCreateServiceSecretFileMode, TestCreateServiceConfigFileMode #42960

Fix TestCreateServiceSecretFileMode, TestCreateServiceConfigFileMode #42960

Conversation

thaJeztah commented Oct 22, 2021 • edited

thaJeztah commented Oct 22, 2021

thaJeztah Oct 22, 2021

Choose a reason for hiding this comment

thaJeztah commented Oct 22, 2021

thaJeztah commented Oct 26, 2021

thaJeztah commented Oct 26, 2021

thaJeztah Oct 26, 2021

Choose a reason for hiding this comment

samuelkarp Oct 26, 2021

Choose a reason for hiding this comment

thaJeztah Oct 27, 2021

Choose a reason for hiding this comment

thaJeztah Oct 26, 2021

Choose a reason for hiding this comment

thaJeztah commented Oct 26, 2021

thaJeztah commented Oct 26, 2021

samuelkarp left a comment

Choose a reason for hiding this comment

samuelkarp Oct 26, 2021

Choose a reason for hiding this comment

thaJeztah commented Oct 27, 2021

thaJeztah commented Oct 27, 2021

thaJeztah commented Oct 22, 2021 •

edited