Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Swarm tests debug #38080

Closed
wants to merge 7 commits into from
Closed

Conversation

kolyshkin
Copy link
Contributor

Just playing with TestSwarmClusterRotateUnlockKey a bit

Since commit 17173ef checkSwarmLockedToUnlocked() no longer
require its third argument, so remove it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Using MNT_FORCE flag does not make sense for nsfs. Using MNT_DETACH
though might help.

2. When -check.vv is added to TESTFLAGS, there are a lot of messages
like this one:

> unmount of /tmp/dxr/d847fd103a4ba/netns failed: invalid argument

and some like

> unmount of /tmp/dxr/dd245af642d94/netns failed: no such file or directory

The first one means directory is not a mount point, the second one
means it's gone. Do ignore both of these.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
A timer is leaking on every daemon start and stop.
Probably nothing major, but given the amount of
daemon starts/stops during tests, it's better to
be accurate about it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

OK the failure in DockerSwarmSuite.TestSwarmClusterRotateUnlockKey

01:27:19.614 docker_cli_swarm_test.go:1454:
01:27:19.614 c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
01:27:19.614 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc00083d0e0), Stderr:[]uint8(nil)} ("exit status 1")
01:27:19.614 ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.

happens (occasionally) because d1 (which is a leader) losts both d2 and d3, and thus it loses quorum and steps itself down from the leader to a follower:

time="2018-10-25T03:51:55.478817549Z" level=warning msg="58bdb01f53b70120 stepped down to follower since quorum is not active" module=raft node.id=l3s9bcv95laiiflz2pya15gi3
time="2018-10-25T03:51:55.478849752Z" level=info msg="58bdb01f53b70120 became follower at term 2" module=raft node.id=l3s9bcv95laiiflz2pya15gi3
time="2018-10-25T03:51:55.478859001Z" level=info msg="raft.node: 58bdb01f53b70120 lost leader 58bdb01f53b70120 at term 2" module=raft node.id=l3s9bcv95laiiflz2pya15gi3
time="2018-10-25T03:51:55.478929983Z" level=error msg="soft state changed, node no longer a leader, resetting and cancelling all waits" raft_id=58bdb01f53b70120
time="2018-10-25T03:51:55.478964151Z" level=info msg="dispatcher stopping" method="(*Dispatcher).Stop" module=dispatcher node.id=l3s9bcv95laiiflz2pya15gi3

what exactly causes that is not yet clear to me...

@kolyshkin
Copy link
Contributor Author

Found a possible cause. Testing a fix (which makes sense for other tests, too, and will be applied if proved usable).

@kolyshkin kolyshkin force-pushed the swarm-tests branch 4 times, most recently from 8dd6f99 to 1547db6 Compare October 25, 2018 21:48
@kolyshkin
Copy link
Contributor Author

Out of 5 runs, got a TestAPISwarmLeaderElection failure on z (I haven't patched this test yet). Guess it makes sense to patch all the swarm tests.

@kolyshkin kolyshkin force-pushed the swarm-tests branch 5 times, most recently from 72f830b to 7c3bf5b Compare October 26, 2018 03:39
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Oct 26, 2018

Patched all the tests, reset the counter, running repeated CI again.

Runs 1-5, no failures.

Run 6, failure on experimental in TestSwarmLockUnlockCluster:

00:25:37.850 START: docker_cli_swarm_test.go:1143: DockerSwarmSuite.TestSwarmLockUnlockCluster
00:25:37.850 START: check_test.go:320: DockerSwarmSuite.SetUpTest
00:25:37.851 PASS: check_test.go:320: DockerSwarmSuite.SetUpTest	0.000s
00:25:37.851 
00:25:37.851 [d67a16aac9c2d] waiting for daemon to start
00:25:38.353 [d67a16aac9c2d] daemon started
00:25:38.353 
00:25:39.153 [d598c4764df87] waiting for daemon to start
00:25:39.655 [d598c4764df87] daemon started
00:25:39.655 
00:25:40.440 [d53d8f36873f9] waiting for daemon to start
00:25:40.941 [d53d8f36873f9] daemon started
00:25:40.941 
00:25:46.731 [d598c4764df87] exiting daemon
00:25:46.731 [d598c4764df87] waiting for daemon to start
00:25:47.233 [d598c4764df87] daemon started
00:25:47.233 
00:25:56.355 [d598c4764df87] exiting daemon
00:25:56.698 [d67a16aac9c2d] exiting daemon
00:25:56.698 [d67a16aac9c2d] waiting for daemon to start
00:25:57.200 [d67a16aac9c2d] daemon started
00:25:57.200 
00:26:13.697 [d53d8f36873f9] exiting daemon
00:26:13.697 [d53d8f36873f9] waiting for daemon to start
00:26:14.200 [d53d8f36873f9] daemon started
00:26:14.200 
00:26:18.372 [d598c4764df87] waiting for daemon to start
00:26:18.874 [d598c4764df87] daemon started
00:26:18.874 
00:26:24.997 [d598c4764df87] exiting daemon
00:26:24.998 [d598c4764df87] waiting for daemon to start
00:26:25.499 [d598c4764df87] daemon started
00:26:25.499 
00:26:26.037 [d67a16aac9c2d] exiting daemon
00:26:26.038 [d67a16aac9c2d] waiting for daemon to start
00:26:46.129 [d67a16aac9c2d] daemon started
00:26:46.130 
00:26:46.240 docker_cli_swarm_test.go:1198:
00:26:46.246     checkSwarmLockedToUnlocked(c, d)
00:26:46.246 docker_cli_swarm_test.go:1017:
00:26:46.254     c.Assert(getNodeStatus(c, d), checker.Equals, swarm.LocalNodeStateActive)
00:26:46.255 ... obtained swarm.LocalNodeState = "pending"
00:26:46.255 ... expected swarm.LocalNodeState = "active"
00:26:46.255 
00:26:46.260 START: check_test.go:347: DockerSwarmSuite.TearDownTest
00:26:46.272 [d67a16aac9c2d] exiting daemon
00:26:46.276 [d598c4764df87] exiting daemon
00:26:51.285 [d53d8f36873f9] exiting daemon
00:26:51.289 PASS: check_test.go:347: DockerSwarmSuite.TearDownTest	5.028s
00:26:51.289 
00:26:51.289 FAIL: docker_cli_swarm_test.go:1143: DockerSwarmSuite.TestSwarmLockUnlockCluster
00:26:51.289 

@kolyshkin kolyshkin force-pushed the swarm-tests branch 4 times, most recently from 706c6ad to 6f9b564 Compare October 26, 2018 19:46
@kolyshkin
Copy link
Contributor Author

Runs 7-9: no failures.

Run 10/60 [▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]

Failure in TestAPISwarmLeaderElection on power:

00:20:07.222 
00:20:07.223 START: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
00:20:07.223 START: check_test.go:320: DockerSwarmSuite.SetUpTest
00:20:07.223 PASS: check_test.go:320: DockerSwarmSuite.SetUpTest	0.000s
00:20:07.223 
00:20:07.292 [d88e13dd68846] waiting for daemon to start
00:20:07.727 [d88e13dd68846] daemon started
00:20:07.727 
00:20:08.300 [d29173e92c51e] waiting for daemon to start
00:20:08.803 [d29173e92c51e] daemon started
00:20:08.803 
00:20:10.963 [dbef89344f8cd] waiting for daemon to start
00:20:11.466 [dbef89344f8cd] daemon started
00:20:11.466 
00:20:12.854 [d88e13dd68846] exiting daemon
00:20:32.857 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
00:20:32.857 START: check_test.go:347: DockerSwarmSuite.TearDownTest
00:20:32.858 Daemon d88e13dd68846 is not started
00:20:34.028 [d29173e92c51e] exiting daemon
00:20:35.272 [dbef89344f8cd] exiting daemon
00:20:35.279 PASS: check_test.go:347: DockerSwarmSuite.TearDownTest	2.423s
00:20:35.280 
00:20:35.280 FAIL: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
00:20:35.280 

This is repeated 6 times in different tests, with slight
minor variations. Let's factor it out, for clarity.

While at it, simplify the code: instead of more complex
parsing of "docker swarm init|update --autolock" output (1)
and checking if the key is also present in
"docker swarm unlock-key" output (2), get the key
from (2) and check it is present in (1).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
..

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin kolyshkin force-pushed the swarm-tests branch 2 times, most recently from 3f55b07 to 8faac3e Compare October 29, 2018 22:09
@kolyshkin
Copy link
Contributor Author

Run 18/60 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]

Failure on power, TestAPISwarmLeaderElection

00:11:55.420
00:11:55.420 START: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
00:11:55.421 START: check_test.go:320: DockerSwarmSuite.SetUpTest
00:11:55.421 PASS: check_test.go:320: DockerSwarmSuite.SetUpTest 0.000s
00:11:55.421
00:11:55.489 [d38e3c8be7c55] waiting for daemon to start
00:11:55.925 [d38e3c8be7c55] daemon started
00:11:55.925
00:11:56.528 [d79776d753327] waiting for daemon to start
00:11:57.031 [d79776d753327] daemon started
00:11:57.032
00:11:59.193 [d3223dbf271f8] waiting for daemon to start
00:11:59.699 [d3223dbf271f8] daemon started
00:11:59.699
00:12:01.074 [d38e3c8be7c55] exiting daemon
00:12:21.076 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
00:12:21.077 START: check_test.go:347: DockerSwarmSuite.TearDownTest
00:12:21.077 Daemon d38e3c8be7c55 is not started
00:12:22.245 [d79776d753327] exiting daemon
00:12:28.418 [d3223dbf271f8] exiting daemon
00:12:28.423 PASS: check_test.go:347: DockerSwarmSuite.TearDownTest 7.347s
00:12:28.424
00:12:28.424 FAIL: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
00:12:28.424

@kolyshkin
Copy link
Contributor Author

kolyshkin commented Oct 30, 2018

Run 19/60 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]

Two failures, one on power in TestAPISwarmLeaderElection:

00:19:42.491
00:19:42.491 START: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
00:19:42.491 START: check_test.go:320: DockerSwarmSuite.SetUpTest
00:19:42.491 PASS: check_test.go:320: DockerSwarmSuite.SetUpTest 0.000s
00:19:42.491
00:19:42.493 [d724a4bdcf700] waiting for daemon to start
00:19:42.996 [d724a4bdcf700] daemon started
00:19:42.996
00:19:43.615 [dd109cf60cb89] waiting for daemon to start
00:19:44.118 [dd109cf60cb89] daemon started
00:19:44.118
00:19:45.293 [d82deaefcb384] waiting for daemon to start
00:19:45.795 [d82deaefcb384] daemon started
00:19:45.795
00:19:47.188 [d724a4bdcf700] exiting daemon
00:20:07.191 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
00:20:07.191 START: check_test.go:347: DockerSwarmSuite.TearDownTest
00:20:07.191 Daemon d724a4bdcf700 is not started
00:20:12.201 [dd109cf60cb89] exiting daemon
00:20:13.405 [d82deaefcb384] exiting daemon
00:20:13.412 PASS: check_test.go:347: DockerSwarmSuite.TearDownTest 6.221s
00:20:13.412
00:20:13.412 FAIL: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
00:20:13.412

and a new failure on z, TestSwarmPublishAdd:

00:35:47.832
00:35:47.832 START: docker_cli_swarm_test.go:274: DockerSwarmSuite.TestSwarmPublishAdd
00:35:47.832 START: check_test.go:320: DockerSwarmSuite.SetUpTest
00:35:47.832 PASS: check_test.go:320: DockerSwarmSuite.SetUpTest 0.000s
00:35:47.832
00:35:47.834 [d3f9f547e33bb] waiting for daemon to start
00:35:48.346 [d3f9f547e33bb] daemon started
00:35:48.346
00:35:49.973 docker_cli_swarm_test.go:286:
00:35:49.981 c.Assert(err, checker.IsNil, check.Commentf("%s", out))
00:35:49.985 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000617da0), Stderr:[]uint8(nil)} ("exit status 1")
00:35:49.986 ... Error response from daemon: rpc error: code = Unknown desc = update out of sequence
00:35:49.986
00:35:49.986
00:35:49.986 START: check_test.go:347: DockerSwarmSuite.TearDownTest
00:35:51.903 [d3f9f547e33bb] exiting daemon
00:35:51.911 PASS: check_test.go:347: DockerSwarmSuite.TearDownTest 1.928s
00:35:51.911
00:35:51.911 FAIL: docker_cli_swarm_test.go:274: DockerSwarmSuite.TestSwarmPublishAdd
00:35:51.912

......

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

Some previous findings on TestAPISwarmLeaderElection are in #37833

@kolyshkin kolyshkin force-pushed the swarm-tests branch 2 times, most recently from 4a5b90b to e9d385d Compare October 30, 2018 16:09
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Oct 30, 2018

Run 20 -- no failures

Run 21/60 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]

Failure on power in TestSwarmClusterRotateUnlockKey:

00:33:55.218
00:33:55.218 START: docker_cli_swarm_test.go:1316: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
00:33:55.218 START: check_test.go:320: DockerSwarmSuite.SetUpTest
00:33:55.218 PASS: check_test.go:320: DockerSwarmSuite.SetUpTest 0.000s
00:33:55.218
00:33:55.223 [db8afafbf2e95] waiting for daemon to start
00:33:55.726 [db8afafbf2e95] daemon started
00:33:55.726
00:33:56.313 [d178b2e905cdf] waiting for daemon to start
00:33:56.833 [d178b2e905cdf] daemon started
00:33:56.833
00:33:57.985 [db1bfab8e751e] waiting for daemon to start
00:33:58.488 [db1bfab8e751e] daemon started
00:33:58.488
00:34:04.759 [d178b2e905cdf] exiting daemon
00:34:04.761 [d178b2e905cdf] waiting for daemon to start
00:34:05.266 [d178b2e905cdf] daemon started
00:34:05.266
00:34:10.502 [db1bfab8e751e] exiting daemon
00:34:10.503 [db1bfab8e751e] waiting for daemon to start
00:34:11.006 [db1bfab8e751e] daemon started
00:34:11.006
00:34:36.221 docker_cli_swarm_test.go:1379:
00:34:36.235 c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
00:34:36.236 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000627e60), Stderr:[]uint8(nil)} ("exit status 1")
00:34:36.236 ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
00:34:36.236
00:34:36.236
00:34:36.237 START: check_test.go:347: DockerSwarmSuite.TearDownTest
00:34:36.590 [db8afafbf2e95] exiting daemon
00:34:41.603 [d178b2e905cdf] exiting daemon
00:34:41.612 [db1bfab8e751e] exiting daemon
00:34:41.618 PASS: check_test.go:347: DockerSwarmSuite.TearDownTest 5.382s
00:34:41.618
00:34:41.618 FAIL: docker_cli_swarm_test.go:1316: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
00:34:41.619

@kolyshkin kolyshkin force-pushed the swarm-tests branch 7 times, most recently from b168e90 to 5643aed Compare November 2, 2018 00:40
When starting docker daemons for swarm testing, we disable iptables
and use lo for communication (in order to avoid network conflicts).

The problem is, these options are lost on restart, that can lead
to any sorts of network conflicts and thus connectivity issues
between swarm nodes.

Fix this.

29/60 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]

Failures so far:
* TestSwarmLockUnlockCluster      ▓▓
* TestAPISwarmLeaderElection      ▓▓▓▓
* TestSwarmClusterRotateUnlockKey ▓▓▓
* TestSwarmPublishAdd             ▓

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@vdemeester
Copy link
Member

@kolyshkin giving the "status" and the time without activity on this PR, I'm gonna close it 👼

@vdemeester vdemeester closed this Jan 10, 2019
@kolyshkin
Copy link
Contributor Author

Right, thanks. The good part of it is already merged in #38127, the rest is just [futile] attempts to figure out what's going on with flaky swarm tests.

@kolyshkin kolyshkin deleted the swarm-tests branch January 10, 2019 22:16
@thaJeztah
Copy link
Member

Perhaps we could use testing.Verbose() to enable additional debugging on some tests (https://splice.com/blog/lesser-known-features-go-test/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants