Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey #38885

Closed
thaJeztah opened this issue Mar 16, 2019 · 2 comments · Fixed by #39616 or #47009
Closed

Flaky test: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey #38885

thaJeztah opened this issue Mar 16, 2019 · 2 comments · Fixed by #39616 or #47009
Labels
area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.

Comments

@thaJeztah
Copy link
Member

Let's create a separate issue for this one (also tracked in #33041 and #37306

Seen failing in https://jenkins.dockerproject.org/job/Docker-PRs-experimental/44501/console (and many other times)

03:25:28 FAIL: docker_cli_swarm_test.go:1316: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
03:25:28
03:25:28 Creating a new daemon
03:25:28 [dcd909916369d] waiting for daemon to start
03:25:28 [dcd909916369d] daemon started
03:25:28
03:25:28 Creating a new daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28
03:25:28 Creating a new daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28
03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28
03:25:28 [d899b634e4c28] exiting daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28
03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28
03:25:28 [d899b634e4c28] exiting daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28
03:25:28 docker_cli_swarm_test.go:1386:
03:25:28     c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
03:25:28 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc0008f00a0), Stderr:[]uint8(nil)} ("exit status 1")
03:25:28 ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
03:25:28
03:25:28
03:25:28 [dcd909916369d] exiting daemon
03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [d899b634e4c28] exiting daemon
03:25:32

This is the test:

// This differs from `TestSwarmRotateUnlockKey` because that one rotates a single node, which is the leader.
// This one keeps the leader up, and asserts that other manager nodes in the cluster also have their unlock
// key rotated.
func (s *DockerSwarmSuite) TestSwarmClusterRotateUnlockKey(c *check.C) {
if runtime.GOARCH == "s390x" {
c.Skip("Disabled on s390x")
}
if runtime.GOARCH == "ppc64le" {
c.Skip("Disabled on ppc64le")
}
d1 := s.AddDaemon(c, true, true) // leader - don't restart this one, we don't want leader election delays
d2 := s.AddDaemon(c, true, true)
d3 := s.AddDaemon(c, true, true)
outs, err := d1.Cmd("swarm", "update", "--autolock")
c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
unlockKey := getUnlockKey(d1, c, outs)
// Rotate multiple times
for i := 0; i != 3; i++ {
outs, err = d1.Cmd("swarm", "unlock-key", "-q", "--rotate")
c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
// Strip \n
newUnlockKey := outs[:len(outs)-1]
c.Assert(newUnlockKey, checker.Not(checker.Equals), "")
c.Assert(newUnlockKey, checker.Not(checker.Equals), unlockKey)
d2.RestartNode(c)
d3.RestartNode(c)
for _, d := range []*daemon.Daemon{d2, d3} {
c.Assert(getNodeStatus(c, d), checker.Equals, swarm.LocalNodeStateLocked)
outs, _ := d.Cmd("node", "ls")
c.Assert(outs, checker.Contains, "Swarm is encrypted and needs to be unlocked")
cmd := d.Command("swarm", "unlock")
cmd.Stdin = bytes.NewBufferString(unlockKey)
result := icmd.RunCmd(cmd)
if result.Error == nil {
// On occasion, the daemon may not have finished
// rotating the KEK before restarting. The test is
// intentionally written to explore this behavior.
// When this happens, unlocking with the old key will
// succeed. If we wait for the rotation to happen and
// restart again, the new key should be required this
// time.
time.Sleep(3 * time.Second)
d.RestartNode(c)
cmd = d.Command("swarm", "unlock")
cmd.Stdin = bytes.NewBufferString(unlockKey)
result = icmd.RunCmd(cmd)
}
result.Assert(c, icmd.Expected{
ExitCode: 1,
Err: "invalid key",
})
outs, _ = d.Cmd("node", "ls")
c.Assert(outs, checker.Contains, "Swarm is encrypted and needs to be unlocked")
cmd = d.Command("swarm", "unlock")
cmd.Stdin = bytes.NewBufferString(newUnlockKey)
icmd.RunCmd(cmd).Assert(c, icmd.Success)
c.Assert(getNodeStatus(c, d), checker.Equals, swarm.LocalNodeStateActive)
outs, err = d.Cmd("node", "ls")
c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
c.Assert(outs, checker.Not(checker.Contains), "Swarm is encrypted and needs to be unlocked")
}
unlockKey = newUnlockKey
}
}

d1 = dcd909916369d
d2 = de6869a9c7827
d3 = d899b634e4c28

03:25:28 FAIL: docker_cli_swarm_test.go:1316: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
03:25:28

Create 3 daemons;

Daemon 1 (d1 = dcd909916369d)

d1 := s.AddDaemon(c, true, true) // leader - don't restart this one, we don't want leader election delays

03:25:28 Creating a new daemon
03:25:28 [dcd909916369d] waiting for daemon to start
03:25:28 [dcd909916369d] daemon started
03:25:28

Daemon 2 (d2 = de6869a9c7827)

d2 := s.AddDaemon(c, true, true)

03:25:28 Creating a new daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28

Daemon 3 (d3 = d899b634e4c28)

d3 := s.AddDaemon(c, true, true)

03:25:28 Creating a new daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28

In a loop (3 times);

Iteration 1:

Restart daemon d2

03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28

Restart daemon d3

03:25:28 [d899b634e4c28] exiting daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28

Iteration 2:

Restart daemon d2

03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28

Restart daemon d3

03:25:28 [d899b634e4c28] exiting daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28

Failing here;

outs, err = d.Cmd("node", "ls")
c.Assert(err, checker.IsNil, check.Commentf("%s", outs))

03:25:28 docker_cli_swarm_test.go:1386:
03:25:28     c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
03:25:28 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc0008f00a0), Stderr:[]uint8(nil)} ("exit status 1")
03:25:28 ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
03:25:28
03:25:28

Teardown:

03:25:28 [dcd909916369d] exiting daemon
03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [d899b634e4c28] exiting daemon
03:25:32

Logs:

@thaJeztah thaJeztah changed the title flaky test: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey Flaky test DockerSwarmSuite.TestSwarmClusterRotateUnlockKey Mar 16, 2019
@thaJeztah thaJeztah added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/testing labels Mar 16, 2019
@thaJeztah thaJeztah changed the title Flaky test DockerSwarmSuite.TestSwarmClusterRotateUnlockKey Flaky test: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey Sep 1, 2019
@thaJeztah
Copy link
Member Author

Looks like this test is still flaky #39883 (comment)

@thaJeztah
Copy link
Member Author

Opened #39885 to add more debugging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.
Projects
Status: Done
1 participant