Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration-cli: add more debugging for TestSwarmClusterRotateUnlockKey #39885

Merged

Conversation

thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Sep 9, 2019

Relates to #38885 Flaky test: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey

This test was updated in b79adac (#39616), but is still flaky #39883 (comment);

20:24:13  FAIL: docker_cli_swarm_test.go:1333: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d6f95e679cb65] waiting for daemon to start
20:24:13  [d6f95e679cb65] waiting for daemon to start
20:24:13  [d6f95e679cb65] daemon started
20:24:13
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] daemon started
20:24:13
20:24:13  [d204a02ba4780] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2478
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] daemon started
20:24:13
20:24:13  [d873d6a842829] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2479
20:24:13  [d204a02ba4780] Stopping daemon
20:24:13  [d204a02ba4780] exiting daemon
20:24:13  [d204a02ba4780] Daemon stopped
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] daemon started
20:24:13
20:24:13  [d873d6a842829] Stopping daemon
20:24:13  [d873d6a842829] exiting daemon
20:24:13  [d873d6a842829] Daemon stopped
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] daemon started
20:24:13
20:24:13  docker_cli_swarm_test.go:1413:
20:24:13      c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
20:24:13  ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000934240), Stderr:[]uint8(nil)} ("exit status 1")
20:24:13  ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
20:24:13
20:24:13
20:24:13  [d6f95e679cb65] Stopping daemon
20:24:13  [d6f95e679cb65] exiting daemon
20:24:13  [d6f95e679cb65] Daemon stopped
20:24:13  [d204a02ba4780] Stopping daemon
20:24:13  [d204a02ba4780] exiting daemon
20:24:13  [d204a02ba4780] Daemon stopped
20:24:13  [d873d6a842829] Stopping daemon
20:24:13  [d873d6a842829] exiting daemon
20:24:13  [d873d6a842829] Daemon stopped

The interesting bit there is that the retry loop should have a 3 second sleep before retrying,
but looking at the failure above, the test started (and failed) within a second, which means that
a different error / output was returned.

This patch adds some additional debugging to that test to see if we can catch the reason
this test is still flaky.

@thaJeztah
Copy link
Member Author

Logs from that failing test;

d6f95e679cb65.log
d204a02ba4780.log
d873d6a842829.log

@thaJeztah
Copy link
Member Author

thaJeztah commented Sep 9, 2019

Failure on RS1 is #39857

https://ci.docker.com/public/blue/rest/organizations/jenkins/pipelines/moby/branches/PR-39885/runs/1/nodes/200/log/?start=0

[2019-09-09T20:37:36.900Z] ok  	github.com/docker/docker/daemon/logger	0.525s	coverage: 43.0% of statements
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:35Z" level=info msg="Trying to get region from EC2 Metadata"
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=info msg="Log stream already exists" errorCode=ResourceAlreadyExistsException logGroupName= logStreamName= message= origError="<nil>"
[2019-09-09T20:37:36.900Z] --- FAIL: TestLogBlocking (0.02s)
[2019-09-09T20:37:36.900Z]     cloudwatchlogs_test.go:313: Expected to be able to read from stream.messages but was unable to
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=error msg=Error
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=error msg="Failed to put log events" errorCode=InvalidSequenceTokenException logGroupName=groupName logStreamName=streamName message="use token token" origError="<nil>"
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=error msg="Failed to put log events" errorCode=DataAlreadyAcceptedException logGroupName=groupName logStreamName=streamName message="use token token" origError="<nil>"
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=info msg="Data already accepted, ignoring error" errorCode=DataAlreadyAcceptedException logGroupName=groupName logStreamName=streamName message="use token token"
[2019-09-09T20:37:36.900Z] FAIL
[2019-09-09T20:37:36.900Z] coverage: 78.2% of statements
[2019-09-09T20:37:36.900Z] FAIL	github.com/docker/docker/daemon/logger/awslogs	0.630s

@thaJeztah
Copy link
Member Author

rebased; @cpuguy83 @tiborvass ptal

Improving CI automation moved this from In progress to Reviewer approved Sep 13, 2019
This test was updated in b79adac, but is still flaky;

```
20:24:13  FAIL: docker_cli_swarm_test.go:1333: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d6f95e679cb65] waiting for daemon to start
20:24:13  [d6f95e679cb65] waiting for daemon to start
20:24:13  [d6f95e679cb65] daemon started
20:24:13
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] daemon started
20:24:13
20:24:13  [d204a02ba4780] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2478
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] daemon started
20:24:13
20:24:13  [d873d6a842829] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2479
20:24:13  [d204a02ba4780] Stopping daemon
20:24:13  [d204a02ba4780] exiting daemon
20:24:13  [d204a02ba4780] Daemon stopped
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] daemon started
20:24:13
20:24:13  [d873d6a842829] Stopping daemon
20:24:13  [d873d6a842829] exiting daemon
20:24:13  [d873d6a842829] Daemon stopped
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] daemon started
20:24:13
20:24:13  docker_cli_swarm_test.go:1413:
20:24:13      c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
20:24:13  ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000934240), Stderr:[]uint8(nil)} ("exit status 1")
20:24:13  ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
20:24:13
20:24:13
20:24:13  [d6f95e679cb65] Stopping daemon
20:24:13  [d6f95e679cb65] exiting daemon
20:24:13  [d6f95e679cb65] Daemon stopped
20:24:13  [d204a02ba4780] Stopping daemon
20:24:13  [d204a02ba4780] exiting daemon
20:24:13  [d204a02ba4780] Daemon stopped
20:24:13  [d873d6a842829] Stopping daemon
20:24:13  [d873d6a842829] exiting daemon
20:24:13  [d873d6a842829] Daemon stopped
```

The interesting bit there is that the retry loop should have a 3 second sleep before retrying,
but looking at the failure above, the test started (and failed) within a second, which means that
a different error / output was returned.

This patch adds some additional debugging to that test to see if we can catch the reason
this test is still flaky.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah thaJeztah force-pushed the debug_flaky_TestSwarmClusterRotateUnlockKey branch from d3ab0ee to 78d137d Compare September 19, 2019 13:01
@thaJeztah
Copy link
Member Author

rebased

@tiborvass tiborvass merged commit 3cfb680 into moby:master Sep 19, 2019
Improving CI automation moved this from Reviewer approved to Done Sep 19, 2019
@thaJeztah thaJeztah deleted the debug_flaky_TestSwarmClusterRotateUnlockKey branch September 19, 2019 20:05
@thaJeztah thaJeztah added this to the 20.03.0 milestone Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Improving CI
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants