integration-cli: add more debugging for TestSwarmClusterRotateUnlockKey #39885

thaJeztah · 2019-09-09T20:21:04Z

Relates to #38885 Flaky test: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey

This test was updated in b79adac (#39616), but is still flaky #39883 (comment);

20:24:13  FAIL: docker_cli_swarm_test.go:1333: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d6f95e679cb65] waiting for daemon to start
20:24:13  [d6f95e679cb65] waiting for daemon to start
20:24:13  [d6f95e679cb65] daemon started
20:24:13
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] daemon started
20:24:13
20:24:13  [d204a02ba4780] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2478
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] daemon started
20:24:13
20:24:13  [d873d6a842829] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2479
20:24:13  [d204a02ba4780] Stopping daemon
20:24:13  [d204a02ba4780] exiting daemon
20:24:13  [d204a02ba4780] Daemon stopped
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] daemon started
20:24:13
20:24:13  [d873d6a842829] Stopping daemon
20:24:13  [d873d6a842829] exiting daemon
20:24:13  [d873d6a842829] Daemon stopped
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] daemon started
20:24:13
20:24:13  docker_cli_swarm_test.go:1413:
20:24:13      c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
20:24:13  ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000934240), Stderr:[]uint8(nil)} ("exit status 1")
20:24:13  ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
20:24:13
20:24:13
20:24:13  [d6f95e679cb65] Stopping daemon
20:24:13  [d6f95e679cb65] exiting daemon
20:24:13  [d6f95e679cb65] Daemon stopped
20:24:13  [d204a02ba4780] Stopping daemon
20:24:13  [d204a02ba4780] exiting daemon
20:24:13  [d204a02ba4780] Daemon stopped
20:24:13  [d873d6a842829] Stopping daemon
20:24:13  [d873d6a842829] exiting daemon
20:24:13  [d873d6a842829] Daemon stopped

The interesting bit there is that the retry loop should have a 3 second sleep before retrying,
but looking at the failure above, the test started (and failed) within a second, which means that
a different error / output was returned.

This patch adds some additional debugging to that test to see if we can catch the reason
this test is still flaky.

thaJeztah · 2019-09-09T20:23:32Z

Logs from that failing test;

d6f95e679cb65.log
d204a02ba4780.log
d873d6a842829.log

thaJeztah · 2019-09-09T21:43:31Z

Failure on RS1 is #39857

https://ci.docker.com/public/blue/rest/organizations/jenkins/pipelines/moby/branches/PR-39885/runs/1/nodes/200/log/?start=0

[2019-09-09T20:37:36.900Z] ok  	github.com/docker/docker/daemon/logger	0.525s	coverage: 43.0% of statements
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:35Z" level=info msg="Trying to get region from EC2 Metadata"
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=info msg="Log stream already exists" errorCode=ResourceAlreadyExistsException logGroupName= logStreamName= message= origError="<nil>"
[2019-09-09T20:37:36.900Z] --- FAIL: TestLogBlocking (0.02s)
[2019-09-09T20:37:36.900Z]     cloudwatchlogs_test.go:313: Expected to be able to read from stream.messages but was unable to
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=error msg=Error
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=error msg="Failed to put log events" errorCode=InvalidSequenceTokenException logGroupName=groupName logStreamName=streamName message="use token token" origError="<nil>"
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=error msg="Failed to put log events" errorCode=DataAlreadyAcceptedException logGroupName=groupName logStreamName=streamName message="use token token" origError="<nil>"
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=info msg="Data already accepted, ignoring error" errorCode=DataAlreadyAcceptedException logGroupName=groupName logStreamName=streamName message="use token token"
[2019-09-09T20:37:36.900Z] FAIL
[2019-09-09T20:37:36.900Z] coverage: 78.2% of statements
[2019-09-09T20:37:36.900Z] FAIL	github.com/docker/docker/daemon/logger/awslogs	0.630s

thaJeztah · 2019-09-13T22:50:44Z

rebased; @cpuguy83 @tiborvass ptal

This test was updated in b79adac, but is still flaky; ``` 20:24:13 FAIL: docker_cli_swarm_test.go:1333: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey 20:24:13 20:24:13 Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey 20:24:13 [d6f95e679cb65] waiting for daemon to start 20:24:13 [d6f95e679cb65] waiting for daemon to start 20:24:13 [d6f95e679cb65] daemon started 20:24:13 20:24:13 Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey 20:24:13 [d204a02ba4780] waiting for daemon to start 20:24:13 [d204a02ba4780] waiting for daemon to start 20:24:13 [d204a02ba4780] daemon started 20:24:13 20:24:13 [d204a02ba4780] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2478 20:24:13 Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey 20:24:13 [d873d6a842829] waiting for daemon to start 20:24:13 [d873d6a842829] waiting for daemon to start 20:24:13 [d873d6a842829] daemon started 20:24:13 20:24:13 [d873d6a842829] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2479 20:24:13 [d204a02ba4780] Stopping daemon 20:24:13 [d204a02ba4780] exiting daemon 20:24:13 [d204a02ba4780] Daemon stopped 20:24:13 [d204a02ba4780] waiting for daemon to start 20:24:13 [d204a02ba4780] waiting for daemon to start 20:24:13 [d204a02ba4780] daemon started 20:24:13 20:24:13 [d873d6a842829] Stopping daemon 20:24:13 [d873d6a842829] exiting daemon 20:24:13 [d873d6a842829] Daemon stopped 20:24:13 [d873d6a842829] waiting for daemon to start 20:24:13 [d873d6a842829] waiting for daemon to start 20:24:13 [d873d6a842829] daemon started 20:24:13 20:24:13 docker_cli_swarm_test.go:1413: 20:24:13 c.Assert(err, checker.IsNil, check.Commentf("%s", outs)) 20:24:13 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000934240), Stderr:[]uint8(nil)} ("exit status 1") 20:24:13 ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online. 20:24:13 20:24:13 20:24:13 [d6f95e679cb65] Stopping daemon 20:24:13 [d6f95e679cb65] exiting daemon 20:24:13 [d6f95e679cb65] Daemon stopped 20:24:13 [d204a02ba4780] Stopping daemon 20:24:13 [d204a02ba4780] exiting daemon 20:24:13 [d204a02ba4780] Daemon stopped 20:24:13 [d873d6a842829] Stopping daemon 20:24:13 [d873d6a842829] exiting daemon 20:24:13 [d873d6a842829] Daemon stopped ``` The interesting bit there is that the retry loop should have a 3 second sleep before retrying, but looking at the failure above, the test started (and failed) within a second, which means that a different error / output was returned. This patch adds some additional debugging to that test to see if we can catch the reason this test is still flaky. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

thaJeztah · 2019-09-19T13:02:07Z

rebased

thaJeztah added status/2-code-review area/testing area/swarm labels Sep 9, 2019

thaJeztah added this to In progress in Improving CI via automation Sep 9, 2019

GordonTheTurtle added the status/0-triage label Sep 9, 2019

thaJeztah mentioned this pull request Sep 9, 2019

Flaky test: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey #38885

Closed

thaJeztah removed the status/0-triage label Sep 9, 2019

thaJeztah mentioned this pull request Sep 9, 2019

pkg/parsers/kernel: gofmt hex value (preparation for Go 1.13+) #39883

Merged

thaJeztah force-pushed the debug_flaky_TestSwarmClusterRotateUnlockKey branch from bbef0e8 to d3ab0ee Compare September 13, 2019 22:49

Improving CI automation moved this from In progress to Reviewer approved Sep 13, 2019

tiborvass approved these changes Sep 13, 2019

View reviewed changes

thaJeztah force-pushed the debug_flaky_TestSwarmClusterRotateUnlockKey branch from d3ab0ee to 78d137d Compare September 19, 2019 13:01

tiborvass approved these changes Sep 19, 2019

View reviewed changes

tiborvass merged commit 3cfb680 into moby:master Sep 19, 2019

Improving CI automation moved this from Reviewer approved to Done Sep 19, 2019

thaJeztah deleted the debug_flaky_TestSwarmClusterRotateUnlockKey branch September 19, 2019 20:05

thaJeztah added the process/cherry-pick label Sep 25, 2019

thaJeztah mentioned this pull request Sep 25, 2019

[19.03 backport] Testing and Jenkinsfile changes [step 1] docker-archive/engine#382

Merged

vikramhh mentioned this pull request Nov 18, 2019

Bump hcsshim to b3f49c06ffaeef24d09c6c08ec8ec8425a0303e2 #40128

Closed

1 task

thaJeztah added this to the 20.03.0 milestone Apr 2, 2020

thaJeztah removed the process/cherry-pick label Feb 18, 2022

thaJeztah mentioned this pull request Jan 2, 2024

De-flake TestSwarmClusterRotateUnlockKey... again... maybe? #47009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration-cli: add more debugging for TestSwarmClusterRotateUnlockKey #39885

integration-cli: add more debugging for TestSwarmClusterRotateUnlockKey #39885

thaJeztah commented Sep 9, 2019 •

edited

Loading

thaJeztah commented Sep 9, 2019

thaJeztah commented Sep 9, 2019 •

edited

Loading

thaJeztah commented Sep 13, 2019

thaJeztah commented Sep 19, 2019

integration-cli: add more debugging for TestSwarmClusterRotateUnlockKey #39885

integration-cli: add more debugging for TestSwarmClusterRotateUnlockKey #39885

Conversation

thaJeztah commented Sep 9, 2019 • edited Loading

thaJeztah commented Sep 9, 2019

thaJeztah commented Sep 9, 2019 • edited Loading

thaJeztah commented Sep 13, 2019

thaJeztah commented Sep 19, 2019

thaJeztah commented Sep 9, 2019 •

edited

Loading

thaJeztah commented Sep 9, 2019 •

edited

Loading