Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: TestDockerNetworkHostModeUngracefulDaemonRestart #19368

Closed
clnperez opened this issue Jan 15, 2016 · 14 comments
Closed

Flaky test: TestDockerNetworkHostModeUngracefulDaemonRestart #19368

clnperez opened this issue Jan 15, 2016 · 14 comments

Comments

@clnperez
Copy link
Contributor

Description of problem:
This one has failed a few times in the past couple of days with gccgo:
https://jenkins.dockerproject.org/job/Docker%20Master%20%28gccgo%29/1285/consoleFull
https://jenkins.dockerproject.org/job/Docker%20Master%20%28gccgo%29/1287/consoleFull
https://jenkins.dockerproject.org/job/Docker%20Master%20%28gccgo%29/1294/consoleFull

docker version:
1.10.10-dev (latest upstream)

docker info:
❗ This is to make Gordon happy, and should be ignored since it's from my laptop (running in a container), not one of Docker's test nodes.

./docker info

Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 1.10.0-dev
Storage Driver: devicemapper
Pool Name: docker-253:2-5914896-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file: /dev/loop2
Metadata file: /dev/loop3
Data Space Used: 11.8 MB
Data Space Total: 107.4 GB
Data Space Available: 84.95 GB
Metadata Space Used: 581.6 kB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.147 GB
Udev Sync Supported: false
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Either use --storage-opt dm.thinpooldev or use --storage-opt dm.no_warn_on_loop_devices=true to suppress this warning.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.82 (2013-10-04)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: null host bridge
Kernel Version: 4.2.8-200.fc22.x86_64
Operating System: Ubuntu 14.04.3 LTS (containerized)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.678 GiB
Name: 703f7cbf4c89
ID: 5VYE:UHVY:NVE7:FXOH:6XFB:MT4G:2WOX:BXUT:5E7S:VFRS:EL4S:6HOX
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

uname -a:
❗ same comment as above.
4.2.8-200.fc22.x86_64

Environment details (AWS, VirtualBox, physical, etc.):
docker's jenkins (see above links)

How reproducible:
flaky. always seen by me in docker's jenkins builds

Steps to Reproduce:

  1. someone commits something
  2. it gets merged
  3. the jenkins build is triggered

Actual Results:
Test always passes

Expected Results:
Sometimes it doesn't

Additional info:
I'll try to look into it today. I haven't tried recreating it locally but wanted to get it into an issue so others can "me too" it and so I don't forget about it.

@thaJeztah
Copy link
Member

Just had this one again with gccgo https://jenkins.dockerproject.org/job/Docker-PRs-gccgo/748/console

@thaJeztah
Copy link
Member

02:11:35 
02:11:35 ----------------------------------------------------------------------
02:11:35 FAIL: docker_cli_network_unix_test.go:945: TestDockerNetworkHostModeUngracefulDaemonRestart.pN59_github_com_docker_docker_integration_cli.DockerNetworkSuite
02:11:35 
02:11:35 [d82681000] waiting for daemon to start
02:11:35 [d82681000] daemon started
02:11:35 [d82681000] exiting daemon
02:11:35 [d82681000] waiting for daemon to start
02:11:35 [d82681000] daemon started
02:11:35 docker_cli_network_unix_test.go:966:
02:11:35     c.Assert(err, checker.IsNil)
02:11:35 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc20d85cb80)} ("exit status 1")
02:11:35 
02:11:35 [d82681000] exiting daemon
02:11:37 
02:11:37 ----------------------------------------------------------------------

This was referenced Jan 20, 2016
@clnperez
Copy link
Contributor Author

Thanks @thaJeztah. I was able to get it to fail locally at least once last week, but I'll have to try again since I keep getting sidetracked and I've since lost that container.

@clnperez
Copy link
Contributor Author

This also fails on ARM with golang. https://jenkins.dockerproject.org/job/Docker-PRs-arm/97/console

@tophj-ibm
Copy link
Contributor

some debug info

FAIL: docker_cli_network_unix_test.go:995: TestDockerNetworkHostModeUngracefulDaemonRestart.pN59_github_com_docker_docker_integration_cli.DockerNetworkSuite

[d46248000] waiting for daemon to start
[d46248000] daemon started
[d46248000] exiting daemon
[d46248000] waiting for daemon to start
[d46248000] daemon started
docker_cli_network_unix_test.go:1017:
    c.Assert(err, checker.IsNil, check.Commentf(fmt.Sprintf("Error starting %s: %s", cName, runningOut)))
... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc209785360)} ("exit status 1")
... Error starting hostc-9: 
Error: No such image or container: hostc-9


[d46248000] exiting daemon

@clnperez
Copy link
Contributor Author

So apparently @tophj-ibm can recreate this a lot more easily than I can, but I put in an inaccurate debug message. It should be an "inspect error," not a "start error," in case that confuses anyone.

@thaJeztah
Copy link
Member

@mavenugo
Copy link
Contributor

mavenugo commented Feb 1, 2016

@thaJeztah i will look into it.

@clnperez
Copy link
Contributor Author

clnperez commented Feb 2, 2016

I've been looking into this, and I've seen it fail in two ways: 1) the container doesn't exist, or 2) the container isn't started. It's always the last container that's the problem. The Start() function just checks to see if the daemon responding to requests. So we can either add a sleep to the test, or rework Start() a bit. I have a feeling that reworking Start() might also require adding some logic in the daemon itself.

@mavenugo
Copy link
Contributor

mavenugo commented Feb 3, 2016

@cinperez thanks. am trying to understand why this is seen almost consistently for gccgo CI but not on other CI runs.

@clnperez
Copy link
Contributor Author

clnperez commented Feb 3, 2016

@mavenugo Go code compiled with gccgo isn't as optimized as go code compile with gc (and there may be some things we can add to our builds but I'm not sure anyone has dug into it much), so things run more slowly. I've seen that pretty consistently. That doesn't prove that this issue is a timing issue, but it could be why we see it on gccgo only.

@clnperez
Copy link
Contributor Author

clnperez commented Feb 4, 2016

Hm @tiborvass, @mavenugo, looks like this failed again on gccgo: https://jenkins.dockerproject.org/job/Docker%20Master%20%28gccgo%29/1584/consoleFull

le sigh

@icecrime
Copy link
Contributor

@clnperez @mavenugo Any idea? This is becoming a huge pain point: I'll skip the test if we can't figure out a better way.

@tophj-ibm
Copy link
Contributor

Just starting to relook into this issue.

I'm getting this error when trying to load the last container that was originally started. (not necessarily the last container to be restarted)

Failed to load container 7d00772d8d3242210243177bc2142f0b406008db6fc27e6b41a3ec5e9119d555: EOF

It's possible the daemon kill is happening before the container has fully started, I'll continue to investigate.

tophj-ibm added a commit to tophj-ibm/moby that referenced this issue Feb 11, 2016
Fixes moby#19368 by waiting until all container statuses are running
before killing the daemon

Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants