Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: TestCheckpoint #38963

Open
thaJeztah opened this issue Mar 28, 2019 · 5 comments
Open

Flaky test: TestCheckpoint #38963

thaJeztah opened this issue Mar 28, 2019 · 5 comments
Labels
area/checkpoint Related to (experimental) checkpoint/restore (CRIU) area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. kind/experimental

Comments

@thaJeztah
Copy link
Member

This test was added recently in #38452, but looks to be flaky;

Seen failing on #38910, #38441, #38952, #38939

For example:

https://jenkins.dockerproject.org/job/Docker-PRs-experimental/44645/console

01:42:48 === RUN   TestCheckpoint
01:42:50 --- FAIL: TestCheckpoint (1.54s)
01:42:50     checkpoint_test.go:40: Error (criu/util.c:816): exited, status=3
01:42:50         Warn  (criu/net.c:2840): Unable to get socket network namespace
01:42:50         Warn  (criu/net.c:2840): Unable to get tun network namespace
01:42:50         Warn  (criu/sk-unix.c:229): unix: Unable to open a socket file: Bad address
01:42:50         Warn  (criu/net.c:2840): Unable to get socket network namespace
01:42:50         Warn  (criu/kerndat.c:881): Can't keep kdat cache on non-tempfs
01:42:50         Looks good.
01:42:50     checkpoint_test.go:51: Start a container
01:42:50     checkpoint_test.go:69: ++ type -P true
01:42:50         ++ type -P ip6tables-restore
01:42:50         + mount --bind /bin/true /sbin/ip6tables-restore
01:42:50         ++ type -P true
01:42:50         ++ type -P ip6tables-save
01:42:50         + mount --bind /bin/true /sbin/ip6tables-save
01:42:50     checkpoint_test.go:81: Do a checkpoint and leave the container running
01:42:50     checkpoint_test.go:24: Exec: [touch /tmp/test-file]
01:42:50     checkpoint_test.go:28: 
01:42:50     checkpoint_test.go:115: Do a checkpoint and stop the container
01:42:50     checkpoint_test.go:117: assertion failed: error is not nil: Error response from daemon: Cannot checkpoint container 632ee0139e1239e1fa4a6bbd6e37ad7c21500fe4d2d8ee8139d40e26cebd6e92: cannot checkpoint a stopped container: unknown
01:42:50     checkpoint_test.go:77: ++ type -P ip6tables-restore
01:42:50         + umount -c -i -l /sbin/ip6tables-restore
01:42:50         ++ type -P ip6tables-save
01:42:50         + umount -c -i -l /sbin/ip6tables-save
01:42:50     main_test.go:32: assertion failed: error is not nil: Error response from daemon: Container 632ee0139e1239e1fa4a6bbd6e37ad7c21500fe4d2d8ee8139d40e26cebd6e92 is not paused: failed to unpause container 632ee0139e1239e1fa4a6bbd6e37ad7c21500fe4d2d8ee8139d40e26cebd6e92
@thaJeztah thaJeztah added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. kind/experimental area/testing labels Mar 28, 2019
@thaJeztah
Copy link
Member Author

ping @kolyshkin @avagin @rst0git

@thaJeztah
Copy link
Member Author

Attaching the daemon and test logs:

docker.log
test.log

@rst0git
Copy link
Contributor

rst0git commented Mar 28, 2019

returned error: Cannot checkpoint container 632ee0139e1239e1fa4a6bbd6e37ad7c21500fe4d2d8ee8139d40e26cebd6e92: cannot checkpoint a stopped container: unknown

@thaJeztah
Copy link
Member Author

Yes, definitely something weird happening (perhaps an earlier test that doesn't clean up properly after itself?)

@kolyshkin
Copy link
Contributor

This is some kind of a race between Exec (which is not yet deleted) and Checkpoint. Have to look deeper, but it seems Checkpoint sets exec task state to Paused, then (p *process) Delete() is called (asynchronously from some event loop) and it can't delete that process because of its state, here:
https://github.com/containerd/containerd/blob/f2a20ead833f8caf3ffc12be058d6ce668b4ebed/process.go#L204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checkpoint Related to (experimental) checkpoint/restore (CRIU) area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. kind/experimental
Projects
Improving CI
  
To do
Development

No branches or pull requests

3 participants