Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman hangs with deadlock on /var/lib/containers/storage/storage.lock #1146

Closed
mandre opened this issue Jul 24, 2018 · 18 comments
Closed

Podman hangs with deadlock on /var/lib/containers/storage/storage.lock #1146

mandre opened this issue Jul 24, 2018 · 18 comments
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@mandre
Copy link

mandre commented Jul 24, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

Description

Got podman in a locked state:

[heat-admin@openstack-controller-0 ~]$ ps aux | grep podman
root       29708  0.0  0.1 480508 15104 ?        Sl   08:47   0:00 /usr/bin/podman run --user root --name docker-puppet-horizon --env PUPPET_TAGS=file,file_line,concat,augeas,cron,horizon_config --env NAME=horizon --env HOSTNAME=openstack-controller-0 --env NO_ARCHIVE= --env STEP=6 --volume /etc/localtime:/etc/localtime:ro --volume /tmp/tmpYRgU6v:/etc/config.pp:ro,z --volume /etc/puppet/:/tmp/puppet-etc/:ro,z --volume /usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro --volume /var/lib/config-data:/var/lib/config-data/rw,z --volume /dev/log:/dev/log:rw --volume /etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume /etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume /etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume /etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume /var/lib/docker-puppet/docker-puppet.sh:/var/lib/docker-puppet/docker-puppet.sh:rw,z --entrypoint /var/lib/docker-puppet/docker-puppet.sh --net host --volume /etc/hosts:/etc/hosts:ro 192.168.24.1:8787/tripleomaster/centos-binary-horizon:current-tripleo
root       30967  0.7  0.1 341240 15840 ?        Sl   08:47   0:33 /usr/bin/podman run --user root --name docker-puppet-cinder --env PUPPET_TAGS=file,file_line,concat,augeas,cron,cinder_config,file,concat,file_line,cinder_config,file,concat,file_line,cinder_config,file,concat,file_line --env NAME=cinder --env HOSTNAME=openstack-controller-0 --env NO_ARCHIVE= --env STEP=6 --volume /etc/localtime:/etc/localtime:ro --volume /tmp/tmpFMcsQy:/etc/config.pp:ro,z --volume /etc/puppet/:/tmp/puppet-etc/:ro,z --volume /usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro --volume /var/lib/config-data:/var/lib/config-data/rw,z --volume /dev/log:/dev/log:rw --volume /etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume /etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume /etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume /etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume /var/lib/docker-puppet/docker-puppet.sh:/var/lib/docker-puppet/docker-puppet.sh:rw,z --entrypoint /var/lib/docker-puppet/docker-puppet.sh --net host --volume /etc/hosts:/etc/hosts:ro 192.168.24.1:8787/tripleomaster/centos-binary-cinder-api:current-tripleo
root       31014  0.0  0.1 331732 11892 ?        Sl   08:47   0:00 /usr/bin/podman inspect 192.168.24.1:8787/tripleomaster/centos-binary-cron:current-tripleo
heat-ad+   50808  0.0  0.0 112708   968 pts/0    S+   09:59   0:00 grep --color=auto podman

strace shows it's waiting on the lock at "/var/lib/containers/storage/storage.lock":

[heat-admin@openstack-controller-0 ~]$ sudo strace podman ps
[...]
stat("/var/run/containers/storage", {st_mode=S_IFDIR|0700, st_size=80, ...}) = 0
stat("/var/lib/containers/storage", {st_mode=S_IFDIR|0700, st_size=150, ...}) = 0
stat("/var/lib/containers/storage/mounts", {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0
stat("/var/lib/containers/storage/tmp", {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0
stat("/var/lib/containers/storage/overlay", {st_mode=S_IFDIR|0700, st_size=12288, ...}) = 0
openat(AT_FDCWD, "/var/lib/containers/storage/storage.lock", O_RDWR|O_CREAT, 0600) = 5
fcntl(5, F_SETFD, FD_CLOEXEC)           = 0
getrandom("\272\267\26?\301\310GUa\36L\255+\24I\\\251\235\2333\273\367\227\313-\326K\327\233\371\252\244", 32, 0) = 32
getpid()                                = 51161
fcntl(5, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}
^Cstrace: Process 51161 detached
 <detached ...>

[heat-admin@openstack-controller-0 ~]$ sudo lsof /var/lib/containers/storage/storage.lock
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF      NODE NAME
podman  29708 root    4u   REG  252,2       64 117440877 /var/lib/containers/storage/storage.lock
podman  30967 root    4u   REG  252,2       64 117440877 /var/lib/containers/storage/storage.lock
podman  31014 root    4uW  REG  252,2       64 117440877 /var/lib/containers/storage/storage.lock

Steps to reproduce the issue:

  1. Launch several containers at the same time

  2. Wait for podman to hang

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Version:       0.6.1
Go Version:    go1.9.4
OS/Arch:       linux/amd64

Output of podman info:

I can't provide the output of podman info, it hangs.

@mheon
Copy link
Member

mheon commented Jul 24, 2018

Looks like a c/storage issue, probably around initialization.
@nalind PTAL

@rhatdan
Copy link
Member

rhatdan commented Jul 24, 2018

If you kill the podman inspect do things clear up?

@mandre
Copy link
Author

mandre commented Jul 24, 2018

Sorry, I tore down the environment and worked around the issue by disabling concurrency in my script. I'll post again when I reproduce the lock.

By the way, I've noticed I'm running quite an old version of podman. I've seen there are rpms for 0.6.4 around (I'm using 0.6.1 from centos-extras). Is version 0.7.x already packaged for centos?

@mheon
Copy link
Member

mheon commented Jul 24, 2018

@lsm5 is a good source for an answer on that

@rhatdan
Copy link
Member

rhatdan commented Jul 24, 2018

We release a podman every week 7==July 3 == third week.

@rhatdan
Copy link
Member

rhatdan commented Jul 24, 2018

@lsm5 We really need to update the README on how to get the latest packaged version for each distro.

@mheon mheon added the bug label Jul 27, 2018
@mheon
Copy link
Member

mheon commented Jul 27, 2018

I'm going to leave this open for other people to report potential locking issues, but it's hard to track these down unless we have more to go on.

@edsantiago
Copy link
Collaborator

Reproduced fairly quickly (once - haven't tried again) with the attached script. Looks similar, but this time the deadlock is in podman logs:

# ps auxww|grep podman
root     14402  0.4  0.7 782152 31488 pts/3    SNl+ 17:17   0:01 podman run --detach --name foo registry.access.redhat.com/rhel7/rhel:latest bash -c for NUM in `seq 1 64`; do trap "echo Received $NUM, ignoring..." $NUM; done; echo READY; while :; do echo READY; sleep 0.5; done
root     14513  0.0  0.7 706308 29200 pts/3    Sl+  17:17   0:00 podman logs foo

# strace -p 14513
strace: Process 14513 attached
fcntl(6, F_SETLKW, .... ^C
# lsof -p 14513|grep ' 6u'
podman  14513 root    6u      REG  252,1       64  655571 /var/lib/containers/storage/overlay-layers/layers.lock


# lsof /var/lib/containers/storage/{storage.lock,overlay-layers/layers.lock}
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
podman  14402 root    5u   REG  252,1       64 655568 /var/lib/containers/storage/storage.lock
podman  14402 root    6uW  REG  252,1       64 655571 /var/lib/containers/storage/overlay-layers/layers.lock
podman  14513 root    5uW  REG  252,1       64 655568 /var/lib/containers/storage/storage.lock
podman  14513 root    6u   REG  252,1       64 655571 /var/lib/containers/storage/overlay-layers/layers.lock

Login access available on request until, say, 4pm EDT. After that I'll kill these and start hammering again to see how reliable that reproducer is. (Speaking of which: much of the script is mumbo-jumbo incantations. It may be possible to reduce it some more; I've just been playing somewhat haphazardly).

podman-deadlock-test.txt

@mheon
Copy link
Member

mheon commented Aug 20, 2018

Looks like the same locks as last time
@nalind PTAL - looks like a deadlock in c/storage. One process holds layers.lock but not storage.lock, the other the opposite.

@mheon
Copy link
Member

mheon commented Aug 20, 2018

@edsantiago managed to reproduce with an strace. Seems to be a race between one of the calls in the image library's NewFromLocal and subsequent calls to storage.GetStore()

@mheon
Copy link
Member

mheon commented Aug 20, 2018

The only call to c/storage in NewFromLocal seems to be store.Image()

@mheon
Copy link
Member

mheon commented Aug 20, 2018

One side seems to be the overlay driver's get() and the other something in load() in c/storage

@EmilienM
Copy link
Contributor

I'm also having this bug when deploying ~30 containers on a busy VM.

@EmilienM
Copy link
Contributor

I've uploaded an sosreport if that something which can help: https://files.macchi.pro:8443/sosreport/

@steveb
Copy link
Contributor

steveb commented Sep 11, 2018

I have a self-contained script[1] which shows buildah hitting what I believe is the same lock problem.

For this script, the build blocks usually on the FROM directive, sometimes on the COPY directive. While blocked, calls to "podman images" also blocks indefinitely. Killing the buildah processes unblocks other calls.

You can tweak the concurrency variable to cause more or less stress. When I set concurrency=4 I see different race issues. For example sometimes an image doesn't get built, and the following is logged:

error creating build container: error creating container: the container name "centos-working-container-2" is already in use by "f5bda7beecec6896b09eea6d406d733199a4f8b98afc9150fdac4f883f59cdce". You have to remove that container to be able to reuse that name.: that name is already in use

[1] http://paste.openstack.org/show/729907/

@mheon
Copy link
Member

mheon commented Sep 12, 2018

containers/storage#213 should fix

@steveb
Copy link
Contributor

steveb commented Sep 12, 2018

I ran my buildah torture script with concurrency=16 (!) and can no longer reproduce my lockup.

[1] http://paste.openstack.org/show/729907/

@rhatdan
Copy link
Member

rhatdan commented Sep 13, 2018

Awesome.

@rhatdan rhatdan closed this as completed Sep 13, 2018
Luap99 added a commit to Luap99/libpod that referenced this issue Sep 21, 2020
This is only temporary until the cobra PR containers#1146 is merged.

Signed-off-by: Paul Holzinger <paul.holzinger@web.de>
Luap99 added a commit to Luap99/libpod that referenced this issue Sep 25, 2020
This is only temporary until the cobra PR containers#1146 is merged.

Signed-off-by: Paul Holzinger <paul.holzinger@web.de>
Luap99 added a commit to Luap99/libpod that referenced this issue Oct 4, 2020
This is only temporary until the cobra PR containers#1146 is merged.

Signed-off-by: Paul Holzinger <paul.holzinger@web.de>
Luap99 added a commit to Luap99/libpod that referenced this issue Oct 22, 2020
This is only temporary until the cobra PR containers#1146 is merged.

Signed-off-by: Paul Holzinger <paul.holzinger@web.de>
Luap99 added a commit to Luap99/libpod that referenced this issue Oct 23, 2020
This is only temporary until the cobra PR containers#1146 is merged.

Signed-off-by: Paul Holzinger <paul.holzinger@web.de>
Luap99 added a commit to Luap99/libpod that referenced this issue Nov 1, 2020
This is only temporary until the cobra PR containers#1146 is merged.

Signed-off-by: Paul Holzinger <paul.holzinger@web.de>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

6 participants