Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
lxd forkstart ... failed to get real path for ... #2814
Comments
cpaelzer
commented
Jan 24, 2017
|
FYI hit that with both:
|
cpaelzer
commented
Jan 24, 2017
|
Deleted all containers, then deleted all local images, to then "lxc launch" from scratch.
|
cpaelzer
commented
Jan 24, 2017
|
I recognized 2bbf30d457ff as one of the images I deleted. $ sudo ls /var/lib/lxd/images/
Ok that is definetly more than what should still be there, dropping all of it: So all is empty, zfs list, lxc list and lxc image list - yet something old still seems to affect it: Afterwards lxc still doesn't know about any images or containers, but zfs has a image and mountpoint:
In there is the base of an image
|
cpaelzer
commented
Jan 24, 2017
|
I switched to dir based backing and things work again as they should. |
|
This sounds like the symlinks in
? |
|
Well, the first part of the issue at least. |
cpaelzer
commented
Jan 24, 2017
|
Hi Christian. BTW - whatever it is worth I realized it happened on file-backed zfs but not on disk backed. While reproducing the issue as requested I found that it is now working. Actually I'll set it up as zfs on disk instead of on file before doing so. |
cpaelzer
commented
Jan 24, 2017
|
Ha, Once more as a reference the current error At the same time as you asked:
testkvm-xenial-from is the container that failed to launch. |
cpaelzer
commented
Jan 24, 2017
|
FYI the series of lxc commands grepped from my log leading towards the issue. |
|
From your log: I suspect this is really the root cause of why the images didn't disappear. Also, although you removed the images from zfs, you also have to remove them from LXD's database in order to really kill them off. I suspect you didn't do that, which is why you're in the confused state. However, the above log entry seems like a legit bug. |
cpaelzer
commented
Jan 24, 2017
|
Hi tycho, |
|
When you say deleted, do you mean you ran "DELETE FROM images WHERE ..." in lxd's database, or just a zfs delete? |
cpaelzer
commented
Jan 24, 2017
|
On Tue, Jan 24, 2017 at 5:05 PM, Tycho Andersen ***@***.***> wrote:
When you say deleted, do you mean you ran "DELETE FROM images WHERE ..."
in lxd's database, or just a zfs delete?
lxd image list shows hashes, for each I ran
lxd image delete <hash>
eventually lxd image list is empty
|
|
Oh, I see. That should be okay, then. I was referring to,
which isn't enough to actually delete the images. |
|
The original error and some of the following problems could be explained by a "zfs umount -a" having been run on the machine. This would have unmounted all the ZFS mountpoints that LXD relies on and removed the mountpoints, causing all our symlinks to be invalid. Then deleting containers and images would have made things worse as LXD wouldn't have been able to detect that they were on ZFS and cause partial image and container removal, leaving the zfs entries behind and causing a big mess on the system. |
|
In your last example, this would mean that "testkvm-xenial-from.zfs" is empty as it's not mounted. |
cpaelzer
commented
Jan 24, 2017
|
On Tue, Jan 24, 2017 at 5:55 PM, Stéphane Graber ***@***.***> wrote:
In your last example, this would mean that "testkvm-xenial-from.zfs" is
empty as it's not mounted.
IIRC so far it was always mounted and not empty when the issue showed up.
Do you want me to reproduce again and confirm that when I hit the issue
before any delete/cleanup just as I did before for Brauner?
Before I reported on the content of "/var/lib/lxd/containers", but I can
easily add what zfs thinks is mounted and what is in
testkvm-xenial-from.zfs at that time.
But that will be tomorrow for me.
Please ping here if you'd need other data (or additional) to be fetched
right after the issue occurs.
|
|
Yeah, the output of both "zfs list -t all" and "cat /proc/mounts" would be useful to see what's going on. That and confirming that the directory isn't empty. |
stgraber
added
the
Incomplete
label
Jan 24, 2017
cpaelzer
commented
Jan 25, 2017
|
FYI - since I run the automation that seems to work as a reproducer on multiple systems I've found that my s390x box also hits the same case (my x86 system has no zfs backing atm, I'm pretty sure it would hit it as well). But that at least confirms that it is not just "one" system but seems to be reproducible on more than that. After hitting the error on this system I again see this issue and can't launch new containers. And yeah Stephane - it seems empty right when the issue occurs.
zfs reports that as a known mount point, but while empty it seems mounted to me.
Your comments got me thinking on the mount state and I wanted to see if anything changes if I do
Knowing that hash from the error I found that this is just the image I wanted to spawn (Xenial)
And in there is
So I'd assume what happened was that at some point the Xenial base image (not the individual guest) got unmounted. Yet this was not catched by any tooling, and due to that file content got placed in the path. I moved the content off that path as I wanted to compare it with what would be in there when zfs is mounted, but the zfs mount was empty. After mounting the zfs path to the dir I retried starting an instance, but since the zfs mount for the base image is empty it still runs into: Deleting the image via lxc image delete 7a53ade547cf unmounts the path again as expected. E.g. Cleaning and switching to dir backed for now - please advise how to further debug |
|
Ok, so that seems to be pretty reliably reproducible for you. Can you give us step by step instructions (commands please) to get into that state from a cleanly installed LXD host with ZFS? Since you're the only one who's ever reported anything like this, there must be something odd somewhere in there which is causing a zfs mount failure at some point but without being able to reproduce this ourselves, the odds of figuring it out are pretty slim. |
cpaelzer
commented
Jan 26, 2017
|
Sure, Run stage 1&2 from test-dev-ppa.sh, just comment out the rest - so far with zfs enabled in about 50% of the cases it ends up in the state. It seems it always works on the first try, so just run that one in a loop until you are in the bad state. If you want to start simple, these are all lxc commands from my log leading to the issue. You'd just have to pick up a few minor things (like my kvm profile) from the git above. Much simpler for sure, but since I don't know if what I do in the containers is important I wanted to pass the full set test well. Let me know if this works to reproduce - otherwise we have to consider me passing you a login to a broken system. |
powersj
commented
Feb 28, 2017
|
Ran into this same issue today on our amd64 test system: If there is additional data you wish for me to collect let me know, I will try to get steps to reproduce, but my observation has been that it starts happening after we run cpaelzer's tests as he describes above. |
powersj
commented
Mar 11, 2017
|
Another time: I have noticed a correlation when this failure occurs: I have existing containers running on the daily image, when the containers attempts to launch, it also attempts to download the new daily image. I have also noticed that I can go in to /var/lib/lxd/images and blow away the offending image everything will work. However in this case I have running tests on other containers, so I do not want to do that. |
|
@powersj, that sounds like a race between image {deletion,creation} and container creation. Storage-api capable LXD instances are a little better in this regard since I gave them a slightly better locking mechanism. If you have machines using LXD 2.11. can you try to reproduce there? |
powersj
commented
Mar 13, 2017
|
I have updated our amd64 and ppc64el systems to LXD 2.11 via the ppa. Will report back later this week. |
powersj
commented
Mar 15, 2017
|
We had LXD failures occur last night while running on LXD 2.11. However, when I got on the system and tried to launch images I was able to. In the past the situation would cause a scenario where to eventually recover I had to go in and delete the offending image in /var/lib/lxd/imges. The error message has also changed, so I am hoping this has to do with your locking mechanism: |
|
@powersj, thanks! Can you please also provide the output of `zfs list -t all? :) |
powersj
commented
Mar 15, 2017
|
Here you are: https://paste.ubuntu.com/24183037/ @brauner I can also get you on the system if you want. |
|
Hm, I wonder whether this is the same error. If it is than we should be able to fix it because the fact that the umount failed might be caused by not using |
cpaelzer
commented
Mar 15, 2017
|
Happy to provide an test automation that as a side effect finds bugs even
in software it is not supposed to test :-)
But really brauner, we can get you onto the system to check in-place if
that helps.
|
jfkw
commented
Mar 15, 2017
•
|
I'm seeing the same error on Gentoo Linux Config info on this system:
|
cpaelzer commentedJan 24, 2017
Required information
Issue description
Hi,
I ran into a situation being unable to start any more lxd containers. It only seems to happen on the systems that I have using zfs as backing for the containers. I don't think it is related, but so far only occurring on my ppc64el boxes (just happen to be the zfs ones atm).
It seems to work once (starting a bunch of containers and working with them), but after I removed them and re-create them from scratch I ran into:
Here a lxc info after the failed launch:
Log:
Ok, I thought it could be my KVM profile, so I tried without
Well that is a different error, but still not working.
Yet FYI here the current KVM profile:
Steps to reproduce
I'm guessing here :-/
Information to attach
Related bugs / discussions I've found: