Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxc doesn't boot anymore. #7076

Closed
esosan opened this issue Mar 24, 2020 · 40 comments
Closed

lxc doesn't boot anymore. #7076

esosan opened this issue Mar 24, 2020 · 40 comments
Assignees

Comments

@esosan
Copy link

@esosan esosan commented Mar 24, 2020

  • Distribution: ubuntu
  • Distribution version: 18.04.4
  • The output of "lxc info" or if that fails:
    • Kernel version: 4.15.0-91-generic
    • LXC version:
    • LXD version: 3.23 (snap)
    • Storage backend in use: lvm (xfs)

Issue description

lxd fail to start. I think the problem is related to some update watch during the starting process.

No container starts and I can't use lxc command.

Here the /var/snap/lxd/common/lxd/logs/lxd.log output. It's looping.

t=2020-03-24T09:13:07+0100 lvl=info msg="LXD 3.23 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-03-24T09:13:07+0100 lvl=info msg="Kernel uid/gid map:" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - u 0 0 4294967295" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - g 0 0 4294967295" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Configured LXD uid/gid map:" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - u 0 1000000 1000000000" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - g 0 1000000 1000000000" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Kernel features:" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - netnsid-based network retrieval: no" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - uevent injection: no" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - seccomp listener: no" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - seccomp listener continue syscalls: no" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - cgroup layout: hybrid" 
t=2020-03-24T09:13:07+0100 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - shiftfs support: disabled" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Initializing local database" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Starting /dev/lxd handler:" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-03-24T09:13:07+0100 lvl=info msg="REST API daemon:" 
t=2020-03-24T09:13:07+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-03-24T09:13:07+0100 lvl=info msg=" - binding TCP socket" socket=[::]:8443
t=2020-03-24T09:13:07+0100 lvl=info msg="Initializing global database" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Firewall loaded driver \"xtables\"" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Initializing storage pools" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Applying patch \"storage_rename_custom_volume_add_project\"" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Skipping already renamed custom volume \"default_####data\" in pool \"data\"" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Skipping already renamed custom volume \"default_####Virtual\" in pool \"data\"" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Skipping already renamed custom volume \"default_####data\" in pool \"data\"" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Skipping already renamed custom volume \"####virtual\" in pool \"data\"" 
t=2020-03-24T09:13:07+0100 lvl=info msg="Skipping already renamed custom volume \"####virtual\" in pool \"data\"" 
t=2020-03-24T09:13:08+0100 lvl=info msg="Renaming custom volume \"####report\" in pool \"data\" to \"default_####report\"" 
t=2020-03-24T09:13:19+0100 lvl=eror msg="Failed to start the daemon: Failed applying patch \"storage_rename_custom_volume_add_project\": Failed to run: lvrename /dev/lxcDATA/custom_####report /dev/lxcDATA/custom_default_####report: Existing logical volume \"custom_####report\" not found in volume group \"lxcDATA\"" 
@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 24, 2020

@esosan this is being caused by the custom volumes in projects upgrade patch failing.

Please can you show me the output of the lvs command on your system please.

Also, are you using the LVM volume group for non-LXD volumes by any chance?

@tomponline tomponline self-assigned this Mar 24, 2020
@esosan

This comment has been minimized.

Copy link
Author

@esosan esosan commented Mar 24, 2020

Hi tomponline, here my las output.

There is a LVM VG. The VG used by lxc it's a nested LVM.

  LV                                    VG      Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
   containers_####--PROD                 lxc     -wi-a-----  11.00g                                                    
   containers_####--mgnl--auth--PROD     lxc     -wi-a-----  11.00g                                                    
   containers_####--mgnl--pub--PROD      lxc     -wi-a-----  11.00g                                                    
   containers_awstats                    lxc     -wi-a-----  11.00g                                                    
   containers_####                       lxc     -wi-a-----  24.00g                                                    
   containers_####                       lxc     -wi-a-----  11.00g                                                    
   containers_####                       lxc     -wi-a-----  11.00g                                                    
   containers_####                       lxc     -wi-a-----  11.00g                                                    
   containers_####--PROD                 lxc     -wi-a-----  11.00g                                                    
   containers_####--PROD                 lxc     -wi-a-----  11.00g                                                    
   containers_####                       lxc     -wi-a-----  11.00g                                                    
   containers_####                       lxc     -wi-a-----  11.00g                                                    
   containers_####                       lxc     -wi-a-----  11.00g                                                    
   containers_mgnl--####                 lxc     -wi-a-----  11.00g                                                    
   containers_####db                     lxc     -wi-a-----  11.00g                                                    
   containers_####                       lxc     -wi-a-----  11.00g                                                    
   containers_####--auth--PROD           lxc     -wi-a-----  11.00g                                                    
   containers_####--pub--PROD            lxc     -wi-a-----  11.00g                                                    
   containers_####                       lxc     -wi-a-----  11.00g                                                    
   containers_####--gtw--PROD            lxc     -wi-a-----  11.00g                                                    
   containers_####--PROD                 lxc     -wi-a-----  11.00g                                                    
   custom_default_####                   lxcDATA -wi-a-----   5.00g                                                    
   custom_default_####                   lxcDATA -wi-a-----  15.00g                                                    
   custom_default_####                   lxcDATA -wi-a-----   2.00g                                                    
   custom_default_####--auth--virtual    lxcDATA -wi-a-----  11.00g                                                    
   custom_default_####--pub--virtual     lxcDATA -wi-a-----  11.00g                                                    
   backup                                vg      -wi-ao----  60.00g                                                    
   ####ASSETS                            vg      -wi-ao---- 500.00m                                                    
   ####VIDEO                             vg      -wi-ao---- 600.00g                                                    
   home                                  vg      -wi-ao---- <10.00g                                                    
   lxc00                                 vg      -wi-ao----  50.00g                                                    
   lxc01                                 vg      -wi-ao----  50.00g                                                    
   lxc02                                 vg      -wi-ao----  50.00g                                                    
   lxc03                                 vg      -wi-ao----  30.00g                                                    
   lxc04                                 vg      -wi-ao----  60.00g                                                    
   lxc05                                 vg      -wi-ao---- 100.00g                                                    
   lxcDATA00                             vg      -wi-ao----  50.00g                                                    
   lxcDATA01                             vg      -wi-a-----  50.00g                                                    
   sambashare                            vg      -wi-ao----   5.00g                                                    
   swap                                  vg      -wi-ao----  15.00g      
@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 24, 2020

@esosan so what appears to be the issue is that in your LXD database you have a custom volume called ####report which would expect to have an associated LVM volume called custom_####report in your lxc volume group.

As per the new custom volumes in projects feature the patch is attempting to rename the custom_####report LVM volume to custom_default_####report, but from your lvs output I can see that the expected custom_####report volume doesn't exist.

Is it possible that the custom volume has been removed in the past manually from the volume group and the database entry left in LXD?

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 24, 2020

@esosan in that output you also seem to have multiple logical volumes in the same volume group sharing the same name, this shouldn't be possible, is that a direct dump of lvs output?

@esosan

This comment has been minimized.

Copy link
Author

@esosan esosan commented Mar 24, 2020

@tomponline I can mail you the original output to tomp@tomp.uk ?

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 24, 2020

@esosan yes that would be fine thanks

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 24, 2020

@esosan ah, so have you artificially blanked out some of the LV names to avoid publicly posting them?

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 24, 2020

@esosan please can you also send me the unredacted error log from LXD

@esosan

This comment has been minimized.

Copy link
Author

@esosan esosan commented Mar 24, 2020

done

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 24, 2020

@esosan so I didn't get your unredacted LXD error log, but looking at your unredacted lvs output I can see that there is no logical volume ending in report so I suspect the issue is that at some point a logical volume has been removed manually rather than deleting it through LXD's cli, and as such LXD doesn't know it has gone. This is why the rename is failing.

If you could send me the unredacted error log I can advise which logical volume you need to manually re-create.

@esosan

This comment has been minimized.

Copy link
Author

@esosan esosan commented Mar 24, 2020

I'm not really sure because this server was online for a very long time. But it seems strange to me to have manually removed the volume. I just mounted the container volume and the report directly exists and it's full of updated data.

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 24, 2020

@esosan thanks for the log, so its as I thought, the logical volume has been manually deleted at some point and LXD doesn't know about that.

So I suggest re-creating the volume manually so that LXD can rename it and start cleanly.

Full name has been redacted.
Size doesn't matter if you are going to delete it later (via lxc storage volume delete <pool> <volname>)

sudo lvcreate -n custom_###creport -L 1G lvm

However I am interested that you say you can still mount the report volume, can you show me how you do that?

@esosan

This comment has been minimized.

Copy link
Author

@esosan esosan commented Mar 24, 2020

Sorry for the misunderstanding. I have mounted the container root volume: the supposed directory that mount report volume have is files. I did it to see if the report was still in use.

@esosan

This comment has been minimized.

Copy link
Author

@esosan esosan commented Mar 24, 2020

GREAT! All the container are coming back online!

Thank you very much @tomponline.
I really appreciate your commitment to this project!

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 24, 2020

@esosan excellent, glad that resolved the issue!

@tomponline tomponline closed this Mar 24, 2020
@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 26, 2020

I'm having the same problem, but haven't used LVM for my volumes. I do use BTRFS and have automated snapshots.

My error is slightly different
"Failed to start the daemon: Failed applying patch \"storage_rename_custom_volume_add_project\": Volume must not be a snapshot"

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

@rldleblanc can you send the full output of the lxd startup logs please so we can see what it is trying to rename.

@tomponline tomponline reopened this Mar 26, 2020
@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 26, 2020

It is trying to rename snapshots that I did remove. There could be hundreds of them... :( If I have to restart the daemon after every one to find the next one, that will be sad. Can I edit the database directly to drop those snapsots?

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

@rldleblanc please can you enable debug logging so I can get a picture of what it is trying to do.

And you're saying that you've removed snapshots manually rather than using LXD cli? Thanks

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

sudo snap set lxd daemon.debug=true
sudo systemctl reload snap.lxd.daemon
@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 26, 2020

Or at least query the database to generate all the snapshots in a loop. I thought I was using LXD cli, but I must have not been. I sent the logs to your e-mail.

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Mar 26, 2020

You can get an idea of roughly what's in the DB with:

sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin "SELECT * FROM storage_volumes;"

But note that this is a convenience read-only DB dump, making changes to it will not help.

In generate the best option is to manually re-create what's missing and then properly delete it through the LXD API.

If not an option, then you can write a SQL patch file at /var/snap/lxd/common/lxd/database/patch.global.sq which will be run by LXD on startup, but that's obviously far riskier as it's easy to accidentally remove the wrong records.

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

Thanks @rldleblanc the issue looks like this volume mythtv_data/mythtv.mythtv_data.2019-11-05_06:34:59 because it contains a / in the name (which is supposed to be exclusively used to indicate a snapshot volume). Let me see how this could occur.

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

thanks @stgraber yes @rldleblanc it would be good to get a copy of what in your LXD database for comparison too.

@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 26, 2020

I had just found the db and installed sqlite when @stgraber mentioned it. At least that way I can generate all the snapshots again. Sending the db to your e-mail.

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

@rldleblanc I would hold off making any DB or storage changes right now, as it may not be missing snapshots. A copy of the storage_volumes table would be useful.

@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 26, 2020

I did remove all those snapshots as they were holding too much storage for very transient data.

@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 26, 2020

The '/' in the name allowed me to group snapshots for containers into subdirectories.

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

@rldleblanc ok so what is confusing me right now is that the parent volume mythtv_data has a vol type of 2 meaning custom in storage_volumes table, which is expected.

However the snapshots should not have been returned as part of the DB query when selecting custom volumes to rename.

id|name|storage_pool_id|node_id|type|description|project_id
56|mythtv_data|4|1|2||1

The snapshots are held in a separate table

select * from storage_volumes_snapshots where storage_volume_id = 56;
id|storage_volume_id|name|description
5874|56|mythtv.mythtv_data.2019-11-05_06:34:59|
5892|56|mythtv.mythtv_data.2019-11-06_06:35:21|
5910|56|mythtv.mythtv_data.2019-11-07_06:34:13|
5928|56|mythtv.mythtv_data.2019-11-08_06:34:42|
5946|56|mythtv.mythtv_data.2019-11-09_06:34:09|
5964|56|mythtv.mythtv_data.2019-11-10_06:35:22|
5982|56|mythtv.mythtv_data.2019-11-11_06:34:07|
6000|56|mythtv.mythtv_data.2019-11-12_06:35:07|
6018|56|mythtv.mythtv_data.2019-11-13_06:33:52|
6036|56|mythtv.mythtv_data.2019-11-14_06:35:14|
6054|56|mythtv.mythtv_data.2019-11-15_06:34:07|
6072|56|mythtv.mythtv_data.2019-11-16_06:35:40|
6090|56|mythtv.mythtv_data.2019-11-17_06:34:09|
6108|56|mythtv.mythtv_data.2019-11-18_06:36:31|
6126|56|mythtv.mythtv_data.2019-11-19_06:34:27|
6144|56|mythtv.mythtv_data.2019-11-20_06:35:05|
6162|56|mythtv.mythtv_data.2019-11-21_06:35:29|
6180|56|mythtv.mythtv_data.2019-11-22_06:35:06|
6198|56|mythtv.mythtv_data.2019-11-23_06:36:43|
6216|56|mythtv.mythtv_data.2019-11-24_06:37:32|
6234|56|mythtv.mythtv_data.2019-11-25_06:34:43|
6252|56|mythtv.mythtv_data.2019-11-26_06:36:32|
6270|56|mythtv.mythtv_data.2019-11-27_06:34:16|
6288|56|mythtv.mythtv_data.2019-11-28_06:36:18|
6306|56|mythtv.mythtv_data.2019-11-29_06:37:14|
6324|56|mythtv.mythtv_data.2019-11-30_06:34:18|
6342|56|mythtv.mythtv_data.2019-12-01_06:35:30|
6360|56|mythtv.mythtv_data.2019-12-02_06:32:18|
6378|56|mythtv.mythtv_data.2019-12-03_06:33:39|
6396|56|mythtv.mythtv_data.2019-12-04_06:32:12|
@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

@rldleblanc im going to try and re-create this locally and see if the query is not doing the right thing.

tomponline added a commit to tomponline/lxd that referenced this issue Mar 26, 2020
…st of custom volumes to be renamed

Snapshots of custom volumes will be renamed as part of parent volume rename.

Fixes lxc#7076

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

@rldleblanc I think this is caused by a bug in the database query used to select the custom volumes to rename, it is currently returning snapshots too. I'm not sure whether the the function used to retrieve the custom volumes is at fault or whether that was expected, but I've added a fix to this patch to ignore any snapshots returned by the function.

I've asked @stgraber and @freeekanayaka whether the DB function itself needs changing also.

@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 26, 2020

Thanks for digging into this, anything I can do on may end to get the environment up?

@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 26, 2020

I tried to downgrade, but the scheme change prevented a start up.

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

@rldleblanc its a bit tricky, but if you rename the BTRFS directories in storage-pools/<pool>/custom/ to have the prefix default_, and likewise the directories in storage-pools/<pool>/custom-snapshots/ to also have a prefix of default_ then the patch should detect this and skip over them.

@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 26, 2020

It is skipping those directories, but it's still trying to change the non-existent snapshot.

@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 26, 2020

Not sure how to manually patch the patch file in your PR. I guess I'll have to wait for a new build.

@tomponline

This comment has been minimized.

Copy link
Member

@tomponline tomponline commented Mar 26, 2020

Can you mail me the new error ur getting please?

@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 27, 2020

I downloaded the edge version of the snap and it started up, my containers are running again. I'd like to move back to stable once this is included.

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Mar 27, 2020

We have the fix in candidate now and it will be in stable in the next hour hopefully.

@rldleblanc

This comment has been minimized.

Copy link

@rldleblanc rldleblanc commented Mar 27, 2020

I had to create all the missing btrfs snapshots (since snaps are just subvolumes, I just created empty subvols) then I could delete them all through LXD (failed to remove a snapshot if it didn't exists). Easy enough to do with some oneliners now that LXD is running. Thanks, I'll keep an eye out for it and move back to stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants
You can’t perform that action at this time.