Skip to content

Commit

Permalink
docs: Improve troubleshooting documentation
Browse files Browse the repository at this point in the history
Due to the temporarily doubled ESP space usage, it is now easier to run
into the out of space issue (once). Use this opportunity to document how
to proceed in this case.

Furthermore, recovery in case of ESP corruption is now slightly more
involved, because not all files are rewritten all the time. Document the
full recovery instructions.
  • Loading branch information
alois31 committed Oct 2, 2023
1 parent 4a44929 commit 76c1b94
Show file tree
Hide file tree
Showing 3 changed files with 70 additions and 9 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ supports UEFI.

## ⚡ Quickstart ⚡

If you want to try this out, head over [here](./docs/QUICK_START.md) for
instructions.
If you want to try this out, head over [here](./docs/QUICK_START.md) for instructions.
In case of any issues, have a look at the [troubleshooting document](./docs/TROUBLESHOOTING.md).

## 🪛 Get Involved 🪛

Expand Down
7 changes: 0 additions & 7 deletions docs/QUICK_START.md
Original file line number Diff line number Diff line change
Expand Up @@ -284,13 +284,6 @@ System:
That's all! 🥳

## Troubleshooting

If your system doesn't boot with Secure Boot enabled, the most likely
issue is that Lanzaboote could not verify a cryptographic hash. To
recover from this, disable Secure Boot in your firmware
settings. Please file a bug, if you hit this issue.

## Disabling Secure Boot and Lanzaboote

When you want to permanently get back to a system without the Secure
Expand Down
68 changes: 68 additions & 0 deletions docs/TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Troubleshooting

## Bootloader installation fails with "Failed to install files. … No space left on device (os error 28)"

During the bootloader installation process, Lanzaboote must copy the kernel and initrd to the EFI system partition (ESP).
It is quite possible that the ESP is not large enough to hold these files for all installed generations, in which case this error occurs.

In this case, you must first delete some generations (e.g. run `nixos-collect-garbage --delete-older-than=7d` to delete all generations more than one week old).
After that, some space on the ESP must be freed manually.
To achieve this, delete some kernels and initrds in `/boot/EFI/nixos` (they will be recreated in the next step if they are in fact still required).
Finally, run `nixos-rebuild boot` again to finish the installation process that was interrupted by the error.

It is recommended run a garbage collection regularly, and monitor the ESP usage (particularly if it is quite small), to prevent this issue from happening again in the future.

**Warning:** It is recommended to not delete the currently booted kernel and initrd, and to not reboot the system before running `nixos-rebuild boot` again, to minimize the risk of accidentally rendering the system unbootable.

**Note:** When upgrading Lanzaboote from version 0.3.0, or from git master prior to the merge of PR #204, ESP space usage is temporarily doubled.
Hence it is possible for this error to occur even if there was plenty (but less than half) free space available prior to the installation.
In this case, it is not necessary to delete any generations, and you can proceed directly to deleting some kernels and initrds before running the installation again.

## Power failed during bootloader installation, and now the system does not boot any more

Due to the shortcomings of the FAT32 filesystem, in rare cases, it is possible for the ESP to become corrupted after power loss.
With Lanzaboote enabled, this will lead to "secure boot errors" or "hash verification failures" (the exact wording depends on the firmware).
In these cases, recovery is usually still possible with the steps below.

**Note:** If the system fails to boot after the Linux kernel has already been started, then the problem is not caused by a corrupted ESP.
In this case, the steps below will not help, and standard rollback procedures should be followed instead.

### The system can still boot an older generation

In case an older generation still works, the recovery can be carried out from within the booted system.
Run `nix-shell -p sbctl` to ensure the tools required for recovery are available.

1. Run `sudo sbctl verify /boot/EFI/Linux/nixos-generation-*.efi` to check the Lanzaboote stubs.
Files that have a crossmark on their left are corrupted and must be deleted.
2. Run `for file in /boot/EFI/nixos/*.efi; do hash=$(nix-hash --flat --type sha256 --base32 "$file" | tr -d = | tr /+ _-); if [[ $file != *$hash.efi ]]; then echo $file; fi; done` to check the kernels and initrds.
Any files that are printed are corrupted and must be deleted.
3. Run `nixos-rebuild boot`.
This should reinstate all files that are required for the newer generations to boot.
4. Reboot the system, it should now work again.

### The system cannot boot any generation anymore

If no available generation can boot any more, the system must be recovered from a rescue system.
First make sure that you have a recent NixOS install medium available.

**Note:** Nix versions from before August 2023 contain a bug that can prevent `nixos-enter` from working.
A more recent medium must be used for the recovery procedure to work reliably.

1. Disable Secure Boot in the firmware settings.
The NixOS install medium is not signed and thus cannot be booted when Secure Boot is active.
2. Boot the NixOS install medium.
3. Mount all partitions belonging to the system to be recovered under `/mnt`, just like you would for installation.
1. In case the ESP does not mount, or only mounts in read-only mode, due to corruption, try `fsck.fat` first.
If that fails as well or the ESP still does not mount, it needs to be reformatted using `mkfs.fat`.
4. Delete the corrupted files on the ESP, using `rm -fr /mnt/boot/EFI/Linux/nixos-generation-*.efi /mnt/boot/efi/nixos`.
5. Enter the recovery shell by running `nixos-enter`.
Then, run `nixos-rebuild boot` to install the bootloader again.
6. Exit the recovery shell and unmount all filesystems.
7. Reboot the system to verify that everything works again.
8. Enable Secure Boot again in the firmware settings.

## The system doesn't boot with Secure Boot enabled

It is the most likely issue that Lanzaboote could not verify a cryptographic hash.
To recover from this, disable Secure Boot in your firmware settings.
Please file a bug, if you hit this issue.

0 comments on commit 76c1b94

Please sign in to comment.