Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an issue when database backups fail because the system runs out of resources #109020

Merged
merged 5 commits into from Jan 30, 2024

Conversation

bdraco
Copy link
Member

@bdraco bdraco commented Jan 28, 2024

This should wait to merge until after home-assistant/supervisor#4843 so we don't have a have a lot of issues in the queue that we can do little about.

Proposed change

If the backup takes too long, the system can run out of resources to store the event queue because we have to queue them in memory until the database backup finishes and the write lock is released. When this happens we logged a warning in the log which was easy to miss, and users sometimes found out their database backup had failed at the worst possible time (when they needed to restore it.)

The repair issue increases the chance the user will see the problem without forcing the rest of their backup to hard fail (since it only affects the database). This is especially relevant if they're doing backups via API to do an update, or using an add-on like the Google Drive Backup or other Backup system. For testing I did backups via api and verified I got the error in the UI.

This issue is expected to be extremely rare after home-assistant/supervisor#4843

Screenshot 2024-01-28 at 12 54 32 PM Screenshot 2024-01-28 at 12 54 47 PM

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Ruff (ruff format homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

…urces

If the backup takes too long, the system can run out of resources
to store the event queue. When this happens we logged a warning
in the log which was easy to miss, and users sometimes found out
their backup had failed at the worst possible time (when they needed
to restore it.)
@home-assistant
Copy link

Hey there @home-assistant/core, mind taking a look at this pull request as it has been labeled with an integration (recorder) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of recorder can trigger bot actions by commenting:

  • @home-assistant close Closes the pull request.
  • @home-assistant rename Awesome new title Renames the pull request.
  • @home-assistant reopen Reopen the pull request.
  • @home-assistant unassign recorder Removes the current integration label and assignees on the pull request, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the pull request.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the pull request.

@codyc1515
Copy link
Contributor

It would be good to create an issue when the system storage space falls under a certain amount, say, 10 or 15% (if that is not already the case), so that users can treat storage issues well in advance (certainly before backups start failing).

@bdraco
Copy link
Member Author

bdraco commented Jan 28, 2024

It would be good to create an issue when the system storage space falls under a certain amount, say, 10 or 15% (if that is not already the case), so that users can treat storage issues well in advance (certainly before backups start failing).

That is not in scope for this PR

@codyc1515
Copy link
Contributor

codyc1515 commented Jan 28, 2024

I understand that but we are just treating the symptom here (backups failing), rather than the cause (low / no space). While I like the idea, I feel that it doesn't seem like a solution that scales well.

Are we going to add alerts for every other area of HA that fails due to low / no space too? Another example could be a HAOS or HA Core docker image that does not have enough space to be able to update.

We could instead check available disk space before creating a backup or performing an upgrade instead and if there is not enough space available, fail to start the backup / upgrade process with an exception.

@bdraco
Copy link
Member Author

bdraco commented Jan 28, 2024

This PR has nothing to do with low space, its about the lock being held too long because the backup itself it taking too long.

Resources in this case mean: cpu time, memory, I/O, or anything else that would make the backup too slow. It's intentionally generic because we honestly don't know why, and it's something the user is going to have to investigate. After the supervisor PR is merged hopefully this will only happen in very extreme cases.

If the backup fails because it runs out of space, they're already going to get a notification in the UI.

The problem here is the backup looks like it's successful but it really wasn't since we gave up the lock and now the database back up is no longer good but the rest of the backup is ok. The alternative would to make it a hard failure but that might put the user in an even a worse state because then they can't even back up core. I think it's better to be able to make a back up even without the database than get nothing at all.

@bdraco bdraco changed the title Create an issue when backups fail because the system runs out of resources Create an issue when database backups fail because the system runs out of resources Jan 28, 2024
Copy link
Contributor

@codyc1515 codyc1515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bdraco bdraco marked this pull request as ready for review January 28, 2024 22:56
@bdraco bdraco requested a review from a team as a code owner January 28, 2024 22:56
@frenck
Copy link
Member

frenck commented Jan 30, 2024

I think the PR is good, but can we make the datetime a little bit more end-user friendly?

CleanShot 2024-01-30 at 18 44 41@2x

:)

@bdraco
Copy link
Member Author

bdraco commented Jan 30, 2024

Lol, isoformat seemed super friendly to me, but I'm obviously not the user in this case. I'll pick a more human readable format

@bdraco
Copy link
Member Author

bdraco commented Jan 30, 2024

I think %H:%M:%S is sufficient here

@codyc1515
Copy link
Contributor

Is it not straightforward to use these? image

@bdraco
Copy link
Member Author

bdraco commented Jan 30, 2024

Is it not straightforward to use these?

Repair issues can't send datetime or timestamps to the frontend, only strings AFAIK

Copy link
Member

@frenck frenck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @bdraco 👍

../Frenck

@bdraco
Copy link
Member Author

bdraco commented Jan 30, 2024

🤞 that this is rare as I think it's going to be, but if it's not, I guess it'll motivate to improve the Backup performance more. Certainly better to know than vs finding out the hard way

We still have a lot of opportunities to reduce I/O and disk writes which also save storage lifetime that haven't been explored yet.

@bdraco
Copy link
Member Author

bdraco commented Jan 30, 2024

Thanks

@bdraco bdraco merged commit a222447 into dev Jan 30, 2024
53 checks passed
@bdraco bdraco deleted the backup_failed_repair_issue branch January 30, 2024 20:23
@github-actions github-actions bot locked and limited conversation to collaborators Jan 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants