Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live db migrations and recovery #49036

Merged
merged 5 commits into from Apr 12, 2021
Merged

Conversation

bdraco
Copy link
Member

@bdraco bdraco commented Apr 11, 2021

Breaking change

Run database migrations in the background after started event

Screen Shot 2021-04-11 at 8 40 21 PM

  • Database migrations no longer block startup
  • The migration is started shortly after the started event since they are cpu intensive (mostly just sqlite since its in the same process) and we do not want to compete with startup.
  • Events are queued and processed when the migration is completed
  • There is a safety to start discarding events if more than ~30000 happen before migration is completed to avoid memory exhaustion
  • A notification is shown when the migration is in progress "The database is being upgraded. Integrations such as logbook and history that read the database may return inconsistent results until the migration completes."
  • The sqlite database recovery logic can now recover at point after setup including purge, migration, and event insert. In short we can always (hopefully) recover and start a new db without a restart.
  • The quick check is no longer performed on unclean shutdown since we can always recover live. The db_integrity_check option has been deprecated.
  • Recorder __init__.py coverage increased to 98% 🎊

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Example entry for configuration.yaml:

# Example configuration.yaml

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Black (black --fast homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • Untested files have been added to .coveragerc.

The integration reached or maintains the following Integration Quality Scale:

  • No score or internal
  • 🥈 Silver
  • 🥇 Gold
  • 🏆 Platinum

To help with the load of incoming pull requests:

@probot-home-assistant
Copy link

Hey there @home-assistant/core, mind taking a look at this pull request as its been labeled with an integration (history) you are listed as a codeowner for? Thanks!
(message by CodeOwnersMention)

@bdraco bdraco force-pushed the live_db_migrations branch 3 times, most recently from d1347ab to a1ab895 Compare April 11, 2021 01:46
@bdraco bdraco force-pushed the live_db_migrations branch 6 times, most recently from 9edcb35 to a94153c Compare April 11, 2021 05:02
@bdraco bdraco changed the title Live db migrations Live db migrations and recovery Apr 11, 2021
@bdraco bdraco force-pushed the live_db_migrations branch 3 times, most recently from 83b62d8 to f4ba123 Compare April 11, 2021 06:00
@bdraco bdraco marked this pull request as ready for review April 11, 2021 06:06
@bdraco bdraco force-pushed the live_db_migrations branch 5 times, most recently from 69f12e3 to bb05150 Compare April 11, 2021 08:40
@bdraco
Copy link
Member Author

bdraco commented Apr 11, 2021

Pushed again to cover one more existing line. At this point testing has been solid

@bdraco
Copy link
Member Author

bdraco commented Apr 11, 2021

Tested this on my large install with

drop index ix_events_event_type_time_fired;
drop index ix_events_context_parent_id;
delete from schema_changes where change_id>=2;

Restart

@bdraco
Copy link
Member Author

bdraco commented Apr 11, 2021

More exhaustive test

update schema_changes set schema_version=0 where change_id=1;
delete from schema_changes where change_id>=2;
drop index ix_events_context_parent_id;
drop index ix_events_event_type_time_fired;
drop index ix_states_entity_id_last_updated;
drop index ix_states_event_id;
drop index ix_states_last_updated;
drop index ix_events_time_fired;

@bdraco
Copy link
Member Author

bdraco commented Apr 11, 2021

I think we should warn that performance may be degraded during the upgrade so they don't wonder why.

@bdraco bdraco force-pushed the live_db_migrations branch 4 times, most recently from 646d603 to 753fa95 Compare April 11, 2021 19:45
- Database migrations no longer block startup
- The migration is started shortly after the started event since they are cpu intensive and we do not want to compete with startup.
- Events are queued and processed when the migration is completed
- There is a safety to start discarding events if more than 10000 happen before migration is completed
- A notification is shown when the migration is in progress "The database is being upgraded. Integrations such as logbook and history that read the database may return inconsistent results until the migration completes."
- The database recovery logic can now recover at point after setup including purge, migration, and event insert. In short we can _always_ (hopefully) recover and start a new db  without a restart.
- The quick check is no longer performed on unclean shutdown since we can always recover live. The `db_integrity_check` option has been deprecated.
@marvin-w
Copy link
Contributor

Being one of the people that waited for about 25 hours during the last update this is a really awesome change! Well done!

"""Migrate schema to the latest version."""
persistent_notification.create(
self.hass,
"System performance will temporarily degrade during the database upgrade. Integrations that read the database, such as logbook and history, may return inconsistent results until the upgrade completes.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that they should not restart their Home Assistant installation?

We probably want to show this in the frontend somewhere too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's a good idea 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe fire an event 'recorder_migration_started' 'recorder_migration_ended'. We could then show a warning on the server tools by the restart button that they probably shouldn't do it

Copy link
Member Author

@bdraco bdraco Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if shutdown will block actually since the thread won't exit until it finished. I'll test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System performance will temporarily degrade during the database upgrade. Do not power down or restart the system until the upgrade completes. Integrations that read the database, such as logbook and history, may return inconsistent results until the upgrade completes.

Changed wording to ^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The webserver will shutdown but the database migration continues and restart is blocked until it finishes because the thread won't join.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2021-04-12 06:08:27 WARNING (MainThread) [homeassistant.core] Timed out waiting for shutdown stage 1 to complete, the shutdown will continue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shutdown blocks until the migration is complete because the thread can't join. I'm going to do another PR to check if migration is in progress and throw when calling SERVICE_HOMEASSISTANT_STOP or SERVICE_HOMEASSISTANT_RESTART so they don't end up in this state unexpectedly.

Thread 213 (idle): "MainThread"
    _wait_for_tstate_lock (threading.py:1027)
    join (threading.py:1011)
    shutdown (concurrent/futures/thread.py:236)
    close (homeassistant/runner.py:92)
    run (asyncio/runners.py:51)
    run (homeassistant/runner.py:126)
    main (homeassistant/__main__.py:313)
    <module> (homeassistant/__main__.py:321)
    _run_code (runpy.py:87)
    _run_module_as_main (runpy.py:194)
Thread 237 (idle): "SyncWorker_1"
    _wait_for_tstate_lock (threading.py:1027)
    join (threading.py:1011)
    shutdown (homeassistant/components/recorder/__init__.py:364)
    run (concurrent/futures/thread.py:57)
    _worker (concurrent/futures/thread.py:80)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 245 (idle): "Recorder"
    migrate_schema (homeassistant/components/recorder/migration.py:60)
    _migrate_schema_and_setup_run (homeassistant/components/recorder/__init__.py:523)
    run (homeassistant/components/recorder/__init__.py:435)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 251 (idle): "Thread-8"
    loop (paho/mqtt/client.py:1167)
    loop_forever (paho/mqtt/client.py:1779)
    _thread_main (paho/mqtt/client.py:3452)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 259 (idle): "Thread-9"
    loop (paho/mqtt/client.py:1167)
    loop_forever (paho/mqtt/client.py:1779)
    _thread_main (paho/mqtt/client.py:3452)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified everything went down cleanly and restarted successfully once the migration did finish.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service block started here #49098

Need to think about how to tell the frontend as it needs to know on reload.

Copy link
Member

@balloob balloob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great ! 🎉

Had a couple of minor comments.

Dev automation moved this from Needs review to Reviewer approved Apr 12, 2021
@bdraco bdraco merged commit 9368891 into home-assistant:dev Apr 12, 2021
Dev automation moved this from Reviewer approved to Done Apr 12, 2021
@bdraco bdraco deleted the live_db_migrations branch April 12, 2021 06:43
@bdraco
Copy link
Member Author

bdraco commented Apr 12, 2021

Thanks. Working on the shutdown block pr now

@github-actions github-actions bot locked and limited conversation to collaborators Apr 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Dev
  
Done
Development

Successfully merging this pull request may close these issues.

recorder starts reporting "database disk image is malformed" error
4 participants