Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGTERM during migration does not cleanly abort migration #37866

Closed
devurandom opened this issue Jan 18, 2024 · 5 comments · Fixed by #41089
Closed

SIGTERM during migration does not cleanly abort migration #37866

devurandom opened this issue Jan 18, 2024 · 5 comments · Fixed by #41089
Assignees
Labels
Operation/ Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness .Team/BackendComponents also known as BEC Type:Bug Product defects
Milestone

Comments

@devurandom
Copy link
Contributor

Describe the bug

When Metabase v48.3 receives a SIGTERM while it is running migrations (after "Migration lock is cleared. Running ${n} migrations ...", but before "Migration complete in ${x} s"), Metabase will shut down ("Metabase Shutting Down ...") but not cleanly abort the migration / clear the migration lock (subsequent runs of Metabase will report "liquibase.exception.LockException: Database has migration lock; cannot run migrations.")

Once this happened, the workaround is running the mentioned command java -jar metabase.jar migrate release-locks, but until this has been done Metabase cannot start again.

To Reproduce

  1. Start Metabase v47.9 and have it set up its database
  2. Update to v48.3
  3. Wait until you see it log "Migration lock is cleared. Running ${n} migrations ..."
  4. Send SIGTERM (before "Migration complete")
  5. Wait for Metabase to shut down
  6. Start it again
  7. Observe startup failure and "Database has migration lock; cannot run migrations."

Expected behavior

Upon SIGTERM Metabase aborts the migration and clears the migration lock

Logs

Example logs of the migration termination:

2024-01-18 00:14:14,102 INFO metabase.core :: Setting up and migrating Metabase DB. Please sit tight, this may take a minute...
2024-01-18 00:14:14,104 INFO db.setup :: Verifying postgres Database Connection ...
2024-01-18 00:14:15,237 INFO db.setup :: Successfully verified PostgreSQL 14.7 application database connection. ✅
2024-01-18 00:14:15,238 INFO db.setup :: Checking if a database downgrade is required...
2024-01-18 00:14:17,567 INFO db.setup :: Running Database Migrations...
2024-01-18 00:14:17,568 INFO db.setup :: Setting up Liquibase...
2024-01-18 00:14:17,645 INFO db.setup :: Liquibase is ready.
2024-01-18 00:14:17,645 INFO db.liquibase :: Checking if Database has unrun migrations...
2024-01-18 00:14:19,958 INFO db.liquibase :: Database has unrun migrations. Waiting for migration lock to be cleared...
2024-01-18 00:14:20,381 INFO db.liquibase :: Migration lock is cleared. Running 55 migrations ...
2024-01-18 00:14:20,690 ERROR middleware.log :: GET /api/health 503 379.9 µs (0 DB calls) 
{:status "initializing", :progress 0.3}

2024-01-18 00:14:22,920 INFO metabase.core :: Metabase Shutting Down ...
2024-01-18 00:14:22,921 INFO metabase.server :: Shutting Down Embedded Jetty Webserver
2024-01-18 00:14:22,995 INFO metabase.core :: Metabase Shutdown COMPLETE

Example logs of subsequent startup failure:

2024-01-18 00:14:39,308 INFO metabase.core :: Setting up and migrating Metabase DB. Please sit tight, this may take a minute...
2024-01-18 00:14:39,310 INFO db.setup :: Verifying postgres Database Connection ...
2024-01-18 00:14:39,629 INFO db.setup :: Successfully verified PostgreSQL 14.7 application database connection. ✅
2024-01-18 00:14:39,630 INFO db.setup :: Checking if a database downgrade is required...
2024-01-18 00:14:39,922 INFO db.setup :: Running Database Migrations...
2024-01-18 00:14:39,922 INFO db.setup :: Setting up Liquibase...
2024-01-18 00:14:39,962 INFO db.setup :: Liquibase is ready.
2024-01-18 00:14:39,962 INFO db.liquibase :: Checking if Database has unrun migrations...
2024-01-18 00:14:40,349 INFO db.liquibase :: Database has unrun migrations. Waiting for migration lock to be cleared...
2024-01-18 00:14:42,358 WARN util.jvm :: auto-retry metabase.db.liquibase$wait_for_migration_lock_to_be_cleared$fn__47884@347004f6: Database has migration lock; cannot run migrations. You can force-release these locks by running `java -jar metabase.jar migrate release-locks`.
2024-01-18 00:14:43,379 ERROR middleware.log :: GET /api/health 503 2.5 ms (0 DB calls)
{:status "initializing", :progress 0.3}
	
2024-01-18 00:14:44,365 WARN util.jvm :: auto-retry metabase.db.liquibase$wait_for_migration_lock_to_be_cleared$fn__47884@347004f6: Database has migration lock; cannot run migrations. You can force-release these locks by running `java -jar metabase.jar migrate release-locks`.
2024-01-18 00:14:46,372 WARN util.jvm :: auto-retry metabase.db.liquibase$wait_for_migration_lock_to_be_cleared$fn__47884@347004f6: Database has migration lock; cannot run migrations. You can force-release these locks by running `java -jar metabase.jar migrate release-locks`.
2024-01-18 00:14:48,378 WARN util.jvm :: auto-retry metabase.db.liquibase$wait_for_migration_lock_to_be_cleared$fn__47884@347004f6: Database has migration lock; cannot run migrations. You can force-release these locks by running `java -jar metabase.jar migrate release-locks`.
2024-01-18 00:14:50,385 WARN util.jvm :: auto-retry metabase.db.liquibase$wait_for_migration_lock_to_be_cleared$fn__47884@347004f6: Database has migration lock; cannot run migrations. You can force-release these locks by running `java -jar metabase.jar migrate release-locks`.
2024-01-18 00:14:52,393 ERROR metabase.core :: Metabase Initialization FAILED
liquibase.exception.LockException: Database has migration lock; cannot run migrations. You can force-release these locks by running `java -jar metabase.jar migrate release-locks`.
	at metabase.db.liquibase$wait_for_migration_lock_to_be_cleared$fn__47884.invoke(liquibase.clj:204)
	at metabase.util.jvm$do_with_auto_retries.invokeStatic(jvm.clj:176)
	at metabase.util.jvm$do_with_auto_retries.invoke(jvm.clj:165)
	at metabase.util.jvm$do_with_auto_retries.invokeStatic(jvm.clj:183)
	at metabase.util.jvm$do_with_auto_retries.invoke(jvm.clj:165)
	at metabase.util.jvm$do_with_auto_retries.invokeStatic(jvm.clj:183)
	at metabase.util.jvm$do_with_auto_retries.invoke(jvm.clj:165)
	at metabase.util.jvm$do_with_auto_retries.invokeStatic(jvm.clj:183)
	at metabase.util.jvm$do_with_auto_retries.invoke(jvm.clj:165)
	at metabase.util.jvm$do_with_auto_retries.invokeStatic(jvm.clj:183)
	at metabase.util.jvm$do_with_auto_retries.invoke(jvm.clj:165)
	at metabase.util.jvm$do_with_auto_retries.invokeStatic(jvm.clj:183)
	at metabase.util.jvm$do_with_auto_retries.invoke(jvm.clj:165)
	at metabase.db.liquibase$wait_for_migration_lock_to_be_cleared.invokeStatic(liquibase.clj:199)
	at metabase.db.liquibase$wait_for_migration_lock_to_be_cleared.invoke(liquibase.clj:195)
	at metabase.db.liquibase$migrate_up_if_needed_BANG_.invokeStatic(liquibase.clj:216)
	at metabase.db.liquibase$migrate_up_if_needed_BANG_.invoke(liquibase.clj:209)
	at metabase.db.setup$fn__50134$_AMPERSAND_f__50135$fn__50136.invoke(setup.clj:80)
	at metabase.db.liquibase$fn__47860$_AMPERSAND_f__47861.invoke(liquibase.clj:131)
	at metabase.db.liquibase$fn__47860$fn__47863.invoke(liquibase.clj:124)
	at metabase.db.setup$fn__50134$_AMPERSAND_f__50135.doInvoke(setup.clj:75)
	at clojure.lang.RestFn.invoke(RestFn.java:445)
	at clojure.lang.AFn.applyToHelper(AFn.java:160)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at clojure.core$apply.invokeStatic(core.clj:673)
	at clojure.core$apply.invoke(core.clj:662)
	at metabase.db.setup$fn__50134$fn__50144.doInvoke(setup.clj:56)
	at clojure.lang.RestFn.invoke(RestFn.java:445)
	at metabase.db.setup$fn__50172$_AMPERSAND_f__50173.invoke(setup.clj:148)
	at metabase.db.setup$fn__50172$fn__50179.invoke(setup.clj:142)
	at metabase.db.setup$fn__50182$_AMPERSAND_f__50183$fn__50186$fn__50187.invoke(setup.clj:166)
	at metabase.util.jvm$do_with_us_locale.invokeStatic(jvm.clj:239)
	at metabase.util.jvm$do_with_us_locale.invoke(jvm.clj:225)
	at metabase.db.setup$fn__50182$_AMPERSAND_f__50183$fn__50186.invoke(setup.clj:161)
	at metabase.db.setup$fn__50182$_AMPERSAND_f__50183.invoke(setup.clj:160)
	at metabase.db.setup$fn__50182$fn__50194.invoke(setup.clj:154)
	at metabase.db$setup_db_BANG_$fn__50203.invoke(db.clj:69)
	at metabase.db$setup_db_BANG_.invokeStatic(db.clj:64)
	at metabase.db$setup_db_BANG_.invoke(db.clj:55)
	at metabase.core$init_BANG__STAR_.invokeStatic(core.clj:113)
	at metabase.core$init_BANG__STAR_.invoke(core.clj:98)
	at metabase.core$init_BANG_.invokeStatic(core.clj:156)
	at metabase.core$init_BANG_.invoke(core.clj:151)
	at metabase.core$start_normally.invokeStatic(core.clj:168)
	at metabase.core$start_normally.invoke(core.clj:162)
	at metabase.core$entrypoint.invokeStatic(core.clj:201)
	at metabase.core$entrypoint.doInvoke(core.clj:195)
	at clojure.lang.RestFn.invoke(RestFn.java:397)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at clojure.lang.Var.applyTo(Var.java:705)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$apply.invoke(core.clj:662)
	at metabase.bootstrap$_main.invokeStatic(bootstrap.clj:31)
	at metabase.bootstrap$_main.doInvoke(bootstrap.clj:28)
	at clojure.lang.RestFn.invoke(RestFn.java:397)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at metabase.bootstrap.main(Unknown Source)
2024-01-18 00:14:52,399 INFO metabase.core :: Metabase Shutting Down ...
2024-01-18 00:14:52,399 INFO metabase.server :: Shutting Down Embedded Jetty Webserver
2024-01-18 00:14:52,402 INFO metabase.core :: Metabase Shutdown COMPLETE

Information about your Metabase installation

Metabase v48.3 on Linux.

Severity

can cause downtime, but easy workaround

Additional context

https://metaboat.slack.com/archives/C013N8XL286/p1705573653217009

Different issue, not the problem I ran into, but similar cause: #30360

@devurandom devurandom changed the title SIGTERM during migration does not safely abort migration SIGTERM during migration does not cleanly abort migration Jan 18, 2024
@paoliniluis paoliniluis added Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Operation/ and removed .Needs Triage labels Jan 18, 2024
@darksciencebase darksciencebase added the .Team/BackendComponents also known as BEC label Jan 25, 2024
@crisptrutski crisptrutski self-assigned this Mar 26, 2024
@crisptrutski
Copy link
Contributor

@devurandom I'm looking into this, and unfortunately the library we're using to run migrations does not support being interrupted or aborted. I'm doing some research into whether it's possible to extend it to add this support, but I'm wondering if you have thoughts on either of the following simpler solutions:

  1. Block the application shutdown on the migrations finishing as usual.

    I suspect part of the reason you killed the process was due to the migrations taking a long time, but just want to confirm. This seems much safer than the current behavior though.

  2. Release the migration locks automatically just before the process terminates.

    This saves you the trouble of running the command manually, but increases the risk of corrupting the application database.

@crisptrutski
Copy link
Contributor

So I've found a way to skip the migrations remaining after the one in progress, but I'm concerned about just terminating the current migration since there's no telling what state that'll leave the database in. I'm inclined to block on waiting for the current migration with a long timeout.

@devurandom
Copy link
Contributor Author

@devurandom I'm looking into this, and unfortunately the library we're using to run migrations does not support being interrupted or aborted.

Is it possible to make it run migrations in transactions? Is that feasible for our setup?

  1. Block the application shutdown on the migrations finishing as usual.

    I suspect part of the reason you killed the process was due to the migrations taking a long time, but just want to confirm. This seems much safer than the current behavior though.

That would work.

@devurandom
Copy link
Contributor Author

@devurandom I'm looking into this, and unfortunately the library we're using to run migrations does not support being interrupted or aborted. I'm doing some research into whether it's possible to extend it to add this support, [...]

@imrkd found these related Liquibase issues:

From a very cursory reading, their approach to solve this appears to be to release the migration lock when the process receives SIGTERM and there is a Liquibase that implements this, see. e.g. liquibase/liquibase#1453 (comment).

  1. Release the migration locks automatically just before the process terminates.

    This saves you the trouble of running the command manually, but increases the risk of corrupting the application database.

From reading the Liquibase documentation I think this should be safe:

Liquibase attempts to execute each changeset in a transaction that is committed at the end, or rolled back if there is an error.
(https://docs.liquibase.com/concepts/changelogs/changeset.html)

When you make a database deployment with the update command, Liquibase attempts to create an atomic "transaction block" around each changeset to ensure that it is either fully deployed to the database or not deployed at all. This protects your database from incomplete deployments with unpredictable behavior.
(https://docs.liquibase.com/concepts/changelogs/attributes/run-in-transaction.html)

@crisptrutski
Copy link
Contributor

crisptrutski commented Apr 5, 2024

Thanks for doing all the digging! Interesting to see all the proposals that had been made upstream, I hadn't thought to look there and just fumbled my way through the source code in my IDE.

Liquibase attempts to execute each changeset in a transaction

Unfortunately our migrations typically involve DDL queries which are not transactional or even idempotent for all our app databases.

The favored approach in that thread was something I handn't thought of - having the locks released as a connection closed hook on the database side. That's really clever, but requires us to not only re-implement that functionality, which could be brittle for upgrading, it would also require driver specific ways to register it. Seems like a lot of complexity, especially since we're uncertain about sticking with liquibase. I'll have a look at whether that 3rd party plugin would be simple for us to integrate.

For now I'm going to go with something really simple to ship with 50.0 - waiting on the current migrations for say 20s before quitting, and automatically releasing the lock when we do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Operation/ Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness .Team/BackendComponents also known as BEC Type:Bug Product defects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants