Sync task never runs on subsequent databases if one of the databases is down/etc #14817

flamber · 2021-02-15T14:28:11Z

Describe the bug
If you have configured 3 databases, where 1st is working, 2nd is inaccessible (down, invalid credentials, wrong certificate, etc), and 3rd is working, then only the 1st database is synced via the task the first time, and then the task scheduler gets into a state BLOCKED, which means it will not sync any databases anymore until the instance is restarted, which will then just loop this behavior (depending on when Metabase was started - see step 1, ID vs sync time).

Workaround is to manually sync the database via Admin > Databases > (your-db) > Sync database schema now.

To Reproduce

Setup 3 databases - the order is important, meaning the ID they get in the database (or the sync time that can be defined by minute from 0.38.0).
1st is MySQL (metabase/qa-databases:mysql-sample-5.7), 2nd is Oracle (wnameless/oracle-xe-11g-r2), 3rd is MySQL (metabase/qa-databases:mysql-sample-8).
All three databases does it's initial sync+fp+scan on setup
I have chosen Oracle as the 2nd database (which is the down-trigger), since it's overly logging a lot during the failed sync and has a bunch of sample data included - any database types can probably be used.
Shutdown Metabase and shutdown the 2nd database (Oracle).
Start Metabase and wait for 1 hour (or faketime it if you have better things to do)
Admin > Troubleshooting > Jobs > metabase.task.sync-and-analyze.job view trigger

Expected behavior
Skip inaccessible (down, invalid credentials, wrong certificate, etc) databases #7526 and continue syncing - and Quartz Scheduler should never get into a blocked state.

Information about your Metabase Installation:
I've only tested 0.37.9, since it takes a long time to reproduce - even with faketime or similar commands - but I'm guessing this problem has existed for a long time.

Additional context
I can consistently get into blocked state initially on startup, if Metabase starts the sync task scheduler (INFO metabase.events :: Starting events listener: :metabase.events.sync-database/Sync), but where the next_fire_time has already passed.
I'm triggering that with faketime -f '+0 x60' - depending on when it's started, then it might already have passed the event. Change to a higher speed if it difficult to get into this state.

Having a way to completely disable scheduled sync would be very helpful #10398

The text was updated successfully, but these errors were encountered:

robdaemon · 2021-03-02T01:21:04Z

Hrm I'm not sure blocked is the issue here - blocked is set because the sync job has DisallowConcurrentExecution set to true here: https://github.com/metabase/metabase/blob/release-x.38.x/src/metabase/task/sync_databases.clj#L74

Still digging in more to see if I can figure out. This is mainly around our misfire settings - syncs are configured to not retry until the "next fire time" if they fail or are missed. There is no retry until then. If the restart happens after that, we miss another day of syncing. Maybe we need a different misfire setting for this job, so it runs at least once a day.

It's possible for the scheduler to get in a weird state if the sync fails while it executes. This change makes it *only* recreate a job/task if the schedule has changed or if it is missing. Previously, clearing the state at every start had bad effects if the JVM had terminated during the sync. #14817

It's possible for the scheduler to get in a weird state if the sync fails while it executes. This change makes it *only* recreate a job/task if the schedule has changed or if it is missing. Previously, clearing the state at every start had bad effects if the JVM had terminated during the sync. Adds a vector of Exception classes that signal a "fatal" exception during sync for a specific database. If these exceptions occur, the sync for that database stops and will pick up next time. This will have to be expanded per driver, but I don't see a way around that, as each driver will have its own, unique way of failing. #14817

camsaul · 2021-03-17T22:10:16Z

Fixed by #15043

flamber added Type:Bug Product defects Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Administration/Metadata & Sync labels Feb 15, 2021

camsaul added this to the 0.38.1 milestone Feb 22, 2021

robdaemon self-assigned this Feb 27, 2021

camsaul modified the milestones: 0.38.1, 0.38.2 Mar 3, 2021

robdaemon mentioned this issue Mar 3, 2021

Clean up sync scheduling #15043

Merged

camsaul linked a pull request Mar 9, 2021 that will close this issue

Clean up sync scheduling #15043

Merged

This was referenced Mar 12, 2021

ERROR metabase.sync.util Error fingerprinting table 60 'public.pg_buffercache' #15143

Closed

fingerprint-fields fails due to non-existing field #15144

Closed

camsaul closed this as completed Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync task never runs on subsequent databases if one of the databases is down/etc #14817

Sync task never runs on subsequent databases if one of the databases is down/etc #14817

flamber commented Feb 15, 2021 •

edited

robdaemon commented Mar 2, 2021

camsaul commented Mar 17, 2021

Sync task never runs on subsequent databases if one of the databases is down/etc #14817

Sync task never runs on subsequent databases if one of the databases is down/etc #14817

Comments

flamber commented Feb 15, 2021 • edited

robdaemon commented Mar 2, 2021

camsaul commented Mar 17, 2021

flamber commented Feb 15, 2021 •

edited