Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync task never runs on subsequent databases if one of the databases is down/etc #14817

Closed
flamber opened this issue Feb 15, 2021 · 2 comments · Fixed by #15043
Closed

Sync task never runs on subsequent databases if one of the databases is down/etc #14817

flamber opened this issue Feb 15, 2021 · 2 comments · Fixed by #15043
Assignees
Labels
Administration/Metadata & Sync Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Type:Bug Product defects
Milestone

Comments

@flamber
Copy link
Contributor

flamber commented Feb 15, 2021

Describe the bug
If you have configured 3 databases, where 1st is working, 2nd is inaccessible (down, invalid credentials, wrong certificate, etc), and 3rd is working, then only the 1st database is synced via the task the first time, and then the task scheduler gets into a state BLOCKED, which means it will not sync any databases anymore until the instance is restarted, which will then just loop this behavior (depending on when Metabase was started - see step 1, ID vs sync time).

Workaround is to manually sync the database via Admin > Databases > (your-db) > Sync database schema now.

To Reproduce

  1. Setup 3 databases - the order is important, meaning the ID they get in the database (or the sync time that can be defined by minute from 0.38.0).
    1st is MySQL (metabase/qa-databases:mysql-sample-5.7), 2nd is Oracle (wnameless/oracle-xe-11g-r2), 3rd is MySQL (metabase/qa-databases:mysql-sample-8).
    All three databases does it's initial sync+fp+scan on setup
    I have chosen Oracle as the 2nd database (which is the down-trigger), since it's overly logging a lot during the failed sync and has a bunch of sample data included - any database types can probably be used.
    image
  2. Shutdown Metabase and shutdown the 2nd database (Oracle).
  3. Start Metabase and wait for 1 hour (or faketime it if you have better things to do)
  4. Admin > Troubleshooting > Jobs > metabase.task.sync-and-analyze.job view trigger
    image

Expected behavior
Skip inaccessible (down, invalid credentials, wrong certificate, etc) databases #7526 and continue syncing - and Quartz Scheduler should never get into a blocked state.

Information about your Metabase Installation:
I've only tested 0.37.9, since it takes a long time to reproduce - even with faketime or similar commands - but I'm guessing this problem has existed for a long time.

Additional context
I can consistently get into blocked state initially on startup, if Metabase starts the sync task scheduler (INFO metabase.events :: Starting events listener: :metabase.events.sync-database/Sync), but where the next_fire_time has already passed.
I'm triggering that with faketime -f '+0 x60' - depending on when it's started, then it might already have passed the event. Change to a higher speed if it difficult to get into this state.

Having a way to completely disable scheduled sync would be very helpful #10398

@flamber flamber added Type:Bug Product defects Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Administration/Metadata & Sync labels Feb 15, 2021
@camsaul camsaul added this to the 0.38.1 milestone Feb 22, 2021
@robdaemon robdaemon self-assigned this Feb 27, 2021
@robdaemon
Copy link
Contributor

Hrm I'm not sure blocked is the issue here - blocked is set because the sync job has DisallowConcurrentExecution set to true here: https://github.com/metabase/metabase/blob/release-x.38.x/src/metabase/task/sync_databases.clj#L74

Still digging in more to see if I can figure out. This is mainly around our misfire settings - syncs are configured to not retry until the "next fire time" if they fail or are missed. There is no retry until then. If the restart happens after that, we miss another day of syncing. Maybe we need a different misfire setting for this job, so it runs at least once a day.

@camsaul camsaul modified the milestones: 0.38.1, 0.38.2 Mar 3, 2021
robdaemon pushed a commit that referenced this issue Mar 3, 2021
It's possible for the scheduler to get in a weird state if the sync
fails while it executes.

This change makes it *only* recreate a job/task if the schedule has
changed or if it is missing. Previously, clearing the state at every
start had bad effects if the JVM had terminated during the sync.

#14817
robdaemon pushed a commit that referenced this issue Mar 5, 2021
It's possible for the scheduler to get in a weird state if the sync
fails while it executes.

This change makes it *only* recreate a job/task if the schedule has
changed or if it is missing. Previously, clearing the state at every
start had bad effects if the JVM had terminated during the sync.

#14817
robdaemon pushed a commit that referenced this issue Mar 5, 2021
It's possible for the scheduler to get in a weird state if the sync
fails while it executes.

This change makes it *only* recreate a job/task if the schedule has
changed or if it is missing. Previously, clearing the state at every
start had bad effects if the JVM had terminated during the sync.

#14817
@camsaul camsaul linked a pull request Mar 9, 2021 that will close this issue
robdaemon pushed a commit that referenced this issue Mar 12, 2021
It's possible for the scheduler to get in a weird state if the sync
fails while it executes.

This change makes it *only* recreate a job/task if the schedule has
changed or if it is missing. Previously, clearing the state at every
start had bad effects if the JVM had terminated during the sync.

Adds a vector of Exception classes that signal a "fatal" exception
during sync for a specific database. If these exceptions occur, the sync
for that database stops and will pick up next time.

This will have to be expanded per driver, but I don't see a way around
that, as each driver will have its own, unique way of failing.

#14817
@camsaul
Copy link
Member

camsaul commented Mar 17, 2021

Fixed by #15043

@camsaul camsaul closed this as completed Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Administration/Metadata & Sync Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Type:Bug Product defects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants