Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When model cache process is interrupted by app restart, model cache becomes unusable #30360

Open
likeshumidity opened this issue Apr 25, 2023 · 10 comments
Labels
.Backend .Escalation Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Querying/Cache Querying/Models aka Datasets .Team/QueryProcessor :hammer_and_wrench: Type:Bug Product defects

Comments

@likeshumidity
Copy link
Contributor

likeshumidity commented Apr 25, 2023

Is your feature request related to a problem? Please describe.

  • When Metabase is hosted on Heroku, it does not handle the SIGTERM for app restarts gracefully if a process like model caching is scheduled and occurs at the same time. The model caching never completes, so the model cache stays in "pending" state. (See Some model cache refresh are are stuck in "pending" // monitor model caching tasks #29689 for detail of the issue).
    • Adding retries may also cause a restart loop if the scheduled process is part of the cause of the SIGTERM, so a retry limit may be needed if retries are enabled

Describe the solution you'd like

  • Track when model cache interrupted (and therefore invalid) and retry model caching process (perhaps up to 2 more times?)
  • Retry interrupted scheduled processes interrupted by SIGTERM after restart up to 2 times

Describe alternatives you've considered

  • Current behavior of ignoring SIGTERM - leaves model caching in an unusable state
  • Current behavior of leaving model cache in unusable state if process interrupted by app restart

How important is this feature to you?

@luizarakaki
Copy link
Contributor

I think something is missing in the description.

What happens with cached Models when the instance is killed? If the user tries to access them, they get an error?

@likeshumidity
Copy link
Contributor Author

Thanks @luizarakaki just updated the description and finished the sentence.

@dpsutton
Copy link
Contributor

At first read, this is working correctly. The fact that the cache is in a "pending" state means it will not be substituted in. The original design of this is that interruptions like this just bust the cache until the next refresh happens. If the cache stays in the "pending" state and never gets updated that is a bug. If it gets fixed on the next scheduled refresh then I consider it working well. Do you know which is the case here?

@dpsutton
Copy link
Contributor

Based on

(def ^:private refreshable-states
  "States of `persisted_info` records which can be refreshed."
  #{"creating" "persisted" "error"})

I guess that the cache being in a "pending" state means it is forever disabled. We'll have to fix this. The good thing is that we disallow concurrent execution

(jobs/defjob ^{org.quartz.DisallowConcurrentExecution true
               :doc "Refresh persisted tables job"}
  PersistenceRefresh [job-context]
  (refresh-job-fn! job-context))

That means that the refresh job runs serially so "pending" isn't a concern. The only issue is making sure that the cache isn't currently undergoing a manual refresh. We should be able to use the refresh_begin or state_change_at properties of the persist_info table to proceed against that.

@likeshumidity
Copy link
Contributor Author

It stays pending until next refresh, but if scheduled 24 hours, and uncached queries are very expensive, performance can take a significant hit or even become unresponsive until cache successfully run.

@dpsutton
Copy link
Contributor

@likeshumidity I think that is acceptable. The original design of this was always "cached when possible". from bruno in the notion doc:

To be extra clear: there’s absolutely no promise at any point that the original query won’t be run if caching is on. That’s part of why it’s called caching!

Improving this is always good. But I'm not sure I'd call this a bug (unless the "pending" state prevents the cache from being repopulated on the next scheduled refresh).

@luizarakaki
Copy link
Contributor

Right... Thanks for the fix there.

Yeah, this was by design. So I agree with the "new feature" label there.

@paoliniluis paoliniluis added Querying/Models aka Datasets Querying/Cache Type:Bug Product defects Priority:P2 Average run of the mill bug and removed .Needs Triage Type:New Feature labels Apr 25, 2023
@pierrickouw
Copy link

pierrickouw commented Apr 26, 2023

Hi (original reporter here)! Thanks @likeshumidity for the report!

Thanks for looking into it! Some additionnal info:

It stays pending until next refresh, but if scheduled 24 hours, and uncached queries are very expensive, performance can take a significant hit or even become unresponsive until cache successfully run.

Actually it never refreshes until we disable all the caches, and re-enable them. We had one cache stuck for more than 10 days (although we have a 1h refresh) (took us a long time to figure out what was wrong with our Metabase instance). So this sentence is correct:

unless the "pending" state prevents the cache from being repopulated on the next scheduled refresh.

We would be fine with the cache being unavailable between 2 runs when this problem occurs, but as of today, we have to manually restart (all of) them (and manually monitor them).

@likeshumidity
Copy link
Contributor Author

@dpsutton re: last comment, I was wrong about the problem correcting itself on the next model cache run. It does require manual intervention to address by disabling and re-enabling cache.

@ranquild ranquild added .Team/QueryProcessor :hammer_and_wrench: .Backend labels Jun 2, 2023
@pierrickouw
Copy link

FWIW and for others people, I plugged Metabase DB as a datasource of Metabase, and create an Alert in Metabase with this query, to at least be notified if something is off:

select * from persisted_info 
where active=true
and state != 'persisted'
and refresh_begin >= NOW() - INTERVAL '10 minutes';

@likeshumidity likeshumidity added Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness .Escalation and removed Priority:P2 Average run of the mill bug labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.Backend .Escalation Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Querying/Cache Querying/Models aka Datasets .Team/QueryProcessor :hammer_and_wrench: Type:Bug Product defects
Projects
None yet
Development

No branches or pull requests

6 participants