When model cache process is interrupted by app restart, model cache becomes unusable #30360

likeshumidity · 2023-04-25T14:26:25Z

Is your feature request related to a problem? Please describe.

When Metabase is hosted on Heroku, it does not handle the SIGTERM for app restarts gracefully if a process like model caching is scheduled and occurs at the same time. The model caching never completes, so the model cache stays in "pending" state. (See Some model cache refresh are are stuck in "pending" // monitor model caching tasks #29689 for detail of the issue).
- Adding retries may also cause a restart loop if the scheduled process is part of the cause of the SIGTERM, so a retry limit may be needed if retries are enabled

Describe the solution you'd like

Track when model cache interrupted (and therefore invalid) and retry model caching process (perhaps up to 2 more times?)
Retry interrupted scheduled processes interrupted by SIGTERM after restart up to 2 times

Describe alternatives you've considered

Current behavior of ignoring SIGTERM - leaves model caching in an unusable state
Current behavior of leaving model cache in unusable state if process interrupted by app restart

How important is this feature to you?

Customer self-hosting Metabase on Heroku requested. Heroku SIGTERM behavior described here for reference.

luizarakaki · 2023-04-25T14:32:31Z

I think something is missing in the description.

What happens with cached Models when the instance is killed? If the user tries to access them, they get an error?

likeshumidity · 2023-04-25T14:35:19Z

Thanks @luizarakaki just updated the description and finished the sentence.

dpsutton · 2023-04-25T14:40:28Z

At first read, this is working correctly. The fact that the cache is in a "pending" state means it will not be substituted in. The original design of this is that interruptions like this just bust the cache until the next refresh happens. If the cache stays in the "pending" state and never gets updated that is a bug. If it gets fixed on the next scheduled refresh then I consider it working well. Do you know which is the case here?

dpsutton · 2023-04-25T14:43:42Z

Based on

(def ^:private refreshable-states
  "States of `persisted_info` records which can be refreshed."
  #{"creating" "persisted" "error"})

I guess that the cache being in a "pending" state means it is forever disabled. We'll have to fix this. The good thing is that we disallow concurrent execution

(jobs/defjob ^{org.quartz.DisallowConcurrentExecution true
               :doc "Refresh persisted tables job"}
  PersistenceRefresh [job-context]
  (refresh-job-fn! job-context))

That means that the refresh job runs serially so "pending" isn't a concern. The only issue is making sure that the cache isn't currently undergoing a manual refresh. We should be able to use the refresh_begin or state_change_at properties of the persist_info table to proceed against that.

likeshumidity · 2023-04-25T14:43:44Z

It stays pending until next refresh, but if scheduled 24 hours, and uncached queries are very expensive, performance can take a significant hit or even become unresponsive until cache successfully run.

dpsutton · 2023-04-25T14:48:18Z

@likeshumidity I think that is acceptable. The original design of this was always "cached when possible". from bruno in the notion doc:

To be extra clear: there’s absolutely no promise at any point that the original query won’t be run if caching is on. That’s part of why it’s called caching!

Improving this is always good. But I'm not sure I'd call this a bug (unless the "pending" state prevents the cache from being repopulated on the next scheduled refresh).

luizarakaki · 2023-04-25T14:49:59Z

Right... Thanks for the fix there.

Yeah, this was by design. So I agree with the "new feature" label there.

pierrickouw · 2023-04-26T09:44:28Z

Hi (original reporter here)! Thanks @likeshumidity for the report!

Thanks for looking into it! Some additionnal info:

It stays pending until next refresh, but if scheduled 24 hours, and uncached queries are very expensive, performance can take a significant hit or even become unresponsive until cache successfully run.

Actually it never refreshes until we disable all the caches, and re-enable them. We had one cache stuck for more than 10 days (although we have a 1h refresh) (took us a long time to figure out what was wrong with our Metabase instance). So this sentence is correct:

unless the "pending" state prevents the cache from being repopulated on the next scheduled refresh.

We would be fine with the cache being unavailable between 2 runs when this problem occurs, but as of today, we have to manually restart (all of) them (and manually monitor them).

likeshumidity · 2023-05-04T19:28:16Z

@dpsutton re: last comment, I was wrong about the problem correcting itself on the next model cache run. It does require manual intervention to address by disabling and re-enabling cache.

pierrickouw · 2023-06-23T11:21:05Z

FWIW and for others people, I plugged Metabase DB as a datasource of Metabase, and create an Alert in Metabase with this query, to at least be notified if something is off:

select * from persisted_info 
where active=true
and state != 'persisted'
and refresh_begin >= NOW() - INTERVAL '10 minutes';

likeshumidity added Type:New Feature .Needs Triage labels Apr 25, 2023

paoliniluis added Querying/Models aka Datasets Querying/Cache Type:Bug Product defects Priority:P2 Average run of the mill bug and removed .Needs Triage Type:New Feature labels Apr 25, 2023

ranquild added .Team/QueryProcessor :hammer_and_wrench: .Backend labels Jun 2, 2023

devurandom mentioned this issue Jan 18, 2024

SIGTERM during migration does not cleanly abort migration #37866

Closed

likeshumidity added Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness .Escalation and removed Priority:P2 Average run of the mill bug labels Feb 23, 2024

likeshumidity mentioned this issue Feb 23, 2024

Model Caching stuck in "pending" state not cleared out by next scheduled refresh #39138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When model cache process is interrupted by app restart, model cache becomes unusable #30360

When model cache process is interrupted by app restart, model cache becomes unusable #30360

likeshumidity commented Apr 25, 2023 •

edited

luizarakaki commented Apr 25, 2023

likeshumidity commented Apr 25, 2023

dpsutton commented Apr 25, 2023

dpsutton commented Apr 25, 2023

likeshumidity commented Apr 25, 2023

dpsutton commented Apr 25, 2023

luizarakaki commented Apr 25, 2023

pierrickouw commented Apr 26, 2023 •

edited

likeshumidity commented May 4, 2023

pierrickouw commented Jun 23, 2023

When model cache process is interrupted by app restart, model cache becomes unusable #30360

When model cache process is interrupted by app restart, model cache becomes unusable #30360

Comments

likeshumidity commented Apr 25, 2023 • edited

luizarakaki commented Apr 25, 2023

likeshumidity commented Apr 25, 2023

dpsutton commented Apr 25, 2023

dpsutton commented Apr 25, 2023

likeshumidity commented Apr 25, 2023

dpsutton commented Apr 25, 2023

luizarakaki commented Apr 25, 2023

pierrickouw commented Apr 26, 2023 • edited

likeshumidity commented May 4, 2023

pierrickouw commented Jun 23, 2023

likeshumidity commented Apr 25, 2023 •

edited

pierrickouw commented Apr 26, 2023 •

edited