Rework places maintenance code #5115

bendk · 2022-09-01T13:18:46Z

A while back we updated the run_maintenance function to prune old history if the places DB was over a certain limit. It looks like Fenix wants to start using it, but we've identified a few areas that could be improved:

Helping consumers decide how frequently to run it. It's unclear how often this should be run. Partly this is a matter of metrics and tuning, but one thing we could do to help is return a boolean specifying if the DB is still over the size limit after pruning. This would help consumers decide how quickly to schedule the next run.
Better interruption support. Right now there's no good way to interrupt a run_maintenance() call. The interrupt() method will interrupt any operation, so it's not really safe for consumers to call it if they want to only interrupt maintenance. We should add support for interruption that's limited to certain kinds of operations.

┆Issue is synchronized with this Jira Task
┆Sprint End Date: 2022-09-16

The text was updated successfully, but these errors were encountered:

bendk · 2022-09-01T21:11:43Z

For that second part, I think we should try one more time to get a good interruption system. What if PlacesDb stored an AtomicU32 that stored the interruption state? 0 would mean either no operation was happening, or the current operation should be interrupted. All other values would specify a type of operation: 1 means a general write, 2 means maintenance, 3 means incremental search, etc. When you lock the DB you would store the type of operation you were about to do, then periodically check if the value was reset to 0. Other other threads could interrupt only particular types of operations using compare-and-swap.

This seems like a nice simplification of the current InterruptScope system. If we wanted to use that system to interrupt maintenance only, the calls get complex. I think it would be something like you call prepare_maintenance() which returns an object with an interrupt scope, then you store that interrupt scope in a shared variable, then call call the run_maintenance() method to start up maintenance using that interrupt scope. This all seems very cumbersome just to add interrupt support. For something like autocomplete it would be even more difficult.

The main drawback I see is that it becomes harder to interrupt a specific operation, rather than a type of operation. However, it doesn't seem like we need that.

mhammond · 2022-09-06T01:00:18Z

For that second part, I think we should try one more time to get a good interruption system

Stepping back a little, I think the interruption requirement comes from mozilla-mobile/fenix#7227 (comment)

so that we can cancel long running maintanence tasks if the app is killed in order to avoid putting the storage layer into a bad state.

So this seems to be met by our existing interrupt support - we don't actually care what's running, just interrupt the world. But then the discussion turned into making sure we don't interrupt the "wrong" operation - but is there actually a wrong operation at shutdown? Or is there a non-shutdown use-case I've forgotten about?

@jonalmeida / @csadilek, what are the actual constraints here?

csadilek · 2022-09-06T20:06:21Z

Or is there a non-shutdown use-case I've forgotten about?

@mhammond Yes, I agree that interrupting the world would be fine in case the app is being shut down, but the case we wanted to discuss with you is this:

The device becomes idle and all other constraints are met too
Maintenance hasn't run in > 24 hours (or whatever threshold we decide on)
run_maintenance is called
While run_maintenance is executing, any of the constraints could change and no longer be met, e.g., the device/app is active again (no longer idle).

Ideally, we should then cancel the maintenance operation and run again the next time the device is idle.

If we keep it running we risk slowing down the app / device.

If we interrupt via the single writer we would currently risk interrupting the wrong operation, e.g., if the user opened a link and we wanted to record the visit while run_maintenance is executing. The regular sync worker could also overlap with a run_maintenance call currently.

If we're reasonably confident that a single run_maintenance call takes a short amount of time, then maybe we should not worry about interrupting it and just let it complete? In the case of shutdown, would it even matter if we interrupted if there's no risk of the storage ending up in a bad state?

I am currently leaning towards not interrupting (unless we can make sure it's the right operation), but wondering what you think? Do we have a rough idea how long a single execution of run_maintenance with default parameters takes?

bendk · 2022-09-07T17:08:13Z

So seems like we're leaning towards not interrupting. What if we updated the result of the operation to give you some metrics that could be useful for both scheduling purposes and double checking that this isn't leading to pathological cases? I'm thinking we can give a result with the following fields:

Did we do any pruning?
Total time taken. This should give you good insight into if the operation is taking too long for some users.
What was the size of the DB before the run_maintenance?
What was the size after? This could be useful for scheduling. If the size is larger than the target size, then you could lower the delay before the next call. Otherwise you could wait a longer time.

In the case of shutdown, would it even matter if we interrupted if there's no risk of the storage ending up in a bad state?

AFAICT there's no chance of ending up in a bad state.

Do we have a rough idea how long a single execution of run_maintenance with default parameters takes?

I don't have an estimate, but I think Grisha's analysis in fenix#7227 is correct. In almost all cases it should take a very short time, but in rare cases it might take a longer time. I think the most likely time is the very first VACUUM we run after not running it for a while.

csadilek · 2022-09-07T18:01:14Z

So seems like we're leaning towards not interrupting.

@bendk yes, if you and @mhammond agree as well. I think this would be a good starting point.

What if we updated the result of the operation to give you some metrics that could be useful for both scheduling purposes and double checking that this isn't leading to pathological cases?

That sounds good. Then we can record telemetry for it and land in Nightly (maybe even behind a feature flag first). This will allow us to look at the data esp. total time taken, and we can still decide if we need to add interrupting logic later. I expect the data to be significantly different on Release though (more diverse devices), but even then we will learn and can adjust.

If the size is larger than the target size, then you could lower the delay before the next call. Otherwise you could wait a longer time.

What is the target size in this case?

/cc @jonalmeida

bendk · 2022-09-07T20:31:05Z

What is the target size in this case?

Whatever value you pass in to us. Desktop uses 75 MiB which seems like a good starting point (https://searchfox.org/mozilla-central/source/toolkit/components/places/PlacesExpiration.sys.mjs#47)

bendk · 2022-09-08T14:48:02Z

One slight tweak after writing this code is that I think we should metrics on time taken, but not return it as part of the metrics struct. The reason is that Glean's timing distribution metrics requires you to use it's timer -- you can't just pass it a specific value to record nor can you get the value that it recorded.

bendk · 2022-10-20T21:02:44Z

AFAICT all the app-services code is ready now that #5123 has been merged. Closing this one.

bendk mentioned this issue Sep 8, 2022

Data review: Places maintenance metrics #5122

Closed

csadilek mentioned this issue Oct 11, 2022

Re-enable storage maintenance call mozilla-mobile/fenix#7227

Closed

bendk closed this as completed Oct 20, 2022

data-sync-user assigned bendk Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework places maintenance code #5115

Rework places maintenance code #5115

bendk commented Sep 1, 2022 •

edited by data-sync-user

bendk commented Sep 1, 2022 •

edited

mhammond commented Sep 6, 2022

csadilek commented Sep 6, 2022 •

edited

bendk commented Sep 7, 2022

csadilek commented Sep 7, 2022

bendk commented Sep 7, 2022

bendk commented Sep 8, 2022

bendk commented Oct 20, 2022

Rework places maintenance code #5115

Rework places maintenance code #5115

Comments

bendk commented Sep 1, 2022 • edited by data-sync-user

bendk commented Sep 1, 2022 • edited

mhammond commented Sep 6, 2022

csadilek commented Sep 6, 2022 • edited

bendk commented Sep 7, 2022

csadilek commented Sep 7, 2022

bendk commented Sep 7, 2022

bendk commented Sep 8, 2022

bendk commented Oct 20, 2022

bendk commented Sep 1, 2022 •

edited by data-sync-user

bendk commented Sep 1, 2022 •

edited

csadilek commented Sep 6, 2022 •

edited