Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework places maintenance code #5115

Closed
bendk opened this issue Sep 1, 2022 · 8 comments
Closed

Rework places maintenance code #5115

bendk opened this issue Sep 1, 2022 · 8 comments
Assignees

Comments

@bendk
Copy link
Contributor

bendk commented Sep 1, 2022

A while back we updated the run_maintenance function to prune old history if the places DB was over a certain limit. It looks like Fenix wants to start using it, but we've identified a few areas that could be improved:

  • Helping consumers decide how frequently to run it. It's unclear how often this should be run. Partly this is a matter of metrics and tuning, but one thing we could do to help is return a boolean specifying if the DB is still over the size limit after pruning. This would help consumers decide how quickly to schedule the next run.
  • Better interruption support. Right now there's no good way to interrupt a run_maintenance() call. The interrupt() method will interrupt any operation, so it's not really safe for consumers to call it if they want to only interrupt maintenance. We should add support for interruption that's limited to certain kinds of operations.

┆Issue is synchronized with this Jira Task
┆Sprint End Date: 2022-09-16

@bendk
Copy link
Contributor Author

bendk commented Sep 1, 2022

For that second part, I think we should try one more time to get a good interruption system. What if PlacesDb stored an AtomicU32 that stored the interruption state? 0 would mean either no operation was happening, or the current operation should be interrupted. All other values would specify a type of operation: 1 means a general write, 2 means maintenance, 3 means incremental search, etc. When you lock the DB you would store the type of operation you were about to do, then periodically check if the value was reset to 0. Other other threads could interrupt only particular types of operations using compare-and-swap.

This seems like a nice simplification of the current InterruptScope system. If we wanted to use that system to interrupt maintenance only, the calls get complex. I think it would be something like you call prepare_maintenance() which returns an object with an interrupt scope, then you store that interrupt scope in a shared variable, then call call the run_maintenance() method to start up maintenance using that interrupt scope. This all seems very cumbersome just to add interrupt support. For something like autocomplete it would be even more difficult.

The main drawback I see is that it becomes harder to interrupt a specific operation, rather than a type of operation. However, it doesn't seem like we need that.

@mhammond
Copy link
Member

mhammond commented Sep 6, 2022

For that second part, I think we should try one more time to get a good interruption system

Stepping back a little, I think the interruption requirement comes from mozilla-mobile/fenix#7227 (comment)

so that we can cancel long running maintanence tasks if the app is killed in order to avoid putting the storage layer into a bad state.

So this seems to be met by our existing interrupt support - we don't actually care what's running, just interrupt the world. But then the discussion turned into making sure we don't interrupt the "wrong" operation - but is there actually a wrong operation at shutdown? Or is there a non-shutdown use-case I've forgotten about?

@jonalmeida / @csadilek, what are the actual constraints here?

@csadilek
Copy link
Collaborator

csadilek commented Sep 6, 2022

Or is there a non-shutdown use-case I've forgotten about?

@mhammond Yes, I agree that interrupting the world would be fine in case the app is being shut down, but the case we wanted to discuss with you is this:

  • The device becomes idle and all other constraints are met too
  • Maintenance hasn't run in > 24 hours (or whatever threshold we decide on)
  • run_maintenance is called
  • While run_maintenance is executing, any of the constraints could change and no longer be met, e.g., the device/app is active again (no longer idle).

Ideally, we should then cancel the maintenance operation and run again the next time the device is idle.

If we keep it running we risk slowing down the app / device.

If we interrupt via the single writer we would currently risk interrupting the wrong operation, e.g., if the user opened a link and we wanted to record the visit while run_maintenance is executing. The regular sync worker could also overlap with a run_maintenance call currently.

If we're reasonably confident that a single run_maintenance call takes a short amount of time, then maybe we should not worry about interrupting it and just let it complete? In the case of shutdown, would it even matter if we interrupted if there's no risk of the storage ending up in a bad state?

I am currently leaning towards not interrupting (unless we can make sure it's the right operation), but wondering what you think? Do we have a rough idea how long a single execution of run_maintenance with default parameters takes?

@bendk
Copy link
Contributor Author

bendk commented Sep 7, 2022

So seems like we're leaning towards not interrupting. What if we updated the result of the operation to give you some metrics that could be useful for both scheduling purposes and double checking that this isn't leading to pathological cases? I'm thinking we can give a result with the following fields:

  • Did we do any pruning?
  • Total time taken. This should give you good insight into if the operation is taking too long for some users.
  • What was the size of the DB before the run_maintenance?
  • What was the size after? This could be useful for scheduling. If the size is larger than the target size, then you could lower the delay before the next call. Otherwise you could wait a longer time.

In the case of shutdown, would it even matter if we interrupted if there's no risk of the storage ending up in a bad state?

AFAICT there's no chance of ending up in a bad state.

Do we have a rough idea how long a single execution of run_maintenance with default parameters takes?

I don't have an estimate, but I think Grisha's analysis in fenix#7227 is correct. In almost all cases it should take a very short time, but in rare cases it might take a longer time. I think the most likely time is the very first VACUUM we run after not running it for a while.

@csadilek
Copy link
Collaborator

csadilek commented Sep 7, 2022

So seems like we're leaning towards not interrupting.

@bendk yes, if you and @mhammond agree as well. I think this would be a good starting point.

What if we updated the result of the operation to give you some metrics that could be useful for both scheduling purposes and double checking that this isn't leading to pathological cases?

That sounds good. Then we can record telemetry for it and land in Nightly (maybe even behind a feature flag first). This will allow us to look at the data esp. total time taken, and we can still decide if we need to add interrupting logic later. I expect the data to be significantly different on Release though (more diverse devices), but even then we will learn and can adjust.

If the size is larger than the target size, then you could lower the delay before the next call. Otherwise you could wait a longer time.

What is the target size in this case?

/cc @jonalmeida

@bendk
Copy link
Contributor Author

bendk commented Sep 7, 2022

What is the target size in this case?

Whatever value you pass in to us. Desktop uses 75 MiB which seems like a good starting point (https://searchfox.org/mozilla-central/source/toolkit/components/places/PlacesExpiration.sys.mjs#47)

@bendk
Copy link
Contributor Author

bendk commented Sep 8, 2022

One slight tweak after writing this code is that I think we should metrics on time taken, but not return it as part of the metrics struct. The reason is that Glean's timing distribution metrics requires you to use it's timer -- you can't just pass it a specific value to record nor can you get the value that it recorded.

@bendk
Copy link
Contributor Author

bendk commented Oct 20, 2022

AFAICT all the app-services code is ready now that #5123 has been merged. Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants