Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement history pruning/expiration API #611

Closed
thomcc opened this issue Feb 1, 2019 · 3 comments

Comments

Projects
None yet
5 participants
@thomcc
Copy link
Contributor

commented Feb 1, 2019

See mozilla-mobile/android-components#1372

Ideas:

  • We can safely delete visits that aren't in the most recent 20.
    • It would be nice to remember that these URLs are pruned so that we don't upload a tombstone unnecessarily. Needs thought though.
  • We can delete low frecency URLs (without remembering tombstones)

@thomcc thomcc added this to the History API - v1 milestone Feb 4, 2019

@rfk rfk added the in progress label Feb 4, 2019

@grigoryk grigoryk referenced this issue Feb 5, 2019

Closed

[meta] PlacesAPI History deletion APIs #621

3 of 3 tasks complete

@thomcc thomcc self-assigned this Feb 5, 2019

@thomcc

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2019

I think the first cut of this might stub out pruneDestructively, and leave runMaintenance as "just VACCUM" (VACUUMing saves ~4MB for the DB in #639). But here are my thoughts on this (some of which are actually @mhammond's thoughts too):

3 APIs, probably

  • fun wipeLocal()
  • fun runMaintance()
  • fun pruneDestructively()

wipeLocal()

Delete all history items in the database. These changes are not synced.

For a non-sync user, this is complete data loss.

For a sync user, The data will still be available remotely, and may be synced down as part the
next sync, but that is not guaranteed (#637
will make it more likely, if we do it).

runMaintance()

Initially this will just call VACUUM. Eventually it will do other things. It would be nice if
A-C or consuming applications can orchestrate it to get called once per day.

Until that happens, we we have to assume it doesn't get called and can't rely on it.

Eventually, we may decide to perform a lighter version of pruning that shouldn't lose too much. It's possible that that lighter version can be applied on the fly during sync.

pruneDestructively()

The name is indended to indicate to the caller that they should not be calling
this just as a matter of course.

This deletes user history without uploading the changes to sync. It might need to take a parameter
for how agressive to be.

Due to the way Sync works, this is inherently destructive if the device plans on
ever writing history back up again.

Some thoughts on how we can/should implement this:

  1. Probably we want to keep the oldest history visit in addition to the ~N most recent ones.

    • This is a bit unintuitive, but we'd like to avoid a case like the following:
      1. Pruning occurs, and deletes everything but the last, IDK, 20 reddit.com visits.
        All of which happened in the past day. The user isn't aware of this at all.
      2. User deletes the past day of history because they were doing
        something they would like to hide, like online gift shopping.
      3. Because it's a deliberate deletion that includes all the known reddit.com
        visits, we upload a tombstone for reddit.com.
      4. Other clients sync, see the tombstone, delete all their visits for reddit.com
        (far more than 20)
      5. Reddit is no longer part of autocomplete on those clients, and while it will
        return, it will have a far lower frecency.
    • Keeping the oldest means they'd have to delete it too, which is less likely to be
      done accidentally (this isn't really true after the first sync...).
  2. We can consider scraping titles of low frecency items that havent been visited recently.

    • Titles are basically optional anyway.
    • We'd like to avoid syncing title-less items, so don't do it for new ones.
    • Not having a title makes autocomplete work less well, so we should avoid doing it
      for things with high frecency.
    • Probably should consider title length as a heuristic here too.
  3. We can consider scrapping redundant visits of items that are frequently visited.

    • This will work better if we do #610
    • Needs some work defining "redundant".
    • This is lower impact, as
      #639 shows that most
      of the size is moz_places, although all the indices do make trimming this
      somewhat worthwhile.
  4. Deleting all e.g. data urls might be worthwhile, but probably likely to be low impact since most users will have none.

  5. ??? @mhammond and I had a meeting about this, and he probably remembers a couple I've missed

More ridiculous thoughts

  • We could consider compressing the data.

The places database linked in
#639 compresses to 29MB
(from 70MB) in under half a second for me with zstd on default settings (xz
is able to get it down to 21MB, but it takes 30s to do so. gzip on standard
settings took 3s and got it to 35MB, and increasing the settings was unable to
get it any smaller).

This is a half-baked idea though, because:

  1. It actually requires more disk space (briefly while compressing).
  2. It assumes we'll eventually get to the point where we want the data again. If
    we think that, we should probably avoid deleting it in the first place. The
    handful of MB we save by deleting history is not going to make the difference
    here.

@linacambridge / @grigoryk / @mhammond Thoughts?

EDIT:

It's worth finding out what the 'acceptable' size is: mozilla-mobile/android-components#1372 (comment)

It's also worth noting that pruning (both aggressive and otherwise) should probably be idempotent in order to prevent/mitigate various types of misuse.

@mhammond

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

I agree with all the above. Other points Thom and I chatted about:

  • In general, and with some limits, I believe we want to avoid removing old visits which are the last visit for a URL. For example, if I last visited interesting-site.com 3 months ago, we should try and keep this visit even if we pruned more recent visits for more popular sites - nothing delights me more than the awesomebar re-finding a site I'm looking for but haven't visited recently.

I'd even go so far as saying that a goal of pruning should be that, as much as possible, the set of known sites ordered by frecency should roughly be the same before and after pruning.

  • It's still not exactly clear to me what guidance we should offer for pruneDestructively(). We know that we will need some regular maintenance of history just like desktop - we don't want it to grow without bound. We also expect that in response to storage pressure we might want to be more aggressive (and then be less aggressive if storage pressure goes away - possibly even repopulating from sync in that scenario?)

  • Ideally we would not sync values we know we are going to prune next time we are asked. IOW, after a first-sync, it should be that our "regular maintenance" process should be a no-op, because Sync respected whatever our retention/deletion policy is. Similarly, if we are in a low-storage situation, we don't want sync to use up all available storage trying to add visits which would have been removed by the aggressive maintenance if we hadn't crashed due to running out of storage :)

(Note that I'm not suggesting we need to get this fancy in the first cut, but we should understand where we want to end up so it can inform our early iterations)

@linacambridge

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

Ramblings, in no particular order. 😄

  • The three APIs, wipeLocal, runMaintenance, and pruneDestructively, make sense. I really like that the names are explicit about when they should be called, and we could always play around with how aggressive they are once we wire up our consumers.
  • Keeping older visits around, both to surface them in the awesomebar and to protect against accidental tombstoning of expired Places, definitely makes sense.
  • Compacting frequently visited sites also makes sense. It would help if we synced and round-tripped the frecency, or at least enough info to recalculate it. (I guess we technically sync the frecency today, as the sortindex. But it's far from stable, even between Desktops, and gets clobbered every time we reupload the record).
  • Other indicators, like dwell time and the signals Activity Stream wanted to collect, would give us more useful heuristics for relevancy.
  • We could roll up (sample? bucket? I'm not sure what the right term for this is) visits, with coarser granularity for older dates. The history UI shows today, yesterday, last week, last 6 months, and older than 6 months, and we could store aggregate counts (and frec_score, if we push on #610...though, I guess we'd always store that) for those durations.
  • Going ahead with #610 is a good idea in general, to insulate frecency from compaction and expiration.
  • I like the idea of pruning titles for low-frecency items. Desktop also aggressively prunes "exotic" visits, older than 60 days for long URLs, redirects, and downloads. Along with visit compaction, that might give us enough space savings while also remaining useful for the awesomebar.
  • Riffing on @mhammond's idea, adapting to storage pressure (pruning when it increases, repopulating from Sync when it goes away) is cool. We could store date ranges and counts for the history records we have on the server, and figuring out how far to backfill based on that, as well as how far we got last time before pruning.
  • I'm not sure about compressing the database, for the reasons you mentioned. We'll need to decompress it, anyway, or store older visits in a separate database that we'd compress and page in on demand. I think we'd still need some kind of expiration strategy in the end, as having it grow unbounded would eventually impact the perf of all queries run on it, increase the time to vacuum, and increase file fragmentation.

thomcc added a commit that referenced this issue Feb 14, 2019

@thomcc thomcc closed this in d289393 Feb 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.