Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use exponential decay frecency algorithm #610

Open
thomcc opened this issue Feb 1, 2019 · 10 comments

Comments

@thomcc
Copy link
Contributor

commented Feb 1, 2019

Algorithm description: https://wiki.mozilla.org/User:Jesse/NewFrecency

See also https://bugzilla.mozilla.org/show_bug.cgi?id=704025

There are a few reasons to do this, but the main ones are

  • Performance
  • It removes the fact that we eventually need to expose some sort of idle maintenance API that the application is responsible for calling roughly once a day (ideally).
  • Would lead to substantially simpler origins implementation when compared to #429.
  • Allows for pruning history without losing frecency info (important for android)

See also #609, which helps perf some but makes an idle maintenance API more important.

┆Issue is synchronized with this Jira Story

@thomcc

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

Basic idea (It took some mathing out to make this work for visits not inserted at the current time):

  • Instead of storing frecency scores, we store something a bit more idiosyncratic: the date at which the frecency score becomes 1. I'm going to call this date frec_date even though that's not really what it is.

  • There's also a notion of a frec_score, which is roughly frecency, however it's a floating point number, and has a different value range (It's always above 1.0).

  • There's a function, frec_date_to_frec_score(frec_date: i64) -> f64 which is implemented as (in rust pseudocode) ((frec_date - now()) as f64) * LAMBDA).exp(), where LAMBDA is std::f64::consts::LN_2 / 30.0.

  • There's a function frec_score_to_frec_date(frec_score: f64) -> i64 which is implemented more or less as (now() as f64 + frec_score.ln() / LAMBDA) as i64 (same LAMBDA as before).

  • There's also a function score_constant_for_visit(visit_type: VisitTransition) -> f64, which would basically map visit types to

    embed_visit_bonus: 0,
    framed_link_visit_bonus: 0,
    link_visit_bonus: 100,
    typed_visit_bonus: 2000,
    bookmark_visit_bonus: 75,
    download_visit_bonus: 0,
    permanent_redirect_visit_bonus: 0,
    temporary_redirect_visit_bonus: 0,
    redirect_source_visit_bonus: 25,
    default_visit_bonus: 0,
    unvisited_bookmark_bonus: 140,
    unvisited_typed_bonus: 200,
    reload_visit_bonus: 0,
    ,

    • Maybe we'd need to tweak them for the new formula, although I suspect they'd probably be fine, since LAMBDA is taken from how the decay works currently.
    • Also maybe this should be a table, and not a function IDK. it's not important. If we want to weigh local/remote visits differently, this is where we'd do it (it would need to take visit_type and is_local).
  • When a new visit is inserted with date == now(), we update the places record to store frec_score_to_frec_date(frec_date_to_score(place.frec_date) + score_constant_for_visit(new_visit.visit_type)) as it's new frec date.

  • When a new visit is inserted in the past (e.g. from sync), we need to apply decay to it's score, so instead of score_constant_for_visit(new_visit.visit_type), we'd add score_constant_for_visit(new_visit.visit_type) * decay_modifier_for_visit_at_date(new_visit.visit_date)

  • Where decay_modifier_for_visit_at_date(date: i64): f64 is ((now() - date).max(0) as f64 * -LAMBDA).exp().

    • Importantly, this returns 1.0 if date == now(), so we don't need to special case insertions in the past vs current insertions.
  • It's likely that the whole score_constant_for_visit(new_visit.visit_type) * decay_modifier_for_visit_at_date(new_visit.visit_date) expression should be rolled up into one score_modifier_for_visit(visit.visit_type, visit.visit_date, <other things here like is_local, is_bookmarked, etc>) function or something.

  • When history visits should be removed (in a way that effects frecency -- note that if we don't want their removals to effect frecency we have that option now -- just do nothing!), we just take the inverse steps: Update frec_date in moz_places to be frec_score_to_frec_date(frec_date_to_score(place.frec_date) - score_modifier_for_visit(visit.visit_type, visit.visit_date, ...)). That is, we perform the opposite step we did when inserting it (just change '+' to '-')

Other things to note:

  • We can directly order by frec_date where we're ordering by frecency -- the further in the future that date is, the higher our frecency is.
    • This lets us avoid needing an expression index on frec_date_to_frec_score or something.

EDIT: Bleh, just realized the dates need to be in days, need to go through and fix the math here, everything should more or less stay the same though (since mostly we're dealing with time deltas)

EDIT2: Here's a version of working-ish code that does it and handles the days/ms conversions

    const LAMBDA: f64 = f64::consts::LN_2 / 30.0;
    const MS_PER_DAY: f64 = 1000.0 * 60.0 * 60.0 * 24.0;

    fn days_between(past: Timestamp, present: Timestamp) -> f64 {
        (present.0 - past.0).max(0) as f64 / MS_PER_DAY
    }

    // TODO: take is_local as an arg?
    pub fn score_modifier_for_visit(visit_type: VisitTransition, visit_date: Timestamp) -> f64 {
        let modifier = (days_between(visit_date, Timestamp::now()) * -LAMBDA).exp();
        let bonus = crate::frecency::DEFAULT_FRECENCY_SETTINGS.get_transition_bonus(type, true, false) as f64;
        bonus * modifier
    }

    pub fn freq_date_to_freq_score(freq_date: Timestamp) -> f64 {
        (days_between(frec_date, Timestamp::now())) * LAMBDA).exp()
    }

    pub fn freq_score_to_freq_date(freq_score: f64) -> Timestamp {
        // This is phrased a little awkwardly to handle NaNs automatically.
        let freq_score = if freq_score >= 0.0 { freq_score } else { 0.0 };
        let offset_in_days = frec_score.ln() / LAMBDA;
        let offset_in_ms = offset_in_days * MS_PER_DAY;
        let now_in_ms = Timestamp::now().0 as f64;
        Timestamp((now_in_ms + offset_in_ms) as i64)
    }
@rfk

This comment has been minimized.

Copy link
Member

commented Feb 3, 2019

Thanks for raising the visibility of this, Thom! For completeness I want to also surface a comment from the linked bug:


From https://bugzilla.mozilla.org/show_bug.cgi?id=704025#c2:

The advantages of this approach would be many:

  • The current painful point in visits addition is frecency recalculation, so this may speed up visits addition
  • The recalculation on idle-daily may be uneffective, for all those users who don't hit idle. Mobile users for example, but other categories may exist.
  • It's cheap, so would be suitable for all platforms, included Mobile.
  • Expiring old visits cannot "regress" frecency.

There are also downsides:

  • it needs careful testing, to check how well it matches compared to the old one.
  • while the old frecency also adapts to pages removals (by requesting a recalculation), this won't. Bumping down an unwanted match by the right amount may be harder.
  • while the old frecency adapts to bonus changes (for example if a page is unstarred, its frecency gets bumped down) this won't. More generically, the bonuses (bookmarked, typed, ...) handling has to be figured out yet.
  • while the old frecency can be rebuilt at any time, starting from the data in the database, this one can't be rebuilt, since it may be based on past and no more existing visits. A guessed one could be rebuilt with available data, but it may not be exactly the same.

For our case, another disadvantage might me divergence from desktop behaviour, which risks complicating any future upstreaming of this work into desktop, and also risks confusing users if their awesomebar behaves differently on desktop vs mobile (although I imagine we have that situation already due to the way history sync works).

How do you see the advantages/disadvantages tradeoff playing out here, both in the Fenix timeline and longer-term?

@thomcc

This comment has been minimized.

Copy link
Contributor Author

commented Feb 3, 2019

In general, I think that the wiki page having left much of the math as an exercise for the reader (or, IDK, maybe it was in the expired mathbin link) is part of why most of those listed disadvanges were assumed when they're not actually a problem. Most are addressed in my follow up.

it needs careful testing, to check how well it matches compared to the old one.

This doesn't seem particularly true prior to merging back into Desktop.

while the old frecency also adapts to pages removals (by requesting a recalculation), this won't. Bumping down an unwanted match by the right amount may be harder.

This is addressed in my follow up directly.

while the old frecency adapts to bonus changes (for example if a page is unstarred, its frecency gets bumped down) this won't. More generically, the bonuses (bookmarked, typed, ...) handling has to be figured out yet.

This is relatively easy (essentially a removal and an addition of the new score). Bonuses are explicitly addressed both by the post and my follow up. Handling bookmarked items with no visits would just be a constant addition to the score.

while the old frecency can be rebuilt at any time, starting from the data in the database, this one can't be rebuilt, since it may be based on past and no more existing visits. A guessed one could be rebuilt with available data, but it may not be exactly the same.

This is only true if we discard visits, and is going to be inherently true...

another disadvantage might me divergence from desktop behaviour

OTOH having implemented this and used it in practice might be desirable for desktop, since it proves out the concept.

although I imagine we have that situation already due to the way history sync works

We very much do. Android prunes history very aggressively and almost completely discards synced history items for frecency computations because it overwhelms the awesomebar and you only see remote visits.

@rfk

This comment has been minimized.

Copy link
Member

commented Feb 3, 2019

Most are addressed in my follow up.

My apologies, I had read your followup in the bug, but somehow missed that it was directly responding to the items from the "disadvantages" list. Thanks for walking me through it :-)

@thomcc

This comment has been minimized.

Copy link
Contributor Author

commented Feb 4, 2019

This is only true if we discard visits, and is going to be inherently true...

It's worth noting that it won't be exactly the same, but it will be close. The older the visits that are missing are, the less they effect the frecency score, and so the less important it is. It will never be exactly the same, but it will be asymptotically the same.

@mhammond

This comment has been minimized.

Copy link
Member

commented Feb 6, 2019

I think it would be great to find some time to look at moving to this. I think that needs to include some objective comparison of the existing scheme with this, although I'm not too sure what that means exactly, but something hand-wavy like "for multiple, significant corpora of history, the ranking of sites using both schemes must be very close to the same, except for [insert known reasons they would differ]" or something?

@thomcc

This comment has been minimized.

Copy link
Contributor Author

commented Feb 6, 2019

I was thinking of trying to use some system to basically brute force something (or maybe something that converges quicker than brute force, like hill climbing?) to try to find 'good' values (where good == same as current) for the constants such that I could let it run overnight searching or something.

@agatheblues

This comment has been minimized.

Copy link

commented May 24, 2019

Hello there! I'm trying to implement this algorithm! Have you noticed that the decay is a little less fast than with a traditional exponential decay (where you would store all the visits)?

Example 1: Here the score is calculated by doing SUM(exp(-LAMBDA * (visit_date - visit_k)) which is what's described in the 'Proposed new definition' paragraph of https://wiki.mozilla.org/User:Jesse/NewFrecency

visit_date score
1558159200000 1
1558180800000 1.994240424
1558184400000 2.992321484
1558188000000 3.989442149

Example 2: Here the score is calculated relatively to freq_date.

  • Compute the absolute time difference TAU between visit_date and freq_date in days
  • Updated score is exp(LAMBDA * TAU) + exp(-LAMBDA * TAU)
  • Convert score to new TAU = ln(score)/LAMBDA)
  • Convert new TAU to milliseconds
  • Store new freq_date = visit_date + new_tau_in_ms
visit_date score tau in days tau in ms freq_date
1558159200000 1 0 0 1558159200000
1558180800000 2.000033365 30.00072202 2592062383 1560772862383
1558184400000 2.498582085 39.63328846 3424316123 1561608716123
1558188000000 2.896790328 46.03365794 3977308046 1562165308046

I guess my question is, am I doing this correctly or does this look fishy? Would there be a way for me to test this in extensive ways? Do you maybe have other resources I could look at to really understand how this is suppose to work? Thanks!

@thomcc

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2019

That looks relatively sane to me. Unfortunately there are not good ways of extensively testing beyond, say, importing your desktop history frecency and comparing vs that. This is largely a theoretical algorithm that has not seen much use in practice.

@agatheblues

This comment has been minimized.

Copy link

commented May 27, 2019

Thanks for taking a look! I'll see if I can find edge cases to test it more thoroughly :)

@jdragojevic jdragojevic removed this from the History Polish milestone Jun 26, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.