New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retention cookbook #153
Conversation
cookbooks/retention.md
Outdated
@@ -0,0 +1,211 @@ | |||
*Authored by the Product Data Science Team. Please direct questions/concerns to Ben Miroglio (bmiroglio).* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by: you'll have to add Miroglio
to the .spelling
file (similar with interpretability
). You can either also put bmiroglio
in the dictionary or surround it with backticks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the guidance--Fixed!
cookbooks/retention.md
Outdated
|
||
For example, let’s say we are calculating retention for new Firefox users. Each user can then be anchored by their `profile_creation_date`, and we can count the number of users who submitted a ping between 7-13 days after profile creation (1 Week retention), 14-20 days after profile creation (2 Week Retention), etc. | ||
|
||
#### Example Methodology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too difficult to add a pure SQL example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to or in place of the current example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in addition to would be fine, the more examples the merrier (IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed :). I'll add one in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it'd be hard to write an efficient query without a view somewhere. The query needs to keep track of the set of clientIds over a cumulative window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on a solution that uses views, as i don't think it is realistic to do everything in one giant query. More for the folks who are more SQL-inclined and not super familiar with Pyspark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! A couple of nits inline.
Documenting retention is a huge asset to Firefox and I imagine writing this document has already taken a lot of your time. Accordingly, I want to make sure this documentation lands. Feel free to submit without addressing these nits if they introduce too much stop energy. We can improve upon this doc once it's submitted. The only exception would be methodological concerns that may mislead future analysts.
Thanks for this PR, Ben. Very exciting work! Retention is nuanced and foundational to Firefox. It's great to have this written down for the first time!
Now we can load in a subset of `main_summary` and construct the necessary fields for retention calculations: | ||
|
||
```python | ||
ms = spark.sql(""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this SQL statement should follow the style guide. In particular, please:
cookbooks/retention.md
Outdated
|
||
*Not quite*. Turns out you next look at `active_ticks` and `total_uri_count` and find that sync users report much higher numbers for these measures as well. Now how can we explain this difference in retention? | ||
|
||
There could be an entirely separate cookbook devoted to answering this question, however this contrived example is meant to demonstrate that simply comparing retention numbers between two groups isn't capturing the full story. Sans an experiment or model-based approach, all we can say is "enabling sync is **associated** with higher retention numbers." There is still value in this assertion, however it should be stressed that association/correlation != causation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great explaination and example. It's worth bolding correlation != causation so the reader can get the gist of this example while scanning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
I plan to add a Pure SQL example per @fbertsch 's request, however I think this is ok to merge for the moment after getting @SuYoungHong 's review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few typo nits. At a high level, this feels like it hits a great sweet spot of being both concise and specific.
cookbooks/retention.md
Outdated
) | ||
``` | ||
|
||
Peaking at 6-Week Retention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peaking -> Peeking
cookbooks/retention.md
Outdated
def from_unixtime_handler(ut): | ||
""" | ||
Converts unix time (in days) to a string in %Y%m%d format. | ||
This is spark UDF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is a spark UDF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catches! Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks great. Awesome work.
I will request you make a note distinguishing between the fixed point method you use for retention (where the weekN period is different for clients that start on different days) vs a cohort approach
Just a quick note like new user vs existing user should be good!
cookbooks/retention.md
Outdated
|
||
Retention measures the rate at which users are *continuing* to use Firefox, making it one of the more important metrics we track. We commonly measure retention between releases, experiment cohorts, and various Firefox subpopulations to better understand how a change to the user experience or use of a specific feature affect behavior. | ||
|
||
### N Week Retention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good work! One more distinction I would make about types of N week retention:
rolling vs cohorted
Here you're using a rolling approach to N week retention (N week is between (7n) and ((7n)+ 6) after client's pcd date), but a lot of people are using cohorts to calculate retention:
Define weekly cohorts for your population (the clients that start between 20180101 - 20180107 = cohort1, 201801018- 20180114 = cohort2, etc.). For each cohort, define weekN retention as, appeared in the week period N weeks after the cohort period (i.e. week1 is appeared in 201801018- 20180114 for cohort 1)
So essentially, using an anchor period instead of an anchor point.
Would be a good idea to clarify this distinction, since this is many people's understanding of N week retention right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I use cohorts when there isn't a clear anchor point (i.e. users that enabled sync--there is no clear sync was enabled on this date indicator without looking across a client's history). Might be worth adding an example like this as well in section title "What if you don't have an achor point?".
When you do have an anchor point however, IMO the rolling method is the most precise (and least expensive). Let me know your thoughts @SuYoungHong !
cookbooks/retention.md
Outdated
|
||
``` | ||
|
||
we observe that 35.6% of Linux users whose profile was created in the first half of April submitted a ping 6 weeks later, and so forth. The example code snippets are consolidated in [this notebook](https://gist.github.com/benmiroglio/fc708e5905fad33b43adb9c90e38ebf4). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mentioned offline, but update copy to reflect code numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! will fix.
cookbooks/retention.md
Outdated
**Please be sure to specify whether or not your retention analysis is for new or existing users.** | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note on Point Anchoring vs Period Anchoring
As defined above, N week retention needs some anchor to start N from.
However, we should note, the anchor itself can be a single point (such as a day) or a period (such as a week), also known as a cohort or acquisition period.
The code example above is using a point anchor of a day, and defining the N week for each client independently, depending on which day they start.
Many previous retention analysis at Mozilla, however, uses a period anchor, and calculates retention per cohort. For example:
- Define a week (20180101 to 20180101) as the anchor/acquisition/cohort period.
- Define clients who have
profile_creation_date
within these dates as a 'cohort'. - Define week Ns as 20180101 + 7N to 20180101 + 7N.
- For the cohort, calculate each N retention.
- Repeat for each cohort in the period of interest.
The major difference between using a 1 week period anchor and a 1 day point anchor is that with the 1 week period anchor, the weekN retention period will be the same for a client who started at the beginning of the anchor period versus a client who started at the end of the anchor period. Whereas for point anchor, those same two clients will have their weekN retention period differ by the amount of days between when they started, respectively.
Point anchoring is preferred, however, both methods convey the same information.
cookbooks/retention.md
Outdated
|
||
we observe that 35.6% of Linux users whose profile was created in the first half of April submitted a ping 6 weeks later, and so forth. The example code snippets are consolidated in [this notebook](https://gist.github.com/benmiroglio/fc708e5905fad33b43adb9c90e38ebf4). | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of cleaning caveats that might be worth mentioning.
First is the client submission latency, which can be between somewhere between 2-5 days. clients_daily
accounts for this when aggregating clients per day, but it's not without it's own issues.
There are also noisy clients with inconsistent profile creation and sub-session dates. Different thresholds will affect the consistency of results. The cookbook could point to an existing baseline in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree these are potential issues, I think it is beyond the scope of the cookbook since they are general data issues subject to most Firefox analysis. I don't want to bog things down with all potential caveats, but rather serve as a jumping off point methodologically-speaking.
Since retention is usually calculated to compare two or more groups of users, I feel safe to exclude these caveats since each group is subject to the same pitfalls (in this case).
@acmiyaguchi Let me know if you feel this is justified!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sounds good to me. A six week waiting period and relatively consistent results will probably take care of those caveats across analyses.
@SuYoungHong I wrote up a section that simplifies cohort-based retention, putting it in terms of the anchor point approach so the document is a little more cohesive. Please review! |
looks good to me! ready for merging! |
🎉 🎉 🎉 |
This is a cookbook that guides the reader through retention analysis. This is the most recent draft after an informal review by the majority of the Product Data Science Team.