Add retention cookbook #153

benmiroglio · 2018-06-19T20:56:22Z

This is a cookbook that guides the reader through retention analysis. This is the most recent draft after an informal review by the majority of the Product Data Science Team.

mreid-moz · 2018-06-20T20:07:47Z

cookbooks/retention.md

@@ -0,0 +1,211 @@
+*Authored by the Product Data Science Team. Please direct questions/concerns to Ben Miroglio (bmiroglio).*


drive-by: you'll have to add Miroglio to the .spelling file (similar with interpretability). You can either also put bmiroglio in the dictionary or surround it with backticks.

Thanks for the guidance--Fixed!

fbertsch · 2018-06-20T20:35:24Z

cookbooks/retention.md

+
+For example, let’s say we are calculating retention for new Firefox users. Each user can then be anchored by their `profile_creation_date`, and we can count the number of users who submitted a ping between 7-13 days after profile creation (1 Week retention), 14-20 days after profile creation (2 Week Retention), etc.
+
+#### Example Methodology


Would it be too difficult to add a pure SQL example?

In addition to or in place of the current example?

I think in addition to would be fine, the more examples the merrier (IMO).

Agreed :). I'll add one in.

It seems like it'd be hard to write an efficient query without a view somewhere. The query needs to keep track of the set of clientIds over a cumulative window.

I'm working on a solution that uses views, as i don't think it is realistic to do everything in one giant query. More for the folks who are more SQL-inclined and not super familiar with Pyspark.

harterrt

Looks great to me! A couple of nits inline.

Documenting retention is a huge asset to Firefox and I imagine writing this document has already taken a lot of your time. Accordingly, I want to make sure this documentation lands. Feel free to submit without addressing these nits if they introduce too much stop energy. We can improve upon this doc once it's submitted. The only exception would be methodological concerns that may mislead future analysts.

Thanks for this PR, Ben. Very exciting work! Retention is nuanced and foundational to Firefox. It's great to have this written down for the first time!

harterrt · 2018-06-20T23:40:13Z

cookbooks/retention.md

+Now we can load in a subset of `main_summary` and construct the necessary fields for retention calculations:
+
+```python
+ms = spark.sql("""


Nit: this SQL statement should follow the style guide. In particular, please:

Capitalize reserved words like WHERE, SELECT, and AND (link)

Add a newline after root keywords like SELECT and WHERE (link)

Indent lines 126-130 one additional level to clarify the WHERE statement block

harterrt · 2018-06-20T23:42:31Z

cookbooks/retention.md

+
+*Not quite*. Turns out you next look at `active_ticks` and `total_uri_count` and find that sync users report much higher numbers for these measures as well. Now how can we explain this difference in retention?
+
+There could be an entirely separate cookbook devoted to answering this question, however this contrived example is meant to demonstrate that simply comparing retention numbers between two groups isn't capturing the full story. Sans an experiment or model-based approach, all we can say is "enabling sync is **associated** with higher retention numbers." There is still value in this assertion, however it should be stressed that association/correlation != causation!


Great explaination and example. It's worth bolding correlation != causation so the reader can get the gist of this example while scanning.

good point!

benmiroglio · 2018-06-21T00:05:57Z

I plan to add a Pure SQL example per @fbertsch 's request, however I think this is ok to merge for the moment after getting @SuYoungHong 's review.

jklukas

Left a few typo nits. At a high level, this feels like it hits a great sweet spot of being both concise and specific.

jklukas · 2018-06-21T12:49:03Z

cookbooks/retention.md

+)
+```
+
+Peaking at 6-Week Retention


Peaking -> Peeking

jklukas · 2018-06-21T13:00:17Z

cookbooks/retention.md

+def from_unixtime_handler(ut):
+    """
+    Converts unix time (in days) to a string in %Y%m%d format.
+    This is spark UDF.


Nit: This is a spark UDF

nice catches! Fixed.

SuYoungHong

Overall, looks great. Awesome work.

I will request you make a note distinguishing between the fixed point method you use for retention (where the weekN period is different for clients that start on different days) vs a cohort approach

Just a quick note like new user vs existing user should be good!

SuYoungHong · 2018-06-21T17:57:30Z

cookbooks/retention.md

+
+Retention measures the rate at which users are *continuing* to use Firefox, making it one of the more important metrics we track. We commonly measure retention between releases, experiment cohorts, and various Firefox subpopulations to better understand how a change to the user experience or use of a specific feature affect behavior.
+
+### N Week Retention


good work! One more distinction I would make about types of N week retention:

rolling vs cohorted

Here you're using a rolling approach to N week retention (N week is between (7n) and ((7n)+ 6) after client's pcd date), but a lot of people are using cohorts to calculate retention:

Define weekly cohorts for your population (the clients that start between 20180101 - 20180107 = cohort1, 201801018- 20180114 = cohort2, etc.). For each cohort, define weekN retention as, appeared in the week period N weeks after the cohort period (i.e. week1 is appeared in 201801018- 20180114 for cohort 1)

So essentially, using an anchor period instead of an anchor point.

Would be a good idea to clarify this distinction, since this is many people's understanding of N week retention right now.

Good point.

I use cohorts when there isn't a clear anchor point (i.e. users that enabled sync--there is no clear sync was enabled on this date indicator without looking across a client's history). Might be worth adding an example like this as well in section title "What if you don't have an achor point?".

When you do have an anchor point however, IMO the rolling method is the most precise (and least expensive). Let me know your thoughts @SuYoungHong !

SuYoungHong · 2018-06-21T17:59:02Z

cookbooks/retention.md

+
+```
+
+we observe that 35.6% of Linux users whose profile was created in the first half of April submitted a ping 6 weeks later, and so forth. The example code snippets are consolidated in [this notebook](https://gist.github.com/benmiroglio/fc708e5905fad33b43adb9c90e38ebf4).


mentioned offline, but update copy to reflect code numbers

good catch! will fix.

SuYoungHong · 2018-06-21T20:13:07Z

cookbooks/retention.md

+**Please be sure to specify whether or not your retention analysis is for new or existing users.**
+
+
+


Note on Point Anchoring vs Period Anchoring

As defined above, N week retention needs some anchor to start N from.

However, we should note, the anchor itself can be a single point (such as a day) or a period (such as a week), also known as a cohort or acquisition period.

The code example above is using a point anchor of a day, and defining the N week for each client independently, depending on which day they start.

Many previous retention analysis at Mozilla, however, uses a period anchor, and calculates retention per cohort. For example:

Define a week (20180101 to 20180101) as the anchor/acquisition/cohort period.

Define clients who have profile_creation_date within these dates as a 'cohort'.

Define week Ns as 20180101 + 7N to 20180101 + 7N.

For the cohort, calculate each N retention.

Repeat for each cohort in the period of interest.

The major difference between using a 1 week period anchor and a 1 day point anchor is that with the 1 week period anchor, the weekN retention period will be the same for a client who started at the beginning of the anchor period versus a client who started at the end of the anchor period. Whereas for point anchor, those same two clients will have their weekN retention period differ by the amount of days between when they started, respectively.

Point anchoring is preferred, however, both methods convey the same information.

acmiyaguchi · 2018-06-21T20:22:58Z

cookbooks/retention.md

+
+we observe that 35.6% of Linux users whose profile was created in the first half of April submitted a ping 6 weeks later, and so forth. The example code snippets are consolidated in [this notebook](https://gist.github.com/benmiroglio/fc708e5905fad33b43adb9c90e38ebf4).
+
+


A couple of cleaning caveats that might be worth mentioning.

First is the client submission latency, which can be between somewhere between 2-5 days. clients_daily accounts for this when aggregating clients per day, but it's not without it's own issues.

There are also noisy clients with inconsistent profile creation and sub-session dates. Different thresholds will affect the consistency of results. The cookbook could point to an existing baseline in the future.

While I agree these are potential issues, I think it is beyond the scope of the cookbook since they are general data issues subject to most Firefox analysis. I don't want to bog things down with all potential caveats, but rather serve as a jumping off point methodologically-speaking.

Since retention is usually calculated to compare two or more groups of users, I feel safe to exclude these caveats since each group is subject to the same pitfalls (in this case).

@acmiyaguchi Let me know if you feel this is justified!

Yep, sounds good to me. A six week waiting period and relatively consistent results will probably take care of those caveats across analyses.

benmiroglio · 2018-06-21T21:42:36Z

@SuYoungHong I wrote up a section that simplifies cohort-based retention, putting it in terms of the anchor point approach so the document is a little more cohesive. Please review!

SuYoungHong · 2018-06-22T00:10:34Z

looks good to me! ready for merging!

harterrt · 2018-06-22T14:55:49Z

🎉 🎉 🎉

Add retention cookbook

ee93678

benmiroglio requested a review from SuYoungHong June 19, 2018 20:56

Fix spell nit

f3346ca

mreid-moz reviewed Jun 20, 2018

View reviewed changes

Update dictionary

0e7602f

fbertsch reviewed Jun 20, 2018

View reviewed changes

harterrt approved these changes Jun 20, 2018

View reviewed changes

Ben added 2 commits June 20, 2018 17:00

Fix nits and s small counting mistake

925d6b3

Fix newline nit

82b50a7

jklukas reviewed Jun 21, 2018

View reviewed changes

Fix nits

c29ceb4

SuYoungHong requested changes Jun 21, 2018

View reviewed changes

SuYoungHong reviewed Jun 21, 2018

View reviewed changes

acmiyaguchi reviewed Jun 21, 2018

View reviewed changes

Add cohort-based retention description

78585c0

SuYoungHong approved these changes Jun 22, 2018

View reviewed changes

benmiroglio merged commit f4d45d2 into mozilla:master Jun 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retention cookbook #153

Add retention cookbook #153

benmiroglio commented Jun 19, 2018

mreid-moz Jun 20, 2018

benmiroglio Jun 20, 2018

fbertsch Jun 20, 2018

benmiroglio Jun 20, 2018

fbertsch Jun 20, 2018

benmiroglio Jun 20, 2018 •

edited

acmiyaguchi Jun 21, 2018

benmiroglio Jun 21, 2018

harterrt left a comment

harterrt Jun 20, 2018

harterrt Jun 20, 2018

benmiroglio Jun 21, 2018

benmiroglio commented Jun 21, 2018

jklukas left a comment

jklukas Jun 21, 2018

jklukas Jun 21, 2018

benmiroglio Jun 21, 2018

SuYoungHong left a comment

SuYoungHong Jun 21, 2018

benmiroglio Jun 21, 2018 •

edited

SuYoungHong Jun 21, 2018

benmiroglio Jun 21, 2018

SuYoungHong Jun 21, 2018

acmiyaguchi Jun 21, 2018

benmiroglio Jun 21, 2018 •

edited

acmiyaguchi Jun 21, 2018

benmiroglio commented Jun 21, 2018

SuYoungHong commented Jun 22, 2018

harterrt commented Jun 22, 2018

		@@ -0,0 +1,211 @@
		Authored by the Product Data Science Team. Please direct questions/concerns to Ben Miroglio (bmiroglio).


		For example, let’s say we are calculating retention for new Firefox users. Each user can then be anchored by their `profile_creation_date`, and we can count the number of users who submitted a ping between 7-13 days after profile creation (1 Week retention), 14-20 days after profile creation (2 Week Retention), etc.

		#### Example Methodology


		Not quite. Turns out you next look at `active_ticks` and `total_uri_count` and find that sync users report much higher numbers for these measures as well. Now how can we explain this difference in retention?

		There could be an entirely separate cookbook devoted to answering this question, however this contrived example is meant to demonstrate that simply comparing retention numbers between two groups isn't capturing the full story. Sans an experiment or model-based approach, all we can say is "enabling sync is associated with higher retention numbers." There is still value in this assertion, however it should be stressed that association/correlation != causation!


		Retention measures the rate at which users are continuing to use Firefox, making it one of the more important metrics we track. We commonly measure retention between releases, experiment cohorts, and various Firefox subpopulations to better understand how a change to the user experience or use of a specific feature affect behavior.

		### N Week Retention


		```

		we observe that 35.6% of Linux users whose profile was created in the first half of April submitted a ping 6 weeks later, and so forth. The example code snippets are consolidated in [this notebook](https://gist.github.com/benmiroglio/fc708e5905fad33b43adb9c90e38ebf4).

		Please be sure to specify whether or not your retention analysis is for new or existing users.

Add retention cookbook #153

Add retention cookbook #153

Conversation

benmiroglio commented Jun 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benmiroglio Jun 20, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harterrt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benmiroglio commented Jun 21, 2018

jklukas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SuYoungHong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benmiroglio Jun 21, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Note on Point Anchoring vs Period Anchoring

Choose a reason for hiding this comment

benmiroglio Jun 21, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benmiroglio commented Jun 21, 2018

SuYoungHong commented Jun 22, 2018

harterrt commented Jun 22, 2018

benmiroglio Jun 20, 2018 •

edited

benmiroglio Jun 21, 2018 •

edited

benmiroglio Jun 21, 2018 •

edited