Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retention cookbook #153

Merged
merged 7 commits into from Jun 22, 2018
Merged

Add retention cookbook #153

merged 7 commits into from Jun 22, 2018

Conversation

benmiroglio
Copy link
Contributor

This is a cookbook that guides the reader through retention analysis. This is the most recent draft after an informal review by the majority of the Product Data Science Team.

@@ -0,0 +1,211 @@
*Authored by the Product Data Science Team. Please direct questions/concerns to Ben Miroglio (bmiroglio).*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by: you'll have to add Miroglio to the .spelling file (similar with interpretability). You can either also put bmiroglio in the dictionary or surround it with backticks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the guidance--Fixed!


For example, let’s say we are calculating retention for new Firefox users. Each user can then be anchored by their `profile_creation_date`, and we can count the number of users who submitted a ping between 7-13 days after profile creation (1 Week retention), 14-20 days after profile creation (2 Week Retention), etc.

#### Example Methodology
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be too difficult to add a pure SQL example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to or in place of the current example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in addition to would be fine, the more examples the merrier (IMO).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed :). I'll add one in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it'd be hard to write an efficient query without a view somewhere. The query needs to keep track of the set of clientIds over a cumulative window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a solution that uses views, as i don't think it is realistic to do everything in one giant query. More for the folks who are more SQL-inclined and not super familiar with Pyspark.

Copy link
Contributor

@harterrt harterrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! A couple of nits inline.

Documenting retention is a huge asset to Firefox and I imagine writing this document has already taken a lot of your time. Accordingly, I want to make sure this documentation lands. Feel free to submit without addressing these nits if they introduce too much stop energy. We can improve upon this doc once it's submitted. The only exception would be methodological concerns that may mislead future analysts.

Thanks for this PR, Ben. Very exciting work! Retention is nuanced and foundational to Firefox. It's great to have this written down for the first time!

Now we can load in a subset of `main_summary` and construct the necessary fields for retention calculations:

```python
ms = spark.sql("""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this SQL statement should follow the style guide. In particular, please:

  • Capitalize reserved words like WHERE, SELECT, and AND (link)
  • Add a newline after root keywords like SELECT and WHERE (link)
  • Indent lines 126-130 one additional level to clarify the WHERE statement block


*Not quite*. Turns out you next look at `active_ticks` and `total_uri_count` and find that sync users report much higher numbers for these measures as well. Now how can we explain this difference in retention?

There could be an entirely separate cookbook devoted to answering this question, however this contrived example is meant to demonstrate that simply comparing retention numbers between two groups isn't capturing the full story. Sans an experiment or model-based approach, all we can say is "enabling sync is **associated** with higher retention numbers." There is still value in this assertion, however it should be stressed that association/correlation != causation!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great explaination and example. It's worth bolding correlation != causation so the reader can get the gist of this example while scanning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

@benmiroglio
Copy link
Contributor Author

I plan to add a Pure SQL example per @fbertsch 's request, however I think this is ok to merge for the moment after getting @SuYoungHong 's review.

Copy link
Contributor

@jklukas jklukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few typo nits. At a high level, this feels like it hits a great sweet spot of being both concise and specific.

)
```

Peaking at 6-Week Retention
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peaking -> Peeking

def from_unixtime_handler(ut):
"""
Converts unix time (in days) to a string in %Y%m%d format.
This is spark UDF.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is a spark UDF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catches! Fixed.

Copy link
Contributor

@SuYoungHong SuYoungHong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks great. Awesome work.

I will request you make a note distinguishing between the fixed point method you use for retention (where the weekN period is different for clients that start on different days) vs a cohort approach

Just a quick note like new user vs existing user should be good!


Retention measures the rate at which users are *continuing* to use Firefox, making it one of the more important metrics we track. We commonly measure retention between releases, experiment cohorts, and various Firefox subpopulations to better understand how a change to the user experience or use of a specific feature affect behavior.

### N Week Retention
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good work! One more distinction I would make about types of N week retention:

rolling vs cohorted

Here you're using a rolling approach to N week retention (N week is between (7n) and ((7n)+ 6) after client's pcd date), but a lot of people are using cohorts to calculate retention:

Define weekly cohorts for your population (the clients that start between 20180101 - 20180107 = cohort1, 201801018- 20180114 = cohort2, etc.). For each cohort, define weekN retention as, appeared in the week period N weeks after the cohort period (i.e. week1 is appeared in 201801018- 20180114 for cohort 1)

So essentially, using an anchor period instead of an anchor point.

Would be a good idea to clarify this distinction, since this is many people's understanding of N week retention right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I use cohorts when there isn't a clear anchor point (i.e. users that enabled sync--there is no clear sync was enabled on this date indicator without looking across a client's history). Might be worth adding an example like this as well in section title "What if you don't have an achor point?".

When you do have an anchor point however, IMO the rolling method is the most precise (and least expensive). Let me know your thoughts @SuYoungHong !


```

we observe that 35.6% of Linux users whose profile was created in the first half of April submitted a ping 6 weeks later, and so forth. The example code snippets are consolidated in [this notebook](https://gist.github.com/benmiroglio/fc708e5905fad33b43adb9c90e38ebf4).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned offline, but update copy to reflect code numbers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! will fix.

**Please be sure to specify whether or not your retention analysis is for new or existing users.**



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note on Point Anchoring vs Period Anchoring

As defined above, N week retention needs some anchor to start N from.

However, we should note, the anchor itself can be a single point (such as a day) or a period (such as a week), also known as a cohort or acquisition period.

The code example above is using a point anchor of a day, and defining the N week for each client independently, depending on which day they start.

Many previous retention analysis at Mozilla, however, uses a period anchor, and calculates retention per cohort. For example:

  • Define a week (20180101 to 20180101) as the anchor/acquisition/cohort period.
  • Define clients who have profile_creation_date within these dates as a 'cohort'.
  • Define week Ns as 20180101 + 7N to 20180101 + 7N.
  • For the cohort, calculate each N retention.
  • Repeat for each cohort in the period of interest.

The major difference between using a 1 week period anchor and a 1 day point anchor is that with the 1 week period anchor, the weekN retention period will be the same for a client who started at the beginning of the anchor period versus a client who started at the end of the anchor period. Whereas for point anchor, those same two clients will have their weekN retention period differ by the amount of days between when they started, respectively.

Point anchoring is preferred, however, both methods convey the same information.


we observe that 35.6% of Linux users whose profile was created in the first half of April submitted a ping 6 weeks later, and so forth. The example code snippets are consolidated in [this notebook](https://gist.github.com/benmiroglio/fc708e5905fad33b43adb9c90e38ebf4).


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of cleaning caveats that might be worth mentioning.

First is the client submission latency, which can be between somewhere between 2-5 days. clients_daily accounts for this when aggregating clients per day, but it's not without it's own issues.

There are also noisy clients with inconsistent profile creation and sub-session dates. Different thresholds will affect the consistency of results. The cookbook could point to an existing baseline in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree these are potential issues, I think it is beyond the scope of the cookbook since they are general data issues subject to most Firefox analysis. I don't want to bog things down with all potential caveats, but rather serve as a jumping off point methodologically-speaking.

Since retention is usually calculated to compare two or more groups of users, I feel safe to exclude these caveats since each group is subject to the same pitfalls (in this case).

@acmiyaguchi Let me know if you feel this is justified!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sounds good to me. A six week waiting period and relatively consistent results will probably take care of those caveats across analyses.

@benmiroglio
Copy link
Contributor Author

@SuYoungHong I wrote up a section that simplifies cohort-based retention, putting it in terms of the anchor point approach so the document is a little more cohesive. Please review!

@SuYoungHong
Copy link
Contributor

looks good to me! ready for merging!

@benmiroglio benmiroglio merged commit f4d45d2 into mozilla:master Jun 22, 2018
@harterrt
Copy link
Contributor

🎉 🎉 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants