### Cohort Analysis

A `cohort` is a group of individuals who share some characteristic of interest, described below, at the time we start observing them.

`Cohort analysis` is a useful way to compare groups of entities over time. Many important behaviors take weeks, months, or years to occur or evolve, and cohort analysis is a way to understand these changes. Cohort analysis provides a framework for detecting correlations between cohort characteristics and these long-term trends, which can lead to hypotheses about the causal drivers.

Cohort analysis can be used to monitor new cohorts of users or customers and assess how they compare to previous cohorts. Such monitoring can provide an early alert signal that something has gone wrong (or right) for new customers. Cohort analysis is also used to mine historical data. A/B tests are the gold standard for determin‐ing causality, but we can’t go back in time and run every test for every question about the past in which we are interested.

`Cohort grouping` is often based on a start date: the customer’s first purchase or subscription date, the date a student started school, and so on. 

> **Cohort or Segment?**
>
> A `cohort` is a group of users (or other entities) who have a common starting date and are followed over time. A `segment` is a grouping of users who share a common characteristic or set of characteristics at a point in time, regardless of their starting date. 

#### Types of cohort anlaysis
---

**Retention**

Retention is concerned with whether the cohort member has a record in the time series on a particular date, expressed as a number of periods from the starting date. This is useful in any kind of organization in which repeated actions are expected, from playing an online game to using a product or renewing a subscription, and it helps to answer questions about how sticky or engaging a product is and how many entities can be expected to appear on future dates.

**Survivorship**

Survivorship is concerned with how many entities remained in the data set for a certain length of time or longer, regardless of the number or frequency of actions up to that time. Survivorship is useful for answering questions about the proportion of the population that can be expected to remain—either in a positive sense by not churning or passing away, or in a negative sense by not graduating or fulfilling some requirement.

**Returnship**

Returnship or repeat purchase behavior is concerned with whether an action has happened more than some minimum threshold of times—often simply more than once—during a fixed window of time. This type of analysis is useful in situations in which the behavior is intermittent and unpredictable, such as in retail, where it characterizes the share of repeat purchasers in each cohort within a fixed time window.

**Cumulative**

Cumulative calculations are concerned with the total number or amounts measured at one or more fixed time windows, regardless of when they happened during that window. Cumulative calculations are often used in calculations of customer lifetime value (LTV or CLTV).

#### The Legislators Data Set

The SQL examples in this chapter will use a data set of past and present members of the United States Congress.

Congress has two chambers, the Senate (“sen” in the data set) and the House of Representatives (“rep”). Each state has two senators, and they are elected for six-year terms. Representatives are allocated to states based on population; each representative has a district that they alone represent. Representatives are elected for two-year terms.

Actual terms in either chamber can be shorter in the event that the legislator dies or is elected or appointed to a higher office. Legislators accumulate power and influence via leadership positions the longer they are in office, and thus standing for re-election is common. 

Finally, a legislator may belong to a political party, or they may be an “independent.” In the modern era, the vast majority of legislators are Democrats or Republicans, and the rivalry between the two parties is well known.

In [1]:
import pandas as pd
import sqlite3

df_01 = pd.read_csv('legislators.csv')
df_02 = pd.read_csv('legislators_terms.csv')

conn = sqlite3.connect("sql-analysis-02.db")

df_01.to_sql('legislators', conn)
df_02.to_sql('legislators_terms', conn)

#### Retention

Retention analysis uses the count of entities or sum of money or actions present in the data set for each period from the starting date, and it normalizes by dividing this number by the count or sum of entities, money, or actions in the first time period. The result is expressed as a percentage, and retention in the starting period is always 100%. Over time, retention based on counts generally declines and can never exceed 100%, whereas money- or action-based retention, while often declining, can increase and be greater than 100% in a time period. Retention analysis output is typically displayed in either table or graph form, which is referred to as a retention curve.

In [4]:
query_01 = """
        SELECT id_bioguide,
            MIN(term_start) AS first_term
        FROM legislators_terms
        GROUP BY 1
        """

sql_01 = pd.read_sql(query_01, conn)
sql_01

Unnamed: 0,id_bioguide,first_term
0,A000001,1951-01-03
1,A000002,1947-01-03
2,A000003,1817-12-01
3,A000004,1843-12-04
4,A000005,1887-12-05
...,...,...
12513,Z000013,1976-12-28
12514,Z000014,1983-01-03
12515,Z000016,1967-01-10
12516,Z000017,2015-01-06


In [72]:
query_02 = """
        SELECT b.term_start - a.first_term AS period,
            COUNT(DISTINCT a.id_bioguide) AS cohort_retained
        FROM
        (
            SELECT id_bioguide,
                MIN(term_start) AS first_term
            FROM legislators_terms
            GROUP BY 1
        ) a
        JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide 
        GROUP BY 1
        """

sql_02 = pd.read_sql(query_02, conn)
sql_02.head()

Unnamed: 0,period,cohort_retained
0,0,12518
1,1,148
2,2,7120
3,3,124
4,4,4925


In [73]:
query_03 = """
        SELECT period,
            FIRST_VALUE(cohort_retained) OVER (ORDER BY period) AS cohort_size,
            cohort_retained,
            cohort_retained * 1.0 / FIRST_VALUE(cohort_retained) OVER (ORDER BY period) AS pct_retained
        FROM
        (
            SELECT b.term_start - a.first_term AS period,
                COUNT(DISTINCT a.id_bioguide) AS cohort_retained
            FROM
            (
                SELECT id_bioguide,
                    MIN(term_start) AS first_term
                FROM legislators_terms 
                GROUP BY 1
            ) a
            JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide 
            GROUP BY 1
        )
        """

sql_03 = pd.read_sql(query_03, conn)
sql_03.head()

Unnamed: 0,period,cohort_size,cohort_retained,pct_retained
0,0,12518,12518,1.0
1,1,12518,148,0.011823
2,2,12518,7120,0.568781
3,3,12518,124,0.009906
4,4,12518,4925,0.393433


In [74]:
query_04 = """
        SELECT cohort_size,
            MAX(CASE WHEN period = 0 THEN pct_retained END) AS yr0,
            MAX(CASE WHEN period = 1 THEN pct_retained END) AS yr1,
            MAX(CASE WHEN period = 2 THEN pct_retained END) AS yr2,
            MAX(CASE WHEN period = 3 THEN pct_retained END) AS yr3,
            MAX(CASE WHEN period = 4 THEN pct_retained END) AS yr4
        FROM
        (
            SELECT period,
                FIRST_VALUE(cohort_retained) OVER (ORDER BY period) AS cohort_size,
                cohort_retained,
                cohort_retained * 1.0 / FIRST_VALUE(cohort_retained) OVER (ORDER BY period) AS pct_retained
            FROM
            (
                SELECT b.term_start - a.first_term AS period,
                    COUNT(DISTINCT a.id_bioguide) AS cohort_retained
                FROM
                (
                    SELECT id_bioguide,
                        MIN(term_start) AS first_term
                    FROM legislators_terms 
                    GROUP BY 1
                ) a
                JOIN legislators_terms b ON a.id_bioguide = b.id_bioguide 
                GROUP BY 1
            )
        )
        GROUP BY 1
        """

sql_04 = pd.read_sql(query_04, conn)
sql_04.head()

Unnamed: 0,cohort_size,yr0,yr1,yr2,yr3,yr4
0,12518,1.0,0.011823,0.568781,0.009906,0.393433


#### Adjusting Time Series to Increase Retention Accuracy

In the legislators data set, we have a record for a term’s start date, but we are missing the notion that this “entitles” a legislator to serve for two or six years, depending on the chamber.

Calculating retention using a start and end date defined in the data is the most accurate approach. For the following examples, we will consider legislators retained in a particular year if they were still in office as of the last day of the year, December 31. Prior to the Twentieth Amendment to the US Constitution, terms began on March 4, but afterward the start date moved to January 3, or to a subsequent weekday if the third falls on a weekend. Legislators can be sworn in on other days of the year due to special off-cycle elections or appointments to fill vacant seats. As a result, term_start dates cluster in January but are spread across the year. 

While we could pick another day, **December 31** is a strategy for normalizing around these varying start dates.