In [7]:
import pandas as pd
from sqlalchemy import create_engine

df_01 = pd.read_csv('legislators.csv')
df_02 = pd.read_csv('legislators_terms.csv')

engine = create_engine('postgresql://testuser:testpass@localhost:5432/postgresql_analysis')

#df_01.to_sql("legislators", engine)
#df_02.to_sql("legislators_terms", engine)

In [8]:
import psycopg2 as pg2

con = pg2.connect(host='localhost',
                  user='testuser',
                  password='testpass',
                  database='postgresql_analysis')
con.autocommit = True
cur = con.cursor()

### Cohort Analysis

A `cohort` is a group of individuals who share some characteristic of interest, described below, at the time we start observing them.

`Cohort analysis` is a useful way to compare groups of entities over time. Many important behaviors take weeks, months, or years to occur or evolve, and cohort analysis is a way to understand these changes. Cohort analysis provides a framework for detecting correlations between cohort characteristics and these long-term trends, which can lead to hypotheses about the causal drivers.

Cohort analysis can be used to monitor new cohorts of users or customers and assess how they compare to previous cohorts. Such monitoring can provide an early alert signal that something has gone wrong (or right) for new customers. Cohort analysis is also used to mine historical data. A/B tests are the gold standard for determin‐ing causality, but we can’t go back in time and run every test for every question about the past in which we are interested.

`Cohort grouping` is often based on a start date: the customer’s first purchase or subscription date, the date a student started school, and so on. 

> **Cohort or Segment?**
>
> A `cohort` is a group of users (or other entities) who have a common starting date and are followed over time. A `segment` is a grouping of users who share a common characteristic or set of characteristics at a point in time, regardless of their starting date. 

#### Types of cohort anlaysis
---

**Retention**

Retention is concerned with whether the cohort member has a record in the time series on a particular date, expressed as a number of periods from the starting date. This is useful in any kind of organization in which repeated actions are expected, from playing an online game to using a product or renewing a subscription, and it helps to answer questions about how sticky or engaging a product is and how many entities can be expected to appear on future dates.

**Survivorship**

Survivorship is concerned with how many entities remained in the data set for a certain length of time or longer, regardless of the number or frequency of actions up to that time. Survivorship is useful for answering questions about the proportion of the population that can be expected to remain—either in a positive sense by not churning or passing away, or in a negative sense by not graduating or fulfilling some requirement.

**Returnship**

Returnship or repeat purchase behavior is concerned with whether an action has happened more than some minimum threshold of times—often simply more than once—during a fixed window of time. This type of analysis is useful in situations in which the behavior is intermittent and unpredictable, such as in retail, where it characterizes the share of repeat purchasers in each cohort within a fixed time window.

**Cumulative**

Cumulative calculations are concerned with the total number or amounts measured at one or more fixed time windows, regardless of when they happened during that window. Cumulative calculations are often used in calculations of customer lifetime value (LTV or CLTV).

#### The Legislators Data Set

The SQL examples in this chapter will use a data set of past and present members of the United States Congress.

Congress has two chambers, the Senate (“sen” in the data set) and the House of Representatives (“rep”). Each state has two senators, and they are elected for six-year terms. Representatives are allocated to states based on population; each representative has a district that they alone represent. Representatives are elected for two-year terms.

Actual terms in either chamber can be shorter in the event that the legislator dies or is elected or appointed to a higher office. Legislators accumulate power and influence via leadership positions the longer they are in office, and thus standing for re-election is common. 

Finally, a legislator may belong to a political party, or they may be an “independent.” In the modern era, the vast majority of legislators are Democrats or Republicans, and the rivalry between the two parties is well known.

#### Retention

Retention analysis uses the count of entities or sum of money or actions present in the data set for each period from the starting date, and it normalizes by dividing this number by the count or sum of entities, money, or actions in the first time period. The result is expressed as a percentage, and retention in the starting period is always 100%. Over time, retention based on counts generally declines and can never exceed 100%, whereas money- or action-based retention, while often declining, can increase and be greater than 100% in a time period. Retention analysis output is typically displayed in either table or graph form, which is referred to as a retention curve.

In [9]:
query_01 = """
        SELECT id_bioguide
            ,min(term_start) as first_term
        FROM legislators_terms 
        GROUP BY 1
        """

sql_01 = pd.read_sql(query_01, con)
sql_01

Unnamed: 0,id_bioguide,first_term
0,A000118,1975-01-14
1,P000281,1933-03-09
2,K000039,1933-03-09
3,A000306,1907-12-02
4,O000095,1949-01-03
...,...,...
12513,G000331,1949-01-03
12514,M000103,1867-03-04
12515,B000255,1821-12-03
12516,L000152,1891-12-07


In [10]:
query_02 = """
        SELECT date_part('year', age(b.term_start, a.first_term)) as periods
            ,count(distinct a.id_bioguide) as cohort_retained
        FROM
        (
                SELECT id_bioguide
                ,min(term_start) as first_term
                FROM legislators_terms 
                GROUP BY 1
        ) a
        JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
        GROUP BY 1
        """

sql_02 = pd.read_sql(query_02, con)
sql_02.head()

Unnamed: 0,periods,cohort_retained
0,0.0,12518
1,1.0,3600
2,2.0,3619
3,3.0,1831
4,4.0,3210


In [11]:
query_03 = """
        SELECT period
            ,first_value(cohort_retained) over (order by period) as cohort_size
            ,cohort_retained
            ,cohort_retained * 1.0 / first_value(cohort_retained) over (order by period) as pct_retained
        FROM
        (
            SELECT date_part('year',age(b.term_start,a.first_term)) as period
            ,count(distinct a.id_bioguide) as cohort_retained
            FROM
            (
                    SELECT id_bioguide
                    ,min(term_start) as first_term
                    FROM legislators_terms 
                    GROUP BY 1
            ) a
            JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
            GROUP BY 1
        ) aa
        """

sql_03 = pd.read_sql(query_03, con)
sql_03.head()

Unnamed: 0,period,cohort_size,cohort_retained,pct_retained
0,0.0,12518,12518,1.0
1,1.0,12518,3600,0.287586
2,2.0,12518,3619,0.289104
3,3.0,12518,1831,0.146269
4,4.0,12518,3210,0.256431


In [12]:
query_04 = """
        SELECT cohort_size
            ,max(case when period = 0 then pct_retained end) as yr0
            ,max(case when period = 1 then pct_retained end) as yr1
            ,max(case when period = 2 then pct_retained end) as yr2
            ,max(case when period = 3 then pct_retained end) as yr3
            ,max(case when period = 4 then pct_retained end) as yr4
        FROM
        (
            SELECT period
            ,first_value(cohort_retained) over (order by period) as cohort_size
            ,cohort_retained
            ,cohort_retained * 1.0 / first_value(cohort_retained) over (order by period) as pct_retained
            FROM
            (
                SELECT 
                date_part('year',age(b.term_start,a.first_term)) as period
                ,count(*) as cohort_retained
                FROM
                (
                        SELECT id_bioguide
                        ,min(term_start) as first_term
                        FROM legislators_terms 
                        GROUP BY 1
                ) a
                JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
                GROUP BY 1
            ) aa
        ) aaa
        GROUP BY 1
        """

sql_04 = pd.read_sql(query_04, con)
sql_04.head()

Unnamed: 0,cohort_size,yr0,yr1,yr2,yr3,yr4
0,12647,1.0,0.284732,0.28655,0.145015,0.253973


#### Adjusting Time Series to Increase Retention Accuracy

In the legislators data set, we have a record for a term’s start date, but we are missing the notion that this “entitles” a legislator to serve for two or six years, depending on the chamber.

Calculating retention using a start and end date defined in the data is the most accurate approach. For the following examples, we will consider legislators retained in a particular year if they were still in office as of the last day of the year, December 31. Prior to the Twentieth Amendment to the US Constitution, terms began on March 4, but afterward the start date moved to January 3, or to a subsequent weekday if the third falls on a weekend. Legislators can be sworn in on other days of the year due to special off-cycle elections or appointments to fill vacant seats. As a result, term_start dates cluster in January but are spread across the year. 

While we could pick another day, **December 31** is a strategy for **normalizing** around these varying start dates.

In [13]:
query_05 = """
       SELECT a.id_bioguide, a.first_term
           ,b.term_start, b.term_end
           ,c.date
           ,date_part('year',age(c.date,a.first_term)) as period
       FROM
       (
            SELECT id_bioguide, min(term_start) as first_term
            FROM legislators_terms 
            GROUP BY 1
        ) a
        JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
        LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
        and c.month_name = 'December' and c.day_of_month = 31
        """

sql_05 = pd.read_sql(query_05, con)
sql_05

Unnamed: 0,id_bioguide,first_term,term_start,term_end,date,period
0,B000944,1993-01-05,1993-01-05,1995-01-03,1993-12-31,0.0
1,B000944,1993-01-05,1993-01-05,1995-01-03,1994-12-31,1.0
2,C000127,1993-01-05,1993-01-05,1995-01-03,1993-12-31,0.0
3,C000127,1993-01-05,1993-01-05,1995-01-03,1994-12-31,1.0
4,C000141,1987-01-06,1987-01-06,1989-01-03,1987-12-31,0.0
...,...,...,...,...,...,...
98710,D000355,1955-01-05,2009-01-06,2011-01-03,2010-12-31,55.0
98711,D000355,1955-01-05,2011-01-05,2013-01-03,2011-12-31,56.0
98712,D000355,1955-01-05,2011-01-05,2013-01-03,2012-12-31,57.0
98713,D000355,1955-01-05,2013-01-03,2015-01-03,2013-12-31,58.0


Note that even though more than 11 months may have elapsed between being sworn in in January and December 31, the first year still appears as 0.

In [14]:
query_06 = """
        SELECT coalesce(date_part('year',age(c.date,a.first_term)),0) as period
            ,count(distinct a.id_bioguide) as cohort_retained
        FROM
        (
            SELECT id_bioguide, min(term_start) as first_term
            FROM legislators_terms 
            GROUP BY 1
        ) a
        JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
        LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
        and c.month_name = 'December' and c.day_of_month = 31
        GROUP BY 1
        """

sql_06 = pd.read_sql(query_06, con)
sql_06.head()

Unnamed: 0,period,cohort_retained
0,0.0,12518
1,1.0,12328
2,2.0,8166
3,3.0,8069
4,4.0,5862


A `coalesce` function is used on period to set a default value of 0 when null. This handles the cases in which a legislator’s term starts and ends in the same year, giving credit for serving in that year

In [15]:
query_07 = """
        SELECT period
            ,first_value(cohort_retained) over (order by period) as cohort_size
            ,cohort_retained
            ,cohort_retained * 1.0 / 
             first_value(cohort_retained) over (order by period) as pct_retained
        FROM
        (
            SELECT coalesce(date_part('year',age(c.date,a.first_term)),0) as period
                ,count(distinct a.id_bioguide) as cohort_retained
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms 
                GROUP BY 1
            ) a
            JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
            LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
            and c.month_name = 'December' and c.day_of_month = 31
            GROUP BY 1
        ) aa
        """

sql_07 = pd.read_sql(query_07, con)
sql_07.head()

Unnamed: 0,period,cohort_size,cohort_retained,pct_retained
0,0.0,12518,12518,1.0
1,1.0,12518,12328,0.984822
2,2.0,12518,8166,0.652341
3,3.0,12518,8069,0.644592
4,4.0,12518,5862,0.468286


<img align='left' width="459" alt="Screen Shot 2022-04-22 at 1 30 07 PM" src="https://user-images.githubusercontent.com/73784742/164609369-65fb2fef-6f45-4ceb-aef0-4190b1413837.png">

Almost all legislators are still in office in year 1, and the first big drop-off occurs in year 2, when some representatives will fail to be reelected.

---

If the data set does not contain an end date, there are a couple of options for imputing one. 

One option is to add a fixed interval to the start date, when the length of a subscription or term is known. 

In [16]:
query_08 = """
        SELECT a.id_bioguide, a.first_term
            ,b.term_start
            ,case when b.term_type = 'rep' then b.term_start + interval '2 years'
                  when b.term_type = 'sen' then b.term_start + interval '6 years'
                  end as term_end
        FROM
        (
            SELECT id_bioguide, min(term_start) as first_term
            FROM legislators_terms 
            GROUP BY 1
        ) a
        JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
        """

sql_08 = pd.read_sql(query_08, con)
sql_08

Unnamed: 0,id_bioguide,first_term,term_start,term_end
0,B000944,1993-01-05,1993-01-05,1995-01-05
1,C000127,1993-01-05,1993-01-05,1995-01-05
2,C000141,1987-01-06,1987-01-06,1989-01-06
3,C000174,1983-01-03,1983-01-03,1985-01-03
4,C001070,2007-01-04,2007-01-04,2013-01-04
...,...,...,...,...
44058,D000355,1955-01-05,2007-01-04,2009-01-04
44059,C000714,1965-01-04,2017-01-03,2019-01-03
44060,D000355,1955-01-05,2009-01-06,2011-01-06
44061,D000355,1955-01-05,2011-01-05,2013-01-05


This block of code can then be plugged into the retention code to derive the period and pct_retained. The drawback to this method is that it fails to capture instances in which a legislator did not complete a full term, which can happen in the event of death or appointment to a higher office.

---

A second option is to use the subsequent starting date, minus one day, as the term_end date.

This can be calculated with the `lead` window function. This function is similar to the lag function we’ve used previously, but rather than returning a value from a row earlier in the partition, it returns a value from a row later in the partition, as determined in the ORDER BY clause. The default is one row, which we will use here, but the function has an optional argument indicating a different number of rows.

In [17]:
query_09 = """
        SELECT a.id_bioguide, a.first_term
            ,b.term_start
            ,lead(b.term_start) over (partition by a.id_bioguide order by b.term_start) 
             - interval '1 day' as term_end
        FROM
        (
            SELECT id_bioguide, min(term_start) as first_term
            FROM legislators_terms 
            GROUP BY 1
        ) a
        JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
        """

sql_09 = pd.read_sql(query_09, con)
sql_09

Unnamed: 0,id_bioguide,first_term,term_start,term_end
0,A000001,1951-01-03,1951-01-03,NaT
1,A000002,1947-01-03,1947-01-03,1949-01-02
2,A000002,1947-01-03,1949-01-03,1951-01-02
3,A000002,1947-01-03,1951-01-03,1953-01-02
4,A000002,1947-01-03,1953-01-03,1955-01-04
...,...,...,...,...
44058,Z000017,2015-01-06,2015-01-06,2017-01-02
44059,Z000017,2015-01-06,2017-01-03,2019-01-02
44060,Z000017,2015-01-06,2019-01-03,NaT
44061,Z000018,2015-01-06,2015-01-06,2017-01-02


This code block can then be plugged into the retention code. This method has a couple of drawbacks. First, when there is no subsequent term, the lead function returns null, leaving that term without a term_end. A default value, such as a default interval shown in the last example, could be used in such cases. The second drawback is that this method assumes that terms are always consecutive, with no time spent out of office. Although most legislators tend to serve continuously until their congressional careers end, there are certainly examples of gaps between terms spanning several years.

#### Cohorts Derived from the Time Series Itself

The most common way to create the cohorts is based on the first or minimum date or time that the entity appears in the time series. This means that only one table is necessary for the cohort retention analysis: the time series itself. Cohorting by the first appearance or action is interesting because often groups that start at different times behave differently.

Time-based cohorts can be grouped by any time granularity that is meaningful to the organization, though weekly, monthly, or yearly cohorts are common. If you’re not sure what grouping to use, try running the cohort analysis with different groupings, without making the cohort sizes too small, to see where meaningful patterns emerge.

In [18]:
query_10 = """
        SELECT date_part('year',a.first_term) as first_year
            ,coalesce(date_part('year',age(c.date,a.first_term)),0) as period
            ,count(distinct a.id_bioguide) as cohort_retained
        FROM
        (
            SELECT id_bioguide, min(term_start) as first_term
            FROM legislators_terms 
            GROUP BY 1
        ) a
        JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
        LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
        and c.month_name = 'December' and c.day_of_month = 31
        GROUP BY 1,2
        """

sql_10 = pd.read_sql(query_10, con)
sql_10

Unnamed: 0,first_year,period,cohort_retained
0,1789.0,0.0,89
1,1789.0,1.0,89
2,1789.0,2.0,57
3,1789.0,3.0,56
4,1789.0,4.0,42
...,...,...,...
6010,2019.0,2.0,4
6011,2019.0,3.0,4
6012,2019.0,4.0,4
6013,2019.0,5.0,4


In [19]:
query_11 = """
        SELECT first_year
            ,period
            ,first_value(cohort_retained) over (partition by first_year order by period) as cohort_size
            ,cohort_retained
            ,round(cohort_retained * 1.0 / first_value(cohort_retained) over (partition by first_year order by period),2) as pct_retained
        FROM
        (
            SELECT date_part('year', a.first_term) as first_year
                ,coalesce(date_part('year',age(c.date, a.first_term)),0) as period
                ,count(distinct a.id_bioguide) as cohort_retained
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms 
                GROUP BY 1
            ) a
            JOIN legislators_terms b on a.id_bioguide = b.id_bioguide
            LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
            and c.month_name = 'December' and c.day_of_month = 31
            GROUP BY 1,2
        ) aa
        """

sql_11 = pd.read_sql(query_11, con)
sql_11

Unnamed: 0,first_year,period,cohort_size,cohort_retained,pct_retained
0,1789.0,0.0,89,89,1.00
1,1789.0,1.0,89,89,1.00
2,1789.0,2.0,89,57,0.64
3,1789.0,3.0,89,56,0.63
4,1789.0,4.0,89,42,0.47
...,...,...,...,...,...
6010,2019.0,2.0,93,4,0.04
6011,2019.0,3.0,93,4,0.04
6012,2019.0,4.0,93,4,0.04
6013,2019.0,5.0,93,4,0.04


In [20]:
query_12 = """
        SELECT first_century, period
            ,first_value(cohort_retained) over (partition by first_century order by period) as cohort_size
            ,cohort_retained
            ,cohort_retained * 1.0 / 
             first_value(cohort_retained) over (partition by first_century order by period) as pct_retained
        FROM
        (
            SELECT date_part('century',a.first_term) as first_century
                ,coalesce(date_part('year',age(c.date,a.first_term)),0) as period
                ,count(distinct a.id_bioguide) as cohort_retained
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms 
                GROUP BY 1
            ) a
            JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
            LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
            and c.month_name = 'December' and c.day_of_month = 31
            GROUP BY 1,2
        ) aa
        ORDER BY 1,2
        """

sql_12 = pd.read_sql(query_12, con)
sql_12

Unnamed: 0,first_century,period,cohort_size,cohort_retained,pct_retained
0,18.0,0.0,368,368,1.000000
1,18.0,1.0,368,360,0.978261
2,18.0,2.0,368,242,0.657609
3,18.0,3.0,368,233,0.633152
4,18.0,4.0,368,149,0.404891
...,...,...,...,...,...
169,21.0,17.0,760,43,0.056579
170,21.0,18.0,760,15,0.019737
171,21.0,19.0,760,14,0.018421
172,21.0,20.0,760,2,0.002632


<img align="left" width="448" alt="Screen Shot 2022-04-22 at 2 02 53 PM" src="https://user-images.githubusercontent.com/73784742/164613123-1b289ed0-193e-4a9d-88aa-4381a658c485.png">

Retention in the early years has been higher for those first elected in the 20th or 21st century. The 21st century is still underway, and thus many of those legislators have not had the opportunity to stay in office for five or more years, though they are still included in the denominator.

In [21]:
query_13 = """
        SELECT distinct id_bioguide
            ,min(term_start) over (partition by id_bioguide) as first_term
            ,first_value(state) over (partition by id_bioguide order by term_start) as first_state
        FROM legislators_terms 
        """

sql_13 = pd.read_sql(query_13, con)
sql_13

Unnamed: 0,id_bioguide,first_term,first_state
0,C000001,1893-08-07,GA
1,R000584,2009-01-06,ID
2,W000215,1975-01-14,CA
3,A000250,1931-12-07,NY
4,S000145,1933-03-09,IN
...,...,...,...
12513,G000244,1885-12-07,MO
12514,H000241,1975-01-14,VA
12515,C000632,1845-12-01,NY
12516,S000122,1849-12-03,NY


In [22]:
query_14 = """
         SELECT first_state, period
            ,first_value(cohort_retained) over (partition by first_state order by period) as cohort_size
            ,cohort_retained
            ,cohort_retained * 1.0 / 
             first_value(cohort_retained) over (partition by first_state order by period) as pct_retained
        FROM
        (
            SELECT a.first_state
                ,coalesce(date_part('year',age(c.date,a.first_term)),0) as period
                ,count(distinct a.id_bioguide) as cohort_retained
            FROM
            (
                SELECT distinct id_bioguide
                    ,min(term_start) over (partition by id_bioguide) as first_term
                    ,first_value(state) over (partition by id_bioguide order by term_start) as first_state
                FROM legislators_terms 
            ) a
            JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
            LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
            and c.month_name = 'December' and c.day_of_month = 31
            GROUP BY 1,2
        ) aa
        """

sql_14 = pd.read_sql(query_14, con)
sql_14

Unnamed: 0,first_state,period,cohort_size,cohort_retained,pct_retained
0,AK,0.0,19,19,1.000000
1,AK,1.0,19,19,1.000000
2,AK,2.0,19,15,0.789474
3,AK,3.0,19,15,0.789474
4,AK,4.0,19,13,0.684211
...,...,...,...,...,...
2365,WY,34.0,43,1,0.023256
2366,WY,35.0,43,1,0.023256
2367,WY,36.0,43,1,0.023256
2368,WY,37.0,43,1,0.023256


<img align="left" width="446" alt="Screen Shot 2022-04-22 at 2 09 18 PM" src="https://user-images.githubusercontent.com/73784742/164613920-2d8bad17-61c3-41a4-ae06-2409ed5d3bfa.png">

Those elected in Illinois and Massachusetts have the highest retention, while New Yorkers have the lowest retention.

#### Difining the Cohort from a Separte Table

In [23]:
query_15 = """
        SELECT d.gender
            ,coalesce(date_part('year',age(c.date,a.first_term)),0) as period
            ,count(distinct a.id_bioguide) as cohort_retained
        FROM
        (
            SELECT id_bioguide, min(term_start) as first_term
            FROM legislators_terms 
            GROUP BY 1
        ) a
        JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
        LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
        and c.month_name = 'December' and c.day_of_month = 31
        JOIN legislators d on a.id_bioguide = d.id_bioguide
        GROUP BY 1,2
        ORDER BY 2,1
        """

sql_15 = pd.read_sql(query_15, con)
sql_15

Unnamed: 0,gender,period,cohort_retained
0,F,0.0,366
1,M,0.0,12152
2,F,1.0,349
3,M,1.0,11979
4,F,2.0,261
...,...,...,...
95,M,55.0,3
96,M,56.0,3
97,M,57.0,2
98,M,58.0,1


In [24]:
query_16 = """
        SELECT gender, period
            ,first_value(cohort_retained) over (partition by gender order by period) as cohort_size
            ,cohort_retained
            ,cohort_retained * 1.0 / 
             first_value(cohort_retained) over (partition by gender order by period) as pct_retained
        FROM
        (
            SELECT d.gender
                ,coalesce(date_part('year',age(c.date,a.first_term)),0) as period
                ,count(distinct a.id_bioguide) as cohort_retained
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms 
                GROUP BY 1
            ) a
            JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
            LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
            and c.month_name = 'December' and c.day_of_month = 31
            JOIN legislators d on a.id_bioguide = d.id_bioguide
            GROUP BY 1,2
        ) aa
        ORDER BY 2,1
        """

sql_16 = pd.read_sql(query_16, con)
sql_16

Unnamed: 0,gender,period,cohort_size,cohort_retained,pct_retained
0,F,0.0,366,366,1.000000
1,M,0.0,12152,12152,1.000000
2,F,1.0,366,349,0.953552
3,M,1.0,12152,11979,0.985764
4,F,2.0,366,261,0.713115
...,...,...,...,...,...
95,M,55.0,12152,3,0.000247
96,M,56.0,12152,3,0.000247
97,M,57.0,12152,2,0.000165
98,M,58.0,12152,1,0.000082


<img align="left" width="448" alt="Screen Shot 2022-04-22 at 2 23 12 PM" src="https://user-images.githubusercontent.com/73784742/164615625-13d280da-c04d-4e5c-bfc5-a8f97b3de0ca.png">

Retention is higher for female legislators than for their male counterparts for periods 2 through 29. **The first female legislator did not take office until 1917, when Jeannette Rankin joined the House as a Republican representative from Montana.** As we saw earlier, retention has increased in more recent centuries.

In [25]:
query_17 = """
        SELECT gender, period
            ,first_value(cohort_retained) over (partition by gender order by period) as cohort_size
            ,cohort_retained
            ,cohort_retained * 1.0 / 
             first_value(cohort_retained) over (partition by gender order by period) as pct_retained
        FROM
        (
            SELECT d.gender
                ,coalesce(date_part('year',age(c.date,a.first_term)),0) as period
                ,count(distinct a.id_bioguide) as cohort_retained
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms 
                GROUP BY 1
            ) a
            JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
            LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
            and c.month_name = 'December' and c.day_of_month = 31
            JOIN legislators d on a.id_bioguide = d.id_bioguide
            WHERE a.first_term between '1917-01-01' and ' 1999-12-31'
            GROUP BY 1,2
        ) aa
        ORDER BY 2,1
        """

sql_17 = pd.read_sql(query_17, con)
sql_17

Unnamed: 0,gender,period,cohort_size,cohort_retained,pct_retained
0,F,0.0,200,200,1.000000
1,M,0.0,3833,3833,1.000000
2,F,1.0,200,187,0.935000
3,M,1.0,3833,3769,0.983303
4,F,2.0,200,149,0.745000
...,...,...,...,...,...
95,M,55.0,3833,2,0.000522
96,M,56.0,3833,2,0.000522
97,M,57.0,3833,1,0.000261
98,M,58.0,3833,1,0.000261


<img align="left" width="441" alt="Screen Shot 2022-04-22 at 2 30 00 PM" src="https://user-images.githubusercontent.com/73784742/164616495-11d7e20f-35c7-469d-8d67-42df47de25db.png">

Male legislators still outnumber female legislators, but by a smaller margin. With the revised cohorts, male legislators have higher retention through year 7, but starting in year 12, female legisla‐ tors have higher retention. The difference between the two gender-based cohort analyses underscores the importance of setting up appropriate cohorts and ensuring that they have comparable amounts of time to be present or complete other actions of interest. 

#### Dealing with Sparse Cohorts

In [26]:
query_18 = """
        SELECT first_state, gender, period
            ,first_value(cohort_retained) over (partition by gender order by period) as cohort_size
            ,cohort_retained
            ,cohort_retained * 1.0 / 
             first_value(cohort_retained) over (partition by gender order by period) as pct_retained
        FROM
        (
            SELECT a.first_state, d.gender
                ,coalesce(date_part('year',age(c.date,a.first_term)),0) as period
                ,count(distinct a.id_bioguide) as cohort_retained
            FROM
            (
                SELECT id_bioguide
                    ,min(term_start) over (partition by id_bioguide) as first_term
                    ,first_value(state) over (partition by id_bioguide) as first_state
                FROM legislators_terms
            ) a
            JOIN legislators_terms b on a.id_bioguide = b.id_bioguide 
            LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
            and c.month_name = 'December' and c.day_of_month = 31
            JOIN legislators d on a.id_bioguide = d.id_bioguide
            WHERE a.first_term between '1917-01-01' and ' 1999-12-31'
            GROUP BY 1,2,3
        ) aa
        ORDER BY 1,2,3
        """

sql_18 = pd.read_sql(query_18, con)
sql_18

Unnamed: 0,first_state,gender,period,cohort_size,cohort_retained,pct_retained
0,AK,M,0.0,13,13,1.000000
1,AK,M,1.0,13,13,1.000000
2,AK,M,2.0,13,11,0.846154
3,AK,M,3.0,13,11,0.846154
4,AK,M,4.0,13,9,0.692308
...,...,...,...,...,...,...
2994,WY,M,23.0,13,2,0.153846
2995,WY,M,24.0,13,1,0.076923
2996,WY,M,25.0,13,1,0.076923
2997,WY,M,26.0,13,1,0.076923


<img align="left" width="451" alt="Screen Shot 2022-04-22 at 2 42 20 PM" src="https://user-images.githubusercontent.com/73784742/164618306-c96b3e26-c2c9-4cfe-b988-c7c27b46da6f.png">

Alaska did not have any female legislators, while Arizona’s female retention curve disappears after year 3. Only California, a large state with many legislators, has complete retention curves for both genders. This pattern repeats for other small and large states.

In [27]:
query_19 = """
        SELECT aa.gender, aa.first_state, cc.period, aa.cohort_size
        FROM
        (
            SELECT b.gender, a.first_state
                ,count(distinct a.id_bioguide) as cohort_size
            FROM 
            (
                SELECT distinct id_bioguide
                    ,min(term_start) over (partition by id_bioguide) as first_term
                    ,first_value(state) over (partition by id_bioguide order by term_start) as first_state
                FROM legislators_terms 
            ) a
            JOIN legislators b on a.id_bioguide = b.id_bioguide
            WHERE a.first_term between '1917-01-01' and '1999-12-31' 
            GROUP BY 1,2
        ) aa
        JOIN
        (
            SELECT generate_series as period 
            FROM generate_series(0,20,1)
        ) cc on 1 = 1
        """

sql_19 = pd.read_sql(query_19, con)
sql_19

Unnamed: 0,gender,first_state,period,cohort_size
0,F,AL,0,3
1,F,AR,0,5
2,F,AZ,0,2
3,F,CA,0,25
4,F,CO,0,2
...,...,...,...,...
2137,M,VT,20,18
2138,M,WA,20,57
2139,M,WI,20,84
2140,M,WV,20,54


In [28]:
query_20 = """
        SELECT aaa.gender, aaa.first_state, aaa.period, aaa.cohort_size
            ,coalesce(ddd.cohort_retained,0) as cohort_retained
            ,coalesce(ddd.cohort_retained,0) * 1.0 / aaa.cohort_size as pct_retained
        FROM
        (
        SELECT aa.gender, aa.first_state, cc.period, aa.cohort_size
        FROM
        (
            SELECT b.gender, a.first_state
            ,count(distinct a.id_bioguide) as cohort_size
            FROM 
            (
                    SELECT distinct id_bioguide
                    ,min(term_start) over (partition by id_bioguide) as first_term
                    ,first_value(state) over (partition by id_bioguide 
                                              order by term_start) as first_state
                    FROM legislators_terms 
            ) a
            JOIN legislators b on a.id_bioguide = b.id_bioguide 
            WHERE a.first_term between '1917-01-01' and '1999-12-31' 
            GROUP BY 1,2
        ) aa
        JOIN
        (
            SELECT generate_series as period 
            FROM generate_series(0,20,1)
        ) cc on 1 = 1
        ) aaa
        LEFT JOIN
        (
            SELECT d.first_state, g.gender
                ,coalesce(date_part('year',age(f.date,d.first_term)),0) as period
                ,count(distinct d.id_bioguide) as cohort_retained
            FROM
            (
                SELECT distinct id_bioguide
                    ,min(term_start) over (partition by id_bioguide) as first_term
                    ,first_value(state) over (partition by id_bioguide order by term_start) as first_state
                FROM legislators_terms 
            ) d
            JOIN legislators_terms e on d.id_bioguide = e.id_bioguide 
            LEFT JOIN date_dim f on f.date between e.term_start and e.term_end 
            and f.month_name = 'December' and f.day_of_month = 31
            JOIN legislators g on d.id_bioguide = g.id_bioguide
            WHERE d.first_term between '1917-01-01' and '1999-12-31'
            GROUP BY 1,2,3
        ) ddd on aaa.gender = ddd.gender and aaa.first_state = ddd.first_state 
        and aaa.period = ddd.period
        ORDER BY 1,2,3
        """

sql_20 = pd.read_sql(query_20, con)
sql_20

Unnamed: 0,gender,first_state,period,cohort_size,cohort_retained,pct_retained
0,F,AL,0,3,3,1.000000
1,F,AL,1,3,1,0.333333
2,F,AL,2,3,0,0.000000
3,F,AL,3,3,0,0.000000
4,F,AL,4,3,0,0.000000
...,...,...,...,...,...,...
2137,M,WY,16,27,7,0.259259
2138,M,WY,17,27,7,0.259259
2139,M,WY,18,27,2,0.074074
2140,M,WY,19,27,2,0.074074


In [29]:
query_21 = """
        SELECT gender, first_state, cohort_size
            ,max(case when period = 0 then pct_retained end) as yr0
            ,max(case when period = 2 then pct_retained end) as yr2
            ,max(case when period = 4 then pct_retained end) as yr4
            ,max(case when period = 6 then pct_retained end) as yr6
            ,max(case when period = 8 then pct_retained end) as yr8
            ,max(case when period = 10 then pct_retained end) as yr10
        FROM
        (
            SELECT aaa.gender, aaa.first_state, aaa.period, aaa.cohort_size
                ,coalesce(ddd.cohort_retained,0) as cohort_retained
                ,coalesce(ddd.cohort_retained,0) * 1.0 / aaa.cohort_size as pct_retained
            FROM
            (
            SELECT aa.gender, aa.first_state, cc.period, aa.cohort_size
            FROM
            (
                SELECT b.gender, a.first_state
                    ,count(distinct a.id_bioguide) as cohort_size
                FROM 
                (
                    SELECT distinct id_bioguide
                        ,min(term_start) over (partition by id_bioguide) as first_term
                        ,first_value(state) over (partition by id_bioguide order by term_start) as first_state
                    FROM legislators_terms 
                ) a
                JOIN legislators b on a.id_bioguide = b.id_bioguide 
                WHERE a.first_term between '1917-01-01' and '1999-12-31' 
                GROUP BY 1,2
            ) aa
            JOIN
            (
                SELECT generate_series as period
                FROM generate_series(0,20,1)
            ) cc on 1 = 1
            ) aaa
            LEFT JOIN
            (
                SELECT d.first_state, g.gender
                    ,coalesce(date_part('year',age(f.date,d.first_term)),0) as period
                    ,count(distinct d.id_bioguide) as cohort_retained
                FROM
                (
                    SELECT distinct id_bioguide
                        ,min(term_start) over (partition by id_bioguide) as first_term
                        ,first_value(state) over (partition by id_bioguide order by term_start) as first_state
                    FROM legislators_terms 
                ) d
                JOIN legislators_terms e on d.id_bioguide = e.id_bioguide 
                LEFT JOIN date_dim f on f.date between e.term_start and e.term_end 
                and f.month_name = 'December' and f.day_of_month = 31
                JOIN legislators g on d.id_bioguide = g.id_bioguide
                WHERE d.first_term between '1917-01-01' and '1999-12-31'
                GROUP BY 1,2,3
            ) ddd on aaa.gender = ddd.gender and aaa.first_state = ddd.first_state 
            and aaa.period = ddd.period
        ) a
        GROUP BY 1,2,3
        """

sql_21 = pd.read_sql(query_21, con)
sql_21

Unnamed: 0,gender,first_state,cohort_size,yr0,yr2,yr4,yr6,yr8,yr10
0,F,AL,3,1.0,0.000000,0.000000,0.000000,0.000000,0.000000
1,F,AR,5,1.0,0.800000,0.200000,0.400000,0.400000,0.400000
2,F,AZ,2,1.0,0.500000,0.000000,0.000000,0.000000,0.000000
3,F,CA,25,1.0,0.920000,0.800000,0.640000,0.680000,0.680000
4,F,CO,2,1.0,1.000000,1.000000,1.000000,1.000000,1.000000
...,...,...,...,...,...,...,...,...,...
97,M,VT,18,1.0,0.666667,0.611111,0.555556,0.555556,0.555556
98,M,WA,57,1.0,0.789474,0.771930,0.666667,0.631579,0.491228
99,M,WI,84,1.0,0.738095,0.666667,0.595238,0.488095,0.392857
100,M,WV,54,1.0,0.666667,0.574074,0.407407,0.333333,0.277778


#### Defining Cohorts from Dates Other Than the First Date

In [30]:
query_22 = """
        SELECT distinct id_bioguide, term_type, date('2000-01-01') as first_term
            ,min(term_start) as min_start
        FROM legislators_terms 
        WHERE term_start <= '2000-12-31' and term_end >= '2000-01-01'
        GROUP BY 1,2,3
        """

sql_22 = pd.read_sql(query_22, con)
sql_22

Unnamed: 0,id_bioguide,term_type,first_term,min_start
0,C000858,sen,2000-01-01,1997-01-07
1,G000333,sen,2000-01-01,1995-01-04
2,M000350,rep,2000-01-01,1999-01-06
3,L000169,rep,2000-01-01,1999-01-06
4,G000386,sen,2000-01-01,1999-01-06
...,...,...,...,...
536,H000789,rep,2000-01-01,1999-01-06
537,O000107,rep,2000-01-01,1999-01-06
538,B000072,rep,2000-01-01,1999-01-06
539,H000615,rep,2000-01-01,1999-01-06


In [31]:
query_23 = """
        SELECT term_type, period
            ,first_value(cohort_retained) over (partition by term_type order by period) as cohort_size
            ,cohort_retained
            ,cohort_retained * 1.0 / 
             first_value(cohort_retained) over (partition by term_type order by period) as pct_retained
        FROM
        (
            SELECT a.term_type
                ,coalesce(date_part('year',age(c.date,a.first_term)),0) as period
                ,count(distinct a.id_bioguide) as cohort_retained
            FROM
            (
                SELECT distinct id_bioguide, term_type
                    ,date('2000-01-01') as first_term
                    ,min(term_start) as min_start
                FROM legislators_terms
                WHERE term_start <= '2000-12-31' and term_end >= '2000-01-01'
                GROUP BY 1,2,3
            ) a
            JOIN legislators_terms b on a.id_bioguide = b.id_bioguide
            and b.term_start >= a.min_start
            LEFT JOIN date_dim c on c.date between b.term_start and b.term_end 
            and c.month_name = 'December' and c.day_of_month = 31
            and c.year >= 2000
            GROUP BY 1,2
        ) aa
        """

sql_23 = pd.read_sql(query_23, con)
sql_23

Unnamed: 0,term_type,period,cohort_size,cohort_retained,pct_retained
0,rep,0.0,440,440,1.0
1,rep,1.0,440,392,0.890909
2,rep,2.0,440,389,0.884091
3,rep,3.0,440,340,0.772727
4,rep,4.0,440,338,0.768182
5,rep,5.0,440,308,0.7
6,rep,6.0,440,306,0.695455
7,rep,7.0,440,267,0.606818
8,rep,8.0,440,262,0.595455
9,rep,9.0,440,223,0.506818


<img align="left" width="446" alt="Screen Shot 2022-04-22 at 3 36 12 PM" src="https://user-images.githubusercontent.com/73784742/164629084-7d3fd545-9b82-4575-ae68-db708eb9bf1c.png">

Despite longer terms for senators, retention among the two cohorts was similar, and was actually worse for senators after 10 years. A further analysis comparing the different years they were first elected, or other cohort attributes, might yield some interesting insights.

#### Survivorship

**Survivorship**, also called *survival analysis*, is concerned with questions about how long something lasts, or the duration of time until a particular event such as churn or death. Survivorship analysis can answer questions about the share of the population that is likely to remain past a certain amount of time. Cohorts can help identify or at least provide hypotheses about which characteristics or circumstances increase or decrease the survival likelihood.

For example, if we want to know the share of game players who survive for a week or longer, we can check for actions that occur after a week from starting and consider those players still surviving.

In [32]:
query_24 = """
        SELECT id_bioguide
            ,min(term_start) as first_term
            ,max(term_start) as last_tertm
        FROM legislators_terms
        GROUP BY 1
        """

sql_24 = pd.read_sql(query_24, con)
sql_24

Unnamed: 0,id_bioguide,first_term,last_tertm
0,A000118,1975-01-14,1977-01-04
1,P000281,1933-03-09,1937-01-05
2,K000039,1933-03-09,1951-01-03
3,A000306,1907-12-02,1939-01-03
4,O000095,1949-01-03,1951-01-03
...,...,...,...
12513,G000331,1949-01-03,1949-01-03
12514,M000103,1867-03-04,1867-03-04
12515,B000255,1821-12-03,1825-12-05
12516,L000152,1891-12-07,1895-12-02


In [33]:
query_25 = """
        SELECT id_bioguide
            ,date_part('century',min(term_start)) as first_century
            ,min(term_start) as first_term
            ,max(term_start) as last_tertm
            ,date_part('year',age(max(term_start),min(term_start))) as tenure
        FROM legislators_terms
        GROUP BY 1
        """

sql_25 = pd.read_sql(query_25, con)
sql_25

Unnamed: 0,id_bioguide,first_century,first_term,last_tertm,tenure
0,A000118,20.0,1975-01-14,1977-01-04,1.0
1,P000281,20.0,1933-03-09,1937-01-05,3.0
2,K000039,20.0,1933-03-09,1951-01-03,17.0
3,A000306,20.0,1907-12-02,1939-01-03,31.0
4,O000095,20.0,1949-01-03,1951-01-03,2.0
...,...,...,...,...,...
12513,G000331,20.0,1949-01-03,1949-01-03,0.0
12514,M000103,19.0,1867-03-04,1867-03-04,0.0
12515,B000255,19.0,1821-12-03,1825-12-05,4.0
12516,L000152,19.0,1891-12-07,1895-12-02,3.0


- Ten Yars

In [34]:
query_26 = """
        SELECT first_century
            ,count(distinct id_bioguide) as cohort_size
            ,count(distinct case when tenure >= 10 then id_bioguide end) as servived_10
            ,count(distinct case when tenure >= 10 then id_bioguide end) * 1.0 
             / count(distinct id_bioguide) as pct_servived_10
        FROM
        (
            SELECT id_bioguide
                ,date_part('century',min(term_start)) as first_century
                ,min(term_start) as first_term
                ,max(term_start) as last_tertm
                ,date_part('year',age(max(term_start),min(term_start))) as tenure
            FROM legislators_terms
            GROUP BY 1
        ) a
        GROUP BY 1
        """

sql_26 = pd.read_sql(query_26, con)
sql_26

Unnamed: 0,first_century,cohort_size,servived_10,pct_servived_10
0,18.0,368,83,0.225543
1,19.0,6299,892,0.14161
2,20.0,5091,1853,0.363976
3,21.0,760,119,0.156579


- Five Terms

In [35]:
query_27 = """
        SELECT first_century
            ,count(distinct id_bioguide) as cohort_size
            ,count(distinct case when total_terms >= 5 then id_bioguide end) as servived_5
            ,count(distinct case when total_terms >= 5 then id_bioguide end) * 1.0 
             / count(distinct id_bioguide) as pct_servived_5
        FROM
        (
            SELECT id_bioguide
                ,date_part('century',min(term_start)) as first_century
                ,count(term_start) as total_terms
            FROM legislators_terms
            GROUP BY 1
        ) a
        GROUP BY 1
        """

sql_27 = pd.read_sql(query_27, con)
sql_27

Unnamed: 0,first_century,cohort_size,servived_5,pct_servived_5
0,18.0,368,63,0.171196
1,19.0,6299,711,0.112875
2,20.0,5091,2153,0.422903
3,21.0,760,205,0.269737


In [36]:
query_28 = """
        SELECT a.first_century, b.terms
            ,count(distinct id_bioguide) as cohort
            ,count(distinct case when a.total_terms >= b.terms then id_bioguide end) as cohort_survived
            ,count(distinct case when a.total_terms >= b.terms then id_bioguide end) * 1.0 
             / count(distinct id_bioguide) as pct_servived
        FROM
        (
            SELECT id_bioguide
                ,date_part('century',min(term_start)) as first_century
                ,count(term_start) as total_terms
            FROM legislators_terms
            GROUP BY 1
        ) a
        JOIN
        (
            SELECT generate_series as terms
            FROM generate_series(1,20,1)
        ) b on 1 = 1
        GROUP BY 1,2
        """

sql_28 = pd.read_sql(query_28, con)
sql_28

Unnamed: 0,first_century,terms,cohort,cohort_survived,pct_servived
0,18.0,1,368,368,1.000000
1,18.0,2,368,249,0.676630
2,18.0,3,368,153,0.415761
3,18.0,4,368,96,0.260870
4,18.0,5,368,63,0.171196
...,...,...,...,...,...
75,21.0,16,760,0,0.000000
76,21.0,17,760,0,0.000000
77,21.0,18,760,0,0.000000
78,21.0,19,760,0,0.000000


<img align="left" width="446" alt="Screen Shot 2022-04-22 at 4 07 19 PM" src="https://user-images.githubusercontent.com/73784742/164645726-b20d73f9-8918-4074-8eab-de7e3307c3dc.png">

Survivorship was highest in the 20th century, a result that agrees with results we saw previously in which retention was also highest in the 20th century.

#### Returnship, or Repeat Purchase Behavior

Survivorship is useful for understanding how long a cohort is likely to stick around. Another useful type of cohort analysis seeks to understand whether a cohort member can be expected to return within a given window of time and the intensity of activity during that window. This is called *returnship* or *repeat purchase behavior*.

To make fair comparisons between cohorts with different starting dates, we need to create an analysis based on a time box, or a fixed window of time from the first date, and consider whether cohort members returned within that window. This way, every cohort has an equal amount of time under consideration, so long as we include only those cohorts for which the full window has elapsed. 

how many legislators have more than one term type, and specifically, **what share of them start as representatives and go on to become senators (some senators later become representatives, but that is much less common)**. Since relatively few make this transition, we’ll cohort legislators by the century in which they first became a representative.

In [37]:
query_29 = """
        SELECT date_part('century',a.first_term) as cohort_century
            ,count(id_bioguide) as reps
        FROM
        (
            SELECT id_bioguide, min(term_start) as first_term
            FROM legislators_terms
            WHERE term_type = 'rep'
            GROUP BY 1
        ) a
        GROUP BY 1
        ORDER BY 1
        """

sql_29 = pd.read_sql(query_29, con)
sql_29

Unnamed: 0,cohort_century,reps
0,18.0,299
1,19.0,5773
2,20.0,4481
3,21.0,683


To find the representatives who later became senators. This is accomplished with the clauses b.term_type = 'sen' and b.term_start > a.first_term

In [38]:
query_30 = """
        SELECT date_part('century',a.first_term) as cohort_century
            ,count(distinct a.id_bioguide) as rep_and_sen
        FROM
        (
            SELECT id_bioguide, min(term_start) as first_term
            FROM legislators_terms
            WHERE term_type = 'rep'
            GROUP BY 1
        ) a
        JOIN legislators_terms b on a.id_bioguide = b.id_bioguide
        and b.term_type = 'sen' and b.term_start > a.first_term
        GROUP BY 1
        """

sql_30 = pd.read_sql(query_30, con)
sql_30

Unnamed: 0,cohort_century,rep_and_sen
0,18.0,57
1,19.0,329
2,20.0,254
3,21.0,25


Calculate the percent of representatives who became senators.

In [39]:
query_31 = """
        SELECT aa.cohort_century
        ,bb.rep_and_sen * 1.0 / aa.reps as pct_rep_and_sen
        FROM
        (
            SELECT date_part('century',a.first_term) as cohort_century
                ,count(id_bioguide) as reps
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms
                WHERE term_type = 'rep'
                GROUP BY 1
            ) a
            GROUP BY 1
        ) aa
        LEFT JOIN
        (
            SELECT date_part('century',b.first_term) as cohort_century
                ,count(distinct b.id_bioguide) as rep_and_sen
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms
                WHERE term_type = 'rep'
                GROUP BY 1
            ) b
            JOIN legislators_terms c on b.id_bioguide = c.id_bioguide
            and c.term_type = 'sen' and c.term_start > b.first_term
            GROUP BY 1
        ) bb on aa.cohort_century = bb.cohort_century
        """

sql_31 = pd.read_sql(query_31, con)
sql_31

Unnamed: 0,cohort_century,pct_rep_and_sen
0,18.0,0.190635
1,19.0,0.056989
2,20.0,0.056684
3,21.0,0.036603


Representatives from the 18th century were most likely to become senators. However, we have not yet applied a time box to ensure a fair comparison. While we can safely assume that all legislators who served in the 18th and 19th centuries are no longer living, many of those who were first elected in the 20th and 21st centuries are still in the middle of their careers.

In [40]:
query_32 = """
        SELECT aa.cohort_century
        ,bb.rep_and_sen * 1.0 / aa.reps as pct_rep_and_sen
        FROM
        (
            SELECT date_part('century',a.first_term) as cohort_century
                ,count(id_bioguide) as reps
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms
                WHERE term_type = 'rep'
                GROUP BY 1
            ) a
            WHERE first_term <= '2009-12-31'
            GROUP BY 1
        ) aa
        LEFT JOIN
        (
            SELECT date_part('century',b.first_term) as cohort_century
            ,count(distinct b.id_bioguide) as rep_and_sen
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms
                WHERE term_type = 'rep'
                GROUP BY 1
            ) b
            JOIN legislators_terms c on b.id_bioguide = c.id_bioguide
            and c.term_type = 'sen' and c.term_start > b.first_term
            WHERE age(c.term_start, b.first_term) <= interval '10 years'
            GROUP BY 1
        ) bb on aa.cohort_century = bb.cohort_century
        """

sql_32 = pd.read_sql(query_32, con)
sql_32

Unnamed: 0,cohort_century,pct_rep_and_sen
0,18.0,0.09699
1,19.0,0.024424
2,20.0,0.034814
3,21.0,0.076364


With this new adjustment, the 18th century still had the highest share of representatives becoming senators within 10 years, but the 21st century has the second-highest share, and the 20th century had a higher share than the 19th.

Since 10 years is somewhat arbitrary, we might also want to compare several time windows. One option is to run the query several times with different intervals and note the results. Another option is to calculate multiple windows in the same result set by using a set of CASE statements inside of count distinct aggregations to form the intervals, rather than specifying the interval in the WHERE clause

In [41]:
query_33 = """
        SELECT aa.cohort_century::int as cohort_century
            ,round(bb.rep_and_sen_5_yrs * 1.0 / aa.reps,4) as pct_5_yrs
            ,round(bb.rep_and_sen_10_yrs * 1.0 / aa.reps,4) as pct_10_yrs
            ,round(bb.rep_and_sen_15_yrs * 1.0 / aa.reps,4) as pct_15_yrs
        FROM
        (
            SELECT date_part('century',a.first_term) as cohort_century
            ,count(id_bioguide) as reps
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms
                WHERE term_type = 'rep'
                GROUP BY 1
            ) a
            WHERE first_term <= '2009-12-31'
            GROUP BY 1
        ) aa
        LEFT JOIN
        (
            SELECT date_part('century',b.first_term) as cohort_century
                ,count(distinct case when age(c.term_start, b.first_term) 
                    <= interval '5 years' then b.id_bioguide end) as rep_and_sen_5_yrs
                ,count(distinct case when age(c.term_start, b.first_term) 
                    <= interval '10 years' then b.id_bioguide end) as rep_and_sen_10_yrs
                ,count(distinct case when age(c.term_start, b.first_term) 
                    <= interval '15 years' then b.id_bioguide end) as rep_and_sen_15_yrs
            FROM
            (
                SELECT id_bioguide, min(term_start) as first_term
                FROM legislators_terms
                WHERE term_type = 'rep'
                GROUP BY 1
            ) b
            JOIN legislators_terms c on b.id_bioguide = c.id_bioguide
            and c.term_type = 'sen' and c.term_start > b.first_term
            GROUP BY 1
        ) bb on aa.cohort_century = bb.cohort_century
        """

sql_33 = pd.read_sql(query_33, con)
sql_33

Unnamed: 0,cohort_century,pct_5_yrs,pct_10_yrs,pct_15_yrs
0,18,0.0502,0.097,0.1438
1,19,0.0088,0.0244,0.0409
2,20,0.01,0.0348,0.0478
3,21,0.04,0.0764,0.0873


<img align="left" width="445" alt="Screen Shot 2022-04-22 at 4 37 23 PM" src="https://user-images.githubusercontent.com/73784742/164651279-23e76f20-b7d2-48fb-acd4-08f26980030f.png">

The cohorts based on century are replaced with cohorts based on the first decade, and the trends over 10 and 20 years are shown. Conversion of representatives to senators during the first few decades of the new US legislature was clearly different from patterns in the years since.

#### Cumulative Calculations

Cumulative cohort analysis can be used to establish *cumulative lifetime value*, also called *customer lifetime value* (the acronyms CLTV and LTV are used interchangeably), and to monitor newer cohorts in order to be able to predict what their full LTV will be. This is possible because early behavior is often highly correlated with long-term behavior. Users of a service who return frequently in their first days or weeks of using it tend to be the most likely to stay around over the long term. Customers who buy a second or third time early on are likely to continue purchasing over a longer time period. Subscribers who renew after the first month or year are often likely to stick around over many subsequent months or years.

In [42]:
query_34 = """
        SELECT date_part('century',a.first_term)::int as century
            ,first_type
            ,count(distinct a.id_bioguide) as cohort
            ,count(b.term_start) as terms
        FROM
        (
            SELECT distinct id_bioguide
                ,first_value(term_type) over (partition by id_bioguide order by term_start) as first_type
                ,min(term_start) over (partition by id_bioguide) as first_term
                ,min(term_start) over (partition by id_bioguide) + interval '10 years' as first_plus_10
            FROM legislators_terms
        ) a
        LEFT JOIN legislators_terms b on a.id_bioguide = b.id_bioguide and b.term_start between a.first_term and a.first_plus_10
        GROUP BY 1,2
        """

sql_34 = pd.read_sql(query_34, con)
sql_34

Unnamed: 0,century,first_type,cohort,terms
0,18,rep,297,760
1,18,sen,71,101
2,19,rep,5744,12165
3,19,sen,555,795
4,20,rep,4473,16203
5,20,sen,618,1008
6,21,rep,683,2203
7,21,sen,77,118


The largest cohort is that of representatives first elected in the 19th century, but the cohort with the largest number of terms started within 10 years is that of representatives first elected in the 20th century. This type of calculation can be useful for understanding the overall contribution of a cohort to an organization.

In [43]:
query_35 = """
        SELECT century
            ,max(case when first_type = 'rep' then cohort end) as rep_cohort
            ,max(case when first_type = 'rep' then terms_per_leg end) as avg_rep_terms
            ,max(case when first_type = 'sen' then cohort end) as sen_cohort
            ,max(case when first_type = 'sen' then terms_per_leg end) as avg_sen_terms
        FROM
        (
            SELECT date_part('century',a.first_term)::int as century
                ,first_type
                ,count(distinct a.id_bioguide) as cohort
                ,count(b.term_start) as terms
                ,count(b.term_start) * 1.0 / count(distinct a.id_bioguide) as terms_per_leg
            FROM
            (
                SELECT distinct id_bioguide
                ,first_value(term_type) over (partition by id_bioguide order by term_start) as first_type
                ,min(term_start) over (partition by id_bioguide) as first_term
                ,min(term_start) over (partition by id_bioguide) + interval '10 years' as first_plus_10
                FROM legislators_terms
            ) a
            LEFT JOIN legislators_terms b on a.id_bioguide = b.id_bioguide and b.term_start between a.first_term and a.first_plus_10
            GROUP BY 1,2
        ) aa
        GROUP BY 1
        """

sql_35 = pd.read_sql(query_35, con)
sql_35

Unnamed: 0,century,rep_cohort,avg_rep_terms,sen_cohort,avg_sen_terms
0,18,297,2.558923,71,1.422535
1,19,5744,2.117862,555,1.432432
2,20,4473,3.622401,618,1.631068
3,21,683,3.225476,77,1.532468


With the cumulative terms normalized by the cohort size, we can now confirm that representatives first elected in the 20th century had the highest average number of terms, while those who started in the 19th century had the fewest number of terms on average. Senators have fewer but longer terms than their representative peers, and again those who started in the 20th century have had the highest number of terms on average.

Cumulative calculations are often used in customer lifetime value calculations. LTV is usually calculated using monetary measures, such as total dollars spent by a customer, or the gross margin (revenue minus costs) generated by a customer across their lifetime. To facilitate comparisons between cohorts, the “lifetime” is often chosen to reflect average customer lifetime, or periods that are convenient to analyze, such as 3, 5, or 10 years. 