## Chapter 1: Cohort Analysis

Understand customers based on their unique behavioral attributes.

It is a powerful analytics technique to group customers and enable the business to customize their product offering and marketing strategy.
For example, we can grup the customers by the month of the first purchase, segment by their recency, frequency and monetary values or run k-means clusterng to identify similar groups of customers based on their purchasing behavior. You will dig deeper into customer purchasing habits and uncover actionable insights.

Cohort analysis is a descriptive analytics tool.
It groups the customers into mutually exclusive cohorts - which are then measured over time. Cohort analysis provides deeper insights than the so-called vanity metrics, it helps with understanding the high level trends better by providing insights on metrics across both the product and the customer lifecycle.

**There are tree major types of cohorts**:

- *Time cohorts* are customers who signed up for a product or service during a particular time frame. Analyzing these cohors shows the customer's behavior depending on the time they started using the company's productos or services. The time may be monthly or quarterly, even daily. 

- *Behavior cohorts* are customers who purchased a product or subscribed to a service in the past, it groups customers by the type of product or service they signed up. Customers who signed up for basic level services might have different needs than those who signed up for advanced services. Understanding the need of the various cohorts can be help a company design custom-made services or products for particular segments.

- *Size cohorts* refer to the various sizes of customers who purchase company's products or services. This categorization can be based on the amount of spending in some period of time after acquisition, or the product type that the customer spent most of their order amount in some period of time.

**The main elements of the cohort analysis**:

- The cohorts analysis data is typically formatted as a pivot table.
- The row values represent the cohort. In this case it's the month of the first purchase and customers are poled into these groups based on their first ever purchase.
- The column values represent months since acquisition. It can be measured in other time periods like months, days, even hours or minutes. That depends on the scope of the analysis.
- Finally, the metrics are in the table. Here, we have the count of active customers. The first column with cohort index 'one' represents the total number of customers in that cohort. This is the month of their first transaction. We will use this data in the next lessons to calculate the retention rate and other metrics.



#### Resume

* What is Cohort Analysis?

    - Mutually exclusive segments - cohorts
    - Compare metrics across **product** lifecycle
    - Compare metrics across **customer** lifecycle

* Types of cohorts

    - Time cohorts
    - Behavior cohorts
    - Size cohorts
    
* Elements of the cohort analysis

    - Pivot table
    - Assigned cohort in rows
    - Cohort index in columns
    - Metrics in the table

In [1]:
# importando librerias necesarias
import numpy as np
import pandas as pd
import datetime as dt


In [2]:
online = pd.read_csv('Online Retail.csv', sep=';')

In [3]:
online.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,1/12/2010 8:26,255,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,1/12/2010 8:26,339,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,1/12/2010 8:26,275,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,1/12/2010 8:26,339,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,1/12/2010 8:26,339,17850.0,United Kingdom


In [4]:
online.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null object
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 33.1+ MB


In [5]:
# convert object to datetime
online['InvoiceDate'] = pd.to_datetime(online['InvoiceDate'])

In [6]:
# convert object to float
online['UnitPrice'] = online['UnitPrice'].apply(lambda x: x.replace(',', '.'))

In [7]:
online['UnitPrice'] = online['UnitPrice'].apply(lambda col:pd.to_numeric(col, errors='coerce'))

In [8]:
online.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [9]:
online.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-01-12 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-01-12 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-01-12 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-01-12 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-01-12 08:26:00,3.39,17850.0,United Kingdom


### Time cohorts

we will segment customers into acquisition cohorts based on the month their first purchase, we will then assign the cohort index to each purchase of the customer.

It will represent the number of months since the first transaction. Time based cohorts group customers by the time they completed their first activity.
We wil mark each transaction based on its relative time period since the first purchase.
The next step we will calculate metrics like retention or average spend value, and build this heaptman.

In [10]:
# Define a function that will parse the date
def get_day(x): return dt.datetime(x.year, x.month, x.day) 

# Create InvoiceDay column
online['InvoiceDay'] = online['InvoiceDate'].apply(get_day) 

# Group by CustomerID and select the InvoiceDay value
grouping = online.groupby('CustomerID')['InvoiceDay'] 

# Assign a minimum InvoiceDay value to the dataset
online['CohortDay'] = grouping.transform('min')

# View the top 5 rows
print(online.head())

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

          InvoiceDate  UnitPrice  CustomerID         Country InvoiceDay  \
0 2010-01-12 08:26:00       2.55     17850.0  United Kingdom 2010-01-12   
1 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   
2 2010-01-12 08:26:00       2.75     17850.0  United Kingdom 2010-01-12   
3 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   
4 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   

   CohortDay  
0 2010-01-12  
1 2010-01-12  
2 2010-01-12  
3 2010-01-12  
4 2010-01-12  


### Calculate time offset in days - part 1

Calculating time offset for each transaction allows you to report the metrics for each cohort in a comparable fashion.

First, we will create 6 variables that capture the integer value of years, months and days for Invoice and Cohort Date using the get_date_int()

In [11]:
def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day

In [12]:
# Get the integers for date parts from the `InvoiceDay` column
invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')

# Get the integers for date parts from the `CohortDay` column
cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')

**Calculate time offset in days - part 2**

Now, we have six different data sets with year, month and day values for Invoice and Cohort dates - invoice_year, cohort_year, invoice_month, cohort_month, invoice_day, and cohort_day.

calculate the difference between the Invoice and Cohort dates in years, months and days separately and then calculate the total days difference between the two. This will be your days offset which we will use to visualize the customer count. 

In [13]:
# Calculate difference in years
years_diff = invoice_year - cohort_year

# Calculate difference in months
months_diff = invoice_month - cohort_month

# Calculate difference in days
days_diff = invoice_day - cohort_day

# Extract the difference in days from all previous values
online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1
print(online.head())

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

          InvoiceDate  UnitPrice  CustomerID         Country InvoiceDay  \
0 2010-01-12 08:26:00       2.55     17850.0  United Kingdom 2010-01-12   
1 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   
2 2010-01-12 08:26:00       2.75     17850.0  United Kingdom 2010-01-12   
3 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   
4 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   

   CohortDay  CohortIndex  
0 2010-01-12          1.0  
1 2010-01-12          1.0  
2 2010-01-12          1.0  
3 20

<h2 class="p-3 mb-2 bg-primary text-white">Data mensual</h2>

#### Calculate cohort metrics

- How many customers originally in each cohort in the cohort_counts table?
- How many customers originally in each cohort?
- How many of them were active in following months?


We will start by using the cohort counts table from our previous lesson to calculate customer retention. (mean, avg)
The retention measures how many customers from each of the cohot have returned in the subsequent months.

1- select the first column which is the total number of customers in the cohort
2- calculate the ratio of how many of these customers came back in the subsequent months which is the retention rate

Note: you will see that the first month's retention -by definition- will be 100% for all cohorts, this is because the number of active customers in the first month is actually the size of the cohort

**Customer retention**

Customer retention is a very useful metric to understand how many of the all customers are still active. Which of the following best describes customer retention?

- [X] Percentage of active customers out of total customers
        - **Correct!** Retention gives you the percentage of active customers compared to the total number of customers.
- [ ] Percentage of active customers compared to a previous month
        - **Incorrect submission:** This metric sounds more like a monthly change in active customers.
- [ ] Number of average active customers each month
        - **Incorrect submission:** Retention is a percentage metric while this is an absolute number.
- [ ] Active customers on the first month. 
        - **Incorrect submission:** Retention is a percentage metric while this is an absolute number.


**Calculate retention rate from scratch**

You have seen how to create retention and average quantity metrics table for the monthly acquisition cohorts. Now it's you time to calculate the average price metrics and see if there are any difference in shopping patterns across time and across cohorts.

**Calculate average price**

You will now calculate the average price metric and analyze if there are any differences in shopping patterns across time and across cohorts.

### Visualize average quantity metric

**Heatmap**
- Easiest way to visualize cohort analysis
- Includes both data and visuals
- Only few lines of code with seaborn

<div class="panel panel-primary">
      <div class="panel-heading"><h1>Customer retention</h1></div>
</div>

Here for view the result retention: [Data Mensual](Ej1.ipynb)