# Cohort analysis

A descriptive analytics tool. It groups the customers into mutually exclusive cohorts, that are measured over time. Cohort analysis provides deeper insights than the so-called vanity metrics. It helps understanding the high level trends better by providing insights on metrics across both the product and the customer lifecycle.

# Types of cohorts
## Time cohorts
Customers who signed up for a product or service during a particular time frame. Analyzing these cohorts shows the customers bhaviour depending on the time they started using the companys  prdocuts or services. The time can be monthly, quarterly or even daily.

## Behaviour cohorts
Customers who purchased a product or subscribed to a service in the past. It groups customers by the type of product or service they signed up: those signing for basic level service may have a different behaviour than the ones going premium. Understanding the needs of various cohorts can help a company design customed-made services or products for particular segments.

## Size cohorts
Refers to the various sizes of customers who purchase companys products or services. This categorization can be based on the amount of spending in some period of  time after acquisition or the product type that the customer spent most of their order amount in some period of time.

# Elements of cohort analysis
## Pivot table
Assigned cohort in rows
Cohort index in columns
Metrics in the table

In [None]:
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


In [None]:
cohort_counts = pd.read_csv("./data/chapter_1/cohort_counts.csv")

In [None]:
cohort_counts


There are 332 customers who have made their first transaction in january 2011

# Time Cohorts

We will segment customers into acquisition cohorts based on the month they made their first purchase. We will then assign the cohort index to each purchase of the customer. It will represent the number of months since the first transaction.

Time based cohorts group customers by the time they completed their first activity. In this lesson, we will group customers into cohorts based on the month of their first purchase. Then we will mark each transaction based on its relative time period since the first purchase. In this example, we will calculate the number of months since the acquisition. In the next step we will calculate metrics like retention or average spend value, and build this heatmap.





In [None]:
online = pd.read_csv("./data/chapter_1/online.csv", parse_dates=["InvoiceDate"])
online.head()

In [None]:
def get_month(x):
    return dt.datetime(x.year, x.month, 1)

online["InvoiceMonth"] = online["InvoiceDate"].apply(get_month)
online["CohortMonth"] = online.groupby("CustomerID")["InvoiceMonth"].transform("min")

In [None]:
online

In [None]:
def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day

invoice_year, invoice_month, _ = get_date_int(online, "InvoiceMonth")
cohort_year, cohort_month, _ = get_date_int(online, "CohortMonth")

years_diff = invoice_year - cohort_year
months_diff = invoice_month - cohort_month
online["CohortIndex"] = years_diff * 12 + months_diff + 1
online.head()

In [None]:
cohort_data = online.groupby(["CohortMonth", "CohortIndex"])["CustomerID"].nunique().reset_index()

cohort_counts = cohort_data.pivot(index="CohortMonth", columns="CohortIndex", values="CustomerID")


In [None]:
cohort_counts


The first column indicates how many customers are initially on each cohort (100% for all cohorts). Then, how many customers were still actives in the following months.

# Metrics

## Retention Rate

In [None]:
cohort_sizes= cohort_counts.iloc[:, 0]
retention = cohort_counts.divide(cohort_sizes, axis=0).round(3)*100
retention

## Other Metrics


In [None]:
cohort_data = online.groupby(["CohortMonth", "CohortIndex"])["Quantity"].mean()
cohort_data = cohort_data.reset_index()
average_quantity = cohort_data.pivot(index="CohortMonth", columns="CohortIndex", values="Quantity")

average_quantity = average_quantity.round(2)
average_quantity

# Visualizing Cohort Analysis




In [None]:
plt.figure(figsize=(12, 8))
plt.title("Cohort Analysis: Retention Rates")
sns.heatmap(
    retention,
    annot=True,
    fmt=".0f",
    cmap="Blues",
    linewidths=0.5,
    linecolor="white",
    cbar_kws={"label": "Retention Rate (%)"},
)
plt.xlabel("Cohort Index")
plt.show()

# Recency, Frequency, Monetary (RFM) segmentation
We asign customers to segments depending on their recency, frequency and monetary values.

## Recency
How recent is each customer last purchase. The lower it is, the better. Every company wants their customers to be recent and active.

## Frequency
How many purchases the customer has done in the last 12 months. The period can change depending on the product lifecycle, etc.

## Monetary Value
How much has the customer spent in the last 12 months. The period can change depending on the product lifecycle, etc.

Once we calculate these values we can group them into categories like high, medium or low, using percentiles, pareto 80/20 split or custom splits based on business knowledge.




In [None]:
print('Min: {}, Max: {}'.format(online["InvoiceDate"].min(), online["InvoiceDate"].max()))



Lets set a hypothetical snapshot_day as if we were doing the analysis recently.

In [None]:
snapshot_day = max(online.InvoiceDate) + dt.timedelta(days=1)
snapshot_day

In [None]:
online["TotalSum"] = online["Quantity"] * online["UnitPrice"]

datamart = online.groupby(['CustomerID']).agg({
    'InvoiceDate': lambda x: (snapshot_day - x.max()).days,
    'InvoiceNo': 'count',
    'TotalSum': 'sum'
})

datamart.rename(columns={
    'InvoiceDate': 'Recency',
    'InvoiceNo': 'Frequency',
    'TotalSum': 'MonetaryValue'
}, inplace=True)
datamart = datamart.reset_index()
datamart.head()

Once calculated the Recenccy, Frequency and Monetary Value for each customer we can group these customers into 4 segments, depending on their RFM value.

In [None]:
# Lets create the labels for the recency, frequency and monetary value
# Recency labels
# These labels are sorted in descending order since we want to assign the highest value to the most recent customers
r_labels = range(4, 0, -1)
r_quartiles = pd.qcut(datamart['Recency'], 4, labels=r_labels)
datamart['R'] = r_quartiles

# Frequency labels
# These labels are sorted in ascending order since we want to assign the highest value to the most frequent customers
f_labels = range(1, 5)
f_quartiles = pd.qcut(datamart['Frequency'], 4, labels=f_labels)
datamart['F'] = f_quartiles

# Monetary labels
# These labels are sorted in ascending order since we want to assign the highest value to the most valuable customers
m_labels = range(1, 5)
m_quartiles = pd.qcut(datamart['MonetaryValue'], 4, labels=m_labels)
datamart['M'] = m_quartiles

datamart.head()

Its time now to create the RFM segment (contactenation of the RFM quartile values) and the RFM score (sum of those values)  

In [None]:
datamart['RFM_Segment'] = datamart.R.astype(str) + datamart.F.astype(str) + datamart.M.astype(str)
datamart['RFM_Score'] = datamart[['R', 'F', 'M']].sum(axis=1)
datamart.sort_values('RFM_Score', ascending=True).head(10)

Lets check the size of the different segments. Its always a best practice to do so.
The RFM_Segment will allow us to directly select 'similar' customers by using it.

In [None]:
datamart.groupby('RFM_Segment').size().sort_values(ascending=False).head(10)

In [None]:
datamart[datamart['RFM_Segment'] == '111'].head(10)

## Summary metrics per RFM score



In [None]:
datamart.groupby('RFM_Score').agg({
    "Recency": 'mean',
    "Frequency": 'mean',
    "MonetaryValue": 'mean',
    "RFM_Score": 'count'
}).round(1)

In [None]:
datamart["RFM_Segment"].nunique()

This segmentation is useful but still confusing. In order to improve usability we can group again these segments into named ones, like Gold, Silver and Bronze.


In [None]:
def segment_me(df):
    if df["RFM_Score"] >= 9:
        return "Gold"
    elif df["RFM_Score"] >= 5 and df["RFM_Score"] < 9:
        return "Silver"
    else:
        return "Bronze"

datamart["General_Segment"] = datamart.apply(segment_me, axis=1)
datamart.groupby("General_Segment").agg({
    "Recency": "mean",
    "Frequency": "mean",
    "MonetaryValue": "mean",
    "RFM_Score": "count",
}).round(1)



In real life,  this process could require several iterations to find the best segmentation for your business.

# Data Preprocessing

## K means clustering
Why K means
- One of the most popular unsupervised learning method
- Pretty fast
- Works well as long as the assumptions about the data are correct:
    - Symmetric distribution of variables (not skewed)
    > When facing skewed variables, logarithmic transformations can help making the distribution more symmetrical. It works on positive values only.
    - Variables have the same average values
    - Variables have the same variance
    > RFM data doest not have same average values nor same variance.

In [None]:
datamart[['Recency', 'Frequency', 'MonetaryValue']].describe()

The best way to identify skewed variables is to plot their distributions.

In [None]:
sns.histplot(datamart['Recency'], kde=True)
plt.show()

In [None]:
sns.histplot(datamart["Frequency"], kde=True)
plt.show()


In [None]:
frequency_log = np.log(datamart['Frequency'])
sns.histplot(frequency_log, kde=True)
plt.show()

## Centering and Scaling variables

### Assessing the issue
A simple .describe() of the variables can help us identifying the pressence of the issue.

In [None]:
# Substracting the mean from every value will center the data around 0
datamart_rfm = datamart[['Recency', 'Frequency', 'MonetaryValue']]
datamart_centered = datamart_rfm - datamart_rfm.mean()
datamart_centered.describe().round(2)


In [None]:
# Dividing by the standard deviation will scale the data to a standard deviation of 1
datamart_scaled = datamart_rfm/datamart_rfm.std()
datamart_scaled.describe().round(2)


These operations can be done manually or using the sklearn StandardScaler class.



In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
datamart_sklearn_scaled = scaler.fit_transform(datamart_rfm)
print('mean: ', datamart_sklearn_scaled.mean(axis=0).round(2))
print('std: ', datamart_sklearn_scaled.std(axis=0).round(2))


The order in which the operations has to be performed is important, since some operations cannot be applied on negative values, and some other generate negative values for instance:

1.- Unskew the data (log transformation)
2.- Standardize the values
3.- Scale the to the same standard deviation
4.- Store as a separate array to be used for clustering


To find the numbers of clusters you can use:
- elbow criteria
- silhouette coefficient
- Experimentation and interpretation

Its important that the clusters makes sense at business level and that are actionable.

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=1)
kmeans.fit(datamart_sklearn_scaled)
cluster_labels = kmeans.labels_

In [None]:
datamart_rfm_k2 = datamart_rfm.assign(Cluster = cluster_labels)

In [None]:
datamart_rfm_k2.groupby('Cluster').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
})

### Elbow

In [None]:
sse={}

for k in range(1, 15):
    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)

    # Fit KMeans on the normalized dataset
    kmeans.fit(datamart_sklearn_scaled)

    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_


In [None]:
# Add the plot title "The Elbow Method"
plt.title("The Elbow Method")

# Add X-axis label "k"
plt.xlabel("k")

# Add Y-axis label "SSE"
plt.ylabel("SSE")

# Plot SSE values for each key in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()


FALTA....
