<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/product-analytics/cohort_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cohort Analysis in Python

Cohort analysis is a very useful and relatively simple technique that helps to get valuable insights about customers. Different metrics can be used for this analysis: conversion, retention, revenue, etc.

A **cohort** is considered a group of customers sharing some characteristics in common, such as sign-up date, geographical location, acquisition channel, and so on. **Cohort Analysis** tracks these cohorts over time to identify some common patterns or behaviors.

When conducting the cohort analysis, it is crucial to consider the relationship between the metric we are tracking and the business model. Depending on the company's goals, we can focus on user retention, conversion ration, revenue, etc. 

Furthermore, cohort analysis can also help to observe the impact of changes in the product on the user behavior under analysis. Potentially being able to measure the impact of product updates or new features. We should be able to observe is the improvement efforts had some effect on the users' behaviors.

In [0]:
#@title ## Setup
#@markdown * Import dependencies
#@markdown * Download dataset (UCI Online Retail II Data Set)
#@markdown ---
#@markdown **Data Set Information:**
#@markdown This Online Retail II data set contains all the transactions 
#@markdown occurring for a UK-based and registered, non-store online retail 
#@markdown between 01/12/2009 and 09/12/2011.The company mainly sells unique 
#@markdown all-occasion gift-ware. Many customers of the company are wholesalers.

import pandas as pd
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
import matplotlib.colors as mcolors

from operator import attrgetter


!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx -q


In [6]:
df = pd.read_excel('Online Retail.xlsx', 
                   dtype={'CustomerID': str,  'InvoiceID': str},
                   parse_dates=['InvoiceDate'],
                   infer_datetime_format=True)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
249148,558887,22423,REGENCY CAKESTAND 3 TIER,2,2011-07-04 15:26:00,12.75,18265,United Kingdom


In [9]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  object        
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 33.1+ MB


In [0]:
df.dropna(subset=['CustomerID'], inplace=True)

In [13]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Quantity,406829.0,12.061303,248.69337,-80995.0,2.0,5.0,12.0,80995.0
UnitPrice,406829.0,3.460471,69.315162,0.0,1.25,1.95,3.75,38970.0


In [0]:
df = df[df['Quantity'] > 0]

In [26]:
n_orders = df.groupby(['CustomerID'])['InvoiceNo'].nunique()
n_orders

CustomerID
12346     1
12347     7
12348     4
12349     1
12350     1
         ..
18280     1
18281     1
18282     2
18283    16
18287     3
Name: InvoiceNo, Length: 4339, dtype: int64

In [0]:
mult_orders_perc = np.sum()