Binning using pd.cut(), pd.qcut()

In [1]:
import pandas as pd
import numpy as np

In [2]:
ages = np.arange(20)
ages

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [3]:
pd.cut(ages,bins=4,labels = ['Youth','YouthAdult','MiddleAged','Senior']).value_counts()

Youth         5
YouthAdult    5
MiddleAged    5
Senior        5
Name: count, dtype: int64

In [4]:
pd.cut(ages, bins = [0,7,15,20], right=False).value_counts()

[0, 7)      7
[7, 15)     8
[15, 20)    5
Name: count, dtype: int64

In [5]:
bins = pd.IntervalIndex.from_tuples([(0, 5), (6, 9), (10, 19)],closed='neither')
pd.cut(ages, bins = bins).value_counts()

(0, 5)      4
(6, 9)      2
(10, 19)    8
Name: count, dtype: int64

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('D:\\Data Preparation\\Data\\OnlineRetail2.csv')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203422 entries, 0 to 203421
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    203422 non-null  object 
 1   StockCode    203422 non-null  object 
 2   Description  202623 non-null  object 
 3   Quantity     203422 non-null  int64  
 4   InvoiceDate  203422 non-null  object 
 5   UnitPrice    203422 non-null  float64
 6   CustomerID   150039 non-null  float64
 7   Country      203422 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 12.4+ MB


In [8]:
df[['InvoiceNo','StockCode','Quantity']].duplicated().sum()

1585

In [9]:
df['Quantity'].describe()

count    203422.000000
mean          9.585684
std         240.921315
min      -74215.000000
25%           1.000000
50%           3.000000
75%          10.000000
max       74215.000000
Name: Quantity, dtype: float64

In [10]:
df.isnull().sum()/df.shape[0]

InvoiceNo      0.000000
StockCode      0.000000
Description    0.003928
Quantity       0.000000
InvoiceDate    0.000000
UnitPrice      0.000000
CustomerID     0.262425
Country        0.000000
dtype: float64

DATA CLEANING

1. Data Cleaning
- Drop duplicated rows
- Delete records with the value of Quantity <= 0
- Drop records with null CustomerID

In [11]:
df = df.drop_duplicates()
df = df.drop_duplicates(subset = ['InvoiceNo','StockCode','Quantity'])
df = df[df['CustomerID'].notnull()]
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df = df[df['Quantity'] > 0]
df.shape

(145000, 8)

RFM (Recency, Frequency, Monetary) analysis is a behavior-based approach grouping customers into segments.
   It groups the customers on the basis of their previous purchase transactions. How recently, how often, and how much did a customer buy.
   In this section we will practice implementing the customer segment based on the RFM model

- Create an empty DataFrame named customer_df
- Append a new column named CustomerID. This column stores the unique ID of each customer
- Create a new column named ‘TotalRevenue’ as the product of two columns Quantity and UnitPrice 

In [12]:
##Create a new dataframe named cusomter_df with only a column 'CustomerID'. Each row contains the ID of a CustomerID 
customer_df = pd.DataFrame()
customer_df['CustomerID'] = df.CustomerID.unique()

In [13]:
##Create new column named ‘TotalRevenue’ as product \
###of two columns Quantity and  UnitPrice
df['TotalRevenue'] = df['Quantity'] * df['UnitPrice']

- Create a new dataframe named frequency_df. This dataframe has only two columns: 
    - The ‘CustomerID’ column stores the unique ID of each customer
    - The ‘Frequency’ column stores the frequency value of each customer
- Create a new dataframe named monetary_df. This dataframe has only two columns:
    - The ‘ CustomerID’ column stores the unique ID of each customer
    - The ‘Monetary’ column stores the total revenue of each customer
- Create a new dataframe named recency_df. This dataframe has only two columns:
    - The ‘CustomerID’ column stores the unique ID of each customer
    - The ‘Recency’ column stores the recency value of each customer 


In [14]:
df.groupby('CustomerID')[['InvoiceNo']].nunique()

Unnamed: 0_level_0,InvoiceNo
CustomerID,Unnamed: 1_level_1
12346.0,1
12347.0,3
12348.0,2
12350.0,1
12352.0,5
...,...
18273.0,1
18280.0,1
18281.0,1
18283.0,7


In [15]:
frequency_df = df.groupby('CustomerID').InvoiceNo.nunique().reset_index()
frequency_df.columns = ['CustomerID','Frequency']
customer_df  = pd.merge(customer_df,frequency_df,on='CustomerID',how='left')
monetary_df = df.groupby('CustomerID').TotalRevenue.sum().reset_index()
monetary_df.columns = ['CustomerID','Monetary']
customer_df = pd.merge(customer_df,monetary_df,on='CustomerID',how='left')
recency_df = (df.InvoiceDate.max() - df.groupby('CustomerID')\
              .InvoiceDate.max()).dt.days.reset_index()
recency_df.columns = ['CustomerID','Recency']
customer_df = pd.merge(customer_df,recency_df,on='CustomerID',how='left')

In [16]:
customer_df.describe()

Unnamed: 0,CustomerID,Frequency,Monetary,Recency
count,2724.0,2724.0,2724.0,2724.0
mean,15283.547724,2.717695,1251.721623,59.764684
std,1717.888344,3.878238,4577.361618,47.913115
min,12346.0,1.0,2.9,0.0
25%,13799.75,1.0,244.4175,20.0
50%,15237.5,2.0,492.36,46.0
75%,16766.25,3.0,1067.4075,94.0
max,18287.0,67.0,127410.23,177.0


2. Add segment bin values to RFM table using quartile. For example, If the recency value belongs to the first quartile, the recency value will be replaced by 1, if it belongs to the second quartile, that value will be replaced by 2...
- Hint: using pd.qcut, create new columns named r_quantile, f_quantile and m_quantile in the dataframe customer_df 

In [17]:
customer_df['r_quantile'] = pd.qcut(customer_df['Recency'],q=4,labels = [4,3,2,1])
customer_df['f_quantile'] = pd.qcut(customer_df['Frequency'],q=4,labels = [1,2,3],duplicates='drop')
customer_df['m_quantile'] = pd.qcut(customer_df['Monetary'],q=4,labels = [1,2,3,4])

3. In the customer_df, Create a new column named RFM_Score. The formula for RFM_Score is as follow
             RFM_Score = r_quantile + f_quantile + m_quantile

In [18]:
customer_df['RFM_Score'] = customer_df[['r_quantile','f_quantile','m_quantile']].sum(axis=1)

4. Based on RFM_Score, customers are divided into 3 segments: low-value, mid-value and high value so that it is satisfy the following rules:
 - The number of customers of high value segment does not exceed 20% of the total number of customers.
 - The number of customers of mid value segment is no less than 30% of the total number of customers.

In [19]:
customer_df['RFM_Score']\
.describe(percentiles=[0.01*i for i in range(0,100,3)])

count    2724.000000
mean        6.514684
std         2.461784
min         3.000000
0%          3.000000
3%          3.000000
6%          3.000000
9%          3.000000
12%         4.000000
15%         4.000000
18%         4.000000
21%         4.000000
24%         4.000000
27%         5.000000
30%         5.000000
33%         5.000000
36%         5.000000
39%         5.000000
42%         6.000000
45%         6.000000
48%         6.000000
50%         6.000000
51%         6.000000
54%         6.000000
57%         7.000000
60%         7.000000
63%         7.000000
66%         7.000000
69%         8.000000
72%         8.000000
75%         8.000000
78%         9.000000
81%         9.000000
84%        10.000000
87%        10.000000
90%        10.000000
93%        11.000000
96%        11.000000
99%        11.000000
max        11.000000
Name: RFM_Score, dtype: float64

In [20]:
customer_df['Segment'] = pd.cut(customer_df['RFM_Score'],bins=[2,6,9,11]\
                                ,labels = ['Low Value','Mid Value','High Value'])

In [21]:
customer_df['Segment'].value_counts(normalize=True)

Segment
Low Value     0.556535
Mid Value     0.275330
High Value    0.168135
Name: proportion, dtype: float64