# Customer Segmentation in Python

## Chapter-1 Cohort Analysis

### Assign daily acquisition cohort

1. once: “YYYY-MM-DD HH:MM:SS” formatinda olan bir “datetime object”in sadece “date” bilgisini cekmis oluyorum. “time” bilgisine ihtiyacim olmadigi durumlarda kullanmak icin.
2. daha sonra "groupby" metodu ile her bir "customer"in alisveris yaptigi tum tarihleri bir degiskene atiyorum ("grouping)
3. bu degiskenin en kucuk degerini ".transorm('min') fonksiyonu ile bularak, musterinin alisveris yaptigi ilk gune ulasmis oluyorum.
4. boylece analizimin ilk adimi olan "cohort" (grup [burada musteri grubu]) olusturulmus oluyor. yani tum musteriler alisveris yaptigi ilk gune gore gruplandirilmis oluyor.

In [None]:
# Define a function that will parse the date
def get_day(x): return dt.datetime(x.year, x.month, x.day) 

# Create InvoiceDay column
online['InvoiceDay'] = online['InvoiceDate'].apply(get_day) 

# Group by CustomerID and select the InvoiceDay value
grouping = online.groupby('CustomerID')['InvoiceDay'] 

# Assign a minimum InvoiceDay value to the dataset
online['CohortDay'] = grouping.transform('min')

# View the top 5 rows
print(online.head())

### Calculate time offset in days

1. sirada "time offset", yani zaman farkinin hesaplanmasi var. gun uzerinden hesaplama yapiliyor burada.
2. hesaplama "integer" degerler uzerinden oldugu icin, oncelikle "date" bilgisi bir fonksiyon ile integer degerlere donusturuluyor.
3. yani once "A" (invoice date) ve "B" (cohort date) i rakamsal olarak olusturuyoruz, sonra "A-B" islemini yapiyoruz.

In [None]:
def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day

# Get the integers for date parts from the InvoiceDaycolumn
invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')

# Get the integers for date parts from the CohortDay column
cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')

# Calculate difference in years
years_diff = invoice_year - cohort_year

# Calculate difference in months
months_diff = invoice_month - cohort_month

# Calculate difference in days
days_diff = invoice_day - cohort_day

# Extract the difference in days from all previous values
online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1
print(online.head())

### Count monthly active customers from each cohort 

In [None]:
grouping = online.groupby(['CohortMonth', 'CohortIndex'])
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique)
cohort_data = cohort_data.reset_index()
cohort_counts = cohort_data.pivot(index='CohortMonth',
                                  columns='CohortIndex',
                                  values='CustomerID')
print(cohort_counts)

### Calculate retention rate from scratch

1. herbir cohortmonth'dan, analiz baslangic tarihinden bitis tarihine kadar, esas aldigimiz zaman dilimi adedince gruplara bolduk.
2. bu gruplardaki "unique" musteri sayisini bir degiskene atadik (reset_index) demeyi untmadik.
3. sonra, index'i "cohortdate" sutunu "cohort time period", degeri de "farkli musteri sayisi" olan bir "pivot table" olusturduk.
4. bu tabledaki tum degerleri, her bir grubun baslangictaki sayisina bolerek, o gruptan zaman icinde ne kadarini muhafa edebildigimizi gorduk. buna "retention rate", yani "elde tutma/muhafa etme orani" deniliyor.

In [None]:
grouping = online.groupby(['CohortMonth', 'CohortIndex'])

# Count the number of unique values per customer ID
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique).reset_index()

# Create a pivot 
cohort_counts = cohort_data.pivot(index='CohortMonth', 
                                  columns='CohortIndex', 
                                  values='CustomerID')

# Select the first column and store it to cohort_sizes
cohort_sizes = cohort_counts.iloc[:,0]

# Divide the cohort count by cohort sizes along the rows
retention = cohort_counts.divide(cohort_sizes, axis=0)

retention.round(3) * 100

### Calculate average price

1. birincisi, almam gereken en onemli noktalardan biri "grouping" mantigi. burada grouping, "online" setinin tamamini iceriyor gibi ama "cohort_data" sadece 3 "feature" iceriyor. 
2. mantigi su olabilir: biz "cohort_data"ya sadece "unit price"i atadik ama "grouping" degiskeni uzerinden yani, "cohortminth" ve "cohortindex" bilgilerine gore siralanmis olan veri setini kullandik. dolayisi ile tek bir degiskenini kullansam bile, onunla birlikte gruplandirma icin kullandigi "feature"lari da getiriyor. umarim dogru anlamisimdir.

In [None]:
# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = online.groupby(['CohortMonth', 'CohortIndex']) 

# Calculate the average of the unit price column
cohort_data = grouping['UnitPrice'].mean()

# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()

# Create a pivot 
average_quantity = cohort_data.pivot(index='CohortMonth', 
                                     columns='CohortIndex', 
                                     values='UnitPrice')
print(average_quantity.round(1))

## Cohort analysis visualization

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
plt.title('Retention rates')
sns.heatmap(data = retention,
            annot = True,
            fmt = '.0%',
            vmin = 0.0,
            vmax = 0.5,
            cmap = 'BuGn')
plt.show()

## Chapter-2 Recency, Frequency, Monetary Value analysis

### Calculate Spend quintiles (q=5)

musterilerin, yaptiklari harcamaya gore 4 esit gruba bolunmesi

In [None]:
# Create a spend quartile with 4 groups and labels ranging from 1 through 4 
spend_quartile = pd.qcut(data['Spend'], q=4, labels=range(1,5))

# Assign the quartile values to the Spend_Quartile column in data
data['Spend_Quartile'] = spend_quartile

# Print data with sorted Spend values
print(data.sort_values('Spend'))

### Calculate Recency deciles (q=10)

ayni seyi "recency" icin yapiyoruz

In [None]:
# Store labels from 4 to 1 in a decreasing order
r_labels = list(range(4, 0, -1))

# Create a spend quartile with 4 groups and pass the previously created labels 
recency_quartiles = pd.qcut(data['Recency_Days'], q=4, labels=r_labels)

# Assign the quartile values to the Recency_Quartile column in `data`
data['Recency_Quartile'] = recency_quartiles 

# Print `data` with sorted Recency_Days values
print(data.sort_values('Recency_Days'))

### Calculate RFM values

1. snapshot'i olusturmamizin nedeni, analiz gununu bugunmus gibi gostermek.
2. datamart

In [None]:
snapshot_date = max(online.InvoiceDate) + datetime.timedelta(days=1)

# Calculate Recency, Frequency and Monetary value for each customer 
datamart = online.groupby(['CustomerID']).agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'count',
    'TotalSum': 'sum'})

# Rename the columns 
datamart.rename(columns={'InvoiceDate': 'Recency',
                         'InvoiceNo': 'Frequency',
                         'TotalSum': 'MonetaryValue'}, inplace=True)

# Print top 5 rows
print(datamart.head())

            Frequency  MonetaryValue  Recency
CustomerID                                   
12747              25         948.70        3
12748             888        7046.16        1
12749              37         813.45        4
12820              17         268.02        4
12822               9         146.15       71

### Calculate 3 groups for Recency, Frequency and MonetaryValue  & Calculate RFM Score

In [None]:
# Create labels for Recency, Frequency and MonetaryValue
r_labels = range(3, 0, -1); f_labels = range(1, 4); m_labels = range(1, 4)

# Assign these labels to three equal percentile groups 
r_groups = pd.qcut(datamart['Recency'], q=3, labels=r_labels)
f_groups = pd.qcut(datamart['Frequency'], q=3, labels=f_labels)
m_groups = pd.qcut(datamart['MonetaryValue'], q=3, labels=m_labels)

# Create new columns R and F 
datamart = datamart.assign(R=r_groups.values, F=f_groups.values, M=m_groups.values)

# Calculate RFM_Score
datamart['RFM_Score'] = datamart[['R','F','M']].sum(axis=1)
print(datamart['RFM_Score'].head())

def join_rfm(x): 
    return str(x['R']) + str(x['F']) + str(x['M'])

datamart['RFM_Segment'] = datamart.apply(join_rfm, axis=1)

### Creating custom segments

In [None]:
# Define rfm_level function
def rfm_level(df):
    if df['RFM_Score'] >= 10:
        return 'Top'
    elif (df['RFM_Score'] >= 6) and (df['RFM_Score'] < 10):
        return 'Middle'
    else:
        return 'Low'

# Create a new variable RFM_Level
datamart['RFM_Level'] = datamart.apply(rfm_level, axis=1)

# Print the header with top 5 rows to the console
print(datamart.head())

### Analyzing custom segments

In [None]:
# Calculate average values for each RFM_Level, and return a size of each segment 
rfm_level_agg = datamart.groupby('RFM_Level').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    
  	# Return the size of each segment
    'MonetaryValue': ['mean', 'count']
}).round(1)

# Print the aggregated dataset
print(rfm_level_agg)

         Frequency MonetaryValue       Recency
               mean          mean count    mean
RFM_Level                                      
Low             3.2          52.7  1075   180.8
Middle         10.7         202.9  1547    73.9
Top            47.1         959.7  1021    20.3

## Chapter-3 Data pre-processing for clustering

### Detect skewed variables

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt
sns.distplot(datamart['Recency'])
plt.show()

### Manage skewness Logarithmic transformation (positive values only)

In [None]:
import numpy as np
frequency_log= np.log(datamart['Frequency'])
sns.distplot(frequency_log)
plt.show()

## Centering and scaling variables

In [None]:
data_centered = data - data.mean()
data_centered.describe().round(2)

data_scaled = data / data.std()
data_scaled.describe().round(2)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(datamart_rfm)
datamart_normalized = scaler.transform(datamart_rfm)

### Pre-process RFM data

In [None]:
# Unskew the data
datamart_log = np.log(datamart_rfm)

# Initialize a standard scaler and fit it
scaler = StandardScaler()
scaler.fit(datamart_log)

# Scale and center the data
datamart_normalized = scaler.transform(datamart_log)

# Create a pandas DataFrame
datamart_normalized = pd.DataFrame(data=datamart_normalized, index=datamart_rfm.index, columns=datamart_rfm.columns)

## Chapter-4 Customer Segmentation with K-means

### Run KMeans

In [None]:
# Import KMeans 
from sklearn.cluster import KMeans

# Initialize KMeans
kmeans = KMeans(n_clusters=3, random_state=1) 

# Fit k-means clustering on the normalized data set
kmeans.fit(datamart_normalized)

# Extract cluster labels
cluster_labels = kmeans.labels_

### Assign labels to raw data

In [None]:
# Create a DataFrame by adding a new cluster label column
datamart_rfm_k3 = datamart_rfm.assign(Cluster=cluster_labels)

# Group the data by cluster
grouped = datamart_rfm_k3.groupby(['Cluster'])

# Calculate average RFM values and segment sizes per cluster value
grouped.agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
  }).round(1)

### Choosing the number of Clusters

Methods to define the number of clusters
    
    a. Visual methods - elbow criterion
    b. Mathematical methods - silhouette coefficient
    c. Experimentation and interpretation

### Elbow criterion method

In [None]:
# Import key libraries
from sklearn.cluster import KMeans
import seaborn as sns
from matplotlib import pyplot as plt

# Fit KMeans and calculate SSE for each *k*
sse = {}
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=1)
kmeans.fit(data_normalized)
sse[k] = kmeans.inertia_ # sum of squared distances to closest cluster cente

# Plot SSE for each *k*
plt.title('The Elbow Method')
plt.xlabel('k'); plt.ylabel('SSE')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

## Profile and interpret segments

Approaches to build customer personas
    
    a. Summary statistics for each cluster e.g. average RFM values
    b. Snake plots (from market research
    c. Relative importance of cluster attributes compared to population

In [None]:
datamart_rfm_k2 = datamart_rfm.assign(Cluster = cluster_labels)

datamart_rfm_k2.groupby(['Cluster']).agg({
'Recency': 'mean',
'Frequency': 'mean',
'MonetaryValue': ['mean', 'count'],
}).round(0)

Repeat the same for k=3 Compare average RFM values of each clustering solution

### Snake plots to understand and compare segments

    a. Market research technique to compare different segments
    b. Visual representation of each segment's attributes
    c. Need to first normalize data (center & scale)
    d. Plot each cluster's average normalized values of each attribute

In [None]:
#Prepare data for a snake plot

# Transform datamart_normalized as DataFrame and add a Cluster column

datamart_normalized = pd.DataFrame(datamart_normalized,
index=datamart_rfm.index,
columns=datamart_rfm.columns)
datamart_normalized['Cluster'] = datamart_rfm_k3['Cluster']

# Melt the data into a long format so RFM values and metric names are stored in 1 column each
datamart_melt = pd.melt(datamart_normalized.reset_index(),
id_vars=['CustomerID', 'Cluster'],
value_vars=['Recency', 'Frequency', 'MonetaryValue'],
var_name='Attribute',
value_name='Value')

# Visualize a snake plot
plt.title('Snake plot of standardized variables')
sns.lineplot(x="Attribute", y="Value", hue='Cluster', data=datamart_melt)

### Relative importance of segment attributes

    a. Useful technique to identify relative importance of each segment's attribute
    b. Calculate average values of each cluster
    c. Calculate average values of population
    d. Calculate importance score by dividing them and subtracting 1

In [None]:
cluster_avg = datamart_rfm_k3.groupby(['Cluster']).mean()
population_avg = datamart_rfm.mean()
relative_imp = cluster_avg / population_avg - 1

# Analyze and plot relative importance

# The further a ratio is from 0, the more important that attribute is for a segment
relative to the total population.

relative_imp.round(2)

Recency Frequency MonetaryValue
Cluster
0 -0.82 1.68 1.83
1 0.84 -0.84 -0.86
2 -0.15 -0.34 -0.42

# Plot a heatmap for easier interpretation:
plt.figure(figsize=(8, 2))
plt.title('Relative importance of attributes')
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
plt.show()