**Author**: Moch Nabil Farras Dhiya (10120034)

**E-mail**: nabilfarras923@gmail.com

-------------------

**Disclaimer**: The **dataset** used in this analysis is a public dataset retrieved from [Customer Personality Analysis - Kaggle](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?datasetId=1546318&sortBy=voteCount).

# Background

## Attributes

**People**

*   ID: Customer's unique identifier
*   Year_Birth: Customer's birth year
*   Education: Customer's education level
*   Marital_Status: Customer's marital status
*   Income: Customer's yearly household income
*   Kidhome: Number of children in customer's household
*   Teenhome: Number of teenagers in customer's household
*   Dt_Customer: Date of customer's enrollment with the company
*   Recency: Number of days since customer's last purchase
*   Complain: 1 if the customer complained in the last 2 years, 0 otherwise

**Products**

*   MntWines: Amount spent on wine in last 2 years
*   MntFruits: Amount spent on fruits in last 2 years
*   MntMeatProducts: Amount spent on meat in last 2 years
*   MntFishProducts: Amount spent on fish in last 2 years
*   MntSweetProducts: Amount spent on sweets in last 2 years
*   MntGoldProds: Amount spent on gold in last 2 years

**Promotion**


*   NumDealsPurchases: Number of purchases made with a discount
*   AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
*   AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
*   AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
*   AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
*   AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
*   Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

**Place**

*   NumWebPurchases: Number of purchases made through the company’s website
*   NumCatalogPurchases: Number of purchases made using a catalogue
*   NumWebVisitsMonth: Number of visits to company’s website in the last month
*   NumStorePurchases: Number of purchases made directly in stores

## Goals

Perform clustering to summarize customer segments.

# Connect to Google Drive

In [36]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [37]:
%cd /content/gdrive/My Drive/Portfolio/Data Science/Python/Customer Segmentation/CSV

/content/gdrive/My Drive/Portfolio/Data Science/Python/Customer Segmentation/CSV


# Import Packages

In [38]:
import pandas as pd
import numpy as np
import datetime as dt

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

from statistics import mean
from scipy.stats import skew

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import davies_bouldin_score, silhouette_score, calinski_harabasz_score
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering

from yellowbrick.cluster import KElbowVisualizer
from collections import defaultdict

# Import Data

In [39]:
# data = pd.read_csv("../Data/final_data.csv", index_col = 0)

# rfm = pd.read_csv("../Data/RFM/rfm_analysis.csv", index_col = 0)

data = pd.read_csv("final_data.csv", index_col = 0)

rfm = pd.read_csv("rfm_analysis.csv", index_col = 0)

# Final Segmentation

In [40]:
data = pd.merge(data[['ID', 'Marital_Status', 'Education', 'Income_Class', 'Age_Class', 'Kidhome', 'Teenhome']], rfm, how = 'inner', on = 'ID')
data

Unnamed: 0,ID,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Recency_Cluster,Frequency_Cluster,Wines_Cluster,Fruits_Cluster,Meat_Cluster,Fish_Cluster,Sweets_Cluster,Gold_Cluster
0,5524,0,1,2,3,0,0,1,2,1,1,2,2,1,1
1,2174,0,1,1,3,1,1,1,0,0,0,0,0,0,0
2,4141,1,1,3,2,0,0,0,1,1,1,0,1,0,0
3,6182,1,1,0,0,1,0,0,0,0,0,0,0,0,0
4,5324,1,3,2,1,1,0,2,1,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2207,10870,1,1,2,2,0,1,1,1,1,1,0,0,2,2
2208,4001,1,3,2,3,2,1,1,2,1,0,0,0,0,0
2209,7270,0,1,2,1,0,0,2,1,2,1,1,0,0,0
2210,8235,1,2,3,3,0,1,0,1,1,0,1,1,0,1


In [41]:
data.to_csv("cluster_data.csv")

In [42]:
min_customer = 10

## Functions

In [43]:
# Potential Wines Customer
def get_potential_customer(df, product_cluster, cluster):
  temp = df.loc[df[product_cluster] == cluster][['Marital_Status', 'Education', 'Income_Class', 'Age_Class', 'Kidhome', 'Teenhome']] \
          .groupby(['Marital_Status', 'Education', 
                    'Income_Class', 'Age_Class', 
                    'Kidhome', 'Teenhome']) \
          .value_counts().reset_index()
  temp = temp.rename(columns = {0: 'Potential_Customers'})

  return temp

In [44]:
def get_potential_percentage(df, temp, min_customer):
  re = df[['Marital_Status', 'Education',  'Income_Class', 'Age_Class', 'Kidhome', 'Teenhome']] \
            .groupby(['Marital_Status', 'Education', 
                      'Income_Class', 'Age_Class', 
                      'Kidhome', 'Teenhome']) \
            .value_counts().reset_index()

  re = re.rename(columns = {0: 'Customers'})
  re = pd.merge(re, temp, how = 'inner', on = ['Marital_Status', 'Education', 
                                                    'Income_Class', 'Age_Class',
                                                    'Kidhome', 'Teenhome'])

  # Only locate characteristics which have at least 10 customers
  re = re.loc[re['Customers'] >= min_customer]
  re['Potential_Percentage'] = 100 * re['Potential_Customers'] / re['Customers']
  re = re.sort_values(by = 'Potential_Percentage', ascending = False).reset_index(drop = True)
  
  return re

In [45]:
def display_rfm(df, product_cluster):
  df = df[['Recency_Cluster', 'Frequency_Cluster', product_cluster]] \
            .groupby(['Recency_Cluster', 'Frequency_Cluster', product_cluster]) \
            .value_counts().reset_index()
  df = df.rename(columns = {0: 'value'})

  # Recency and Frequency Percentage
  recency_data = round(100 * df.groupby('Recency_Cluster')['value'].sum() / \
                       df['value'].sum(), 3)
  frequency_data = round(100 * df.groupby('Frequency_Cluster')['value'].sum() / \
                         df['value'].sum(), 3)
  monetary_data = round(100 * df.groupby(product_cluster)['value'].sum() / \
                        df['value'].sum(), 3)

  display(recency_data)
  display(frequency_data)
  display(monetary_data)

In [46]:
def get_concern(df, product_data, mask):
  concern_data = df[df['ID'].isin(mask)][['Marital_Status', 'Education', 
                                          'Income_Class', 'Age_Class', 
                                          'Kidhome', 'Teenhome']] \
                      .groupby(['Marital_Status', 'Education', 
                                'Income_Class', 'Age_Class', 
                                'Kidhome', 'Teenhome']) \
                      .value_counts().reset_index()
  concern_data = concern_data.rename(columns = {0: 'Recency_Customers'})
  concern_data = pd.merge(concern_data, product_data, how = 'inner', on = ['Marital_Status', 'Education', 
                                                                           'Income_Class', 'Age_Class',
                                                                           'Kidhome', 'Teenhome'])
  concern_data['Recency_Percentage'] = 100 * concern_data['Recency_Customers'] / concern_data['Customers']
  concern_data = concern_data.sort_values(by = ['Potential_Percentage', 'Recency_Percentage'], ascending = False).reset_index(drop = True)
  
  return concern_data

## Wines

In [47]:
# Potential Wines Customer
temp = get_potential_customer(data, 'Wines_Cluster', 2)

wines = get_potential_percentage(data, temp, min_customer)
wines

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,1,3,3,0,0,0,11,9,81.818182
1,1,3,3,2,0,1,10,8,80.0
2,1,3,3,1,0,0,15,9,60.0
3,0,2,3,2,0,0,10,6,60.0
4,1,3,3,3,0,0,16,9,56.25
5,0,3,2,2,0,1,14,7,50.0
6,0,3,3,2,0,0,14,7,50.0
7,0,2,3,0,0,0,11,5,45.454545
8,1,1,3,2,0,1,22,10,45.454545
9,1,1,3,3,0,0,32,14,43.75


Only locate the characteristics in which the potential customer population are higher than certain threshold.
Here, we will set the threshold as **60%** because that is a reasonable number to say if certain characteristics are what we seek in our potential customers.

In [48]:
threshold = 60
top_wines = wines.loc[wines['Potential_Percentage'] >= threshold].reset_index(drop = True)
top_wines.head(10)

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,1,3,3,0,0,0,11,9,81.818182
1,1,3,3,2,0,1,10,8,80.0
2,1,3,3,1,0,0,15,9,60.0
3,0,2,3,2,0,0,10,6,60.0


From above, we can see that the potential wine buyers are most likely to have these criterias:

First segment
1.   Is **married**.
2.   Possess a **doctoral degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Is **younger than 40**.
5.   **Do not** have any kid nor teen at home.

Second segment
1.   Is **married**.
2.   Possess a **doctoral degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Around **50 - 65** years old.
5.   **Do not** have any kide and **have** 1 teen at home.

Third segment
1.   Is **married**.
2.   Possess a **doctoral degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   **Around 40 - 50** years old.
5.   **Do not** have any kid nor teen at home.

Now, we will check if these mentioned above are active customers.

In [49]:
wines_potential = data.loc[((data['Marital_Status'] == 1) &
                            (data['Education'] == 3) &
                            (data['Income_Class'] == 3) &
                            ((data['Age_Class'] == 0) | (data['Age_Class'] == 1)) &
                            (data['Kidhome'] == 0) &
                            (data['Teenhome'] == 0)) |
                           
                           ((data['Marital_Status'] == 1) &
                            (data['Education'] == 3) &
                            (data['Income_Class'] == 3) &
                            (data['Age_Class'] == 2) &
                            (data['Kidhome'] == 0) &
                            (data['Teenhome'] == 1))] \
                    .groupby(['Marital_Status', 'Education', 
                              'Income_Class', 'Age_Class', 
                              'Kidhome', 'Teenhome']) \
                    .value_counts().reset_index()

display_rfm(wines_potential, 'Wines_Cluster')

Recency_Cluster
0    38.889
1    41.667
2    19.444
Name: value, dtype: float64

Frequency_Cluster
0     8.333
1    58.333
2    33.333
Name: value, dtype: float64

Wines_Cluster
0     5.556
1    22.222
2    72.222
Name: value, dtype: float64

Notice that the precentage of **inactive potential customers** are **quite high**, even though they are the ones who contribute the most to our GDP. This is concerning.

In [50]:
# Look for overall potential customers (almost) churn distribution
mask = rfm[(rfm['Recency_Cluster'] == 0) & (rfm['Wines_Cluster'] == 2)]['ID'].values
concern_wines = get_concern(data, wines, mask)
concern_wines

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Recency_Customers,Customers,Potential_Customers,Potential_Percentage,Recency_Percentage
0,1,3,3,0,0,0,2,11,9,81.818182,18.181818
1,1,3,3,2,0,1,2,10,8,80.0,20.0
2,1,3,3,1,0,0,6,15,9,60.0,40.0
3,0,2,3,2,0,0,3,10,6,60.0,30.0
4,1,3,3,3,0,0,2,16,9,56.25,12.5
5,0,3,3,2,0,0,4,14,7,50.0,28.571429
6,0,3,2,2,0,1,2,14,7,50.0,14.285714
7,0,2,3,0,0,0,3,11,5,45.454545,27.272727
8,1,1,3,2,0,1,4,22,10,45.454545,18.181818
9,1,1,3,3,0,0,2,32,14,43.75,6.25


Notice that there is **no centralized characteristics** in which certain customer segments are inactive. Also, notice that our potential customers' churn rate are **relatively low**, which are **<=20%**, **except** for the customers who are **40-50 years old**. Thus, there are 2 options we can do here:

1. Stage a **discount / campaign** which can attract them to buy wines again.
2. **Divert** our **potential customers** to the fourth segments, which is exactly the same as the third, except that they are **>65** years old. This is because their **potential** rate is quite high, **56.25%**, with a **churn** rate of only **12.5%**.


## Fruits

In [51]:
# Potential Wines Customer
temp = get_potential_customer(data, 'Fruits_Cluster', 2)

fruits = get_potential_percentage(data, temp, min_customer)
fruits

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,0,1,3,1,0,0,18,9,50.0
1,0,1,3,2,0,0,18,9,50.0
2,1,1,2,2,0,0,10,5,50.0
3,1,1,3,0,0,0,20,9,45.0
4,1,2,3,2,0,0,23,10,43.478261
5,0,1,3,0,0,0,23,8,34.782609
6,1,1,3,3,0,1,10,3,30.0
7,0,3,3,2,0,0,14,4,28.571429
8,1,1,3,3,0,0,32,9,28.125
9,1,1,3,1,0,0,26,7,26.923077


Only locate the characteristics in which the potential customer population are higher than certain threshold.
Here, we will set the threshold as **60%** because that is a reasonable number to say if certain characteristics are what we seek in our potential customers.

But then, notice that there is **no characteristic** which **exceedes** that threshold. Thus, we will lower the threshold to **50%**.

In [52]:
threshold = 50
top_fruits = fruits.loc[fruits['Potential_Percentage'] >= threshold].reset_index(drop = True)
top_fruits.head(10)

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,0,1,3,1,0,0,18,9,50.0
1,0,1,3,2,0,0,18,9,50.0
2,1,1,2,2,0,0,10,5,50.0


From above, we can see that the potential fruits buyers are most likely to have these criterias:

First segment
1.   Is **single**.
2.   Possess a **graduate degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Around **40 - 65** years old.
5.   **Do not** have any kid nor teen at home.

Second segment
1.   Is **married**.
2.   Possess a **graduate degree**.
3.   Have an **upper-middle income** (USD 51K++ Annual income).
4.   Around **50 - 65** years old.
5.   **Do not** have any kid nor teen at home.

Now, we will check if these mentioned above are active customers.

In [53]:
fruits_potential = data.loc[((data['Marital_Status'] == 0) &
                             (data['Education'] == 1) &
                             (data['Income_Class'] == 3) &
                             ((data['Age_Class'] == 1) | (data['Age_Class'] == 2)) &
                             (data['Kidhome'] == 0) &
                             (data['Teenhome'] == 0))|
                            
                            ((data['Marital_Status'] == 1) &
                             (data['Education'] == 1) &
                             (data['Income_Class'] == 2) &
                             (data['Age_Class'] == 2) &
                             (data['Kidhome'] == 0) &
                             (data['Teenhome'] == 0))] \
                    .groupby(['Marital_Status', 'Education', 
                              'Income_Class', 'Age_Class', 
                              'Kidhome', 'Teenhome']) \
                    .value_counts().reset_index()

display_rfm(fruits_potential, 'Fruits_Cluster')

Recency_Cluster
0    26.087
1    19.565
2    54.348
Name: value, dtype: float64

Frequency_Cluster
0    23.913
1    54.348
2    21.739
Name: value, dtype: float64

Fruits_Cluster
0    17.391
1    32.609
2    50.000
Name: value, dtype: float64

Notice that the precentage of **inactive potential customers** are terribly **high**, even though they are the ones who contribute the most to our GDP. This is concerning.

In [54]:
# Look for overall potential customers (almost) churn distribution
mask = rfm[(rfm['Recency_Cluster'] == 0) & (rfm['Fruits_Cluster'] == 2)]['ID'].values
concern_fruits = get_concern(data, fruits, mask)
concern_fruits

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Recency_Customers,Customers,Potential_Customers,Potential_Percentage,Recency_Percentage
0,0,1,3,2,0,0,4,18,9,50.0,22.222222
1,0,1,3,1,0,0,2,18,9,50.0,11.111111
2,1,1,2,2,0,0,1,10,5,50.0,10.0
3,1,1,3,0,0,0,4,20,9,45.0,20.0
4,1,2,3,2,0,0,2,23,10,43.478261,8.695652
5,0,1,3,0,0,0,2,23,8,34.782609,8.695652
6,0,3,3,2,0,0,1,14,4,28.571429,7.142857
7,1,1,3,3,0,0,3,32,9,28.125,9.375
8,1,1,3,1,0,0,5,26,7,26.923077,19.230769
9,1,1,3,2,0,0,8,45,12,26.666667,17.777778


Notice that there is **no centralized characteristics** in which certain customer segments are inactive. Also, notice that our potential customers' churn rate are **relatively low**, which are **<=20%**, **except** for the customers who are **50-65 years old**. Since it only **exceeds** the threshold by **2%** and we **can not divert** our attention to another customer segment (since they have low potential rate), the only option we have is to stage a **discount / campaign** which can attract them to buy fruits again.

## Meat

In [55]:
# Potential Meat Customer
temp = get_potential_customer(data, 'Meat_Cluster', 2)

meat = get_potential_percentage(data, temp, min_customer)
meat

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,0,2,3,2,0,0,10,8,80.0
1,0,2,3,0,0,0,11,7,63.636364
2,0,1,3,2,0,0,18,11,61.111111
3,1,1,3,1,0,0,26,14,53.846154
4,1,1,3,3,0,0,32,16,50.0
5,0,1,3,0,0,0,23,11,47.826087
6,1,1,3,2,0,0,45,21,46.666667
7,1,3,3,2,0,0,15,7,46.666667
8,1,2,3,3,0,0,15,7,46.666667
9,1,1,3,0,0,0,20,9,45.0


Only locate the characteristics in which the potential customer population are higher than certain threshold.
Here, we will set the threshold as **60%** because that is a reasonable number to say if certain characteristics are what we seek in our potential customers.

In [56]:
threshold = 60
top_meat = meat.loc[meat['Potential_Percentage'] >= threshold].reset_index(drop = True)
top_meat.head(10)

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,0,2,3,2,0,0,10,8,80.0
1,0,2,3,0,0,0,11,7,63.636364
2,0,1,3,2,0,0,18,11,61.111111


From above, we can see that the potential meat buyers are most likely to have these criterias:

First segment
1.   Is **single**.
2.   Possess a **master degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Is **younger than 40** or around **50 - 65** years old.
5.   **Do not** have any kid nor teen at home.

Second segment
1.   Is **single**.
2.   Possess a **graduate degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Around **50 - 65** years old.
5.   **Do not** have any kid nor teen at home.

Now, we will check if these mentioned above are active customers.

In [57]:
meat_potential = data.loc[((data['Marital_Status'] == 0) &
                           (data['Education'] == 2) &
                           (data['Income_Class'] == 3) &
                           (((data['Age_Class'] == 0)) | (data['Age_Class'] == 2)) &
                           (data['Kidhome'] == 0) &
                           (data['Teenhome'] <= 0)) |
                          
                          ((data['Marital_Status'] == 0) &
                           (data['Education'] == 1) &
                           (data['Income_Class'] == 3) &
                           (data['Age_Class'] == 2) &
                           (data['Kidhome'] == 0) &
                           (data['Teenhome'] <= 0))] \
                    .groupby(['Marital_Status', 'Education', 
                              'Income_Class', 'Age_Class', 
                              'Kidhome', 'Teenhome']) \
                    .value_counts().reset_index()

display_rfm(meat_potential, 'Meat_Cluster')

Recency_Cluster
0    38.462
1    25.641
2    35.897
Name: value, dtype: float64

Frequency_Cluster
0    15.385
1    69.231
2    15.385
Name: value, dtype: float64

Meat_Cluster
0     2.564
1    30.769
2    66.667
Name: value, dtype: float64

Notice that the precentage of **inactive potential customers** are terribly **high**, even though they are the ones who contribute the most to our GDP. This is concerning.

In [58]:
# Look for overall potential customers (almost) churn distribution
mask = rfm[(rfm['Recency_Cluster'] == 0) & (rfm['Meat_Cluster'] == 2)]['ID'].values
concern_meat = get_concern(data, meat, mask)
concern_meat

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Recency_Customers,Customers,Potential_Customers,Potential_Percentage,Recency_Percentage
0,0,2,3,2,0,0,3,10,8,80.0,30.0
1,0,2,3,0,0,0,2,11,7,63.636364,18.181818
2,0,1,3,2,0,0,3,18,11,61.111111,16.666667
3,1,1,3,1,0,0,8,26,14,53.846154,30.769231
4,1,1,3,3,0,0,5,32,16,50.0,15.625
5,0,1,3,0,0,0,3,23,11,47.826087,13.043478
6,1,1,3,2,0,0,9,45,21,46.666667,20.0
7,1,2,3,3,0,0,3,15,7,46.666667,20.0
8,1,3,3,2,0,0,1,15,7,46.666667,6.666667
9,1,1,3,0,0,0,4,20,9,45.0,20.0


Notice that there is **no centralized characteristics** in which certain customer segments are inactive. Also, notice that our potential customers' churn rate are **relatively low**, which are **<=20%**, **except** for the customers who are **50-65 years old**. Since it **exceeds** the threshold by quite a number, which is **10%** and they are basically our main customers for this product, we have to **work hard** and make sure that they **won't** leave us. And so, one option we have is to stage a **discount / campaign** which can attract them to buy meat again.

## Fish

In [59]:
# Potential Fish Customer
temp = get_potential_customer(data, 'Fish_Cluster', 2)

fish = get_potential_percentage(data, temp, min_customer)
fish

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,0,1,3,1,0,0,18,9,50.0
1,1,1,3,3,0,1,10,4,40.0
2,0,2,3,2,0,0,10,4,40.0
3,1,1,3,2,0,0,45,18,40.0
4,0,1,3,0,0,0,23,9,39.130435
5,0,1,3,2,0,0,18,7,38.888889
6,1,1,3,3,0,0,32,12,37.5
7,1,1,3,1,0,1,16,6,37.5
8,0,3,3,2,0,0,14,5,35.714286
9,1,2,3,2,0,0,23,8,34.782609


Only locate the characteristics in which the potential customer population are higher than certain threshold.
Here, we will set the threshold as **60%** because that is a reasonable number to say if certain characteristics are what we seek in our potential customers.

But, notice that there is **no customer segment** which exceeds that threshold. Thus, we will **customize** the threshold to **50%** in this case.

In [60]:
threshold = 50
top_fish = fish.loc[fish['Potential_Percentage'] >= threshold].reset_index(drop = True)
top_fish.head(10)

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,0,1,3,1,0,0,18,9,50.0


From above, we can see that the potential fish buyers are most likely to have these criterias:

First segment
1.   Is **single**.
2.   Possess a **graduate degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Around **40 - 50** years old.
5.   **Do not** have any kid nor teen at home.

Now, we will check if these mentioned above are active customers.

In [61]:
fish_potential = data.loc[((data['Marital_Status'] == 0) &
                           (data['Education'] == 1) &
                           (data['Income_Class'] == 3) &
                           (data['Age_Class'] == 1) &
                           (data['Kidhome'] == 0) &
                           (data['Teenhome'] == 0))] \
                    .groupby(['Marital_Status', 'Education', 
                              'Income_Class', 'Age_Class', 
                              'Kidhome', 'Teenhome']) \
                    .value_counts().reset_index()

display_rfm(fish_potential, 'Fish_Cluster')

Recency_Cluster
0    16.667
1    27.778
2    55.556
Name: value, dtype: float64

Frequency_Cluster
0    22.222
1    61.111
2    16.667
Name: value, dtype: float64

Fish_Cluster
0    27.778
1    22.222
2    50.000
Name: value, dtype: float64

Notice that the precentage of **inactive potential customers** are terribly **high**, even though they are the ones who contribute the most to our GDP. This is concerning.

In [62]:
# Look for overall potential customers (almost) churn distribution
mask = rfm[(rfm['Recency_Cluster'] == 0) & (rfm['Fish_Cluster'] == 2)]['ID'].values
concern_fish = get_concern(data, fish, mask)
concern_fish

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Recency_Customers,Customers,Potential_Customers,Potential_Percentage,Recency_Percentage
0,1,1,3,2,0,0,7,45,18,40.0,15.555556
1,1,1,3,3,0,1,1,10,4,40.0,10.0
2,0,1,3,0,0,0,4,23,9,39.130435,17.391304
3,0,1,3,2,0,0,1,18,7,38.888889,5.555556
4,1,1,3,3,0,0,5,32,12,37.5,15.625
5,1,1,3,1,0,1,1,16,6,37.5,6.25
6,0,3,3,2,0,0,2,14,5,35.714286,14.285714
7,1,2,3,2,0,0,2,23,8,34.782609,8.695652
8,1,1,3,1,0,0,5,26,9,34.615385,19.230769
9,1,1,3,0,0,0,3,20,6,30.0,15.0


Notice that there is **no centralized characteristics** in which certain customer segments are inactive. Thus, we will try to **generalize** the **marketing strategies** to bring back these types of customers, as well as targeting new potential customers who match these criterias.

In addition to that, notice that the table above shows that our best potential customer here does not have a customer which has a 0 Recency_Cluster, which means they are all (almost) active customers.

## Sweets

In [63]:
# Potential Sweets Customer
temp = get_potential_customer(data, 'Sweets_Cluster', 2)

sweets = get_potential_percentage(data, temp, min_customer)
sweets

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,0,1,3,2,0,0,18,10,55.555556
1,0,2,3,2,0,0,10,5,50.0
2,0,1,3,2,0,1,15,7,46.666667
3,0,2,3,0,0,0,11,5,45.454545
4,1,1,3,0,0,0,20,8,40.0
5,0,1,3,1,0,0,18,7,38.888889
6,1,1,3,3,0,0,32,12,37.5
7,1,1,3,1,0,0,26,9,34.615385
8,1,2,3,3,0,0,15,5,33.333333
9,1,1,3,1,0,1,16,5,31.25


Only locate the characteristics in which the potential customer population are higher than certain threshold.
Here, we will set the threshold as **60%** because that is a reasonable number to say if certain characteristics are what we seek in our potential customers.

But, notice that there is **no segment** which exceeds that threshold. Thus, we will **customize** the threshold to **50%** in this case.

In [64]:
threshold = 50
top_sweets = sweets.loc[sweets['Potential_Percentage'] >= threshold].reset_index(drop = True)
top_sweets.head(10)

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,0,1,3,2,0,0,18,10,55.555556
1,0,2,3,2,0,0,10,5,50.0


From above, we can see that the potential sweets buyers are most likely to have these criterias:

First segment
1.   Marital status does not matter, but it is **more likely** if they are married.
2.   Possess a **graduate degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   **Do not** have any kid nor teen at home.   

Now, we will check if these mentioned above are active customers.

In [65]:
sweets_potential = data.loc[(data['Education'] == 1) &
                            (data['Income_Class'] == 3) &
                            (data['Kidhome'] == 0) &
                            (data['Teenhome'] == 0)] \
                    .groupby(['Marital_Status', 'Education', 
                              'Income_Class', 'Age_Class', 
                              'Kidhome', 'Teenhome']) \
                    .value_counts().reset_index()

display_rfm(sweets_potential, 'Sweets_Cluster')

Recency_Cluster
0    36.585
1    28.780
2    34.634
Name: value, dtype: float64

Frequency_Cluster
0    10.732
1    72.683
2    16.585
Name: value, dtype: float64

Sweets_Cluster
0    33.659
1    35.610
2    30.732
Name: value, dtype: float64

Notice that the precentage of **inactive potential customers** are terribly **high**, even though they are the ones who contribute the most to our GDP. This is concerning.

In [66]:
# Look for overall potential customers (almost) churn distribution
mask = rfm[(rfm['Recency_Cluster'] == 0) & (rfm['Sweets_Cluster'] == 2)]['ID'].values
concern_sweets = get_concern(data, sweets, mask)
concern_sweets

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Recency_Customers,Customers,Potential_Customers,Potential_Percentage,Recency_Percentage
0,0,1,3,2,0,0,4,18,10,55.555556,22.222222
1,0,2,3,2,0,0,3,10,5,50.0,30.0
2,0,1,3,2,0,1,1,15,7,46.666667,6.666667
3,0,2,3,0,0,0,3,11,5,45.454545,27.272727
4,1,1,3,0,0,0,3,20,8,40.0,15.0
5,0,1,3,1,0,0,2,18,7,38.888889,11.111111
6,1,1,3,3,0,0,3,32,12,37.5,9.375
7,1,1,3,1,0,0,3,26,9,34.615385,11.538462
8,1,2,3,3,0,0,2,15,5,33.333333,13.333333
9,1,1,3,1,0,1,2,16,5,31.25,12.5


Notice that there is **no centralized characteristics** in which certain customer segments are inactive. Thus, we will try to **generalize** the **marketing strategies** to bring back these types of customers, as well as targeting new potential customers who match these criterias.

Also, notice that our potential buyers all have high churn rate, which are **>20%**. And since the **other segments** have considerably **low potential** rate, it is recommended to **divert** our focus to make them **keep buying** our product instead of promoting to other customer segments.

## Gold

In [67]:
# Potential Gold Customer
temp = get_potential_customer(data, 'Gold_Cluster', 2)

gold = get_potential_percentage(data, temp, min_customer)
gold

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Customers,Potential_Customers,Potential_Percentage
0,0,2,2,2,0,1,14,6,42.857143
1,1,1,3,3,0,1,10,4,40.0
2,1,1,3,0,0,0,20,8,40.0
3,0,1,2,1,0,1,14,5,35.714286
4,0,1,3,1,0,0,18,6,33.333333
5,0,1,3,3,0,0,23,7,30.434783
6,0,2,3,2,0,0,10,3,30.0
7,1,1,3,2,0,0,45,13,28.888889
8,1,1,3,3,0,0,32,9,28.125
9,1,1,3,2,0,1,22,6,27.272727


Only locate the characteristics in which the potential customer population are higher than certain threshold.
Here, we will set the threshold as **60%** because that is a reasonable number to say if certain characteristics are what we seek in our potential customers.

But, notice that there is **no customer segment** which exceeds that threshold. And since the **highest** potential rate is merely **42%**, we can't say that they are our potential customer. So, we have 2 option here:
1. **Do not** spend any $ to market gold-related product to save money.
2. **Bet** our money into the **"best" customer segment**, even thuogh the success rate is only 42%.

In case we choose the 2nd option, then we will get the following results.

In [68]:
threshold = 60
top_gold = gold.iloc[0].reset_index(drop = True)
top_gold.head(10)

0     0.000000
1     2.000000
2     2.000000
3     2.000000
4     0.000000
5     1.000000
6    14.000000
7     6.000000
8    42.857143
Name: 0, dtype: float64

From above, we can see that the potential gold buyers are most likely to have these criterias:

First segment
1.   Is **married**.
2.   Possess a **graduate degree**.
3.   Have a **high-income** (USD 68K++ Annual income).
4.   **At least** in their **40s**.
5.   **Do not** have any kid and **at most** 1 teen at home.   

Now, we will check if these mentioned above are active customers.

In [69]:
gold_potential = data.loc[((data['Marital_Status'] == 0) &
                           (data['Education'] == 2) &
                           (data['Income_Class'] == 2) &
                           (data['Age_Class'] == 2) &
                           (data['Kidhome'] == 0) &
                           (data['Teenhome'] == 1))] \
                    .groupby(['Marital_Status', 'Education', 
                              'Income_Class', 'Age_Class', 
                              'Kidhome', 'Teenhome']) \
                    .value_counts().reset_index()

display_rfm(gold_potential, 'Gold_Cluster')

Recency_Cluster
0    28.571
1    35.714
2    35.714
Name: value, dtype: float64

Frequency_Cluster
1    21.429
2    78.571
Name: value, dtype: float64

Gold_Cluster
0    35.714
1    21.429
2    42.857
Name: value, dtype: float64

Notice that the precentage of **inactive potential customers** are terribly **high**, even though they are the ones who contribute the most to our GDP. This is concerning.

In [70]:
# Look for overall potential customers (almost) churn distribution
mask = rfm[(rfm['Recency_Cluster'] == 0) & (rfm['Gold_Cluster'] == 2)]['ID'].values
concern_gold = get_concern(data, gold, mask)
concern_gold

Unnamed: 0,Marital_Status,Education,Income_Class,Age_Class,Kidhome,Teenhome,Recency_Customers,Customers,Potential_Customers,Potential_Percentage,Recency_Percentage
0,0,2,2,2,0,1,1,14,6,42.857143,7.142857
1,1,1,3,0,0,0,5,20,8,40.0,25.0
2,1,1,3,3,0,1,1,10,4,40.0,10.0
3,0,1,2,1,0,1,3,14,5,35.714286,21.428571
4,0,1,3,1,0,0,1,18,6,33.333333,5.555556
5,0,1,3,3,0,0,1,23,7,30.434783,4.347826
6,0,2,3,2,0,0,1,10,3,30.0,10.0
7,1,1,3,2,0,0,4,45,13,28.888889,8.888889
8,1,1,3,3,0,0,3,32,9,28.125,9.375
9,1,1,3,2,0,1,2,22,6,27.272727,9.090909


Notice that there is **no centralized characteristics** in which certain customer segments are inactive. Thus, we will try to **generalize** the **marketing strategies** to bring back these types of customers, as well as targeting new potential customers who match these criterias.

In addition to that, our **"best" customer segment** also has **very low churn** rate, which is only **7%**, and it is obviously a good thing. We should work on **promoting** the **product** more to **this customer segment**.

# Summary & Recommendations

## Wines

**Top customer characteristics**

-------------------

First segment
1.   Is **married**.
2.   Possess a **doctoral degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Is **younger than 40**.
5.   **Do not** have any kid nor teen at home.

Second segment
1.   Is **married**.
2.   Possess a **doctoral degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Around **50 - 65** years old.
5.   **Do not** have any kide and **have** 1 teen at home.

Third segment
1.   Is **married**.
2.   Possess a **doctoral degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   **Around 40 - 50** years old.
5.   **Do not** have any kid nor teen at home.

**Recommended Actions**

-------------------

1. Stage a **discount / campaign** which can attract them to buy wines again.
2. **Divert** our **potential customers** to the fourth segments, which is exactly the same as the third, except that they are **>65** years old. This is because their **potential** rate is quite high, **56.25%**, with a **churn** rate of only **12.5%**.

since our potential customers' churn rate are **relatively low**, **except** for the customers who are in the **second segment**

## Fruits

**Top customer characteristics**

-------------------

First segment
1.   Is **single**.
2.   Possess a **graduate degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Around **40 - 65** years old.
5.   **Do not** have any kid nor teen at home.

Second segment
1.   Is **married**.
2.   Possess a **graduate degree**.
3.   Have an **upper-middle income** (USD 51K++ Annual income).
4.   Around **50 - 65** years old.
5.   **Do not** have any kid nor teen at home.

**Recommended Actions**

-------------------

Stage a **discount / campaign** which can attract our potentail customers who are in the **second segment**. Since the churn rate for this particular segment **exceeds**, since even though their churn rate only **exceeds** the threshold by **2%**, but we also do not have other choice and **can not divert** our attention to another customer segment (since they have low potential rate).

## Meat

**Top customer characteristics**

-------------------

First segment
1.   Is **single**.
2.   Possess a **master degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Is **younger than 40** or around **50 - 65** years old.
5.   **Do not** have any kid nor teen at home.

Second segment
1.   Is **single**.
2.   Possess a **graduate degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Around **50 - 65** years old.
5.   **Do not** have any kid nor teen at home.

**Recommended Actions**

-------------------

Stage a **discount / campaign** which can attract our potentail customers who are in the **second segment**. Since the churn rate for this particular segment **exceeds** the threshold by quite a number, which is **10%** and they are basically our main customers for this product.

## Fish

**Top customer characteristics**

-------------------

First segment
1.   Is **single**.
2.   Possess a **graduate degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   Around **40 - 50** years old.
5.   **Do not** have any kid nor teen at home.

**Recommended Actions**

-------------------

Focus on **promoting** the **products** to users which characteristics match with stated above, since our best potential customer are **all** (almost) **active** customers.

## Sweets

**Top customer characteristics**

-------------------

First segment
1.   Marital status does not matter, but it is **more likely** if they are married.
2.   Possess a **graduate degree**.
3.   Have a **high income** (USD 68K++ Annual income).
4.   **Do not** have any kid nor teen at home. 

**Recommended Actions**

-------------------

**Divert** our focus to make them **keep buying** our product instead of promoting to other customer segments, since our potential buyers all have high churn rate, which are **>20%**

## Gold

There are 2 option here:
1. **Do not** spend any $ to market gold-related product to save money.
2. **Bet** our money into the **"best" customer segment**, even thuogh the success rate is only 42%.

**Top customer characteristics**

-------------------

First segment
1.   Is **married**.
2.   Possess a **graduate degree**.
3.   Have a **high-income** (USD 68K++ Annual income).
4.   **At least** in their **40s**.
5.   **Do not** have any kid and **at most** 1 teen at home. 

**Recommended Actions**

-------------------

Work on **promoting** the **product** more to **this customer segment**, since churn rate is very low.