# Example - Kmeans clustering

---

Customer Personality Analysis involves a thorough examination of a company's optimal customer profiles. This analysis facilitates a deeper understanding of customers, enabling businesses to tailor products to meet the distinct needs, behaviors, and concerns of various customer types.

By conducting a Customer Personality Analysis, businesses can refine their products based on the preferences of specific customer segments. Rather than allocating resources to market a new product to the entire customer database, companies can identify the segments most likely to be interested in the product. Subsequently, targeted marketing efforts can be directed toward those particular segments, optimizing resource utilization and increasing the likelihood of successful product adoption.

### License: Apache 2.0
### Source link: https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering/data

## Customer Marketing Dataset – Feature Descriptions

- **Id**: Unique identifier for each individual in the dataset.
- **Year_Birth**: The birth year of the individual.
- **Education**: The highest level of education attained by the individual.
- **Marital_Status**: The marital status of the individual.
- **Income**: The annual income of the individual.
- **Kidhome**: The number of young children in the household.
- **Teenhome**: The number of teenagers in the household.
- **Dt_Customer**: The date when the customer was first enrolled or became a part of the company's database.
- **Recency**: The number of days since the last purchase or interaction.
- **MntWines**: The amount spent on wines.
- **MntFruits**: The amount spent on fruits.
- **MntMeatProducts**: The amount spent on meat products.
- **MntFishProducts**: The amount spent on fish products.
- **MntSweetProducts**: The amount spent on sweet products.
- **MntGoldProds**: The amount spent on gold products.
- **NumDealsPurchases**: The number of purchases made with a discount or as part of a deal.
- **NumWebPurchases**: The number of purchases made through the company's website.
- **NumCatalogPurchases**: The number of purchases made through catalogs.
- **NumStorePurchases**: The number of purchases made in physical stores.
- **NumWebVisitsMonth**: The number of visits to the company's website in a month.
- **AcceptedCmp3**: Binary indicator (1 or 0) whether the individual accepted the third marketing campaign.
- **AcceptedCmp4**: Binary indicator (1 or 0) whether the individual accepted the fourth marketing campaign.
- **AcceptedCmp5**: Binary indicator (1 or 0) whether the individual accepted the fifth marketing campaign.
- **AcceptedCmp1**: Binary indicator (1 or 0) whether the individual accepted the first marketing campaign.
- **AcceptedCmp2**: Binary indicator (1 or 0) whether the individual accepted the second marketing campaign.
- **Complain**: Binary indicator (1 or 0) whether the individual has made a complaint.
- **Z_CostContact**: A constant cost associated with contacting a customer.
- **Z_Revenue**: A constant revenue associated with a successful campaign response.
- **Response**: Binary indicator (1 or 0) whether the individual responded to the marketing campaign.


In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

from sklearn.preprocessing import OneHotEncoder

In [16]:
data = pd.read_csv('data/customer_segmentation.csv')
data

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10870,1967,Graduation,Married,61223.0,0,1,13-06-2013,46,709,...,5,0,0,0,0,0,0,3,11,0
2236,4001,1946,PhD,Together,64014.0,2,1,10-06-2014,56,406,...,7,0,0,0,1,0,0,3,11,0
2237,7270,1981,Graduation,Divorced,56981.0,0,0,25-01-2014,91,908,...,6,0,1,0,0,0,0,3,11,0
2238,8235,1956,Master,Together,69245.0,0,1,24-01-2014,8,428,...,3,0,0,0,0,0,0,3,11,0


In [17]:
df = data.copy()

In [18]:
df.shape

(2240, 29)

In [19]:
df.isna().sum()

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

In [20]:
df['Income'] = df['Income'].fillna(df['Income'].mean())

In [21]:
df.isna().sum()

ID                     0
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Z_CostContact          0
Z_Revenue              0
Response               0
dtype: int64

In [22]:
df['Education'].value_counts()

Education
Graduation    1127
PhD            486
Master         370
2n Cycle       203
Basic           54
Name: count, dtype: int64

In [23]:
df['Marital_Status'].value_counts()

Marital_Status
Married     864
Together    580
Single      480
Divorced    232
Widow        77
Alone         3
Absurd        2
YOLO          2
Name: count, dtype: int64

In [24]:
def ohe_data(dataset: pd.DataFrame, feature: str) -> pd.DataFrame:
    """
    Encode a categorical feature using One-Hot Encoding (drop='first').
    
    :param dataset: Main dataset that contains the feature.
    :param feature: The name of the feature to be one-hot encoded.
    :return: A DataFrame with the encoded columns ready to be merged with the main dataset.
    """
    enc  = OneHotEncoder(sparse_output=False, drop='first')
    encoded = enc.fit_transform(dataset[[feature]])
    encoded_df = pd.DataFrame(encoded, columns=enc.get_feature_names_out([feature]))
    
    return encoded_df

In [25]:
def concat_to_main_dataset(dataset: pd.DataFrame, encoded_data: pd.DataFrame, feature: str) -> pd.DataFrame:
    """
    Join the encoded columns to the main dataset and remove the original feature.
    
    :param dataset: The main dataset to update.
    :param encoded_data: The one-hot encoded DataFrame returned by ohe_data().
    :param feature: The original feature name to be removed.
    :return: The updated dataset with encoded columns and without the original feature.
    """
    dataset = dataset.join(encoded_data)
    dataset = dataset.drop(columns=[feature])
    
    return dataset

In [26]:
for f in ['Education', 'Marital_Status']:
    encoded_d = ohe_data(df, f)
    df = concat_to_main_dataset(df, encoded_d, f)

In [27]:
df

Unnamed: 0,ID,Year_Birth,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,...,Education_Graduation,Education_Master,Education_PhD,Marital_Status_Alone,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single,Marital_Status_Together,Marital_Status_Widow,Marital_Status_YOLO
0,5524,1957,58138.0,0,0,04-09-2012,58,635,88,546,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,2174,1954,46344.0,1,1,08-03-2014,38,11,1,6,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,4141,1965,71613.0,0,0,21-08-2013,26,426,49,127,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,6182,1984,26646.0,1,0,10-02-2014,26,11,4,20,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,5324,1981,58293.0,1,0,19-01-2014,94,173,43,118,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10870,1967,61223.0,0,1,13-06-2013,46,709,43,182,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2236,4001,1946,64014.0,2,1,10-06-2014,56,406,0,30,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2237,7270,1981,56981.0,0,0,25-01-2014,91,908,48,217,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2238,8235,1956,69245.0,0,1,24-01-2014,8,428,30,214,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
