# Machine Learning II Project - Customer Segmentation


Group 17
- Joel Mendes - 20221825
- Lourenço Martins - 20222043
- Margarida Sardinha - 20221959

This project's goal is to segment a fictional retail company's customers into clusters, based on their demographic and purchasing data, and then create targeted promotions for each cluster detected.

This jupyter noteboook includes all functions and code that is required for our clustering solution, and that is the basis for the coupons and promotions created.

## Importing data and libraries

In [None]:
'''pip install umap-learn
pip install pyECLAT'''

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

# Importing functions
from preprocessing_py_files.preprocessing import *
from preprocessing_py_files.feature_engineering import *
from preprocessing_py_files.feature_selection import *

from visualization_py_files.initial_visualizations import *
from visualization_py_files.ploting import *

from modeling_py_files.apriori_clusters import *
from modeling_py_files.dimensionality_reduction import *
from modeling_py_files.modeling import *
from modeling_py_files.association_rules import *

from extra_py_files.ExtraCredit import *
from extra_py_files.helping_functions import *

In [8]:
cust_basket = pd.read_csv('data/customer_basket.csv')
cust_info = pd.read_csv('data/customer_info.csv', index_col=0)
product_mapping = pd.read_excel('data/product_mapping.xlsx')

## Exploratory Data Analysis

The first step of any ML project is to clean and visualize the data one is working with. This, in our case, included creating variables and removing unwanted ones, scaling all numerical variables, and removing outliers, both random and those that form clusters a priori.

In [None]:
# Initial visualizations
display(cust_info.head())
display(cust_info.info())
display(cust_info.describe())
display(cust_info.describe(include=object))

display(cust_basket.head())
display(cust_basket.info())
display(cust_basket.describe())
display(cust_basket.describe(include=object))

product_mapping

#### Inconsistencies found

In [None]:
# Variables with missing values
cust_info.isnull().sum()

In [None]:
# A percentage cannot be negative
cust_info['percentage_of_products_bought_promotion'].min()

### Preprocessing

In [None]:
# Applying the preproc functions to customer_info
info_treated = custinfo_feature_eng(cust_info)
info_scaled = scaling_imputation(info_treated)

In [29]:
# Applying the preproc functions to customer_basket
basket_treated = cust_basket_preproc(cust_basket)
basket_encoded = cust_basket_encoding(basket_treated)

We end up with the following dataframes:

* *info_treated* - Unscaled, but with the new features, to be used in interpretation of clusters. The cluster labels will be added here.
* *info_scaled* - Scaled with Robust Scaler, with missing values imputed with KNN, to be used for modelling
* *basket_treated* - Cleaned up, to serve as reference if needed
* *basket_encoded* - With TransactionEncoded applied, to be used for association rules

### Visualizations

In [None]:
plot_population(info_treated)

In [None]:
plot_distributions_grid(info_treated, ['customer_name', 'age', 'gender', 'vegetarian'], figsize=(20, 15), bins=30)

In [None]:
plot_variable_correlation(info_scaled, ['customer_name', 'education'])

### Feature Selection

Based on the results above, here we remove the unwanted features, those that were too highly correlated and thus worked as proxies for another.

In [31]:
info_scaled = feature_selection(info_scaled)

## Customer Segmentation

### Preparing the data

To apply to the clustering algorithms, only the variables relating to purchase history will be used.

The initial plan was to create 2 clustering solutions and merge them based on results, but due to time constraints, only the most relevant variables, the ones relating to purchase history, will be used. In future, the project could be expanded upon by utilizing the demographic variables as well.

In [48]:
modeldf_purchase, modeldf_demog = custinfo_separator(info_scaled)

### K-Means Clustering

In [34]:
# Checking the optimum number of clusters to look for
dispersion = create_dispersion_list(modeldf_purchase)
plot_elbow_graph(dispersion)

In [49]:
allocate_clusters_kmeans(info_treated, modeldf_purchase, n_clusters=7)

In [None]:
plot_cluster_description(modeldf_purchase)

In [None]:
plot_cluster_sizes(modeldf_purchase)

### Hierarchical Clustering

In [None]:
agg_clust = create_agg_clusters(modeldf_purchase)

KeyboardInterrupt: 

In [None]:
fig, ax = plt.subplots()
plt.title("Hierarchical Clustering Dendrogram")
# plot the top three levels of the dendrogram
plot_dendrogram(agg_clust, truncate_mode="level", p=50)
plt.show()

In [None]:
allocate_clusters_aggclust(modeldf_purchase, modeldf_purchase, n_clusters=8)

In [None]:
modeldf_purchase.groupby(['cluster_hierarchical']).mean()

In [None]:
modeldf_purchase.mean()

In [None]:
pd.DataFrame(
    confusion_matrix(sample_pp.cluster_kmeans, sample_pp.cluster_hierarchical),
    index = ['K-means {} Cluster'.format(i) for i in np.arange(0,8)],
    columns = ['Ward {} Cluster'.format(i) for i in np.arange(0,8)],
)

In [None]:
modeldf_purchase.groupby(['cluster_hierarchical']).size().plot(kind='bar')
plt.show()

In [None]:
eps_values = np.arange(0.1, 1.1, 0.1)
min_samples_values = range(2, 11)

# ITERATE CLUSTERS TO DEFINE THE BEST PARAMS

for min_samples_test in min_samples_values:
    for eps_test in eps_values:
        '''model = DBSCAN(eps=eps, min_samples=min_samples)
        model.fit_predict(data_preprocessed)'''
        pass

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors

def plot_k_distance(X, k):
    nbrs = NearestNeighbors(n_neighbors=k).fit(X)
    distances, indices = nbrs.kneighbors(X)
    distances = np.sort(distances[:, k-1], axis=0)
    plt.plot(distances)
    plt.xlabel('Points sorted by distance')
    plt.ylabel(f'{k}-th Nearest Neighbor Distance')
    plt.title(f'k-distance Graph for k={k}')
    plt.show()

# Plot the k-distance graph
plot_k_distance(sample_pp_sc, k=4)

In [None]:
from scipy.spatial import distance_matrix

# Calculate the distance matrix
dist_matrix = distance_matrix(sample_purchase, sample_pp_sc)

# Summary statistics of pairwise distances
print('Min distance:', np.min(dist_matrix))
print('Max distance:', np.max(dist_matrix))
print('Mean distance:', np.mean(dist_matrix))
print('Median distance:', np.median(dist_matrix))

### DBSCAN

In [None]:
eps_range = np.arange(1, 2.5, 0.1)
min_samples_range = range(5, 13)

# Function to calculate and print cluster sizes and noise points
def print_cluster_info(labels):
    unique_labels, counts = np.unique(labels, return_counts=True)
    cluster_info = dict(zip(unique_labels, counts))

    noise_points = cluster_info.get(-1, 0)
    clusters = {k: v for k, v in cluster_info.items() if k != -1}

    return clusters, noise_points

# Grid search over the parameter ranges
for min_samples in min_samples_range:
    for eps_val in eps_range:
        model = DBSCAN(eps=eps_val, min_samples=min_samples)
        labels = model.fit_predict(sample_purchase)

        clusters, noise_points = print_cluster_info(labels)

        if noise_points <= 500:
            print(f'eps: {round(eps_val,2)}, min_samples: {min_samples}')
            print(f'Cluster sizes: {clusters}')
            print(f'Number of noise points: {noise_points}')
            print('-' * 40)

In [None]:
allocate_clusters_dbscan(sample_pp, sample_pp_sc, eps=2.5, min_samples=8)

sample_pp.groupby(['cluster_dbscan']).mean()

In [None]:
sample_pp.groupby(['cluster_dbscan']).size().plot(kind='bar')
plt.show()

### Meanshift

In [None]:
from sklearn.cluster import estimate_bandwidth

estimate_bandwidth(sample_pp_sc, quantile=0.15)

In [None]:
allocate_clusters_meanshift(sample_pp, sample_pp_sc, bandwidth=4)

In [None]:
sample_pp.groupby(['cluster_meanshift']).mean()

In [None]:
sample_pp.groupby(['cluster_meanshift']).size().plot(kind='bar')
plt.show()

### Segment Descriptions

In [None]:
apply_tsne(modeldf_purchase)

In [None]:
apply_umap(modeldf_purchase)

### Segment Comparison

## Association Rules

We know it would be best to divide the dataset into train and test, and evaluate how well the rules describe the entire population, but due to time constraints this was not possible for us to do. In a future project, this would be something to improve on.

In [None]:
cluster_dfs = association_rules_preproc(info_treated, basket_encoded)

for index in [0,6]:
    association_rules_apriori(cluster_dfs, index)

The higher the lift, the more specific the association rules are to the cluster at hand. To find better lift, we try lowering the minimum support