# Comparison of obtained models

In [1]:
import pandas as pd

rfm = pd.read_csv("./../data/clustered/rfm.csv")
rfm["kmeans_cluster"] = rfm["kmeans_cluster"] + 1
rfm["kmeans_cluster"] = rfm["kmeans_cluster"].astype("category")
rfm

Unnamed: 0,recency,frequency,monetary,clusters,kmeans_cluster
0,107,166,8026.24,0,4
1,2229,7,96.42,0,2
2,664,43,1364.72,0,3
3,2277,7,112.33,0,2
4,207,4,35.72,0,1
...,...,...,...,...,...
136383,1,1,14.18,6,1
136384,1,2,36.03,5,1
136385,1,2,74.61,5,1
136386,1,3,15.51,5,1


In [2]:
rfmv = pd.read_csv("./../data/clustered/rfmv.csv")
rfmv["kmeans_cluster"] = rfmv["kmeans_cluster"] + 1
rfmv["kmeans_cluster"] = rfmv["kmeans_cluster"].astype("category")

rfmv

Unnamed: 0,recency,frequency,monetary,variety,kmeans_cluster
0,274,166,8026.24,91,3
1,2396,7,96.42,6,2
2,831,43,1364.72,36,4
3,2444,7,112.33,7,2
4,374,4,35.72,4,1
...,...,...,...,...,...
136632,168,6,30.55,6,1
136633,168,2,36.03,2,1
136634,168,2,74.61,2,1
136635,168,3,15.51,2,1


In [3]:
from sklearn.metrics import (
    silhouette_score,
    calinski_harabasz_score,
    davies_bouldin_score,
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

experiments = []

## RFM : 100 experiments

In [4]:
y = rfm["kmeans_cluster"]
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(
        rfm, y, test_size=0.1, stratify=y
    )

    scaler = StandardScaler()

    scaled_customers = scaler.fit_transform(
        X_test[["recency", "frequency", "monetary"]]
    )
    kmeans = KMeans(n_clusters=5)
    kmeans.fit(scaled_customers)
    labels = kmeans.labels_

    experiments.append(
        {
            "model": "RFM",
            "experiment_no": i,
            "silouhette_score": round(silhouette_score(scaled_customers, labels), 2),
            "calinski_harabasz_score": round(
                calinski_harabasz_score(scaled_customers, labels), 2
            ),
            "davies_bouldin_score": round(
                davies_bouldin_score(scaled_customers, labels), 2
            ),
        }
    )

## RFMV : 100 experiments

In [5]:
y = rfmv["kmeans_cluster"]
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(
        rfmv, y, test_size=0.1, stratify=y
    )

    scaler = StandardScaler()

    scaled_customers = scaler.fit_transform(
        X_test[["recency", "frequency", "monetary", "variety"]]
    )
    kmeans = KMeans(n_clusters=4)
    kmeans.fit(scaled_customers)
    labels = kmeans.labels_

    experiments.append(
        {
            "model": "RFMV",
            "experiment_no": i,
            "silouhette_score": round(silhouette_score(scaled_customers, labels), 2),
            "calinski_harabasz_score": round(
                calinski_harabasz_score(scaled_customers, labels), 2
            ),
            "davies_bouldin_score": round(
                davies_bouldin_score(scaled_customers, labels), 2
            ),
        }
    )

In [6]:
# Track results
report = pd.DataFrame(experiments)

## Statistics review


### Looking for a gaussian variable

* H<sub>0</sub>: our variables are gaussian
* H<sub>1</sub>: our variables are not gaussian

In [7]:
import scipy.stats as st

alpha = 0.01

for var in ["silouhette_score", "calinski_harabasz_score", "davies_bouldin_score"]:
    k2, p = st.normaltest(report[var])

    if p > alpha:
        print(f"The variable '{var}' follow the normal law. (p-value = {p:.4f})")
    else:
        print(
            f"The variable '{var}' doesn't follow the normal law. (p-value = {p:.4f})"
        )

The variable 'silouhette_score' follow the normal law. (p-value = 0.0655)
The variable 'calinski_harabasz_score' doesn't follow the normal law. (p-value = 0.0000)
The variable 'davies_bouldin_score' doesn't follow the normal law. (p-value = 0.0000)


In [8]:

import plotly.express as px

fig1 = px.histogram(report, x='silouhette_score', color='model')
fig1.show()

In [9]:
fig2 = px.histogram(report, x='calinski_harabasz_score', color='model')
fig2.show()

In [10]:
fig3 = px.histogram(report, x='davies_bouldin_score', color='model')
fig3.show()

None of the variables are gaussian, but the silhouette score have the best distribution so we will use Student and Bartlett tests to check if the the distributions of this variable is different enough regarding the modelisation method.

To assess the quality of our groups, we must prove that they are statistically different.

For this, we will use two statistical tests:

* Bartlett's test which is a variance adequacy test;
* Student's test which is a test of adequacy on the average;

The hypothesis H<sub>0</sub> will be the following, if one or the other of these statistical properties varies significantly from one group to another then the hypothesis is rejected.

We set our risk threshold α at 1%.

In [11]:
rfm = report[report.model == "RFM"]["silouhette_score"]
rfmv = report[report.model == "RFMV"]["silouhette_score"]

st.bartlett(rfm, rfmv)

BartlettResult(statistic=44.220568772244064, pvalue=2.933823155879102e-11)

The H<sub>0</sub> hypothesis on the variance is not rejected since the p-value is bigger than 0.01.

In [12]:
st.ttest_ind(rfm, rfmv, equal_var=True)

Ttest_indResult(statistic=28.016990495773857, pvalue=8.138107509649175e-71)

The H<sub>0</sub> hypothesis on the mean is rejected since the p-value is less than 0.01.

### Findings

The two groups are statistically different regarding the silhouette score, we can assert that the RFMV modelisation produce less performant clustering than RFM modelisation.

In [13]:
report.to_csv("./../data/metrics/report.csv", index=False)