https://www.kaggle.com/datasets/arjunbhasin2013/ccdata

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/kurmukovai/ds-for-business/88ded3b36c5cc97c26756b4c62c98bbbf99deba3/2022/seminar-5/CC%20GENERAL.csv')
df.dropna(inplace=True)
df.head(3)

# Credit Card dataset

- CUST_ID - Identification of Credit Card holder (Categorical)
- BALANCE - Balance amount left in their account to make purchases
- BALANCE_FREQUENCY - How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- PURCHASES - Amount of purchases made from account
- ONEOFF_PURCHASES - Maximum purchase amount done in one-go
- INSTALLMENTS_PURCHASES - Amount of purchase done in installment
- CASH_ADVANCE - Cash in advance given by the user
- PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
- ONEOFF_PURCHASES_FREQUENCY - How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
- PURCHASES_INSTALLMENTS_FREQUENCY - How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
- CASH_ADVANCE_FREQUENCY - How frequently the cash in advance being paid
- CASH_ADVANCE_TRX - Number of Transactions made with "Cash in Advanced"
- PURCHASES_TRX - Numbe of purchase transactions made
- CREDIT_LIMIT - Limit of Credit Card for user
- PAYMENTS - Amount of Payment done by user
- MINIMUM_PAYMENTS - Minimum amount of payments made by user
- PRC_FULL_PAYMENT - Percent of full payment paid by user
- TENURE - Tenure of credit card service for user

# Make customer_id to be an index

In [None]:
df.CUST_ID.nunique(), df.shape

In [None]:
df.index = df['CUST_ID']
df = df.drop('CUST_ID', axis=1)
df.head(3)

# Start with basic EDA

In [None]:
df.head(3)

# 1. Plot features distribution

- Plot a 3 by 6 plot with each subplot representing a  histogramm of features' distribution (e.g. using `plt.subplots(...)`). 
- Add title for each subplot, and format its fontsize.
- Remove last (empty) subplot or make it invisible

save the resulting graph to pdf and upload to the reporting form.

In [None]:
import matplotlib.pyplot as plt

# 2. Preprocess features

All clustering algorithms requires some kind of feature standartization.
We will use standartization (or "z-scoring"):

$$X_{std} = \frac{X - mean(X)}{std(X)}$$

What is the mean of all columns in `X_std`? What is the standard deviation of all columns in `X_std`?


## Scientific notation

Recall that sometimes Python uses a so-called scientific notation for small numbers, e.g. notation `6.993531e-17` by definition is: 

$$6.993531 \cdot 10^{-17} = \frac{6.993531}{10^{17}} = 0.00000000000000006993531$$.

For more details, see https://sparrow.dev/python-scientific-notation/ .

For the purposes of this home assignment all numbers with absolute value smaller than $0.000001$ are effectively $0$.

In [None]:
X_standardized = ...

# 3. KMeans

What does K in KMeans means?


# Run KMeans with 5 clusters

In [None]:
import numpy as np
from tqdm import tqdm
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

In [None]:
kmeans = KMeans(n_clusters=5, n_init='auto')
labels = kmeans.fit_predict(X_standardized)

In [None]:
np.unique(labels, return_counts=True)

# 4. Run Kmeans with different number of clusters


Which number of clusters is optimal according to silhouette score?


In [None]:
def run_kmeans(x, kmin=2, kmax=20):
    inertia = []
    for k in tqdm(range(kmin, kmax+1)):
        kmeans = KMeans(n_clusters=k, n_init='auto')
        kmeans.fit(x)
        results_df[f'clusters_kmeans_{k}'] = kmeans.predict(x)
        inertia.append(kmeans.inertia_)
    return inertia

In [None]:
results_df = pd.DataFrame()
kmin, kmax = 2, 20
n_clusters = range(kmin, kmax+1)

# standardized
inertia = run_kmeans(X_standardized, kmin, kmax)

In [None]:
plt.plot(n_clusters, inertia);
plt.xlabel('Number of clusters')
plt.ylabel('Inertia');

In [None]:
from plot_utils import plot_silhouette
# if you run in google colab copy-paste code from `plot_utils.py` into a notebook cell

In [None]:
plot_silhouette(X_standardized, kmin=4, kmax=21, step=4)

# 5. Hierarchical clustering

What is hieararchical clustering?


# 6. Run hierarchical clustering

with different types of linkage looking for 8-12 clusters (choose one number):
 - "single"
 - "complete"
 - "ward"
 - "average"
 
For each type of linkage print sizes of the resulting clusters. Which of the linkages result in a non-degenerative clustering?

**Degenerative** clustering is a clustering of data into N clusters with most of the data being in a small subset of clusters, and rest of the clusters containing 1-5-10 points each.


In [None]:
from sklearn.cluster import AgglomerativeClustering

# 7. Plot dendrogram

for Ward linkage from the previous question what distance treshold will result in 10 clusters?


In [None]:
from plot_utils import plot_dendrogram
# if you run in google colab copy paste code from `plot_utils.py` into a notebook cell

In [None]:
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage='ward')
model = model.fit(X_standardized)

plt.figure(figsize=(15,15))
plt.title("Hierarchical Clustering Dendrogram") 
plot_dendrogram(model, truncate_mode="level", p=4)# plot the top three levels of the dendrogram
plt.xlabel("Number of points in node in round brackets or an index of a point (no brackets)")
plt.show()

# Visualization

To visualize our multi-dimensional data we will apply two differen dimensionality techniques: PCA and tSNE ("tea-sni")

# 8. Select all true statements about PCA


# Run PCA with 2 components

Unlike kmeans and hierarchical clustering PCA only requires data centering (without data scaling)

In [None]:
X_mean = df - df.mean(axis=0)

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(2)
X_pca2 = pca.fit_transform(X_mean)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.sum())

In [None]:
kmeans = KMeans(n_clusters=10, n_init='auto')
labels = kmeans.fit_predict(X_standardized)

plt.scatter(X_pca2[:, 0], X_pca2[:, 1], c=labels); # Use kmeans labels for the color

PCA visualizations are not always informative, we will try using TSNE, but first we need to select the optimal number of PCA components

# 9. Select number of PCA components

based on explained variance ratio. Which minimal number of PCA components explain **atleast 95%** of the data variance (`X_mean`)?


# These are the `loadings` of the very first Principle component, which explains about 47% of data variance


In [None]:
components = dict(zip(df.columns, pca.components_[0]))
components = sorted(components.items(), key=lambda x: x[1], reverse=True)

for c, w in components:
    print(c, np.round(w, 3))

# 10. Select all correct statements

based on the whole PCA analysis


# TSNE visualization

# 11. What is t-SNE?


In [None]:
# !pip install opentsne

In [None]:
from openTSNE import TSNE

pca = PCA(10)
X_pca10 = pca.fit_transform(X_mean)
embedding = TSNE().fit(X_pca10)

In [None]:
plt.scatter(embedding[:, 0], embedding[:, 1]);

In [None]:
ac = AgglomerativeClustering(n_clusters=8, linkage='ward')
prediction = ac.fit_predict(X_pca10)
np.unique(prediction, return_counts=True)

### tSNE with AgglomerativeClustering labels

In [None]:
plt.scatter(embedding[:, 0], embedding[:, 1], c=prediction, cmap='Paired');

### tSNE with KMeans clustering labels

In [None]:
kmeans = KMeans(n_clusters=8, n_init='auto')
labels = kmeans.fit_predict(X_standardized)

In [None]:
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Paired');

# 12. Which metrics are suitable for clusterings comparison?



# 13. Compare Kmeans and Agglomerative clusterings

with 8 clusters using Adjusted Rand Score, what is the value of ARI?


Visually KMeans and Agglomerative (with Ward distance) result in drastically different results, we will try to remove data outliers and see if it will affect the results

# Remove outliers

In [None]:
def detect_outliers_very_simple(x, col):
    """A naive outliers detector, based on left most (5 percentile) or right most (95 percentile) threshold"""
    if col in ['BALANCE_FREQUENCY', 'TENURE']:
        return x < x.quantile(0.05)
    else:
        return x > x.quantile(0.95)

In [None]:
cols_outliers = dict()
outliers = [False] * df.shape[0]

for col in columns:
    cols_outliers[col] = detect_outliers_very_simple(df[col], col)
    outliers += cols_outliers[col]
    print(col, np.round(cols_outliers[col].sum() / df.shape[0] * 100), '%')

In [None]:
X_mean['is_outlier'] = outliers

# 14. What is the percentage of detected outliers (to the whole data)?


# 15. Repeat the analysis on filtered data

Run:
1. PCA with 10 components
2. KMeans with **12 clusters** (using PCA representation), set `n_init='auto'`
3. Hierarchical clustering with **12 clusters** (using PCA representation)
4. Compare 2 and 3 using adjusted mutual information

what is the value of AMI?


In [None]:
X_mean_filtered = X_mean.query('is_outlier==False')

In [None]:
from sklearn.metrics import adjusted_mutual_info_score

# 16. Plot a subplot with points colored according to kmeans and agglomerative clustering

Draw a 1 row 2 columns subplot using TSNE embeddings. Title each subplot according to the source of clusters' colors (kmeans or agglomerative). Save pdf and upload it to the submission form.

In [None]:
embedding = TSNE().fit(X_pca10)

# Plotly visualization

> ChatGPT: `How to plot an interactive scatter plot in python so I can mouse over a point to see some label, provide a code example.`

> `What if my data source is stored in pandas DataFrame?`

> `How to add a color to each point?`

finalize with some manual edits (change title, add color alpha, etc.)

## Try to mouse over the points on the graph

In [None]:
x_plot = pd.DataFrame(index=X_mean_filtered.index)
x_plot['tsne1'] = embedding[:, 0]
x_plot['tsne2'] = embedding[:, 1]
x_plot['customer_id'] = range(embedding.shape[0])
x_plot['cluster_agg'] = prediction
x_plot['cluster_kmeans'] = prediction_kmeans
x_plot = x_plot.reset_index()

In [None]:
# !pip install plotly

In [None]:
import plotly.express as px

In [None]:
fig = px.scatter(x_plot, x='tsne1', y='tsne2', color='cluster_agg', hover_data=['CUST_ID'], )

# Customize aspect
fig.update_traces(marker=dict(size=5, line=dict(width=0.5)), selector=dict(mode='markers')) 
fig.update_layout(title='TSNE plot of customers clusters', xaxis_title='x', yaxis_title='y')
fig.update_yaxes(scaleanchor = "x", scaleratio = 1)

fig.show()


# 17. Interpret the resulting clusters

Try to interpret the resulting clusters, you can use smaller number of clusters (e.g. 4-8) and any clustering algorithm you want. Provide a short but detailed report <300 words. Save it to pdf and upload to submission form.