In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# 8 - Clustering (part 2)

In this notebook, we will introduce other clustering algorithms and the metrics that can be used for evaluating them.

Specifically, we will look at the following clustering algorithms:
- KMeans ([Documentation of KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)),
- DBSCAN ([Documentation of DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#)),
- hierarchical clustering ([Documentation of AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering))

and at the following techniques for evaluating clustering models:
- the "elbow method"

The dataset we are going to use in the same *fraud detection* dataset which we have used in some of the previous sessions.
In case you do not remember the details: it contains a list of transactions from an online retailer: for each transaction, we have several attributes, as well as a label indicating whether a transaction was fraudulent or not.

Since we have the target labels, we might as well used some classification algorithm on this dataset (as we did in the 3rd session); but in this notebook we are going to focus exclusively on clustering.

# Index

- [0. Imports](#0.)
- [1. analysis of different clustering algorithms on artificial data](#1.)
    - [1.1 Creation of the artificial blobs of data](#1.1)
    - [1.2 Analysis of the clustering algorithms when using dafault parameters](#1.2)
    - [1.3 Let's try to select the number of clusters](#1.3)
    - [1.4 What parameters can we change with DBSCAN?](#1.4)
    - [1.5 Experimenting with different distributions of artificial data](#1.5)
- [2. Clustering on the fraud detection dataset](#2.)
    - [2.1 Analysis of the dataset](#2.1)
    - [2.2 Data preparation for clustering](#2.2)
    - [2.3 Clustering](#2.3)
    - [2.4 KMeans](#2.4)
    - [2.5 Evaluating the KMeans algorithm](#2.5)
    - [2.6 The elbow method](#2.6)
    - [2.7 AgglomerativeClustering](#2.7)
    - [2.8 DBSCAN](#2.8)


# 0.
## Imports

[Index](#Index)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time

from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

from sklearn.cluster import (
    KMeans, 
    AgglomerativeClustering, 
    DBSCAN
)

from sklearn.metrics import (
    homogeneity_score,
    completeness_score,
    v_measure_score,
    silhouette_score,
    calinski_harabasz_score,
)

## 1.

## Analyzing different clustering algorithms on artificial data
[Index](#Index)

### 1.1

### Creation of the artificial blobs of data
[Index](#Index)

In [None]:
n_samples = 1500
random_state = 170
alpha=0.5

X, y = make_blobs(n_samples=n_samples, random_state=random_state)

In [None]:
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(X[:, 0], X[:, 1], alpha=alpha)
plt.show()

### 1.2

### Analysis of the clustering algorithms when using dafault parameters
[Index](#Index)

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(22,6))

y_pred = KMeans(random_state=random_state).fit_predict(X)
scatter = ax[0].scatter(X[:, 0], X[:, 1], alpha=alpha, c=y_pred)
ax[0].add_artist(ax[0].legend(*scatter.legend_elements(), title="Cluster"))  # this line is to add a legend with the ID of each cluster
ax[0].set_title("KMeans")

y_pred = DBSCAN().fit_predict(X)
scatter = ax[1].scatter(X[:, 0], X[:, 1], alpha=alpha, c=y_pred)
ax[1].add_artist(ax[1].legend(*scatter.legend_elements(), title="Cluster"))
ax[1].set_title("DBSCAN")

y_pred = AgglomerativeClustering().fit_predict(X)
scatter = ax[2].scatter(X[:, 0], X[:, 1], alpha=alpha, c=y_pred)
ax[2].add_artist(ax[2].legend(*scatter.legend_elements(), title="Cluster"))
ax[2].set_title("AgglomerativeClustering")

plt.show()

- For `KMeans` and `AgglomerativeClustering` we can specify the desired number of clusters as an argument (`n_clusters`), we have a different number of clusters since the default value is different
- For `DBSCAN`, we cannot explicitly define the number of clusters, it "finds" the number of clusters depending on other arguments
- In `KMeans` and `AgglomerativeClustering` all the points are assigned to a cluster; in `DBSCAN` some points are assigned to cluster `-1`, which contains the "noisy samples" (thus should not be considered as a cluster)

### 1.3

### Let's try to select the number of clusters
[Index](#Index)

Let's leave DBSCAN aside for the moment, as it does not have a `n_clusters` attribute.

First of all, I define a function for plotting KMeans and AgglomerativeClustering results, since I will have to plot them several times and in this way it is more readable.

In [None]:
def plot_KMeans_and_AgglomerativeClustering(X, n_clusters, random_state=random_state, alpha=alpha):

    fig, ax = plt.subplots(1, 2, figsize=(14,6))

    y_pred = KMeans(n_clusters=n_clusters, random_state=random_state).fit_predict(X)
    scatter = ax[0].scatter(X[:, 0], X[:, 1], alpha=alpha, c=y_pred)
    ax[0].add_artist(ax[0].legend(*scatter.legend_elements(), title="Cluster"))
    ax[0].set_title("KMeans(n_clusters=%d)" % n_clusters)

    y_pred = AgglomerativeClustering(n_clusters=n_clusters).fit_predict(X)
    scatter = ax[1].scatter(X[:, 0], X[:, 1], alpha=alpha, c=y_pred)
    ax[1].add_artist(ax[1].legend(*scatter.legend_elements(), title="Cluster"))
    ax[1].set_title("AgglomerativeClustering(n_clusters=%d)" % n_clusters)

    plt.show()

Let's start with the "correct" number of clusters (i.e. 3).

In [None]:
plot_KMeans_and_AgglomerativeClustering(X, n_clusters=3)

What happens if we change the number of clusters?
We did this analysis for KMeans in the previous notebook, but now let's repeat it with AgglomerativeClustering as well.

In [None]:
plot_KMeans_and_AgglomerativeClustering(X, n_clusters=2)

There is an interesting difference between the two plots! According to KMeans, some points which are very close to the top-center blob are actually marked as belonging to the same cluster as the bottom left blob.

Why is that? We can answer that question by looking at the centers of the two clusters.

In [None]:
n_clusters = 2

fig, ax = plt.subplots(figsize=(6,6))

kmeans = KMeans(n_clusters=n_clusters, random_state=random_state).fit(X)
y_pred = kmeans.predict(X)
scatter = ax.scatter(X[:, 0], X[:, 1], alpha=0.2, c=y_pred)
ax.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=100, label='cluster centers')
ax.add_artist(ax.legend(*scatter.legend_elements(), title="Cluster", loc='lower right'))
ax.set_title("KMeans(n_clusters=%d)" % n_clusters)
ax.legend()

plt.show()

In [None]:
plot_KMeans_and_AgglomerativeClustering(X, n_clusters=4)

In [None]:
plot_KMeans_and_AgglomerativeClustering(X, n_clusters=5)

In general, we can see that `KMeans` tends to create clusters with regular shapes, while `AgglomerativeClustering` clusters have boundaries which depend on the density of points in each area.

Both models seem to distinguish pretty well between the different blobs, if we use `n_clusters` >= the number of blobs (i.e. they do not assign elements of different blobs to the same clusters).

### 1.4 
### What parameters can we change with DBSCAN?
[Index](#Index)

Again, let's defin a function which I can use to plot the clusters.

In [None]:
def plot_DBSCAN(dbscan, X):
    fig, ax = plt.subplots(figsize=(6,6))
    y_pred = dbscan.fit_predict(X)
    scatter = ax.scatter(X[:, 0], X[:, 1], alpha=alpha, c=y_pred)
    ax.add_artist(ax.legend(*scatter.legend_elements(), title="Cluster"))
    ax.set_title("DBSCAN")
    plt.show()

In [None]:
plot_DBSCAN(DBSCAN(), X)

Let's look at the documentation of [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#).

In [None]:
plot_DBSCAN(DBSCAN(eps=0.75), X)

In [None]:
plot_DBSCAN(DBSCAN(eps=0.9), X)

In [None]:
plot_DBSCAN(DBSCAN(eps=0.3), X)

### 1.5

### Experimenting with different distributions of artificial data
[Index](#Index)

In [None]:
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_transformed = np.dot(X, transformation)

In [None]:
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(X_transformed[:, 0], X_transformed[:, 1], alpha=alpha)
plt.show()

In [None]:
fig, ax = plt.subplots(3, 3, figsize=(22,22))

for idx, n_clusters in enumerate([2, 3, 5]):
    y_pred = KMeans(n_clusters=n_clusters, random_state=random_state).fit_predict(X_transformed)
    scatter = ax[idx][0].scatter(X_transformed[:, 0], X_transformed[:, 1], alpha=alpha, c=y_pred)
    ax[idx][0].add_artist(ax[idx][0].legend(*scatter.legend_elements(), title="Cluster"))
    ax[idx][0].set_title("KMeans (n_clusters=%d)" % n_clusters)

for idx, eps in enumerate([0.3, 0.5, 0.8]):
    y_pred = DBSCAN(eps=eps).fit_predict(X_transformed)
    scatter = ax[idx][1].scatter(X_transformed[:, 0], X_transformed[:, 1], alpha=alpha, c=y_pred)
    ax[idx][1].add_artist(ax[idx][1].legend(*scatter.legend_elements(), title="Cluster"))
    ax[idx][1].set_title("DBSCAN (eps = %.2f)" % eps)

for idx, n_clusters in enumerate([2, 3, 5]):
    y_pred = AgglomerativeClustering(n_clusters=n_clusters).fit_predict(X_transformed)
    scatter = ax[idx][2].scatter(X_transformed[:, 0], X_transformed[:, 1], alpha=alpha, c=y_pred)
    ax[idx][2].add_artist(ax[idx][2].legend(*scatter.legend_elements(), title="Cluster"))
    ax[idx][2].set_title("AgglomerativeClustering (n_clusters = %d)" % n_clusters)

plt.show()

---

### Different variance

In [None]:
X_var, y_var = make_blobs(n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=random_state)

In [None]:
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(X_var[:, 0], X_var[:, 1], alpha=alpha)
plt.show()

In [None]:
fig, ax = plt.subplots(3, 3, figsize=(22,22))

for idx, n_clusters in enumerate([2, 3, 5]):
    y_pred = KMeans(n_clusters=n_clusters, random_state=random_state).fit_predict(X_var)
    scatter = ax[idx][0].scatter(X_var[:, 0], X_var[:, 1], alpha=alpha, c=y_pred)
    ax[idx][0].add_artist(ax[idx][0].legend(*scatter.legend_elements(), title="Cluster"))
    ax[idx][0].set_title("KMeans (n_clusters=%d)" % n_clusters)

for idx, eps in enumerate([0.3, 0.5, 0.8]):
    y_pred = DBSCAN(eps=eps).fit_predict(X_var)
    scatter = ax[idx][1].scatter(X_var[:, 0], X_var[:, 1], alpha=alpha, c=y_pred)
    ax[idx][1].add_artist(ax[idx][1].legend(*scatter.legend_elements(), title="Cluster"))
    ax[idx][1].set_title("DBSCAN (eps = %.2f)" % eps)

for idx, n_clusters in enumerate([2, 3, 5]):
    y_pred = AgglomerativeClustering(n_clusters=n_clusters).fit_predict(X_var)
    scatter = ax[idx][2].scatter(X_var[:, 0], X_var[:, 1], alpha=alpha, c=y_pred)
    ax[idx][2].add_artist(ax[idx][2].legend(*scatter.legend_elements(), title="Cluster"))
    ax[idx][2].set_title("AgglomerativeClustering (n_clusters = %d)" % n_clusters)

plt.show()

---

### Unevenly sized blobs

In [None]:
X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))

In [None]:
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(X_filtered[:, 0], X_filtered[:, 1], alpha=alpha)
plt.show()

In [None]:
fig, ax = plt.subplots(3, 3, figsize=(22,22))

for idx, n_clusters in enumerate([2, 3, 5]):
    y_pred = KMeans(n_clusters=n_clusters, random_state=random_state).fit_predict(X_filtered)
    scatter = ax[idx][0].scatter(X_filtered[:, 0], X_filtered[:, 1], alpha=alpha, c=y_pred)
    ax[idx][0].add_artist(ax[idx][0].legend(*scatter.legend_elements(), title="Cluster"))
    ax[idx][0].set_title("KMeans (n_clusters=%d)" % n_clusters)

for idx, eps in enumerate([0.3, 0.5, 0.8]):
    y_pred = DBSCAN(eps=eps).fit_predict(X_filtered)
    scatter = ax[idx][1].scatter(X_filtered[:, 0], X_filtered[:, 1], alpha=alpha, c=y_pred)
    ax[idx][1].add_artist(ax[idx][1].legend(*scatter.legend_elements(), title="Cluster"))
    ax[idx][1].set_title("DBSCAN (eps = %.2f)" % eps)

for idx, n_clusters in enumerate([2, 3, 5]):
    y_pred = AgglomerativeClustering(n_clusters=n_clusters).fit_predict(X_filtered)
    scatter = ax[idx][2].scatter(X_filtered[:, 0], X_filtered[:, 1], alpha=alpha, c=y_pred)
    ax[idx][2].add_artist(ax[idx][2].legend(*scatter.legend_elements(), title="Cluster"))
    ax[idx][2].set_title("AgglomerativeClustering (n_clusters = %d)" % n_clusters)

plt.show()

## 2.

## Clustering on the fraud detection dataset
[Index](#Index)

### 2.1

### Analysis of the dataset
[Index](#Index)

In [None]:
df = pd.read_csv('payment_fraud.csv')

<div class="alert alert-block alert-danger">
<b>Q: we often had to define in advance the column names. However, this time we didn't have to do so. Why?</b>
</div>

<div class="alert alert-block alert-success">
ANS
</div>

<div class="alert alert-block alert-danger">
<b>Q: Display 1 random row of the dataframe.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Print the shape of the dataframe</b>
</div>

<div class="alert alert-block alert-info">
<b>Some clustering algorithms (e.g. AgglomerativeClustering) require a lot of memory. In order not to have any problems in running them on my machine during this session, I subsample the original dataframe in order to have only few of the original samples.</b>
</div>

In [None]:
df_benign = df[df['label']==0]
df_fraud = df[df['label']!=0]

df = pd.concat([df_benign.sample(2000, random_state=random_state), df_fraud], ignore_index=True)

In [None]:
df.shape

<div class="alert alert-block alert-danger">
<b>Q: Show the type of each feature.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Display the number of occurrences of each value of 'paymentMethod'.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Plot the distribution of 'accountAgeDays'.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Plot the distribution of 'localTime'.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Plot the distribution of 'paymentMethodAgeDays'.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Plot 'paymentMethodAgeDays' versus 'accountAgeDays'. Can you see a relationship between them? If so, what is it? Also, try to do the same thing separating fraudulent and not fraudulent entries (e.g. by plotting them in different colors).</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: The plots above suggest the reason why this was a very easy dataset (if you remember, back in the 3rd session we got 100% accuracy on it!). What is such reason?</b>
</div>

<div class="alert alert-block alert-success">
ANS
</div>

<div class="alert alert-block alert-info">
As in previous sessions, I drop the `AccountAgeDays` column, in order to create a more difficult dataset.
</div>

In [None]:
df = df.drop('accountAgeDays', axis=1)

In [None]:
df.sample()

---

<div class="alert alert-block alert-danger">
<b>Q: Try to look for any correlations between 'paymentMethodAgeDays' and 'numItems'. Do that separating fraudulent and not fraudulent entries as well (e.g. by plotting them in different colors).</b>
</div>

### 2.2
### Data preparation for clustering
[Index](#Index)

<div class="alert alert-block alert-danger">
<b>Q: Create a new dataframe performing one hot encoding on the feature(s) that require so.</b>
</div>

In [None]:
# df_one_hot = ...

<div class="alert alert-block alert-danger">
<b>Q: Complete the cell below in order to perform scaling.</b>
</div>

In [None]:
# numeric_cols = ...

# standard_scaler = StandardScaler().fit...
# df_one_hot[numeric_cols] = standard_scaler...

---

### 2.3
### Clustering
[Index](#Index)

Documentation:
- [kmeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)
- [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN)
- [hierarchical clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering)

**Preliminary**: I create a new dataframe which does not contain the true label, and an array which contains only the true labels

In [None]:
true_labels = df_one_hot['label'].values
df = df_one_hot.drop(['label'], axis=1)

### 2.4
### `KMeans`
[Index](#Index)

<div class="alert alert-block alert-danger">
<b>Q: Define a <code>KMeans</code> object with 2 clusters and perform the clustering. Also, measure the time elapsed for training.</b>
</div>

In [None]:
# kmeans = ...

<div class="alert alert-block alert-danger">
<b>Q: Print the coordinates of the center of each cluster.</b>
</div>

In [None]:
# TODO

### 2.5
### Evaluating the `KMeans` algorithm
[Index](#Index)

There are many possible metrics for clustering evaluation: some can be used when the ground truth labels are known, some can be used when the true labels are unknown.

**Extrinsic evaluation** - If the ground truth is known (as in this case, since we have the `label` column) you could use:
- [homogeneity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score)
- [completeness](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.completeness_score.html#sklearn.metrics.completeness_score)
- [v-measure](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.v_measure_score.html#sklearn.metrics.v_measure_score)

**Intrinsic evaluation** - If the ground truth is not known, you could use:
- [silhouette](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score) - "a higher Silhouette Coefficient score relates to a model with better defined clusters"
- [Calinski-Harabasz (C-H) Index](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html) - "The score is defined as ratio between the within-cluster dispersion and the between-cluster dispersion"

<div class="alert alert-block alert-danger">
<b>Q: Above, I have instered links to the documentation of each metric. Looking at the documentation and the examples provided here, try to compute the metrics for the KMenas clustering we performed.</b>
</div>

**Extrinsic evaluation**

In [None]:
print("Homogeneity score  = %.4f" % homogeneity_score(true_labels, predicted_clusters))
# TODO: completeness_score
# TODO: v_measure_score

**Intrinsic evaluation**

In [None]:
print("Silhouette score  = %.4f" % silhouette_score(df, kmeans.labels_))

<div class="alert alert-block alert-danger">
<b>Q: Print Calinshi Harabasz score.</b>
</div>

In [None]:
# TODO: calinski_harabasz_score

<div class="alert alert-block alert-danger">
<b>Q: Now perform k-means clustering for k=3, compute the metrics and compare the results.</b>
</div>

In [None]:
# fit KMeans and evaluate it

### 2.6
### The elbow method
[Index](#Index)


In KMeans, how do you choos the best number of clusters?

Looking at the metric only is not very helpful, as they tend to decrease as the number of cluster increases.
We can use the elbow method: that is, we compute the evaluation metrics for different values of `n_clusters` and observe how the evaluation metrics change.
Then, we pick as number of clusters the value that caused a "step" in the chose evaluation metric.

**Homogeneity score**

In [None]:
homogeneity_values = []
x_range = range(2, 20) 
for k in x_range:
    t0 = time.time()
    kmeans = KMeans(n_clusters=k, random_state=random_state).fit(df)
    homogeneity_values.append(homogeneity_score(true_labels, kmeans.labels_))
    print("k = %2d; Elapsed time = %.2f s" % (k, time.time()-t0))

In [None]:
fig, ax = plt.subplots()
ax.plot(x_range, homogeneity_values)
ax.set_xticks(x_range)
ax.set_title('KMEANS - Homogeneity for varying K')
ax.set_xlabel('K')
ax.grid()
plt.show()

**Calinski Harabasz Index**

In [None]:
ch_values = []
for k in x_range:
    t0 = time.time()
    kmeans = KMeans(n_clusters=k, random_state=random_state).fit(df)
    ch_values.append(calinski_harabasz_score(df, kmeans.labels_))
    print("k = %2d; Elapsed time = %.2f s" % (k, time.time()-t0))

In [None]:
fig, ax = plt.subplots()
ax.plot(x_range, ch_values)
ax.set_xticks(x_range)
ax.set_title('KMEANS - C-H Index for varying K')
ax.set_xlabel('K')
ax.grid()
plt.show()

<div class="alert alert-block alert-danger">
<b>Q: Repeat the analysis with the elbow method using the silhouette score</b>
</div>

### 2.7
### `AgglomerativeClustering`
[Index](#Index)

<div class="alert alert-block alert-danger">
<b>Q: Repeat the analysis with the elbow method using the homogeneity score on AgglomerativeClustering.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Repeat the analysis with the elbow method using the homogeneity score on AgglomerativeClustering.</b>
</div>

### 2.8
### `DBSCAN`
[Index](#Index)

<div class="alert alert-block alert-danger">
<b>Q: Repeat the analysis with the elbow method using the homogeneity score and C-H Index on DBSCAN. Remember that with DBSCAN you cannot select the n_cluster attribute, you have to select the eps attribute.</b>
</div>

**Homogeneity**

**Calinski Harabasz Index**

---