# 8 - Data visualization & clustering

---

In this notebook, we will continue to explore the techniques for data visualization and focus on clustering algorithms and techniques for evaluating them.

The dataset used in this notebook is the same that was used in the third session.
In case you do not remember the details: the dataset contains a list of transactions from an online retailer: for each transaction, we have several attributes, as well as a label indicating whether a transaction was fraudulent or not.

#### Import the required libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.preprocessing import StandardScaler

#### Define the filename for the training data

In [None]:
FILENAME = 'payment_fraud.csv'

#### Read the file

In [None]:
df = pd.read_csv(FILENAME)

<div class="alert alert-block alert-danger">
<b>Q: we often had to define in advance the column names. However, this time we didn't have to. Why?</b>
</div>

<div class="alert alert-block alert-success">
ANS
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Display 10 random rows of the dataframe.</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Show the columns of the dataframe</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Print the number of rows of the dataframe</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Print the number of columns of the dataframe</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Show the type of each feature.</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Display the number of occurrences of each value of 'paymentMethod'.</b>
</div>

---

## Data visualization

<div class="alert alert-block alert-danger">
<b>Q: Plot the distribution of 'accountAgeDays'.</b>
</div>

In [None]:
fig, ax = plt.subplots()
# TODO
plt.show()

---

<div class="alert alert-block alert-danger">
<b>Q: Plot the distribution of 'localTime'.</b>
</div>

In [None]:
fig, ax = plt.subplots()
# TODO
plt.show()

---

<div class="alert alert-block alert-danger">
<b>Q: Plot the distribution of 'paymentMethodAgeDays'.</b>
</div>

In [None]:
fig, ax = plt.subplots()
# TODO
plt.show()

---

<div class="alert alert-block alert-danger">
<b>Q: Plot the same three distributions as above but separately for the fraudulent and not fraudulent transactions.</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Plot 'paymentMethodAgeDays' versus 'accountAgeDays'. Can you see a relationship between them? If so, what is it? Also, try to do the same thing separating fraudulent and not fraudulent entries (e.g. by plotting them in different colors).</b>
</div>

In [None]:
x = # TODO
y = # TODO
colors = # TODO

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))
# TODO
plt.show()

---

<div class="alert alert-block alert-danger">
<b>Q: The plots above suggest the reason why this was a very easy dataset (if you remember, we got 100% accuracy on it!). What is such reason?</b>
</div>

<div class="alert alert-block alert-success">
ANS
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Try to look for any correlations between 'paymentMethodAgeDays' and 'numItems'. Do that separating fraudulent and not fraudulent entries as well (e.g. by plotting them in different colors).</b>
</div>

In [None]:
x = # TODO
y = # TODO
colors = # TODO

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))
# TODO
plt.show()

---

<div class="alert alert-block alert-danger">
<b>Q: Display with a bar plot the number of transactions for each 'paymentMethod'</b>
</div>

---

## Data preparation for clustering

<div class="alert alert-block alert-danger">
<b>Q: Create a new dataframe performing one hot encoding on the feature(s) that require so.</b>
</div>

In [None]:
df_one_hot = # TODO

<div class="alert alert-block alert-danger">
<b>Q: Complete the cell below in order to perform scaling.</b>
</div>

In [None]:
numeric_cols = # TODO

standard_scaler = StandardScaler().fit # TODO
df_one_hot[numeric_cols] = standard_scaler # TODO

---

# Clustering

Documentation:
- [kmeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)
- [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN)
- [hierarchical clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering)

---

#### Import the classes for clustering

In [None]:
from sklearn.cluster import (
    KMeans, 
    AgglomerativeClustering, 
    DBSCAN
)

---

### Let's start with KMeans

You can perform fit similarly to how you perform it for regression and classification.

In [None]:
kmeans = KMeans(n_clusters=3).fit(df_one_hot.drop(['label'], axis=1))

You can print the coordinates of the center of each cluster 

In [None]:
kmeans.cluster_centers_

If you want, you can print the labels for the elements in DF

In [None]:
kmeans.labels_

Let's save those labels in a new column of the dataframe

In [None]:
df_one_hot['cluster'] = kmeans.labels_

You can have an intuition of how well the clustering algorithm performed by looking at the distribution of the elements across the two clusters:

In [None]:
df_one_hot.groupby(['label', 'cluster']).size()

---

### Evaluating the clustering algorithms

There are many possible metrics for clustering evaluation: some can be used when the ground truth labels are known, some can be used when the true labels are unknown.

If the ground truth is known, you can use, for instance:
- [homogeneity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score)
- [completeness](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.completeness_score.html#sklearn.metrics.completeness_score)
- [v-measure](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.v_measure_score.html#sklearn.metrics.v_measure_score)

If the ground truth is not known, you can use, for instance:
- [silhouette](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score) - "a higher Silhouette Coefficient score relates to a model with better defined clusters"
- [Davies-Bouldin Index](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score) - "a lower Davies-Bouldin index relates to a model with better separation between the clusters"

---

#### Import the metrics

In [None]:
from sklearn.metrics import (
    homogeneity_score,
    completeness_score,
    v_measure_score,
    silhouette_score,
    davies_bouldin_score,
)

<div class="alert alert-block alert-danger">
<b>Q: In the cell above, I have imported the metrics. Following the given examples, try to compute the other metrics for the clusters computed above with k-means.</b>
</div>

Remember to keep separeted:
- the predicted labels
- the column 'label' containing the ground truth (be careful not to consider it as a feature while testing!!)
- the features.

In [None]:
predicted_clusters = df_one_hot['cluster'].values
true_labels = df_one_hot['label'].values

In [None]:
print(homogeneity_score(true_labels, predicted_clusters))
# TODO: completeness_score
# TODO: v_measure_score

In [None]:
print(silhouette_score(df_one_hot.drop(['label', 'cluster'], axis=1), kmeans.labels_))
# TODO: davies_bouldin_score

---

<div class="alert alert-block alert-danger">
<b>Q: After that, perform k-means clustering for k=4, compute the same metrics and compare the results. Which one performed better?</b>
</div>

Remember: better metrics doesn't always mean better clustering.

---

## Let's have a look at the elbow method in action

In [None]:
silhouette_values = []
for k in range(2, 20):
    kmeans = KMeans(n_clusters=k).fit(df_one_hot.drop(['label'], axis=1))
    silhouette_values.append(silhouette_score(df_one_hot.drop(['label'], axis=1), kmeans.labels_))

In [None]:
fig, ax = plt.subplots()
ax.plot(range(2, 20), silhouette_values)
ax.set_xticks(range(2, 20))
ax.set_title('KMEANS - Silhouette for varying K')
ax.set_xlabel('K')
plt.show()

---

<div class="alert alert-block alert-danger">
<b>Q: Now compute and plot the value of the v_measure_score for different K</b>
</div>

In [None]:
v_measure_values = []
for k in range(2, 20):
    kmeans = # TODO
    v_measure_values.append() # TODO

In [None]:
fig, ax = plt.subplots()
ax.plot(range(2, 20), v_measure_values)
ax.set_xticks(range(2, 20))
ax.set_title('KMEANS - v-measure for varying K')
ax.set_xlabel('K')
plt.show()

---

<div class="alert alert-block alert-danger">
<b>Q: Now try to see how the v-measure depends on K for hierarchical clustering.</b>
</div>

In [None]:
v_measure_values = []
for k in range(2, 10):
    agglomerative = # TODO
    v_measure_values.append() # TODO

In [None]:
fig, ax = plt.subplots()
ax.plot(range(2, 10), v_measure_values)
ax.set_xticks(range(2, 10))
ax.set_title('HIERARCHICAL CLUSTERING - v-measure for varying K')
ax.set_xlabel('K')
plt.show()

---

<div class="alert alert-block alert-danger">
<b>Q: If you have time, try to compare the results obtained with K-Means and hierarchical clustering with the ones you obtain with DBSCAN.</b>
</div>

---