<a href="https://colab.research.google.com/github/mnijhuis-dnb/Artificial_Intelligence_and_Machine_Learning_for_SupTech/blob/main/Tutorials/Tutorial%206%20Finding%20clusters%20and%20neighbours.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Artificial Intelligence and Machine Learning for SupTech  
Tutorial 6: Finding clusters and neighbors

*	Implementing K-means and DBSCAN
*	Hierarchical clustering: Bottom-up or Top-down?
*	Visual inspection of results

<br/>

14 March 2023  

**Instructors**  
Prof. Iman van Lelyveld (iman.van.lelyveld@vu.nl)<br/>
Dr. Michiel Nijhuis (m.nijhuis@dnb.nl)  

In [None]:
!gdown 1PCu4jNahysRpZ72z31KHpVkyAOp6nrKj

In this tutorial we will try to group companies based on a number of factors and see if the groups we made have a different revenue a couple of years later.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('/content/company_data.csv', index_col=0)

We are going to cluster the data based on the 2016 data and see the effects on the 2018 data

In [None]:
df_2016 = df.loc[df['year'] == 2016, :]
df_2018 = df.loc[df['year'] == 2018, :]

Select a number of columns in the dataframe which could be predictors of the revenue two years down the line. As an example, high R&D investments could transfer into larger revenue a few years later, so that could be something to select 

In [None]:
selected_columns = ['','','','']
df_2016_limited = df_2016[selected_columns]

We are going to cluster this data using the KMeans algorithm

In [None]:
from sklearn.cluster import KMeans

In [None]:
n_clusters = 3

kmeans = KMeans(
    n_clusters=n_clusters, 
    random_state=0,
).fit(df_2016_limited)

Let's have a look at the results

In [None]:
pd.Series(kmeans.labels_, index=df_2016_limited.index, name='clusters')

Now it is time to discover whether your clusters are actually separating the data well. Calculate the statistics of each of the clusters

In [None]:
display(df_2016_limited[kmeans.labels_==0].describe())
display(df_2016_limited[kmeans.labels_==1].describe())
display(df_2016_limited[kmeans.labels_==2].describe())

You can also visually show your clusters, we now do it in 2 dimensions. Pick two of the columns you selected and use the code to plot the results. How does the separation of the data look?

In [None]:
col_a = selected_columns[0]
col_b = selected_columns[1]

color = iter(plt.cm.rainbow(np.linspace(0, 1, n_clusters)))
for i in range(n_clusters):
  plot_df = df_2016_limited[kmeans.labels_==i]
  plt.plot(
    plot_df.loc[:, col_a], plot_df.loc[:, col_b],
    color=next(color),
    marker='o', markersize=3, lw=0
  )

Another approach to clustering is agglomerative clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering

In [None]:
agg = AgglomerativeClustering(
    n_clusters=n_clusters
).fit(df_2016_limited)

Compare the results of the agglomerative clustering to the KMeans clustering

In [None]:
display(df_2016_limited[agg.labels_==0].describe())
display(df_2016_limited[agg.labels_==1].describe())
display(df_2016_limited[agg.labels_==2].describe())

Use the confusion matrix to get an idea of how the clusters map onto each other 

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(kmeans.labels_, agg.labels_)

We can also use the silhouette to quality of the clustering

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score

In [None]:
def plot_silhoutes(model): 
  fig, axes = plt.subplots(1, n_clusters, sharey=True, figsize=[int(5*n_clusters),5])

  silhouette_values = silhouette_samples(df_2016_limited, model.labels_)

  colors = iter(plt.cm.rainbow(np.linspace(0, 1, n_clusters)))
  for cluster in np.unique(model.labels_):
    ax = axes[cluster]
    color = next(colors)
    sils = silhouette_values[model.labels_ == cluster]
    sils = sorted(sils)
    ax.bar(range(len(sils)), sils, color=color, width=2)

    sils_avg = np.mean(sils)
    ax.axhline(sils_avg, lw=3, ls='--', color=color)
    ax.set_title(f'Cluster {cluster}\n(avg. silhoutte: {sils_avg:.3f}')

  fig.tight_layout()

  plot_silhoutes(kmeans)

Can you make the same plot for the agglomerative clustering?

In [None]:
plot_silhoutes(agg)

Now we can see if the clusters you have defined also have an effect on the revenue in 2018. Can you combine the data of 2016 to 2018 and see the differences in statistics of the revenue in 2018 between the clusters

In [None]:
df_revenue = df_2016.merge(df_2018, how='left', left_index=True, right_index=True, suffixes=['_2016','_2018'])[['Revenue_2016', 'Revenue_2018']]
df_revenue['Revenue_difference'] = df_revenue['Revenue_2018'] - df_revenue['Revenue_2016'] / (0.5*(df_revenue['Revenue_2018'] + df_revenue['Revenue_2016']))
df_revenue['labels'] = kmeans.labels_
df_revenue.groupby('labels')['Revenue_difference'].mean()