# Clustering Metrics

## Clustering Metrics

First, we are going to generate a small dataset to demonstrate the metrics; in this case, it's best to visualize it. Remember that visualization will not always be possible, as your data may have more than 3 dimensions.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np

np.random.seed(42)

# Generar datos aleatorios y predicciones con KMeans
X, y_true = make_blobs(n_samples=1000, centers=4, cluster_std=1, random_state=42)


In [None]:
y_pred_6 = KMeans(n_clusters=6, random_state=42, n_init='auto').fit_predict(X)
y_pred_5 = KMeans(n_clusters=5, random_state=42, n_init='auto').fit_predict(X)
y_pred_4 = KMeans(n_clusters=4, random_state=42, n_init='auto').fit_predict(X)
y_pred_3 = KMeans(n_clusters=3, random_state=42, n_init='auto').fit_predict(X)
y_wrong = np.random.randint(4, size=1000)

predichos = [y_pred_3, y_pred_4, y_pred_5, y_pred_6]

# Crear los subplots lado a lado
fig, axs = plt.subplots(1, 2 + len(predichos), figsize=(25, 5))

axs[0].scatter(X[:, 0], X[:, 1], c='k', alpha=0.5)
axs[0].set_title('Datos originales')

for idx, y_preds in enumerate(predichos, 1):
    axs[idx].scatter(X[:, 0], X[:, 1], c=y_preds)
    axs[idx].set_title(f'{idx+2} clusters encontrados')
axs[-1].scatter(X[:, 0], X[:, 1], c=y_wrong)
axs[-1].set_title('Mal clusttering')


### Silhouette score

This measures how well the data is clustered and how separated the groups are. This metric takes values between -1 and 1, where 1 indicates perfect clustering, 0 indicates that the groups overlap, and -1 indicates that the points are assigned to the wrong group. Obviously, the result we are hoping for is 1:

In [None]:
from sklearn.metrics import silhouette_score


c3 = silhouette_score(X, y_pred_3)
c4 = silhouette_score(X, y_pred_4)
c5 = silhouette_score(X, y_pred_5)
c6 = silhouette_score(X, y_pred_6)
wrong = silhouette_score(X, y_wrong)

print(f'Silhouette Score for 3: {c3:0.2f}, 4: {c4:0.2f}, 5: {c5:0.2f}, 6: {c6:0.2f} and random: {wrong:0.2f}.')


### Calinski-Harabasz Index:

This measures the separation between groups and the dispersion within groups. The higher the value of this metric, the better the clustering.

In [None]:
from sklearn.metrics import calinski_harabasz_score


c3 = calinski_harabasz_score(X, y_pred_3)
c4 = calinski_harabasz_score(X, y_pred_4)
c5 = calinski_harabasz_score(X, y_pred_5)
c6 = calinski_harabasz_score(X, y_pred_6)
wrong = calinski_harabasz_score(X, y_wrong)

print(f'Índice Calinski-Harabasz para 3: {c3:0.2f}, 4: {c4:0.2f}, 5: {c5:0.2f}, 6: {c6:0.2f} and random: {wrong:0.2f}.')


### Davies-Bouldin Index

This measures the "compactness" of each cluster and the separation between clusters. The lower the value of this metric, the better the grouping.

In [None]:
from sklearn.metrics import davies_bouldin_score


c3 = davies_bouldin_score(X, y_pred_3)
c4 = davies_bouldin_score(X, y_pred_4)
c5 = davies_bouldin_score(X, y_pred_5)
c6 = davies_bouldin_score(X, y_pred_6)
wrong = davies_bouldin_score(X, y_wrong)

print(f'Índice Davies-Bouldin para 3: {c3:0.2f}, 4: {c4:0.2f}, 5: {c5:0.2f}, 6: {c6:0.2f} and random: {wrong:0.2f}.')


## Conclusion

And there you have it, these were some of the metrics offered by scikit-learn. If you notice, the functions for calculating supervised learning metrics follow a pattern: true values as the first argument and predicted values after. Similarly, for unsupervised learning functions, where you have to pass the input variables as well as the assigned clusters.

It's also important to highlight that each metric has its own purpose and there isn't a universally better metric for evaluating a model, as it depends on the problem at hand and the user's needs.

In the next chapter, we'll see how we can visualize our metrics to present them to the user or to better understand them ourselves.