# Clustering
#### Part of the course on "Foundations of machine learning", Department of Mathematics and Statistics, University of Turku, Finland
#### Lectures available on YouTube: https://youtube.com/playlist?list=PLbkSohdmxoVAZ9DEHEWHjeGK7Ei-DjKHI&si=Msu74_I0qhLrRWcu
#### Code available on GitHub: https://github.com/ionpetre/FoundML_course_assignments

#### This notebook is based on the following sources: 
> https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

> https://www.kaggle.com/code/karnikakapoor/customer-segmentation-clustering (KARNIKA KAPOOR)

> https://www.kaggle.com/code/mihirjhaveri/wholesale-customer-retail-uci (MIHIR JHAVERI)

> https://www.kaggle.com/code/kautumn06/yellowbrick-clustering-evaluation-examples (KRISTENMCINTYRE)

Clustering is a fundamental unsupervised learning technique used to group similar data points together based on their intrinsic characteristics. The objective of clustering is to identify patterns and structures within the dataset without any predefined labels. The algorithm segregates the data into distinct clusters, with data points within a cluster sharing common traits and features. This grouping allows for a better understanding of the underlying relationships and similarities in the data, aiding in data analysis, pattern recognition, and decision-making. Clustering algorithms, such as Gaussian mixture models, K-means, agglomerative clustering, hierarchical clustering, and DBSCAN, play a crucial role in various applications, including customer segmentation, image recognition, anomaly detection, and recommendation systems. Effective clustering helps reveal hidden insights and patterns within large datasets, contributing to more informed business strategies and improved model performance in a wide array of real-world applications.

We demonstrate in this notebook the following methods:
 1. GMM
 2. K-means
 3. Agglomerative clustering
 4. Hierarchical clustering
 5. DBSCAN

We first demo these methods using a labeled dataset with a known number of desired clusters so that we can see the quality of the clustering. We then demo them on an unlabled dataset where the optimal number of clusters is to be determined and their significance identified. 

#### Load the libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import seaborn as sns
import matplotlib.colors as mcolors

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score, adjusted_rand_score
from sklearn.metrics import adjusted_mutual_info_score, silhouette_score 

In [None]:
# Reset the seed of the random number generator, for reproducibility purposes

import os

def reset_seed(SEED = 0):
    """Reset the seed for every random library in use (System, numpy)"""

    os.environ['PYTHONHASHSEED']=str(SEED)
    np.random.seed(SEED)


reset_seed(150)

In [None]:
# Prepare the style of the plots

plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

## I. Demo clustering on the Iris dataset (labels known, number of clusters known)

#### The Iris Dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set

This dataset consists of data collected on 150 Iris flowers, 50 from each of three types: Setosa, Versicolour, and Virginica. For each flower in the dataset we have its Iris type, its petal and sepal
length and width. The goal of this assignment is to create a model that learns the type of Iris based on its petal and sepal length and width.

In [None]:
# Import the Iris dataset from the sklearn library. 
# The organisation of the data in the sklearn library is described at https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html. 

from sklearn.datasets import load_iris
iris_X = load_iris(return_X_y=False, as_frame=True)['frame']

#Check the dataset
print(iris_X.info())

In [None]:
iris_X.info()

#### Visualise the data in 2D, color the datapoints depending on their labels. This suggests the clusters we look for.

In [None]:
# normalize dataset for easier parameter selection
iris_X[['sepal length (cm)',
       'sepal width (cm)',
       'petal length (cm)',
       'petal width (cm)']] = pd.DataFrame(
    StandardScaler().fit_transform(iris_X[['sepal length (cm)',
                                           'sepal width (cm)',
                                           'petal length (cm)',
                                           'petal width (cm)']]), 
    columns=['sepal length (cm)',
             'sepal width (cm)',
             'petal length (cm)',
             'petal width (cm)']
)

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
true_colors = np.array(['red', 'green', 'blue'])  # map categorical values to colors

# Plot the dataset in 2D based on its sepal characteristics

iris_X.plot(kind="scatter", 
            x="sepal length (cm)", 
            y="sepal width (cm)", 
            color=true_colors[iris_X['target']],
            title = 'Sepal characteristics',
            ax = axes [0],
            #figsize=(5,3),
           )

# Plot the dataset in 2D based on its petal characteristics

iris_X.plot(kind="scatter", 
            x="petal length (cm)", 
            y="petal width (cm)", 
            color=true_colors[iris_X['target']],
            title = 'Petal characteristics',
            ax = axes [1],
            #figsize=(5,3),
           )

**As we knew from other assignments, the petal characteristics provide a much better separation of the data**

Our clustering algorithms will separate the data using all features. 

In [None]:
# Run on the Iris data, without the target values (finding the clusters is the objective here)
X = iris_X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]

In [None]:
# ============
# Create the cluster objects
# ============

three_means = KMeans(
    n_clusters=3,
    n_init=10,  # Number of times the k-means algorithm is run with different centroid seeds. 
                # The final results is the best output of n_init consecutive runs in terms of inertia. 
    random_state=2023,
)

agg_ward = AgglomerativeClustering(
    n_clusters=3, 
    linkage="ward", # ‘ward’ minimizes the variance of the clusters being merged.
    metric = 'euclidean'
)

agg_average = AgglomerativeClustering(
    n_clusters=3,
    linkage="average", # ‘average’ uses the average of the distances of each observation of the two sets.
    metric="euclidean",
)

'''
    DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is 
    a popular density-based clustering algorithm used in data mining and machine learning. 
    It groups together data points that are close to each other in space, 
    identifying clusters based on their density and separating noise or outliers.
    It identifies itself the optimal number of clusters for a dataset. 

'''

dbscan = DBSCAN(
    eps=0.5,
    min_samples = 5,
    metric="euclidean",
    n_jobs = -1,
)

gmm = GaussianMixture(
    n_components=3,
    covariance_type="full",
    random_state=2023,
    max_iter = 1000,
    tol = 1e-3, # The convergence threshold. 
                # EM iterations will stop when the lower bound average gain is below this threshold.
)

clustering_algorithms = (
    ("KMeans", three_means),
    ("Agglomerative Ward", agg_ward),
    ("Agglomerative Average", agg_average),
    ("DBSCAN", dbscan),
    ("Gaussian Mixture", gmm),
)

In [None]:
colors = np.array(['orange', 'olive', 'purple'])  # map categorical values to colors


fig, axes = plt.subplots(nrows=10, ncols=2, figsize=(15,70))

plot_num = 0

for name, algorithm in clustering_algorithms:
    t0 = time.time()
    algorithm.fit(X)
    t1 = time.time()
    
    if hasattr(algorithm, "labels_"):
        y_pred = algorithm.labels_.astype(int)
    else:
        y_pred = algorithm.predict(X)

    iris_X.plot(kind="scatter", 
                x="sepal length (cm)", 
                y="sepal width (cm)", 
                color=colors[y_pred],
                title = name,
                ax=axes[plot_num,0],
                #figsize=(8,15),
               )
    
    iris_X.plot(kind="scatter", 
                x="sepal length (cm)", 
                y="sepal width (cm)", 
                color=true_colors[iris_X['target']],
                title = 'True labels',
                ax = axes [plot_num,1],
                #figsize=(5,3),
               )

    plot_num += 1
    
    iris_X.plot(kind="scatter", 
                x="petal length (cm)", 
                y="petal width (cm)", 
                color=colors[y_pred],
                title = name,
                ax=axes[plot_num,0],
                #figsize=(8,15),
               )
    
    iris_X.plot(kind="scatter", 
            x="petal length (cm)", 
            y="petal width (cm)", 
            color=true_colors[iris_X['target']],
            title = 'True labels',
                ax=axes[plot_num,1],
            #figsize=(5,3),
           )
    
    plot_num += 1
    
plt.show()

**NOTE.** AggloemrativeAverage and DBSCAN seem a little worse on this dataset than the other methods. 

#### Since we know the true labels for the Iris dataset, we can calculate several different metrics for the quality of the clustering results. 

In [None]:
cluster_scores = pd.DataFrame(np.nan, 
                  index=['KMeans',
                         'Agglomerative Ward',
                         'Agglomerative Average',
                         'DBSCAN',
                         'Gaussian Mixture'
                        ], 
                  columns=['Homogeneity', 
                           'Completeness',
                           'V-measure',
                           'Adjusted Rand Index',
                           'Adjusted Mutual Information',
                           'Silhouette Coefficient',
                          ]
                 )

for name, algorithm in clustering_algorithms:
    if hasattr(algorithm, "labels_"):
        y_pred = algorithm.labels_.astype(int)
    else:
        y_pred = algorithm.predict(X)
        
    cluster_scores.loc[name] = [
        homogeneity_score(iris_X['target'], y_pred),
        completeness_score(iris_X['target'],  y_pred),
        v_measure_score(iris_X['target'],y_pred),
        adjusted_rand_score(iris_X['target'], y_pred),
        adjusted_mutual_info_score(iris_X['target'], y_pred),
        silhouette_score(X, y_pred)                                
    ]    
    
cluster_scores.style.highlight_max(color = 'lightgreen', axis = 0)

#### Conclusion: GaussianMixture outperforms the other methods on this dataset. On the silhouette coefficient, agglomerative average does better. 
Note: on dataset with unknown labels, the silhouette coefficients is the only one of these metrics that we can use to measure the quality of the clustering results. 

## II. Demo clustering on a dataset with unknwown labels, unknown number of clusters

### We use the UCI Wholesale customers dataset https://archive.ics.uci.edu/dataset/292/wholesale+customers

Features: 

1)	FRESH: annual spending (m.u.) on fresh products (Continuous);
2)	MILK: annual spending (m.u.) on milk products (Continuous);
3)	GROCERY: annual spending (m.u.)on grocery products (Continuous);
4)	FROZEN: annual spending (m.u.)on frozen products (Continuous)
5)	DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous) 
6)	DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous); 
7)	CHANNEL: customersâ€™ Channel - Horeca (Hotel/Restaurant/CafÃ©) or Retail channel (Nominal)
8)	REGION: customersâ€™ Region â€“ Lisnon, Oporto or Other (Nominal)


In [None]:
from sklearn.datasets import fetch_openml

wholesale_X, wholesale_y = fetch_openml(
    data_id=1511,
    as_frame=True,
    return_X_y=True,
    parser = 'auto'
)


In [None]:
wholesale_y.value_counts()

# 1 encodes 'hotel' customers and 2 encodes 'retail' customers

In [None]:
wholesale_X

In [None]:
# The data seems to have had its columns renamed in OpenML. We bring them to their original format. 

wholesale_X.rename(
    columns={"V2": "Fresh", 
             "V3": "Milk",
             "V4": "Grocery",
             "V5": "Frozen",
             "V6": "Detergents_Paper",
             "V7": "Delicatessen",
            },
    inplace = True,
)

# Add the channels to the dataset
wholesale_X = pd.concat([wholesale_X, wholesale_y.to_frame()], axis=1)

# V1 is identical to "Region" and we drop it
wholesale_X.drop(['V1'], axis=1, inplace=True)

wholesale_X.info()

**Note: no missing values!** 

In [None]:
# Use the data without the region and the channel
X = wholesale_X[["Fresh", "Milk","Grocery", "Frozen", "Detergents_Paper", "Delicatessen"]]

# normalize dataset for easier parameter selection
X = pd.DataFrame(StandardScaler().fit_transform(X), columns=X.columns)
X.info()

In [None]:
# Visualise each data category per channel (hotel/retail) and per region (three of them)

sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Fresh", hue ="Region", kind="bar", errorbar=None, data=wholesale_X)
plt.title('Item - Fresh')

sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Milk", hue ="Region", kind="bar", errorbar=None, data=wholesale_X)
plt.title('Item - Milk')

sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Grocery", hue ="Region", kind="bar", errorbar=None, data=wholesale_X)
plt.title('Item - Grocery')

sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Frozen", hue ="Region", kind="bar", errorbar=None, data=wholesale_X)
plt.title('Item - Frozen')

sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Detergents_Paper", hue ="Region", kind="bar", errorbar=None, data=wholesale_X)
plt.title('Item - Detergents_Paper')

sns.set(style="ticks", color_codes=True)
sns.catplot(x="Channel", y="Delicatessen", hue ="Region", kind="bar", errorbar=None, data=wholesale_X)
plt.title('Delicatessen')
plt.show()

In [None]:
# ============
# Create the cluster objects
# ============


k_means = KMeans(
    n_init=10,  # Number of times the k-means algorithm is run with different centroid seeds. 
                # The final results is the best output of n_init consecutive runs in terms of inertia. 
    random_state=2023,
)

agg_ward = AgglomerativeClustering(
    linkage="ward", # ‘ward’ minimizes the variance of the clusters being merged.
    metric = 'euclidean',
)

agg_average = AgglomerativeClustering(
    linkage="average", # ‘average’ uses the average of the distances of each observation of the two sets.
    metric="euclidean",
)


# The GMM is a little problematic in this context. It needs a small hack, done with this weird code below. 
from sklearn.base import ClusterMixin
class GaussianMixtureCluster(GaussianMixture, ClusterMixin):
    """Subclass of GaussianMixture to make it a ClusterMixin."""

    def fit(self, X):
        super().fit(X)
        self.labels_ = self.predict(X)
        return self

    def get_params(self, **kwargs):
        output = super().get_params(**kwargs)
        output["n_clusters"] = output.get("n_components", None)
        return output

    def set_params(self, **kwargs):
        kwargs["n_components"] = kwargs.pop("n_clusters", None)
        return super().set_params(**kwargs)



gmm = GaussianMixtureCluster(
    covariance_type="full",
    random_state=2023,
    max_iter = 1000,
    tol = 1e-3, # The convergence threshold. 
                # EM iterations will stop when the lower bound average gain is below this threshold.
)

clustering_algorithms = (
    ("KMeans", k_means),
    ("Agglomerative Ward", agg_ward),
    ("Agglomerative Average", agg_average),
    ("Gaussian Mixture", gmm),
)

### Maximize the silhouette score to find the optimal numbers of clusters.
The silhouette coefficient/score is a metric used to calculate the goodness of a clustering technique. Its value ranges from -1 to 1.
- 1: clusters are well apart from each other and clearly distinguished.
- 0: clusters are indifferent, or we can say that the distance between clusters is not significant.
- -1: clusters are assigned in the wrong way.

In [None]:
from yellowbrick.cluster import KElbowVisualizer


for name, algorithm in clustering_algorithms:
    visualizer = KElbowVisualizer(algorithm, 
                              k = (2,11), 
                              metric = 'silhouette', 
                              timings = False,
                              locate_elbow = False,
                              force_model=True,
                             )

    visualizer.fit(X)    # Fit the data to the visualizer
    visualizer.show()    # Draw the results

In [None]:
dbscan = DBSCAN(
    eps=0.5,
    min_samples = 5,
    metric="euclidean",
    p=2,
    n_jobs = -1,
)

dbscan.fit(X)
unique, counts = np.unique(dbscan.labels_, return_counts=True)
print(np.asarray((unique, counts)).T)
print("Number of clusters (without the outliers) identified by DBSCAN:", 
      np.sum(np.array(unique) >= 0, axis=0)
     )

> The algorithms agree in their suggestions for the optimal number of clusters: 2. 

> The next step would be to validate these clusters, either by evaluating some quality merics (say their silhouette score profile), or by identifying their meaning somehow. This is often difficult and it requirers domain-specific knowledge about the dataset.

> In this notebook instead, we will only visualize the clusters to see how they differ. 

> Our data is multi-dimensional, which makes visualising it difficult. We will use PCA to project it onto the first 2 principal components.

In [None]:
from sklearn.decomposition import PCA

X_2comp = pd.DataFrame(
    PCA(n_components = 2, random_state = 2023).fit_transform(X),
    columns = ['PC1', 'PC2']
)

sns.relplot(
    x="PC1", 
    y="PC2", 
    data=X_2comp, 
    height=6,
    palette = sns.color_palette(palette = 'tab10'),
).fig.suptitle("UCI Wholesale", fontsize=14)


In [None]:
k_means = KMeans(
    n_clusters=2,
    n_init=10,  # Number of times the k-means algorithm is run with different centroid seeds. 
                # The final results is the best output of n_init consecutive runs in terms of inertia. 
    random_state=2023,
)

agg_ward = AgglomerativeClustering(
    n_clusters=2, 
    linkage="ward", # ‘ward’ minimizes the variance of the clusters being merged.
    metric = 'euclidean',
)

agg_average = AgglomerativeClustering(
    n_clusters=2,
    linkage="average", # ‘average’ uses the average of the distances of each observation of the two sets.
    metric="euclidean",
)


gmm = GaussianMixture(
    n_components=2,
    covariance_type="full",
    random_state=2023,
    max_iter = 1000,
    tol = 1e-3, # The convergence threshold. 
                # EM iterations will stop when the lower bound average gain is below this threshold.
)

optimal_clustering_algorithms = (
    ("KMeans", k_means),
    ("Agglomerative Ward", agg_ward),
    ("Agglomerative Average", agg_average),
    ("Gaussian Mixture", gmm),
    ("DBSCAN", dbscan)
)


for name, model in optimal_clustering_algorithms:
    model.fit(X)    # Fit the data to the model

In [None]:
fig = plt.figure()

for name, model in optimal_clustering_algorithms:
    print(name)
    if hasattr(model, "labels_"):
        y_pred = model.labels_.astype(int)
    else:
        y_pred = model.predict(X)
        
    X_2comp['cluster'] = y_pred.reshape(-1,1)

    sns.relplot(
        x="PC1", 
        y="PC2", 
        hue="cluster", 
        data=X_2comp, 
        height=6,
        palette = sns.color_palette(palette = 'tab10'),
    ).fig.suptitle(name, fontsize=12)
    plt.show()

## Challenge: cluster the California housing dataset

In [None]:
# Load the dataset from sklearn, add the target to the main dataset

from sklearn.datasets import fetch_california_housing

calif_X, calif_y = fetch_california_housing(return_X_y=True, as_frame=True)

calif_X = pd.concat([calif_X, calif_y.to_frame()], axis=1)
del calif_y

#### Q1: How many features do you have in the dataset? 
#### Q2: How many datapoints do you have in the dataset? 
#### Q3: Are there missing values in the dataset? 

In [None]:
# Your code here



In [None]:
# normalize the dataset on all features, except Latitude and Longitude
# Your code here



#### Q4: Cluster the housing dataset using only the "HouseAge" feature. What is the optimal number of clusters (from 3 to 10) suggested by the silhouette score? 

Hint: to train on a single feature, some changes in the datastructure may be needed. Try applying to_numpy() and reshape(-1, 1) to your single-feature data, it may help. 

In [None]:
# We will use K-means in this assignment and we will use the silhouette to find the optimal number of clusters
# The elbow method is sometimes difficult to interpret: several values may be just as well selected
# Instead, we will select the number of custers that offer the maximal silhouette score.
# Below we set "locate_elbow" to False and we check the maximum in the plot. 

# Your code here to setup the K-means model and the KElbowVisualizer



In [None]:
# Your code here to fit KElbow and visualize the results



Train the K-means model using the optimal number of clusters. 
Visualise the clusters using the code below.

#### Q5. How old are the houses in the cluster with the most recent houses? (0-5, 0-8, 0-12, 0-15, 0-21) 

In [None]:
k_means = KMeans(
    n_clusters = HERE_THE_NUMBER_OF_CLUSTERS,
    n_init = 5,  # Number of times the k-means algorithm is run with different centroid seeds. 
                # The final results is the best output of n_init consecutive runs in terms of inertia. 
    random_state = 2023,
)

k_means.fit(X['HouseAge'].to_numpy().reshape(-1, 1)) 
y_pred = k_means.labels_.astype(int)

calif_X['cluster'] = y_pred

fig = plt.figure()


sns.stripplot(
    data = calif_X, 
    x = 'HouseAge',
    hue="cluster", 
)

plt.show()

del k_means

#### Q6: Cluster the housing dataset using only the "MedInc" feature. What is the optimal number of clusters (from 3 to 10) suggested by the silhouette score? 

In [None]:
# Your code here



In [None]:
fig = plt.figure()

k_means = KMeans(
    n_clusters = HERE_THE_NUMBER_OF_CLUSTERS,
    n_init = 5,  # Number of times the k-means algorithm is run with different centroid seeds. 
                # The final results is the best output of n_init consecutive runs in terms of inertia. 
    random_state = 2023,
)

k_means.fit(X['MedInc'].to_numpy().reshape(-1, 1)) 
y_pred = k_means.labels_.astype(int)

calif_X['cluster'] = y_pred

sns.relplot(
    x="Longitude", 
    y="Latitude", 
    hue="cluster", 
    data=calif_X, 
    height=6,
    palette = 'tab10', #palette = ['red', 'blue', 'green'], #sns.color_palette("Paired"),
);

plt.show()

del k_means

#### Q7: Cluster the housing dataset using only the "MedHouseVal" feature. What is the optimal number of clusters (from 3 to 10) suggested by the silhouette score? 

In [None]:
# Your code here



In [None]:
fig = plt.figure()

k_means = KMeans(
    n_clusters = HERE_THE_NUMBER_OF_CLUSTERS,
    n_init = 5,  # Number of times the k-means algorithm is run with different centroid seeds. 
                # The final results is the best output of n_init consecutive runs in terms of inertia. 
    random_state = 2023,
)

k_means.fit(X['MedHouseVal'].to_numpy().reshape(-1, 1)) 
y_pred = k_means.labels_.astype(int)

calif_X['cluster'] = y_pred

sns.relplot(
    x="Longitude", 
    y="Latitude", 
    hue="cluster", 
    data=calif_X, 
    height=6,
    palette = sns.color_palette(palette = 'tab10'),
)

plt.show()
del k_means

#### Q8: Cluster the housing dataset using all features except latitude and longitude. What is the optimal number of clusters (from 3 to 10) suggested by the silhouette score? 

In [None]:
# Your code here



In [None]:
fig = plt.figure()

k_means = KMeans(
    n_clusters = HERE_THE_NUMBER_OF_CLUSTERS,
    n_init = 5,  # Number of times the k-means algorithm is run with different centroid seeds. 
                # The final results is the best output of n_init consecutive runs in terms of inertia. 
    random_state = 2023,
)

k_means.fit(X['MedHouseVal'].to_numpy().reshape(-1, 1)) 
y_pred = k_means.labels_.astype(int)

calif_X['cluster'] = y_pred

sns.relplot(
    x="Longitude", 
    y="Latitude", 
    hue="cluster", 
    data=calif_X, 
    height=6,
    palette = sns.color_palette(palette = 'tab10'),
)

plt.show()
del k_means