<a href="https://colab.research.google.com/github/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/UNSUPERVISED_LEARNING_K_MEANS_CLUSTERING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## UNSUPERVISED LEARNING - K-MEANS CLUSTERING


In this notebook, we will demonstrate how to build and evaluate K-means clustering as an example of an unsupervised learning technique. We will work on a modified version of the cardiovascular dataset from Kaggle (https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data). The aim of this problem is to cluster the patients into two groups with common characteristics.

# Import Libraries

First, we need to import some libraries that will be used during the creation and evaluation of K-means clustering models.

In [None]:
import pandas as pd
import warnings
from sklearn.cluster import KMeans
#warnings.filterwarnings('ignore')

# Data Preparation

**Clone the dataset Repository**

The prepared dataset after cleaning, removing outliers, and feature engineering can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [None]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

**Read the dataset**

The data is stored in the cardio_EDA.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [None]:
df = pd.read_csv("/content/AIData/cardio_EDA.csv",sep=";")
df.head()

**Display Data Info**

Display some information about the dataset using the info() method

In [None]:
df.info()

The dataset contains 53659 records with 14 features for each record. Twelve features are numeric and the rest are objects (strings).

# Clean Data and Remove Outliers

This data has been processed in previous notebooks
- Data Cleaning: https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_DATA_CLEANING.ipynb
- Feature Selection and Feature Engineering: https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_FEATURE_SELECTION_AND_FEATURE_ENGINEERING.ipynb

As we noticed from the presented sample of the dataset above some features are highly correlated such as the age and the age_year features. So we need to drop one of these features. Besides, we will drop any not needed features such as the 'id' feature.

In [None]:
df.drop(['id','age'],axis=1, inplace=True)
df.head()

# Encode Categorical Data

We will use one hot encoding through the get_dummies() method in pandas to encode the data in the 'gender' and 'smoke' features.

In [None]:
df = pd.get_dummies(df)
df.head()

Remember to drop one of the columns that resulted from the hot encoding of each feature. Also, make sure that the original features ('age' and 'smoke') are dropped too.

In [None]:
df.drop(['gender_female','smoke_No'],axis=1,inplace=True)
df.head()

# Train And Evaluate K-means clustering

**Scaling/Normalizing Features**

In the beginning, we need to scale/normalize all features within the same range. Here, we will use the MinMaxScaler from sklearn as below

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range = (0,1))

x=df.copy()
scaler.fit(x)
x_normalized = scaler.transform(x)

**Train K-means clustering**

We will splitting the dataset into training and testing splits of the dataset, the split ratio is usually 80% training and 20% testing.

In [None]:
km = KMeans( n_clusters = 10 )
km.fit_predict( x_normalized )


The KMeans cluster centers are also computed while fit and predict steps. However, it will be difficult to draw them in the features domian (10 features and so need to view in 10 dimensions)

In [None]:
km_cluster_centers_ = km.cluster_centers_
print(km_cluster_centers_)

**Evaluate K-means clustering**

The sum of the square error (SSE) is used to evaluate the quality of clustering. This SSE is the inertia_ attribute of the sklearn K-means algorithm. A low value means records within every cluster are very close to its center.

In [None]:
km.inertia_

**Number of Clusters**

To determine the best number of clusters, we will measure the SSE for different number-of-cluster values, this should be a decreasing curve. Then, we will choose the number-of-cluster value just before the SSE flatten. This process will take some time depending on the minimum number of clusters and the maximum number of clusters allowed. You are recommended to use a high step size when the range of search is large, and then to reduce the range and reduce the step size until you determine the best number-of-clusters value.

We will start with minimum number of clusters equal to 2, and maximum number of clusters equal to 102, and step size 10.

In [None]:
SSE=[]
MinClusters=2
MaxClusters=102
StepClusters=10
for k in range(MinClusters,MaxClusters,StepClusters):
    km = KMeans(n_clusters=k)
    km.fit_predict(x)
    km.inertia_
    SSE.append([k,km.inertia_])
    print([k,km.inertia_])

Let us plot the number of clusters Vs inertia curve

In [None]:
import numpy as np
import matplotlib.pyplot as plt
SSE=np.array(SSE)
plt.plot(SSE[:,0],SSE[:,1],color = 'red')
plt.scatter(SSE[:,0],SSE[:,1],color = 'blue')

As can be observed, the curve starts to flatten arround the number of clusters equal to 20. So, we will refine the search such that the minimum number of clusters equals 2, the maximum number of clusters equals 36, and the step size is 2.

In [None]:
SSE=[]
MinClusters=2
MaxClusters=35
StepClusters=2
for k in range(MinClusters,MaxClusters,StepClusters):
    km = KMeans(n_clusters=k)
    km.fit_predict(x)
    km.inertia_
    SSE.append([k,km.inertia_])
    print([k,km.inertia_])

SSE=np.array(SSE)
plt.plot(SSE[:,0],SSE[:,1],color = 'red')
plt.scatter(SSE[:,0],SSE[:,1],color = 'blue')

This is called the Elbow method to determine the value of number-of-clusters. 

Another approach to determine the best number-of-clusters value is by using the silhouette metric. Silhouette score tells how far away the datapoints in one cluster are, from the datapoints in another cluster. The range of silhouette score is from -1 to 1. Score should be closer to 1 than -1.

In [None]:
from sklearn.metrics import silhouette_score
MinClusters=2
MaxClusters=35
StepClusters=2
for k in range(MinClusters,MaxClusters,StepClusters):
    km = KMeans(n_clusters=k)
    cluster_labels = km.fit_predict(x)
    silhouette_avg = silhouette_score(x, cluster_labels)
    print("For n_clusters =", k,
          "The average silhouette_score is :", silhouette_avg)

So the best number of clusters according to both methods is 6.

In [None]:
k = 6
km = KMeans(n_clusters=k)
km.fit_predict(x)

# Saving and Loading Models

We will use the joblib method from sklearn library (https://scikit-learn.org/stable/modules/model_persistence.html) to save and load the models. To save the model we use the dump method as

In [None]:
import joblib as jb
jb.dump(km, './Model_km.joblib')

And to load the K-means model, we will use the load() method

In [None]:
km_joblib = jb.load('./Model_km.joblib')

# Cluster Data Using K-means Models

To predict the target values for new data, we will use the loaded model

In [None]:
y_predict = km_joblib.predict(x)
dfnew=x.copy()
dfnew['Class']=y_predict
dfnew.head()