# Step 1 : Importing Libraries
*We need the follolwing libraries to carry on with the workflow*
+  ucimlrepo: for fetching the dataset from UCI Library
+  pandas: for data manipulation and analysis.
+  numpy: for numerical operations.
+  pycaret: for analyzing and comparing the results of different models

In [1]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
from pycaret.clustering import *
from pycaret.datasets import get_data

# Step 2 : Import the required dataset
*Here, we are importing Obesity dataset.*

In [2]:
estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition = fetch_ucirepo(id=544) 

X = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.features 
y = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.targets 

*Let's have a look at the metadata of the dataset we have imported. The data contains 17 attributes and 2111 records.*<br>
*It contains target values of:*
+   Insufficient Weight
+   Normal Weight
+   Overweight Level I
+   Overweight Level II
+   Obesity Type I
+   Obesity Type II 
+   Obesity Type III

In [3]:
print(estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.metadata) 

{'uci_id': 544, 'name': 'Estimation of obesity levels based on eating habits and physical condition ', 'repository_url': 'https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition', 'data_url': 'https://archive.ics.uci.edu/static/public/544/data.csv', 'abstract': 'This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. ', 'area': 'Health and Medicine', 'tasks': ['Classification', 'Regression', 'Clustering'], 'characteristics': ['Multivariate'], 'num_instances': 2111, 'num_features': 16, 'feature_types': ['Integer'], 'demographics': ['Gender', 'Age'], 'target_col': ['NObeyesdad'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2019, 'last_updated': 'Tue Dec 19 2023', 'dataset_doi': '10.24432/C5H31Z', 'creators': [], 'intro_paper': {'title': 'Dataset f

*Let's also take a look at the different variables and their associated information such as type of feature, null values, etc.*

In [4]:
print(estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.variables) 

                              name     role         type demographic  \
0                           Gender  Feature  Categorical      Gender   
1                              Age  Feature   Continuous         Age   
2                           Height  Feature   Continuous        None   
3                           Weight  Feature   Continuous        None   
4   family_history_with_overweight  Feature       Binary        None   
5                             FAVC  Feature       Binary        None   
6                             FCVC  Feature      Integer        None   
7                              NCP  Feature   Continuous        None   
8                             CAEC  Feature  Categorical        None   
9                            SMOKE  Feature       Binary        None   
10                            CH2O  Feature   Continuous        None   
11                             SCC  Feature       Binary        None   
12                             FAF  Feature   Continuous        

*Let's look at the first five rows of the dataset*

In [5]:
X.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation


In [6]:
y.head()

Unnamed: 0,NObeyesdad
0,Normal_Weight
1,Normal_Weight
2,Normal_Weight
3,Overweight_Level_I
4,Overweight_Level_II


# Step 3: Converting categorical features
*We need to convert the categorical features*

In [7]:
X["Gender"]=X["Gender"].apply(lambda x: {True: 1, False:0}[x=="Male"])
X["FAVC"]=X["FAVC"].apply(lambda x: {True: 1, False:0}[x=="yes"])
X["family_history_with_overweight"]=X["family_history_with_overweight"].apply(lambda x: {True: 1, False:0}[x=="yes"])
X["SMOKE"]=X["SMOKE"].apply(lambda x: {True: 1, False:0}[x=="yes"])
X["SCC"]=X["SCC"].apply(lambda x: {True: 1, False:0}[x=="yes"])
X['CAEC']=X['CAEC'].apply(lambda x: {'no':0,'Sometimes':1,'Frequently':2,'Always':3}[x])
X['CALC']=X['CALC'].apply(lambda x: {'no':0,'Sometimes':1,'Frequently':2,'Always':3}[x])
X=pd.get_dummies(X,columns=['MTRANS'],drop_first=True)

# Step 4 : Using Pycaret
*Applying the different Clustering techniques with different configurations of preprocessing on the dataset in order to analyse the effectiveness of the models in different scenarios.*

In [12]:
rows = [ 'Silhouette', 'Calinski-Harabasz', 'Davies-Bouldin']
Type = ['No Data Preprocessing', 'Using Normalization', 'Using Transform', 'Using PCA', 'Using T+N', 'Using T+N+PCA']

# List of dictionaries containing the different arguments for setup function
setup_args = [
    {'verbose': False},
    {'normalize': True, 'normalize_method': 'zscore', 'verbose': False},
    {'transformation': True, 'transformation_method': 'yeo-johnson', 'verbose': False},
    {'pca': True, 'pca_method': 'linear', 'verbose': False},
    {'transformation': True, 'transformation_method': 'yeo-johnson', 'normalize': True, 'normalize_method': 'zscore', 'verbose': False},
    {'pca': True, 'pca_method': 'linear', 'normalize': True, 'normalize_method': 'zscore', 'transformation': True, 'transformation_method': 'yeo-johnson', 'verbose': False}
]
models=['kmeans','hclust','meanshift']
for k in models:
    data = {}
    for j, setup_arg in enumerate(setup_args):
        for i in range(3):
            print(k, Type[j], "with", i+3, "clusters")
            ModelParameters = setup(data=X, **setup_arg)
            Model = create_model(k, num_clusters=i+3)
            metrics = get_metrics()

            silhouette_score_function = metrics.loc['silhouette', 'Score Function']
            silhouette_score = silhouette_score_function(X, Model.labels_)

            chs_score_function = metrics.loc['chs', 'Score Function']
            Calinski_Harabasz_score = chs_score_function(X, Model.labels_)

            db_score_function = metrics.loc['db', 'Score Function']
            Davies_Bouldin_score = db_score_function(X, Model.labels_)

            data[(Type[j], 'c={}'.format(i+3))] = [silhouette_score, Calinski_Harabasz_score, Davies_Bouldin_score]
    if k == 'kmeans':
        kmeans_metrics = pd.DataFrame(data = data,index = rows)
    elif k == 'hclust':
        hclust_metrics = pd.DataFrame(data = data,index = rows)
    else:
        meanshift_metrics = pd.DataFrame(data = data,index = rows)


kmeans No Data Preprocessing with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5031,4729.2488,0.6675,0,0,0


kmeans No Data Preprocessing with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4752,4818.2645,0.6997,0,0,0


kmeans No Data Preprocessing with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4318,4599.8678,0.7502,0,0,0


kmeans Using Normalization with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.1408,198.0316,2.4523,0,0,0


kmeans Using Normalization with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.14,189.7881,2.212,0,0,0


kmeans Using Normalization with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.1562,192.5705,1.7464,0,0,0


kmeans Using Transform with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.6937,219757.5847,0.4166,0,0,0


kmeans Using Transform with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5895,199768.5249,0.4985,0,0,0


kmeans Using Transform with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5517,211877.6767,0.6193,0,0,0


kmeans Using PCA with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5031,4729.2477,0.6675,0,0,0


kmeans Using PCA with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4752,4818.2595,0.6997,0,0,0


kmeans Using PCA with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4237,4592.9025,0.7709,0,0,0


kmeans Using T+N with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.1436,195.112,2.5313,0,0,0


kmeans Using T+N with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.1002,186.0736,2.05,0,0,0


kmeans Using T+N with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.0921,179.7465,1.9072,0,0,0


kmeans Using T+N+PCA with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.0904,194.8814,2.501,0,0,0


kmeans Using T+N+PCA with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.1079,188.5959,2.1589,0,0,0


kmeans Using T+N+PCA with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.0943,181.0831,2.0552,0,0,0


hclust No Data Preprocessing with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4757,4302.4911,0.7001,0,0,0


hclust No Data Preprocessing with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4675,4346.0394,0.667,0,0,0


hclust No Data Preprocessing with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4151,4114.6325,0.7314,0,0,0


hclust Using Normalization with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.1477,172.5966,2.6492,0,0,0


hclust Using Normalization with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.1529,170.969,2.2802,0,0,0


hclust Using Normalization with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.1596,174.4808,1.8468,0,0,0


hclust Using Transform with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.6937,219757.5847,0.4166,0,0,0


hclust Using Transform with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5923,199358.3515,0.4885,0,0,0


hclust Using Transform with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5548,211263.2927,0.6107,0,0,0


hclust Using PCA with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4757,4302.4938,0.7001,0,0,0


hclust Using PCA with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4675,4346.0411,0.667,0,0,0


hclust Using PCA with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4151,4114.6363,0.7314,0,0,0


hclust Using T+N with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.2153,168.9287,2.4209,0,0,0


hclust Using T+N with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.198,168.1136,2.0814,0,0,0


hclust Using T+N with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.2046,172.0094,1.7095,0,0,0


hclust Using T+N+PCA with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.2153,168.9288,2.4209,0,0,0


hclust Using T+N+PCA with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.198,168.1137,2.0814,0,0,0


hclust Using T+N+PCA with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.2046,172.0095,1.7095,0,0,0


meanshift No Data Preprocessing with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5592,4309.4903,0.5983,0,0,0


meanshift No Data Preprocessing with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5592,4309.4903,0.5983,0,0,0


meanshift No Data Preprocessing with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5592,4309.4903,0.5983,0,0,0


meanshift Using Normalization with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.3003,30.6594,0.776,0,0,0


meanshift Using Normalization with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.3003,30.6594,0.776,0,0,0


meanshift Using Normalization with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.3003,30.6594,0.776,0,0,0


meanshift Using Transform with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.964,231576.6177,0.0608,0,0,0


meanshift Using Transform with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.964,231576.6177,0.0608,0,0,0


meanshift Using Transform with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.964,231576.6177,0.0608,0,0,0


meanshift Using PCA with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5592,4309.4966,0.5983,0,0,0


meanshift Using PCA with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5592,4309.4966,0.5983,0,0,0


meanshift Using PCA with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5592,4309.4966,0.5983,0,0,0


meanshift Using T+N with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.3182,36.0077,0.7927,0,0,0


meanshift Using T+N with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.3182,36.0077,0.7927,0,0,0


meanshift Using T+N with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.3182,36.0077,0.7927,0,0,0


meanshift Using T+N+PCA with 3 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.3182,36.0077,0.7927,0,0,0


meanshift Using T+N+PCA with 4 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.3182,36.0077,0.7927,0,0,0


meanshift Using T+N+PCA with 5 clusters


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.3182,36.0077,0.7927,0,0,0


# Step 5 : Printing the final metrics for different models

In [13]:
kmeans_metrics

Unnamed: 0_level_0,No Data Preprocessing,No Data Preprocessing,No Data Preprocessing,Using Normalization,Using Normalization,Using Normalization,Using Transform,Using Transform,Using Transform,Using PCA,Using PCA,Using PCA,Using T+N,Using T+N,Using T+N,Using T+N+PCA,Using T+N+PCA,Using T+N+PCA
Unnamed: 0_level_1,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5
Silhouette,0.503127,0.475232,0.431835,0.159241,0.030817,0.044751,0.062337,-0.098254,-0.128453,0.503127,0.475232,0.42366,0.146546,-0.039181,0.008905,0.086009,-0.026553,0.005027
Calinski-Harabasz,4729.250305,4818.263076,4599.874823,873.562521,528.353819,419.449999,384.60245,277.110536,216.554677,4729.250305,4818.263076,4592.904212,808.964756,429.772455,429.679536,650.518657,438.7468,545.248518
Davies-Bouldin,0.667549,0.699678,0.750224,1.712856,2.172773,3.06581,2.814921,3.757068,8.267418,0.667549,0.699678,0.770938,1.83003,6.814159,2.529282,5.373655,4.758358,2.26067


In [14]:
hclust_metrics

Unnamed: 0_level_0,No Data Preprocessing,No Data Preprocessing,No Data Preprocessing,Using Normalization,Using Normalization,Using Normalization,Using Transform,Using Transform,Using Transform,Using PCA,Using PCA,Using PCA,Using T+N,Using T+N,Using T+N,Using T+N+PCA,Using T+N+PCA,Using T+N+PCA
Unnamed: 0_level_1,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5
Silhouette,0.475726,0.467548,0.415117,0.060391,-0.029765,-0.042453,0.062337,-0.099628,-0.129376,0.475726,0.467548,0.415117,0.098841,-0.014811,-0.029596,0.098841,-0.014811,-0.029596
Calinski-Harabasz,4302.495839,4346.043925,4114.637298,436.363741,309.647539,232.338639,384.60245,276.709678,216.276108,4302.495839,4346.043925,4114.637298,392.483965,266.722296,200.14228,392.483965,266.722296,200.14228
Davies-Bouldin,0.700061,0.666997,0.731406,2.471153,2.976022,3.615776,2.814921,3.737062,8.253889,0.700061,0.666997,0.731406,4.727314,4.503711,4.842121,4.727314,4.503711,4.842121


In [15]:
meanshift_metrics

Unnamed: 0_level_0,No Data Preprocessing,No Data Preprocessing,No Data Preprocessing,Using Normalization,Using Normalization,Using Normalization,Using Transform,Using Transform,Using Transform,Using PCA,Using PCA,Using PCA,Using T+N,Using T+N,Using T+N,Using T+N+PCA,Using T+N+PCA,Using T+N+PCA
Unnamed: 0_level_1,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5,c=3,c=4,c=5
Silhouette,0.559179,0.559179,0.559179,-0.740924,-0.740924,-0.740924,0.019492,0.019492,0.019492,0.559179,0.559179,0.559179,-0.742555,-0.742555,-0.742555,-0.742555,-0.742555,-0.742555
Calinski-Harabasz,4309.498187,4309.498187,4309.498187,4.867212,4.867212,4.867212,158.963052,158.963052,158.963052,4309.498187,4309.498187,4309.498187,5.356018,5.356018,5.356018,5.356018,5.356018,5.356018
Davies-Bouldin,0.598328,0.598328,0.598328,2.757867,2.757867,2.757867,1.720077,1.720077,1.720077,0.598328,0.598328,0.598328,3.614551,3.614551,3.614551,3.614551,3.614551,3.614551


*Hence, upon analysing the above tables we can clearly see the performance of diffrent techniques in different cases on the same dataset which is. Mean shift is the best algorithm amonst all.*