## Install PyCaret & UCI Library Repository
`&> /dev/null` redirects the command standard output to the null device, which is a special device which discards the information written to it.

In [1]:
!pip install pycaret &> /dev/null
print("PyCaret Installed Sucessfully!!")

PyCaret Installed Sucessfully!!


In [2]:
!pip install ucimlrepo &> /dev/null
print("UCI Lib Repo Installed Sucessfully!!")

UCI Lib Repo Installed Sucessfully!!


## Importing [Online Shoppers Purchasing Intention Dataset](https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset) Data Set from UCI Library.


In [3]:
from ucimlrepo import fetch_ucirepo

Fetch Data

In [4]:
data = fetch_ucirepo(id=468)

Data as dataframe

In [5]:
X = data.data.features
Y = data.data.targets # end up purchasing

# Clustering

Importing necessary lib

In [6]:
import time # for start and finish time of clustering program
import pandas as pd
from pycaret.clustering import * # for setup parameters

Start the time

In [7]:
start = time.time()

Function to **Create Clustering Model** and print results.

Due to `verbose = false` **'setup'** won't show the output.

In [8]:
def create_cluster_model(X, num_clusters, algorithm, normalization=False, transformation=False, pca=False):
    if normalization:
        setup(data=X, normalize=True, normalize_method='zscore', verbose=False)
    elif transformation:
        setup(data=X, transformation=True, transformation_method='yeo-johnson', verbose=False)
    elif pca:
        setup(data=X, pca=True, pca_method='linear', verbose=False)
    else:
        kMeanParameters = setup(X, verbose=False)

    model = create_model(algorithm, num_clusters=num_clusters, verbose=False)
    results = pull()
    return results

Defining different types of **Data Pre-Processing** techniques.

In [9]:
scenarios = [
    ("No Data PreProcessing", False, False, False),
    ("Normalization", True, False, False),
    ("Transformation", False, True, False),
    ("PCA", False, False, True),
    ("Transformation + Normalization", True, True, False),
    ("Transformation + Normalization + PCA", True, True, True)
]

Defining different type of **Clustering Algorithms** .

In [10]:
algorithms = ['kmeans', 'hclust', 'dbscan']

Define number of Clusters

In [11]:
num_clusters_list = [3, 4, 5]

Result

In [12]:
# Creating results dictionary
results_list = []

# Creating and printing results in a table
for algorithm in algorithms:
    for num_clusters in num_clusters_list:
        for scenario in scenarios:
            name, normalize, transform, pca = scenario
            results = create_cluster_model(X, num_clusters=num_clusters, algorithm=algorithm, normalization=normalize, transformation=transform, pca=pca)
            silhouette = results['Silhouette'].iloc[0]
            calinski_harabasz = results['Calinski-Harabasz'].iloc[0]
            davies_bouldin = results['Davies-Bouldin'].iloc[0]
            results_list.append({
                "Algorithm": algorithm,
                "Number of Clusters": num_clusters,
                "Scenario": name,
                "Silhouette": silhouette,
                "Calinski-Harabasz": calinski_harabasz,
                "Davies-Bouldin": davies_bouldin
            })

# Create DataFrame
results_df = pd.DataFrame(results_list)

Total time of execution of program.

In [13]:
end = time.time()
print("The time of execution of program:", (end-start), "seconds OR", (end-start)/60, "minutes")

The time of execution of program: 325.9873549938202 seconds OR 5.433122583230337 minutes


Visualizing Data as we wish

In [14]:
# Pivoting the DataFrame
results_pivot = results_df.pivot(index=['Algorithm', 'Number of Clusters'], columns='Scenario')
results_pivot

Unnamed: 0_level_0,Unnamed: 1_level_0,Silhouette,Silhouette,Silhouette,Silhouette,Silhouette,Silhouette,Calinski-Harabasz,Calinski-Harabasz,Calinski-Harabasz,Calinski-Harabasz,Calinski-Harabasz,Calinski-Harabasz,Davies-Bouldin,Davies-Bouldin,Davies-Bouldin,Davies-Bouldin,Davies-Bouldin,Davies-Bouldin
Unnamed: 0_level_1,Scenario,No Data PreProcessing,Normalization,PCA,Transformation,Transformation + Normalization,Transformation + Normalization + PCA,No Data PreProcessing,Normalization,PCA,Transformation,Transformation + Normalization,Transformation + Normalization + PCA,No Data PreProcessing,Normalization,PCA,Transformation,Transformation + Normalization,Transformation + Normalization + PCA
Algorithm,Number of Clusters,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2
dbscan,3,-0.3585,-0.4274,-0.3585,-0.7146,-0.4274,-0.4274,2.9168,11.1629,2.9168,30.7557,11.1629,11.1629,0.9412,1.4134,0.9412,1.5281,1.4134,1.4134
dbscan,4,-0.3585,-0.4274,-0.3585,-0.7146,-0.4274,-0.4274,2.9168,11.1629,2.9168,30.7557,11.1629,11.1629,0.9412,1.4134,0.9412,1.5281,1.4134,1.4134
dbscan,5,-0.3585,-0.4274,-0.3585,-0.7146,-0.4274,-0.4274,2.9168,11.1629,2.9168,30.7557,11.1629,11.1629,0.9412,1.4134,0.9412,1.5281,1.4134,1.4134
hclust,3,0.6663,0.0607,0.6663,0.5475,0.0607,0.0607,14724.3807,721.7452,14724.3807,311755.546,721.7452,721.7452,0.5806,2.3787,0.5806,0.5065,2.3787,2.3787
hclust,4,0.667,0.0813,0.667,0.4754,0.0813,0.0813,15605.0983,728.5675,15605.0983,292623.8974,728.5675,728.5675,0.4986,2.3035,0.4986,0.6176,2.3035,2.3035
hclust,5,0.6644,0.0891,0.6644,0.4593,0.0891,0.0891,16904.5948,739.824,16904.5948,296311.3836,739.824,739.824,0.5471,2.3236,0.5471,0.6537,2.3236,2.3236
kmeans,3,0.695,0.1147,0.695,0.548,0.1704,0.0799,15202.0688,768.0329,15202.0688,315502.989,970.0911,902.9892,0.5601,2.5171,0.5601,0.5108,2.119,2.747
kmeans,4,0.6758,0.0858,0.6758,0.5027,0.0771,0.0574,17232.3126,667.4294,17232.3126,318296.6009,726.0686,759.8312,0.5416,2.4386,0.5416,0.5843,2.291,2.0494
kmeans,5,0.6224,0.0541,0.6278,0.4865,0.0905,0.1021,20334.6668,838.2072,19802.1918,308221.5927,955.0055,839.1009,0.5643,2.1077,0.5175,0.6168,2.2757,1.8619


In [15]:
print(results_pivot)

                                        Silhouette                        \
Scenario                     No Data PreProcessing Normalization     PCA   
Algorithm Number of Clusters                                               
dbscan    3                                -0.3585       -0.4274 -0.3585   
          4                                -0.3585       -0.4274 -0.3585   
          5                                -0.3585       -0.4274 -0.3585   
hclust    3                                 0.6663        0.0607  0.6663   
          4                                 0.6670        0.0813  0.6670   
          5                                 0.6644        0.0891  0.6644   
kmeans    3                                 0.6950        0.1147  0.6950   
          4                                 0.6758        0.0858  0.6758   
          5                                 0.6224        0.0541  0.6278   

                                                                            \
Scenario 

In [16]:
results_pivot.to_csv('result.csv')
print("Result file save sucessfully!!")

Result file save sucessfully!!
