$\textbf{PROGRAMMING ASSIGNMENT - LECTURE 3}$
---

Instructions: Choose a dataset of your liking and perform the following:

1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
2. Replicate the study using the same dataset.
3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.
5. Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?
6. Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.
7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).

---

In [59]:
from ucimlrepo import fetch_ucirepo
import pandas as pd

from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score

from sklearn.preprocessing import LabelEncoder

In [60]:
dataset = fetch_ucirepo(id=53)

X = dataset.data.features
y = dataset.data.targets

df = pd.DataFrame(dataset.data.original, columns=dataset.headers)

In [61]:
df = df.dropna()

label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(df['class'])

label_encoders = {}
for column in df.columns:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column].astype(str))
    label_encoders[column] = le

# Separate features and target variable
X = df.drop(columns=['class'])

# OneHotEncode the features for clustering
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(sparse_output=False)
X_encoded = enc.fit_transform(X)

In [62]:
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_encoded)
y_pred_kmeans = kmeans.labels_

# Compute metrics for KMeans
ari_kmeans = adjusted_rand_score(y_true, y_pred_kmeans)
nmi_kmeans = normalized_mutual_info_score(y_true, y_pred_kmeans)
fmi_kmeans = fowlkes_mallows_score(y_true, y_pred_kmeans)

print(f"KMeans ARI: {ari_kmeans}, NMI: {nmi_kmeans}, FMI: {fmi_kmeans}")

# K-Modes clustering
from kmodes.kmodes import KModes

km = KModes(n_clusters=3, init='Huang', random_state=42)
km_clusters = km.fit_predict(X)

# Hierarchical clustering
from scipy.cluster.hierarchy import linkage, fcluster

Z = linkage(X, method='ward')
hier_clusters = fcluster(Z, 3, criterion='maxclust')

# Compute metrics for K-Modes
ari_kmodes = adjusted_rand_score(y_true, km_clusters)
nmi_kmodes = normalized_mutual_info_score(y_true, km_clusters)
fmi_kmodes = fowlkes_mallows_score(y_true, km_clusters)

# Compute metrics for Hierarchical
ari_hier = adjusted_rand_score(y_true, hier_clusters)
nmi_hier = normalized_mutual_info_score(y_true, hier_clusters)
fmi_hier = fowlkes_mallows_score(y_true, hier_clusters)

print(f"KModes ARI: {ari_kmodes}, NMI: {nmi_kmodes}, FMI: {fmi_kmodes}")
print(f"Hierarchical ARI: {ari_hier}, NMI: {nmi_hier}, FMI: {fmi_hier}")

KMeans ARI: 0.37852958394680664, NMI: 0.4408899270859355, FMI: 0.5854489694615445
KModes ARI: 0.09216062573782456, NMI: 0.13269189892685265, FMI: 0.43880184250711013
Hierarchical ARI: 0.7311985567707746, NMI: 0.7700836616487869, FMI: 0.8221697785442927


In [63]:
import warnings
warnings.filterwarnings('ignore')

X = df[['sepal length', 'sepal width', 'petal length', 'petal width']]

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

y_pred = kmeans.labels_

y_true = df['class'].replace({'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2})

ari = adjusted_rand_score(y_true, y_pred)
nmi = normalized_mutual_info_score(y_true, y_pred)
fmi = fowlkes_mallows_score(y_true, y_pred)

print(f"Adjusted Rand Index (ARI): {ari}")
print(f"Normalized Mutual Information (NMI): {nmi}")
print(f"Folkes-Mallows Index (FMI): {fmi}")


Adjusted Rand Index (ARI): 0.6652734126084092
Normalized Mutual Information (NMI): 0.6762038347420312
Folkes-Mallows Index (FMI): 0.7770910213718182


#### Comparative Analysis of Clustering Performance Metrics: ARI, NMI, and FMI

This report presents a comparative analysis of Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Folkes-Mallows Index (FMI) as performance metrics for clustering algorithms. The study uses the `soybean_large` dataset and applies K-Modes and hierarchical clustering methods to evaluate and compare the performance of these metrics.

Clustering performance metrics are essential for evaluating the quality of clustering results. ARI, NMI, and FMI are widely used metrics, each with unique advantages and disadvantages.
The `soybean_large` dataset was used for this study. K-Modes and hierarchical clustering methods were applied, and the clustering results were evaluated using ARI, NMI, and FMI.


#### Discussion
- Adjusted Rand Index (ARI)
    - ARI adjusts for chance and is suitable for comparing clustering results with ground truth labels. However, it can be sensitive to the number of clusters.

- Normalized Mutual Information (NMI)
    - NMI is normalized and suitable for comparing different clustering results. It is robust to different numbers of clusters but may not capture all aspects of clustering quality.

- Folkes-Mallows Index (FMI)
    - FMI measures the geometric mean of precision and recall, focusing on pair-counting. It is intuitive but may not be as commonly used as ARI and NMI.

#### Conclusion
This study highlights the importance of selecting appropriate clustering performance metrics. ARI, NMI, and FMI each have their strengths and are suitable for different scenarios. Understanding their advantages and disadvantages is crucial for accurate clustering evaluation.