## Programming Assignment
#### Submitted by Maria Eloisa H. Garcia

----

#### Instructions:

1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
2. Replicate the study using the same dataset.

In [1]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd
import numpy as np
from sklearn.metrics.cluster import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score
from sklearn.cluster import KMeans
from sklearn.manifold import SpectralEmbedding
from sklearn.preprocessing import OneHotEncoder
import networkx as nx
import itertools
import warnings
from sklearn.cluster import AgglomerativeClustering
from kmodes.kmodes import KModes
from sklearn.preprocessing import LabelEncoder 

In [2]:
# fetch datasets
soybean= fetch_ucirepo(id=91) 
zoo = fetch_ucirepo(id=111) 
heart_disease = fetch_ucirepo(id=45) 
dermatology = fetch_ucirepo(id=33)
breast_cancer = fetch_ucirepo(id=15)
mushroom = fetch_ucirepo(id=73) 

In [3]:
# convert to dataframes
X = soybean.data.features
y = soybean.data.targets 
soybean_df = pd.merge(X, y, left_index=True, right_index=True)

X = zoo.data.features
y = zoo.data.targets 
zoo_df = pd.merge(X, y, left_index=True, right_index=True)

X = heart_disease.data.features
y = heart_disease.data.targets 
heart_disease_df = pd.merge(X, y, left_index=True, right_index=True)

X = dermatology.data.features
y = dermatology.data.targets 
dermatology_df = pd.merge(X, y, left_index=True, right_index=True)

X = breast_cancer.data.features
y = breast_cancer.data.targets 
breast_cancer_df = pd.merge(X, y, left_index=True, right_index=True)

X = mushroom.data.features
y = mushroom.data.targets 
mushroom_df = pd.merge(X, y, left_index=True, right_index=True)

soybean_df = soybean_df.dropna()
zoo_df = zoo_df.dropna()
heart_disease_df = heart_disease_df.dropna()
dermatology_df = dermatology_df.dropna()
breast_cancer_df = breast_cancer_df.dropna()
mushroom_df = mushroom_df.dropna()

In [4]:
def jaccard_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union

def ochiai_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    denominator = np.sqrt(len(set1) * len(set2))
    return intersection / denominator

def overlap_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    min_length = min(len(set1), len(set2))
    return intersection / min_length

def dice_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    denominator = len(set1) + len(set2)
    return 2 * intersection / denominator

def graph_based_representation(data):
    num_samples, num_features = data.shape
    similarity_matrix = np.zeros((num_features, num_features))
    for i, j in itertools.combinations(range(num_features), 2):
        similarity_matrix[i, j] = jaccard_coefficient(set(data[:, i]), set(data[:, j]))
        similarity_matrix[j, i] = similarity_matrix[i, j]
    G = nx.from_numpy_array(similarity_matrix)
    embedding = SpectralEmbedding(n_components=p)
    representation_matrix = embedding.fit_transform(similarity_matrix)
    return representation_matrix

def joint_operation(data, representation_matrix):
    return np.dot(data, representation_matrix)

def mean_operation(data, representation_matrix):
    return np.mean(np.dot(data, representation_matrix), axis=1)

def perform_clustering(data, k):
    kmeans = KMeans(n_clusters=k)
    return kmeans.fit_predict(data)

3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.

In [5]:
# define parameters
p = 10
q = 10 
k = 3

results = []

# data preprocessing and clustering
datasets = ["soybean_df", "zoo_df", "heart_disease_df", "dermatology_df", "breast_cancer_df", "mushroom_df"]
for dataset_name in datasets:
    dataset = globals()[dataset_name]
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=UserWarning)
        try:
            X = dataset
            enc = OneHotEncoder()
            X_encoded = enc.fit_transform(X)
            representation_matrix = graph_based_representation(X_encoded.toarray())
            integrated_data = joint_operation(X_encoded.toarray(), representation_matrix)
            labels = perform_clustering(integrated_data, k)

            true_labels = dataset.iloc[:, -1] 
            ARI = adjusted_rand_score(true_labels, labels)
            NMI = normalized_mutual_info_score(true_labels, labels)
            FMI = fowlkes_mallows_score(true_labels, labels)
            results.append([dataset_name, ARI, NMI, FMI])
        except UserWarning as e:
            print(f"Warning: {e}")

results_df = pd.DataFrame(results, columns=["Dataset", "ARI", "NMI", "FMI"])
print(results_df)

            Dataset       ARI       NMI       FMI
0        soybean_df  0.427922  0.615203  0.602283
1            zoo_df  0.714645  0.734450  0.809650
2  heart_disease_df  0.305414  0.373894  0.545601
3    dermatology_df  0.442272  0.586775  0.620302
4  breast_cancer_df  0.716976  0.640601  0.859445
5       mushroom_df  0.289424  0.344347  0.600922



5. Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?
    
    = The Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Fowlkes-Mallows Index (FMI) are commonly used performance metrics in clustering analysis. ARI is advantageous as it corrects for chance, making it suitable for evaluating the agreement between clusterings even if the cluster sizes vary. NMI quantifies the amount of information shared by two clusterings, normalizing the mutual information score to provide a value between 0 and 1, where 1 indicates identical clusterings. FMI evaluates the similarity between two clusterings by calculating the geometric mean of precision and recall. 
    
    In practice, ARI is preferred when comparing clusterings with varying cluster sizes and when a reference clustering is available. NMI is suitable for comparing clusterings without a reference and handling varying cluster sizes. FMI is valuable when precision and recall are essential for evaluating clustering results.

6. Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.

In [6]:
results_categorical = []

for dataset_name in datasets:
    dataset = globals()[dataset_name]
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=UserWarning)
        try:
            X_cat = dataset.iloc[:, :-1]
            true_labels_cat = dataset.iloc[:, -1]

            encoder = LabelEncoder()
            X_cat_encoded = X_cat.apply(encoder.fit_transform)

            km = KModes(n_clusters=k, init='Huang', n_init=5, verbose=0)
            km_labels = km.fit_predict(X_cat_encoded)
            ARI_km = adjusted_rand_score(true_labels_cat, km_labels)
            NMI_km = normalized_mutual_info_score(true_labels_cat, km_labels)
            FMI_km = fowlkes_mallows_score(true_labels_cat, km_labels)

            ac = AgglomerativeClustering(n_clusters=k, linkage='ward')
            ac_labels = ac.fit_predict(X_cat_encoded)
            ARI_ac = adjusted_rand_score(true_labels_cat, ac_labels)
            NMI_ac = normalized_mutual_info_score(true_labels_cat, ac_labels)
            FMI_ac = fowlkes_mallows_score(true_labels_cat, ac_labels)
            
            results_categorical.append([dataset_name + " (Kmodes)", ARI_km, NMI_km, FMI_km])
            results_categorical.append([dataset_name + " (Hierarchical)", ARI_ac, NMI_ac, FMI_ac])
        except UserWarning as e:
            print(f"Warning: {e}")

results_categorical_df = pd.DataFrame(results_categorical, columns=["Dataset", "ARI", "NMI", "FMI"])
print(results_categorical_df)

                            Dataset       ARI       NMI       FMI
0               soybean_df (Kmodes)  0.653689  0.837666  0.783908
1         soybean_df (Hierarchical)  0.653689  0.837666  0.783908
2                   zoo_df (Kmodes)  0.721570  0.714510  0.812035
3             zoo_df (Hierarchical)  0.461701  0.585056  0.645736
4         heart_disease_df (Kmodes)  0.152201  0.173102  0.443410
5   heart_disease_df (Hierarchical)  0.009543  0.010944  0.353434
6           dermatology_df (Kmodes)  0.283247  0.348970  0.480600
7     dermatology_df (Hierarchical)  0.032201  0.078147  0.293614
8         breast_cancer_df (Kmodes)  0.368440  0.447053  0.659501
9   breast_cancer_df (Hierarchical)  0.781678  0.688725  0.893555
10             mushroom_df (Kmodes)  0.325039  0.369164  0.650832
11       mushroom_df (Hierarchical)  0.298722  0.431275  0.610515


7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).

    = The ARI, NMI, and FMI metrics are critical in evaluating clustering performance in various domains. For example, in the analysis of single-cell RNA sequencing data, ARI is used as one of the benchmark metrics to evaluate the effectiveness of batch-effect correction methods. Both NMI and ARI are utilized in deep image clustering to evaluate clustering methods' performance. Similarly, in network-guided sparse subspace clustering, NMI and ARI are exploited to evaluate different clustering techniques.

    The selection of these metrics is justified by their ability to provide quantitative insights into the quality of clustering results. For task-oriented clustering in dialogues, NMI and ARI are used along with other metrics to evaluate clustering algorithms' performance. ARI is essential for achieving better clustering performance by considering pairwise and cardinality constraints in constrained clustering.
    
    FMI consistently performs better than ARI and NMI. ARI's accuracy in evaluating clustering results can be affected by its bias towards clustering with a higher number of clusters. NMI is widely used but has been criticized for its sensitivity to the number of clusters and cluster size imbalances, which could lead to misleading results. Due to its scoring range's emphasis on accurate classification, FMI provides a more robust metric for evaluating clustering performance.

    Moreover, the comprehensive analysis of performance metrics in different studies highlights the significance of FMI in providing a reliable and consistent evaluation of clustering and fusion algorithms. Its ability to capture accurate classification results and its consistent performance across various applications make FMI a preferred metric for assessing the quality and effectiveness of clustering and fusion techniques.