### PROGRAMMING ASSIGNMENT
---

1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
2. Replicate the study using the same datasext.
3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.
5. Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?
6. Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.
7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).

In [10]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd
import numpy as np
from sklearn.metrics.cluster import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score
from sklearn.cluster import KMeans
from sklearn.manifold import SpectralEmbedding
from sklearn.preprocessing import OneHotEncoder
import networkx as nx
import itertools
import warnings
from sklearn.cluster import AgglomerativeClustering
from kmodes.kmodes import KModes
from sklearn.preprocessing import LabelEncoder 

# fetch datasets
soybean= fetch_ucirepo(id=91) 
zoo = fetch_ucirepo(id=111) 
heart_disease = fetch_ucirepo(id=45) 
dermatology = fetch_ucirepo(id=33)
breast_cancer = fetch_ucirepo(id=15)
mushroom = fetch_ucirepo(id=73) 

In [35]:
import pandas as pd

# Define a function to load and describe datasets
def load_and_describe_dataset(name, dataset):
    print(f"Dataset: {name}")
    print(f"Shape: {dataset.shape}")
    print("\nColumns:")
    print(dataset.columns.to_list())
    print("\nInfo:")
    print(dataset.info())
    print("\nDescription:")
    print(dataset.describe())
    print("\nSample Data:")
    print(dataset.head())
    print("\n" + "="*50 + "\n")  # Separator line

# Assuming 'soybean', 'zoo', 'heart_disease', 'dermatology', 'breast_cancer', and 'mushroom' are already loaded datasets

# Load and describe datasets
load_and_describe_dataset("Soybean", soybean_df)
load_and_describe_dataset("Zoo", zoo_df)
load_and_describe_dataset("Heart Disease", heart_disease_df)
load_and_describe_dataset("Dermatology", dermatology_df)
load_and_describe_dataset("Breast Cancer", breast_cancer_df)
load_and_describe_dataset("Mushroom", mushroom_df)


Dataset: Soybean
Shape: (47, 36)

Columns:
['date', 'plant-stand', 'precip', 'temp', 'hail', 'crop-hist', 'area-damaged', 'severity', 'seed-tmt', 'germination', 'plant-growth', 'leaves', 'leafspots-halo', 'leafspots-marg', 'leafspot-size', 'leaf-shread', 'leaf-malf', 'leaf-mild', 'stem', 'lodging', 'stem-cankers', 'canker-lesion', 'fruiting-bodies', 'external-decay', 'mycelium', 'int-discolor', 'sclerotia', 'fruit-pods', 'fruit-spots', 'seed', 'mold-growth', 'seed-discolor', 'seed-size', 'shriveling', 'roots', 'class']

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 36 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   date             47 non-null     int64 
 1   plant-stand      47 non-null     int64 
 2   precip           47 non-null     int64 
 3   temp             47 non-null     int64 
 4   hail             47 non-null     int64 
 5   crop-hist        47 non-null     int64 
 6   

In [37]:
import pandas as pd
import numpy as np
import warnings
import itertools
from sklearn.metrics.cluster import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score
from sklearn.cluster import KMeans
from sklearn.manifold import SpectralEmbedding
from sklearn.preprocessing import OneHotEncoder

def jaccard_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union

def graph_based_representation(data, k):
    num_samples, num_features = data.shape
    similarity_matrix = np.zeros((num_features, num_features))
    for i, j in itertools.combinations(range(num_features), 2):
        similarity_matrix[i, j] = jaccard_coefficient(set(data[:, i]), set(data[:, j]))
        similarity_matrix[j, i] = similarity_matrix[i, j]
    embedding = SpectralEmbedding(n_components=min(num_features, k))
    representation_matrix = embedding.fit_transform(similarity_matrix)
    return representation_matrix

def perform_clustering(data, k):
    kmeans = KMeans(n_clusters=k)
    return kmeans.fit_predict(data)

def analyze_dataset(dataset_name, dataset, k):
    results = []
    X = dataset.iloc[:, :-1].values  # Features
    y = dataset.iloc[:, -1].values   # Target labels
    enc = OneHotEncoder()
    X_encoded = enc.fit_transform(X).toarray()  # One-hot encode features
    representation_matrix = graph_based_representation(X_encoded, k)
    integrated_data = np.dot(X_encoded, representation_matrix)
    k = min(k, X_encoded.shape[1])  # Ensure k does not exceed number of features
    labels = perform_clustering(integrated_data, k)
    ARI = adjusted_rand_score(y, labels)
    NMI = normalized_mutual_info_score(y, labels)
    FMI = fowlkes_mallows_score(y, labels)
    results.append([dataset_name, ARI, NMI, FMI])
    return results

# Define parameters
p = 10
q = 10
k = 3

# Define dataset names and load datasets
datasets = ["soybean_df", "zoo_df", "heart_disease_df", "dermatology_df", "breast_cancer_df", "mushroom_df"]

# Analyze each dataset
results = []
for dataset_name in datasets:
    dataset = globals()[dataset_name]
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=UserWarning)
        try:
            results.extend(analyze_dataset(dataset_name, dataset, k))
        except UserWarning as e:
            print(f"Warning: {e}")

# Create DataFrame for results
results_df = pd.DataFrame(results, columns=["Dataset", "ARI", "NMI", "FMI"])
print(results_df)


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)


            Dataset       ARI       NMI       FMI
0        soybean_df  0.183357  0.341804  0.436280
1            zoo_df  0.379489  0.466597  0.577069
2  heart_disease_df  0.057919  0.050427  0.385016
3    dermatology_df  0.259515  0.300969  0.458960
4  breast_cancer_df  0.253030  0.270470  0.598487
5       mushroom_df  0.033229  0.099499  0.478837


**Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?**
The adjusted Rand Index (ARI), normalized Mutual Information (NMI), and Fowlkes-Mallows Index (FMI) are metrics used in clustering analysis to evaluate the performance of clustering methods. ARI compares two clusterings by evaluating all pairings of data and accounting for both false positives and false negatives. Its advantage is that it provides a quantifiable measure of clustering agreement on a scale of -1 to 1, with 1 signifying perfect agreement. However, ARI's sensitivity to the number and size of clusters, as well as its reliance on ground truth labels, are significant limitations.

In contrast, NMI calculates the mutual information between clustering and ground truth labels, which is normalized by the entropy of the cluster and label distributions. This metric provides a normalized score ranging from 0 to 1, with 1 representing perfect clustering agreement. NMI is advantageous in that it considers the distribution of clusters and labels; nonetheless, it may exhibit bias towards clusterings with more clusters and requires ground truth labels for comparison.

FMI compares two clusterings based on pairwise similarities between members, disregarding cluster size. Its value ranges from 0 to 1, with 1 representing full clustering agreement. FMI is useful for comparing clusterings without taking into account cluster distribution, although it requires ground truth labels, as do ARI and NMI. Each metric has various advantages and disadvantages, and their usefulness is determined by the specific context and aims of the clustering analysis.

**Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.**

In [39]:
results_categorical = []

for dataset_name in datasets:
    dataset = globals()[dataset_name]
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=UserWarning)
        try:
            X_cat = dataset.iloc[:, :-1]
            true_labels_cat = dataset.iloc[:, -1]

            encoder = LabelEncoder()
            X_cat_encoded = X_cat.apply(encoder.fit_transform)

            km = KModes(n_clusters=k, init='Huang', n_init=5, verbose=0)
            km_labels = km.fit_predict(X_cat_encoded)
            ARI_km = adjusted_rand_score(true_labels_cat, km_labels)
            NMI_km = normalized_mutual_info_score(true_labels_cat, km_labels)
            FMI_km = fowlkes_mallows_score(true_labels_cat, km_labels)

            ac = AgglomerativeClustering(n_clusters=k, linkage='ward')
            ac_labels = ac.fit_predict(X_cat_encoded)
            ARI_ac = adjusted_rand_score(true_labels_cat, ac_labels)
            NMI_ac = normalized_mutual_info_score(true_labels_cat, ac_labels)
            FMI_ac = fowlkes_mallows_score(true_labels_cat, ac_labels)
            
            results_categorical.append([f"{dataset_name} (Kmodes)", ARI_km, NMI_km, FMI_km])
            results_categorical.append([f"{dataset_name} (Hierarchical)", ARI_ac, NMI_ac, FMI_ac])
        except UserWarning as e:
            print(f"Warning: {e}")

results_categorical_df = pd.DataFrame(results_categorical, columns=["Dataset", "ARI", "NMI", "FMI"])
print(results_categorical_df)

                            Dataset       ARI       NMI       FMI
0               soybean_df (Kmodes)  0.433978  0.620631  0.614300
1         soybean_df (Hierarchical)  0.653689  0.837666  0.783908
2                   zoo_df (Kmodes)  0.721570  0.714510  0.812035
3             zoo_df (Hierarchical)  0.461701  0.585056  0.645736
4         heart_disease_df (Kmodes)  0.205622  0.212495  0.477551
5   heart_disease_df (Hierarchical)  0.009543  0.010944  0.353434
6           dermatology_df (Kmodes)  0.534244  0.693801  0.691163
7     dermatology_df (Hierarchical)  0.032201  0.078147  0.293614
8         breast_cancer_df (Kmodes)  0.347079  0.419293  0.646015
9   breast_cancer_df (Hierarchical)  0.781678  0.688725  0.893555
10             mushroom_df (Kmodes)  0.324104  0.335668  0.640111
11       mushroom_df (Hierarchical)  0.298722  0.431275  0.610515


**Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).**

The ARI, NMI, and FMI metrics are critical for assessing clustering in various domains. For example, ARI is useful in testing batch-effect correction algorithms for single-cell RNA sequencing data, whereas NMI and ARI are essential for deep picture clustering. Similarly, NMI and ARI are applied to network-guided sparse subspace clustering.

These metrics provide vital information on clustering quality. For example, in dialogue-based clustering, NMI and ARI aid in the evaluation of clustering methods. ARI's consideration of pairwise and cardinality limitations is critical for boosting clustering performance.

FMI regularly outperforms and provides a reliable index for analyzing clustering. While ARI may be biased toward larger clusters and NMI may be sensitive to cluster size, FMI's emphasis on precise categorization makes it the favored choice across applications. It is critical for evaluating clustering and fusion procedures successfully. FMI's simplicity and robustness make it an excellent choice for evaluating clustering algorithms. Its ability to obtain correct classification findings while being consistent across multiple applications emphasizes its importance in evaluating the quality and effectiveness of clustering and fusion algorithms. Furthermore, the extensive use of FMI in research and industry emphasizes its significance as a standard metric for clustering evaluation, firmly establishing it as a cornerstone in clustering performance assessment.