# Cluster Validity

#### Garrett McCue


1. Implement three cluster validity indices (one must be the CS index).
2. Taking into consideration the results you obtained after implementing VAT, iVAT (as a pre-clustering technique) and FCM (as a clustering technique), compare the results you will get after applying the validity indices with the previous results of VAT (iVAT) and FCM. What do you think the best choice for the number of clusters (C) will be?

In [5]:
import numpy as np
import pandas as pd
from fcmeans import FCM
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [7]:
# import data
hap_df = pd.read_csv("data/world_happiness_rankings_2022.csv")
ranking_df = hap_df[['RANK', 'Country']]
metrics_df = hap_df.drop(['RANK', 'Country'], axis=1)

hap_df.head()

Unnamed: 0,RANK,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.83) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,1,Finland,7.821,7.886,7.756,2.518,1.892,1.258,0.775,0.736,0.109,0.534
1,2,Denmark,7.636,7.71,7.563,2.226,1.953,1.243,0.777,0.719,0.188,0.532
2,3,Iceland,7.557,7.651,7.464,2.32,1.936,1.32,0.803,0.718,0.27,0.191
3,4,Switzerland,7.512,7.586,7.437,2.153,2.026,1.226,0.822,0.677,0.147,0.461
4,5,Netherlands,7.415,7.471,7.359,2.137,1.945,1.206,0.787,0.651,0.271,0.419


In [13]:
# scale data
metrics_df = StandardScaler().fit_transform(metrics_df)

# apply 2D PCA to data
pca_2 = PCA(n_components=2)
pca_2_data = pca_2.fit_transform(metrics_df)
pca_2_df = pd.DataFrame(data=pca_2_data, columns=['PC1', 'PC2'])
pca_2_ranking_df = pd.concat([ranking_df, pca_2_df], axis=1)

# apply 3D PCA to data
pca_3 = PCA(n_components=3)
pca_3_data = pca_3.fit_transform(metrics_df)
pca_3_df = pd.DataFrame(data=pca_3_data, columns=['PC1', 'PC2', 'PC3'])
pca_3_ranking_df = pd.concat([ranking_df, pca_3_df], axis=1)

X_2d = pca_2_df.to_numpy()
X_3d = pca_3_df.to_numpy()

### Classification entropy

lower the index the better the clustering

In [2]:
def ce(u):
    n, c = u.shape
    return abs((np.log10(u) * u).sum() / n)

In [21]:
print("2D Clustering: ")
for n_clusters in range(2, 6):
    fcm = FCM(n_clusters=n_clusters, m=2)
    fcm.fit(X_2d)
    u = fcm.u
    index = ce(u)

    print(f"classification entropy with {n_clusters} number of clusters is: {index}")

print("\n3D Clustering: ")
for n_clusters in range(2, 6):
    fcm = FCM(n_clusters=n_clusters, m=2)
    fcm.fit(X_3d)
    u = fcm.u
    index = ce(u)

    print(f"classification entropy with {n_clusters} number of clusters is: {index}")

2D Clustering: 
classification entropy with 2 number of clusters is: 0.15636960316532073
classification entropy with 3 number of clusters is: 0.26217203099401803
classification entropy with 4 number of clusters is: 0.31092072988828295
classification entropy with 5 number of clusters is: 0.36385658331854787

3D Clustering: 
classification entropy with 2 number of clusters is: 0.18630141922819146
classification entropy with 3 number of clusters is: 0.299758939293352
classification entropy with 4 number of clusters is: 0.37074073024041704
classification entropy with 5 number of clusters is: 0.4326485534979038


### Partition Coefficient

larger the better

In [23]:
def pc(u):
    n, c = u.shape
    return np.square(u).sum() / n

In [24]:
print("2D Clustering: ")
for n_clusters in range(2, 6):
    fcm = FCM(n_clusters=n_clusters, m=2)
    fcm.fit(X_2d)
    u = fcm.u
    index = pc(u)

    print(f"partition coeficient with {n_clusters} number of clusters is: {index}")

print("\n3D Clustering: ")
for n_clusters in range(2, 6):
    fcm = FCM(n_clusters=n_clusters, m=2)
    fcm.fit(X_3d)
    u = fcm.u
    index = pc(u)

    print(f"partition coeficient with {n_clusters} number of clusters is: {index}")

2D Clustering: 
partition coeficient with 2 number of clusters is: 0.778128432086986
partition coeficient with 3 number of clusters is: 0.6568531183751598
partition coeficient with 4 number of clusters is: 0.6264371757621778
partition coeficient with 5 number of clusters is: 0.580008461955948

3D Clustering: 
partition coeficient with 2 number of clusters is: 0.7298308248311745
partition coeficient with 3 number of clusters is: 0.6048014776172559
partition coeficient with 4 number of clusters is: 0.5508124458325865
partition coeficient with 5 number of clusters is: 0.5014876229884749


### CS Index

N: number of samples  
X: are the data points  
C:  the number of clusters  
ai:  the number of data points in the cluster i  
vi:  the cluster center of cluster i 

In [66]:
def euclidean_dist(p, q):
    temp = p - q
    sum_sq = np.dot(temp.T, temp)
    return(np.sqrt(sum_sq))


In [69]:
def cs(X, centers, labels, n_clusters):
    #bewteen data points
    max_sum = 0
    for c in range(1, n_clusters+1):
        labels_c = np.array(np.where(labels == c))
        X_c = []
        for i in labels_c: X_c.append(X[i])
        ai = len(X_c) 
        max_c = 0
        for i in X_c:
            for j in X_c:
                dist = euclidean_dist(i, j)
                if dist >= max_c:
                    max_c = dist/ai
        max_sum += max_c
    between_points = max_sum/n_clusters
            
    # between clusters
    min_sum = 0
    for i in centers:
        min_vi = 0
        vi = i
        for j in  centers:
            dist = euclidean_dist(vi, j)
            if dist > min_vi:
                min_vi = dist
        min_sum += min_vi

    between_clusters = min_sum/n_clusters

    return between_points / between_clusters

In [70]:
print("2D Clustering: ")
for n_clusters in range(2, 6):
    fcm = FCM(n_clusters=n_clusters, m=2)
    fcm.fit(X_2d)
    centers_2d = fcm.centers
    labels_2d = fcm.predict(X_2d)
    index = cs(X_2d, centers=centers_2d, labels=labels_2d, n_clusters=n_clusters)

    print(f"partition coeficient with {n_clusters} number of clusters is: {index}")

print("\n3D Clustering: ")
for n_clusters in range(2, 6):
    fcm = FCM(n_clusters=n_clusters, m=2)
    fcm.fit(X_3d)
    centers_3d = fcm.centers
    labels_3d = fcm.predict(X_3d)
    index = cs(X_3d, centers=centers_3d, labels=labels_3d, n_clusters=n_clusters)

    print(f"partition coeficient with {n_clusters} number of clusters is: {index}")

2D Clustering: 


ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()