# Clustering with the Fuzzy C-means Algorithm

#### Garrett McCue

The goal of this assignment is to apply the fuzzy c-means clustering algorithm to the dataset used for [VAT/iVAT analysis](https://nbviewer.org/github/mcqueg/Unsupervised_ML/blob/main/VAT.ipynb) and compare the results from the clustering with the resutls from VAT/iVAT.


## Fuzzy C-means

![](https://miro.medium.com/max/700/1*O5Ynz1UI6ClCs-Bdf-MG9A.png)

Fuzzy C-means clustering is a form of clustering that is referred to as "soft clustering". Soft clustering allows for data points to belong to more than one cluster based on their likely hood of similiarity which is defined by a distance measure. For example a data point could have a chance of belonging to cluster A [A = 0.25] and a chance of belonging to cluster B [B = 0.75], and which cluster it would belong to would be decided by a similarity threshold. If the similarity threshold was set at 0.7, then this data point would be grouped within cluster B. This algorithm works by assigning membership values to each observation for all potential clusters. Membership values are based on the distance of the observation to the centroid of a cluster, which means the closer a data point is to the centroid the greater its membership value is for that cluster. The summation of membership values for all clusters pertaining to a singluar observation should eqaul 1. The goal of the algorithm is to minimize its objective function, which can be described as the weighted summation between data points and the clusters. By optimizing the clustering through the minimization of the objective function, the algorithm is grouping points together that are closest to the respective centroid. The clustering algorithm iterates over all data points updating the membership matrix and cluster centroids after each iteration.

## FCM Algorithm

$ X = \{x_1, x_2, ... , x_n\} $ : the set of data points  
$ V = \{v_1, v_2, ... , v_c\}$ : set of cluster centers

$N$ : number of data points  
$q$ : fuzziness  
$c$ : number of cluster centers
$d_{ij}$ : Euclidean distance between $i^\text{th}$ data point and $j^\text{th}$ cluster center  
$\mu_{ij}$ : the membership of the $i^\text{th}$ data point to the $j^\text{th}$ cluster center  
$\cup$ : membership matrix of all $\mu_{ij}$ membership values with shape
$v_{j}$ : the $j^\text{th}$ cluster center  
$\beta$ : termination criterion

Goal is to minimize:

$$J(\cup,V) = \sum_{i=1}^{n}\sum_{j=1}^{c}\mu_{ij}^{q}d(\vec{x_i} , \vec{v_j}) $$
where, $$d(\vec{x_i} , \vec{v_j}) \text{is the Euclidean distance between the }i^\text{th}\text{ data point to the } j^\text{th}\text{ cluster center.} $$

1.  Randomly select $v_j$, cluster centers

2.  Generate $\cup$, the fuzzy member matrix, by calculating $\mu_{ij}$ for $N$

    $$
    \mu_{ij}=\frac{1}{\displaystyle\sum_{k=1}^c\left(\frac{d(\vec{x_i} ,\vec{v_j})}{d(\vec{x_i} , \vec{v_j})
    }\right)^{\frac{2}{q - 1}}}, \;\forall\: i=1,2,...N \; \text{and} \; j=1,2,...,c
    $$

3.  Compute new $v_j$, cluster centers, based on $\cup$
    $$v_j = \frac{\sum_{j=1}^N \mu_{ij}^q x_i}{\sum_{j=1}^N\mu_{ij}^{q}}, \; \forall\: j = 1,2,...,c$$

4.  Repeat steps (2) and (3) until $J < \beta$, or until a set number of max iterations is met.


### Load Libraries


In [None]:
import pandas as pd
import numpy as np
from fcmeans import FCM
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook"


## Load and Process Data


In [None]:
# import data
hap_df = pd.read_csv("data/world_happiness_rankings_2022.csv")
ranking_df = hap_df[['RANK', 'Country']]
metrics_df = hap_df.drop(['RANK', 'Country'], axis=1)

hap_df.head()


In [None]:
# scale data
metrics_df = StandardScaler().fit_transform(metrics_df)

# apply 2D PCA to data
pca_2 = PCA(n_components=2)
pca_2_data = pca_2.fit_transform(metrics_df)
pca_2_df = pd.DataFrame(data=pca_2_data, columns=['PC1', 'PC2'])
pca_2_ranking_df = pd.concat([ranking_df, pca_2_df], axis=1)

# apply 3D PCA to data
pca_3 = PCA(n_components=3)
pca_3_data = pca_3.fit_transform(metrics_df)
pca_3_df = pd.DataFrame(data=pca_3_data, columns=['PC1', 'PC2', 'PC3'])
pca_3_ranking_df = pd.concat([ranking_df, pca_3_df], axis=1)


## Applying FCM

When applying FCM we can specify the number of clusters (c) to use and the fuzziness (q) inclusion threshold.
Based on the cluster tendency of the dataset using VAT/iVAT, it appears that there could potentially be 3 to 5 clusters.

specified parameters to use:

- C = [3, 4, 5]
- q = [2, 3, 4]


In [None]:
# TODO: FCM
# 3vals for C & q
# 10-12 results

c = [3, 4, 5]
q = [2, 3, 4]

X_pca2 = pca_2_df.to_numpy()
fcm_2d_3c_2q = FCM(n_clusters=3, m=2.5)
fcm_2d_3c_2q.fit(X_pca2)

cens_2d_3c_2q = fcm_2d_3c_2q.centers
labels_2d_3c_2q = fcm_2d_3c_2q.predict(X_pca2)

# plot result
fig = make_subplots(specs=[[{'secondary_y': True}]])

fig.add_trace(
    go.Scatter(x=X_pca2[:, 0], y=X_pca2[:, 1],
               mode='markers', marker=dict(color=labels_2d_3c_2q))
)
# , size=.5)
# axes[1].scatter(cens_2d_3c_15q[:,0], cens_2d_3c_15q[:,1], marker="x", s=100, c='red')
fig.add_trace(
    go.Scatter(x=cens_2d_3c_2q[:, 0], y=cens_2d_3c_2q[:, 1], mode='markers',
               marker_color='rgb(0,0,0)', marker_size=20, marker_symbol='x')
)
fig.update_layout(showlegend=False)
#fig.show()


In [None]:
# def compare_fcm()
# input -> data, c_list, q_list, title

# initialize figure with subplots of correct size based on input
# create specs list and titles for subplots
# loop through c_list
# loop through q_list
# fcm = FCM(n_clusters=c_list[i], m=q_list[j])
# fcm.fit(data)
# centers = fcm.centers
# labels = fcm.predict(data)
# fig_temp = make_subplots(specs=[[{'secondary_y': True}]]))
# fig_temp.add_trace(go.Scatter)
# fig_temp.add_trace(go.Scatter)
# fig.add_trace(fig_temp)


# return -> fig with all combinations (9 scatters)


c=3
r=2
specs={'secondary_y': True}
specs_c = [specs for x in range(c)]
fig_test = make_subplots(specs=[specs_c for x in range(r)], rows=2, cols=3)

fig_test.add_trace(
    go.Scatter(x=X_pca2[:, 0], y=X_pca2[:, 1], mode='markers', marker=dict(color=labels_2d_3c_2q)),
               row=1, col=1
)

fig_test.add_trace(
    go.Scatter(x=cens_2d_3c_2q[:, 0], y=cens_2d_3c_2q[:, 1], mode='markers',
               marker_color='rgb(20,170,73)', marker_size=10, marker_symbol='x'),
               row=1, col=1
)
fig_test.update_layout(showlegend=False)

fig_test.show()


## FCM Results


In [None]:

# TODO: plot function for all combinations of C & q


## Clustering Analysis and comparison with VAT/iVAT


1. [Fuzzy C-Means Clustering with Python](https://towardsdatascience.com/fuzzy-c-means-clustering-with-python-f4908c714081)
