# Clustering with the Fuzzy C-means Algorithm

#### Garrett McCue

The goal of this assignment is to apply the fuzzy c-means clustering algorithm to the dataset used for [VAT/iVAT analysis](https://nbviewer.org/github/mcqueg/Unsupervised_ML/blob/main/VAT.ipynb) and compare the results from the clustering with the resutls from VAT/iVAT.


## Fuzzy C-means

![](https://miro.medium.com/max/700/1*O5Ynz1UI6ClCs-Bdf-MG9A.png)

Fuzzy C-means clustering is a form of clustering that is referred to as "soft clustering". Soft clustering allows for data points to belong to more than one cluster based on their likely hood of similiarity which is defined by a distance measure. For example, a data point could have a chance of belonging to cluster A [A = 0.25] and a chance of belonging to cluster B [B = 0.75], and which cluster it would belong to would be decided by a similarity threshold. If the similarity threshold was set at 0.7, then this data point would be grouped within cluster B. This algorithm works by assigning membership values to each observation for all potential clusters. Membership values are based on the distance of the observation to the centroid of a cluster, which means the closer a data point is to the centroid the greater its membership value is for that cluster. The summation of membership values for all clusters pertaining to a singluar observation should eqaul 1. The goal of the algorithm is to minimize its objective function, which can be described as the weighted summation between data points and the clusters. By optimizing the clustering through the minimization of the objective function, the algorithm is grouping points together that are closest to the respective centroid. The clustering algorithm iterates over all data points updating the membership matrix and cluster centroids after each iteration.

## FCM Algorithm

$ X = \{x_1, x_2, ... , x_n\} $ : the set of data points  
$ V = \{v_1, v_2, ... , v_c\}$ : set of cluster centers

$N$ : number of data points  
$q$ : fuzziness  
$c$ : number of cluster centers
$d_{ij}$ : Euclidean distance between $i^\text{th}$ data point and $j^\text{th}$ cluster center  
$\mu_{ij}$ : the membership of the $i^\text{th}$ data point to the $j^\text{th}$ cluster center  
$\cup$ : membership matrix of all $\mu_{ij}$ membership values with shape
$v_{j}$ : the $j^\text{th}$ cluster center  
$\beta$ : termination criterion

Goal is to minimize:

$$J(\cup,V) = \sum_{i=1}^{n}\sum_{j=1}^{c}\mu_{ij}^{q}d(\vec{x_i} , \vec{v_j}) $$
where, $$d(\vec{x_i} , \vec{v_j}) \text{is the Euclidean distance between the }i^\text{th}\text{ data point to the } j^\text{th}\text{ cluster center.} $$

1.  Randomly select $v_j$, cluster centers

2.  Generate $\cup$, the fuzzy member matrix, by calculating $\mu_{ij}$ for $N$

    $$
    \mu_{ij}=\frac{1}{\displaystyle\sum_{k=1}^c\left(\frac{d(\vec{x_i} ,\vec{v_j})}{d(\vec{x_i} , \vec{v_j})
    }\right)^{\frac{2}{q - 1}}}, \;\forall\: i=1,2,...N \; \text{and} \; j=1,2,...,c
    $$

3.  Compute new $v_j$, cluster centers, based on $\cup$
    $$v_j = \frac{\sum_{j=1}^N \mu_{ij}^q x_i}{\sum_{j=1}^N\mu_{ij}^{q}}, \; \forall\: j = 1,2,...,c$$

4.  Repeat steps (2) and (3) until $J < \beta$, or until a set number of max iterations is met.


### Load Libraries


In [1]:
import pandas as pd
import numpy as np
from fcmeans import FCM
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook"


## Load and Process Data


In [2]:
# import data
hap_df = pd.read_csv("data/world_happiness_rankings_2022.csv")
ranking_df = hap_df[['RANK', 'Country']]
metrics_df = hap_df.drop(['RANK', 'Country'], axis=1)

hap_df.head()


Unnamed: 0,RANK,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.83) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,1,Finland,7.821,7.886,7.756,2.518,1.892,1.258,0.775,0.736,0.109,0.534
1,2,Denmark,7.636,7.71,7.563,2.226,1.953,1.243,0.777,0.719,0.188,0.532
2,3,Iceland,7.557,7.651,7.464,2.32,1.936,1.32,0.803,0.718,0.27,0.191
3,4,Switzerland,7.512,7.586,7.437,2.153,2.026,1.226,0.822,0.677,0.147,0.461
4,5,Netherlands,7.415,7.471,7.359,2.137,1.945,1.206,0.787,0.651,0.271,0.419


In [3]:
# scale data
metrics_df = StandardScaler().fit_transform(metrics_df)

# apply 2D PCA to data
pca_2 = PCA(n_components=2)
pca_2_data = pca_2.fit_transform(metrics_df)
pca_2_df = pd.DataFrame(data=pca_2_data, columns=['PC1', 'PC2'])
pca_2_ranking_df = pd.concat([ranking_df, pca_2_df], axis=1)

# apply 3D PCA to data
pca_3 = PCA(n_components=3)
pca_3_data = pca_3.fit_transform(metrics_df)
pca_3_df = pd.DataFrame(data=pca_3_data, columns=['PC1', 'PC2', 'PC3'])
pca_3_ranking_df = pd.concat([ranking_df, pca_3_df], axis=1)


## Applying and Visualizing FCM

When applying FCM we can specify the number of clusters (c) to use and the fuzziness (q) inclusion threshold.
Based on the cluster tendency of the dataset using VAT/iVAT, it appears that there could potentially be 3 to 5 clusters.

specified parameters to use:

- C = [5, 4, 3]
- q = [2, 2.5, 3]


In [4]:
def compare_fcm(data, dims, c_list, q_list, title):
    '''
    Parameters:
     data : np matrix, data to apply FCM algorithm 
     dims : string dimensionality of data i.e 2D or 3D
     c_list : list of potential cluster numbers to consider
     q_list : list of potential "fuzzy" values to consider
     title : title to be applied to the figure

     Returns: 
     fig : scatter for fcm clustering of the data using all possible combinations between c and q
    '''
    r = len(c_list)  # number of rows to include in subplots
    c = len(q_list)  # number of rows to include in subplots
    # set specs based on dims
    if dims == '3D': specs = {'type':'scene'}
    elif dims == '2D': specs = {'secondary_y': True}
    specs_c = [specs for x in range(1, c+1)]  # set the column number for specs
    # create titles for each subplot that specify the c and q values for that plot
    subplot_titles = []
    for i in c_list:
        for j in q_list:
            temp_title = "c:{} q:{}".format(i, j)
            subplot_titles.append(temp_title)

    # create subplots based on c_list and q_list size with specs for each plot
    fig = make_subplots(specs=[specs_c for x in range(1, r+1)],
                        rows=r, cols=c,
                        subplot_titles=subplot_titles,
                        y_title="Cluster Count",
                        x_title="Fuzzy Values",
                        vertical_spacing=0.05,
                        horizontal_spacing=0.01)
    # loop through c_list & q_list applying FCM with the specified c and q parameters
    for i in range(1, len(c_list)+1):
        for j in range(1, len(q_list)+1):
            fcm = FCM(n_clusters=c_list[i-1], m=q_list[j-1])
            fcm.fit(data)
            # compute centers and labels for plotting the scatter
            centers = fcm.centers
            labels = fcm.predict(data)

            if dims == '3D':
                # plot the scatter at position j,i
                fig.add_trace(
                    go.Scatter3d(x=data[:, 0], y=data[:, 1], z=data[:,2],
                            mode='markers', marker=dict(color=labels, size=4, opacity=0.65)), 
                    row=i, col=j
            )
            elif dims == '2D':
                # plot the scatter at position j,i
                fig.add_trace(
                    go.Scatter(x=data[:, 0], y=data[:, 1],
                            mode='markers', marker=dict(color=labels, opacity=0.65, size=7)),
                    row=i, col=j
                )
                # plot the centroids for the plot  at j,i
                fig.add_trace(
                    go.Scatter(x=centers[:, 0], y=centers[:, 1],
                            mode='markers', marker_color='rgb(0,0,0)',
                            marker_size=12, marker_symbol='x'),
                    row=i, col=j
                )
            
            # update figure aesthetics
    fig.update_annotations(font_size=17)
    fig.update_xaxes(showticklabels=False)
    fig.update_yaxes(showticklabels=False)
    fig.update_layout(showlegend=False,
                              title=title,
                              title_font_size=20,
                              title_x=0.5,
                              title_y=.99,
                              margin_l=60,
                              margin_r=10,
                              margin_b=55,
                              margin_t=60,
                              height=1000,
                              width=1100)

    # return fig with all combinations (9 scatters)
    return fig


### Applying FCM to the 2D PCA projected data


In [9]:
c_list = [5, 4, 3]
q_list = [2, 2.5, 3]

X_pca2 = pca_2_df.to_numpy()
fig_2d = compare_fcm(X_pca2, dims='2D', c_list=c_list, q_list=q_list,
                     title='2D PCA: Fuzzy C-means Clustering')
fig_2d.show()


### Applying FCM to the 3D PCA projected data


In [8]:
X_pca3 = pca_3_df.to_numpy()
fig_3d = compare_fcm(X_pca3, dims='3D', c_list=c_list, q_list=q_list,
                     title='3D PCA: Fuzzy C-means Clustering')
fig_3d.show()


## Clustering Analysis and comparison with VAT/iVAT

The Fuzzy C-means clustering of the 2D dataset was grouped into 3, 4, and 5 clusters with fuzziness values of 2, 2.5, and 3. The variation between the clustering at 3, 4, and 5 clusters resulted in centroids that began to converge towards the center as the cluster numbers increased. The clusterings with 5 total groupings had two relativelty close clusters that crowded within the middle. The clustering of 3 led to the scatter being split into three columns of equal size. The best clustering appeared to be with a total cluster number of 4, which is consistent with the assumptions from the VAT/iVAT images from below. The fuzziness values did not lead to much change amongst general group represenations, but a few data points are passed between clusters within the center of the scatter plot. The density of the points in this location can be a factor leading to some points swapping cluster memberships and can be an example of how sensitve clustering is based on the distance measure used. In order to better estimate groupings a better distance measure could be implemented such as the mahalanobis distance or GK distance measure. The 3D clustering results are interesting because they lead to the same conclusion that 5 clusters is too many, but it also appears that the cluster boundries within the groupings of 4 seem to be unclear. The clustering of 3 has much more division or concrete boundaries in comparison with the other two projections, but it depends on the question that is being asked. If a more generalized clustering is desired then a total cluster number of 3 woud be a better choice when comapred with the others. The images of the 3D VAT/iVAT exploration are hard to translate and compare with the 3D fuzzy c means clustering, because the VAT/iVAT images are 2D renderings of a 3D dataset. The loss of a dimension can compress the data onto itself hiding the class separations within the z-plane. Further exploration of cluster differences when using a more robust distance measure could improve the clustering algorithm even more, and can lead to more accurate group represenations within each cluster.



<table><tr>
<td> <img src="/Users/garrettmccue/lewisU/DATA55100/Unsupervised_ML_Assignments/figures/vat/vat_2d.png" alt="Image" style="width: 450px;"/> </td>
<td> <img src="/Users/garrettmccue/lewisU/DATA55100/Unsupervised_ML_Assignments/figures/vat/i_vat_2d.png" alt="Image" style="width: 450px;"/> </td>
</tr></table>

<table><tr>
<td> <img src="/Users/garrettmccue/lewisU/DATA55100/Unsupervised_ML_Assignments/figures/vat/vat_3d.png" alt="Image" style="width: 450px;"/> </td>
<td> <img src="/Users/garrettmccue/lewisU/DATA55100/Unsupervised_ML_Assignments/figures/vat/i_vat_3d.png" alt="Image" style="width: 450px;"/> </td>
</tr></table>




1. [Fuzzy C-Means Clustering with Python](https://towardsdatascience.com/fuzzy-c-means-clustering-with-python-f4908c714081)
