# Clustering with mixed data

- This notebook very briefly demonstrates some approaches to clustering with mixed data
- It complements a [post on my website](https://www.tomasbeuzen.com/post/clustering-mixed-data/)

## Imports

> Note you may need to install the following non-standard packages:

```
pip install prince
pip install kmodes
```

In [6]:
import numpy as np
import pandas as pd
from prince import FAMD
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes
from sklearn.preprocessing import StandardScaler
random_state = 1234
pd.options.plotting.backend = "plotly"

In [7]:
def plot_cluster(X, y, title="Cluster plot"):
    fig = X.plot.scatter(x='X1', y='X2', color=y)
    fig.update_layout(autosize=False, width=500, height=500,
                  coloraxis = dict(showscale=False, colorscale='Portland'),
                  font=dict(size=18),
                  title=dict(text=title, x=0.5, y=0.95, xanchor='center'))
    fig.update_traces(marker=dict(size=15))
    return fig

## Make data

- Below we'll make some synthetic data for clustering
- The data will have 50 observations, 3 features and 3 clusters
- We standardise the data for clustering purposes (to make sure all features are on the same scale), and convert one of the features to a categorical of "LOW" and "HIGH" values to demonstrate different approaches to clustering mixed data

In [8]:
X, y = make_blobs(n_samples=50, centers=3, n_features=3, random_state=random_state)
X = pd.DataFrame(X, columns=['X1', 'X2', 'X3'])
X['X3'] = np.where(X['X3'] < 0, 'LOW', 'HIGH')
con_feats = ['X1', 'X2'] 
cat_feats = ['X3']
scale = StandardScaler()
X[con_feats] = scale.fit_transform(X[con_feats])
X.head()

Unnamed: 0,X1,X2,X3
0,-0.495194,0.963114,HIGH
1,-0.548021,-1.762852,LOW
2,1.101047,0.935499,LOW
3,-0.69472,-1.779252,LOW
4,1.261093,0.964404,LOW


- Below we plot our synthetic data (using our two continuous features as the x and y axes)
- There are 3 quite distinct blobs shown in blue, red, and yellow
- However, there is a bit of mixture evident in the blue and red blobs and it will be interesting to explore how our different approaches can capture this

In [9]:
plot_cluster(X, y, "True Data")

## 1. Cluster based on continuous data only

- First we'll ignore the categorical feature (which standard algorithms like k-means and DBSCAN) don't like, and only cluster based on the continuous features
- The results are not too bad, we pick up the 3 main clusters, but do not identify that mixed blue/red data we saw earlier

In [10]:
model = KMeans(n_clusters=3, random_state=random_state).fit(X[con_feats])
pred = model.labels_
plot_cluster(X, pred, "Continuous Only")





## 2. Encode categorical data

- Next we'll try encoding the categorical data using One Hot Encoding (you may also want to try scaling the data after OHE but I didn't do that here for succinctness)
- The results are better than before, we get our 3 blobs, plus we identify some of that blue/red mixed data

In [11]:
model = KMeans(n_clusters=3, random_state=random_state).fit(pd.get_dummies(X))
pred = model.labels_
plot_cluster(X, pred, "Encoded Categorical Data")





## 3. The k-prototypes algorithm

- K-prototypes can work directly with the categorical data, without the need for encoding
- I defer to the [KPrototypes documentation](https://github.com/nicodv/kmodes) for an explanation of how the algorithm works
- The results are similar to the above, we get our 3 blobs, plus we identify some of that blue/red mixed data

In [12]:
pred = KPrototypes(n_clusters=3).fit_predict(X, categorical=[2])
plot_cluster(X, pred.astype(float), "k-prototypes")

## 4. FAMD followed by clustering

- Our final approach is to use FAMD (factor analysis for mixed data) to convert our mixed continuous and categorical data into derived continuous components (I chose 3 components here)
- I defer to the [Prince documentation](https://github.com/kormilitzin/Prince) for an explanation of how the FAMD algorithm works
- The results are interesting here, we do get our 3 blobs but the bottom left blob is not very uniform. However, we perfectly identify the mixed labels around (`X1=-1`, `X2=0`), which no previous approach has been able to do.

In [13]:
famd = FAMD(n_components=3).fit(X)
famd.row_coordinates(X).head()

component,0,1,2
0,-0.070665,1.363094,0.016198
1,1.403583,-1.341015,0.040926
2,-1.533976,-0.188276,0.246985
3,1.524438,-1.293881,0.111368
4,-1.672744,-0.232328,0.174506


In [14]:
model = KMeans(n_clusters=3, random_state=random_state).fit(famd.row_coordinates(X))
pred = model.labels_
plot_cluster(X, pred, "FAMD + Clustering")



