# K-Prototypes Tech Talk
# By Palermo Penano

The summary below closely follows the original paper, [Huang 1998](https://link.springer.com/article/10.1023/A:1009769707641).

## Outline
* Unsupervised machine learning and what does it mean to cluster data
    * How does regular K-means work?
    * Other clustering algos
* What are its advantages and limitations
* Why is using standard distance metric problematic?
* K-prototypes
* What libraries are available?
* Application to Colombia vulnerability index

## References

* Python implementation
    * https://pypi.org/project/kmodes/
    * https://github.com/nicodv/kmodes
* Other articles and notebooks using k-prototypes
    * https://www.kaggle.com/rohanadagouda/unsupervised-learning-using-k-prototype-and-dbscan
    * https://towardsdatascience.com/the-k-prototype-as-clustering-algorithm-for-mixed-data-type-categorical-and-numerical-fe7c50538ebb

* Source for this demo
    * Dataset
        * https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/
    * Analysis
        * https://towardsdatascience.com/clustering-datasets-having-both-numerical-and-categorical-variables-ed91cdca0677
        * https://www.kaggle.com/paulinan/bank-customer-segmentation

* [Unsupervised clustering with mixed categorical and continuous data](https://www.tomasbeuzen.com/post/clustering-mixed-data/)
    
https://www.kaggle.com/ashydv/bank-customer-clustering-k-modes-clustering







## How does K-means work

### K-means algorithm
* Initialize with a random set of cluster vectors (these are m-dimensional vectors where m is the number of features in the dataset)
* In a loop:
    * Allocate each point in the data to the nearest cluster based on the Euclidean distance
    * For each cluster, update the cluster vector by calculating the mean of all the records assigned to that cluster (average of each component across all vectors belonging to the same cluster).
    * Repeat the first step by checking and reallocating each data point based on the new cluster vectors
    * Repeat until desired number of iteration is reached or record changes cluster
    
Here's a visualization of the algorithm
    
    
![kmeans_algo](./imgs/kmeans_algo.png)
[Source](https://en.wikipedia.org/wiki/K-means_clustering)

## Why K-means shouldn't be used with categorical variables

Categorical variables are data that takes one of a limited and usually fixed number of possible values. If they can be ordered, then it is called `ordinal`. If it cannot be ordered, then it is called `nominal`. Here are some examples:

    * ordinal: risk rating, rank, day of the week
    * nominal: political party, region of residence, occupation

There are several ways these variables are used in models. For unsupervised ML, some approaches include one-hot encoding, label encoding, feature aggregation, or excluding the feature from the model.

For one-hot encoding, the categorical is converted to several binary features each of which representing a given category in the original feature. This works when there are few categories, but become a computational burden when categories exceeds hundres or thousands of unique values (e.g. zip codes).

Label encoding only works for ordinal data, but even in such case it would require that the encoded feature follows the same ordering as the original values in the raw feature.

In the case of a categorical feature with high cardinality, one solution is to aggregate them. For example, zip codes can be aggregated to a region containing many zip codes. Following the aggregation, one can then apply one-hot encoding.

When the approaches above fails or is infeasible to apply, the data scientist may exclude categorical feature all together from the model.

## Why is using standard distance metric problematic for categorical data?


For numerical data, it is natural to define the center of mass for a collection of points using common distance metrics, such as the Euclidean distance. Say you have three points in your dataset containing annual income: `[$100k, $45k, $65k]`. One metric for the center of mass is just the average, in this case, `$70k`. 

But instead of income, suppose you had categories of hair color: `[blond, red, black]`. How can one define center of mass in this case?


## K-prototypes

K-prototypes combines k-means and k-modes into to be able to handle both numeric and cateogircal variables. The k-modes algorithm differs from k-means in the following way:

- Instead of Euclidean distance, use an alternative dissimilarity measure for categorical objects
- Instead of mean of a set of points to define a cluster centroid, use the mode where a frequency-based approach to find modes for the clusters


### K-modes

Dissimilarity measure for categorical attributes

```
x1 = [finance, married, golf] 
	is similar to 
x2 = [finance, married, squash] 
	but dissimilar to 
x3 = [healthcare, single, soccer]
```

More formally,

<img src="./imgs/dissimilarity_measure.png" alt="drawing" width="250"/>

Use the mode to represent the "center" of a set of vectors containing only categoricals. The mode itself need not be part of the original set. The formal definition for the mode is a vector Q such that this Q minimizes the distance between itself and the other vectors in the set using the dissimilarity measure above.

More formally

<img src="./imgs/cat_mode.png" alt="drawing" width="500"/>

Now that we have an approach for defining how close two vectors contaning categorical values are and a technique for determining the center of mass for a set of categorical-valued vectors, we can use a similar algorithm as k-means to find the clusters.

The k-modes algorithm:

* Select the k initial modes, one for each cluster
* In a loop, 
    * Allocate each point in the data to the nearest cluster based on the dissimilarity measure defined above
    * Update the mode of the cluster based on the set of points allocated to it
    * Repeat the first step by checking and reallocating each data point based on the new cluster modes
    * Repeat until no record changes cluster or the desired number of iteration is reached

### Combining k-means and k-modes

Combining k-means and k-modes requires a function that combines both the Euclidean distance metric and the dissimilarity measure function. Formally, the function is formulated as 

<img src="./imgs/combined_euc_dissm.png" alt="drawing" width="300"/>

Similar to the k-means and k-modes algorithm above, the optimization problems boils down to finding the cluster vectors (that now contain both numeric and categorical values) such that the sum of the euclidean distance metric and dissimilarity measure across all k cluster vectors is minimized.

In [1]:
import warnings
from typing import List, Tuple

import numpy as np
import pandas as pd 
import random
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from matplotlib import style
from kmodes.kprototypes import KPrototypes
warnings.filterwarnings("ignore")

# Clustering German Credit Data

In [7]:
col_names = [
    "chk_acct",          # 1
    "duration",          # 2
    "credit_his",        # 3
    "purpose",           # 4
    "amount",            # 5
    "saving_acct",       # 6
    "present_emp",       # 7
    "installment_rate",  # 8
    "sex",               # 9
    "other_debtor",      # 10          
    "present_resid",     # 11
    "property",          # 12
    "age",               # 13
    "other_install",     # 14
    "housing",           # 15
    "n_credits",         # 16
    "job",               # 17
    "n_people",          # 18
    "telephone",         # 19
    "foreign",           # 20
    "response"           # 21
]

df = pd.read_csv('./data/german/german.data',
                 sep=' ', 
                 header=None,
                 names=col_names)

In [8]:
cont_cols = [
    "duration",          # 2
    "amount",            # 5
    "installment_rate",  # 8
    "present_resid",     # 11
    "age",               # 13
    "n_credits",         # 16
    "n_people"           # 18
]
cat_cols = [
    "chk_acct",          # 1
    "credit_his",        # 3
    "present_emp",       # 7
    "sex",               # 9
    "property",          # 12
    "housing"            # 15
]

In [9]:
df = df.loc[:, cont_cols+cat_cols]

In [10]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
duration            1000 non-null int64
amount              1000 non-null int64
installment_rate    1000 non-null int64
present_resid       1000 non-null int64
age                 1000 non-null int64
n_credits           1000 non-null int64
n_people            1000 non-null int64
chk_acct            1000 non-null object
credit_his          1000 non-null object
present_emp         1000 non-null object
sex                 1000 non-null object
property            1000 non-null object
housing             1000 non-null object
dtypes: int64(7), object(6)
memory usage: 101.7+ KB


In [11]:
df[cont_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
duration,1000.0,20.903,12.058814,4.0,12.0,18.0,24.0,72.0
amount,1000.0,3271.258,2822.736876,250.0,1365.5,2319.5,3972.25,18424.0
installment_rate,1000.0,2.973,1.118715,1.0,2.0,3.0,4.0,4.0
present_resid,1000.0,2.845,1.103718,1.0,2.0,3.0,4.0,4.0
age,1000.0,35.546,11.375469,19.0,27.0,33.0,42.0,75.0
n_credits,1000.0,1.407,0.577654,1.0,1.0,1.0,2.0,4.0
n_people,1000.0,1.155,0.362086,1.0,1.0,1.0,1.0,2.0


In [12]:
def summarize_cats(df: pd.DataFrame, cat_cols: List[str] = []) -> pd.DataFrame:
    '''Create table summarizing categorical variables
    To display more values in Values col, set
    pd.set_option('display.max_colwidth', 100)
    '''

    df_cats = df.loc[:, cat_cols]

    print(f"Dataset Shape: {df_cats.shape}")
    summary = pd.DataFrame(df_cats.dtypes, columns=['dtypes'])
    summary = summary.reset_index()
    summary['Column Name'] = summary['index']
    summary = summary[['Column Name', 'dtypes']]
    summary['Missing'] = df_cats.isnull().sum().values
    summary['Uniques'] = df_cats.nunique().values

    for name in summary['Column Name'].value_counts().index:

        # List unique values
        list_uniques = [str(v) for v in df_cats[name].unique()]
        summary.loc[summary['Column Name'] == name,
                    'Values'] = ' | '.join(list_uniques)

        # Calculate entropy
        shares = df_cats[name].value_counts(normalize=True)
        summary.loc[summary['Column Name'] == name, 'Entropy'] = round(
            stats.entropy(shares, base=2), 2)

    return summary

In [13]:
summarize_cats(df, cat_cols=cat_cols)

Dataset Shape: (1000, 6)


Unnamed: 0,Column Name,dtypes,Missing,Uniques,Values,Entropy
0,chk_acct,object,0,4,A11 | A12 | A14 | A13,1.8
1,credit_his,object,0,5,A34 | A32 | A33 | A30 | A31,1.71
2,present_emp,object,0,5,A75 | A73 | A74 | A71 | A72,2.16
3,sex,object,0,4,A93 | A92 | A91 | A94,1.53
4,property,object,0,4,A121 | A122 | A124 | A123,1.95
5,housing,object,0,3,A152 | A153 | A151,1.14


In [14]:
df

Unnamed: 0,duration,amount,installment_rate,present_resid,age,n_credits,n_people,chk_acct,credit_his,present_emp,sex,property,housing
0,6,1169,4,4,67,2,1,A11,A34,A75,A93,A121,A152
1,48,5951,2,2,22,1,1,A12,A32,A73,A92,A121,A152
2,12,2096,2,3,49,1,2,A14,A34,A74,A93,A121,A152
3,42,7882,2,4,45,1,2,A11,A32,A74,A93,A122,A153
4,24,4870,3,4,53,2,2,A11,A33,A73,A93,A124,A153
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,12,1736,3,4,31,1,1,A14,A32,A74,A92,A121,A152
996,30,3857,4,4,40,1,1,A11,A32,A73,A91,A122,A152
997,12,804,4,4,38,1,1,A14,A32,A75,A93,A123,A152
998,45,1845,4,4,23,1,1,A11,A32,A73,A93,A124,A153


# Data analysis

# Clustering using K-Prototypes

In [33]:
# Get index of categorical columns
cat_cols_idx = [df.columns.get_loc(c) for c in cat_cols if c in df]

kp = KPrototypes(
        n_clusters=5,
        max_iter=100,
        init='Huang',
        n_init=10,
        gamma=None,
        verbose=0,
        random_state=None,
        n_jobs=-1,
)

clusters = kp.fit_predict(df, categorical=cat_cols_idx)

In [35]:
kp.cluster_centroids_

array([['25.310734463276837', '4120.12429378531', '2.6271186440677967',
        '2.926553672316384', '34.86440677966102', '1.423728813559322',
        '1.192090395480226', 'A14', 'A32', 'A73', 'A93', 'A123', 'A152'],
       ['40.36585365853659', '12576.463414634147', '2.3902439024390243',
        '2.975609756097561', '36.80487804878049', '1.3902439024390243',
        '1.1219512195121952', 'A12', 'A32', 'A73', 'A93', 'A124', 'A152'],
       ['33.12096774193548', '7243.096774193548', '2.5161290322580645',
        '2.846774193548387', '36.983870967741936', '1.5',
        '1.185483870967742', 'A14', 'A32', 'A73', 'A93', 'A123', 'A152'],
       ['13.3', '1160.8243243243244', '3.3027027027027027',
        '2.8027027027027027', '36.075675675675676', '1.4081081081081082',
        '1.154054054054054', 'A14', 'A32', 'A73', 'A93', 'A121', 'A152'],
       ['19.930555555555557', '2426.0833333333335', '3.0416666666666665',
        '2.829861111111111', '34.486111111111114', '1.3576388888888888',
    

In [36]:
kp.cluster_centroids_[0]

array(['25.310734463276837', '4120.12429378531', '2.6271186440677967',
       '2.926553672316384', '34.86440677966102', '1.423728813559322',
       '1.192090395480226', 'A14', 'A32', 'A73', 'A93', 'A123', 'A152'],
      dtype='<U32')

In [None]:
# How does KPrototypes know which variables are categorical and which are numerics