In this mini lecture, we study correlation for two categorical variables. 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import os

import scipy.stats as ss

In [2]:
path="C:\\Users\\gao\\GAO_Jupyter_Notebook\\Datasets"
os.chdir(path)

#path="C:\\Users\\pgao\\Documents\\PGZ Documents\\Programming Workshop\\PYTHON\\Open Courses on Python\\Udemy Course on Python\Introduction to Data Science Using Python\\datasets"
#os.chdir(path)

When we think of correlation, most of us think of Pearson's correlation coefficient $r=\frac{\sum_{i=1}^{n}(x_{i}-\overline{x})(y_{i}-\overline{y})}{\sqrt{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}\sqrt{\sum_{i=1}^{n}(y_{i}-\overline{y})^{2})}}$ for two continuous variable $x, y$. When the 2 variables are categorical in nature, the definition becomes a bit tricky. One common option to handle this scenario is by first using one-hot encoding, and break each possible option of each categorical feature to 0-or-1 features. This will then allow the use of correlation, but it can easily become too complex to analyse. For example, one-hot encoding converts the 22 categorical features of a simple dataset to a 112-features dataset, and when plotting the correlation table as a heat-map, we get something basically unreadable. This is when Cramer's V statistic comes into handy (usu. denoted by $\varphi_{c} \in [0, 1]$). This is a measure of association between two nominal variables, giving a value between 0 (no association) and 1. This statistic is the intercorrelation of two discrete variables and may be used with variables having two or more levels. $\varphi_{c}$ is a symmetrical measure: it does not matter which variable we place in the columns and which in the rows. Also, the order of rows/columns doesn't matter. The statistic is based on the **phi-coefficient (mean square contingency coefficient)** which is a measure of association for two binary variables introduced by Karl Pearson, with $\phi=\sqrt{\frac{\chi^{2}}{n}}$. 

Let a sample of size $n$ be of the simultaneously distributed variables. Let $n$ denote the grand total of observations and $K$ being the number of columns and $R$ the number of rows. Then Cramer's V is defined as $\varphi_{c}=\sqrt{\frac{\phi^{2}}{\min({K-1, R-1})}}$. Unlike correlation, there are no negative values for Cramer's V value, as there’s no such thing as a negative association. Either there is, or there isn’t. 

In [3]:
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
                                np.arange(0, 55, 5),
                                include_lowest=True,
                                right=False)
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,total_bill_cut
0,16.99,1.01,Female,No,Sun,Dinner,2,"[15, 20)"
1,10.34,1.66,Male,No,Sun,Dinner,3,"[10, 15)"
2,21.01,3.5,Male,No,Sun,Dinner,3,"[20, 25)"
3,23.68,3.31,Male,No,Sun,Dinner,2,"[20, 25)"
4,24.59,3.61,Female,No,Sun,Dinner,4,"[20, 25)"


In [4]:
print(tips["day"].value_counts(), "\n")
print(tips["time"].value_counts())

Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64 

Dinner    176
Lunch      68
Name: time, dtype: int64


In [5]:
def cramers_v(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher,
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

confusion_matrix = pd.crosstab(tips["day"], tips["time"])
print(type(confusion_matrix))
confusion_matrix

<class 'pandas.core.frame.DataFrame'>


time,Lunch,Dinner
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,61,1
Fri,7,12
Sat,0,87
Sun,0,76


In [6]:
print("Cramer's V: ", cramers_v(confusion_matrix.values))

Cramer's V:  0.9386619340722221


There are other ways to look at categorical variable correlation, such as using the conditoinal number of the covariance matrix of the data matrix, as well as association rule mining. We will not elaborate here. 

#### References:
   - https://www.kaggle.com/uciml/mushroom-classification
   - https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9