# Multiple Correspondence Analysis (MCA)

PCA, to explore the inner structure of the features, not possible since it's categorical data, instead, we can use "Multiple Correspondence Analysis (MCA)".

This is a variant of PCA that is specifically designed for categorical data". MCA is to categorical data what PCA is to continuous data. It's a method of data analysis that allows you to detect and represent underlying structures in a dataset containing categorical variables. -->need to deal with nas.

In [3]:
import pandas as pd
import prince
#from prince import MCA

import sys

PATH_CODED_SCV = '../data/coded.csv'
PATH_FLYING_ETIQUETTE_CSV ='../data/flying-etiquette.csv'


# Assuming your categorical variables are in columns 'cat_var1', 'cat_var2', etc.
# Specify the categorical columns
# Specify the columns that are categorical (excluding "How tall are you?")

# try the one dataset which drop all nas
df=pd.read_csv(PATH_CODED_SCV,sep=',')

df_cleaned=df.dropna()
categorical_columns = [col for col in df_cleaned.columns if col != "How tall are you?"]



mca = prince.MCA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
mca = mca.fit(df_cleaned[categorical_columns])
one_hot = pd.get_dummies(df_cleaned[categorical_columns])

mca_no_one_hot = prince.MCA(one_hot=True)
mca_no_one_hot = mca_no_one_hot.fit(one_hot)


In [4]:
mca.eigenvalues_summary #explained variances not so good, incredibly small.

Unnamed: 0_level_0,eigenvalue,% of variance,% of variance (cumulative)
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.157,0.54%,0.54%
1,0.125,0.43%,0.98%
2,0.111,0.39%,1.36%


In [5]:
#Try the imputed version data
df=pd.read_csv(PATH_FLYING_ETIQUETTE_CSV,sep=',')
categorical_columns = [col for col in df.columns if col != "How tall are you?"]



mca = prince.MCA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
mca = mca.fit(df[categorical_columns])
one_hot = pd.get_dummies(df[categorical_columns])

mca_no_one_hot = prince.MCA(one_hot=True)
mca_no_one_hot = mca_no_one_hot.fit(one_hot)


In [6]:
mca.eigenvalues_summary #explained variances even become worse.

Unnamed: 0_level_0,eigenvalue,% of variance,% of variance (cumulative)
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.753,1.06%,1.06%
1,0.508,0.72%,1.78%
2,0.505,0.71%,2.49%


In [7]:
#Try to exclude the column "ResponseID" too with the data which drops all nas.
exclude_columns = ['How tall are you?', 'RespondentID']

categorical_columns = [col for col in df_cleaned.columns if col not in exclude_columns]

mca = prince.MCA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
mca = mca.fit(df_cleaned[categorical_columns])
one_hot = pd.get_dummies(df_cleaned[categorical_columns])

mca_no_one_hot = prince.MCA(one_hot=True)
mca_no_one_hot = mca_no_one_hot.fit(one_hot)

In [8]:
mca.eigenvalues_summary #explained variance significantly increase, but still very small.

Unnamed: 0_level_0,eigenvalue,% of variance,% of variance (cumulative)
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.125,4.79%,4.79%
1,0.098,3.79%,8.58%
2,0.085,3.29%,11.87%


In [9]:
#Visualization
mca.plot(
    df_cleaned[categorical_columns],
    x_component=0,
    y_component=1,
    show_column_markers=True,
    show_row_markers=True,
    show_column_labels=False,
    show_row_labels=False
)

uch small explained variances, could be due to the following reasons:


Categorical Data Nature: MCA is designed for categorical data, and categorical variables often have a limited number of categories. This inherent discreteness can result in smaller variances compared to continuous data, where values can vary more smoothly.

Sparsity: Categorical data matrices are often sparse, meaning that most of the entries are zeros, as not all categories are present in each observation. This sparsity can lead to smaller variances.

Scaling: In MCA, variables are typically scaled to have unit variance. This scaling can result in variances that are close to 1, which makes the explained variances smaller when compared to the original unscaled data.

Data Characteristics: The distribution and characteristics of the categorical data itself play a significant role. Highly imbalanced or skewed categorical variables can lead to smaller explained variances.

Number of Categories: Variables with a large number of categories can result in smaller explained variances because the information is spread out across many categories.


NOTE: could also be the problem of imputation, some questions have nomial answers and were imputed by mode too, which can introduce much bias. But since MCA can only deal with data without nas, we have quite many limitations here.

>-- MCA probably is not the good techinique to explore the data.
