# Using Multiple Correspondence Analysis to understand the relationships among characteristics of notable people

Multiple Correspondence Analysis (MCA) is a type of factor analysis for categorical data. A common use of MCA is to produce biplots that can be used to visualize the joint distribution of several categorical variables. Here we MCA it to understand the relationships among nominal (categorical) characteristics of notable people from the BHHT data.  These data can be thought of as a 4-way contingency table (sex x occupation x birth era x region).  The goal of MCA is to visualize the structure of this contingency table as a graph.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import prince
import matplotlib.pyplot as plt

Modify the path below to point to the data file as needed.

In [None]:
pa = Path("/home/kshedden/data/Teaching/bhht")

The entire dataset is around 2.3 million rows.  You can restrict to the first million rows when exploring, then switch to the whole dataset for final analyses.

In [None]:
ca = ["birth", "death", "gender", "un_region", "level1_main_occ", "name"]
df = pd.read_csv(pa / Path("cross-verified-database.csv.gz"), usecols=ca, encoding="latin-1", nrows=1000000)

Rename the variables so that they fit on the plot.

In [None]:
df = df.rename({"level1_main_occ": "occ", "gender": "sex", "un_region": "reg"}, axis=1)
df = df[["birth", "occ", "sex", "reg", "name"]].dropna()

Since very few people in the dataset lived prior to 1500 we exclude them here.

In [None]:
df = df.query("birth >= 1500")

Remove very infrequent or difficult to interpret categories

In [None]:
df = df.loc[df.occ != "Other", :]
df = df.loc[df.occ != "Missing", :]
df = df.loc[df.sex != "Other", :]

Create a "century of birth" variable, make it a string so that it is interpreted as a nominal variable.

In [None]:
df.loc[:, "era"] = df.birth.round(-2)
df = df.drop("birth", axis=1)
df.loc[:, "era"] = ["%d" % x for x in df.era]
df.head()

The goal of MCA is to visualize the contingency table below.

In [None]:
df.groupby(["sex", "reg", "occ", "era"]).count().unstack().fillna(0)

Below we fit factors to the data using multiple correspondence analysis (MCA).

In [None]:
mca = prince.MCA(n_components=4)
df = df.drop(columns=["name"])
mca = mca.fit(df)

We can make an interactive plot of the column coordinates (there are too many objects to plot):

In [None]:
mca.plot(df, show_row_markers=False, show_row_labels=False)

We can make a more informative static version of this plot by coloring the levels of the same parent variable in a common color, and by connecting the points that correspond to ordered variables.

In [None]:
cols = {"occ": "orange", "sex": "purple", "reg": "lime", "era": "navy"}

def mca_plot(mca, df, cols, jx, jy):
    cc = mca.column_coordinates(df)
    xmin, xmax = cc.iloc[:, jx].min(), cc.iloc[:, jx].max()
    d = xmax - xmin
    xmin -= 0.1*d
    xmax += 0.1*d
    ymin, ymax = cc.iloc[:, jy].min(), cc.iloc[:, jy].max()
    d = ymax - ymin
    ymin -= 0.1*d
    ymax += 0.1*d

    plt.clf()
    plt.grid(True)
    for k in cols.keys():
        cx = cc[cc.index.str.startswith(k)]
        if k == "era":
            plt.plot(cx.iloc[:, jx], cx.iloc[:, jy], "-", color=cols[k])
        for i in range(cx.shape[0]):
            plt.text(cx.iloc[i, jx], cx.iloc[i, jy], cx.index[i], color=cols[k],
                     ha="center", va="center")
    plt.xlabel("Component %d" % (jx + 1))
    plt.ylabel("Component %d" % (jy + 1))
    plt.xlim(xmin, xmax)
    plt.ylim(ymin, ymax)

The most informative projection of the columns is spanned by the first two factors, as plotted below:

In [None]:
mca_plot(mca, df, cols, 0, 1)

Recall that the angle between two vectors corresponding to categories of different variables encodes the correlation between the indicators for those variable categories.  This is illustrated by a few examples below.

In [None]:
np.corrcoef(df.occ=="Culture", df.sex=="Female")

In [None]:
np.corrcoef(df.occ=="Leadership", df.sex=="Female")

In [None]:
np.corrcoef(df.reg=="Africa", df.occ=="Sports/Games")

In [None]:
np.corrcoef(df.occ=="Leadership", df.era=="1800")

Below we plot factors 2 and 3 (these are the third and fourth factors since Python counts from zero).  These capture an independent projection of the data from factors 0 and 1.

In [None]:
mca_plot(mca, df, cols, 2, 3)

In an MCA plot, information about the variables is encoded both in the angles between variable scores, and in the magnitudes of the variable scores.  Variables that are uncorrelated with all other variables have very short magnitudes.  To demonstrate this, we create a variable that is independent of the others and include it in the MCA.

In [None]:
df1 = df.copy()
df1["fake"] = np.random.choice([0, 1], df.shape[0])
cols1 = cols.copy()
cols1["fake"] = "red"

mca1 = prince.MCA(n_components=4)
mca1 = mca1.fit(df1)
mca1.transform(df1)
mca_plot(mca1, df1, cols1, 0, 1)