# Understanding the relationships among characteristics of notable people

"Notability" is a social construct with no fixed definition.  The BHHT data contain four main attributes describing the notable people -- sex, occupation, birth era, and region.  These characteristics can be used to understand the social construction of notability in different contexts.  For example, we can consider how the joint distribution of sex and occupation varies geographically (by region) and over time.

The characteristics of notable people in the BHHT data are [nominal](https://en.wikipedia.org/wiki/Nominal_category) (except birth year, which can be recoded if desired to an ordinal "birth era" variable).  We can thus represent the data as a 4-way [contingency table](https://en.wikipedia.org/wiki/Contingency_table) (sex x occupation x birth era x region). This contingency table reflects the joint distribution of the four characteristics in the population of interest.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import prince
import matplotlib.pyplot as plt

Modify the path below to point to the data file.

In [None]:
pa = Path("/home/kshedden/data/Teaching/bhht")

The entire dataset is around 2.3 million rows.  You can restrict to the first million rows when exploring, then switch to the whole dataset for final analyses.

In [None]:
ca = ["birth", "death", "gender", "un_region", "level1_main_occ", "name"]
df = pd.read_csv(pa / Path("cross-verified-database.csv.gz"), usecols=ca, encoding="latin-1", nrows=2000000)

Rename the variables so that they fit better as labels on plots.

In [None]:
df = df.rename({"level1_main_occ": "occ", "gender": "sex", "un_region": "reg"}, axis=1)
df = df[["birth", "occ", "sex", "reg", "name"]].dropna()

We will focus on people who lived after 1500.

In [None]:
df = df.query("birth >= 1500")
df.head()

Remove very infrequent or difficult to interpret categories

In [None]:
df = df.loc[df.occ != "Other", :]
df = df.loc[df.occ != "Missing", :]
df = df.loc[df.sex != "Other", :]

Create a "century of birth" variable, make it a string so that it is interpreted as a nominal variable.

In [None]:
df.loc[:, "era"] = df.birth.round(-2)
df = df.drop("birth", axis=1)
df.loc[:, "era"] = ["%d" % x for x in df.era]
df.head()

We won't need the names anymore so we create a copy of the data that omits them.

In [None]:
dx = df.drop(columns=["name"])
dx

We will aim to understand notability by studying the contingency table below.

In [None]:
tab = dx.groupby(["sex", "reg", "occ", "era"]).size().unstack().fillna(0)
tab

## Pearson residuals

One way to gain some insight from a multi-way contingency table is to form Pearson residuals.  These residuals have the form (observed - expected) / SD(observed), where "observed" and "expected" are the observed and expected cell counts for each cell in the contingency table.  The "expected" cell count is obtained under an independence model, and the standard deviation SD(observed) is calculated using the Poisson approximation to the distribution of a Bernoulli trial with small success probability.  The Pearson residuals help identify where the observed counts are furthest (in statistical terms) from what would be expected if the attributes were distributed randomly over the observations. 

In [None]:
long = tab.stack().reset_index().rename(columns={0: "count"})
n = long["count"].sum()
vx = ["sex", "reg", "occ", "era"]
vp = ["%s_p" % x for x in vx]
for v in vx:
    long[v+"_p"] = long.groupby(v)["count"].transform(np.sum) / n
long["exp"] = n * long[vp].prod(1)
long["chi2_resid"] = (long["count"] - long["exp"]) / np.sqrt(long["exp"])
long = long.sort_values(by="chi2_resid")
long.sort_values("chi2_resid")

Below is a plot of the order statistics of the chi square residuals. 

In [None]:
plt.grid(True)
plt.xlabel("Rank")
plt.ylabel("Chi squared residual")
plt.plot(np.sort(long["chi2_resid"]))

## Multiple Correspondence Analysis

Multiple Correspondence Analysis (MCA) is a type of factor analysis for categorical data. A common use of MCA is to produce biplots that can be used to visualize the joint distribution of several categorical variables. Here we use MCA to understand the relationships among the contingency table of characteristics of the subjects in the BHHT data.

Below we fit factors to the data using multiple correspondence analysis (MCA).

In [None]:
mca = prince.MCA(n_components=4)
mca = mca.fit(dx)

We can make an interactive plot of the column coordinates (there are too many objects to plot):

In [None]:
mca.plot(dx, show_row_markers=False, show_row_labels=False)

We can make a more informative static version of this plot by coloring the levels of the same parent variable in a common color, and by connecting the points that correspond to ordered variables.

In [None]:
cols = {"occ": "orange", "sex": "purple", "reg": "lime", "era": "navy"}

def mca_plot(mca, df, cols, jx, jy):
    cc = mca.column_coordinates(df)
    xmin, xmax = cc.iloc[:, jx].min(), cc.iloc[:, jx].max()
    d = xmax - xmin
    xmin -= 0.1*d
    xmax += 0.1*d
    ymin, ymax = cc.iloc[:, jy].min(), cc.iloc[:, jy].max()
    d = ymax - ymin
    ymin -= 0.1*d
    ymax += 0.1*d

    plt.clf()
    plt.grid(True)
    for k in cols.keys():
        cx = cc[cc.index.str.startswith(k)]
        if k == "era":
            plt.plot(cx.iloc[:, jx], cx.iloc[:, jy], "-", color=cols[k])
        for i in range(cx.shape[0]):
            plt.text(cx.iloc[i, jx], cx.iloc[i, jy], cx.index[i], color=cols[k],
                     ha="center", va="center")
    plt.xlabel("Component %d" % (jx + 1))
    plt.ylabel("Component %d" % (jy + 1))
    plt.xlim(xmin, xmax)
    plt.ylim(ymin, ymax)

The most informative projection of the columns is spanned by the first two factors, as plotted below:

In [None]:
mca_plot(mca, dx, cols, 0, 1)

Recall that the angle between two vectors corresponding to categories of different variables encodes the correlation between the indicators for those variable categories.  This is illustrated by a few examples below.

In [None]:
np.corrcoef(dx.occ=="Culture", dx.sex=="Female")

In [None]:
np.corrcoef(dx.occ=="Leadership", dx.sex=="Female")

In [None]:
np.corrcoef(dx.reg=="Oceania", dx.occ=="Sports/Games")

In [None]:
np.corrcoef(dx.occ=="Leadership", dx.era=="1800")

Below we plot factors 2 and 3 (these are the third and fourth factors since Python counts from zero).  These capture an independent projection of the data from factors 0 and 1.

In [None]:
mca_plot(mca, dx, cols, 2, 3)

In an MCA plot, information about the variables is encoded both in the angles between variable scores, and in the magnitudes of the variable scores.  Variables that are uncorrelated with all other variables have very short magnitudes.  To demonstrate this, we create a variable that is independent of the others and include it in the MCA.

In [None]:
dx1 = dx.copy()
dx1["fake"] = np.random.choice([0, 1], dx.shape[0])
cols1 = cols.copy()
cols1["fake"] = "red"

mca1 = prince.MCA(n_components=4)
mca1 = mca1.fit(dx1)
mca1.transform(dx1)
mca_plot(mca1, dx1, cols1, 0, 1)