# Multi-Way Tables and Simpson's Paradox

In the previous lesson, we summarized two categorical variables by cross-tabulating their frequencies. Now we'll see how to introduce a third variable. We'll continue with the Titanic data set.

Note: you have performed many of these calculations in a previous notebook using tables involving two variables. Now we'll see how to work with tables involving three variables.

In [1]:
import pandas as pd

In [2]:
df_titanic = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/titanic.csv")
df_titanic

Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,survived,pclass,crew
0,"Abbing, Mr. Anthony",male,42.0,3rd,S,United States,5547.0,7.11,0,3.0,
1,"Abbott, Mr. Eugene Joseph",male,13.0,3rd,S,United States,2673.0,20.05,0,3.0,
2,"Abbott, Mr. Rossmore Edward",male,16.0,3rd,S,United States,2673.0,20.05,0,3.0,
3,"Abbott, Mrs. Rhoda Mary 'Rosa'",female,39.0,3rd,S,England,2673.0,20.05,1,3.0,
4,"Abelseth, Miss. Karen Marie",female,16.0,3rd,S,Norway,348125.0,7.13,1,3.0,
...,...,...,...,...,...,...,...,...,...,...,...
2202,"Wynn, Mr. Walter",male,41.0,deck crew,B,England,,,1,,deck crew
2203,"Yearsley, Mr. Harry",male,40.0,victualling crew,S,England,,,1,,victualling crew
2204,"Young, Mr. Francis James",male,32.0,engineering crew,S,England,,,0,,engineering crew
2205,"Zanetti, Sig. Minio",male,20.0,restaurant staff,S,England,,,0,,restaurant staff


Recall that we also created the "type" variable (passenger or crew).

In [25]:
def class_to_type(c):
  if c in ["1st", "2nd", "3rd"]:
    return "passenger"
  else:
    return "crew"

df_titanic["type"] = df_titanic["class"].map(class_to_type)
pd.crosstab(df_titanic["survived"], df_titanic["gender"], normalize=True).sum(axis=0)

gender
female    0.221568
male      0.778432
dtype: float64

We can create a two-way table to summarize the joint distribution of type and survival status for those onboard the Titanic.

In [4]:
joint_type_survived = pd.crosstab(
    df_titanic["type"],
    df_titanic["survived"],
    normalize=True
)

joint_type_survived

survived,0,1
type,Unnamed: 1_level_1,Unnamed: 2_level_1
crew,0.307657,0.095605
passenger,0.370186,0.226552


Each number in this table represents a joint proportion. For example:
$$ P(\text{crew}, \text{survived}) = 0.095605. $$

We might want to know whether crew members or passengers survived at higher rates. To do this, we have to compare the *conditional* proportions
\begin{align}
P(\text{survived} | \text{crew}) & & \text{vs.} & & P(\text{survived} | \text{passenger}).
\end{align}

We have learned how to calculate conditional distributions using broadcasting.

In [19]:
survived_given_type = joint_type_survived.divide(
    joint_type_survived.sum(axis=1),
    axis=0
)

survived_given_type

type
crew         0.403262
passenger    0.596738
dtype: float64

From the table, it is apparent that passengers survived at higher rates than crew members:
$$ P(\text{survived}|\text{crew}) = 0.237079 < 0.379651 = P(\text{survived}|\text{passenger}). $$



## Controlling for a Variable

But is this the whole story? We know also that survival rates for males and females were very different. Will the trend between the survival rates for crew and passengers still hold after we _control_ for **gender**?

To do this, let's determine the joint distribution of these two variables and a third variable, **gender**. In principle, the frequencies could be represented using a three-dimensional table, but it is difficult to visualize more than two dimensions on paper or on a screen. So we put two of the variables along one dimension and one variable along the other, creating a _three-way table_.

In [6]:
joint_gender_type_survived = pd.crosstab(
    [df_titanic["gender"], df_titanic["type"]],
    df_titanic["survived"],
    normalize=True
)

joint_gender_type_survived

Unnamed: 0_level_0,survived,0,1
gender,type,Unnamed: 2_level_1,Unnamed: 3_level_1
female,crew,0.001359,0.009062
female,passenger,0.057544,0.153602
male,crew,0.306298,0.086543
male,passenger,0.312642,0.07295


Of course, we would have chosen any two of the variables to place along the rows, or had the two variables along the columns instead of the rows. The particular representation above was chosen because it makes it easy to find survival rates for each gender and type, i.e.,
$$ P(\text{survived} | \textbf{gender}, \textbf{type}), $$
where **gender** is either "male" or "female" and **type** is either "crew" or "passenger". Recall that the conditional proportion is calculated as
$$ P(\text{survived} | \textbf{gender}, \textbf{type}) = \frac{P(\text{survived}, \textbf{gender}, \textbf{type})}{P(\textbf{gender}, \textbf{type})}. $$
The numerator comes from the joint distribution above. The denominator can be calculated by summing over the possible values of **survived**---in other words, across each row, over the columns.

In [11]:
joint_gender_type = joint_gender_type_survived.sum(axis=1)
joint_gender_type

gender  type     
female  crew         0.010421
        passenger    0.211146
male    crew         0.392841
        passenger    0.385591
dtype: float64

To obtain the conditional probabilities, we simply divide the joint distribution by the marginal.

In [12]:
survived_given_gender_type = joint_gender_type_survived.divide(
    joint_gender_type,
    axis=0
)

survived_given_gender_type

Unnamed: 0_level_0,survived,0,1
gender,type,Unnamed: 2_level_1,Unnamed: 3_level_1
female,crew,0.130435,0.869565
female,passenger,0.272532,0.727468
male,crew,0.7797,0.2203
male,passenger,0.810811,0.189189


Now, let's compare the survival rates of passengers and crew members for females and males separately.

- For females, crew members survived at a higher rate:
$$ P(\text{survived} | \text{female}, \text{crew}) = 0.869565 > 0.727468 = P(\text{survived} | \text{female}, \text{passenger}) $$
- For males, crew members survived at a higher rate:
$$ P(\text{survived} | \text{male}, \text{crew}) = 0.220300 > 0.189189 = P(\text{survived} | \text{male}, \text{passenger}) $$

But remember, we found earlier that passengers survived at a higher rate overall:
$$ P(\text{survived}|\text{crew}) = 0.237079 < 0.379651 = P(\text{survived}|\text{passenger}). $$

How is it possible that both male and female crew members survived at a higher rate, yet crew members survived at a lower rate overall? This surprising phenomenon is known as _Simpson's paradox_.

## Simpson's Paradox

*Simpson's paradox* is a phenomenon where a relationship or trend that is present in all of several groups disappears or reverses when the data is aggregated. In the Titanic data set, both male and female crew members survived at higher rates, but when we aggregated over gender, the trend reversed.

In order to investigate Simpson's paradox (and also to play with some Pandas commands), we first reorganize the proportions. First, we keep only the survival rate, dropping the death rate (since it is just one minus the survival rate).

In [13]:
survived_given_gender_type[1]

gender  type     
female  crew         0.869565
        passenger    0.727468
male    crew         0.220300
        passenger    0.189189
Name: 1, dtype: float64

Next, we rearrange these proportions into a two-way table, with gender along one dimension and type along the other. This can be achieved in code by "unstacking" a level of the index. (There are two "levels": **gender** and **type**.)

In [14]:
survival_rates_by_gender_type = survived_given_gender_type[1].unstack(level="type")
survival_rates_by_gender_type

type,crew,passenger
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.869565,0.727468
male,0.2203,0.189189


Caution: the proportions in this table do not represent a distribution. They do not add up to 1.0. These proportions originally came from the conditional distribution of **survived** given **gender** and **type**, but we dropped the death rates from the data.

For comparison, recall that we computed the overall survival rates for each of crew and passenger as a column in `survived_given_type`. Let's extract that column (which will be a Series), convert it to a DataFrame (`to_frame()`) and treat it as a row (`T` for transpose).

In [15]:
survival_rates_by_type = survived_given_type[1].to_frame().T
survival_rates_by_type

type,crew,passenger
1,0.237079,0.379651


Now we can append the overall survival rates to `survival_rates_by_gender_type` using `pd.concat`, which stacks the rows of two data frames on top of one another.

In [16]:
pd.concat([survival_rates_by_gender_type, survival_rates_by_type])

type,crew,passenger
female,0.869565,0.727468
male,0.2203,0.189189
1,0.237079,0.379651


The overall survival rates are weighted averages of the survival rates for each gender. If we look at the survival rates for crew members:

- The survival rate for female crew is 87.0%.
- The survival rate for male crew is 22.0%.
- The overall survival rate for all crew is 23.7%, which is between the gender-specific survival rates, but much closer to the survival rate for male crew.

Likewise, if we look at the survival rates for passengers:

- The survival rate for female passengers is 72.7%.
- The survival rate for male passengers is 18.9%.
- The overall survival rate for all passenger is 38.0%, which is closer to the middle of the gender-specific survival rates.

Why would the survival rate for crew members be so close to the survival rate for male crew? To answer this question, let's examine the weights that go into this weighted average.

In mathematical notation, the overall survival rate can be decomposed as:
$$ \underbrace{P(\text{survived} | \textbf{type})}_{\text{overall survival rate}} = \sum_{\textbf{gender}} \underbrace{P(\textbf{gender} | \textbf{type})}_{\text{weight}} \underbrace{P(\text{survived} | \textbf{gender}, \textbf{type})}_{\text{gender-specific survival rate}}. $$
So we see that the weights are $P(\textbf{gender} | \textbf{type})$.

First, we calculate this conditional distribution from the joint distribution of **gender** and **type**.

In [17]:
joint_gender_type = pd.crosstab(
    df_titanic["gender"],
    df_titanic["type"],
    normalize=True
)

gender_given_type = joint_gender_type.divide(
    joint_gender_type.sum(axis=0),
    axis=1
)

gender_given_type

type,crew,passenger
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.025843,0.353834
male,0.974157,0.646166


Notice that 97.4% of crew members were male! So the lower male survival rate is going to dominate the weighted average when we calculate the overall survival rate for crew members. On the other hand, the gender ratio for passengers was more balanced, so their overall survival rate will end up being closer to the middle of the male and female survival rates.

Now, we calculate the weighted average, using the conditional distribution of gender as "weights" that we multiply by the survival rates. Then, we sum over the genders to get the weighted averages---i.e., the overall survival rates.

In [18]:
(gender_given_type * survival_rates_by_gender_type).sum(axis=0)

type
crew         0.237079
passenger    0.379651
dtype: float64

Check that these match the overall survival rates that we calculated above.

So the secret of Simpson's Paradox lies in two facts:

1. Survival rates were generally much lower for males than for females.
2. Because crew members were predominantly male, their survival rate was weighted towards the lower male survival rate, that their overall survival rate ended up being lower than the survival rate for passengers.

Simpson's Paradox means that we have to be careful when comparing proportions from a two-way table, such as survival rates for crew and passengers. When we control for a third variable, such as **gender**, the direction of the effect could change!