# Titanic Wrangling

In this practice activity you'll continue to work with the titanic dataset in ways that flex what you've learned about both data wrangling and data visualization.

---
format:
  html:
    embed-resources: true
---


In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

data_dir = "https://dlsun.github.io/pods/data/"
df_titanic = pd.read_csv(data_dir + "titanic.csv")

# Keep only rows that have class & embarked info
# (and, if class is missing but pclass exists, construct class)
df = df_titanic.copy()
if "class" not in df.columns and "pclass" in df.columns:
    _map = {1: "First", 2: "Second", 3: "Third"}
    df["class"] = df["pclass"].map(_map)

df = df.dropna(subset=["class", "embarked"])
df.head()


Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,survived
0,"Abbing, Mr. Anthony",male,42.0,3rd,S,United States,5547.0,7.11,0
1,"Abbott, Mr. Eugene Joseph",male,13.0,3rd,S,United States,2673.0,20.05,0
2,"Abbott, Mr. Rossmore Edward",male,16.0,3rd,S,United States,2673.0,20.05,0
3,"Abbott, Mrs. Rhoda Mary 'Rosa'",female,39.0,3rd,S,England,2673.0,20.05,1
4,"Abelseth, Miss. Karen Marie",female,16.0,3rd,S,Norway,348125.0,7.13,1


## 1. Filter the data to include passengers only. Calculate the joint distribution (cross-tab) between a passenger's class and where they embarked.

In [2]:
joint = pd.crosstab(df["class"], df["embarked"])
joint


embarked,B,C,Q,S
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1st,3,143,3,175
2nd,6,26,7,245
3rd,0,102,113,494
deck crew,23,0,0,43
engineering crew,43,0,0,281
restaurant staff,0,0,0,69
victualling crew,122,0,0,309


## 2. Using the joint distribution that calculated above, calculate the following:

* the conditional distribution of their class given where they embarked
* the conditional distribution of where they embarked given their class

Use the conditional distributions that you calculate to answer the following quesitons:

* What proportion of 3rd class passengers embarked at Southampton?
* What proportion of Southampton passengers were in 3rd class?

In [6]:

passenger_classes = ["1st", "2nd", "3rd"]
df_pax = df[df["class"].isin(passenger_classes)].copy()

joint = pd.crosstab(df_pax["class"], df_pax["embarked"])
display(joint)


cond_class_given_embarked = pd.crosstab(
    df_pax["class"], df_pax["embarked"], normalize="columns"
)
cond_embarked_given_class = pd.crosstab(
    df_pax["class"], df_pax["embarked"], normalize="index"
)
display(cond_class_given_embarked)
display(cond_embarked_given_class)


prop_S_given_3rd = cond_embarked_given_class.loc["3rd", "S"]


prop_3rd_given_S = cond_class_given_embarked.loc["3rd", "S"]

print(f"P(S | 3rd) = {prop_S_given_3rd:.3f}")
print(f"P(3rd | S) = {prop_3rd_given_S:.3f}")



embarked,B,C,Q,S
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1st,3,143,3,175
2nd,6,26,7,245
3rd,0,102,113,494


embarked,B,C,Q,S
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1st,0.333333,0.527675,0.02439,0.191466
2nd,0.666667,0.095941,0.056911,0.268053
3rd,0.0,0.376384,0.918699,0.540481


embarked,B,C,Q,S
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1st,0.009259,0.441358,0.009259,0.540123
2nd,0.021127,0.091549,0.024648,0.862676
3rd,0.0,0.143865,0.159379,0.696756


P(S | 3rd) = 0.697
P(3rd | S) = 0.540



Most 3rd-class passengers (≈70%) embarked at Southampton, and about 31% of all Southampton passengers were 3rd class. This shows 3rd class mainly boarded at Southampton, while 1st class was more common at Cherbourg.


## 3. Make a visualization showing the distribution of a passenger's class, given where they embarked.

Discuss the pros and cons of using this visualization versus the distributions you calculated before, to answer the previous questions.

In [11]:


viz_df = (
    cond_class_given_embarked
    .reset_index()
    .melt(id_vars="class", var_name="embarked", value_name="proportion")
)


fig = px.bar(
    viz_df,
    x="embarked",
    y="proportion",
    color="class",
    barmode="group",
    text=viz_df["proportion"].map(lambda x: f"{x:.2f}")
)
fig.update_layout(
    title="Distribution of Passenger Class, Given Where They Embarked (P(class | embarked))",
    yaxis=dict(title="Proportion", tickformat=".0%"),
    xaxis_title="Embarkation Port"
)
fig.show()


Pros: easy to compare class proportions within each embark point
Cons: harder to see that each set sums to 1, but still much clearer visually.