# Titanic Survivor Predictor
**Authour:** *Kamau Wa Wainaina*

## Loading Datasets.

In [3]:
# Library for loading datasets.
import pandas as pd
# Library for linear algebra.
import numpy as np

In [5]:
# Load the datasets.
path = "../../../Data/titanic/"
train = pd.read_csv(path+"train.csv")
test = pd.read_csv(path+"test.csv")

In [7]:
pd.set_option("display.max_colwidth", None) # Ensures column content isn't truncated.

Let's peek at the first five rows of both train and test.

In [10]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


We observe the following about the data:
- PassengerId seems to identify each observation (should make it the index).
- Survived column in train is the target we're trying to predict.
- Next we should perform EDA to know more about the data. 

In [15]:
# First, let's make passenger id the index in both datasets.
train = train.set_index("PassengerId")
test = test.set_index("PassengerId")

## Exploratory Data Analysis.

In this section I want to investigate the following (will focus on train to avoid data snooping):
1. Data types and columns with missing values.
2. Number of those whose survived.
    - Categorized by Sex, Age, and Pclass.
3. How expensive was the trip.
4. How name can be used to predict survivors.
5. Similarily, how is ticket related to survivors.

1. **Data types and columns with missing values.**

The info function shows both data types and missing values.

In [21]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


Let's create a function to calculate percentage of missing information.

In [24]:
def missing_percent(data):
    has_missing_vals = data.isnull().any() # Checks if any column has missing values.
    cols_with_missing = []
    for col, val in zip(has_missing_vals.index, has_missing_vals.values):
        if val == True:
            cols_with_missing.append(col)
            
    for col in cols_with_missing:
        missing_count = data[col].isnull().sum() # Counts number of True since True == 1.
        total_count = len(data[col])
        missing_percent = np.round((missing_count/total_count)*100, 2)
        print(f"{col} has {missing_percent}% of the values missing.")
        
missing_percent(train)

Age has 19.87% of the values missing.
Cabin has 77.1% of the values missing.
Embarked has 0.22% of the values missing.


**It is evident that Cabin has the highest proportion of missing data, followed by Age, and then Embarked. Additionally, there are 6 numerical columns and 5 categorical columns in total.**

2. **Number of those whose survived.**

How many people survived?

In [29]:
survived = train["Survived"].sum() # Survived records 1 as survived and 0 as perished.
print(f"{survived} people survived.")

342 people survived.


What was the surival rate? 

In [32]:
total_passengers = len(train)
survival_rate = np.round((survived/total_passengers)*100, 2)
print(f"The survival rate of boarding the titanic was {survival_rate}%.")

The survival rate of boarding the titanic was 38.38%.


Of those who survived how many were female and male?

In [35]:
survived_female = train.query("Sex == 'female'")["Survived"].sum() # Works because Survived has 1 and 0.
survived_male = train.query("Sex == 'male'")["Survived"].sum()
print(f"{survived_female} females survived while {survived_male} males survived.")

233 females survived while 109 males survived.


Which gender had a better survival rate?

In [38]:
total_female = len(train.query("Sex == 'female'"))
total_male = len(train.query("Sex == 'male'"))

female_rate = np.round((survived_female/total_female)*100, 2)
male_rate = np.round((survived_male/total_male)*100, 2)

overall_female_rate = np.round((survived_female/total_passengers)*100, 2)
overall_male_rate = np.round((survived_male/total_passengers)*100, 2)

print(f"Among females, the survival rate was {female_rate}% whereas among males it was {male_rate}%.")
print(f"Females aboard the titanic had a survival rate of {overall_female_rate}% whereas males had {overall_male_rate}%.")

Among females, the survival rate was 74.2% whereas among males it was 18.89%.
Females aboard the titanic had a survival rate of 26.15% whereas males had 12.23%.


Did age affect surival rate?

*To answer this question, I'll create age buckets to make analysis easier*

In [42]:
print(f" Minimum age: {train['Age'].min()} \n Maximum age: {train['Age'].max()}")

 Minimum age: 0.42 
 Maximum age: 80.0


In [44]:
# Define age buckets
bins = [0, 12, 18, 35, 60, 100]  # Specify bucket edges
labels = ['Child', 'Teen', 'Young Adult', 'Adult', 'Senior']  # Specify labels for the buckets

# Create the age buckets
train['Age_group'] = pd.cut(train['Age'], bins=bins, labels=labels)

*Next, I'll calculate how many people survived per age group*

In [47]:
survived_age_group_dict = {} # This dict will help while calculating survival rates.
for group in train["Age_group"].unique():
    # I set the engine to be python as numexpr which runs .query doesn't support nullable.
    survived_age_group = train.query("Age_group == @group", engine="python")["Survived"].sum()
    survived_age_group_dict[group] = survived_age_group
    print(f"{survived_age_group} {group} survived.")

137 Young Adult survived.
78 Adult survived.
0 nan survived.
40 Child survived.
30 Teen survived.
5 Senior survived.


*I can now answer the question of whether age affected surival rate*

In [50]:
for group, survival_count in survived_age_group_dict.items():
    
    if pd.isna(group): # This avoids cases where the age isn't known.
        continue
        
    total_age_group = len(train.query("Age_group == @group", engine="python"))
    age_group_rate = np.round((survival_count/total_age_group)*100, 2)
    overall_age_group_rate = np.round((survival_count/total_passengers)*100, 2)
    
    print(f"Among {group}, the survival rate was {age_group_rate}%.")
    print(f"{group} aboard the titanic had a survival rate of {overall_age_group_rate}%.")
    print(f"{'-'*60}")

Among Young Adult, the survival rate was 38.27%.
Young Adult aboard the titanic had a survival rate of 15.38%.
------------------------------------------------------------
Among Adult, the survival rate was 40.0%.
Adult aboard the titanic had a survival rate of 8.75%.
------------------------------------------------------------
Among Child, the survival rate was 57.97%.
Child aboard the titanic had a survival rate of 4.49%.
------------------------------------------------------------
Among Teen, the survival rate was 42.86%.
Teen aboard the titanic had a survival rate of 3.37%.
------------------------------------------------------------
Among Senior, the survival rate was 22.73%.
Senior aboard the titanic had a survival rate of 0.56%.
------------------------------------------------------------


How did passenger classes affect survival?

In [53]:
survived_pclass_dict = {} # This dict will help while calculating survival rates.
for pclass in train["Pclass"].unique():
    survived_pclass = train.query("Pclass == @pclass")["Survived"].sum()
    survived_pclass_dict[pclass] = survived_pclass
    print(f"{survived_pclass} passengers from passenger class {pclass} survived.")

119 passengers from passenger class 3 survived.
136 passengers from passenger class 1 survived.
87 passengers from passenger class 2 survived.


In [55]:
for pclass, survival_count in survived_pclass_dict.items():
            
    total_pclass = len(train.query("Pclass == @pclass"))
    pclass_rate = np.round((survival_count/total_pclass)*100, 2)
    overall_pclass_rate = np.round((survival_count/total_passengers)*100, 2)
    
    print(f"Among passenger class {pclass}, the survival rate was {pclass_rate}%.")
    print(f"Passengers in class {pclass} aboard the titanic had a survival rate of {overall_pclass_rate}%.")
    print(f"{'-'*70}")

Among passenger class 3, the survival rate was 24.24%.
Passengers in class 3 aboard the titanic had a survival rate of 13.36%.
----------------------------------------------------------------------
Among passenger class 1, the survival rate was 62.96%.
Passengers in class 1 aboard the titanic had a survival rate of 15.26%.
----------------------------------------------------------------------
Among passenger class 2, the survival rate was 47.28%.
Passengers in class 2 aboard the titanic had a survival rate of 9.76%.
----------------------------------------------------------------------


**The overall survival rate on the Titanic was just 38%. Interestingly, women had a higher likelihood of survival compared to men. Among the age groups, children stood out with better chances of survival. Unsurprisingly, first-class passengers had a significantly higher probability of making it through the disaster.**

3. **How expensive was the trip.**

Overall, how expense was the trip?

In [60]:
mean_fare = np.ceil(train["Fare"].mean())
print(f"On average the passengers paid {mean_fare} pounds.")

On average the passengers paid 33.0 pounds.


Which age group paid the most? 

In [63]:
for group in train["Age_group"].unique():
    group_mean_fare = np.ceil(train.query("Age_group == @group", engine="python")["Fare"].mean())
    print(f"{group} paid {group_mean_fare} pounds on average.")
    print(f"{'-'*40}")

Young Adult paid 30.0 pounds on average.
----------------------------------------
Adult paid 45.0 pounds on average.
----------------------------------------
nan paid nan pounds on average.
----------------------------------------
Child paid 32.0 pounds on average.
----------------------------------------
Teen paid 34.0 pounds on average.
----------------------------------------
Senior paid 42.0 pounds on average.
----------------------------------------


**Age had an impact on the fare passengers paid, with older individuals typically paying more. However, young adults differ from this pattern, paying on average less, even less than children.**

4. **How name can be used to predict survivors.**

This is a bit challenging I'll have to admit. However, let's look at dataframe and see if there is something we can extract.

In [77]:
train.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_group,Salutation
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Young Adult,Mr
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C,Adult,Mrs
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Young Adult,Miss
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Young Adult,Mrs
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Young Adult,Mr
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,Mr
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Adult,Mr
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Child,Master
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Young Adult,Mrs
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Teen,Mrs


Let's start by extracting the salutations present in each name.

In [73]:
def extract_salutation(data):
    other_names = data.split(",")[1] # Retrieves other names apart from surname. 
    salutation = other_names.split(".")[0] # All salutation seem to end in a fullstop.
    salutation = salutation.strip()
    return salutation

In [75]:
train["Salutation"] = train["Name"].apply(extract_salutation)

Is there any correlation between these salutations and survival?

In [81]:
salutation_survived = train.groupby("Salutation")["Survived"].sum()
salutation_survived 

Salutation
Capt              0
Col               1
Don               0
Dr                3
Jonkheer          0
Lady              1
Major             1
Master           23
Miss            127
Mlle              2
Mme               1
Mr               81
Mrs              99
Ms                1
Rev               0
Sir               1
the Countess      1
Name: Survived, dtype: int64

What was the survival rates of these groups?