# Titanic Survivor Predictor
**Authour:** *Kamau Wa Wainaina*

## Loading Datasets.

In [3]:
# Library for loading datasets.
import pandas as pd
# Library for linear algebra.
import numpy as np

In [4]:
# Load the datasets.
path = "../../../Data/titanic/"
train = pd.read_csv(path+"train.csv")
test = pd.read_csv(path+"test.csv")

Let's peek at the first five rows of both train and test.

In [6]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


We observe the following about the data:
- PassengerId seems to identify each observation (should make it the index).
- Survived column in train is the target we're trying to predict.
- Next we should perform EDA to know more about the data. 

In [9]:
# First, let's make passenger id the index in both datasets.
train = train.set_index("PassengerId")
test = test.set_index("PassengerId")

## Exploratory Data Analysis.

In this section I want to investigate the following (will focus on train to avoid data snooping):
1. Data types and columns with missing values.
2. Number of those whose survived.
    - Categorized by Sex, Age, and Pclass.
3. How name can be used to predict survivors.
4. Similarily, how is ticket related to survivors.
5. Expense of the trip.
    - Curious if area of Embarkment influenced this.

1. **Data types and columns with missing values.**

The info function shows both data types and missing values.

In [14]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


Let's create a function to calculate percentage of missing information.

In [16]:
def missing_percent(data):
    has_missing_vals = data.isnull().any() # Checks if any column has missing values.
    cols_with_missing = []
    for col, val in zip(has_missing_vals.index, has_missing_vals.values):
        if val == True:
            cols_with_missing.append(col)
            
    for col in cols_with_missing:
        missing_count = data[col].isnull().sum() # Counts number of True since True == 1.
        total_count = len(data[col])
        missing_percent = np.round((missing_count/total_count)*100, 2)
        print(f"{col} has {missing_percent}% of the values missing.")
        
missing_percent(train)

Age has 19.87% of the values missing.
Cabin has 77.1% of the values missing.
Embarked has 0.22% of the values missing.


We observe that Cabin has the largest percentage of missing information followed by Age and finally Embarked.

2. **Number of those whose survived.**

How many people survived?

In [35]:
survived = train["Survived"].sum() # Survived records 1 as survived and 0 as perished.
print(f"{survived} people survived.")

342 people survived.


What was the surival rate? 

In [37]:
total_passengers = len(train)
survival_rate = np.round((survived/total_passengers)*100, 2)
print(f"The survival rate of boarding the titanic was {survival_rate}%.")

The survival rate of boarding the titanic was 38.38%.


Of those who survived how many were female and male?

In [53]:
survived_female = train.query("Sex == 'female'")["Survived"].sum() # Works because Survived has 1 and 0.
survived_male = train.query("Sex == 'male'")["Survived"].sum()
print(f"{survived_female} females survived while {survived_male} males survived.")

233 females survived while 109 males survived.


Which gender had a better survival rate?

In [57]:
total_female = len(train.query("Sex == 'female'"))
total_male = len(train.query("Sex == 'male'"))

female_rate = np.round((survived_female/total_female)*100, 2)
male_rate = np.round((survived_male/total_male)*100, 2)

overall_female_rate = np.round((survived_female/total_passengers)*100, 2)
overall_male_rate = np.round((survived_male/total_passengers)*100, 2)

print(f"Among females, the survival rate was {female_rate}% whereas among males it was {male_rate}%.")
print(f"Females aboard the titanic had a survival rate of {overall_female_rate}% whereas males had {overall_male_rate}%.")

Among females, the survival rate was 74.2% whereas among males it was 18.89%.
Females aboard the titanic had a survival rate of 26.15% whereas males had 12.23%.
