### The Titanic Project

#### 1. The problem


~~~~
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered 
“unsinkable” RMS Titanic sank after colliding with an iceberg. 

Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting 
in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some 
groups of people were more likely to survive than others.


~~~~

The Titanic project requires you to perform several steps including:
  - Exploration on the data and understand the dataset.
  - Create visualizations to better understand the relationships.
  - Clean the data and fix/remove incomplete records.
  - Look for insights about "what sorts of people were more likely to survive?"
  - Build and Optimize a model to predict if a passenger survived.


The file containing the titanic survival data is located at:

`./data/train.csv`


This link from Kaggle has more details: https://www.kaggle.com/c/titanic

#### 2. The Datasets

The data has been split into two groups:

training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.


|Variable| Definition | Field values |
|:---|:---| --- |
| survival | Passenger survived     | 0 = No, 1 = Yes |
| pclass   | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex      | Gender          |   |
| Age      | Age in years | |
| sibsp	   | # of siblings/spouses aboard 	| |
| parch	   | # of parents/children aboard 	| |
| ticket   | Ticket number | |
| fare	   | Passenger fare	| |
| cabin	   | Cabin number	| Cabin codes have info about location in the ship |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
`

#### 3. Helper functions

In [9]:
import enum
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from typing import List
%matplotlib inline

class Fld(enum.Enum):
    PassengerId = 0;  Survived = 1;   Pclass = 2; Name = 3;  Sex = 4
    Age = 5;          SibSp = 6;      Parch = 7;  Ticket = 8; Fare = 9;  
    Cabin = 10;       Embarked = 11


def load_titanic_csv(file_name):
    data = []
    with open(file_name, newline='') as csv_file:
        line_reader = csv.reader(csv_file, delimiter=',', quotechar='"')
        first_line = True
        for row in line_reader:
            if first_line:
                first_line = False
                continue
            data.append(row)
    return data


def get_field(row, field_name):
    idx = field_name.value
    return row[idx]


def print_row(row):
    for idx, field in enumerate(Fld):
        fld_name = field.name
        print('{0:>2}: {1:<12} = {2}'.format(idx, fld_name, row[idx]))
    
def print_titanic_fields():
    for idx, field in enumerate(Fld):
        if idx % 4 == 0:
            print(' ')
        print('{0:<2}: {1:<15}'.format(idx, field), end='\t')


#### 4. Project & Questions

In [10]:
# Loads the data.
data = load_titanic_csv('./data/train.csv')

In [11]:
# List the fields.
print_titanic_fields()

 
0 : Fld.PassengerId	1 : Fld.Survived   	2 : Fld.Pclass     	3 : Fld.Name       	 
4 : Fld.Sex        	5 : Fld.Age        	6 : Fld.SibSp      	7 : Fld.Parch      	 
8 : Fld.Ticket     	9 : Fld.Fare       	10: Fld.Cabin      	11: Fld.Embarked   	

In [19]:
# Get a field from a row in the data.
value = get_field(data[1], Fld.Fare)
print(value)

71.2833


In [12]:
print_row(data[1])

 0: PassengerId  = 2
 1: Survived     = 1
 2: Pclass       = 1
 3: Name         = Cumings, Mrs. John Bradley (Florence Briggs Thayer)
 4: Sex          = female
 5: Age          = 38
 6: SibSp        = 1
 7: Parch        = 0
 8: Ticket       = PC 17599
 9: Fare         = 71.2833
10: Cabin        = C85
11: Embarked     = C


In [13]:
# Sample code to filter a list of rows.
def get_rows_by_passeger_class(data, pass_class):
    rows = []
    for row in data:
        if  get_field(row, Fld.Pclass) == str(pass_class):
            rows.append(row) 
    return rows

In [14]:
# Select the first and third classes.
first_class = get_rows_by_passeger_class(data, 1)
third_class = get_rows_by_passeger_class(data, 2)

In [15]:
# Prints the number of rows in each class.
print('First class = {0}'.format(len(first_class)))
print('Third class = {0}'.format(len(third_class)))

First class = 216
Third class = 184


In [16]:
# Question 1. Implement the function filter_by_sex and print the total for female and male.
# Hint: Use get_field(row, Fld.Sex) to get the passenger gender.

def filter_by_sex(data, key_value):
    rows = []
    for row in data:
        if  get_field(row, Fld.Sex) == str(key_value):
            rows.append(row) 
    return rows
    return []

males = filter_by_sex(data, 'male')
females = filter_by_sex(data, 'female')
print('males = {0}'.format(len(males1)))
print('females = {0}'.format(len(females1)))

In [21]:
def filter_by_field(data, fld, key_value):
    rows = []
    for row in data:
        if  get_field(row, fld) == str(key_value):
            rows.append(row) 
    return rows
    return []

grp1 = filter_by_field(data, Fld.Pclass, 1)
grp3 = filter_by_field(data, Fld.Pclass, 2)
# Prints the number of rows in each class.
print('first = {0}'.format(len(grp1)))
print('third = {0}'.format(len(grp3)))

first = 216
third = 184


In [22]:
import pandas as pd
titanic_df = pd.read_csv("./data/train.csv")
titanic_df.shape

(891, 12)

In [24]:
titanic_df.shape[1]

12

In [30]:
(titanic_df["Age"].isnull()) 

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool

In [25]:
idx = (titanic_df["Age"].isnull()) & (titanic_df["Sex"] == "female") & (titanic_df["Pclass"] == 1)
idx

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Length: 891, dtype: bool

In [26]:
titanic_df[idx]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
166,167,1,1,"Chibnall, Mrs. (Edith Martha Bowerman)",female,,0,1,113505,55.0,E33,S
256,257,1,1,"Thorne, Mrs. Gertrude Maybelle",female,,0,0,PC 17585,79.2,,C
306,307,1,1,"Fleming, Miss. Margaret",female,,0,0,17421,110.8833,,C
334,335,1,1,"Frauenthal, Mrs. Henry William (Clara Heinshei...",female,,1,0,PC 17611,133.65,,S
375,376,1,1,"Meyer, Mrs. Edgar Joseph (Leila Saks)",female,,1,0,PC 17604,82.1708,,C
457,458,1,1,"Kenyon, Mrs. Frederick R (Marion)",female,,1,0,17464,51.8625,D21,S
669,670,1,1,"Taylor, Mrs. Elmer Zebley (Juliet Cummins Wright)",female,,1,0,19996,52.0,C126,S
849,850,1,1,"Goldenberg, Mrs. Samuel L (Edwiga Grabowska)",female,,1,0,17453,89.1042,C92,C


In [27]:
# Total females = 314
# Total males = 577
print('Females = {0}'.format(len(females)))
print('Males   = {0}'.format(len(males)))

Females = 314
Males   = 577


In [36]:
# Question 2: 
# For each field listed below, generate a printed report 
# showing the counts for each value. Like the report from Question 1.
#   Fld.Survived
#   Fld.SibSp  
#   Fld.Parch 
#   Fld.Embarked

In [29]:
dict() # Dict

{}

In [30]:
# Helper function.
def filter_by_fld(data: List, fld: Fld, key_value: any) -> List[str]:
    rows = []
    for row in data:
        if  get_field(row, fld) == str(key_value):
            rows.append(row) 
    return rows
    return []

In [32]:
survived = filter_by_fld(data, Fld.Survived, 1)
perished = filter_by_fld(data, Fld.Survived, 0)
print('Survived = {0}'.format(len(survived)))
print('Preished = {0}'.format(len(perished)))

Survived = 342
Preished = 549


#### Pandas Library

In [33]:
import pandas as pd
titanic_df = pd.read_csv("./data/train.csv")
titanic_df.shape

(891, 12)

In [54]:
# Question 3: with Pandas
# For each field listed below, generate a printed report 
# showing the counts for each value. Like the report from Question 1.
#   Survived
#   SibSp  
#   Parch 
#   Embarked


In [62]:
titanic_df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [34]:
for fld in ['Survived', 'SibSp', 'Parch', 'Embarked']:
    print(f'-- Counts for [{fld}] --')
    print(titanic_df[fld].value_counts())
    print(' ')

-- Counts for [Survived] --
0    549
1    342
Name: Survived, dtype: int64
 
-- Counts for [SibSp] --
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64
 
-- Counts for [Parch] --
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64
 
-- Counts for [Embarked] --
S    644
C    168
Q     77
Name: Embarked, dtype: int64
 


In [64]:
# Question 4: 
# Create a field 'age_group' based on the following rules.
# age_group = 1 -> Age is null and is female and first class
# age_group = 2 -> is male and second class
# age_group = 3 -> if not in one of the other groups.

In [35]:
age_grp = "age_group"
titanic_df[age_grp] = 3
idx_grp_1 = (titanic_df["Age"].isnull()) & (titanic_df["Sex"] == "female") & (titanic_df["Pclass"] == 1)
titanic_df.loc[idx_grp_1, age_grp] = 1
idx_grp_2 = (titanic_df["Sex"] == "male") & (titanic_df["Pclass"] == 2)
titanic_df.loc[idx_grp_2, age_grp] = 2

In [36]:
titanic_df[titanic_df[age_grp] == 1].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,age_group
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C,1
166,167,1,1,"Chibnall, Mrs. (Edith Martha Bowerman)",female,,0,1,113505,55.0,E33,S,1
256,257,1,1,"Thorne, Mrs. Gertrude Maybelle",female,,0,0,PC 17585,79.2,,C,1
306,307,1,1,"Fleming, Miss. Margaret",female,,0,0,17421,110.8833,,C,1
334,335,1,1,"Frauenthal, Mrs. Henry William (Clara Heinshei...",female,,1,0,PC 17611,133.65,,S,1


In [37]:
titanic_df[titanic_df[age_grp] == 2].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,age_group
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S,2
20,21,0,2,"Fynney, Mr. Joseph J",male,35.0,0,0,239865,26.0,,S,2
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S,2
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S,2
70,71,0,2,"Jenkin, Mr. Stephen Curnow",male,32.0,0,0,C.A. 33111,10.5,,S,2


In [38]:
# Question 5: 
# Create the 'age_group' counts report.
#
titanic_df[age_grp].value_counts().sort_index()

1      9
2    108
3    774
Name: age_group, dtype: int64

In [93]:
# Question 6: 
# List the records in which 'Age' is null.
#

titanic_df.loc[titanic_df['Age'].isnull()].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Group,Age_Group,age_group
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,1000,3,3
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S,1000,2,2
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C,1000,3,3
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C,1000,3,3
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q,1000,3,3
