### The Titanic

#### 1. The problem


~~~~
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered 
“unsinkable” RMS Titanic sank after colliding with an iceberg. 

Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting 
in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some 
groups of people were more likely to survive than others.


~~~~

The Titanic project requires you to perform several steps.

The steps are documented in detail in the 02_titanic_notes notebook.

The file containing the titanic survival dataset is located at:

`./data/train.csv`


This is a link at Kaggle that can provide additional details: https://www.kaggle.com/c/titanic

#### 2. The Datasets

The dataset has been split into two groups:

__training set (train.csv)__

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

__test set (test.csv)__

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

__Dataset schema__


|Variable| Definition | Field values |
|:---|:---| --- |
| survival | Passenger survived     | 0 = No, 1 = Yes |
| pclass   | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex      | Gender          |   |
| Age      | Age in years | |
| sibsp	   | # of siblings/spouses aboard 	| |
| parch	   | # of parents/children aboard 	| |
| ticket   | Ticket number | |
| fare	   | Passenger fare	| |
| cabin	   | Cabin number	| Cabin codes have info about location in the ship |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
`

#### 3. Reading data without __Pandas__

In [2]:
import enum
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from typing import List
%matplotlib inline

class Fld(enum.Enum):
    PassengerId = 0;  Survived = 1;   Pclass = 2; Name = 3;  Sex = 4
    Age = 5;          SibSp = 6;      Parch = 7;  Ticket = 8; Fare = 9;  
    Cabin = 10;       Embarked = 11


def load_titanic_csv(file_name):
    data = []
    with open(file_name, newline='') as csv_file:
        line_reader = csv.reader(csv_file, delimiter=',', quotechar='"')
        first_line = True
        for row in line_reader:
            if first_line:
                first_line = False
                continue
            data.append(row)
    return data


def get_field(row, field_name):
    idx = field_name.value
    return row[idx]


def print_row(row):
    for idx, field in enumerate(Fld):
        fld_name = field.name
        print('{0:>2}: {1:<12} = {2}'.format(idx, fld_name, row[idx]))
    
def print_titanic_fields():
    for idx, field in enumerate(Fld):
        if idx % 4 == 0:
            print(' ')
        print('{0:<2}: {1:<15}'.format(idx, field.name), end='\t')


In [3]:
data = load_titanic_csv('./data/train.csv')

In [4]:
print_titanic_fields()

 
0 : PassengerId    	1 : Survived       	2 : Pclass         	3 : Name           	 
4 : Sex            	5 : Age            	6 : SibSp          	7 : Parch          	 
8 : Ticket         	9 : Fare           	10: Cabin          	11: Embarked       	

In [5]:
# Get a field from a row in the data.
value = get_field(data[1], Fld.Fare)
print(value)

71.2833


In [80]:
print_row(data[1])

 0: PassengerId  = 2
 1: Survived     = 1
 2: Pclass       = 1
 3: Name         = Cumings, Mrs. John Bradley (Florence Briggs Thayer)
 4: Sex          = female
 5: Age          = 38
 6: SibSp        = 1
 7: Parch        = 0
 8: Ticket       = PC 17599
 9: Fare         = 71.2833
10: Cabin        = C85
11: Embarked     = C


In [7]:
# Sample code to filter a list of rows.
def get_rows_by_passeger_class(data, pass_class):
    rows = []
    for row in data:
        if  get_field(row, Fld.Pclass) == str(pass_class):
            rows.append(row) 
    return rows

In [9]:
# Select the first and third classes.
first_class = get_rows_by_passeger_class(data, 1)
third_class = get_rows_by_passeger_class(data, 3)

In [10]:
# Prints the number of rows in each class.
print('First class = {0}'.format(len(first_class)))
print('Third class = {0}'.format(len(third_class)))

First class = 216
Third class = 491


##### 4.1 Implement the function filter_by_sex and print the total for female and male.


In [11]:
def filter_by_sex(data, key_value):
    rows = []
    for row in data:
        if  get_field(row, Fld.Sex) == str(key_value):
            rows.append(row) 
    return rows

males = filter_by_sex(data, 'male')
females = filter_by_sex(data, 'female')
print('Males   = {0:>4}'.format(len(males)))
print('Females = {0:>4}'.format(len(females)))

Males   =  577
Females =  314


##### 4.2 Lets make the function more generic by adding a field parameter.

In [12]:
def filter_by_field(data: List, field: Fld, key_value: any) -> list[str]:
    rows = []
    for row in data:
        if  get_field(row, field) == str(key_value):
            rows.append(row) 
    return rows

group_1 = filter_by_field(data, Fld.Pclass, 1)
group_3 = filter_by_field(data, Fld.Pclass, 3)
# Prints the number of rows in each class.
print('First class group = {0}'.format(len(group_1)))
print('Third class group = {0}'.format(len(group_3)))

First class group = 216
Third class group = 491


##### 4.3 Print the totals for the fields: Survived, SibSp, Parch, Embarked.

In [13]:
# Another helper function
def get_field_values(data: List, field: Fld) -> list[str]:
    values = set()
    for row in data:
        values.add(get_field(row, field))
    return sorted(list(values))

# Lets first print the possible values for a field.
print('SibSp values = {0}'.format(get_field_values(data, Fld.SibSp)))
print('Parch values = {0}'.format(get_field_values(data, Fld.Parch)))
print('Embarked values = {0}'.format(get_field_values(data, Fld.Embarked)))


SibSp values = ['0', '1', '2', '3', '4', '5', '8']
Parch values = ['0', '1', '2', '3', '4', '5', '6']
Embarked values = ['', 'C', 'Q', 'S']


In [14]:
# Then use the function above to resolve the report.
field_to_report = [Fld.Survived, Fld.SibSp, Fld.Parch, Fld.Embarked]
for field in field_to_report:
    print(' ')
    print('-- Counts for [{0}] --'.format(field.name))
    field_values = get_field_values(data, field)
    for field_value in field_values:
        the_group = filter_by_field(data, field, field_value)
        print('"{0}" = {1} records'.format(field_value, len(the_group)))
    

 
-- Counts for [Survived] --
"0" = 549 records
"1" = 342 records
 
-- Counts for [SibSp] --
"0" = 608 records
"1" = 209 records
"2" = 28 records
"3" = 16 records
"4" = 18 records
"5" = 5 records
"8" = 7 records
 
-- Counts for [Parch] --
"0" = 678 records
"1" = 118 records
"2" = 80 records
"3" = 5 records
"4" = 4 records
"5" = 5 records
"6" = 1 records
 
-- Counts for [Embarked] --
"" = 2 records
"C" = 168 records
"Q" = 77 records
"S" = 644 records


#### 5. Loading & Processing using __Pandas__

In [15]:
import pandas as pd
titanic_df = pd.read_csv("./data/train.csv")
titanic_df.shape

(891, 12)

In [63]:
columns = titanic_df.shape[1]
columns

12

##### 5.1 Print the total for female and male.

In [16]:
# Create filters
ds_males = (titanic_df["Sex"] == "male")
ds_females = (titanic_df["Sex"] == "female")
# Sum the values with value 1 (True)
print('Males   = {0}'.format(ds_males.sum()))
print('Females = {0}'.format(ds_females.sum()))


Males   = 577
Females = 314


##### 5.2 Print the totals for the fields: Survived, SibSp, Parch, Embarked.

In [110]:
for fld in ['Survived', 'SibSp', 'Parch', 'Embarked']:
    print(f'-- Counts for [{fld}] --')
    print(titanic_df[fld].value_counts().to_string(dtype=False))
    print(' ')

-- Counts for [Survived] --
0    549
1    342
 
-- Counts for [SibSp] --
0    608
1    209
2     28
4     18
3     16
8      7
5      5
 
-- Counts for [Parch] --
0    678
1    118
2     80
5      5
3      5
4      4
6      1
 
-- Counts for [Embarked] --
S    644
C    168
Q     77
 


#### 6. Some other pandas features.

##### 6.1 Building a composite index

In [95]:
# Combining And ('&') and Or ('|') in an index
idx = (titanic_df["Age"].isnull()) & (titanic_df["Sex"] == "female") & (titanic_df["Pclass"] == 1)
idx.head()

0    False
1    False
2    False
3    False
4    False
dtype: bool

##### 6.2 Applying the index to the data

In [67]:
titanic_df[idx].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
166,167,1,1,"Chibnall, Mrs. (Edith Martha Bowerman)",female,,0,1,113505,55.0,E33,S
256,257,1,1,"Thorne, Mrs. Gertrude Maybelle",female,,0,0,PC 17585,79.2,,C
306,307,1,1,"Fleming, Miss. Margaret",female,,0,0,17421,110.8833,,C
334,335,1,1,"Frauenthal, Mrs. Henry William (Clara Heinshei...",female,,1,0,PC 17611,133.65,,S


##### 6.3 Create a field 'age_group' based on the following rules.

* age_group 1 -> Age is null and is female and first class
* age_group 2 -> is male and second class
* age_group 3 -> if not in one of the other groups.

In [17]:
age_grp = "age_group"
# Set everybody to group 3 first.
titanic_df[age_grp] = 3

# Set the records for group 1 next
idx_grp_1 = (titanic_df["Age"].isnull()) & (titanic_df["Sex"] == "female") & (titanic_df["Pclass"] == 1)
titanic_df.loc[idx_grp_1, age_grp] = 1

# Finally set the records for group 2
idx_grp_2 = (titanic_df["Sex"] == "male") & (titanic_df["Pclass"] == 2)
titanic_df.loc[idx_grp_2, age_grp] = 2

In [21]:
# Select records from group 1
titanic_df[titanic_df[age_grp] == 1].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,age_group
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C,1
166,167,1,1,"Chibnall, Mrs. (Edith Martha Bowerman)",female,,0,1,113505,55.0,E33,S,1
256,257,1,1,"Thorne, Mrs. Gertrude Maybelle",female,,0,0,PC 17585,79.2,,C,1
306,307,1,1,"Fleming, Miss. Margaret",female,,0,0,17421,110.8833,,C,1
334,335,1,1,"Frauenthal, Mrs. Henry William (Clara Heinshei...",female,,1,0,PC 17611,133.65,,S,1


In [22]:
# Select records from group 2
titanic_df[titanic_df[age_grp] == 2].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,age_group
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S,2
20,21,0,2,"Fynney, Mr. Joseph J",male,35.0,0,0,239865,26.0,,S,2
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S,2
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S,2
70,71,0,2,"Jenkin, Mr. Stephen Curnow",male,32.0,0,0,C.A. 33111,10.5,,S,2


##### 6.4 Report the counts for each 'age_group' value


In [102]:
titanic_df[age_grp].value_counts().sort_index().to_frame()

Unnamed: 0,age_group
1,9
2,108
3,774


##### 6.5 List the records in which 'Age' is null.

In [25]:
titanic_df.loc[titanic_df['Age'].isnull()].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,age_group
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,3
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S,2
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C,3
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C,3
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q,3
