<h1>Titanic: Binary Regression Analysis</h1>
<h3>Competition Description </h3>
<p>
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
    
Data Source: https://www.kaggle.com/c/titanic/data
</p>

In [1]:
import pandas as pd
import numpy as np
import random
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

path = 'data/'
submission_data = 'gender_submission.csv'
test_data = 'test.csv'
train_data = 'train.csv'
actual_results = 'gender_submission.csv'

actual = pd.read_csv(path+submission_data, keep_default_na=False)
test = pd.read_csv(path+test_data, keep_default_na=False)
train = pd.read_csv(path+train_data, keep_default_na=False)

actual.head(5)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [2]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,1.0,0.0,31.0
max,891.0,1.0,3.0,8.0,6.0,512.3292


In [3]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

<h3>Data Validation</h3>
<p>
    First step will be to validate the data - check for irregularities and any missing data, and gauge the size. 
</p>
<p>
    However, we don't need to test this for all the fields. Certain fields - like Name, Ticket, Cabin, Embarked - don't help our model, because the data is too diverse and irregular (e.g. Strings mixing alphanumeric characters, lack of consistency). As such, we need only focus on cleaning the data attributes that are important and can be used - namely the attributes with purely numeric values: Pclass, SibSp, Parch, Fare, Sex, Age.
</p>

In [4]:
# Check how much missing data for the following attributes: Pclass, SibSp, Parch, Fare, Sex, Age

features = ['Pclass', 'SibSp', 'Parch', 'Fare', 'Sex', 'Age']

df = pd.read_csv(path+train_data,
                  dtype={
                      'Pclass': str,
                      'SibSp': str,
                      'Parch': str,
                      'Fare': str,
                      'Sex': str,
                      'Age': str,
                  },
                  keep_default_na=False)

for feature in features:
    df = df[df[feature] != ""]
    
# From .describe(), we saw 891 rows. temp now holds the number of rows where all feature values are present
len(df)

714

<h3>Model 1</h3>

<p>
    From previous analysis, we see that 714 out of our 891 rows of data in our train file truly has all our "feature" attributes filled. While this is a significant population (~20%), in our first test we will proceed and make our model by simply removing these rows of data.
</p>

<p>
    Using our built in SciKitLearn functions to build our model, we'll train our data with:
    <ul>
        <li>Pclass</li>
        <li>Age</li>
        <li>Sex</li>
        <li>Fare</li>
        <li>sibsp (# of siblings / spouses aboard the Titanic)</li>
        <li>parch (# of parents / children aboard the Titanic)</li>
    </ul>
</p>

In [5]:
# Remove fields where Age is empty

features = ['Pclass', 'SibSp', 'Parch', 'Fare', 'Sex', 'Age']

train = pd.read_csv(path+train_data,
                  dtype={
                      'Pclass': str,
                      'SibSp': str,
                      'Parch': str,
                      'Fare': str,
                      'Sex': str,
                      'Age': str,
                  },
                  keep_default_na=False)

# Convert male to 1, female to 0
train.loc[train.Sex == 'female', 'Sex'] = "1"
train.loc[train.Sex == 'male', 'Sex'] = "0"

for feature in features:
    # Remove blanks
    train = train[train[feature] != '']  
    
    # Convert values from Str to Int or Float
    curr_feature = train[feature]
    curr_feature = pd.to_numeric(curr_feature)
    train[feature] = curr_feature
    

test = pd.read_csv(path+test_data,
                  dtype={
                      'Pclass': str,
                      'SibSp': str,
                      'Parch': str,
                      'Fare': str,
                      'Sex': str,
                      'Age': str,
                  },
                  keep_default_na=False)

# Convert male to 1, female to 0
test.loc[test.Sex == 'female', 'Sex'] = "1"
test.loc[test.Sex == 'male', 'Sex'] = "0"

for feature in features:
    # Remove blanks
    test = test[test[feature] != '']  
    
    # Convert values from Str to Int or Float
    curr_feature = test[feature]
    curr_feature = pd.to_numeric(curr_feature)
    test[feature] = curr_feature


train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S
6,7,0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",1,4.0,1,1,PP 9549,16.7,G6,S


In [6]:
# Store actual results into a Hash Table
actual = pd.read_csv(path+actual_results, keep_default_na=False)
survival_record_of_passenger_id = {}
for i in range(len(actual)):
    passenger_id = actual['PassengerId'][i]
    survival_status = actual['Survived'][i]
    survival_record_of_passenger_id[passenger_id] = survival_status

In [7]:

# Features for training our model
features = ['Pclass', 'Age', 'Sex', 'Fare', 'SibSp', 'Parch']
X = train[features]
Y = train['Survived']

# DecisionTreeRegressor
dtr_model = DecisionTreeRegressor(random_state=0)
dtr_model.fit(X,Y)

# DecisionTreeClassifier
dtc_model = DecisionTreeClassifier(random_state=0)
dtc_model.fit(X,Y)

# Linear Model
lin_model = LogisticRegression(random_state=0, solver='lbfgs')
lin_model.fit(X,Y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

<h3> Generate Random Sample and Test </h3>

<p>
    Choose any number of samples from our test data to run through our models.
</p>

In [8]:
num = 30
test_sample = test[features]  # Keep only headers of features used for training
test_sample = test_sample.sample(n=num)  # Keep any number for test sampling

actual_results = []
indices = list(test_sample.index.values)
for i in indices:
    passenger_id = test['PassengerId'][i]  # Get passenger ID from test data
    survived = survival_record_of_passenger_id[passenger_id]  # Cross check if PassengerId survived
    actual_results.append(survived)  # Store results
    
test_sample.head(5)

Unnamed: 0,Pclass,Age,Sex,Fare,SibSp,Parch
193,2,61.0,0,12.35,0,0
62,3,18.0,0,7.75,0,0
401,2,38.0,0,21.0,1,0
402,1,22.0,1,59.4,0,1
3,3,27.0,0,8.6625,0,0


In [9]:
dtr_predictions = dtr_model.predict(test_sample)
dtc_predictions = dtc_model.predict(test_sample)
lin_model_predictions = lin_model.predict(test_sample)

print(dtr_predictions)
print(dtc_predictions)
print(lin_model_predictions)

[0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 1. 1. 1.]
[0 0 0 1 1 0 1 0 0 0 1 0 1 1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 1 1]
[0 0 0 1 0 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 0 0]


In [10]:
df = pd.DataFrame({})
df['ActualRes'] = actual_results
df['DecisionTreeRegressorRes'] = dtr_predictions
df['DecisionTreeClassifierRes'] = dtc_predictions
df['LinearModelRes'] = lin_model_predictions
df

Unnamed: 0,ActualRes,DecisionTreeRegressorRes,DecisionTreeClassifierRes,LinearModelRes
0,0,0.0,0,0
1,0,0.0,0,0
2,0,0.0,0,0
3,1,1.0,1,1
4,0,1.0,1,0
5,0,0.0,0,0
6,0,1.0,1,1
7,0,0.0,0,0
8,1,0.0,0,0
9,0,0.0,0,0


<h3>Model 1 Result</h3>
<p>
    We will perform a simple check to verify the accuracy of our model predictions
</p>

In [11]:
scores = [0, 0, 0]

for i in range(len(df)):
    actual_res = int(df['ActualRes'][i])
    
    res1 = int(df['DecisionTreeRegressorRes'][i])
    res2 = int(df['DecisionTreeClassifierRes'][i])
    res3 = int(df['LinearModelRes'][i])
    
    if actual_res == res1:
        scores[0] += 1
    if actual_res == res2:
        scores[1] += 1
    if actual_res == res3:
        scores[2] += 1

score_percentage = [(100.0*i/num) for i in scores]

score_percentage

[56.666666666666664, 56.666666666666664, 80.0]

<h3>Result Analysis</h3>
<p>
    From our previous analysis, we see that the Linear Regression model churned the best prediction rate.
</p>

<h1>Model 2</h1>
<p>
For our second model, we can make slight alterations to our training data (before introducing additional and alternative models). Specifically, we can combine certain data columns that contain information on number of family members a certain passenger has on board. We can also do analysis on the correlation between passenger class and the fare of their ticket, to see if this data feature is actually valuable in our analysis.
</p>

In [12]:
# Remove fields where Age is empty

features = ['Pclass', 'SibSp', 'Parch', 'Fare', 'Sex', 'Age']

train = pd.read_csv(path+train_data,
                  dtype={
                      'Pclass': str,
                      'SibSp': str,
                      'Parch': str,
                      'Fare': str,
                      'Sex': str,
                      'Age': str,
                  },
                  keep_default_na=False)

# Convert male to 1, female to 0
train.loc[train.Sex == 'female', 'Sex'] = "1"
train.loc[train.Sex == 'male', 'Sex'] = "0"

for feature in features:
    # Remove blanks
    train = train[train[feature] != '']  
    
    # Convert values from Str to Int or Float
    curr_feature = train[feature]
    curr_feature = pd.to_numeric(curr_feature)
    train[feature] = curr_feature
    

test = pd.read_csv(path+test_data,
                  dtype={
                      'Pclass': str,
                      'SibSp': str,
                      'Parch': str,
                      'Fare': str,
                      'Sex': str,
                      'Age': str,
                  },
                  keep_default_na=False)

# Convert male to 1, female to 0
test.loc[test.Sex == 'female', 'Sex'] = "1"
test.loc[test.Sex == 'male', 'Sex'] = "0"

for feature in features:
    # Remove blanks
    test = test[test[feature] != '']  
    
    # Convert values from Str to Int or Float
    curr_feature = test[feature]
    curr_feature = pd.to_numeric(curr_feature)
    test[feature] = curr_feature

train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S
6,7,0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",1,4.0,1,1,PP 9549,16.7,G6,S


In [13]:
# Let us do basic statistical analysis on Fare and Pclass

classes = [1,2,3]
class_stats = []

for c in classes:
    my_filter = train[train['Pclass'] == c]
    temp = my_filter['Fare']
    
    obj = {}
    obj['Pclass'] = c
    obj['mean'] = np.mean(temp)
    obj['median'] = np.median(temp)
    obj['min'] = min(temp)
    obj['max'] = max(temp)
    class_stats.append(obj)

class_stats

[{'Pclass': 1,
  'mean': 87.96158225806447,
  'median': 69.3,
  'min': 0.0,
  'max': 512.3292},
 {'Pclass': 2,
  'mean': 21.47155606936416,
  'median': 15.0458,
  'min': 10.5,
  'max': 73.5},
 {'Pclass': 3,
  'mean': 13.229435211267623,
  'median': 8.05,
  'min': 0.0,
  'max': 56.4958}]

While there is a clear trend that the lower the value of Pclass, the "higher" class the ticket and therefore the higher cost (seen with mean). However, looking at the min and max prices of each ticket class, we see some inconsistencies. For instance, while the mean of class 3 was only 13 dollars, the max price paied was over 56 dollars, while the mean of class 2 tickets were only 12 dollars. For this Model, we should consider using ONLY Pclass to evaluabe whether a passenger survived, and remove fare from our model.

In [14]:
# Combine SibSp and Parch into one column - Family
indices = list(train.index.values)
family_members = [(int(train['SibSp'][i]) + int(train['Parch'][i])) for i in indices]
train['Family'] = family_members

indices = list(test.index.values)
family_members = [(int(test['SibSp'][i]) + int(test['Parch'][i])) for i in indices]
test['Family'] = family_members

In [15]:
# Features for training our model
features = ['Pclass', 'Age', 'Sex', 'Family']
X = train[features]
Y = train['Survived']

# DecisionTreeRegressor
dtr_model = DecisionTreeRegressor(random_state=0)
dtr_model.fit(X,Y)

# DecisionTreeClassifier
dtc_model = DecisionTreeClassifier(random_state=0)
dtc_model.fit(X,Y)

# Linear Model
lin_model = LogisticRegression(random_state=0, solver='lbfgs')
lin_model.fit(X,Y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [16]:
num = 30
test_sample = test[features]  # Keep only headers of features used for training
test_sample = test_sample.sample(n=num)  # Keep any number for test sampling

actual_results = []
indices = list(test_sample.index.values)
for i in indices:
    passenger_id = test['PassengerId'][i]  # Get passenger ID from test data
    survived = survival_record_of_passenger_id[passenger_id]  # Cross check if PassengerId survived
    actual_results.append(survived)  # Store results
    
test_sample.head(5)

Unnamed: 0,Pclass,Age,Sex,Family
165,3,26.0,1,2
56,3,35.0,0,0
293,1,53.0,0,2
397,1,48.0,1,2
400,1,30.0,1,0


In [17]:
dtr_predictions = dtr_model.predict(test_sample)
dtc_predictions = dtc_model.predict(test_sample)
lin_model_predictions = lin_model.predict(test_sample)

print(dtr_predictions)
print(dtc_predictions)
print(lin_model_predictions)

[1.         0.         0.         1.         1.         1.
 0.14285714 0.         0.07692308 0.         0.5        0.
 1.         0.         0.         0.33333333 0.5        0.
 1.         0.         1.         0.         0.5        0.
 0.75       1.         1.         0.         1.         1.        ]
[1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 1 1]
[1 0 0 1 1 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1]


In [18]:
df = pd.DataFrame({})
df['ActualRes'] = actual_results
df['DecisionTreeRegressorRes'] = dtr_predictions
df['DecisionTreeClassifierRes'] = dtc_predictions
df['LinearModelRes'] = lin_model_predictions
df

Unnamed: 0,ActualRes,DecisionTreeRegressorRes,DecisionTreeClassifierRes,LinearModelRes
0,1,1.0,1,1
1,0,0.0,0,0
2,0,0.0,0,0
3,1,1.0,1,1
4,1,1.0,1,1
5,1,1.0,1,1
6,0,0.142857,0,0
7,0,0.0,0,1
8,0,0.076923,0,0
9,0,0.0,0,0


In [19]:
scores = [0, 0, 0]

for i in range(len(df)):
    actual_res = int(df['ActualRes'][i])
    
    res1 = int(df['DecisionTreeRegressorRes'][i])
    res2 = int(df['DecisionTreeClassifierRes'][i])
    res3 = int(df['LinearModelRes'][i])
    
    if actual_res == res1:
        scores[0] += 1
    if actual_res == res2:
        scores[1] += 1
    if actual_res == res3:
        scores[2] += 1

score_percentage = [(100.0*i/num) for i in scores]

score_percentage

[83.33333333333333, 80.0, 90.0]