# Predict Titanic Survival 

The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project we will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

The data we will be using for training our model is provided by Kaggle. (link: https://www.kaggle.com/c/titanic/data)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


Bad key "text.kerning_factor" on line 4 in
C:\Users\Parnashri\Anaconda_3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


<h3>Data</h3>

In [2]:
passengers = pd.read_csv('passengers.csv')

In [3]:
passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<h3>Cleaning the Data</h3>

Given the saying, “women and children first,” Sex and Age seem like good features to predict survival. Let’s map the text values in the Sex column to a numerical value. Updating Sex such that:

'female': 1       

'male': 0

In [4]:
passengers['Sex'] = passengers.Sex.apply(lambda x: 1 if x=='female' else 0)

In [5]:
passengers['Age'].values

array([22.  , 38.  , 26.  , 35.  , 35.  ,   nan, 54.  ,  2.  , 27.  ,
       14.  ,  4.  , 58.  , 20.  , 39.  , 14.  , 55.  ,  2.  ,   nan,
       31.  ,   nan, 35.  , 34.  , 15.  , 28.  ,  8.  , 38.  ,   nan,
       19.  ,   nan,   nan, 40.  ,   nan,   nan, 66.  , 28.  , 42.  ,
         nan, 21.  , 18.  , 14.  , 40.  , 27.  ,   nan,  3.  , 19.  ,
         nan,   nan,   nan,   nan, 18.  ,  7.  , 21.  , 49.  , 29.  ,
       65.  ,   nan, 21.  , 28.5 ,  5.  , 11.  , 22.  , 38.  , 45.  ,
        4.  ,   nan,   nan, 29.  , 19.  , 17.  , 26.  , 32.  , 16.  ,
       21.  , 26.  , 32.  , 25.  ,   nan,   nan,  0.83, 30.  , 22.  ,
       29.  ,   nan, 28.  , 17.  , 33.  , 16.  ,   nan, 23.  , 24.  ,
       29.  , 20.  , 46.  , 26.  , 59.  ,   nan, 71.  , 23.  , 34.  ,
       34.  , 28.  ,   nan, 21.  , 33.  , 37.  , 28.  , 21.  ,   nan,
       38.  ,   nan, 47.  , 14.5 , 22.  , 20.  , 17.  , 21.  , 70.5 ,
       29.  , 24.  ,  2.  , 21.  ,   nan, 32.5 , 32.5 , 54.  , 12.  ,
         nan, 24.  ,

 Filling all the empty Age values in passengers with the mean age.

In [6]:
passengers = passengers.fillna(value={'Age':passengers.Age.mean()})

Given the strict class system onboard the Titanic, let’s utilize the Pclass column, or the passenger class, as another feature.

Creating a new column named FirstClass that stores:

'1': for all passengers in first class 

'0': for all other passengers.

In [7]:
passengers['FirstClass'] = passengers.Pclass.apply(lambda x: 1 if x==1 else 0)

Creating a new column named SecondClass that stores:

'1': for all passengers in second class

'0': for all other passengers

In [8]:
passengers['SecondClass'] = passengers.Pclass.apply(lambda x: 1 if x==2 else 0)

In [9]:
passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,0,0


<h3>Selecting and Splitting the Data</h3>

Selecting the columns we want to build our model on.

In [10]:
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]

In [11]:
survival = passengers[['Survived']]

In [12]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(features, survival, train_size=0.8)

<h3>Normalizing the Data</h3>

Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. 

In [13]:
from sklearn.preprocessing import StandardScaler

In [14]:
ss = StandardScaler()

In [15]:
ss.fit_transform(xtrain, ytrain)
ss.transform(xtest)

array([[ 1.38493061, -0.4919706 , -0.54488848, -0.51007548],
       [ 1.38493061,  0.01976827,  1.8352379 , -0.51007548],
       [-0.72205784,  0.01976827, -0.54488848,  1.96049418],
       [-0.72205784,  0.01976827,  1.8352379 , -0.51007548],
       [-0.72205784,  0.01976827, -0.54488848,  1.96049418],
       [ 1.38493061, -1.25586056, -0.54488848, -0.51007548],
       [-0.72205784,  0.27191936, -0.54488848,  1.96049418],
       [ 1.38493061, -0.79752658, -0.54488848, -0.51007548],
       [-0.72205784, -0.72113759, -0.54488848, -0.51007548],
       [-0.72205784,  0.01976827, -0.54488848, -0.51007548],
       [-0.72205784,  2.18164425,  1.8352379 , -0.51007548],
       [-0.72205784, -1.02669357, -0.54488848, -0.51007548],
       [-0.72205784,  0.80664233,  1.8352379 , -0.51007548],
       [-0.72205784, -0.79752658, -0.54488848, -0.51007548],
       [ 1.38493061,  0.80664233, -0.54488848,  1.96049418],
       [-0.72205784, -0.56835959, -0.54488848, -0.51007548],
       [-0.72205784,  0.

<h3>Creating and Evaluating the Model</h3>

In [16]:
from sklearn.linear_model import LogisticRegression

In [17]:
lr = LogisticRegression()

Fitting the model will perform gradient descent to find the feature coefficients that minimize the log-loss for the training data.

In [18]:
lr.fit(xtrain, ytrain)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [19]:
lr.score(xtrain, ytrain)

0.7949438202247191

Scoring the model on the training data will run the data through the model and make final classifications on survival for each passenger in the training set. The score returned is the percentage of correct classifications, or the accuracy.

In [20]:
lr.score(xtest, ytest)

0.8156424581005587

Similarly, scoring the model on the testing data will run the data through the model and make final classifications on survival for each passenger in the test set.

In [21]:
list(zip(['Sex','Age','FirstClass','SecondClass'],lr.coef_[0]))

[('Sex', 2.3996680080732298),
 ('Age', -0.03576943142722004),
 ('FirstClass', 2.0759755332147733),
 ('SecondClass', 1.2631192047205846)]

<h3>Predicting with the Model</h3>

Let’s use our model to make predictions on the survival of a few fateful passengers.

We have information for 3rd class passenger Jack and 1st class passenger Rose, stored in NumPy arrays. The arrays store 4 feature values, in the following order:

Sex: represented by a 0 for male and 1 for female

Age: represented as an integer in years

FirstClass: with a 1 indicating the passenger is in first class

SecondClass: with a 1 indicating the passenger is in second class

In [22]:
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])

In [23]:
You = np.array([1.0,23,1.0,0.0])

In [24]:
#Combining passenger arrays
sample_passengers = np.array([Jack, Rose, You])

Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on.

In [25]:
sample_passengers = ss.transform(sample_passengers)
sample_passengers

array([[-0.72205784, -0.72113759, -0.54488848, -0.51007548],
       [ 1.38493061, -0.95030457,  1.8352379 , -0.51007548],
       [ 1.38493061, -0.4919706 ,  1.8352379 , -0.51007548]])

Who will survive, and who will sink? 
Using our model’s .predict() method on sample_passengers to find out.

In [26]:
lr.predict_proba(sample_passengers)

array([[0.9907604 , 0.0092396 ],
       [0.0048189 , 0.9951811 ],
       [0.00489816, 0.99510184]])

The 1st column is the probability of a passenger perishing on the Titanic, and the 2nd column is the probability of a passenger surviving the sinking (which was calculated by our model to make the final classification decision).