#### Predict Titanic Survival
The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project you will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

The data we will be using for training our model is provided by Kaggle. Feel free to make the model better on your own and submit it to the Kaggle Titanic competition!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [4]:
# Load the passenger data
passengers = pd.read_csv('datasets/titanic_data.csv')
print(passengers)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ... 

In [5]:
# Update sex column to numerical
passengers['Sex'] = passengers.apply(lambda row: 1 if row['Sex'] == 'female' else 0, axis = 1)
print(passengers)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name  Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    0  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...    1  38.0      1   
2                               Heikkinen, Miss. Laina    1  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)    1  35.0      1   
4                             Allen, Mr. William Henry    0  35.0      0   
..                                                 ...  ...   ...    ...   
886        

In [7]:
# Fill the nan values in the age column
# print(passengers['Age'].values)
passengers['Age'].fillna(value = passengers.Age.mean(), inplace = True)

In [8]:
# Create a first class column
passengers['FirstClass'] = passengers.apply(lambda row: 1 if row['Pclass'] == 1 else 0, axis = 1)

In [9]:
# Create a second class column
passengers['SecondClass'] = passengers.apply(lambda row: 1 if row['Pclass'] == 2 else 0, axis = 1)
print(passengers.head(3))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                Name  Sex   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris    0  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...    1  38.0      1      0   
2                             Heikkinen, Miss. Laina    1  26.0      0      0   

             Ticket     Fare Cabin Embarked  FirstClass  SecondClass  
0         A/5 21171   7.2500   NaN        S           0            0  
1          PC 17599  71.2833   C85        C           1            0  
2  STON/O2. 3101282   7.9250   NaN        S           0            0  


In [10]:
# Select the desired features
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
survival = passengers['Survived']

In [11]:
# Perform train, test, split
X_train, X_test, y_train, y_test = train_test_split(features, survival, test_size = 0.3, random_state = 123)

# Scale the feature data so it has mean = 0 and standard deviation = 1
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [12]:
# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Score the model on the train data
train_score = model.score(X_train, y_train)
print(train_score)

0.7865168539325843


In [13]:
# Score the model on the test data
test_score = model.score(X_test, y_test)
print(test_score)

0.7798507462686567


In [14]:
# Analyze the coefficients
print(model.coef_)

[[ 1.20761751 -0.49375661  0.9867789   0.41433824]]


In [15]:
# Sample passenger features
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
You = np.array([0.0, 30.0, 0.0, 1.0])

In [16]:
# Combine passenger arrays
sample_passengers = np.array([Jack, Rose, You])

# Scale the sample passenger features
sample_passengers = scaler.transform(sample_passengers)

In [17]:
# Make survival predictions!
survival_probabilities = model.predict_proba(sample_passengers)
print(survival_probabilities)

[[0.86568337 0.13431663]
 [0.04315545 0.95684455]
 [0.7736865  0.2263135 ]]


In [18]:
predictions = model.predict(sample_passengers)
print(predictions)

[0 1 0]
