# The Challenge:

    The sinking of the Titanic is one of the most infamous shipwrecks in history.

    On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

    While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

    In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). 

# Notes:

| Variable | Definition                                   | Key                                              |
|----------|----------------------------------------------|--------------------------------------------------|
| survival | Survival                                     | 0 = No, 1 = Yes                                  |
| pclass   | Ticket class                                 | 1 = 1st, 2 = 2nd, 3 = 3rd                        |
| sex      | Sex                                          |                                                  |
| Age      | Age in years                                 |                                                  |
| sibsp    | # of siblings / spouses aboard   the Titanic |                                                  |
| parch    | # of parents / children aboard   the Titanic |                                                  |
| ticket   | Ticket number                                |                                                  |
| fare     | Passenger fare                               |                                                  |
| cabin    | Cabin number                                 |                                                  |
| embarked | Port of Embarkation                          | C = Cherbourg, Q = Queenstown, S =   Southampton |

                                        pclass: A proxy for socio-economic status (SES)
                                        1st = Upper
                                        2nd = Middle
                                        3rd = Lower

                                        age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

                                        sibsp: The dataset defines family relations in this way...
                                        Sibling = brother, sister, stepbrother, stepsister
                                        Spouse = husband, wife (mistresses and fiancés were ignored)

                                        parch: The dataset defines family relations in this way...
                                        Parent = mother, father
                                        Child = daughter, son, stepdaughter, stepson
                                        Some children travelled only with a nanny, therefore parch=0 for them

# 1. Import Libraries:

In [1]:
#DF
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix

#Common Model Algorithms
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeRegressor

#Common Model Helpers
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import roc_auc_score as auc
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

#Deep Learning
import tensorflow as tf
from tensorflow import keras

import keras
from keras.layers import Dense
from keras.models import Sequential
from keras import models
from keras import layers

from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')

Using TensorFlow backend.


## 2. Meet and Greet Data

    This is the meet and greet step. Get to know your data by first name and learn a little bit about it. What does it look like (datatype and values), what makes it tick (independent/feature variables(s)), what's its goals in life (dependent/target variable(s)). Think of it like a first date, before you jump in and start poking it in the bedroom.

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# 4. Modelling:

In [3]:
X = train.copy()
y = train['Survived']

In [4]:
X = X.drop('Survived',axis=1)

In [5]:
X.isnull().sum().sort_values(ascending = False).head(1) #drop this useless as shit

Cabin    687
dtype: int64

In [6]:
X = X.drop('Cabin',axis =1)

In [7]:
cat = list(X[X.select_dtypes(include=['object']).columns])
num = list(X[X.select_dtypes(exclude=['object']).columns])
my_cols = cat + num

In [8]:
num_transformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
    ])

cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num),       
        ('cat',cat_transformer,cat),
        ])

In [9]:
X_prepared = preprocessor.fit_transform(X)

print(X_prepared.shape)

(891, 1583)


In [10]:
X_train, X_valid, y_train, y_valid = train_test_split(X_prepared, y, test_size=0.20, train_size=0.80, random_state=42)

In [11]:
print("Data Shape: {}".format(train.shape))
print("X_train Shape: {}".format(X_train.shape))
print("y_train Shape: {}".format(y_train.shape))
print("X_valid Shape: {}".format(X_valid.shape))

Data Shape: (891, 12)
X_train Shape: (712, 1583)
y_train Shape: (712,)
X_valid Shape: (179, 1583)


In [12]:
y_train = np.asarray(y_train).astype('float32')
y_valid = np.asarray(y_valid).astype('float32')

# 5. Keras Deep Learning:

In [13]:
model = models.Sequential()

model.add(layers.Dense(24, activation='relu', input_shape=(X_train.shape[1],)))
model.add(layers.Dense(24, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [14]:
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

In [15]:
model.fit(X_train,y_train,epochs=20,validation_data=(X_valid, y_valid))

Train on 712 samples, validate on 179 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.callbacks.History at 0x15284946548>

In [16]:
X_test = test[my_cols].copy()

In [17]:
X_test_prepared = preprocessor.transform(X_test)



In [18]:
predictions = model.predict(X_test_prepared)

In [19]:
predictions_2 = predictions > 0.5
predictions_2 = predictions_2.astype(int)

In [20]:
len(predictions_2[:,0])

418

In [21]:
output = pd.DataFrame({'PassengerId': X_test.PassengerId,
                       'Survived': predictions_2[:,0]})

output.to_csv('submission.csv', index=False)