# Modelling Titanic Survivorship

For today's layup lines, you will be using a dataset that originated from Kaggle. We will not be entering the competition, but this will act as a solid checkpoint of where you are at in the data science workflow. After today, you will have a good idea of the areas that you excel at and need some review. All-in-all, this exercise will be a great benchmark of your progress.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14
plt.style.use("fivethirtyeight")

In [None]:
# Read the data and set the datetime as the index.
df = pd.read_csv('data/titanic.csv')

In [None]:
df.head(2)

### Data prep

In [None]:
# Do we have missing data?
df.isnull().sum()

### Feature engineering


In [None]:
# Let's fill Age with mean values
df['Age'].fillna((df['Age'].mean()), inplace=True)

In [None]:
# Convert Sex to binary
df['Sex'] = df['Sex'].map({'male': 1, 'female': 0})

### Feature selection

Note: you _must_ call your features `feature_cols` because we will use those features in your final prediction

In [None]:
# Create X and y.
feature_cols = ['Pclass', 'Age', 'Sex']

X = df_train[feature_cols]
y = df_train['Survived']

### Train-Test-Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Standardize data (don't need for tree-based model, but putting here just for reference)

Why do we need to standardize after splitting our data? Why not before?

"By fitting the scaler on the full dataset prior to splitting (option #1), information about the test set is used to transform the training set, which in turn is passed downstream."

See https://datascience.stackexchange.com/questions/38395/standardscaler-before-and-after-splitting-data for a discussion.

In [None]:
# from sklearn.preprocessing import MinMaxScaler

# scaler = MinMaxScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)

### Run a few models (regressions)

Try out a number of different regression techniques and adjust the hyperparameters to find the best fit

In [None]:
# import 
from sklearn.tree import DecisionTreeClassifier

# instantiate
DT = DecisionTreeClassifier(random_state=0)

# fit
DT.fit(X_train, y_train)

# predict
y_pred = DT.predict(X_test)

### Score your model

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

### Hyperparameter tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

parameters={'max_depth': range(2, 20, 2),
            'min_samples_split' : range(3, 20, 2),
            'max_depth': range(1,20,2)}

DT = DecisionTreeClassifier()

grid_search = GridSearchCV(DT, parameters)

grid_search.fit(X_train, y_train)

In [None]:
# What is the best estimator?
grid_search.best_estimator_

In [1]:
# What was the best accuracy score you got?
print(grid_search.best_score_)

NameError: name 'grid_search' is not defined