# Introduction

Like most people, the Titanic Challenge is my first Kaggle competition and my first machine learning project outside of econometrics courses. In this project I walk through the basics of dealing with a binary classification problem. I can't thank stackpoverflow and various data science blogs enough for learning how to tune hyperparameters using grid and random search. I can't remember all of the specific sites, but I have the basics down now and can't wait to apply it to a new project when I have the time.

# Setting the Environment (loading up the basics)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
%matplotlib inline
import seaborn as sns # data visualization
sns.set()

# Importing the data

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

# Data Exploration

This section looks at the shape of the data (how many observations and variables) and looks at basic desciptions of the values for each variable.

In [None]:
train.head()

In [None]:
train.shape

In [None]:
train.describe()

In [None]:
train.describe(include = ['O'])

In [None]:
train.info

In [None]:
train.isnull().sum()

In [None]:
test.shape

In [None]:
test.describe()

In [None]:
test.describe(include = ['O'])

In [None]:
test.info

In [None]:
test.isnull().sum()

# Finding Relationships between Independent Variables and Survival

Quick note, PassengerId and Name are indentifiers and should not affect survival rating.

In [None]:
survived = train[train.Survived == 1]
not_survived = train[train.Survived == 0]
print('Survived: %i (%.1f%%)' %(len(survived), float(100*len(survived)/(len(survived)+len(not_survived)))))
print('Did not Survive: %i (%.1f%%)' %(len(not_survived), float(100*len(not_survived)/(len(survived)+len(not_survived)))))

## Pclass vs. Survival

In [None]:
print(train.Pclass.value_counts())
# Gives the number of people in each class

print(train.groupby('Pclass').Survived.value_counts())
# Counts the number of people in each class who survived and did not survive

print(train[['Pclass', 'Survived']].groupby('Pclass', as_index = False).mean())
# Counts the proportion of people in each class that survived (1 = Survived and 0 = Did Not Survive, so mean = proportion who survived)

In [None]:
#train.groupby('Pclass').Survived.mean().plot(kind = 'bar')
sns.barplot(x = 'Pclass', y = 'Survived', data = train)

The results suggest that people in 1st class were most likely to survive, followed by those in 2nd class. Those in 3rd class were the least likely to survive.

Perhaps this is a result of 1st class being the closest to the deck, followed by 2nd class, while 3rd class is near the bottom of the ship.

## Sex vs. Survival

In [None]:
print(train.Sex.value_counts())
# Displays the number of each sex

print(train.groupby('Sex').Survived.value_counts())
# Displays the number of each sex that survived and did not survive

print(train.groupby('Sex').Survived.mean())
# Displays the proportion of each sex that survived or did not survive

In [None]:
#train.groupby('Sex').Survived.mean().plot(kind = 'bar')
sns.barplot(x = 'Sex', y = 'Survived', data = train)

Females are far more likely than males to have survived the Titanic. I hypothesize this is due to the "women and children" first policy.

## Age vs. Survival

In [None]:
train.Age.describe()

In [None]:
age = train.Age.dropna()
sns.distplot(age, bins = 25, kde = False)

The passengers on the Titanic tended to be younger adults in their 20s and 30s. Among the children, the age skews to very young.

In [None]:
train['AgeBand'] = np.where(train.Age <= 16, 1, 
                            np.where((train.Age > 16) & (train.Age <= 32), 2, 
                                     np.where((train.Age > 32) & (train.Age <= 48), 3, 
                                              np.where((train.Age > 48) & (train.Age <= 64), 4, 
                                                      np.where((train.Age > 64) & (train.Age <= 80), 5, False)))))

In [None]:
print(train.AgeBand.value_counts())
# Displays the number of each sex

print(train.groupby('AgeBand').Survived.value_counts())
# Displays the number of each sex that survived and did not survive

print(train.groupby('AgeBand').Survived.mean())
# Displays the proportion of each sex that survived or did not survive

In [None]:
train.groupby('AgeBand').Age.describe()

In [None]:
#train.groupby('AgeBand').Survived.mean().plot(kind = 'bar')
sns.barplot(x = 'AgeBand', y = 'Survived', data = train)

People under 16 were the most likely group to survive, supporting the "women and children" hypothesis. The AgeBand variable is used only for this analysis and will not be used in the final model.

## Sibsp vs. Survival

In [None]:
train.SibSp.describe()

Most people came without a spouse or sibling.

In [None]:
print(train.SibSp.value_counts())
# Displays the number of each sex

print(train.groupby('SibSp').Survived.value_counts())
# Displays the number of each sex that survived and did not survive

print(train.groupby('SibSp').Survived.mean())
# Displays the proportion of each sex that survived or did not survive

People with 2 or fewer siblings or spouses on the Titanic (assuming a big chunk of the people with 1 SibSp are spouses) were more likely to survive than those with many people. I hypothesize this is due to it being difficult to round up a big group in a crisis. But it does appear better to have 1 or 2 siblings and or spouse with you than to be alone.

In [None]:
#train.groupby('SibSp').Survived.mean().plot(kind = 'bar')
sns.barplot(x = 'SibSp', y = 'Survived', data = train)

## Parch vs. Survival

In [None]:
train.Parch.describe()

Most people came without their parents and children.

In [None]:
print(train.Parch.value_counts())
# Displays the number of each sex

print(train.groupby('Parch').Survived.value_counts())
# Displays the number of each sex that survived and did not survive

print(train.groupby('Parch').Survived.mean())
# Displays the proportion of each sex that survived or did not survive

Having 3 or more parents or children with you is associated with high survival rates, but some number combinations have small sample sizes so there is more variation in those groups.

In [None]:
#train.groupby('Parch').Survived.mean().plot(kind = 'bar')
sns.barplot(x = 'Parch', y = 'Survived', data = train)

## Embarked Point vs. Survival

In [None]:
train.Embarked.describe(include = ['O'])

In [None]:
print(train.Embarked.value_counts())
# Displays the number of each sex

print(train.groupby('Embarked').Survived.value_counts())
# Displays the number of each sex that survived and did not survive

print(train.groupby('Embarked').Survived.mean())
# Displays the proportion of each sex that survived or did not survive

Those who embarked at South Hampton and those who embarked at Cherbourg (more variation here) were the most likely to survive.

In [None]:
#train.groupby('Embarked').Survived.mean().plot(kind = 'bar')
sns.barplot(x = 'Embarked', y = 'Survived', data = train)

## FamilySize vs. Survival

In [None]:
train['FamilySize'] = train.SibSp + train.Parch

In [None]:
train.FamilySize.describe()

The plurarlity of people on the Titanic were alone, at least in this sample.

In [None]:
print(train.FamilySize.value_counts())
# Displays the number of each sex

print(train.groupby('FamilySize').Survived.value_counts())
# Displays the number of each sex that survived and did not survive

print(train.groupby('FamilySize').Survived.mean())
# Displays the proportion of each sex that survived or did not survive

Having 3 or fewer family members is associated with higher likelihood of survival, and being alone is more dangerous than having another person or 2 people with you.

In [None]:
#train.groupby('FamilySize').Survived.mean().plot(kind = 'bar')
sns.barplot(x = 'FamilySize', y = 'Survived', data = train)

## Fare

In [None]:
train.Fare.describe()

In [None]:
sns.distplot(train.Fare, bins = 15, kde = False)

I took the natural log of it to try and make the data less spread out. You can see above that the mass of the distribution is on the left side, but the mean is a litte off the y-axis like a log-normal distribution.

In [None]:
train['logFare'] = np.where(train.Fare != 0, np.log(train.Fare), train.Fare)

In [None]:
sns.distplot(train.logFare, bins = 15)

# Creating a Model

## Data Pre-Processing

I reload and process the data here so I can skip to this section when I take breaks and come back to working on it.

In [None]:
from sklearn.impute import SimpleImputer

#Loads the data
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

#Creates the FamilySize and logFate variables
train['FamilySize'] = train.SibSp + train.Parch
train['logFare'] = np.where(train.Fare != 0, np.log(train.Fare), train.Fare)
test['FamilySize'] = test.SibSp + test.Parch
test['logFare'] = np.where(test.Fare != 0, np.log(test.Fare), test.Fare)

#Puts the features that should have no effect on survival in a list
cols_to_drop = ['Name', 'Ticket', 'Cabin', 'PassengerId']

#Drops the aforementioned features
train = train.drop(cols_to_drop, axis=1)
X_test = test.drop(cols_to_drop, axis=1)

#Creates boolean variables for categorical features
train_data = pd.get_dummies(train)
X_test = pd.get_dummies(X_test)

#Creates the training feature matrix and the training target vector
X_train = train_data.drop('Survived', axis=1)
y_train = train_data.Survived

#Replaces missing values with averages
my_imputer = SimpleImputer()
X_train = my_imputer.fit_transform(X_train)
X_test = my_imputer.fit_transform(X_test)

## Importing Model Modules

In [None]:
from xgboost import XGBClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold

#Splits the training data up for use to score model accuracy and model selection.
train_X, test_X, train_y, test_y = train_test_split(X_train, y_train, train_size = 0.7, test_size = 0.3, random_state = 0)

### Support Vector Machine

Below the hyperparameters are selected.

In [None]:
#Defines a function to return the best parameters for the SVC model
def svc_param_selection(X, y, nfolds):
    Cs = [0.001, 0.01, 0.1, 1, 10]
    gammas = [0.001, 0.01, 0.1, 1]
    param_grid = {'C': Cs, 'gamma' : gammas}
    grid_search = GridSearchCV(svm.SVC(kernel='linear'), param_grid, cv=nfolds)
    grid_search.fit(X, y)
    grid_search.best_params_
    return grid_search.best_params_

svc_param_selection(X_train, y_train, 5)

Used an article (https://medium.com/@aneesha/svm-parameter-tuning-in-scikit-learn-using-gridsearchcv-2413c02125a0) to learn how to choose the hyperparameters for the SVC mocel. They are C = 0.01 and gamma = 0.001

Below the model is tested for accuracy using cross-validation with 10 folds.

In [None]:
my_svc_model = svm.SVC(C = 0.01, kernel ='linear', gamma = 0.001)
my_svc_model.fit(X_train, y_train)

kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(my_svc_model, X_train, y_train, cv=kfold)
print("SVM Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

### Random Forest

Below I use random search instead of grid search to do the hyperparameter tuning because a grid search would take forever.

In [None]:
# This is the number of trees starting at 200 and going to 2000 in increments of 10
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Sets a maximum number of levels in each tree to avoid overfitting
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Sets a minimum number of observations to allow a node to make a split
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


my_rf_model = RandomForestClassifier()
my_rf_model = RandomizedSearchCV(estimator = my_rf_model, param_distributions = random_grid, n_iter = 100, cv = 5, verbose=2, random_state=7, n_jobs = -1)
my_rf_model.fit(X_train, y_train)
my_rf_model.best_params_

In [None]:
my_forest_model = RandomForestClassifier(n_estimators = 200,
                                         min_samples_split = 5,
                                         min_samples_leaf = 4,
                                         max_features = 'auto',
                                         max_depth = 80,
                                         bootstrap = True)
my_forest_model.fit(X_train, y_train)

kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(my_forest_model, X_train, y_train, cv=kfold)
print("Random Forest Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

### K-Nearest Neighbors Classifier

Below the hyperparameters are selected.

In [None]:
#Creates a list of possible ks from 1 to 30
k_range = list(range(1, 31))
#Creates a list of 2 possible weighting options
weight_options = ['uniform', 'distance']
#Creates a dictionary containing the k_range and weight_options that is the parameter grid
param_grid = {'n_neighbors': k_range, 'weights': weight_options}

my_knn_model = KNeighborsClassifier(algorithm = 'brute')
clf = GridSearchCV(my_knn_model, param_grid, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)

Below the model is tested for accuracy using cross-validation with 10 folds.

In [None]:
my_knn_model = KNeighborsClassifier(n_neighbors = 13, weights = 'distance')
my_knn_model.fit(X_train, y_train)

kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(my_knn_model, X_train, y_train, cv=kfold)
print("Knn Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

### Gaussian Naive Bayes Classifier

In [None]:
my_gnb_model = GaussianNB()
my_gnb_model.fit(X_train, y_train)

kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(my_gnb_model, X_train, y_train, cv=kfold)
print("GNB Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

### Logistic Regression

In [None]:
my_logit_model = LogisticRegression(solver = 'liblinear')
my_logit_model.fit(X_train, y_train)

kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(my_logit_model, X_train, y_train, cv=kfold)
print("Logistic Regression Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

### XTREME GRADIENT BOOSTED TREE

A grid search is used to find the best learning rate, max depth and min child weight.

In [None]:
my_xgb_model = XGBClassifier()

parameters = {'nthread':[4],
              'objective':['binary:logistic'],
              'learning_rate': [0.03, 0.04, 0.05, 0.06, 0.07, 0.08], 
              'max_depth': [5, 6, 7, 8],
              'min_child_weight': [9, 10, 11, 12, 13],
              'silent': [1],
              'subsample': [0.8],
              'colsample_bytree': [0.7],
              'n_estimators': [10], #kept small so the grid search doesn't take too long
              'missing':[-999],
              'seed': [7]}

clf = GridSearchCV(my_xgb_model, parameters, n_jobs=5, 
                   cv=StratifiedKFold(n_splits=5), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clf.fit(X_train, y_train)

print(clf.best_params_)

By a little experimentation, the best number of estimators appears to be 125.

In [None]:
my_xgb_model = XGBClassifier(colsample_bytree = 0.7, 
                             learning_rate = 0.07, 
                             max_depth = 5, 
                             min_child_weight = 9, 
                             missing = -999, 
                             n_estimators = 125, 
                             nthread = 4, 
                             objective = 'binary:logistic', 
                             seed = 7, 
                             silent = 1, 
                             subsample = 0.8)
my_xgb_model.fit(X_train, y_train, early_stopping_rounds = 5, eval_set = [(test_X, test_y)], verbose = False)

kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(my_xgb_model, X_train, y_train, cv=kfold)
print("XGBTree Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

## Final Model (Going with a Gradient Boosted Tree, but it is a close call between it and the Random Forest. The Gradient Boosted Tree has a smaller standard deviation in accuracy score, which is why I made this decision in favor of using XGBoost.)

I reload everything so I can skip down here and mess around without needing to find and run select cells before running this one.

In [None]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

train['FamilySize'] = train.SibSp + train.Parch
train['logFare'] = np.where(train.Fare != 0, np.log(train.Fare), train.Fare)

cols_to_drop = ['Name', 'Ticket', 'Cabin', 'PassengerId']

train.Pclass = train.Pclass.astype(str)
train = train.drop(cols_to_drop, axis=1)
test.Pclass = test.Pclass.astype(str)
X_test = test.drop(cols_to_drop, axis=1).copy()

X_test['FamilySize'] = X_test.SibSp + X_test.Parch
X_test['logFare'] = np.where(X_test.Fare != 0, np.log(X_test.Fare), X_test.Fare)

train_data = pd.get_dummies(train)
X_test = pd.get_dummies(X_test)

X_train = train_data.drop('Survived', axis=1)
y_train = train_data.Survived

my_imputer = SimpleImputer()
X_train = my_imputer.fit_transform(X_train)
X_test = my_imputer.fit_transform(X_test)

train_X, test_X, train_y, test_y = train_test_split(X_train, y_train, train_size = 0.7, test_size = 0.25, random_state = 0)

my_xgb_model = XGBClassifier(colsample_bytree = 0.7, 
                             learning_rate = 0.07, 
                             max_depth = 5, 
                             min_child_weight = 9, 
                             missing = -999, 
                             n_estimators = 125, 
                             nthread = 4, 
                             objective = 'binary:logistic', 
                             seed = 1337, 
                             silent = 1, 
                             subsample = 0.8)
my_xgb_model.fit(X_train, y_train, early_stopping_rounds = 5, eval_set = [(test_X, test_y)], verbose = False)

my_predictions = my_xgb_model.predict(X_test)

jcleme_submission = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": my_predictions})

jcleme_submission.to_csv('jcleme_xgb_submission.csv', index = False)