<center><h1>Kaggle Competition: Titanic Data Set</h1></center>

This Jupyter Notebook represents my attempt at competing in the Kaggle competition for the Titanic Data Set.  I'll add in notes and explanation all throughout in order to clarify my approach and my thoughts at each step.  



<center><h3>Step 1: Data Wrangling</h3></center>

In this step, I'll import the data from the CSV files, explore the data, and "clean" it by putting it in a format more conducive to machine learning.  

In [2]:
# Import all necessary libraries
import pandas as pd
import time
import numpy as np
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression



In [3]:
raw_df = pd.read_csv("train.csv")
pred_df = pd.read_csv("test.csv")

# Separate out the labels into a separate data frame.  
train = raw_df.drop("Survived", axis=1, inplace=False)
labels = raw_df["Survived"]

# Replace categorical values for embarked, sex with numerical values
train["Embarked"] = train["Embarked"].map({"S": 0, "C": 1, "Q": 2})
train["Sex"] = train["Sex"].map({"female": 0, "male": 1})

# Drop names, ticket numbers, cabins
train.drop("Name", axis=1, inplace=True)
train.drop("Ticket", axis=1, inplace=True)
train.drop("Cabin", axis=1, inplace=True)
train.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,3,1,22.0,1,0,7.25,0.0
1,2,1,0,38.0,1,0,71.2833,1.0
2,3,3,0,26.0,0,0,7.925,0.0
3,4,1,0,35.0,1,0,53.1,0.0
4,5,3,1,35.0,0,0,8.05,0.0


The data is in better shape, but we still have to deal with our null values.  

In [4]:
# Deal with Null Values
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Embarked"] = train["Embarked"].fillna(train["Embarked"].median())

In [5]:
# Store column names for data that will be used by our classifier.
parameters = ["Pclass", "Sex", "Age", "SipSp", "Parch", "Fare", "Embarked"]

In [6]:
train["Age"].describe()

count    891.000000
mean      29.361582
std       13.019697
min        0.420000
25%       22.000000
50%       28.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

In [7]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    
    # Print the results
    print ("Trained model in {:.4f} seconds".format(end - start))

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    
    # Print and return results
    print ("Made predictions in {:.4f} seconds.".format(end - start))
    return f1_score(target.values, y_pred)


def train_predict(clf, X_train, y_train, X_test, y_test):
    print("-"*50)
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print ("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print ("F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train)))
    print ("F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test)))
    print("-"*50)

In [8]:
scorer = make_scorer(f1_score)
X_train, X_test, y_train, y_test = train_test_split(train, labels, test_size=.30, random_state=42)

# We'll try several different Ml algorithms, and go with the one that has the highest F1 Score.  
clf1 = DecisionTreeClassifier()
clf2 = SVC()
clf3 = GaussianNB()
clf4 = RandomForestClassifier()
clf5 = XGBClassifier(
        nthread = 4,
        silent = 1
        )

all_clfs = [clf1, clf2, clf3, clf4, clf5]

for clf in all_clfs:
    train_predict(clf, X_train, y_train, X_test, y_test)



--------------------------------------------------
Training a DecisionTreeClassifier using a training set size of 623. . .
Trained model in 0.0028 seconds
Made predictions in 0.0005 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0003 seconds.
F1 score for test set: 0.7225.
--------------------------------------------------
--------------------------------------------------
Training a SVC using a training set size of 623. . .
Trained model in 0.0167 seconds
Made predictions in 0.0092 seconds.
F1 score for training set: 0.9935.
Made predictions in 0.0041 seconds.
F1 score for test set: 0.0526.
--------------------------------------------------
--------------------------------------------------
Training a GaussianNB using a training set size of 623. . .
Trained model in 0.0012 seconds
Made predictions in 0.0004 seconds.
F1 score for training set: 0.7188.
Made predictions in 0.0004 seconds.
F1 score for test set: 0.7545.
--------------------------------------------------

Initial Results:

SVC was horrible.  There are several reasons that this could be.

Best untuned results: Gaussian Naive Bayes and XGBoost Classifiers.  There's not enough of a difference between the scores for reach to declare one a clear winner.  In that case, we'll tune them both and see which one works best.  

There's not much tuning that can be done to A GaussianNB Classifier--the only parameter that you can tweak is the priors, and that's something better calculated from the data itself.  

That leaves XGBoost.  

In [9]:
parameters = {'objective':['binary:logistic'],
              'learning_rate': [ii * 0.01 for ii in range(1, 6)], #so called `eta` value
              'max_depth': [2, 3, 4],
              'min_child_weight': [ii for ii in range(1, 5)],
              'silent': [0],
              'subsample': [0.7, 0.8, 0.9],
              'colsample_bytree': [0.5, 0.6, 0.7, 0.8],
              'n_estimators': [5, 10, 15], #number of trees, change it to 1000 for better results
              'missing':[-999],
              'seed': [42]}

clf = GridSearchCV(clf5, parameters, 
                   scoring='f1',
                    refit=True)
clf.fit(X_train, y_train)
clf = clf.best_estimator_

#train_predict(clf, X_train, y_train, X_test, y_test)

ValueError: 'auc' is not a valid scoring value. Valid options are ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc']