# GETTING STARTED WITH KAGGLE COMPETITIONS
Author: *Melissa Liao*

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

We will go through my first Kaggle competition project, the `Titanic - Machine Learning from Disaster`. In this project, we are going to follow the following workflow: 
1. Download and load the data given by the site: `train.csv` and `test.csv`.
2. Inspect the data to analyze any relationship between the features.
3. Preprocess any discrete, nominal and string data.
4. Fit and compare models using cross-validation (using our defined functions).
5. Apply hyperparameter tuning using grid search (using our defined functions) to find the best model.
6. Retrain best model on data and predict test data.
7. Submit predicted results to Kaggle and conclude my interpretations.
8. Reflect on this mini-project experience.

## 0. Function definitions

We will defined our own functions to make it easier to find the best model and hyperparameters that outputs the best accuracy scores for prediction.

In [2]:
from sklearn.model_selection import cross_validate

def get_classifier_cv_score(model, X, y, scoring='accuracy', cv=7):
    '''Calculate train and validation scores of classifier (model) using cross-validation
        
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        scoring (str): a scoring string accepted by sklearn.metrics.cross_validate()
        cv (int): number of cross-validation folds see sklearn.metrics.cross_validate()
        
        returns: mean training score, mean validation score
    
    '''
    model.fit(X, y)
    scores = cross_validate(model, X, y, cv=cv, 
                            scoring=scoring, 
                            return_train_score=True)
    return scores['train_score'].mean(), scores['test_score'].mean()

In [3]:
def print_grid_search_result(grid_search):
    '''Prints summary of best model from GridSearchCV object.
    
        For the best model of the grid search, print:
        - parameters 
        - cross-validation training score
        
        scores are printed with 3 decimal places.
        grid_search (sklearn GridSearchCV): Fitted GridSearchCV object
        returns: None

    '''
    print("Best parameters: {}".format(grid_search.best_params_))
    print("Best cross-validation score: {:.3f}".format(grid_search.best_score_))

In [4]:
import mglearn

def plot_grid_search_results(grid_search):
    '''For grids with 2 hyperparameters, create a heatmap plot of test scores
        grid_search (sklearn GridSearchCV): Fitted GridSearchCV object
        uses mglearn.tools.heatmap() for plotting.
        
    '''
    results = pd.DataFrame(grid_search.cv_results_)
    params = sorted(grid_search.param_grid.keys())
    assert len(params) == 2, "We can only plot two parameters."
    
    # second dimension in reshape are rows, needs to be the fast changing parameter
    scores = np.array(results.mean_test_score).reshape(len(grid_search.param_grid[params[0]]),
                                                      len(grid_search.param_grid[params[1]]))

    # plot the mean cross-validation scores
    # x-axis needs to be the fast changing parameter
    mglearn.tools.heatmap(scores, 
                          xlabel=params[1], 
                          xticklabels=grid_search.param_grid[params[1]], 
                          ylabel=params[0], 
                          yticklabels=grid_search.param_grid[params[0]],
                          cmap="viridis", fmt="%0.3f")

## 1. Load data
The Titanic Kaggle project can be downloaded from https://www.kaggle.com/c/titanic/data and add it to your Jupyter Notebook directory.

In [5]:
train_data = pd.read_csv("train.csv")
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
test_data = pd.read_csv("test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


A brief description of what each column represents:
* pclass - ticket class (1st, 2nd and 3rd class)
* Sex - gender
* Age - age in years
* SibSp - number of siblings and/or spouses aboard in the Titanic
* Parch - number of parents and/or children aboard in the Titanic
* Ticket - ticket number of the passenger
* Fare - passenger fare
* Cabin - cabin number
* Embarked - port of embarkation (C=Cherbourg, Q=Queenstown, S=Southampton)

### 1.1 Prepare the feature matrix and target vector

Since the test data doesn't include the ground truth on passengers who survived in the Titanic, we will only set the train data into feature matrix `X` and target vector `y`. And print out the shape and type of `X`, `y`. As we know what each information represents, I would discard the `PassengerId`, `Name` and `Ticket` columns. All three columns contains unique values on each row that pertains to the actual passenger identification and it wouldn't be relevant for the objective of the project. Also, the port of embarkation from a passenger, `Embarked`, doesn't directly impact from the place where the incident occur, so we may take that column as well.

In [7]:
train_data = train_data.drop(['PassengerId', 'Name', 'Ticket', 'Embarked'], axis=1)
train_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin
0,0,3,male,22.0,1,0,7.25,
1,1,1,female,38.0,1,0,71.2833,C85
2,1,3,female,26.0,0,0,7.925,
3,1,1,female,35.0,1,0,53.1,C123
4,0,3,male,35.0,0,0,8.05,


In [8]:
X_train = train_data.drop('Survived', axis=1)
y_train = train_data['Survived']

print('X.shape={}, type(X)={}'.format(X_train.shape, type(X_train)))
print('y.shape={}, type(y)={}'.format(y_train.shape, type(y_train)))

X.shape=(891, 7), type(X)=<class 'pandas.core.frame.DataFrame'>
y.shape=(891,), type(y)=<class 'pandas.core.series.Series'>


In [9]:
X_test = test_data.drop(['PassengerId', 'Name', 'Ticket', 'Embarked'], axis=1)
X_test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin
0,3,male,34.5,0,0,7.8292,
1,3,female,47.0,1,0,7.0,
2,2,male,62.0,0,0,9.6875,
3,3,male,27.0,0,0,8.6625,
4,3,female,22.0,1,1,12.2875,


In [10]:
print('X_test.shape={}, type(X_test)={}'.format(X_test.shape, type(X_test)))

X_test.shape=(418, 7), type(X_test)=<class 'pandas.core.frame.DataFrame'>
