# GETTING STARTED WITH KAGGLE COMPETITIONS
Author: *Melissa Liao*

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

We will go through my first Kaggle competition project, the `Titanic - Machine Learning from Disaster`. In this project, we are going to follow the following workflow: 
1. Download and load the data given by the site: `train.csv` and `test.csv`.
2. Inspect the data to analyze any relationship between the features.
3. Preprocess any discrete, nominal and string data.
4. Fit and compare models using cross-validation (using our defined functions).
5. Apply hyperparameter tuning using grid search (using our defined functions) to find the best model.
6. Retrain best model on data and predict test data.
7. Submit predicted results to Kaggle and conclude my interpretations.
8. Reflect on this mini-project experience.

## 0. Function definitions

We will defined our own functions to make it easier to find the best model and hyperparameters that outputs the best accuracy scores for prediction.

In [2]:
from sklearn.model_selection import cross_validate

def get_classifier_cv_score(model, X, y, scoring='accuracy', cv=7):
    '''Calculate train and validation scores of classifier (model) using cross-validation
        
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        scoring (str): a scoring string accepted by sklearn.metrics.cross_validate()
        cv (int): number of cross-validation folds see sklearn.metrics.cross_validate()
        
        returns: mean training score, mean validation score
    
    '''
    model.fit(X, y)
    scores = cross_validate(model, X, y, cv=cv, 
                            scoring=scoring, 
                            return_train_score=True)
    return scores['train_score'].mean(), scores['test_score'].mean()

In [3]:
def print_grid_search_result(grid_search):
    '''Prints summary of best model from GridSearchCV object.
    
        For the best model of the grid search, print:
        - parameters 
        - cross-validation training score
        
        scores are printed with 3 decimal places.
        grid_search (sklearn GridSearchCV): Fitted GridSearchCV object
        returns: None

    '''
    print("Best parameters: {}".format(grid_search.best_params_))
    print("Best cross-validation score: {:.3f}".format(grid_search.best_score_))

In [4]:
import mglearn

def plot_grid_search_results(grid_search):
    '''For grids with 2 hyperparameters, create a heatmap plot of test scores
        grid_search (sklearn GridSearchCV): Fitted GridSearchCV object
        uses mglearn.tools.heatmap() for plotting.
        
    '''
    results = pd.DataFrame(grid_search.cv_results_)
    params = sorted(grid_search.param_grid.keys())
    assert len(params) == 2, "We can only plot two parameters."
    
    # second dimension in reshape are rows, needs to be the fast changing parameter
    scores = np.array(results.mean_test_score).reshape(len(grid_search.param_grid[params[0]]),
                                                      len(grid_search.param_grid[params[1]]))

    # plot the mean cross-validation scores
    # x-axis needs to be the fast changing parameter
    mglearn.tools.heatmap(scores, 
                          xlabel=params[1], 
                          xticklabels=grid_search.param_grid[params[1]], 
                          ylabel=params[0], 
                          yticklabels=grid_search.param_grid[params[0]],
                          cmap="viridis", fmt="%0.3f")