# Modeling

In [1]:
%matplotlib inline
import cPickle
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC



Grab the engineered data

In [2]:
def read_pickle(file_name):
    f = open(file_name, 'rb')
    p = cPickle.load(f)
    f.close()
    return p


train = read_pickle('data/train.engineered')
test = read_pickle('data/test.engineered')
outcomes = read_pickle('data/outcomes.engineered')
outcomes_le = read_pickle('data/outcomes_le.engineered')

Split the `train` data into training/test sets using the hold-out method. Though there is a DataFrame labeled `test`, this is really the set that we want to make predictions against (and, we don't have labeled examples for this set).

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    np.array(train), outcomes, test_size = 0.2, random_state = 10)

In [4]:
list(train) == list(test)

True

## Baseline model

Though I suspect other models will make more accurate predictions, let me quickly try out a logistic regression model w/ different regularization hyperparameters.

In [5]:
def train_test_model(model, hyperparameters, X_train, X_test, y_train, y_test):
    """
    Given a [model] and a set of possible [hyperparameters], along with 
    matricies corresponding to hold-out cross-validation, returns a model w/ 
    optimized hyperparameters using log-loss scoring and 5-fold cross-validation.
    """
    optimized_model = GridSearchCV(
        model, hyperparameters, cv = 5, n_jobs = -1, scoring = 'log_loss')
    optimized_model.fit(X_train, y_train)
    print 'Optimized parameters:', optimized_model.best_params_
    print 'Log loss:', np.absolute(optimized_model.score(X_test, y_test))
    return optimized_model


def create_submission(name, model, train, outcomes, outcomes_le, test):
    """
    Train [model] on [train] and predict the probabilties on [test], and
    format the submission according to Kaggle.
    """
    clf = model.best_estimator_
    clf.fit(np.array(train), outcomes)
    probs = clf.predict_proba(np.array(test))
    results = pd.DataFrame(probs)
    results.columns = list(outcomes_le.inverse_transform(list(results)))
    results['ID'] = pd.read_csv('data/test.csv')[['ID']].astype(int)
    results = results[['ID', 'Adoption', 'Died', 'Euthanasia', 
                       'Return_to_owner', 'Transfer']]
    results.to_csv('submissions/' + name, index = False)
    return None

In [6]:
%%time
logit_model = train_test_model(
    LogisticRegression(), 
    {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2']}, 
    X_train, X_test, y_train, y_test)

Optimized parameters: {'penalty': 'l1', 'C': 1}
Log loss: 0.887987793278
CPU times: user 11.1 s, sys: 72 ms, total: 11.2 s
Wall time: 1min 49s




In [7]:
create_submission('first_submission.csv', logit_model, 
                  train, outcomes, outcomes_le, test)

My estimate of the test error was much lower than the actual error - the log loss on the public leaderboard for this model is 1.83, compared to 0.89 here. Let's also try a logistic regression using an elastic net penalty instead of an L1 penalty.

In [8]:
%%time
logit_en_model = train_test_model(
    SGDClassifier(penalty = 'elasticnet', loss = 'log'), 
    {'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}, 
    X_train, X_test, y_train, y_test)

Optimized parameters: {'alpha': 0.001}
Log loss: 0.898063749436
CPU times: user 2.16 s, sys: 44 ms, total: 2.21 s
Wall time: 9.05 s


The result was a higher error, time to try a more advanced model.

## Random Forest

In [14]:
%%time
rf_model = train_test_model(
    RandomForestClassifier(random_state = 1),
    {'n_estimators': [100, 500, 800, 1000, 1500, 2000]},
    X_train, X_test, y_train, y_test
)

Optimized parameters: {'n_estimators': 2000}
Log loss: 0.82965091448
CPU times: user 1min 9s, sys: 928 ms, total: 1min 9s
Wall time: 7min 45s


In [15]:
create_submission('second_submission.csv', rf_model, 
                  train, outcomes, outcomes_le, test)