## Click Prediction Project

Matt Leffers, John Burt

This notebook uses hyperparameter tuning to optimize a SGDClassifier click predictor.

Here, I use GridSearchCV to tune the SGDClassifier hyperparams. I also added a pipeline, which  will make it easier to add tuning of different data selection and transformation schemes in the future.  


In [1]:
from IPython.display import HTML
from IPython.display import Image

import pandas as pd
import numpy as np


In [2]:
nrows2read = 500000
location='./data/'
converters = {"site_id": lambda x: int(x, 16),
              "site_domain": lambda x: int(x, 16),
              "site_category": lambda x: int(x, 16),
              "app_id": lambda x: int(x, 16),
              "app_domain": lambda x: int(x, 16),
              "app_category": lambda x: int(x, 16),
              "device_id": lambda x: int(x, 16),
              "device_model": lambda x: int(x, 16),
              "device_type": lambda x: int(x, 16),
              "device_ip": lambda x: int(x, 16),
             }
#Import only the first nrows2read rows
data=pd.read_csv(location+'train.csv', nrows=nrows2read, converters=converters) 


In [3]:
# extract our X and y variables for training
y = data['click'].copy()
X = data[data.columns.values[2:]].copy()

# from the larger dataset, subsample nsamps click and no-click records
y0 = y[y==0]
X0 = X[y==0]
y1 = y[y==1]
X1 = X[y==1]

nsamps = y1.shape[0]

print("original data = %d rows: %d clicks, %d nonclicks %1.1f%% clicks"%(
    y.shape[0], y1.shape[0], y0.shape[0], 100*y1.shape[0]/y.shape[0]))

y_eq = y1[:nsamps].append(y0[:nsamps], ignore_index=True)
X_eq = X1[:nsamps].append(X0[:nsamps], ignore_index=True)

print("training data = %d rows, equal# clicks/nonclicks "%(y_eq.shape[0]))


original data = 500000 rows: 82037 clicks, 417963 nonclicks 16.4% clicks
training data = 164074 rows, equal# clicks/nonclicks 


In [9]:
from time import time
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Define SGDClassifier defaults: 
# define defaults: doing it this way allows us to define our own default params
clf_defaults = {
    'loss' : 'log',
    'alpha' : 1e-03, 
    'penalty' : 'l2',
    'max_iter' : 10,
    'tol' : None
    }

# Create a pipeline, allowing to tune a transformer and the SGDClassifier classifier.
# (transformer not implemented yet)
pipeline = Pipeline([    
    ('clf', SGDClassifier(**clf_defaults)),
])

# Define the parameters and values we want to test.
# Uncommenting more parameters will give better exploring power but will
#   increase processing time in a combinatorial way. I suggest tuning <= 3
#   parameters at a time.
# Note the naming format: pipelineobjectname__paramname
parameters = {
    'clf__alpha': (1e-01, 1e-02, 1e-03, 1e-04),
#     'clf__penalty': ( 'none', 'l2', 'l1', 'elasticnet'),
    'clf__penalty': ( 'l2', 'l1'),
    'clf__max_iter': (50, 100, 200, 500),
}

# Create the grid search object.
# Note that "n_jobs=-1" means that the search will use all of the 
#  computer's available processing cores to speed things up.
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

print("Performing grid search...")
# print("parameters:")
# print(parameters)
t0 = time()

# Run the grid search to find the best parameters for the classifier.
grid_search.fit(X_eq, y_eq)

print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))


Performing grid search...
Fitting 3 folds for each of 32 candidates, totalling 96 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  96 out of  96 | elapsed:  3.6min finished


done in 250.791s

Best score: 0.532
Best parameters set:
	clf__alpha: 0.01
	clf__max_iter: 500
	clf__penalty: 'l2'


with all data:

done in 711.312s

Best score: 0.836
Best parameters set:
	clf__alpha: 0.0001
	clf__max_iter: 100
	clf__penalty: 'l2'