## Click Prediction Project

Matt Leffers, John Burt

This notebook uses hyperparameter tuning to optimize XGBoost classifier based click predictor.

Here, I use GridSearchCV to tune the classifier hyperparams. I also added a pipeline, which  will make it easier to add tuning of different data selection and transformation schemes in the future.  

To install XGBoost:
- conda install py-xgboost

Some XGBoost references:
- https://github.com/dmlc/xgboost

- https://www.kaggle.com/phunter/xgboost-with-gridsearchcv


In [2]:
import pandas as pd
import numpy as np
import pickle

pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
# set all font sizes
matplotlib.rcParams.update({'font.size': 18})

# set all line widths
matplotlib.rcParams.update({'lines.linewidth': 2})

# set all symbol sizes
matplotlib.rcParams.update({'lines.markersize': 10})

matplotlib.rcParams.update({'axes.facecolor': 'white'})
matplotlib.rcParams.update({'axes.edgecolor': 'black'})

# nrows2read = 500000
location='./data/'
clickfilename = 'train'
userdatafilename = 'train_users_only'

converters = {"site_id": lambda x: int(x, 16),
              "site_domain": lambda x: int(x, 16),
              "site_category": lambda x: int(x, 16),
              "app_id": lambda x: int(x, 16),
              "app_domain": lambda x: int(x, 16),
              "app_category": lambda x: int(x, 16),
              "device_id": lambda x: int(x, 16),
              "device_model": lambda x: int(x, 16),
              "device_type": lambda x: int(x, 16),
              "device_ip": lambda x: int(x, 16),
             }
#Import only the first nrows2read rows
# data=pd.read_csv(location+'train.csv', nrows=nrows2read, converters=converters) 

clickcsvpath = location+clickfilename+'.csv'
clickpicklepath = location+clickfilename+'.pkl'
userdatapath = location+userdatafilename+'.pkl'

try:
    print('reading original pickled data...')
    with open(clickpicklepath, 'rb') as handle:
        data = pickle.load(handle)

except:
    print('error: reading original csv file')
    #Import csv file
    data=pd.read_csv(clickcsvpath, converters=converters) 
    # save data
    with open(clickpicklepath, 'wb') as handle:
        pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)


reading original pickled data...


### Equalize number of click vs non-click samples for training

In [11]:
# extract our X and y variables for training
y = data['click'].copy()
X = data[data.columns.values[2:]].copy()

# from the larger dataset, subsample nsamps click and no-click records
y0 = y[y==0]
X0 = X[y==0]
y1 = y[y==1]
X1 = X[y==1]

# nsamps = y1.shape[0] # use as many samples as possible
nsamps = 2000000

print("original data = %d rows: %d clicks, %d nonclicks %1.1f%% clicks"%(
    y.shape[0], y1.shape[0], y0.shape[0], 100*y1.shape[0]/y.shape[0]))

y_eq = y1[:nsamps].append(y0[:nsamps], ignore_index=True)
X_eq = X1[:nsamps].append(X0[:nsamps], ignore_index=True)

print("training data = %d rows, equal# clicks/nonclicks "%(y_eq.shape[0]))


original data = 40428967 rows: 6865066 clicks, 33563901 nonclicks 17.0% clicks
training data = 4000000 rows, equal# clicks/nonclicks 


original data = 500000 rows: 82037 clicks, 417963 nonclicks 16.4% clicks
training data = 164074 rows, equal# clicks/nonclicks 

In [13]:
from time import time
from scipy import stats
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

# NOTE: the following comment was pasted from some example code:
#brute force scan for all parameters, here are the tricks
#usually max_depth is 6,7,8
#learning rate is around 0.05, but small changes may make big diff
#tuning min_child_weight subsample colsample_bytree can have 
#much fun of fighting against overfit 
#n_estimators is how many round of boosting
#finally, ensemble xgboost with multiple seeds may reduce variance
parameters = {
            'nthread':[4], #when use hyperthread, xgboost may become slower
            'objective':['binary:logistic'],
            'learning_rate': [.2, .5, 1], #so called `eta` value
            'max_depth': [7],
            'min_child_weight': [11],
            'silent': [1],
            'subsample': [0.8],
            'colsample_bytree': [0.7],
            'n_estimators': [100,500,1000], #number of trees, change it to 1000 for better results
            'missing':[-999],
            }

xgb_model = xgb.XGBClassifier()

# Create the grid search object.
# Note that "n_jobs=-1" means that the search will use all of the 
#  computer's available processing cores to speed things up.
grid_search = GridSearchCV(xgb_model, parameters, n_jobs=-1, verbose=1)

print("Performing grid search...")
# print("parameters:")
# print(parameters)
t0 = time()

# Run the grid search to find the best parameters for the classifier.
grid_search.fit(X_eq, y_eq)

print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))


Performing grid search...
Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed: 212.0min finished


done in 12951.565s

Best score: 0.656
Best parameters set:
	colsample_bytree: 0.7
	learning_rate: 0.2
	max_depth: 7
	min_child_weight: 11
	missing: -999
	n_estimators: 100
	nthread: 4
	objective: 'binary:logistic'
	silent: 1
	subsample: 0.8
