## Click Prediction Project

Matt Leffers
John Burt

This notebook implements a SGDClassifier to predict clicks.


In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.

For this competition, we have provided 11 days worth of Avazu data to build and test prediction models. Can you find a strategy that beats standard classification algorithms? The winning models from this competition will be released under an open-source license.

Avazu: https://www.kaggle.com/c/avazu-ctr-prediction

File descriptions

    train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks are subsampled according to different strategies.
    test - Test set. 1 day of ads to for testing your model predictions. 
    sampleSubmission.csv - Sample submission file in the correct format, corresponds to the All-0.5 Benchmark.

Data fields

    id: ad identifier
    click: 0/1 for non-click/click
    hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
    C1 -- anonymized categorical variable
    banner_pos
    site_id
    site_domain
    site_category
    app_id
    app_domain
    app_category
    device_id
    device_ip
    device_model
    device_type
    device_conn_type
    C14-C21 -- anonymized categorical variables


In [3]:
from IPython.display import HTML
from IPython.display import Image

import pandas as pd
import numpy as np
#import xlearn as xl
#logistic regression l1 regularization
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model


In [4]:
nrows2read = 500000
location='./data/'
converters = {"site_id": lambda x: int(x, 16),
              "site_domain": lambda x: int(x, 16),
              "site_category": lambda x: int(x, 16),
              "app_id": lambda x: int(x, 16),
              "app_domain": lambda x: int(x, 16),
              "app_category": lambda x: int(x, 16),
              "device_id": lambda x: int(x, 16),
              "device_model": lambda x: int(x, 16),
              "device_type": lambda x: int(x, 16),
              "device_ip": lambda x: int(x, 16),
             }
#Import only the first nrows2read rows
data=pd.read_csv(location+'train.csv', nrows=nrows2read, converters=converters) 


In [5]:
# extract our X and y variables for training
y = data['click'].copy()
X = data[data.columns.values[2:]].copy()

In [14]:
# from the larger dataset, subsample nsamps click and no-click records
y0 = y[y==0]
X0 = X[y==0]
y1 = y[y==1]
X1 = X[y==1]

nsamps = y1.shape[0]

print("original data = %d rows: %d clicks, %d nonclicks %1.1f%% clicks"%(
    y.shape[0], y1.shape[0], y0.shape[0], 100*y1.shape[0]/y.shape[0]))

y_eq = y1[:nsamps].append(y0[:nsamps], ignore_index=True)
X_eq = X1[:nsamps].append(X0[:nsamps], ignore_index=True)

print("training data = %d rows, equal# clicks/nonclicks "%(y_eq.shape[0]))


original data = 500000 rows: 82037 clicks, 417963 nonclicks 16.4% clicks
training data = 164074 rows, equal# clicks/nonclicks 


In [15]:
from sklearn import linear_model

# model = linear_model.LogisticRegression(penalty='l1', C=1.0, verbose=True)
model = linear_model.SGDClassifier(loss='log', max_iter=10, tol=None, verbose=False)
model.fit(X, y)
Accuracy=model.score(X, y)
print('model accuracy:',Accuracy)
coeff_df = pd.DataFrame(list(zip(X.columns, np.transpose(model.coef_))))
Y_mean=y.mean()
print('Target average:',Y_mean)


model accuracy: 0.791042
Target average: 0.164074
