In [152]:
%matplotlib inline
import math
import pandas as pd
import numpy as np
from patsy import dmatrices

import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf

# import graphlab as gl
import plotly.plotly as py # interactive graphing
from plotly.graph_objs import Bar, Scatter, Marker, Layout
import matplotlib.pyplot as plt
# import seaborn as sns

Data Preparation
======================

In [203]:
df = pd.read_csv('firstable.csv')
df.columns = map(str.lower, df.columns)
df.shape

(481943, 15)

**Given the dataset, we know that it has 481943 rows and 15 columns, for the purpose of this analysis, our variable of interest is either clicks or engagements from impressions that served by Sharethrough, hence we are going to narrow the data set down to rows have impressions served by Sharesthrough, which gives us a new shape of the data set with 165513 rows and still 15 columns** 

In [204]:
#In this analysis we are going to assume that we are only interested in those ad slot that 
#have impressions served by Sharethrough
df = df[df['impressions']>0]
#dimension of the dataset
df.shape
# np.corrcoef(df_ss[])

(165513, 15)

**Next, we are going to creat two new varibales: click through rate(ctr) and engagement rate              
ctr = clicks/impressions                     
engagement rate = engagement/impressions         
We also realized that some ctr and engagement rate is larger than 1 which doesn't make sense, those should be removed**

In [205]:
#CTR is important to us, becasue we want to have a better CTR predicion so we can tell the client 
df['CTR'] = df['clicks']/df['impressions']
df['Engagement_Rate'] = df['total_engagements']/df['impressions']
##For ctr that is larger than zero, it doesn't make sense, so we are going to drop them
df = df[(df['CTR'] <= 1) & (df['Engagement_Rate'] <= 1)]
df.shape

(165486, 17)

**Quick look of the data:    
since we are going to perform a logistic model, so I am going to lable ads that doesn't got any click as 0, and ads that gets clicks as 1, and this varibale will be the target varibel we try to predit** 

In [176]:
df.head(1)

Unnamed: 0,placement_key,country,device_category,eligible_impression_requests,total_impression_requests,filled_pages,impressions,visible_impressions,clicks,total_engagements,video_plays,autoplay_views,guaranteed_brand_safe,location_type,layout_type,CTR,Engagement_Rate
0,c1687c86,US,Smartphone,241,284,1,1,1,0,0,0,0,0,below the post,single,0.0,0.0


In [206]:
#Another way to define CTR
def foo(j):
    if j['clicks'] > 0: return 1
    else: return 0

df['CLICKS'] = df.apply(lambda row: foo(row), axis = 1)

In [11]:
df['CLICKS'].mean()

0.23418100088814775

**To understand the relationship between the target variable and other features we have, we want to calculate the click-through rate for each value of country/device_type/location_type/brands/layout_type, and have a basic idea that how strong those features will help us to predit the target variable**

In [179]:
print df.groupby('device_category')['CLICKS'].mean()
##looks like there are some difference between CTR cross different device types

device_category
Desktop       0.187246
Smartphone    0.267812
Tablet        0.230685
Name: CLICKS, dtype: float64


**1. device_category and clicks : some significant difference between difference types of devices**

In [180]:
print df.groupby('country')['CLICKS'].mean()
##seems like there is also some difference between CTR cross country

country
AU    0.027995
BE    0.010000
CA    0.159643
DE    0.033705
ES    0.013592
FI    0.004950
FR    0.020859
GB    0.186646
IT    0.006061
US    0.370687
Name: CLICKS, dtype: float64


**2. counrty and clicks : some strong evidences showing difference between countries**

In [181]:
print df.groupby('guaranteed_brand_safe')['CLICKS'].mean()
#I wouldn't say there is big difference... 

guaranteed_brand_safe
0    0.354991
1    0.208065
Name: CLICKS, dtype: float64


**3. brand safe and clicks : there's small differnce **

In [182]:
print df.groupby('location_type')['CLICKS'].mean()
##We can see that there's definitely differnce here 

location_type
below the post    0.280809
gallery           0.203320
in-feed           0.208759
mid-post          0.368070
other             0.185260
Name: CLICKS, dtype: float64


**4. location_type and clicks : there's a decent difference might help the model **

In [183]:
print df.groupby('layout_type')['CLICKS'].mean()
##The layout doesn't mean that big difference 

layout_type
multiple           0.163374
multiple_manual    0.336830
single             0.235386
Name: CLICKS, dtype: float64


**5. location_type and clicks : there's a decent difference might help the model**


**So, I am going to include device category, counrty, location_type and layout_type into features for the model. And besides of the categorical variabls, total_impression_requests, filled_pages and visible_impression will also be included in the model.**

Modeling 
----------------
1. Split the dataset  
2. Set dummy variables for categorical variables
3. Fit the model and check model performance
4. Using the model to predict the test data and obtain test error 
5. Model cross-validation 

In [188]:
##Spilting the data
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test = df[~msk]

+ Spliting the data randomly
  - 80% training data 
  - 20% testing 

In [207]:
y, X = dmatrices('CLICKS ~ total_impression_requests + filled_pages + visible_impressions + C(country) + C(device_category) + \
                    C(location_type) + C(layout_type)',df, return_type="dataframe")
# train_y, train_X = dmatrices('CLICKS ~ total_impression_requests + filled_pages + visible_impressions + C(country) + C(device_category) + \
#                     C(location_type) + C(layout_type)',train, return_type="dataframe")
# test_y, test_X = dmatrices('CLICKS ~ total_impression_requests + filled_pages + visible_impressions + C(country) + C(device_category) + \
#                     C(location_type) + C(layout_type)',test, return_type="dataframe")

+ Dummy Variables
  - create dummy variables for all the categorical variabls and put back to the data frame
  - following is what it looks like after creating dummy variables

In [192]:
train_X.head(1)

Unnamed: 0,Intercept,C(country)[T.BE],C(country)[T.CA],C(country)[T.DE],C(country)[T.ES],C(country)[T.FI],C(country)[T.FR],C(country)[T.GB],C(country)[T.IT],C(country)[T.US],...,C(device_category)[T.Tablet],C(location_type)[T.gallery],C(location_type)[T.in-feed],C(location_type)[T.mid-post],C(location_type)[T.other],C(layout_type)[T.multiple_manual],C(layout_type)[T.single],total_impression_requests,filled_pages,visible_impressions
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,284.0,1.0,1.0


In [211]:
train_y = np.ravel(train_y)
y = np.ravel(y)

- transform target variable into an array in order to fit into model next step

In [195]:
model = LogisticRegression()
model = model.fit(train_X, train_y)
print model.score(train_X, train_y)

0.88987836499572881

- **0.8898** is the accuracy on the training set. It's good, but we want to know the **prediction power** of the model

In [196]:
predicted = model.predict(test_X)
probs = model.predict_proba(test_X)

Result
--------

- We got 89% accuracy predicting click =0 and 91% accuracy predicting click =1

| Target value  | predicted 0| predicted 1 | Precision |
| ------------- |:-------------:|:-----------:|
|       0       | 24976 |  471  |   0.89    |
|       1       | 3111  | 4647  |   0.91    |

- It seems like a very pleasant result, However, we still want to run a cross-validation to make sure this result is not just happen to happen

In [198]:
metrics.confusion_matrix(test_y, predicted)
metrics.classification_report(test_y, predicted)

[[24976   471]
 [ 3111  4647]]
             precision    recall  f1-score   support

        0.0       0.89      0.98      0.93     25447
        1.0       0.91      0.60      0.72      7758

avg / total       0.89      0.89      0.88     33205



**The cross validation I picked here is 10-fold cross-validation**

In 10-fold cross-validation, the original sample is randomly partitioned into 10 equal size subsamples. Of the 10 subsamples, a single subsample is retained as the validation data for testing the model, and the remaining 10-1 subsamples are used as training data. The cross-validation process is then repeated 10 times (the folds), with each of the 10 subsamples used exactly once as the validation data. The 10 results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.
- source : http://www.openml.org/a/estimation-procedures/1

In [213]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
table = {'CV': ['cv1','cv2','cv3','cv4','cv5','cv6','cv7','cv8','cv9','cv10'],'scores': scores }
print pd.DataFrame(table)

     CV    scores
0   cv1  0.888399
1   cv2  0.889728
2   cv3  0.891843
3   cv4  0.891951
4   cv5  0.889292
5   cv6  0.890319
6   cv7  0.889292
7   cv8  0.886331
8   cv9  0.889956
9  cv10  0.894368


Conclusion 
==============

**Given the logistic model, we were able to predict if an placement(ad slot) will be clicked or not in around 88% accurary. This gives us a lot of information, for instance, those ad slot that have higher chances to be clicked have higher value to ad slot that don't.**