# w261 Final Project - Clickthrough Rate Prediction


Team 14   
Brian Musisi, Pri Nonis, Vinicio Del Sola, Laura Chutny   
Fall 2019, Section 3

## Table of Contents

* __Section 1__ - Question Formulation
* __Section 2__ - Algorithm Explanation
* __Section 3__ - EDA & Challenges
* __Section 4__ - Algorithm Implementation
* __Section 5__ - Course Concepts

# __Section 1__ - Question Formulation

Advertisers on websites make money when people click on an ad, visit the advertiser's site and then purchase something. This means that understanding the rate (or probability) at which people click on an ad is important - higher 'click-through' rates have the potential for more revenue. This study will not address the next step, which is how an advertiser converts a person who has 'clicked-through' to their site into a paying customer. Instead, our question is how to predict the click through rate for a given (unseen) ad based on the training data supplied to the model. In other words, for a given ad, what is the probability that a person will click on the ad. Ads cost money, so advertisers need to know which ads will generate more clicks and thus which ads are more valuable to the advertiser.

This is a classification problem - a 'positive' result (1) if the ad is clicked on and a 'negative' result (0) if the ad is not clicked on. There is a very large imbalance between classes - far more impressions (views of the ad) with no click (0) than impressions which result in a click (1). In this instance we need to decide between false positives (type 1 error - where we predict a click that did not actually happen) and false negatives (type 2 errors - where we do not predict a click when there actually was one). Because advertisers pay more for ads that are clicked, we want to be conservative in our predictions, and avoid false positives.

Note that Click Through Rate is defined as the number of ads that are clicked on as a fraction of the total impressions of that ad that are viewed. In this case, each example in the dataset is an impression of the ad.

# __Section 2__ - Algorithm Explanation

1. Training Algorithm Choices:  
- Logistic Regression
- Decision Tree / Forest
- Factorization Machine

2. Loss function:
- Log Loss (Cross Entropy)
- Exponential Loss
- Hinge Loss (as a proxy for 0/1 loss) - can't be used with general gradient descent as it is not differentiable for all x, but can be used with subgradients which are locally differentiable

3. Hyperparameter tuning

4. Evaluation Metric
- Accuracy is not a good metric - we could have excellent accuracy by correctly predicting 100% of the test examples as 0, while the true number might be 96% - so we would have a great accuracy of 96, but in actual fact, we would have missed out predicting the actual positive values (1).
- With a goal of limiting the false positives, precision (TP/(TP + FP)) will work well. We would trade this off with minimizing the number of false negatives (sensitivity (recall): TP/(TP+FN)). 
- Or we could combine them to optimize the most precision with the best sensitivity by using the F-score: 2\*((P\*S)/(P+S))
- An average precision - the area under the precision recall curve (AUC) helps give better inference than just a single F score.

# __Section 3__ - EDA & Challenges

In [36]:
import pandas  as pd
import numpy   as np
import time    as ti

import seaborn           as sns
import matplotlib.pyplot as plt
import ipywidgets        as widgets

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg         import Vectors
from pyspark.ml.feature        import OneHotEncoderEstimator, StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml                import Pipeline

from pyspark.sql               import SparkSession, SQLContext
from pyspark.sql.types         import StructType, StructField, StringType, FloatType
from pyspark.sql.functions     import countDistinct, col, desc

from os.path                   import exists

In [12]:
def initSpark(workingSet):
    
    workingSet['ss'] = SparkSession.builder \
                                   .config('spark.driver.memory', '240G') \
                                   .getOrCreate()
    workingSet['sc'] = workingSet['ss'].sparkContext
    workingSet['sq'] = SQLContext(workingSet['sc'])

In [21]:
def loadData(workingSet):

    start = ti.time()
    
    if  not exists('../data/criteo.parquet.full'):

        ds = StructType([StructField(f'ctr'    ,  FloatType(), True)                      ] + \
                        [StructField(f'i{f:02}',  FloatType(), True) for f in range(1, 14)] + \
                        [StructField(f's{f:02}', StringType(), True) for f in range(1, 27)])

        df = workingSet['sq'].read.format('csv') \
                             .options(header = 'true', delimiter = '\t') \
                             .schema(ds) \
                             .load('../data/train.txt')

        df.write.parquet('../data/criteo.parquet.full')

    df = workingSet['ss'].read.parquet('../data/criteo.parquet.full')

    workingSet['df_full'    ] = df
    workingSet['df_toy'     ] = df.sample(fraction = 0.01, seed = 2019)

    workingSet['num_columns'] = [c for c in df.columns if 'i'       in c]
    workingSet['cat_columns'] = [c for c in df.columns if 's'       in c]
    workingSet['all_columns'] = [c for c in df.columns if 'ctr' not in c]
    
    print(f'\nFinished DataFrame Loading in {ti.time()-start:.3f} Seconds\n')

In [29]:
workingSet = {}

initSpark(workingSet)
loadData(workingSet)


Finished DataFrame Loading in 0.126 Seconds



- 80/10/10 split Training/Dev/Test
- Discuss Schema - numerical variables (and normalization); Character variables and (indexing? or whatever else we do with them)
- Distribution of values - mean, median, skewness
- number of NaNs and what our approach is
- Feature Engineering - how to increase/reduce features and implications.

## 3.1 Data Split
Data was split before processing and stored using parquet files. An 80/10/10 train/dev/test split was used.

In [27]:
def splitData(workingSet):

    start = ti.time()
    
    if  not exists('../data/criteo.parquet.train') or \
        not exists('../data/criteo.parquet.test' ) or \
        not exists('../data/criteo.parquet.dev'  )    :

        train, test, dev = workingSet['df_full'].randomSplit([0.8, 0.1, 0.1], seed = 2019)
        
        train.write.parquet('../data/criteo.parquet.train')
        test.write.parquet('../data/criteo.parquet.test')
        dev.write.parquet('../data/criteo.parquet.dev')
        
    workingSet['df_train'] = workingSet['ss'].read.parquet('../data/criteo.parquet.train')
    workingSet['df_test '] = workingSet['ss'].read.parquet('../data/criteo.parquet.test')
    workingSet['df_dev'  ] = workingSet['ss'].read.parquet('../data/criteo.parquet.dev')
    
    print(f'\nFinished DataFrame Splitting in {ti.time()-start:.3f} Seconds\n')    

In [30]:
splitData(workingSet)


Finished DataFrame Splitting in 0.305 Seconds



## 3.2 Numerical Variables

The basic statistics for all the numerical variables was first run and reported, as shown in the attached table. Median and skewness were added. 

In [15]:
import pandas as pd
num_stats=pd.read_csv('../notebooks/num_stats.csv', index_col='variable')
num_stats

Unnamed: 0_level_0,count,mean,stddev,min,max,median,Pearson2Skew
variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
i01,20035247,3.50006,9.427662,0.0,5775.0,1.0,0.79555
i02,36673203,105.881091,391.940275,-3.0,257675.0,3.0,0.787475
i03,28802877,26.910415,396.797641,0.0,65535.0,7.0,0.150533
i04,28723574,7.322778,8.799146,0.0,969.0,4.0,1.132875
i05,35725965,18545.307976,69435.515294,0.0,23159456.0,2868.0,0.677347
i06,28469401,116.147388,391.380525,0.0,431037.0,34.0,0.629674
i07,35085776,16.32269,65.524064,0.0,34536.0,4.0,0.564191
i08,36654939,12.517513,16.816877,0.0,6047.0,8.0,0.805889
i09,35085776,106.106119,220.289905,0.0,29019.0,40.0,0.900261
i10,20035247,0.617435,0.683969,0.0,10.0,1.0,-1.677994


  return (1. / np.sqrt(2 * np.pi)) * np.exp(-(Xi - x)**2 / (h**2 * 2.))


Following the basic statistics up - the following steps to analysing the numerical variables were taken:  
1) Plot histograms and scatter plots of each variable, along with a box plot and violin plot of a 10% sample of each variable.  
2) Determine the distribution of each variable  
3) Apply a standardization for each variable.  
 
Variables are standardized not to create 'normal' variables, but to bring the values into a region of approximately (-1,3) so that machine learning techniques could be applied. The following table shows the variable, the distribution as observed, and the standardization applied.

| Numerical Variable | Distribution Type          | Standardization          |
|--------------------|----------------------------|--------------------------|
| i01                | Exponentially Decreasing   | i01' = i01/(2*SD)        |
| i02                | Truncated Skewed Normal    | i02' = (i02 - median)/SD |
| i03                | Exponentially Decreasing   | i03' = i03/SD            |
| i04                | Truncated Skewed Normal    | i04' = (i04-median)/SD   |
| i05                | Truncated Skewed Normal    | i05' = (i05-median)/SD   |
| i06                | Exponentially Decreasing   | i06' = i06/2*SD          |
| i07                | Exponentially Decreasing   | i07' = i07/2*SD          |
| i08                | Exponentially Decreasing   | i08' = i08/2*SD          |
| i09                | Truncated Skewed Normal    | i09' = (i09-median)/SD   |
| i10                | Sigmoid                    | i10' = i10/Max(i10)      |
| i11                | Truncated Skewed Normal    | i11' = (i11-median)/SD   |
| i12                | Exponentially Decreasing   | i12' = i12/2*SD          |
| i13                | Truncated Skewed Normal    | i13' = (i13-median)/SD   |

In [37]:
outputs  = []

def featureAnalysisNumerical(df, feature):

    output = widgets.Output()
    data   = df[~np.isnan(df[feature])]
    y      = data['ctr']
    x      = data[feature]
    
    xmax   = max(x)
    
    with output:
        fig = plt.figure(figsize = (28, 7))

        ax1 = fig.add_subplot(1, 4, 1)
        ax1 = sns.boxplot(x)
      # ax1.set(xlim=(-1,40))
        
        ax2 = fig.add_subplot(1, 4, 2)
        ax2 = sns.violinplot(x)
      # ax2.set(xlim=(-1,40))

        ax3 = fig.add_subplot(1, 4, 3)
        ax3 = sns.distplot(x, hist = True, color = 'red')
      # ax3.set(xlim=(-1,40))

        ax4 = fig.add_subplot(1, 4, 4)
        ax4 = sns.scatterplot(x, y, data = df, hue = 'ctr')

        plt.show(fig)
        
    return [output]

pf  = workingSet['df_toy'].sample(fraction = 0.001, seed = 2019).cache().toPandas()
tab = widgets.Tab()
for n, feature in enumerate(workingSet['num_columns']):
    outputs += featureAnalysisNumerical(pf, feature)
    tab.set_title(n, feature)

tab.children = outputs
display(tab)

Tab(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output(), Output(), Output(), Output…

# __Section 4__ - Algorithm Implementation

1. Toy example with hand calculation/ simple code

2. parallel implementation using MLLib (or similar)
- challenges
- validation

# __Section 5__ - Course Concepts