# w261 Final Project - Click-through Rate Prediction


Team 14   
Brian Musisi, Pri Nonis, Vinicio Del Sola, Laura Chutny   
Fall 2019, Section 3

## Table of Contents

* __Section 1__ - Question Formulation
* __Section 2__ - Algorithm Explanation
* __Section 3__ - EDA & Challenges
* __Section 4__ - Algorithm Implementation
* __Section 5__ - Course Concepts

# __Section 1__ - Question Formulation

Advertisers on websites make money when people click on an ad, visit the advertiser's site and then purchase something. This means that understanding the rate (or probability) at which people click on an ad is important - higher 'click-through' rates have the potential for more revenue. This study will not address the next step, which is how an advertiser converts a person who has 'clicked-through' to their site into a paying customer. Instead, our question is how to predict the click through rate for a given (unseen) ad based on the training data supplied to the model. In other words, for a given ad, what is the probability that a person will click on the ad? Ads cost money, so advertisers need to know which ads will generate more clicks and thus which ads are more valuable to the advertiser.

This is a classification problem - a 'positive' result (1) if the ad is clicked on and a 'negative' result (0) if the ad is not clicked on. There is a very large imbalance between classes - far more impressions (views of the ad) with no click (0) than impressions which result in a click (1). In this instance we need to decide between false positives (type 1 error - where we predict a click that did not actually happen) and false negatives (type 2 errors - where we do not predict a click when there actually was one). Because advertisers pay more for ads that are clicked, we want to be conservative in our predictions, and avoid false positives.

Note that Click Through Rate is defined as the number of ads that are clicked on as a fraction of the total impressions of that ad that are viewed. In this case, each example in the dataset is an impression of the ad.

In order to be effective, this type of predictor should achieve an AUC, or 'Area Under The ROC (Receiver Operating Curve)' of approximately 0.75-0.8. The ROC plots the true-positive rate against the false-positive rate. As the ROC becomes more concave, the achievable true positive rate for a given false positive rate increases, as does the area under the curve. Another way to say this, as summarized on Wikipedia (__3__), is that the AUC is "equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one".  The 2nd place winner of the Criteo Kaggle competition had an AUC of 0.8097 (__1__), also another attempt at prediction on the Criteo dataset reported an AUC of 0.79 (__2__). 

__References - Section 1__  
(1) M. Jahrer, “Can anyone tell me the AUC of benchmark model ?,” kaggle.com, 2014. [Online]. Available: https://www.kaggle.com/c/criteo-display-ad-challenge/discussion/9821.   

(2) wormhole developers, “Binary Classification on the Criteo CTR Dataset,” wormhole.readthedocs.io, 2015. [Online]. Available: https://wormhole.readthedocs.io/en/latest/tutorial/criteo_kaggle.html.   

(3) “Receiver Operating Characteristic,” wikipedia.org, Nov-2019. [Online]. Available: https://en.wikipedia.org/wiki/Receiver_operating_characteristic.    

# __Section 2__ - Algorithm Explanation

There are many possible choices of algorithm for binary classification problems. Logistic Regression, Decision Tree / Decision Tree Forest, Support Vector Machine (SVM), and Factorization Machine have each proven to be worthy choices. In this project, for our toy model and main algorithm, we decided to focus on the SVM, as this algorithm had not been focused on in the original Kaggle Criteo competition. We will also investigate the performance on many algorithms.

The Support Vector Machine is an algorithm that divides a set of data into two classes. In 2 dimensions, we can think of this as a line separating the classes. What the SVM allows us to do is extend this concept to different dimensions (i.e. feature space) and create a 'best-fit' hyperplane to separate the classes. The transformation to a different dimensional space is called a kernel.

The prediction of the class (where class is redefined from $y_i \in \{0, 1\}$ to $y_i \in \{-1, 1\}$) in this case will be done with a linear kernel - i.e. a linear hyperplane where the hyperplane equation is $\left(w^\top x_i+b\right)$:  

$$ class =  y_i\left(w^\top x_i +b\right)_+ $$  

where:  

$$
class  = \left\{
        \begin{array}{ll}
            1, & \quad \text{if} \quad \left(w^\top x_i+b\right) \geq 1 \\
            -1, & \quad \text{if} \quad \left(w^\top x_i+b\right) \leq -1
        \end{array}
    \right.
$$


The goal in Support Vector Machines is to maximize the distance between the hyperplane and any of the examples, in otherwords, maximize the 'margin' (**1**). The margin can be thought of in two dimensions as shown in the following diagram:

<figure>
  <img src="SVM_Margin.png" width="350">
    <figcaption> <b>Support Vector Machine Margin (2)</b> </figcaption>
</figure>

From trigonometry and vector algebra, we can compute the distance between the two margins by adding the perpendicular distances from the positive and negative margins to the hyperplane that solves $\left(w^\top x_i+b\right)=0$. This distance, the margin, we will denote as $\frac{2}{\parallel w\parallel }$. This allows us to say that in order to maximize the margin, we must minimize $\parallel w\parallel$. In order to maintain differentiability, we will actually minimize $\parallel w\parallel^2$, because the same value of $w$ satisifies both minimizations.

This allows us to write what is called the 'hard-margin' Support Vector Machine. The objective function is:

$\underset{w}{min}(J\text{'}(w))=\underset{}{\underset{w}{min}(\frac{\lambda }{2})\parallel w{\parallel }^{2}}$  
such that $y_i\left(w^\top x_i +b\right)\ge 1$

This is called a 'hard-margin' SVM, because the data have to be on one or the other side of our hyperplane. Unfortunately, the world is not usually that kind - we may be able to separate a large amount of our data with the hyperplane, but there will still be some samples that just aren't classified correctly, no matter where we place the hyperplane. For this reason, we need to allow some 'slack', or 'softness' in the margin. In order to achieve this we introduce the Hinge Loss function:



The loss function here is Hinge Loss: 









2. Loss function:
- Log Loss (Cross Entropy)
- Exponential Loss
- Hinge Loss (as a proxy for 0/1 loss) - can't be used with general gradient descent as it is not differentiable for all x, but can be used with subgradients which are locally differentiable

3. Hyperparameter tuning

4. Evaluation Metric
- Accuracy is not a good metric - we could have excellent accuracy by correctly predicting 100% of the test examples as 0, while the true number might be 96% - so we would have a great accuracy of 96, but in actual fact, we would have missed out predicting the actual positive values (1).
- With a goal of limiting the false positives, precision (TP/(TP + FP)) will work well. We would trade this off with minimizing the number of false negatives (sensitivity (recall): TP/(TP+FN)). 
- Or we could combine them to optimize the most precision with the best sensitivity by using the F-score: 2\*((P\*S)/(P+S))
- An average precision - the area under the precision recall curve (AUC) helps give better inference than just a single F score.

5. Toy example with hand calculation/ simple code

**VINICIO and/or LAURA**

__References - Section 2__  

[1]H. Daumé, A Course in Machine Learning. 2017.
[2]K. K. Mahto, “Demystifying Maths of SVM,” Medium, 18-Apr-2019. [Online]. Available: https://towardsdatascience.com/demystifying-maths-of-svm-13ccfe00091e. [Accessed: 11-Dec-2019].




# __Section 3__ - EDA & Challenges

**LAURA** Will clean up this section

## 3.1 Load Data

In [2]:
from code.engineering import *

In [5]:
Engineering.setupSpark(application = 'eda')

[18:16:35] : Starting Spark Initialization
[18:16:40] : Stopping Spark Initialization



In [6]:
Engineering.importData(location = 'data')

[18:16:42] : Starting Data Import
[18:16:42] : Stopping Data Import



In [7]:
Engineering.splitsData(ratios = [0.8, 0.1, 0.1])

[18:16:44] : Starting Data Splits
[18:16:47] : Stopping Data Splits



The data was split before any EDA into Training/Validation/Test files, at an 80/10/10 ratio.

## 3.2 Basic EDA

- Discuss Schema - numerical variables (and normalization) & Distribution of values - mean, median, skewness

### 3.2.1 Numerical Variables

The basic statistics for all the numerical variables was first run and reported, as shown in the attached table. Median and skewness were added. 

In [4]:
num_measures = load(open('data/model.pickled.normed.num_measures', 'rb'))
num_measures

Unnamed: 0,count,mean,stddev,min,max,median
n01,20033724.0,3.502948,9.459785,0.0,5775.0,1.0
n02,36669584.0,105.835815,389.627045,-3.0,36664.0,3.0
n03,28799232.0,26.907507,397.482605,0.0,65535.0,7.0
n04,28721092.0,7.32299,8.787283,0.0,933.0,4.0
n05,35723496.0,18537.207031,69412.34375,0.0,23159456.0,2842.0
n06,28468040.0,116.084457,382.811981,0.0,430898.0,34.0
n07,35083980.0,16.340485,66.314934,0.0,56311.0,4.0
n08,36651352.0,12.516854,16.639774,0.0,6047.0,8.0
n09,35083980.0,106.111786,220.357422,0.0,29019.0,40.0
n10,20033724.0,0.617539,0.684164,0.0,10.0,1.0


Following the basic statistics up - the following steps to analysing the numerical variables were taken:  
1) Plot histograms and scatter plots of each variable, along with a box plot and violin plot of a 10% sample of each variable.  
2) Determine the distribution of each variable  
3) Apply a standardization for each variable.  
 
Variables are standardized not to create 'normal' variables, but to bring the values into a region of approximately (-1,3) so that machine learning techniques could be applied. The following table shows the variable, the distribution as observed, and the standardization applied.

| Numerical Variable | Distribution Type          | Standardization          |
|--------------------|----------------------------|--------------------------|
| i01                | Exponentially Decreasing   | i01' = i01/(2*SD)        |
| i02                | Truncated Skewed Normal    | i02' = (i02 - median)/SD |
| i03                | Exponentially Decreasing   | i03' = i03/SD            |
| i04                | Truncated Skewed Normal    | i04' = (i04-median)/SD   |
| i05                | Truncated Skewed Normal    | i05' = (i05-median)/SD   |
| i06                | Exponentially Decreasing   | i06' = i06/2*SD          |
| i07                | Exponentially Decreasing   | i07' = i07/2*SD          |
| i08                | Exponentially Decreasing   | i08' = i08/2*SD          |
| i09                | Truncated Skewed Normal    | i09' = (i09-median)/SD   |
| i10                | Sigmoid                    | i10' = i10/Max(i10)      |
| i11                | Truncated Skewed Normal    | i11' = (i11-median)/SD   |
| i12                | Exponentially Decreasing   | i12' = i12/2*SD          |
| i13                | Truncated Skewed Normal    | i13' = (i13-median)/SD   |

In [37]:
outputs  = []

def featureAnalysisNumerical(df, feature):

    output = widgets.Output()
    data   = df[~np.isnan(df[feature])]
    y      = data['ctr']
    x      = data[feature]
    
    xmax   = max(x)
    
    with output:
        fig = plt.figure(figsize = (28, 7))

        ax1 = fig.add_subplot(1, 4, 1)
        ax1 = sns.boxplot(x)
      # ax1.set(xlim=(-1,40))
        
        ax2 = fig.add_subplot(1, 4, 2)
        ax2 = sns.violinplot(x)
      # ax2.set(xlim=(-1,40))

        ax3 = fig.add_subplot(1, 4, 3)
        ax3 = sns.distplot(x, hist = True, color = 'red')
      # ax3.set(xlim=(-1,40))

        ax4 = fig.add_subplot(1, 4, 4)
        ax4 = sns.scatterplot(x, y, data = df, hue = 'ctr')

        plt.show(fig)
        
    return [output]

pf  = workingSet['df_toy'].sample(fraction = 0.001, seed = 2019).cache().toPandas()
tab = widgets.Tab()
for n, feature in enumerate(workingSet['num_columns']):
    outputs += featureAnalysisNumerical(pf, feature)
    tab.set_title(n, feature)

tab.children = outputs
display(tab)

Tab(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output(), Output(), Output(), Output…

## 3.2.2 Character Features

**PRI / BRIAN** Please add some commentary and simplified code snippets here

- Discuss Schema - Character variables and (indexing? or whatever else we do with them)

The dataset contains 26 categorical features, which we labelled from c01 to c26 based on their oder in the dataset and they had all been hashed. We analyzed these features and found that they had very high cardinality. The number of unique categories for each feature is shown below:

In [37]:
cat_measures = load(open('data/model.pickled.normed.filled.masked-060000.cat_measures', 'rb'))

df             = pd.DataFrame.from_dict({ k: [len(v)] for k,v in cat_measures['distinct'].items() }, orient = 'index', columns = ['distinct'])
df['frequent'] = [len(v) for v in cat_measures['frequent'].values()]
df['uncommon'] = [len(v) for v in cat_measures['uncommon'].values()]
df['top five'] = [', '.join(v[:5]) for v in cat_measures['frequent'].values()]

df

Unnamed: 0,distinct,frequent,uncommon,top five
c01,1460,23,1437,"05db9164, 68fd1e64, 5a9ed9b0, 8cf07265, be589b51"
c02,581,100,481,"38a947a1, 207b2d81, 38d50e09, 1cfdf714, 287130e0"
c03,8381767,45,8381722,"deadbeef, d032c263, 02cf9876, aa8c1539, 9143c832"
c04,1883842,60,1883782,"c18be181, deadbeef, 29998ed1, d16679b9, 85dd697c"
c05,305,13,292,"25c83c98, 4cf72387, 43b19349, 384874ce, 30903e74"
c06,24,7,17,"7e0ccccf, fbad5c96, fe6b92e5, deadbeef, 13718bbd"
c07,12488,62,12426,"1c86e0eb, dc7659bd, 7195046d, 5e64ce5f, 468a0854"
c08,633,17,616,"0b153874, 5b392875, 1f89b562, 37e4aa92, 062b5529"
c09,3,2,1,"a73ee510, 7cc72ec2"
c10,88956,42,88914,"3b08e48b, efea433b, fbbf2c95, fa7d0797, 03e48276"


As shown above, the number of distict categories ranges from 3 to 5894178. Working with this number of cardinal features presents challenges so we examined ways to reduce the cardinality of these features, which included:
1. For each feature, consider the most commonly occurring features in the training set only and encode any other rare categories as one category
2. Use Field-aware Factorization Machines which handle high cardinality data in recommendation and click thoguh rate datases by default
3. Hash the categorical features to a smaller feature space with possible collisions
4. Drop columns with extreme cardinality since these may correspond to features like user IDs or street address that ordinarily may not provide much value to the dataset.

In the end, we chose option 1 because this allowed us to create a dataset that could be used with multiple algorithms while allowing us to shrink the feature space. We also found that for most features. Mnay of the categories appeared in under 1% of the training set (shown below) and so encoding the rare categories into a special category would greatly reduce the potential feature space.

Another issue we had to deal with categorical features, just like with the numeric features was how to deal with null values. Many of the categorical variables had a high number of null values with some columsn having over 40% null values. Below is a breakdown of the percentage of null values for each column. We would need an effective way to deal with these null values

## 3.2.3 General Feature Engineering

**PRI / BRIAN** Please add some commentary and simplified code snippets here

The EDA guided how we went about out feature engineering. The items that we needed to tackle in our feature engineering were:
1. Scale the numeric features so that they are within the same range using an appropriate scaler based on the feature
2. Deal with null values. For the numeric features, null values were replaced with the median while with the categorical features, null values were replaced witha special category, "deadbeef".
3. Reduce the number of categories in each column by encoding rare categories as rarebeef and only keeping those that appear more than 1% of the time
4. One-hot encode the categorical features so that they can be used to train the model
5. Reduce the number of features in the dataset by using a feature selector based on the chi-squared test
6. Create interaction variables between the features in the dataset


#### Numeric Features
The numeric features had different ranges and so it was imprtant to scale them especially when working with linear models like Logistic Regression. As shown in the EDA above, the numeric features had different distributions so they were scaled differently based on this. The considerations made when choosing a scaler were:

1. 
2. 
3. 

#### Dealing with the Null Values
The null values were dealt with differently for numeric and categorical features. For the numeric features, the null values were imputed using the ___ . For the cateogircal values, the null values were all encoded to a category labelled **deadbeef**

#### Reducing the cardinality of categorical features
As mentioned before, the cardinaly of the categorical features was handled by looking at each feature and encoding all the rare categories to one special category called **rarebeef**.

#### One-hot encode the categorical features
To be able to use the categorical features in our machine learning algorithms, we had to turn the categorical features into numeric features. To do this, we first used Spark's StrinIndexer to create a string index for each of the categories within the features and then one-hot encoded them

#### Feature Selection
After carrying out the other feature engineering steps, we had a number of features. To determine which ones we should keep and to reduce dimensionality, we used Pyspark's ChiSqSelector to choose the best features. This uses the Chi-Squared Test of Independence to select the best features based on the outcome variable. The variant we used chooses the top features with the highest predictive power in relation to the outcome variable.

#### Create Interaction Variables
Finally, we also explored the creation of interaction variables to add more features and to capture possible interactions between the variables. The interaction features we created were between the numeric and the categorical features. 

- number of NaNs and what our approach is
- Feature Engineering - how we increased/reduced features and implications.

In [8]:
min = 60000 # minimum occurance threshold for each categorical feature category, generates 1150 one-hot encoded categorical features
top =   987 # maximum selection threshold for ChiSquareSelector of one-hot encoded categorical features : 987 top categorical features + 13 numerical features = 1000 total features

In [None]:
Engineering.numDoMeasurement(subset = 'train', iStep = f'', fit = True)

Engineering.numDoStandardize(subset = 'train', iStep = f'')
Engineering.numDoStandardize(subset = 'tests', iStep = f'')
Engineering.numDoStandardize(subset = 'valid', iStep = f'')

Engineering.catFillUndefined(subset = 'train', iStep = f'normed')
Engineering.catFillUndefined(subset = 'tests', iStep = f'normed')
Engineering.catFillUndefined(subset = 'valid', iStep = f'normed')

Engineering.catFindFrequents(subset = 'train', iStep = f'normed.filled', min = min, fit = True)

Engineering.catMaskUncommons(subset = 'train', iStep = f'normed.filled', min = min)
Engineering.catMaskUncommons(subset = 'tests', iStep = f'normed.filled', min = min)
Engineering.catMaskUncommons(subset = 'valid', iStep = f'normed.filled', min = min)

Engineering.catDoCodeFeature(subset = 'train', iStep = f'normed.filled.masked-{min:06d}', fit = True)
Engineering.catDoCodeFeature(subset = 'tests', iStep = f'normed.filled.masked-{min:06d}')
Engineering.catDoCodeFeature(subset = 'valid', iStep = f'normed.filled.masked-{min:06d}')

Engineering.catDoPickFeature(subset = 'train', iStep = f'normed.filled.masked-{min:06d}.encode', top = top, fit = True)
Engineering.catDoPickFeature(subset = 'tests', iStep = f'normed.filled.masked-{min:06d}.encode', top = top)
Engineering.catDoPickFeature(subset = 'valid', iStep = f'normed.filled.masked-{min:06d}.encode', top = top)

Engineering.allDoPackFeature(subset = 'train', iStep = f'normed.filled.masked-{min:06d}.encode.picked-{top:06d}', fit = True)
Engineering.allDoPackFeature(subset = 'tests', iStep = f'normed.filled.masked-{min:06d}.encode.picked-{top:06d}')
Engineering.allDoPackFeature(subset = 'valid', iStep = f'normed.filled.masked-{min:06d}.encode.picked-{top:06d}')

In [None]:
Engineering.toyTakeSubSample(subset = 'train', iStep = f'normed.filled.masked-{min:06d}.encode.picked-{top:06d}.packed', len = 8000)
Engineering.toyTakeSubSample(subset = 'tests', iStep = f'normed.filled.masked-{min:06d}.encode.picked-{top:06d}.packed', len = 1000)
Engineering.toyTakeSubSample(subset = 'valid', iStep = f'normed.filled.masked-{min:06d}.encode.picked-{top:06d}.packed', len = 1000)

__References - Section 3__

# __Section 4__ - Algorithm Implementation

## 4.1. Parallel implementation of main algorithm using MLLib
- challenges
- validation
- __VINICIO__ add commentary on 'main' algorithm (SVM?) here

## 4.2 Scaled Classification on Full Dataset using Spark ML




**BRIAN** Add brief commentar to go with table

On the full dataset, we applied a scalable version of our algorithm from Spark MLLib together with other algorithms. We used Logistic Regression as our baseline. and thereafter applied our main algorithm, the Linear SVM and other tree based models.

|             Classifier | Parameters                                                                           | Feature Count                | Engineering Notes                                                                                           | Train AUC | Validation AUC | Test AUC |
|-----------------------:|--------------------------------------------------------------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------|-----------|----------------|----------|
|     LogisticRegression | <ul> <li>maxIter = 100</li> <li>family = binomial</li> <li>regParam = 0.0</li> </ul> | 1000 = 13 + 787              | <ul> <li>baseline 13 num + 987 cat features</li> </ul>                                                      | 73.64%    | 73.63%         | 73.65%   |
|              LinearSVC |                                                                                      | 1000 = 13 + 787              | <ul> <li>baseline 13 num + 987 cat features</li> </ul>                                                      |           |                |          |
| DecisionTreeClassifier |                                                                                      | 1000 = 13 + 787              | <ul> <li>baseline 13 num + 987 cat features</li> </ul>                                                      |           |                |          |
| RandomForestClassifier |                                                                                      | 1000 = 13 + 787              | <ul> <li>baseline 13 num + 987 cat features</li> </ul>                                                      |           |                |          |
|          GBTClassifier |                                                                                      | 1000 = 13 + 787              | <ul> <li>baseline 13 num + 987 cat features</li> </ul>                                                      |           |                |          |
|     LogisticRegression |                                                                                      | 1031 = 13 + 787 + 10231      | <ul> <li>baseline 13 num + 987 cat features</li> <li>additiona 13 num x 987 cat interactions</li> </ul>     |           |                |          |
|     LogisticRegression |                                                                                      | 56571 = 13 + 56558           | <ul> <li>extended 13 num + 56558 cat features</li> </ul>                                                    | 77.88%    | 77.75%         | 77.75%   |
|     LogisticRegression |                                                                                      | 791503 = 13 + 56535 + 734955 | <ul> <li>extended 13 num + 56535 cat features</li> <li>additiona 13 num x 56535 cat interactions</li> </ul> |           |                |          |

As shown above, none of the algorithms was able to beat the baseline set by the Logistic Regression classifier.With the closest algorithm being the Linear SVM. This is in line with what we would expect in terms of similarity between the Linear SVm and the Logistic Regression because the SVM with a linear kernel behaves similar to Logistic Regression with the main differentiator being the loss; logistic loss for the Logistic Regression and Hinge Loss for SVM.

Discussion about L1/L2 (elasticnet) and lambda (regParam) for Logistic Regression here. - **LAURA** reword

<figure>
  <img src="OverUnderFitting.png" width="550">
    <figcaption> <b>Over and Under Fitting Tradeoff (1)</b> </figcaption>
</figure>

What we are seeing so far, I think, is that we have not yet entered 'overfitting' territory. I.e. we are still very close between validation and training sets, which says that we are not overfitting. Because we are not overfitting, the lambda will not help us.

The alpha probably does no good because we've already feature-engineered the heck out of the features. Increasing alpha (increasing L1) means we are taking more features out, which decreases our results.


__References - Section 4__

(1) https://www.jeremyjordan.me/evaluating-a-machine-learning-model/


# __Section 5__ - Course Concepts

**LAURA** to start input here

__References - Section 5__