# Criteo Click Through Rate Prediction 

### W261 - Machine Learning at Scale 
Spring Semester  
Ben Arnoldy, Kenneth Chen, Nick Conidas, Rohini Kashibatla, Pavan Kurapati

Criteo is an advertising company that specializes in web ad and gathered click through information via their ad services. In 2014, Criteo launched Click Through Rate (CTR) prediction competition hosted in Kaggle. It provides train.txt, test.txt where train.txt was provided with its label: `1` for click, and `0` for no-click. The test.txt was given with a number of features without labels for which we have to predict whether the user clicks the web ad or not. 

### Dataset construction:

The training dataset consists of a portion of Criteo's traffic over a period
of 7 days. Each row corresponds to a display ad served by Criteo and the first
column is indicates whether this ad has been clicked or not.
The positive (clicked) and negatives (non-clicked) examples have both been
subsampled (but at different rates) in order to reduce the dataset size.

There are 13 features taking integer values (mostly count features) and 26
categorical features. The values of the categorical features have been hashed
onto 32 bits for anonymization purposes. 
The semantic of these features is undisclosed. Some features may have missing values.

The rows are chronologically ordered.

The test set is computed in the same way as the training set but it 
corresponds to events on the day following the training period. 
The first column (label) has been removed.

## Rubrics 
## Question 1: Question Formulation
Introduce the goal of your analysis. What questions will you seek to answer, why do people perform this kind of analysis on this kind of data? Preview what level of performance your model would need to achieve to be practically useful.

## Question 2: Algorithm Explanation
Create your own toy example that matches the dataset provided and use this toy example to explain the math behind the algorithym that you will perform.

## Question 3: EDA & Discussion of Challenges
Determine 2-3 relevant EDA tasks that will help you make decisions about how you implement the algorithm to be scalable. Discuss any challenges that you anticipate based on the EDA you perform

## Question 4: Algorithm Implementation
Develop a 'homegrown' implementation of the algorithn, apply it to the training dataset and evaluate your results on the test set.

## Question 5: Application of Course Concepts
Pick 3-5 key course concepts and discuss how your work on this assignment illustrates an understanding of these concepts.

In [1]:
# imports
import re
import ast
import time
import numpy as np
import pandas as pd
import seaborn as sns
import networkx as nx
import matplotlib.pyplot as plt

In [2]:
%reload_ext autoreload
%autoreload 2

In [3]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

In [4]:
# start Spark Session
from pyspark.sql import SparkSession
app_name = "hw5_notebook"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()
sc = spark.sparkContext

In [5]:
# load the data into Spark RDDs for convenience of use later (RUN THIS CELL AS IS)
trainRDD = sc.textFile('data/train.txt')
testRDD = sc.textFile('data/test.txt')
sampleRDD = sc.textFile('data/dac_sample.txt')

In [6]:
!head -n 2 data/train.txt

0	1	1	5	0	1382	4	15	2	181	1	2		2	68fd1e64	80e26c9b	fb936136	7b4723c4	25c83c98	7e0ccccf	de7995b8	1f89b562	a73ee510	a8cd5504	b2cb9c98	37c9c164	2824a5f6	1adce6ef	8ba8b39a	891b62e7	e5ba7672	f54016b9	21ddcdc9	b1252a9d	07b5194c		3a171ecb	c5c50484	e8b83407	9727dd16
0	2	0	44	1	102	8	2	2	4	1	1		4	68fd1e64	f0cf0024	6f67f7e5	41274cd7	25c83c98	fe6b92e5	922afcc0	0b153874	a73ee510	2b53e5fb	4f1b46f3	623049e6	d7020589	b28479f6	e6c5b5cd	c92f3b61	07c540c4	b04e4670	21ddcdc9	5840adea	60f6221e		3a171ecb	43f13e8b	e8b83407	731c3655


In [7]:
!head -n 2 data/test.txt

	29	50	5	7260	437	1	4	14		1	0	6	5a9ed9b0	a0e12995	a1e14474	08a40877	25c83c98		964d1fdd	5b392875	a73ee510	de89c3d2	59cd5ae7	8d98db20	8b216f7b	1adce6ef	78c64a1d	3ecdadf7	3486227d	1616f155	21ddcdc9	5840adea	2c277e62		423fab69	54c91918	9b3e8820	e75c9ae9
27	17	45	28	2	28	27	29	28	1	1		23	68fd1e64	960c983b	9fbfbfd5	38c11726	25c83c98	7e0ccccf	fe06fd10	062b5529	a73ee510	ca53fc84	67360210	895d8bbb	4f8e2224	f862f261	b4cc2435	4c0041e5	e5ba7672	b4abdd09	21ddcdc9	5840adea	36a7ab86		32c7478e	85e4d73f	010f6491	ee63dd9b


In [8]:
!wc -l data/train.txt
!wc -l data/dac_sample.txt

 45840617 data/train.txt
  100000 data/dac_sample.txt


## Question 1: Question Formulation
Introduce the goal of your analysis. What questions will you seek to answer, why do people perform this kind of analysis on this kind of data? Preview what level of performance your model would need to achieve to be practically useful.

## Objectives 

Given the click through dataset, our objective is to accurately predict whether or not the display ad will be clicked. Since the majority of features are categorical in nature and the values are hashed to keep the privacy, our main goal is to engineer features based on our model performance. There are two caveates in designing features engineering and executing exploratory data analysis (EDA). 

1. The dataset is extremely large with more than 45 million samples collected over 7 days. In order to process a large data, we can either subtract a small sample set and execute EDA or design an algorithm that offers a parallel computation in the first place. For this caution, we will do both in our EDAs. We will analyze on a small dataset in terms of each features correlation and their distribution or histogram. We'd also implement an algorithm in Spark that allows parallel computation on a huge amount of data. 

2. The second caveate is the categorical features in our dataset. In principle, most of the python libraries offer to impute the missing value, either by taking the mean or median of the feature value in which missing values exist. However, in tackling with the large dataset in Spark and majority of features being the categorical in nature, Spark does not have any imputer function for categorical variable. <a href="https://spark.apache.org/docs/2.2.0/ml-features.html">Ref</a> 

We will go through step by step as to how we would explore our data and design an algorithm to implement in Spark in the following sessions.  


## Question 2: Algorithm Explanation

This is the fundamental concept in feature engineering when exploring our Criteo dataset. 

## Data Exploration 

```
  45,840,617 (Criteo dataset in Kaggle) 
      |
      |
   100,000 (sample) 
    |-- train data      : 80,000 (80%) 
    |-- validation data : 10,000 (10%)
    |-- test data       : 10,000 (10%)
```

Initially, we will explore 100,000 sample dataset out of original 45 million samples. To test the performance of our classifiers, we split 80% of the sample dataset into train data (80,000), 10% into validation data (10,000) and the remaining 10% for the final test data (10,000). 

## Features Engineering (One Hot Encoding Vs Features Hashing) 

Each sample has `40` variables. The first variable or the column is the label: `1` or `0` to indicate the clicking event, `1` being the clicked activity. The remaining `39` variables are features, of which `13` are represented in numeric value and the rest `26` are represented in hashed value. Some of them have missing values. All features are categorical features. In order to capture all the features, we will explore the most common feature engineering methods for categorical variables: **one hot encoding (OHE)** and **feature hashing**. 

### 1. One Hot Encoding (OHE)

As the name suggests, one hot encoding (OHE) expands all the unique features in the dataset. 

```
|         | feature1 | feature2 | feature3 | 
|---------|----------|----------|----------|
|sample 1 | black    | round    | matte    |
|sample 2 | white    | square   | shiny    |
|sample 3 | blue     | round    | matte    |
|sample 4 | black    | round    | shiny    | 


|         |black |white |blue ||round |square ||matte  |shiny | 
|---------|------|------|-----||------|-------||-------|------|
|sample 1 | 1    | 0    | 0   || 1    |  0    || 1     | 0    |  
|sample 2 | 0    | 1    | 0   || 0    |  1    || 0     | 1    |
|sample 3 | 0    | 0    | 1   || 1    |  0    || 1     | 0    | 
|sample 4 | 1    | 0    | 0   || 1    |  0    || 0     | 1    | 


index    = [0, 1, 2, 3, 4, 5, 6]
--------------------------------
sample 1 = [1, 0, 0, 1, 0, 1, 0] 
sample 2 = [0, 1, 0, 0, 1, 0, 1]
sample 3 = [0, 0, 1, 1, 0, 1, 0]
sample 4 = [1, 0, 0, 1, 0, 0, 1]
```

The advantage of one hot encoding is when the dataset is relatively small, it is convenient to capture all the unique features. We can also create a sparse vector to represent our dataset in more compact fashion. However one hot encoding becomes computationaly expensive when the dataset is extremely large, for eg, our Criteo Kaggle dataset which comes with 45 millions of samples and there are at least 33 millions unique features. 

The other disadvantage of One Hot Encoding is, in order to capture all the unique features, we need to have a single pass over the entire dataset, which makes the computation more expensive for the large data. However we will use OHE in our exploratory data analysis and compare with features hashing method described below. 


### 2. Features Hashing 

Features hashing is one of the feature engineering methods that is extremely powerful in handling large data. It also offers a unique trait known as **"online learning"** in which the model does not need to be trained all over again due to a new dataset. The model can immediately learns online and update the hyperparameters as it needs. 

As the name implies, we will have a hash function to hash all the unique features in our dataset. Hash function itself is based on the modulo function, we will need to give an ample amount of hash indexes in the beginning in order to capture all **unique** features. 

```
|         | feature1 | feature2 | feature3 | 
|---------|----------|----------|----------|
|sample 1 | black    | round    | matte    |
|sample 2 | white    | square   | shiny    |
|sample 3 | blue     | round    | matte    |
|sample 4 | black    | round    | shiny    | 
```
Imagine we don't know how many unique features exists in our dataset. But we will generously create a hash table to hold all the unique features. We will start with giving $2^3 = 8$ indexes for our dataset. 
```
|         |    list of features    | 
|---------|------------------------|
|sample 1 | [black, round, matte]  |  
|sample 2 | [white, square, shiny] |
|sample 3 | [blue, round, matte]   |
|sample 4 | [black, round, shiny]  |
```
There are a number of hash value using different bits. The most common being md5 using 128 bits, we will use md5 hash function.
```
step 1: feature = feature.encode('utf-8')                    # feature needs to be encoded
step 2: hashlib.md5(feature)                                 # hash object
step 3: hashlib.md5(feature).hexdigest()                     # hash object converted to hexadecimal format 
step 4: int(hashlib.md5(feature).hexdigest(), 16)            # converting hex to base 10 
step 5: int(int(hashlib.md5(feature).hexdigest(), 16) % 8)   # modulo to match the hash table, here 2^3 = 8
eg    : int(int(hashlib.md5('black'.encode('utf-8')).hexdigest(), 16) % 8) 

|         | index list |   
|---------|------------|
|sample 1 | [1, 2, 3]  |  
|sample 2 | [4, 0, 6]  |
|sample 3 | [7, 2, 3]  |
|sample 4 | [1, 2, 6]  |


index    = [0, 1, 2, 3, 4, 5, 6, 7]
-----------------------------------
sample 1 = [0, 1, 1, 1, 0, 0, 0, 0] 
sample 2 = [1, 0, 0, 0, 1, 0, 1, 0]
sample 3 = [0, 0, 1, 1, 0, 0, 0, 1]
sample 4 = [0, 1, 1, 0, 0, 0, 1, 0]

```
As decribed above, feature hashing also generates one hot encoding format. By generously giving more than enough index spaces for all unique features, we avoids **hash collision** situation. Even if it occurs in a large hash table, it should not affect the accuracy of our model. 

The most significant advantage of using feature hashing is, we **don't** need to go over the entire dataset to convert the sample features into vector space. For eg, the first sample 1 comes with `black, round, matte` features. We don't need to know how many unique colors are there in our dataset. It could be only `black` color or more than that. But the modulo function in hash compression will take care of all the unique features and give the index based on the remainder. Similarly, our $2^3 = 8$ hash table also takes care of `round` feature as well. 



## EDA. Average Click Through Rate from the entire dataset (45 millions)

In [9]:
# This approach takes 5.7 minutes. 
start = time.time()
print("The average click through rate is : ", trainRDD.map(lambda x: int(x.split('\t')[0])).mean())
print("Time taken : {} seconds".format(time.time() - start))

The average click through rate is :  0.25622338372976045
Time taken : 387.6914131641388 seconds


In [10]:
from pyspark.accumulators import AccumulatorParam

class FloatAccumulatorParam(AccumulatorParam):
    """
    Custom accumulator for use in page rank to keep track of various masses.
    
    IMPORTANT: accumulators should only be called inside actions to avoid duplication.
    We stringly recommend you use the 'foreach' action in your implementation below.
    """
    def zero(self, value):
        return value
    def addInPlace(self, val1, val2):
        return val1 + val2
    
def avgCTR(dataRDD):
    
    clickCount = sc.accumulator(0.0)
    totAccum = sc.accumulator(0.0)
    
    def countCTR(line):
        cnt = line.split('\t')[0]
        clickCount.add(int(cnt))
        totAccum.add(1)
        
    dataRDD.foreach(countCTR)
    tempRDD = dataRDD.map(countCTR)
    
    average = clickCount.value/totAccum.value
        
    return average     

# This approach takes 2.7 minutes. 
start = time.time()
CTR = avgCTR(trainRDD) 
print("The average click through rate is {}".format(CTR)) 
print("Time taken : {} seconds".format(time.time() - start))

The average click through rate is 0.2562233837297609
Time taken : 176.37068128585815 seconds


### Observation 

Average click through rate is `0.26` which indicates that on average a user will click 25 display ads out of 100 on the webpage. This reflects the poor performance of the display ad. Ideally, we would want near 100% click on display ads. A click through rate of at least 80% will enhance the profits of the display ads. This shows that there are some features critical to the success of the display ad clicks or could be focused on by the advertising team.  

In addition, the in-line lambda function took twice as long as the accumulator function to calculate the average click through rate. 

## 100,000 samples split into Train, Validation and Test dataset

In [11]:
!head -n 2 data/dac_sample.txt

0	1	1	5	0	1382	4	15	2	181	1	2		2	68fd1e64	80e26c9b	fb936136	7b4723c4	25c83c98	7e0ccccf	de7995b8	1f89b562	a73ee510	a8cd5504	b2cb9c98	37c9c164	2824a5f6	1adce6ef	8ba8b39a	891b62e7	e5ba7672	f54016b9	21ddcdc9	b1252a9d	07b5194c		3a171ecb	c5c50484	e8b83407	9727dd16
0	2	0	44	1	102	8	2	2	4	1	1		4	68fd1e64	f0cf0024	6f67f7e5	41274cd7	25c83c98	fe6b92e5	922afcc0	0b153874	a73ee510	2b53e5fb	4f1b46f3	623049e6	d7020589	b28479f6	e6c5b5cd	c92f3b61	07c540c4	b04e4670	21ddcdc9	5840adea	60f6221e		3a171ecb	43f13e8b	e8b83407	731c3655


In [12]:
!wc -l data/dac_sample.txt

  100000 data/dac_sample.txt


In [13]:
sampleRDD1 = sampleRDD.map(lambda x: x.replace('\t', ','))

# Splitting the data
weights = [.8, .1, .1]
seed = 42
# Use randomSplit with weights and seed
TrainData, ValData, TestData = sampleRDD1.randomSplit(weights,seed)

# count the data
nTrain = TrainData.count()
nVal = ValData.count()
nTest = TestData.count()
print("Number of train data: ", nTrain)
print("Number of val data  : ", nVal)
print("Number of test data : ", nTest)
print("number of total data: ", nTrain + nVal + nTest)
print(sampleRDD1.take(1))

Number of train data:  80053
Number of val data  :  9941
Number of test data :  10006
number of total data:  100000
['0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']


## Question 4: Algorithm Implementation

This is a step by step walk through on how we develop home grown algorithm with particularly a parallel computation concept in mind. Since our dataset is very large, in order to capture all the unique features, we first employed one hot encoding approach for our small dataset `100,000` samples. 

## Step by step walk through to create one hot encoding

One sample 
```
['0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']
```
### Step 1 (Features selection) 

Remove the first column, i.e., label `1` or `0`. 

```
['0']
[1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']
```

### Step 2 (Index each feature)

Assign the remaining `39` variables with corresponding indexes. This step is critical because we will then use the index as our keys and in the reducer step, we could count how many unique features exists in each index.  

```
[[(0, '1'),      # index 0, feature 1 with the category '1'
  (1, '1'),
  (2, '5'),
  (3, '0'),
  (4, '1382'), ...],
 [(0, '5'),      # index 0, feature 1 with the category '5'
  (1, '10'), 
  (2, '34'), ...], ...,
  []]
```

### Step 3 (Select unique) 

The index `0` can have many features. For example 

```
(0, '1')
(0, '5')
(0, '15')
(0, '8')
(0, '1')      --> (already had above), will be removed by distinct()
(0, '15')     --> (already had above), will be removed by distinct() 
...
```
To count the unique features for each index, we will use `distinct()`. Distinct() step will be expensive especially when the data is very large. 

### Step 4 [Optional] (Mapper and Reducer stage) 

1. Map every index and emit them with `1` so that we can reduce them. 
2. Reduce by (lambda x, y: x + y) 

```
[(0, 145),
 (1, 2483),
 (2, 864),
 (3, 130),
 (4, 20352),
 ...
```

There are 145 unique categories for the index `0`, which is the first feature out of `39` features in our sample. 

### Step 4 (Create One Hot Dictionary) 

We need to create an index dictionary for all the unique features. Once we have the dictionary, we then go over all the samples and create a sparse vector based on what features each sample has. 

```
1. zipWithIndex() will create an index for every items
2. The emitted from above will become the tuple with the index starting from `0`.
3. [((0, '1'), 0), ((4, '1382'), 1), ((7, '2'), 2), ((13, '68fd1e64'), 3), ((14, '80e26c9b'), 4), ...]
4. collectAsMap() will create a dictionary by mapping the first [0] in a tuple and makes them as keys (composite key)
5. the index becomes the dict value

{(0, '1'): 0,
 (4, '1382'): 1,
 (7, '2'): 2,
 (13, '68fd1e64'): 3,
 ...,
 ...,
 (36, '9a1c7e3d'): 234358}
```

### Step 5 (One Hot Encoding) 

In order to pipeline our RDD into our classifiers in later steps. Spark has a LabeledPoint which requires label in its tuple index`[0]` position. the index`[1]` has the sparse vector format. Reference <a href="https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=labeledpoint">here</a>. This is what we try to achieve from our RDD sample. 

```
[LabeledPoint(0.0, (234358,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,117215,117216,117217,117218,117219,117220,117221,117222,117223,117224,117225,117226,117227,117228,117229,117230,117231,117232,117233,117234,117235],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))] 
```
1. The first `0.0` at tuple index`[0]` is the sample label. 
2. `234358` is the length of the vector space. 
3. The first list `[]` is the index list for all the unique features for the sample. 
4. The second list `[]` is the value of the unique features. Since this is OHE, it's `1`. 


- To achieve this, we need to feed our original RDD because we need the original label for each sample. 
- To have the sparse vector, we already created OHE dictionary in Step 4. We will then create our sample into `(index, category)` format, and search in our OHE dictionary for their corresponding index.  

## Step 6 (Classifiers) 

We will use logistic regression because our label is a binary classification, either `1` or `0`. We will use log loss function as to check our model performance. A general rule of thumb is the nearer the log loss value is to zero, the more accurate our model is. 

In [20]:
#def parsePoint(point):
def indexFeatures(line):
    #  index each feature after removing the first column which is the label

    features = []

    for index,feature in enumerate(line.split(',')[1:]):
        idxFeat = index,feature
        features.append(idxFeat)
        
    return features

In [15]:
indexedTrainFeat = TrainData.map(indexFeatures)

totalCat = (indexedTrainFeat
            .flatMap(lambda x: x)
            .distinct()
            .map(lambda x: (x[0], 1))
            .reduceByKey(lambda x, y: x + y)
            .sortByKey()
            .collect())

print(totalCat)

[(0, 145), (1, 2483), (2, 864), (3, 130), (4, 20352), (5, 1899), (6, 583), (7, 131), (8, 1803), (9, 8), (10, 84), (11, 66), (12, 254), (13, 484), (14, 495), (15, 36200), (16, 21463), (17, 134), (18, 12), (19, 7236), (20, 232), (21, 3), (22, 9863), (23, 3682), (24, 34184), (25, 2748), (26, 26), (27, 4864), (28, 28933), (29, 10), (30, 2416), (31, 1230), (32, 4), (33, 32056), (34, 10), (35, 14), (36, 10849), (37, 49), (38, 8359)]


In [16]:
# Counting total number of categories
total = 0
for item in totalCat:
    total += item[1]
print("Total number of unique categories for all features: ", total)

Total number of unique categories for all features:  234358


In [23]:
def createOHEDict(dataRDD):
    """ This is an function to create one hot encoding (OHE) dictionary 
    on the entire dataset. To make sure there's a unique number of categories for each feature,
    it will use the distinct() function. """
    
    OHEDict = (dataRDD.map(indexFeatures)
                       .flatMap(lambda x: x)
                       .distinct()
                       .zipWithIndex()
                       .collectAsMap())
    return OHEDict

In [24]:
trainOHEDict = createOHEDict(TrainData)
print(trainOHEDict[(15, 'f5717f7e')])

trainOHEFeatures = len(trainOHEDict.keys())
print(trainOHEFeatures)

993
234358


## I. One Hot Encoding 
## Preparing each dataset for our classifier model

In [32]:
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint

def sparseOHE(feature_list, OHEDict):
    """Emitting SparseVector from the entire one hot encoding dictionary 
    for one sample data at a time
    """

    index_list = []
    
    for feat in feature_list:
        if feat in OHEDict:                     # this step is for other dataset when the unique feature might not be in the OHE dict
            index_list.append(OHEDict[feat])
        
    index_list.sort()
    
    # create a list of 1.0s for each feature
    values = [1.] * len(index_list)

    return SparseVector(len(OHEDict.keys()), index_list, values)

def parseRDD(line, OHEDict):
    """emit label, sparseVector for the sample that has been already one hot encoded. 
    label = 1 or 0 
    sparseVector = (total number of features, [list of indexes], [list of corresponding values])
    """
    features = []
    
    # split each value in the comma separated text
    for index, feature in enumerate(line.split(',')[1:]):
        idxfeat = index, feature
        features.append(idxfeat)
    
    return LabeledPoint(line.split(",")[0], sparseOHE(features, OHEDict))

In [33]:
OHETrainData = TrainData.map(lambda line: parseRDD(line, trainOHEDict))
OHEValData = ValData.map(lambda line: parseRDD(line, trainOHEDict))
OHETestData = TestData.map(lambda line: parseRDD(line, trainOHEDict))
print("1st sample of feature engineered train data")
print(OHETrainData.take(1))
print("-"*50)
print("1st sample of feature engineered validation data")
print(OHEValData.take(1))
print("-"*50)
print("1st sample of feature engineered test data")
print(OHETestData.take(1))

1st sample of feature engineered train data
[LabeledPoint(0.0, (234358,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,117215,117216,117217,117218,117219,117220,117221,117222,117223,117224,117225,117226,117227,117228,117229,117230,117231,117232,117233,117234,117235],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]
--------------------------------------------------
1st sample of feature engineered validation data
[LabeledPoint(0.0, (234358,[2,8,15,21,45,49,50,52,61,96,135,160,164,671,1355,5140,5141,5142,117223,117226,117238,117242,117258,117261,117262,117267,117269,117284,117287,118378,119236,120375,122236,122237,122238,122239,122240,122241,170030],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]
--------------------------------------------------
1st sample of feature e

## Classifiers
## Logistic Regression on Criteo CTR Data




In [79]:
# Model definition

from pyspark.mllib.classification import LogisticRegressionWithSGD

# adjustable hyperparameters
numIters = 50
stepSize = 5.
regParam = 1e-6
regType = 'l2'
includeIntercept = True

model0 = LogisticRegressionWithSGD.train(OHETrainData,\
                                         iterations=numIters,\
                                         step=stepSize,\
                                         regParam=regParam,\
                                         regType=regType,\
                                         intercept=includeIntercept)

print("Model weights for the first 5 variables or features developed on train data\n")
sortedWeights = sorted(model0.weights)
print(sortedWeights[:5])
print("-"*50)
print("Model intercept\n")
print(model0.intercept)

Model weights for the first 5 variables or features developed on train data

[-0.34718045287758437, -0.345875988465329, -0.3324849806726338, -0.31288488747657667, -0.31038881937206875]
--------------------------------------------------
Model intercept

0.596061956931456


In [80]:
from math import log
from math import exp 

def computeLogLoss(p, y):
    """Calculates the value of log loss for a given probabilty and label.
    Since log(0) is undefined, we will use a small value (epsilon). 
    """
    epsilon = 10e-12

    if p == 1:
        p = p - epsilon
    elif p == 0:
        p = p + epsilon
    
    if y == 1:
        return -log(p)
    elif y == 0: 
        return -log(1-p)

def calcProb(x, w, intercept):
    """Calculate the probability for an observation given a set of weights and intercept.
    Note    : We'll bound our raw prediction between 20 and -20 for numerical purposes.
    Returns : float: A probability between 0 and 1 for each sample using the model developed earlier, i.e., LogisticRegressionwithSGD
    """
    
    rawPrediction = x.dot(w) + intercept

    # Bound the raw prediction value
    rawPrediction = min(rawPrediction, 20)
    rawPrediction = max(rawPrediction, -20)
    
    # pass it through the sigmoid function
    return (1 + exp(-rawPrediction))**(-1)

def evaluateResults(model, data):
    """Calculates the log loss for the data given the model."""
    
    # generate the predictions and the labels and combine them
    predictions = data.map(lambda x: calcProb(x.features, model.weights, model.intercept))
    labels = data.map(lambda x: x.label)
    pred_labels = predictions.zip(labels)
    
    # compute the log-loss, sum it, and divide by the count
    logLoss = (pred_labels.map(lambda x: computeLogLoss(x[0], x[1])) 
                .reduce(lambda x,y: x + y))/pred_labels.count()
    
    return logLoss

### Calculating the baseline for logloss

In [81]:
# calculate the overall ctr
classOneFracTrain = OHETrainData.map(lambda x: x.label).reduce(lambda x,y: x + y) / OHETrainData.count()
print(classOneFracTrain)

# using the overall ctr, applies to each sample and calculate the logloss
logLossTrBase = (OHETrainData.map(lambda x: computeLogLoss(classOneFracTrain,x.label))
                             .reduce(lambda x,y: x + y))/OHETrainData.count()

print('Baseline Train Logloss = {0:.3f}\n'.format(logLossTrBase))

0.22712452999887575
Baseline Train Logloss = 0.536



### Model Performance (logloss) on the train data

In [82]:
trainingPredictions = OHETrainData.map(lambda x: calcProb(x.features,model0.weights,model0.intercept))

print("Probability prediction on first 5 samples using LogisticRegression with Stochastic Gradient Descent")
print(trainingPredictions.take(5))

print("-"*50)
logLossTrLR0 = evaluateResults(model0, OHETrainData)
print('OHE Features Train Logloss:\n\tBaseline = {0:.3f}\n\tLogistic Regression model = {1:.3f}'.format(logLossTrBase, logLossTrLR0))

Probability prediction on first 5 samples using LogisticRegression with Stochastic Gradient Descent
[0.28783712176420123, 0.12093271783135372, 0.3152123993814, 0.1700386154024066, 0.5676127642926942]
--------------------------------------------------
OHE Features Train Logloss:
	Baseline = 0.536
	Logistic Regression model = 0.467


### Validation Dataset to check the model performance

Since we have developed out model on the train dataset, we will use the model on validation dataset and check the performance, i.e., logloss. If need be, we'd go back to the model, fine tune the hyperparameters to bring the logloss as near as zero. 

In [83]:
logLossValBase = OHEValData.map(lambda x: computeLogLoss(classOneFracTrain,x.label))\
.reduce(lambda x,y: x + y) / OHEValData.count()

logLossValLR0 = evaluateResults(model0,OHEValData)
print ('OHE Features Validation Logloss:\n\tBaseline = {0:.3f}\n\tLogistic Regression model = {1:.3f}'
       .format(logLossValBase, logLossValLR0))

OHE Features Validation Logloss:
	Baseline = 0.527
	Logistic Regression model = 0.465


In [84]:
logLossTestBase = OHETestData.map(lambda x: computeLogLoss(classOneFracTrain,x.label))\
.reduce(lambda x,y: x + y) / OHETestData.count()

logLossTestLR0 = evaluateResults(model0,OHETestData)
print ('OHE Features Validation Logloss:\n\tBaseline = {0:.3f}\n\tLogistic Regression model = {1:.3f}'
       .format(logLossTestBase, logLossTestLR0))

OHE Features Validation Logloss:
	Baseline = 0.539
	Logistic Regression model = 0.472


## Conclusion from Logistic Regression using One Hot Encoding  

Our logistic regression with stochastic gradient descent performs slightly better than the baseline. We checked the baseline in such a way that the average click through rate is distributed as probability for each sample and the performance parameter (logloss) is calculated. As the best model that can predict accurately will have the log loss of `0` value, the performance of our model is checked with the logloss function. 



## II. Features Hashing

In [93]:
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
from collections import defaultdict
import hashlib

def hashFunction(numBuckets, rawFeats):
    """Calculate a feature dictionary for an observation's features based on hashing."""
    
    mapping = {}
    for ind, category in rawFeats:
        featureString = (category + str(ind)).encode('utf-8')
        mapping[featureString] = int(int(hashlib.md5(featureString).hexdigest(), 16) % numBuckets)

    sparseFeatures = defaultdict(float)
    for bucket in mapping.values():
        sparseFeatures[bucket] += 1.0
    return dict(sparseFeatures)

def parseHashPoint(line, numBuckets):
    """Create a LabeledPoint for this observation using hashing."""

    # grab the label and feature 
    label = line.split(',')[0]
    features = line.split(',')[1:]
    
    # create an array for the indexed features
    indexedFeat = []
    for i,feat in enumerate(features):
        idxfeat = i,feat
        indexedFeat.append(idxfeat)
    
    # convert the features to a sparse hash dictionary
    hashed = hashFunction(numBuckets, indexedFeat)
    sorted_hash = sorted([(v,k) for v,k in hashed.items()])
    
    # separate out the indices and the features
    sorted_idx = [x[0] for x in sorted_hash]
    sorted_feat = [x[1] for x in sorted_hash]
    
    # create the sparse vector
    vector = SparseVector(numBuckets, sorted_idx, sorted_feat)
    
    # create the labelled point
    new_point = LabeledPoint(label, vector)
    
    # return the labelled point
    return new_point

### Features engineered by Hashing

In [94]:
# generouly giving the hash table size = 32,768 slots
numBuckets = 2 ** 15
FHTrainData = TrainData.map(lambda x: parseHashPoint(x,numBuckets))
FHValData = ValData.map(lambda x: parseHashPoint(x,numBuckets))
FHTestData = TestData.map(lambda x: parseHashPoint(x,numBuckets))

print("1st sample of feature hashed on train data")
print(FHTrainData.take(1))
print("-"*50)
print("1st sample of feature hashed on validation data")
print(FHValData.take(1))
print("-"*50)
print("1st sample of feature hashed on test data")
print(FHTestData.take(1))

1st sample of feature hashed on train data
[LabeledPoint(0.0, (32768,[1305,2883,3807,4814,4866,4913,6952,7117,9985,10316,11512,11722,12365,13893,14735,15816,16198,17761,19274,21604,22256,22563,22785,24855,25202,25533,25721,26487,26656,27668,28211,29152,29402,29873,30039,31484,32493,32708],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]
--------------------------------------------------
1st sample of feature hashed on validation data
[LabeledPoint(0.0, (32768,[1580,2817,3338,3668,3807,4533,4667,5302,5725,7077,7998,8316,8759,8909,9114,11558,11722,12089,13606,13687,19274,20821,21734,22256,22580,22943,23554,23587,24234,25533,25818,25947,27939,28405,30065,30683,31035,31663],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]
--------------------------------------------------
1st s

In [95]:
# Model definition on Feature hashed train dataset

from pyspark.mllib.classification import LogisticRegressionWithSGD

# adjustable hyperparameters
numIters = 50
stepSize = 10.
regParam = 1e-6
regType = 'l2'
includeIntercept = True

model1 = LogisticRegressionWithSGD.train(FHTrainData,\
                                         iterations=numIters,\
                                         step=stepSize,\
                                         regParam=regParam,\
                                         regType=regType,\
                                         intercept=includeIntercept)

print("Model weights for the first 5 variables or features developed on train data\n")
sortedWeights = sorted(model1.weights)
print(sortedWeights[:5])
print("-"*50)
print("Model intercept\n")
print(model1.intercept)

Model weights for the first 5 variables or features developed on train data

[-0.46290298227041093, -0.4016629631670746, -0.3823074413344563, -0.3628581823423947, -0.34856830469924743]
--------------------------------------------------
Model intercept

0.5703368706362044


In [97]:
trainingPredictions = FHTrainData.map(lambda x: calcProb(x.features,model1.weights,model1.intercept))

print("Probability prediction on first 5 samples using LogisticRegression with Stochastic Gradient Descent")
print(trainingPredictions.take(5))

print("-"*50)
logLossTrLR0 = evaluateResults(model1, FHTrainData)
print('OHE Features Train Logloss:\n\tBaseline = {0:.3f}\n\tLogistic Regression model = {1:.3f}'.format(logLossTrBase, logLossTrLR0))

Probability prediction on first 5 samples using LogisticRegression with Stochastic Gradient Descent
[0.29951974875230575, 0.10025528557432511, 0.3040045161813861, 0.16816282602629815, 0.5388862232280243]
--------------------------------------------------
OHE Features Train Logloss:
	Baseline = 0.536
	Logistic Regression model = 0.458


In [98]:
logLossValBase = FHValData.map(lambda x: computeLogLoss(classOneFracTrain, x.label))\
.reduce(lambda x,y: x + y) / FHValData.count()

logLossValLR1 = evaluateResults(model1, FHValData)
print ('OHE Features Validation Logloss:\n\tBaseline = {0:.3f}\n\tLogistic Regression model = {1:.3f}'
       .format(logLossValBase, logLossValLR1))

OHE Features Validation Logloss:
	Baseline = 0.527
	Logistic Regression model = 0.460


In [99]:
logLossTestBase = FHTestData.map(lambda x: computeLogLoss(classOneFracTrain, x.label))\
.reduce(lambda x,y: x + y) / FHTestData.count()

logLossTestLR1 = evaluateResults(model1, FHTestData)
print ('OHE Features Validation Logloss:\n\tBaseline = {0:.3f}\n\tLogistic Regression model = {1:.3f}'
       .format(logLossTestBase, logLossTestLR1))

OHE Features Validation Logloss:
	Baseline = 0.539
	Logistic Regression model = 0.465


### Observation 

Our features engineering approach using hashing does not differ significantly from one hot encoding in terms of logloss value. The advantage is we significantly reduced our features space from `234,358` to `32,768`, accounting for 86% reduction in handling features alone. 

This is when we checked on `100,000` sample dataset. 

## Question 5: Application of Course Concepts
Pick 3-5 key course concepts and discuss how your work on this assignment illustrates an understanding of these concepts.

1. Parallel computation 
2. Dimensionality Reduction 
3. 

## 1. Parallel computation 


## 2. Dimensionality Reduction (PCA vs Features Hashing) 

Although both PCA and Features hashing generate reduction in features dimension as per se, the fundamental concept in dimensionality reduction is different in each approach. 

### PCA (Principal Component Analysis) 

In PCA, all features are considered. However, in calculating the eigenvectors, i.e., the vectors can be multiplied with a scalar value (eigenvalue) and the resulting vector stays in the same euclidean vector space without changing the directionality. For all the eigenvectors that can be made from the given features space, the highest eigenvector is calculated based on the how many number of features contributes to the vector and gives the highest variance. The higher the variance, the more likely it is that the model can capture all the dataset. Since each eigenvectors are orthogonal to each other, every vector has a variation in feature numbers. Some vector could have `10000` features contributing to the larger percentage of the variance, while others can be just `200` features. In principle, PCA offers the most significant features in the feature spaces. 

### Features Hashing

In contrast to PCA, features hashing is constrained by the number of feature space we defined. We could generously give the feature space to be $2^{15} = 32,768$, i.e., there should be `32,768` features we want to analyze in our model development. We do not have a control on which features to choose and which features to drop. Granted, if we start off with a small feature space, it's more likely that we will be dropping many features via **hash collision** situation. 

It is also noteworthy that when the dataset is extremelty large, executing PCA on millions of features is computationaly expensive and features hashing is recommended.  
