# Criteo Click Through Rate Prediction 

### W261 - Machine Learning at Scale 
Spring Semester  
Ben Arnoldy, Kenneth Chen, Nick Conidas, Rohini Kashibatla, Pavan Kurapati

Criteo is an advertising company that specializes in web ad and gathered click through information via their ad services. In 2014, Criteo launched Click Through Rate (CTR) prediction competition hosted in Kaggle. It provides train.txt, test.txt where train.txt was provided with its label: `1` for click, and `0` for no-click. The test.txt was given with a number of features without labels for which we have to predict whether the user clicks the web ad or not. 

### Dataset construction:

The training dataset consists of a portion of Criteo's traffic over a period
of 7 days. Each row corresponds to a display ad served by Criteo and the first
column is indicates whether this ad has been clicked or not.
The positive (clicked) and negatives (non-clicked) examples have both been
subsampled (but at different rates) in order to reduce the dataset size.

There are 13 features taking integer values (mostly count features) and 26
categorical features. The values of the categorical features have been hashed
onto 32 bits for anonymization purposes. 
The semantic of these features is undisclosed. Some features may have missing values.

The rows are chronologically ordered.

The test set is computed in the same way as the training set but it 
corresponds to events on the day following the training period. 
The first column (label) has been removed.

In [1]:
# imports
import re
import ast
import time
import numpy as np
import pandas as pd
import seaborn as sns
import networkx as nx
import matplotlib.pyplot as plt

In [2]:
%reload_ext autoreload
%autoreload 2

In [3]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

In [4]:
# start Spark Session
from pyspark.sql import SparkSession
app_name = "hw5_notebook"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()
sc = spark.sparkContext

In [90]:
# load the data into Spark RDDs for convenience of use later (RUN THIS CELL AS IS)
trainRDD = sc.textFile('data/train.txt')
testRDD = sc.textFile('data/test.txt')
sampleRDD = sc.textFile('data/dac_sample.txt')

In [38]:
!head -n 2 data/train.txt

0	1	1	5	0	1382	4	15	2	181	1	2		2	68fd1e64	80e26c9b	fb936136	7b4723c4	25c83c98	7e0ccccf	de7995b8	1f89b562	a73ee510	a8cd5504	b2cb9c98	37c9c164	2824a5f6	1adce6ef	8ba8b39a	891b62e7	e5ba7672	f54016b9	21ddcdc9	b1252a9d	07b5194c		3a171ecb	c5c50484	e8b83407	9727dd16
0	2	0	44	1	102	8	2	2	4	1	1		4	68fd1e64	f0cf0024	6f67f7e5	41274cd7	25c83c98	fe6b92e5	922afcc0	0b153874	a73ee510	2b53e5fb	4f1b46f3	623049e6	d7020589	b28479f6	e6c5b5cd	c92f3b61	07c540c4	b04e4670	21ddcdc9	5840adea	60f6221e		3a171ecb	43f13e8b	e8b83407	731c3655


In [9]:
!head -n 2 data/test.txt

	29	50	5	7260	437	1	4	14		1	0	6	5a9ed9b0	a0e12995	a1e14474	08a40877	25c83c98		964d1fdd	5b392875	a73ee510	de89c3d2	59cd5ae7	8d98db20	8b216f7b	1adce6ef	78c64a1d	3ecdadf7	3486227d	1616f155	21ddcdc9	5840adea	2c277e62		423fab69	54c91918	9b3e8820	e75c9ae9
27	17	45	28	2	28	27	29	28	1	1		23	68fd1e64	960c983b	9fbfbfd5	38c11726	25c83c98	7e0ccccf	fe06fd10	062b5529	a73ee510	ca53fc84	67360210	895d8bbb	4f8e2224	f862f261	b4cc2435	4c0041e5	e5ba7672	b4abdd09	21ddcdc9	5840adea	36a7ab86		32c7478e	85e4d73f	010f6491	ee63dd9b


In [10]:
!wc -l data/train.txt

 45840617 data/train.txt


In [44]:
# part d - provided FloatAccumulator class (RUN THIS CELL AS IS)

from pyspark.accumulators import AccumulatorParam

class FloatAccumulatorParam(AccumulatorParam):
    """
    Custom accumulator for use in page rank to keep track of various masses.
    
    IMPORTANT: accumulators should only be called inside actions to avoid duplication.
    We stringly recommend you use the 'foreach' action in your implementation below.
    """
    def zero(self, value):
        return value
    def addInPlace(self, val1, val2):
        return val1 + val2

In [61]:
def avgCTR(dataRDD):
    
    clickCount = sc.accumulator(0.0)
    totAccum = sc.accumulator(0.0)
    
    def countCTR(line):
        cnt = line.split('\t')[0]
        clickCount.add(int(cnt))
        totAccum.add(1)
        
    dataRDD.foreach(countCTR)
    tempRDD = dataRDD.map(countCTR)
    
    average = clickCount.value/totAccum.value
        
    return average       

In [54]:
# This approach takes 6 minutes. 
trainRDD.map(lambda x: int(x.split('\t')[0])).mean()

0.25622338372976045

In [63]:
# This approach takes 2.7 minutes. 
start = time.time()
CTR = avgCTR(trainRDD) 
print("The average click through rate is {}".format(CTR)) 
print("Time taken : {} seconds".format(time.time() - start))

The average click through rate is 0.2562233837297609
Time taken : 164.37067008018494 seconds


## Observation 
Average click through rate is `0.26` which indicates that on average a user will click 25 display ads out of 100 on the webpage. This reflects the poor performance of the display ad. Ideally, we would want near 100% click on display ads. A click through rate of at least 80% will enhance the profits of the display ads. This shows that there are some features that critical to the success of the display ad clicks or could be focused on by the advertising team.  

## 100,000 sample dataset

In [11]:
!head -n 2 data/dac_sample.txt

0	1	1	5	0	1382	4	15	2	181	1	2		2	68fd1e64	80e26c9b	fb936136	7b4723c4	25c83c98	7e0ccccf	de7995b8	1f89b562	a73ee510	a8cd5504	b2cb9c98	37c9c164	2824a5f6	1adce6ef	8ba8b39a	891b62e7	e5ba7672	f54016b9	21ddcdc9	b1252a9d	07b5194c		3a171ecb	c5c50484	e8b83407	9727dd16
0	2	0	44	1	102	8	2	2	4	1	1		4	68fd1e64	f0cf0024	6f67f7e5	41274cd7	25c83c98	fe6b92e5	922afcc0	0b153874	a73ee510	2b53e5fb	4f1b46f3	623049e6	d7020589	b28479f6	e6c5b5cd	c92f3b61	07c540c4	b04e4670	21ddcdc9	5840adea	60f6221e		3a171ecb	43f13e8b	e8b83407	731c3655


In [12]:
!wc -l data/dac_sample.txt

  100000 data/dac_sample.txt


In [16]:
import os.path
baseDir = os.path.join('')
inputPath = os.path.join('', 'data/dac_sample.txt')
fileName = os.path.join(baseDir, inputPath)

if os.path.isfile(fileName):
    rawData = (sc
               .textFile(fileName, 2)
               .map(lambda x: x.replace('\t', ',')))  # work with either ',' or '\t' separated data
    print(rawData.take(1))

['0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']


In [18]:
# TODO: Replace <FILL IN> with appropriate code
weights = [.8, .1, .1]
seed = 42
# Use randomSplit with weights and seed
TrainData, ValData, TestData = rawData.randomSplit(weights,seed)
# Cache the data
TrainData.cache()
ValData.cache()
TestData.cache()

# count the data
nTrain = TrainData.count()
nVal = ValData.count()
nTest = TestData.count()
print(nTrain, nVal, nTest, nTrain + nVal + nTest)
print(rawData.take(1))

80053 9941 10006 100000
['0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']


In [27]:
TrainData.take(1)

['0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']

In [175]:
def initGF(dataRDD):
    
    def sparse(line):
        features = []
        for feat in line.split('\t')[1:13]:
            if feat == '':
                features.append(0)
            else:
                features.append(int(feat)) 
        return features
        
    tempRDD = dataRDD.map(sparse)
    return tempRDD

In [176]:
start = time.time()
sampleGF = initGF(sampleRDD)
print("Time taken {} seconds".format(time.time() - start))

Time taken 0.0005571842193603516 seconds


In [177]:
sampleGF.collect()

[[1, 1, 5, 0, 1382, 4, 15, 2, 181, 1, 2, 0],
 [2, 0, 44, 1, 102, 8, 2, 2, 4, 1, 1, 0],
 [2, 0, 1, 14, 767, 89, 4, 2, 245, 1, 3, 3],
 [0, 893, 0, 0, 4392, 0, 0, 0, 0, 0, 0, 0],
 [3, -1, 0, 0, 2, 0, 3, 0, 0, 1, 1, 0],
 [0, -1, 0, 0, 12824, 0, 0, 0, 6, 0, 0, 0],
 [0, 1, 2, 0, 3168, 0, 0, 1, 2, 0, 0, 0],
 [1, 4, 2, 0, 0, 0, 1, 0, 0, 1, 1, 0],
 [0, 44, 4, 8, 19010, 249, 28, 31, 141, 0, 1, 0],
 [0, 35, 0, 1, 33737, 21, 1, 2, 3, 0, 1, 0],
 [0, 2, 632, 0, 56770, 0, 0, 5, 65, 0, 0, 0],
 [0, 6, 6, 6, 421, 109, 1, 7, 107, 0, 1, 0],
 [0, -1, 0, 0, 1465, 0, 17, 0, 4, 0, 4, 0],
 [0, 2, 11, 5, 10262, 34, 2, 4, 5, 0, 1, 0],
 [0, 51, 84, 4, 3633, 26, 1, 4, 8, 0, 1, 0],
 [0, 2, 1, 18, 20255, 0, 0, 1, 1306, 0, 0, 0],
 [1, 987, 0, 2, 105, 2, 1, 2, 2, 1, 1, 0],
 [0, 1, 0, 0, 16597, 557, 3, 5, 123, 0, 1, 0],
 [0, 24, 4, 2, 2056, 12, 6, 10, 83, 0, 1, 0],
 [7, 102, 0, 3, 780, 15, 7, 15, 15, 1, 1, 0],
 [0, 47, 0, 0, 6399, 38, 19, 10, 143, 0, 10, 0],
 [0, 1, 80, 0, 1848, 287, 1, 4, 46, 0, 1, 0],
 [0, 0, 14, 6, 

In [178]:
sampleDF = spark.createDataFrame(sampleGF)

In [179]:
sampleDF.show()

+---+---+---+---+-----+---+---+---+----+---+---+---+
| _1| _2| _3| _4|   _5| _6| _7| _8|  _9|_10|_11|_12|
+---+---+---+---+-----+---+---+---+----+---+---+---+
|  1|  1|  5|  0| 1382|  4| 15|  2| 181|  1|  2|  0|
|  2|  0| 44|  1|  102|  8|  2|  2|   4|  1|  1|  0|
|  2|  0|  1| 14|  767| 89|  4|  2| 245|  1|  3|  3|
|  0|893|  0|  0| 4392|  0|  0|  0|   0|  0|  0|  0|
|  3| -1|  0|  0|    2|  0|  3|  0|   0|  1|  1|  0|
|  0| -1|  0|  0|12824|  0|  0|  0|   6|  0|  0|  0|
|  0|  1|  2|  0| 3168|  0|  0|  1|   2|  0|  0|  0|
|  1|  4|  2|  0|    0|  0|  1|  0|   0|  1|  1|  0|
|  0| 44|  4|  8|19010|249| 28| 31| 141|  0|  1|  0|
|  0| 35|  0|  1|33737| 21|  1|  2|   3|  0|  1|  0|
|  0|  2|632|  0|56770|  0|  0|  5|  65|  0|  0|  0|
|  0|  6|  6|  6|  421|109|  1|  7| 107|  0|  1|  0|
|  0| -1|  0|  0| 1465|  0| 17|  0|   4|  0|  4|  0|
|  0|  2| 11|  5|10262| 34|  2|  4|   5|  0|  1|  0|
|  0| 51| 84|  4| 3633| 26|  1|  4|   8|  0|  1|  0|
|  0|  2|  1| 18|20255|  0|  0|  1|1306|  0|  

In [180]:
sampleDF.printSchema()

root
 |-- _1: long (nullable = true)
 |-- _2: long (nullable = true)
 |-- _3: long (nullable = true)
 |-- _4: long (nullable = true)
 |-- _5: long (nullable = true)
 |-- _6: long (nullable = true)
 |-- _7: long (nullable = true)
 |-- _8: long (nullable = true)
 |-- _9: long (nullable = true)
 |-- _10: long (nullable = true)
 |-- _11: long (nullable = true)
 |-- _12: long (nullable = true)



In [129]:
sampleDF.count()

100000

In [181]:
# https://stackoverflow.com/questions/51831874/how-to-get-correlation-matrix-values-pyspark

from pyspark.mllib.stat import Statistics
import pandas as pd

# df = sqlCtx.read.format('com.databricks.spark.csv').option('header', 'true').option('inferschema', 'true').load('corr_test.csv')
col_names = sampleDF.columns
features = sampleDF.rdd.map(lambda row: row[0:])
corr_mat=Statistics.corr(features, method="pearson")
corr_df = pd.DataFrame(corr_mat)
corr_df.index, corr_df.columns = col_names, col_names

In [182]:
print(corr_df.to_string())

           _1        _2        _3        _4        _5        _6        _7        _8        _9       _10       _11       _12
_1   1.000000  0.035891  0.005063  0.072212 -0.067207 -0.068811  0.480229  0.025994  0.051337  0.447186  0.267833  0.112176
_2   0.035891  1.000000 -0.010739 -0.089524 -0.010694 -0.014957  0.016917 -0.016743 -0.007257  0.025012  0.019018 -0.003951
_3   0.005063 -0.010739  1.000000  0.046064 -0.004894  0.008781  0.006377  0.047232  0.039657 -0.003239  0.011224 -0.001866
_4   0.072212 -0.089524  0.046064  1.000000 -0.085205  0.026127  0.038853  0.367403  0.245387  0.144144  0.069665  0.021059
_5  -0.067207 -0.010694 -0.004894 -0.085205  1.000000  0.010853 -0.051607 -0.040148 -0.053966 -0.150039 -0.111524 -0.020224
_6  -0.068811 -0.014957  0.008781  0.026127  0.010853  1.000000 -0.028241  0.005038  0.221015 -0.150347 -0.040539 -0.013809
_7   0.480229  0.016917  0.006377  0.038853 -0.051607 -0.028241  1.000000  0.021495  0.195873  0.224310  0.645329  0.117434
_8   0.0

In [188]:
# https://stackoverflow.com/questions/52214404/how-to-get-the-correlation-matrix-of-a-pyspark-data-frame

from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

vector_col = "corr_features"
assembler = VectorAssembler(inputCols=sampleDF.columns, outputCol=vector_col)
df_vector = assembler.transform(sampleDF).select(vector_col)

# get correlation matrix
matrix = Correlation.corr(df_vector, vector_col)
matrix.collect()[0]["pearson({})".format(vector_col)].values

array([ 1.        ,  0.03589145,  0.00506312,  0.07221178, -0.06720698,
       -0.06881115,  0.48022876,  0.02599412,  0.05133746,  0.44718551,
        0.26783259,  0.11217636,  0.03589145,  1.        , -0.01073907,
       -0.08952401, -0.01069403, -0.01495693,  0.01691694, -0.01674285,
       -0.00725722,  0.02501235,  0.01901848, -0.00395116,  0.00506312,
       -0.01073907,  1.        ,  0.04606407, -0.00489405,  0.00878104,
        0.00637663,  0.04723187,  0.03965702, -0.00323904,  0.01122397,
       -0.00186575,  0.07221178, -0.08952401,  0.04606407,  1.        ,
       -0.08520458,  0.02612704,  0.03885344,  0.36740252,  0.24538707,
        0.14414358,  0.06966468,  0.02105947, -0.06720698, -0.01069403,
       -0.00489405, -0.08520458,  1.        ,  0.01085306, -0.05160656,
       -0.04014843, -0.05396592, -0.15003879, -0.11152414, -0.0202244 ,
       -0.06881115, -0.01495693,  0.00878104,  0.02612704,  0.01085306,
        1.        , -0.02824132,  0.00503771,  0.22101521, -0.15