# Final Project
__`MIDS w261: Machine Learning at Scale | UC Berkeley School of Information | Fall 2018`__  
Throughout this course you’ve engaged with key principles required to develop scalable machine learning analyses for structured and unstructured data. Working in Hadoop Streaming and Spark you’ve learned to translate common machine learning algorithms into Map-Reduce style implementations. You’ve developed the ability to evaluate Machine Learning approaches both in terms of their predictive performance as well as their scalability. For the final project you will demonstrate these skills by solving a machine learning challenge on a new dataset. Your job is to perform Click Through Rate prediction on a large dataset of Criteo advertising data made public as part of a Kaggle competition a few years back. As you perform your analysis, keep in mind that we are not grading you on the final performance of your model or how ‘advanced’ the techniques you use but rather on your ability to explain and develop a scalable machine learning approach to answering a real question.

More about the dataset:
https://www.kaggle.com/c/criteo-display-ad-challenge

# Notebook Set-Up
Before starting your homework run the following cells to confirm your setup.

In [37]:
import re
import ast
import time
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [38]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [39]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

In [40]:
# start Spark Session
from pyspark.sql import SparkSession
app_name = "fproj_notebook"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()
sc = spark.sparkContext

In [72]:
#Toy data set Use Feature hasher
#data = [(2.0, True, 1, "foo","A"), (3.0, False, 2, "bar","B"),(4.0, True, 2, "bar","B"),(4.0, True, 2, "bar","B")]
df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])
#cols = ["real", "bool", "int","string", "categotical"]
#df = spark.createDataFrame(data, cols)
cols = ["id", "category1", "category2"]
df.show()
hasher = FeatureHasher(inputCols=cols, outputCol="features")
hasher.setNumFeatures(4)
featurized = hasher.transform(df)
featurized.select("features").show(truncate=False)

+---+---------+---------+
| id|category1|category2|
+---+---------+---------+
|  0|        a|        1|
|  1|        b|        2|
|  2|        c|        3|
|  3|        a|        4|
|  4|        a|        4|
|  5|        c|        3|
+---+---------+---------+

+-------------------+
|features           |
+-------------------+
|(4,[0],[2.0])      |
|(4,[0,1],[3.0,1.0])|
|(4,[0],[6.0])      |
|(4,[0],[8.0])      |
|(4,[0],[9.0])      |
|(4,[0],[9.0])      |
+-------------------+



In [41]:
import os
import sys
import os.path
import pyspark

file = open('data/train2000', 'r') 
dacContents = file.read().split("\n") 
dacContents = [x.strip().replace('\t', ',') for x in dacContents]

numPartitions = 2
rawData = sc.parallelize(dacContents)

In [42]:
weights = [.8, .1, .1]
seed = 42
rawTrainData, rawValidationData, rawTestData = rawData.randomSplit(weights,seed)
# Cache the data
rawTrainData.cache()
rawValidationData.cache()
rawTestData.cache()

nTrain = rawTrainData.count()
nVal = rawValidationData.count()
nTest = rawTestData.count()
print( nTrain, nVal, nTest, nTrain + nVal + nTest)
print (rawData.take(3))

1585 203 213 2001
['id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21', '1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79', '10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79']


In [43]:
#get all categorical variables
def get_categorical(df):
    categorical = [var for var in df.columns if df[var].dtype=='O']
    return categorical
#get all numerical variables
def get_numerical(df):
    numerical = [var for var in df.columns if df[var].dtype!='O']
    
    return numerical


In [44]:
from pyspark.ml.feature import FeatureHasher
from pyspark.sql import SQLContext, Row
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import SparseVector
from pyspark.sql.types import IntegerType
from pyspark.ml.classification import LogisticRegression

regParams = [1e-6,1e-3]
sqlContext = SQLContext(sc)
df = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load("data/train2000")
df.printSchema()

hasher = FeatureHasher(inputCols=["id", "click", "hour", "C1","banner_pos","site_id","site_domain"],
                       outputCol="features")
df.select("id", "click", "hour", "C1","banner_pos","site_id","site_domain").show(2)

hasher.setNumFeatures(20)
print("Number of features ",hasher.getNumFeatures())
testDF = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load("data/train_tail_10")
testDF = testDF.select("id", "hour", "C1","banner_pos","site_id","site_domain","click")
testDF= testDF.withColumn("click", testDF["click"].cast(IntegerType()))
hashertest = FeatureHasher(inputCols=["id", "hour", "click","C1","banner_pos","site_id","site_domain"],
                           outputCol="features")
hashertest.setNumFeatures(20)
ftestDF = hashertest.transform(testDF)

featurized = hasher.transform(df)
df.show(2)
featurized.show(2)
featurized.select('features').show(5,truncate=False)
numIters = 5
regType = 'l1'
includeIntercept = True


stepSizes = [1,10]

f = featurized.select('click','features')
f.show(10)
f = f.withColumn("click", f["click"].cast(IntegerType()))

lr = LogisticRegression(labelCol="click",featuresCol="features",maxIter = 10)
lrModel = lr.fit(f)
print("lrmodel",lrModel)

predictions = lrModel.transform(ftestDF)
predictions.show(20)
testDF.show(20)
predictions.show(10)



root
 |-- id: string (nullable = true)
 |-- click: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- banner_pos: string (nullable = true)
 |-- site_id: string (nullable = true)
 |-- site_domain: string (nullable = true)
 |-- site_category: string (nullable = true)
 |-- app_id: string (nullable = true)
 |-- app_domain: string (nullable = true)
 |-- app_category: string (nullable = true)
 |-- device_id: string (nullable = true)
 |-- device_ip: string (nullable = true)
 |-- device_model: string (nullable = true)
 |-- device_type: string (nullable = true)
 |-- device_conn_type: string (nullable = true)
 |-- C14: string (nullable = true)
 |-- C15: string (nullable = true)
 |-- C16: string (nullable = true)
 |-- C17: string (nullable = true)
 |-- C18: string (nullable = true)
 |-- C19: string (nullable = true)
 |-- C20: string (nullable = true)
 |-- C21: string (nullable = true)

+--------------------+-----+--------+----+----------+--------+-

In [45]:
selected = predictions.select("hour", "prediction", "probability", "banner_pos", "site_id","rawPrediction","click")
display(selected)
selected.count()
selected.show(10)

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
predictions.printSchema()

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",labelCol="click")
evaluator.evaluate(predictions)

evaluator.getMetricName()


DataFrame[hour: string, prediction: double, probability: vector, banner_pos: string, site_id: string, rawPrediction: vector, click: int]

+--------+----------+--------------------+----------+--------+--------------------+-----+
|    hour|prediction|         probability|banner_pos| site_id|       rawPrediction|click|
+--------+----------+--------------------+----------+--------+--------------------+-----+
|14103023|       0.0|[0.91796131731855...|         0|85f751fd|[2.41496437568790...|    0|
|14103023|       0.0|[0.80644370577670...|         0|83a0ad1a|[1.42706569810265...|    1|
|14103023|       0.0|[0.80644196465003...|         1|856e6d3f|[1.42705454366892...|    0|
|14103023|       0.0|[0.91796131731855...|         0|85f751fd|[2.41496437568790...|    0|
|14103023|       0.0|[0.92361582717442...|         0|85f751fd|[2.49252070098601...|    0|
|14103023|       0.0|[0.88185793228089...|         1|e151e245|[2.01014310451765...|    1|
|14103023|       0.0|[0.92361582717442...|         0|85f751fd|[2.49252070098601...|    0|
|14103023|       0.0|[0.86533995168475...|         1|f61eaaae|[1.86036899554122...|    0|
|14103023|

'areaUnderROC'

In [46]:
print(lr.explainParams())
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10])
             .build())

# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(f)
# this will likely take a fair amount of time because of the amount of models that we're creating and testing

# Use test set to measure the accuracy of our model on new data
predictions = cvModel.transform(ftestDF)

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: click)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The

__`REMINDER:`__ If you are running this notebook on the course docker container, you can monitor the progress of your jobs using the Spark UI at: http://localhost:4040/jobs/

In [47]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

0.5631035966032081

In [48]:
print('Model Intercept: ', cvModel.bestModel.intercept)

Model Intercept:  -1.630758568797585


In [49]:
weights = cvModel.bestModel.coefficients
weights = [(float(w),) for w in weights]  # convert numpy type to float, and to tuple
weightsDF = sqlContext.createDataFrame(weights, ["Feature Weight"])
display(weightsDF)
weightsDF.show(2)

DataFrame[Feature Weight: double]

+-------------------+
|     Feature Weight|
+-------------------+
|0.03499620827637826|
| 0.2983161853161314|
+-------------------+
only showing top 2 rows



In [50]:
selected = predictions.select("click", "prediction", "probability", "banner_pos", "C1")
display(selected)
selected.count()
selected.show(10)

DataFrame[click: int, prediction: double, probability: vector, banner_pos: string, C1: string]

+-----+----------+--------------------+----------+----+
|click|prediction|         probability|banner_pos|  C1|
+-----+----------+--------------------+----------+----+
|    0|       0.0|[0.85972227103750...|         0|1005|
|    1|       0.0|[0.77520730867717...|         0|1005|
|    0|       0.0|[0.81854245291356...|         1|1005|
|    0|       0.0|[0.85972227103750...|         0|1005|
|    0|       0.0|[0.86002566403318...|         0|1005|
|    1|       0.0|[0.85972227103750...|         1|1005|
|    0|       0.0|[0.86002566403318...|         0|1005|
|    0|       0.0|[0.83413944644018...|         1|1005|
|    1|       0.0|[0.85972227103750...|         0|1005|
|    0|       0.0|[0.80756060095088...|         0|1005|
+-----+----------+--------------------+----------+----+
only showing top 10 rows



# Question 1: Question Formulation 
Introduce the goal of your analysis. What questions will you seek to answer, why do people perform this kind of analysis on this kind of data? Preview what level of performance your model would need to achieve to be practically useful.

# Question 2: Algorithm Explanation
Create your own toy example that matches the dataset provided and use this toy example to explain the math behind the algorithym that you will perform.

# Question 3: EDA & Discussion of Challenges
Determine 2-3 relevant EDA tasks that will help you make decisions about how you implement the algorithm to be scalable. Discuss any challenges that you anticipate based on the EDA you perform

In [11]:
average_CTR = projectRDD.map(lambda x: int(x.split('\t')[0])).mean()

In [12]:
average_CTR

0.25622338372976045

# Question 4: Algorithm Implementation 
Develop a 'homegrown' implementation of the algorithn, apply it to the training dataset and evaluate your results on the test set. 

# Question 5: Application of Course Concepts
Pick 3-5 key course concepts and discuss how your work on this assignment illustrates an understanding of these concepts. 