# Final Project
__`MIDS w261: Machine Learning at Scale | UC Berkeley School of Information | Fall 2018`__  
Throughout this course you’ve engaged with key principles required to develop scalable machine learning analyses for structured and unstructured data. Working in Hadoop Streaming and Spark you’ve learned to translate common machine learning algorithms into Map-Reduce style implementations. You’ve developed the ability to evaluate Machine Learning approaches both in terms of their predictive performance as well as their scalability. For the final project you will demonstrate these skills by solving a machine learning challenge on a new dataset. Your job is to perform Click Through Rate prediction on a large dataset of Criteo advertising data made public as part of a Kaggle competition a few years back. As you perform your analysis, keep in mind that we are not grading you on the final performance of your model or how ‘advanced’ the techniques you use but rather on your ability to explain and develop a scalable machine learning approach to answering a real question.

More about the dataset:
https://www.kaggle.com/c/criteo-display-ad-challenge

# Notebook Set-Up
Before starting your homework run the following cells to confirm your setup.

In [8]:
import re
import ast
import time
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pyspark.ml.feature import FeatureHasher
from pyspark.sql import SQLContext, Row
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import SparseVector
from pyspark.sql.types import IntegerType
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import FeatureHasher
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [9]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [10]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

In [11]:
# start Spark Session
from pyspark.sql import SparkSession
app_name = "fproj_notebook"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()
sc = spark.sparkContext
print(sc)

<SparkContext master=yarn appName=pyspark-shell>


In [14]:
sqlContext = SQLContext(sc)

# Load a text file and convert each line to a Row.
#lines = sc.textFile("data/train50000.txt")
lines = sc.textFile("gs://midsw261/data/train.txt")

parts = lines.map(lambda l: l.split("\t"))
projectRDD = parts.map(lambda p: Row(label=int(p[0]), I1=p[1], I2=p[2],\
                    I3=p[3], I4=p[4], I5=p[5], I6=p[6],\
                    I7=p[7], I8=p[8], I9=p[9], I10=p[10],\
                    I11=p[11], I12=p[12], I13=p[13], C1=p[14], C2=p[15], C3=p[16],\
                    C4=p[17], C5=p[18], C6=p[19], C7=p[20], C8=p[21], C9=p[22],\
                    C10=p[23], C11=p[24], C12=p[25], C13=p[26], C14=p[27], C15=p[28],\
                    C16=p[29], C17=p[30], C18=p[31], C19=p[32], C20=p[33], C21=p[34],\
                    C22=p[35], C23=p[36], C24=p[37], C25=p[38], C26=p[39]))

# Infer the schema, and register the DataFrame as a table.
projectDF = sqlContext.createDataFrame(projectRDD)
projectDF.registerTempTable("projectTable")
projectDF= projectDF.withColumn("I1", projectDF["I1"].cast(IntegerType()))
projectDF= projectDF.withColumn("I2", projectDF["I2"].cast(IntegerType()))
projectDF= projectDF.withColumn("I3", projectDF["I3"].cast(IntegerType()))
projectDF= projectDF.withColumn("I4", projectDF["I4"].cast(IntegerType()))
projectDF= projectDF.withColumn("I5", projectDF["I5"].cast(IntegerType()))
projectDF= projectDF.withColumn("I6", projectDF["I6"].cast(IntegerType()))
projectDF= projectDF.withColumn("I8", projectDF["I8"].cast(IntegerType()))
projectDF= projectDF.withColumn("I9", projectDF["I9"].cast(IntegerType()))
projectDF= projectDF.withColumn("I10", projectDF["I10"].cast(IntegerType()))
projectDF= projectDF.withColumn("I12", projectDF["I12"].cast(IntegerType()))
projectDF= projectDF.withColumn("I13", projectDF["I13"].cast(IntegerType()))
projectDF= projectDF.withColumn("I7", projectDF["I7"].cast(IntegerType()))
projectDF= projectDF.withColumn("I11", projectDF["I11"].cast(IntegerType()))
projectDF= projectDF.withColumn("I1", projectDF["I1"].cast(IntegerType()))
projectDF= projectDF.withColumn("label", projectDF["label"].cast(IntegerType()))


#remove cat logs = C3, C4, C12, C16, C21 and C24 
# include - C6,C9,C14,C17,C20,C22,C23
# remove -  I3, I2, I8, I9
# remove I15 - outliers
# include - I4 & I13, I8 & I13, I7 & I11, I1 & I7
# since #I13 appears in noth I4 and I8 - include - I4,I13,I7 & I11, I1 & I7

categorical_features = ['C1','C2','C5','C6','C7','C8','C9','C10','C11','C11','C13','C14','C15','C17','C18','C19','C20','C22','C23','C25','C26']
numeric_features = ['I4','I13','I7','I11', 'I1']
label_and_numeric = ['label']+numeric_features

allcols = categorical_features + numeric_features

#numeric_features = ["I1", "I2", "I3", "I4", "I5", "I6", "I7", "I8", "I9", "I10", "I11", "I12", "I13"]
#categorical_features =  ['C6','C9','C14','C17','C20','C22','C23']
#label_and_numeric = ["label", "I1", "I2", "I3", "I4", "I5", "I6", "I7", "I8", "I9", "I10", "I11", "I12", "I13"]

In [15]:
splits=projectDF.randomSplit([0.8, 0.2],2018)
trainDF,testDF = splits
testDF.show(2)
testDF.printSchema()
cols = testDF.columns
print(cols)

+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+---+--------+---+--------+---+--------+--------+---+---+--------+--------+--------+--------+--------+--------+--------+----+----+---+----+----+---+----+----+----+---+---+---+---+-----+
|      C1|     C10|     C11|     C12|     C13|     C14|     C15|     C16|     C17|     C18|C19|      C2|C20|     C21|C22|     C23|     C24|C25|C26|      C3|      C4|      C5|      C6|      C7|      C8|      C9|  I1| I10|I11| I12| I13| I2|  I3|  I4|  I5| I6| I7| I8| I9|label|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+---+--------+---+--------+---+--------+--------+---+---+--------+--------+--------+--------+--------+--------+--------+----+----+---+----+----+---+----+----+----+---+---+---+---+-----+
|013c8fe1|233e3a0c|a7b606c4|6aaba33c|eae197fd|b28479f6|2d0bb053|b041b04a|e5ba7672|2804effd|   |421b43cd|   |723b4dfd|   |3a171ecb|b34f3128|   |   |8162da11|29998ed1|25c83c9

In [16]:
%%time

# Copied code from the cells below - Delete start from here

num_of_features = 256 # 2^8
#allcols = label_and_numeric + categorical_features
hasher = FeatureHasher(inputCols=allcols,outputCol="features",categoricalCols=categorical_features)
hasher.setNumFeatures(num_of_features)

featurizedTrain = hasher.transform(trainDF)
featurizedTrain.select("label","features").show(2,truncate=False)

featurizedTest = hasher.transform(testDF)

featurizedTest.select("features").show(2,truncate=False)

lr = LogisticRegression(labelCol="label",featuresCol="features",maxIter = 10)
lrModel = lr.fit(featurizedTrain)
print(lrModel)

predictions = lrModel.transform(featurizedTest)
predictions.show(2)

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",labelCol="label")
evaluator.evaluate(predictions)
evaluator.getMetricName()
print('Model Intercept: ', lrModel.intercept)
weights = lrModel.coefficients
weights = [(float(w),) for w in weights]  # convert numpy type to float, and to tuple
weightsDF = sqlContext.createDataFrame(weights, ["Feature Weight"])
display(weightsDF)
weightsDF.show(2)
print("Evaluator:")
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",labelCol="label")
eval = evaluator.evaluate(predictions)
print("AUC",eval)

print("Sample Distribuion :")
clicked = predictions.filter(predictions.prediction == 1.0).count()
notclicked = predictions.filter(predictions.prediction == 0.0).count()

TP = predictions.select("label", "prediction").filter("label = 0 and prediction = 0").count()

TN = predictions.select("label", "prediction").filter("label = 1 and prediction = 1").count()
FP = predictions.select("label", "prediction").filter("label = 0 and prediction = 1").count()
FN = predictions.select("label", "prediction").filter("label = 1 and prediction = 0").count()
total = predictions.select("label").count()
total = float(total)

print("Total :",total)
print("TN :",TN)
print("FP :",FP)
print("FN :",FN)
print("TP :",TP)

print("\nConfusion Matrix:")
print(TP,'\t',FP)
print(FN,'\t',TN)
print("\n")

accuracy	= (TP + TN) / total
precision   = (TP + FP) / total
recall      = (TP + FN) / total
F1		= 2/(1/precision + 1/recall)

print("Accuracy :",accuracy)
print("Precision :",precision)
print("Recall :",recall)
print("F1 :",F1)


+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features                                                                                                                                                                             |
+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0    |(256,[6,34,35,36,54,55,74,79,81,86,95,115,119,140,175,183,185,199,210,213,225,231,250],[1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,4.0,1.0,1.0,1.0,1.0])|
|1    |(256,[0,32,36,37,38,46,54,56,57,61,68,81,83,86,95,109,115,153,173,175,183,199,210],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,3.0,4.0,3.0])   |
+-----+-------------------------------------------

DataFrame[Feature Weight: double]

+--------------------+
|      Feature Weight|
+--------------------+
|-0.05627587597116992|
|-0.05281664021179919|
+--------------------+
only showing top 2 rows

Evaluator:
AUC 0.6948185957767022
Sample Distribuion :
Total : 9170441.0
TN : 302040
FP : 208731
FN : 2047947
TP : 6611723

Confusion Matrix:
6611723 	 208731
2047947 	 302040


Accuracy : 0.7539182684889418
Precision : 0.7437432943519292
Recall : 0.9443024604814534
F1 : 0.8321085146088715
CPU times: user 540 ms, sys: 224 ms, total: 764 ms
Wall time: 17min 23s
