# DATASCI W261: Machine Learning at Scale

**Nick Hamlin** (nickhamlin@gmail.com)  
  
Time of Submission: 11:15 PM EST, Friday, April 29, 2016  
W261-3, Spring 2016  
Week 13 Homework

### Submission Notes:
- For each problem, we've included a summary of the question as posed in the instructions.  In many cases, we have not included the full text to keep the final submission as uncluttered as possible.  For reference, we've included a link to the original instructions in the "Useful Reference" below.
- Some aspects of this notebook don't always render nicely into PDF form.  In these situations, please reference [the complete rendered notebook on Github](https://github.com/nickhamlin/mids_261_homework/blob/master/HW10/MIDS-W261-2015-HWK-Week13-Hamlin-Thomas-Baek-Danish.ipynb)


### Useful References and Notebook Setup:
- **[Original Assignment Instructions](https://www.dropbox.com/s/gsti4plbst7ena3/MIDS-MLS-HW-13.txt?dl=0)**


In [2]:
#Only needed to start Spark locally, unnecessary in AWS 
import os
import sys #current as of 9/26/2015
spark_home = os.environ['SPARK_HOME'] = \
   '/Users/nicholashamlin/spark-1.6.1-bin-hadoop2.6/'

if not spark_home:
    raise ValueError('SPARK_HOME enviroment variable is not set')
sys.path.insert(0,os.path.join(spark_home,'python'))
sys.path.insert(0,os.path.join(spark_home,'python/lib/py4j-0.9-src.zip'))
execfile(os.path.join(spark_home,'python/pyspark/shell.py'))
 

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:57:58)
SparkContext available as sc, HiveContext available as sqlContext.


## HW 13.4: Criteo Phase 2 baseline

### Problem Statement:

SPECIAL NOTE:
Please share your findings as they become available with class via the Google Group. You will get brownie points for this.  Once results are shared please use them and build on them.

The Criteo data for this challenge is located in the following S3/Dropbox buckets:

On Dropbox see:
     https://www.dropbox.com/sh/dnevke9vsk6yj3p/AABoP-Kv2SRxuK8j3TtJsSv5a?dl=0

Raw Data:  (Training, Validation and Test data)
https://console.aws.amazon.com/s3/home?region=us-west-1#&bucket=criteo-dataset&prefix=rawdata/

Hashed Data: Training, Validation and Test data in hash encoded (10,000 buckets) and sparse representation
https://console.aws.amazon.com/s3/home?region=us-west-1#&bucket=criteo-dataset&prefix=processeddata/


Using the training dataset, validation dataset and testing dataset in the Criteo bucket perform the following experiment:

-- write spark code (borrow from Phase 1 of this project) to train a logistic regression model with the following hyperparamters:

-- Number of buckets for hashing: 1,000
-- Logistic Regression: no regularization term
-- Logistic Regression: step size = 10

Report the AWS cluster configuration that you used and how long in minutes and seconds it takes to complete this job.

Report in tabular form the AUC value (https://en.wikipedia.org/wiki/Receiver_operating_characteristic) for the Training, Validation, and Testing datasets.
Report in tabular form  the logLossTest for the Training, Validation, and Testing datasets.

Dont forget to put a caption on your tables (above each table).

### Import packages and setup raw data in Spark

In [14]:
#Load required dependencies
import os
import time
from math import log,exp
from collections import OrderedDict,defaultdict
from itertools import product
import hashlib

from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.evaluation import BinaryClassificationMetrics

### Load data - LOCAL VERSION

In [4]:
#Load Raw Data - LOCAL VERSION
fileName='dac_sample.txt'
if os.path.isfile(fileName):
    rawData = (sc
               .textFile(fileName, 2)
               .map(lambda x: x.replace('\t', ',')))  # work with either ',' or '\t' separated data
rawData.count()

100000

In [5]:
#SPLIT DATA INTO TRAIN/VALIDATION/TEST - LOCAL TESTING
weights = [.8, .1, .1]
seed = 42
# Use randomSplit with weights and seed
rawTrainData, rawValidationData, rawTestData = rawData.randomSplit(weights, seed)
# Cache the data
rawTrainData.cache()
rawValidationData.cache()
rawTestData.cache()

nTrain = rawTrainData.count()
nVal = rawValidationData.count()
nTest = rawTestData.count()
print "NUMBER OF RECORDS"
print "Training: {0}".format(nTrain)
print "Validation: {0}".format(nVal)
print "Test: {0}".format(nTest)
print "Total: {0}".format(nTrain + nVal + nTest)


NUMBER OF RECORDS
Training: 79911
Validation: 10075
Test: 10014
Total: 100000


### Load data - FULL VERSION

In [2]:
#Load Raw Data - EMR VERSION WITH SAMPLE DATA
fileName="s3://hamlin-mids-261/dac_sample.txt"
rawData = (sc.textFile(fileName, 2).map(lambda x: x.replace('\t', ',')))
#rawData.cache()
rawData.count()

100000

In [3]:
#Load ONE file first
testFile="s3://criteo-dataset/rawdata/test/part-00000"
rawTestData = (sc.textFile(testFile, 2).map(lambda x: x.replace('\t', ',')))
nTest = rawTestData.count()
print "Test: {0}".format(nTest)

Test: 27807


In [4]:
trainFile="s3://criteo-dataset/rawdata/train/part-*"
validationFile="s3://criteo-dataset/rawdata/validation/part-*"
testFile="s3://criteo-dataset/rawdata/test/part-*"

In [5]:
#Load test data first to make sure everything works
rawTestData = (sc.textFile(testFile, 2).map(lambda x: x.replace('\t', ',')))
rawTestData.cache()
nTest = rawTestData.count()
print "Test: {0}".format(nTest)

Test: 4586343


In [6]:
rawTrainData = (sc.textFile(trainFile, 2).map(lambda x: x.replace('\t', ','))).cache()
rawValidationData = (sc.textFile(validationFile, 2).map(lambda x: x.replace('\t', ','))).cache()
#rawTestData = (sc.textFile(testFile, 2).map(lambda x: x.replace('\t', ','))).cache()

nTrain = rawTrainData.count()
nVal = rawValidationData.count()
#nTest = rawTestData.count()
print "NUMBER OF RECORDS"
print "Training: {0}".format(nTrain)
print "Validation: {0}".format(nVal)
print "Test: {0}".format(nTest)
print "Total: {0}".format(nTrain + nVal + nTest)


NUMBER OF RECORDS
Training: 36669090
Validation: 4585184
Test: 4586343
Total: 45840617


### Define functions to evaluate models
(These are recycled from HW 12)

In [6]:
def computeLogLoss(p, y):
    """Calculates the value of log loss for a given probabilty and label.

    Note:
        log(0) is undefined, so when p is 0 we need to add a small value (epsilon) to it
        and when p is 1 we need to subtract a small value (epsilon) from it.

    Args:
        p (float): A probabilty between 0 and 1.
        y (int): A label.  Takes on the values 0 and 1.

    Returns:
        float: The log loss value.
    """
    epsilon = 10e-12
    if y==1:
        return -log(p+epsilon)
    elif y==0:
        return -log(1-p+epsilon)

In [7]:
def getP(x, w, intercept):
    """Calculate the probability for an observation given a set of weights and intercept.

    Note:
        We'll bound our raw prediction between 20 and -20 for numerical purposes.

    Args:
        x (SparseVector): A vector with values of 1.0 for features that exist in this
            observation and 0.0 otherwise.
        w (DenseVector): A vector of weights (betas) for the model.
        intercept (float): The model's intercept.

    Returns:
        float: A probability between 0 and 1.
    """
    rawPrediction=x.dot(w)+intercept
    

    # Bound the raw prediction value
    rawPrediction = min(rawPrediction, 20)
    rawPrediction = max(rawPrediction, -20)
    
    output = (1+exp(-rawPrediction))**-1
    return output

In [8]:
def evaluateResults(model, data):
    """Calculates the log loss for the data given the model.

    Args:
        model (LogisticRegressionModel): A trained logistic regression model.
        data (RDD of LabeledPoint): Labels and features for each observation.

    Returns:
        float: Log loss for the data.
    """
    output=data.map(lambda x: computeLogLoss(getP(x.features,model.weights,model.intercept), x.label)).sum()/data.count()
    return output

In [9]:
#Define Hash Function

def hashFunction(numBuckets, rawFeats, printMapping=False):
    """Calculate a feature dictionary for an observation's features based on hashing.

    Note:
        Use printMapping=True for debug purposes and to better understand how the hashing works.

    Args:
        numBuckets (int): Number of buckets to use as features.
        rawFeats (list of (int, str)): A list of features for an observation.  Represented as
            (featureID, value) tuples.
        printMapping (bool, optional): If true, the mappings of featureString to index will be
            printed.

    Returns:
        dict of int to float:  The keys will be integers which represent the buckets that the
            features have been hashed to.  The value for a given key will contain the count of the
            (featureID, value) tuples that have hashed to that key.
    """
    mapping = {}
    for ind, category in rawFeats:
        featureString = category + str(ind)
        mapping[featureString] = int(int(hashlib.md5(featureString).hexdigest(), 16) % numBuckets)
    if(printMapping): print mapping
    sparseFeatures = defaultdict(float)
    for bucket in mapping.values():
        sparseFeatures[bucket] += 1.0
    return dict(sparseFeatures)

In [10]:
#Use Hash function to create labeled points with hashed features

def parseHashPoint(point, numBuckets):
    """Create a LabeledPoint for this observation using hashing.

    Args:
        point (str): A comma separated string where the first value is the label and the rest are
            features.
        numBuckets: The number of buckets to hash to.

    Returns:
        LabeledPoint: A LabeledPoint with a label (0.0 or 1.0) and a SparseVector of hashed
            features.
    """
    output=[]
    features=point.split(',')
    label=features[0]
    for i,j in enumerate(features[1:]):
        output.append((i,j))
    output.sort()
    hashResult=hashFunction(numBuckets,output)
    sortedHashResult=OrderedDict(sorted(hashResult.items(), key=lambda t: t[0]))
    sparse=SparseVector(numBuckets,sortedHashResult.keys(),sortedHashResult.values())
    return LabeledPoint(label,sparse)


### Generate Features and Train Baseline Model

- Number of buckets for hashing: 1,000
- Logistic Regression: no regularization term
- Logistic Regression: step size = 10

In [18]:
#Baseline model parameters
stepSize=10 #GD stepsize
regType=None #Regularization
numBucketsCTR = 1000 #Number of buckets into which we want to hash features

#Hash all three datasets
start_hash_time=time.time()
hashTrainData = rawTrainData.map(lambda point: parseHashPoint(point,numBucketsCTR))
hashTrainData.cache()
hashValidationData = rawValidationData.map(lambda point: parseHashPoint(point,numBucketsCTR))
hashValidationData.cache()
hashTestData = rawTestData.map(lambda point: parseHashPoint(point,numBucketsCTR))
hashTestData.cache()

start_train_time = time.time()
#train(data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=0.01, regType='l2', intercept=False, validateData=True, convergenceTol=0.001)[source]
model = LogisticRegressionWithSGD.train(hashTrainData, step=stepSize, regType=regType, intercept=True)

end_time=time.time()
print "Feature hashing for all data took {0} seconds".format(start_train_time-start_hash_time)
print "Model trained in {0} seconds".format(end_time-start_train_time)


Feature hashing for all data took 0.0308191776276 seconds
Model trained in 1915.60110497 seconds


In [19]:
model.save(sc, 's3://hamlin-mids-261/hw13_baseline')

### Evaluate results

In [20]:
#10-15m?
trainResults = hashTrainData.map(lambda lp: (float(model.predict(lp.features)), lp.label))
validationResults = hashValidationData.map(lambda lp: (float(model.predict(lp.features)), lp.label))
testResults = hashTestData.map(lambda lp: (float(model.predict(lp.features)), lp.label))

trainMetrics = BinaryClassificationMetrics(trainResults)
validationMetrics = BinaryClassificationMetrics(validationResults)
testMetrics = BinaryClassificationMetrics(testResults)

print "AUC RESULTS"
print "Training: {0}".format(trainMetrics.areaUnderROC)
print "Validation: {0}".format(validationMetrics.areaUnderROC)
print "Test: {0}".format(testMetrics.areaUnderROC)

AUC RESULTS
Training: 0.582128506212
Validation: 0.582249919283
Test: 0.582221271582


In [21]:
#4m approx runtime
logLossTrain = evaluateResults(model, hashTrainData)
logLossVal = evaluateResults(model, hashValidationData)
logLossTest = evaluateResults(model, hashTestData)
print "LOGLOSS RESULTS"
print "Training: {0}".format(logLossTrain)
print "Validation: {0}".format(logLossVal)
print "Test: {0}".format(logLossTest)

LOGLOSS RESULTS
Training: 0.505463996931
Validation: 0.505676112056
Test: 0.505602800332


## HW 13.5: Criteo Phase 2 hyperparameter tuning
SPECIAL NOTE:
Please share your findings as they become available with class via the Google Group. You will get brownie points for this.  Once results are shared please used them and build on them.
 

Using the training dataset, validation dataset and testing dataset in the Criteo bucket perform the following experiments:

- write spark code (borrow from Phase 1 of this project) to train a logistic regression model with various hyperparamters. Do a gridsearch of the hyperparameter space and determine optimal settings using the validation set.

- Number of buckets for hashing: 1,000, 10,000, .... explore different values  here
- Logistic Regression: regularization term: [1e-6, 1e-3]  explore other  values here also
- Logistic Regression: step size: explore different step sizes. Focus on a stepsize of 1 initially. 

Report the AWS cluster configuration that you used and how long in minutes and seconds it takes to complete this job.

Report in tabular form and using heatmaps the AUC values (https://en.wikipedia.org/wiki/Receiver_operating_characteristic) for the Training, Validation, and Testing datasets.
Report in tabular form and using heatmaps  the logLossTest for the Training, Validation, and Testing datasets.

Dont forget to put a caption on your tables (above the table) and on your heatmap figures (put caption below figures) detailing the experiment associated with each table or figure (data, algorithm used, parameters and settings explored.

Discuss the optimal setting to solve this problem  in terms of the following:
-- Features
-- Learning algortihm
-- Spark cluster

Justiy your recommendations based on your experimental results and cross reference with table numbers and figure numbers. Also highlight key results with annotations, both textual and line and box based, on your tables and graphs.

### Setup infrastructure for gridsearch strategy

In [31]:
def test_configuration(numBucketsCTR,regParam,stepSize,rawTrainData=rawTrainData,rawValidationData=rawValidationData,rawTestData=rawTestData):
    #Hash dataset
    start_hash_time=time.time()
    hashTrainData = rawTrainData.map(lambda point: parseHashPoint(point,numBucketsCTR))
    hashTrainData.cache()
    hashValidationData = rawValidationData.map(lambda point: parseHashPoint(point,numBucketsCTR))
    hashValidationData.cache()
    hashTestData = rawTestData.map(lambda point: parseHashPoint(point,numBucketsCTR))
    hashTestData.cache()
    
    start_train_time = time.time()
    model = LogisticRegressionWithSGD.train(hashTrainData, step=stepSize, iterations=10,regType='l2',regParam=regParam, intercept=True)

    start_test_time=time.time()
    #print "Model trained in {0} seconds".format(end_time-start_train_time)

    trainResults = hashTrainData.map(lambda lp: (float(model.predict(lp.features)), lp.label))
    validationResults = hashValidationData.map(lambda lp: (float(model.predict(lp.features)), lp.label))
    testResults = hashTestData.map(lambda lp: (float(model.predict(lp.features)), lp.label))

    trainMetrics = BinaryClassificationMetrics(trainResults)
    validationMetrics = BinaryClassificationMetrics(validationResults)
    testMetrics = BinaryClassificationMetrics(testResults)
    
    trainAUC=trainMetrics.areaUnderROC
    validationAUC=validationMetrics.areaUnderROC
    testAUC=testMetrics.areaUnderROC

#     print "AUC RESULTS"
#     print "Training: {0}".format(trainAUC)
#     print "Validation: {0}".format(validationAUC)
#     print "Test: {0}".format(testAUC)
    
#     print ""
    logLossTrain = evaluateResults(model, hashTrainData)
    logLossVal = evaluateResults(model, hashValidationData)
    logLossTest = evaluateResults(model, hashTestData)
    
    end_test_time=time.time()
    
#     print "LOGLOSS RESULTS"
#     print "Training: {0}".format(logLossTrain)
#     print "Validation: {0}".format(logLossVal)
#     print "Test: {0}".format(logLossTest)
    
    trainTime=start_test_time-start_train_time
    testTime=end_test_time-start_test_time
    
    output={'AUC':{'train':trainAUC,'val':validationAUC,'test':testAUC},
            'logloss':{'train':logLossTrain,'val':logLossVal,'test':logLossTest},
            'time':{'train':trainTime,'test':testTime},
            'params':(numBucketsCTR,regParam,stepSize)
           }
    return output

In [36]:
hashOptions=[1000,5000,10000]
regOptions=[1e-6,1e-4,1e-3]
stepOptions=[1,5,10]
regType='l2'

allOptions=list(product(hashOptions,regOptions,stepOptions))

results=[]
for option in allOptions[0:4]:
    print "Parameters: "+str(option)
    result=test_configuration(option[0],option[1],option[2])
    print "Train Time: "+str(result['time']['train'])
    print "Test Time: "+ str(result['time']['test'])
    print ""
    results.append(result)
    
#Baseline model parameters
# stepSize=10 #GD stepsize
# regType=None #Regularization
# numBucketsCTR = 1000 #Number of buckets into which we want to hash features
# numIterations=10

#Hash all three datasets
# start_hash_time=time.time()
# hashTrainData = rawTrainData.map(lambda point: parseHashPoint(point,numBucketsCTR))
# hashTrainData.cache()
# hashValidationData = rawValidationData.map(lambda point: parseHashPoint(point,numBucketsCTR))
# hashValidationData.cache()
# hashTestData = rawTestData.map(lambda point: parseHashPoint(point,numBucketsCTR))
# hashTestData.cache()

# start_train_time = time.time()
# #train(data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=0.01, regType='l2', intercept=False, validateData=True, convergenceTol=0.001)[source]
# model2 = LogisticRegressionWithSGD.train(hashTrainData, step=stepSize, iterations=numIterations,regType=regType, intercept=True)

# end_time=time.time()
# #print "Feature hashing for all data took {0} seconds".format(start_train_time-start_hash_time)
# print "Model trained in {0} seconds".format(end_time-start_train_time)

Parameters: (1000, 1e-06, 1)
Train Time: 20.0864210129
Test Time:6.33702111244

Parameters: (1000, 1e-06, 5)
Train Time: 19.9464569092
Test Time:6.75753808022

Parameters: (1000, 1e-06, 10)
Train Time: 20.9382989407
Test Time:7.13368105888

Parameters: (1000, 0.0001, 1)
Train Time: 20.9715909958
Test Time:6.80100393295



In [39]:
#Print/Store stuff locally so we can plot offline
import pickle

with open('results.pkl','w') as f:
    pickle.dump(results,f)

for i in results:
    print i

{'logloss': {'test': 0.5117910075966668, 'train': 0.5091918344622549, 'val': 0.5023164247618207}, 'AUC': {'test': 0.5002185314685315, 'train': 0.5001685062317179, 'val': 0.5001614619897322}, 'params': (1000, 1e-06, 1), 'time': {'test': 6.337021112442017, 'train': 20.086421012878418}}
{'logloss': {'test': 0.5185708179328563, 'train': 0.5147507714302086, 'val': 0.5045939386342658}, 'AUC': {'test': 0.5171926507352342, 'train': 0.5160440283128466, 'val': 0.5146948755352495}, 'params': (1000, 1e-06, 5), 'time': {'test': 6.757538080215454, 'train': 19.946456909179688}}
{'logloss': {'test': 1.0516782300224607, 'train': 1.0428988816807345, 'val': 1.01291456786128}, 'AUC': {'test': 0.503391452257295, 'train': 0.5022164592153732, 'val': 0.5018056053824256}, 'params': (1000, 1e-06, 10), 'time': {'test': 7.133681058883667, 'train': 20.93829894065857}}
{'logloss': {'test': 0.5117934859334248, 'train': 0.5091946411956635, 'val': 0.5023195207957308}, 'AUC': {'test': 0.5002185314685315, 'train': 0.500

In [40]:
with open('results.pkl','r') as f:
    foo=pickle.load(f)
    
for i in foo:
    print i

{'logloss': {'test': 0.5117910075966668, 'train': 0.5091918344622549, 'val': 0.5023164247618207}, 'AUC': {'test': 0.5002185314685315, 'train': 0.5001685062317179, 'val': 0.5001614619897322}, 'params': (1000, 1e-06, 1), 'time': {'test': 6.337021112442017, 'train': 20.086421012878418}}
{'logloss': {'test': 0.5185708179328563, 'train': 0.5147507714302086, 'val': 0.5045939386342658}, 'AUC': {'test': 0.5171926507352342, 'train': 0.5160440283128466, 'val': 0.5146948755352495}, 'params': (1000, 1e-06, 5), 'time': {'test': 6.757538080215454, 'train': 19.946456909179688}}
{'logloss': {'test': 1.0516782300224607, 'train': 1.0428988816807345, 'val': 1.01291456786128}, 'AUC': {'test': 0.503391452257295, 'train': 0.5022164592153732, 'val': 0.5018056053824256}, 'params': (1000, 1e-06, 10), 'time': {'test': 7.133681058883667, 'train': 20.93829894065857}}
{'logloss': {'test': 0.5117934859334248, 'train': 0.5091946411956635, 'val': 0.5023195207957308}, 'AUC': {'test': 0.5002185314685315, 'train': 0.500

## End of Submission

In [22]:
#Baseline model parameters
stepSize=10 #GD stepsize
regType=None #Regularization
numBucketsCTR = 1000 #Number of buckets into which we want to hash features
numIterations=10

#Hash all three datasets
start_hash_time=time.time()
hashTrainData = rawTrainData.map(lambda point: parseHashPoint(point,numBucketsCTR))
hashTrainData.cache()
hashValidationData = rawValidationData.map(lambda point: parseHashPoint(point,numBucketsCTR))
hashValidationData.cache()
hashTestData = rawTestData.map(lambda point: parseHashPoint(point,numBucketsCTR))
hashTestData.cache()

start_train_time = time.time()
#train(data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=0.01, regType='l2', intercept=False, validateData=True, convergenceTol=0.001)[source]
model2 = LogisticRegressionWithSGD.train(hashTrainData, step=stepSize, iterations=numIterations,regType=regType, intercept=True)

end_time=time.time()
print "Feature hashing for all data took {0} seconds".format(start_train_time-start_hash_time)
print "Model trained in {0} seconds".format(end_time-start_train_time)

Model trained in 57.8550970554 seconds
