###HW14.3 Field-aware Factorization Machine  FFM test
**Download the Spark libFM from https://github.com/zhengruifeng/spark-libFM <br><br>
Run the following Field-aware Factorization Machine test: <br><br>
https://github.com/zhengruifeng/spark-libFM/blob/master/src/main/scala/TestFM.scala <br><br>
Describe the dataset. Describe the two experiments: fm1 and fm2 and discuss your results.**

**Dataset:**<br>
The dataset comprises approximately 2.4M lines of the format: <br>
*label feature0:value0 feature1:value1 ... featureN:valueN*

The lines are in sparse format, the labels take values -1 or 1, and the feature values are either binary or take a fractional value between 0 and 1. The total uncompressed data size is about 2.2G.

**Differences in the two factorizations:**
* fm1 trains an FFM using stochastic gradient descent for 100 iterations. This is an iterative method that processes one data point at a time and adjusts the model based on the gradient of the loss function. fm1 uses all of the data in each iteration (it can be configured to use lesS).
* fm2 trains an FFM using limited-memory Broyden–Fletcher–Goldfarb–Shanno for 20 iterations. This algorithm approximates Newton's method (using the gradient and an approximate Hessian) to find a more direct path to convergence than gradient descent.

Both experiments use equivalent parameters otherwise, including for interactions (use global bias term, use one-way interactions, use 4 factors for pairwise interactions) and regularization (all regularization parameters set to 0). There is no output from FFMTest, but we can compare runtimes: with 10 m3.xlarge instances, fm1 required approximately 2 minutes and fm2 13 minutes (15 minutes total).

The original scala code provides no output, so I wrote a function and created an additional RDD calculate log loss. Both fm1 and fm2 had identical log loss of 0.8520897668623415.

Below are the commands and code used to set up the cluster and compile and run the code.

In [None]:
# set up a cluster of 10 m3.xlarge instances
./spark/ec2/spark-ec2 --key-pair=jamesr261 --identity-file=jamesr261.pem -s 12 -t m3.xlarge \
    --region=us-west-1 --zone=us-west-1a launch jr

This is the scala file for running the factorizations.

In [None]:
import com.github.fommil.netlib.BLAS._
import com.github.fommil.netlib.BLAS.{getInstance => blas}

import scala.util.Random

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.regression._
import org.apache.spark.mllib.util.MLUtils

object TestFM extends App {

  override def main(args: Array[String]): Unit = {

    val sc = new SparkContext(new SparkConf().setAppName("TESTFM"))

    val training = MLUtils.loadLibSVMFile(sc, "s3n://<access_key>:<secret_key>@ucb-mids-mls-jamesroute-hw5/url_combined", false, -1, 20).cache()

    //    val task = args(1).toInt
    //    val numIterations = args(2).toInt
    //    val stepSize = args(3).toDouble
    //    val miniBatchFraction = args(4).toDouble

    val dataSize = training.count()

    // run with SGD, report log loss
    val fm1 = FMWithSGD.train(training, task = 1, numIterations = 100, stepSize = 0.15, miniBatchFraction = 1.0, dim = (true, true, 4), regParam = (0, 0, 0), initStd = 0.1)
    val logLoss_fm1 = training.map { point =>
      fm1.predict(point.features)
    }.sum() / dataSize
    println("training log loss for fm1 = " + logLoss_fm1)

    // run with LBFGS, report log loss 
    val fm2 = FMWithLBFGS.train(training, task = 1, numIterations = 20, numCorrections = 5, dim = (true, true, 4), regParam = (0, 0, 0), initStd = 0.1)
    val logLoss_fm2 = training.map { point =>
      fm2.predict(point.features)
    }.sum() / dataSize
    println("training log loss for fm2 = " + logLoss_fm2)

  }
}

This is the SBT file used to build the jar.

In [None]:
name := "TESTFM"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.10" % "1.5.1",
  "org.apache.spark" % "spark-mllib_2.10" % "1.5.1"
)

This is the command to submit the job to Spark, run from the root of the project directory.

In [None]:
# driver mem set to 6G to avoid java heap memory issues
~/spark/bin/spark-submit --driver-memory 6G \
    --master spark://ec2-52-53-250-84.us-west-1.compute.amazonaws.com:7077 \
    target/scala-2.10/testfm_2.10-1.0.jar 

###HW14.4 Replicate Criteo Challenge winning solution


This is code used to preprocess the raw dataset into one with 1000 hashed features, output in LibSVM format. If we get GBDT and feature hashing to work, this step will be unnecessary.

In [3]:
%%writefile preprocess.py
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext, SparkConf

from math import log, exp
from collections import defaultdict
import hashlib
import sys

# Calculate a feature dictionary for an observation's features based on hashing
def hashFunction(numBuckets, rawFeats, printMapping=False):
    mapping = {}
    for ind, category in rawFeats:
        featureString = category + str(ind)
        mapping[featureString] = int(int(hashlib.md5(featureString).hexdigest(), 16) % numBuckets)
    if(printMapping): print mapping
    sparseFeatures = defaultdict(float)
    for bucket in mapping.values():
        sparseFeatures[bucket] += 1.0
    return dict(sparseFeatures)

# Converts a comma separated string into a list of (featureID, value) tuples
def parsePoint(point):
    feature_list = point.split(',')[1:]
    return [(idx, feature) for idx, feature in enumerate(feature_list)]

# Create a LabeledPoint for this observation using hashing
def parseHashPoint(point, numBuckets):
    # parse the points
    point_list = parsePoint(point)
    #get the label of the point
    label = point.split(',')[0]
    
    features = hashFunction(numBuckets, point_list)
    return LabeledPoint(label, SparseVector(numBuckets, features))


if __name__ == "__main__":
    sc = SparkContext(appName="preprocess")
    input_file=sys.argv[1]
    output=sys.argv[2]
    numBucketsCTR=int(sys.argv[3])

    # read in the files and parse tab to comma
    rawData = sc.textFile(input_file, 20).map(lambda x: x.replace('\t', ','))
    
    # cache the data
    rawData.cache()
    
    # hash the data
    hashData = rawData.map(lambda x: parseHashPoint(x, numBucketsCTR))
    MLUtils.saveAsLibSVMFile(hashData, output)

Overwriting preprocess.py


This is the scala file for running FFM on the training dataset and calculating log loss and AUC on the training set. It is compiled and run the same way as the code in 14.3.

In [None]:
import com.github.fommil.netlib.BLAS._
import com.github.fommil.netlib.BLAS.{getInstance => blas}

import scala.util.Random

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.regression._
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics


object TestFM extends App {

  def computeLogLoss(p: Double, y: Double): Double = {
    val epsilon = 10E-12
    if(y == 0){
      if(p == 1){
        return -math.log(1 - p + epsilon)
      }else{
        return -math.log(1 - p)
      }
    }else{
      if(p == 0){
        return -math.log(epsilon + p)
      }else{
        return -math.log(p)
      }
    }
  }

  override def main(args: Array[String]): Unit = {

    val sc = new SparkContext(new SparkConf().setAppName("TESTFM"))

    //    "hdfs://ns1/whale-tmp/url_combined"
    val training = MLUtils.loadLibSVMFile(sc, "s3n://<access_key>:<secret_key>@ucb-mids-mls-jamesroute-hw5/criteo-train-1000", false, -1, 20).cache()

    val dataSize = training.count()

    //    val task = args(1).toInt
    //    val numIterations = args(2).toInt
    //    val stepSize = args(3).toDouble
    //    val miniBatchFraction = args(4).toDouble

    val fm1 = FMWithSGD.train(training, task = 1, numIterations = 100, stepSize = 0.15, miniBatchFraction = 1.0, dim = (true, true, 4), regParam = (0, 0, 0), initStd = 0.1)

    val preds_fm1 = training.map { point =>
      val prediction = fm1.predict(point.features)
      (prediction, point.label)
    }

    val logLoss_fm1 = preds_fm1.map { pred_label =>
      computeLogLoss(pred_label._1, pred_label._2)
      }.sum() / dataSize

    val metrics = new BinaryClassificationMetrics(preds_fm1)
    val auROC = metrics.areaUnderROC

    println("training log loss = " + logLoss_fm1)
    println("Area under ROC = " + auROC)
  }
}