In [1]:
val PATH = "file:///Users/lzz/work/SparkML/"

$ \begin{equation}
    f(v) := \lambda\, R(v) +
    \frac1n \sum_{i=1}^n L(v;x_i,y_i)
    \label{eq:regPrimal}
    \ .
\end{equation} $

## Loss functions

 |      loss function L(v; x, y)       ..... |         gradient or sub-gradient      .......
 --------------------- | ------------------------- | ----------------------------
hinge loss | $ \max \{0, 1-y w^T x \}, \quad y \in \{-1, +1\} $ | $ \begin{cases}-y \cdot x & \text{if $y w^T x <1$}, \\ 0 &
\text{otherwise}.\end{cases} $
logistic loss | $ \log(1+\exp( -y w^T x)), \quad y \in \{-1, +1\} $  | $ -y \left(1-\frac1{1+\exp(-y w^T x)} \right) \cdot x $
squared loss | $ \frac{1}{2} (w^T x - y)^2, \quad y \in R $ | $ ( w^T x - y) \cdot x $

## Regularizers

 ......|      regularizer R(w)    .........     ..    .....|         gradient or sub-gradient  .......   ..   
 --------------------- | ------------------------- | ----------------------------
 zero (unregularized) | 0 | $ 0 $
 L2 | $ \frac{1}{2}\|w\|_2^2 $ | $ w $
 L1 | $ \|w\|_1 $ | $ \mathrm{sign}(w) $
 elastic net | $ \alpha \|w\|_1 + (1-\alpha)\frac{1}{2}\|w\|_2^2 $ | $ \alpha \mathrm{sign}(w) + (1-\alpha) w $
 

## Linear Support Vector Machines (SVMs)

SVM 使用的是hinge loss是损失函数，默认的正则规则是L2，我们也可以设置成L1
$ L(w;x,y) := \max \{0, 1-y w^T x \}. $


In [3]:
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, PATH + "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)

// Clear the default threshold.
model.clearThreshold()

// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
  val score = model.predict(point.features)
  (score, point.label)
}

// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)

// Save and load model
model.save(sc, "myModelPath")
val sameModel = SVMModel.load(sc, "myModelPath")

Area under ROC = 1.0


In [4]:
import org.apache.spark.mllib.optimization.L1Updater

val svmAlg = new SVMWithSGD()
svmAlg.optimizer.
  setNumIterations(200).
  setRegParam(0.1).
  setUpdater(new L1Updater)
val modelL1 = svmAlg.run(training)

## Logistic regression

逻辑回归使用的是logistic loss 损失函数，

$ L(w;x,y) :=  \log(1+\exp( -y w^T x)). $

当−ywTx <＝0 的时候 exp(−ywTx) <＝ 1， 所以log(1) <= 0,这个就是函数为什么要加1点原因

In [8]:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, PATH + "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS().setNumClasses(10).run(training)

// Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)
}

// Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision)

// Save and load model
model.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")

Precision = 1.0


## Regression

In [9]:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile( PATH + "data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)

// Save and load model
model.save(sc, "myModelPath")
val sameModel = LinearRegressionModel.load(sc, "myModelPath")

training Mean Squared Error = 6.207597210613578


Name: org.apache.hadoop.mapred.FileAlreadyExistsException
Message: Output directory hdfs://localhost:9000/user/lzz/myModelPath/metadata already exists
StackTrace: org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1089)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1065)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:989)
org.apache.spark.rdd.PairRDDFu