# Linear least squares, Lasso, and ridge regression

$ \begin{equation}
    f(v) := \lambda\, R(v) +
    \frac1n \sum_{i=1}^n L(v;x_i,y_i)
    \label{eq:regPrimal}
    \ .
\end{equation} $

回归问题中最小二乘法是最常用的求损失函数最小值的数学模型，损失函数公式如下：
$$L(w;x,y) :=  \frac{1}{2} (w^T x - y)^2$$
不同的回归方法使用不同的正则化普通最小二乘或线性最小二乘不使用正规化，ridge regression 使用的是L2正则化; Lasso 使用的是 L1 正则化。
## 例子
这个例子使用LinearRegressionWithSGD 去构建一个简单的线性预测标签值。

In [1]:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors

val PATH = "file:///Users/lzz/work/SparkML/"
// Load and parse the data
val data = sc.textFile( PATH+"data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)

// Save and load model
model.save(sc, "myModelPath")
val sameModel = LinearRegressionModel.load(sc, "myModelPath")

training Mean Squared Error = 6.207597210613578


Name: org.apache.hadoop.mapred.FileAlreadyExistsException
Message: Output directory hdfs://localhost:9000/user/lzz/myModelPath/metadata already exists
StackTrace: org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1089)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1065)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:989)
org.apache.spark.rdd.PairRDDFu

RidgeRegressionWithSGD and LassoWithSGD can be used in a similar fashion as LinearRegressionWithSGD.

In order to run the above application, follow the instructions provided in the Self-Contained Applications section of the Spark quick-start guide. Be sure to also include spark-mllib to your build file as a dependency.