## 1小时入门Spark之MLlib

MLlib是Spark的机器学习库，包括以下主要功能。

* 实用工具：线性代数，统计，数据处理等工具
* 特征工程：特征提取，特征转换，特征选择
* 常用算法：分类，回归，聚类，协同过滤，降维
* 模型优化：模型评估，参数优化。

MLlib库包括两个不同的部分

spark.mllib 包含基于rdd的机器学习算法API，目前不再更新，在3.0版本后将会丢弃，不建议使用。

spark.ml 包含基于DataFrame的机器学习算法API，可以用来构建机器学习工作流Pipeline，推荐使用。

### 一，MLlib基本概念

DataFrame: MLlib中数据的存储形式，其列可以存储特征向量，标签，以及原始的文本，图像。

Transformer：转换器。具有transform方法。通过附加一个或多个列将一个DataFrame转换成另外一个DataFrame。

Estimator：估计器。具有fit方法。它接受一个DataFrame数据作为输入后经过训练，产生一个转换器Transformer。

Pipeline：流水线。具有setStages方法。顺序将多个Transformer和1个Estimator串联起来，得到一个流水线模型。


### 二，Pipeline流水线范例

任务描述：用逻辑回归模型预测句子中是否包括”spark“这个单词。



In [1]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature._
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.{MulticlassClassificationEvaluator,BinaryClassificationEvaluator}
import org.apache.spark.ml.{Pipeline,PipelineModel}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row


Name: Compile Error
Message: <console>:26: error: object ml is not a member of package org.apache.spark
       import org.apache.spark.ml.feature._
                               ^
<console>:27: error: object ml is not a member of package org.apache.spark
       import org.apache.spark.ml.classification.LogisticRegression
                               ^
<console>:28: error: object ml is not a member of package org.apache.spark
       import org.apache.spark.ml.evaluation.{MulticlassClassificationEvaluator,BinaryClassificationEvaluator}
                               ^
<console>:29: error: object ml is not a member of package org.apache.spark
       import org.apache.spark.ml.{Pipeline,PipelineModel}
                               ^
<console>:30: error: object ml is not a member of package org.apache.spark
       import org.apache.spark.ml.linalg.Vector
                               ^

StackTrace: 

In [41]:
val spark = SparkSession.builder()
   .master("local[4]").appName("ml")
   .getOrCreate()

import spark.implicits._

spark = org.apache.spark.sql.SparkSession@3cb7bce


org.apache.spark.sql.SparkSession@3cb7bce

**1，准备数据**

In [84]:
val dftrain = Seq((0L,"a b c d e spark",1.0),
                (1L,"a c f",0.0),
                (2L,"spark hello world",1.0),
                (3L,"hadoop mapreduce",0.0),
                (4L,"I love spark", 1.0),
                (5L,"big data",0.0)).toDF("id","text","label")
dftrain.show

+---+-----------------+-----+
| id|             text|label|
+---+-----------------+-----+
|  0|  a b c d e spark|  1.0|
|  1|            a c f|  0.0|
|  2|spark hello world|  1.0|
|  3| hadoop mapreduce|  0.0|
|  4|     I love spark|  1.0|
|  5|         big data|  0.0|
+---+-----------------+-----+



dftrain = [id: bigint, text: string ... 1 more field]


[id: bigint, text: string ... 1 more field]

**2，构建模型**

In [127]:
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
println(tokenizer.getClass)

val hashingTF = new HashingTF().setNumFeatures(100)
   .setInputCol(tokenizer.getOutputCol)
   .setOutputCol("features")
println(hashingTF.getClass)

val lr = new LogisticRegression().setLabelCol("label")
//println(lr.explainParams)
lr.setFeaturesCol("features").setMaxIter(10).setRegParam(0.01)
println(lr.getClass)

val pipe = new Pipeline().setStages(Array(tokenizer,hashingTF,lr))
println(pipe.getClass)

class org.apache.spark.ml.feature.Tokenizer
class org.apache.spark.ml.feature.HashingTF
class org.apache.spark.ml.classification.LogisticRegression
class org.apache.spark.ml.Pipeline


tokenizer = tok_a186006a3ca3
hashingTF = hashingTF_89fe3cde38ef
lr = logreg_8d6a68ed16b5
pipe = pipeline_2a8d7c734272


pipeline_2a8d7c734272

**3，训练模型**

In [88]:
val model = pipe.fit(dftrain)
print(model.getClass)

class org.apache.spark.ml.PipelineModel

model = pipeline_8c6ec9126745


pipeline_8c6ec9126745

**4，使用模型**

In [93]:
val dftest = Seq((7L,"spark job",1.0),(9L,"hello world",0.0),
                 (10L,"a b c d e",0.0),(11L,"you can you up",0.0),
                (12L,"spark is easy to use.",1.0)).toDF("id","text","label")
dftest.show

val dfresult = model.transform(dftest)

dfresult.selectExpr("text","features","probability","prediction").show

+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  7|           spark job|  1.0|
|  9|         hello world|  0.0|
| 10|           a b c d e|  0.0|
| 11|      you can you up|  0.0|
| 12|spark is easy to ...|  1.0|
+---+--------------------+-----+

+--------------------+--------------------+--------------------+----------+
|                text|            features|         probability|prediction|
+--------------------+--------------------+--------------------+----------+
|           spark job|(100,[5,70],[1.0,...|[0.35046042897667...|       1.0|
|         hello world|(100,[48,50],[1.0...|[0.33560921515515...|       1.0|
|           a b c d e|(100,[22,61,70,78...|[0.19082246657270...|       1.0|
|      you can you up|(100,[25,28,33],[...|[0.81519423235142...|       0.0|
|spark is easy to ...|(100,[5,21,60,81,...|[0.47768327161195...|       1.0|
+--------------------+--------------------+--------------------+----------+



dftest = [id: bigint, text: string ... 1 more field]
dfresult = [id: bigint, text: string ... 6 more fields]


[id: bigint, text: string ... 6 more fields]

**5，评估模型**

In [106]:
val evaluator = new MulticlassClassificationEvaluator().setMetricName("accuracy")
    .setPredictionCol("prediction").setLabelCol("label")

println(evaluator.explainParams())

println(s"\naccuracy = ${evaluator.evaluate(dfresult)}")

labelCol: label column name (default: label, current: label)
metricName: metric name in evaluation (f1|weightedPrecision|weightedRecall|accuracy) (default: f1, current: accuracy)
predictionCol: prediction column name (default: prediction, current: prediction)

accuracy = 0.6


evaluator = mcEval_853d92ed6dd6


mcEval_853d92ed6dd6

**6，保存模型**

In [130]:
model.save("mymodel.model")

// Now we can optionally save the fitted pipeline to disk
//model.write.overwrite().save("/tmp/spark-logistic-regression-model")

// We can also save this unfit pipeline to disk
//pipeline.write.overwrite().save("/tmp/unfit-lr-model")

In [120]:
//重新载入模型
val model_loaded = PipelineModel.load("pipe.model")
model_loaded.transform(dftest).select("text","label","prediction").show

+--------------------+-----+----------+
|                text|label|prediction|
+--------------------+-----+----------+
|           spark job|  1.0|       1.0|
|         hello world|  0.0|       1.0|
|           a b c d e|  0.0|       1.0|
|      you can you up|  0.0|       0.0|
|spark is easy to ...|  1.0|       1.0|
+--------------------+-----+----------+



model_loaded = pipeline_8c6ec9126745


lastException: Throwable = null


pipeline_8c6ec9126745

### 三，导入数据

可以使用spark.read导入csv，image，libsvm，txt等格式数据。

In [136]:
//导入图片

val dfimage = spark.read.format("image").option("dropInvalid", true).load("../imagedata")
dfimage.printSchema


root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)



dfimage = [image: struct<origin: string, height: int ... 4 more fields>]


[image: struct<origin: string, height: int ... 4 more fields>]

In [137]:
dfimage.selectExpr("image.*").show

+--------------------+------+-----+---------+----+--------------------+
|              origin|height|width|nChannels|mode|                data|
+--------------------+------+-----+---------+----+--------------------+
|file:///Users/lia...|   803|  998|        4|  24|[F2 F9 FC FF F2 F...|
|file:///Users/lia...|   626|  886|        4|  24|[FF FF FF FF FF F...|
+--------------------+------+-----+---------+----+--------------------+



### 四，特征工程

spark的特征处理功能主要在 spark.ml.feature 模块中，包括以下一些功能。

* 特征提取：Tf-idf, Word2Vec, CountVectorizer, FeatureHasher

* 特征转换：OneHotEncoderEstimator, Normalizer, Imputer(缺失值填充), StandardScaler, MinMaxScaler, Tokenizer(构建词典), 
  StopWordsRemover, SQLTransformer, Bucketizer, Interaction(交叉项), Binarizer(二值化), n-gram,……

* 特征选择：VectorSlicer(向量切片), RFormula, ChiSqSelector(卡方检验)

* LSH转换：局部敏感哈希广泛用于海量数据中求最邻近，聚类等算法。


### 五，分类模型

### 六，回归模型

### 七，聚类模型

### 八，降维模型

### 九，模型优化

### 十，统计工具