## 1小时入门Spark之MLlib

MLlib是Spark的机器学习库，包括以下主要功能。

* 实用工具：线性代数，统计，数据处理等工具
* 特征工程：特征提取，特征转换，特征选择
* 常用算法：分类，回归，聚类，协同过滤，降维
* 模型优化：模型评估，参数优化。

MLlib库包括两个不同的部分

spark.mllib 包含基于rdd的机器学习算法API，目前不再更新，在3.0版本后将会丢弃，不建议使用。

spark.ml 包含基于DataFrame的机器学习算法API，可以用来构建机器学习工作流Pipeline，推荐使用。

In [1]:
%AddDeps org.apache.spark spark_mllib_2.11 2.3.1  

Marking org.apache.spark:spark_mllib_2.11:2.3.1 for download
-> Failed to resolve org.apache.spark:spark_mllib_2.11:2.3.1
    -> not found: /var/folders/bs/pw656yts35qb1dpr1myl3wd80000gn/T/toree-tmp-dir1813761629305680892/toree_add_deps/cache/org.apache.spark/spark_mllib_2.11/ivy-2.3.1.xml
    -> not found: https://repo1.maven.org/maven2/org/apache/spark/spark_mllib_2.11/2.3.1/spark_mllib_2.11-2.3.1.pom
Obtained 0 files


### 一，MLlib基本概念

DataFrame: MLlib中数据的存储形式，其列可以存储特征向量，标签，以及原始的文本，图像。

Transformer：转换器。具有transform方法。通过附加一个或多个列将一个DataFrame转换成另外一个DataFrame。

Estimator：估计器。具有fit方法。它接受一个DataFrame数据作为输入后经过训练，产生一个转换器Transformer。

Pipeline：流水线。具有setStages方法。顺序将多个Transformer和1个Estimator串联起来，得到一个流水线模型。


### 二，Pipeline流水线范例

任务描述：用逻辑回归模型预测句子中是否包括”spark“这个单词。



In [1]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature._
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.{MulticlassClassificationEvaluator,BinaryClassificationEvaluator}
import org.apache.spark.ml.{Pipeline,PipelineModel}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row


In [2]:
val spark = SparkSession.builder()
   .master("local[4]").appName("ml")
   .getOrCreate()

import spark.implicits._

spark = org.apache.spark.sql.SparkSession@f4ea567


org.apache.spark.sql.SparkSession@f4ea567

**1，准备数据**

In [3]:
val dftrain = Seq((0L,"a b c d e spark",1.0),
                (1L,"a c f",0.0),
                (2L,"spark hello world",1.0),
                (3L,"hadoop mapreduce",0.0),
                (4L,"I love spark", 1.0),
                (5L,"big data",0.0)).toDF("id","text","label")
dftrain.show

+---+-----------------+-----+
| id|             text|label|
+---+-----------------+-----+
|  0|  a b c d e spark|  1.0|
|  1|            a c f|  0.0|
|  2|spark hello world|  1.0|
|  3| hadoop mapreduce|  0.0|
|  4|     I love spark|  1.0|
|  5|         big data|  0.0|
+---+-----------------+-----+



dftrain = [id: bigint, text: string ... 1 more field]


[id: bigint, text: string ... 1 more field]

**2，构建模型**

In [4]:
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
println(tokenizer.getClass)

val hashingTF = new HashingTF().setNumFeatures(100)
   .setInputCol(tokenizer.getOutputCol)
   .setOutputCol("features")
println(hashingTF.getClass)

val lr = new LogisticRegression().setLabelCol("label")
//println(lr.explainParams)
lr.setFeaturesCol("features").setMaxIter(10).setRegParam(0.01)
println(lr.getClass)

val pipe = new Pipeline().setStages(Array(tokenizer,hashingTF,lr))
println(pipe.getClass)

class org.apache.spark.ml.feature.Tokenizer
class org.apache.spark.ml.feature.HashingTF
class org.apache.spark.ml.classification.LogisticRegression
class org.apache.spark.ml.Pipeline


tokenizer = tok_f117a88d34eb
hashingTF = hashingTF_83b157337b02
lr = logreg_9a278b7ad216
pipe = pipeline_bbe1e386268d


pipeline_bbe1e386268d

**3，训练模型**

In [5]:
val model = pipe.fit(dftrain)
print(model.getClass)

class org.apache.spark.ml.PipelineModel

model = pipeline_bbe1e386268d


pipeline_bbe1e386268d

**4，使用模型**

In [6]:
val dftest = Seq((7L,"spark job",1.0),(9L,"hello world",0.0),
                 (10L,"a b c d e",0.0),(11L,"you can you up",0.0),
                (12L,"spark is easy to use.",1.0)).toDF("id","text","label")
dftest.show

val dfresult = model.transform(dftest)

dfresult.selectExpr("text","features","probability","prediction").show

+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  7|           spark job|  1.0|
|  9|         hello world|  0.0|
| 10|           a b c d e|  0.0|
| 11|      you can you up|  0.0|
| 12|spark is easy to ...|  1.0|
+---+--------------------+-----+

+--------------------+--------------------+--------------------+----------+
|                text|            features|         probability|prediction|
+--------------------+--------------------+--------------------+----------+
|           spark job|(100,[5,70],[1.0,...|[0.35046042897668...|       1.0|
|         hello world|(100,[48,50],[1.0...|[0.33560921515516...|       1.0|
|           a b c d e|(100,[22,61,70,78...|[0.19082246657270...|       1.0|
|      you can you up|(100,[25,28,33],[...|[0.81519423235142...|       0.0|
|spark is easy to ...|(100,[5,21,60,81,...|[0.47768327161195...|       1.0|
+--------------------+--------------------+--------------------+----------+



dftest = [id: bigint, text: string ... 1 more field]
dfresult = [id: bigint, text: string ... 6 more fields]


[id: bigint, text: string ... 6 more fields]

**5，评估模型**

In [7]:
dfresult.printSchema

root
 |-- id: long (nullable = false)
 |-- text: string (nullable = true)
 |-- label: double (nullable = false)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [9]:
val evaluator = new MulticlassClassificationEvaluator().setMetricName("f1")
    .setPredictionCol("prediction").setLabelCol("label")

println(evaluator.explainParams())

evaluator.evaluate(dfresult)

println(s"\naccuracy = ${evaluator.evaluate(dfresult)}")

labelCol: label column name (default: label, current: label)
metricName: metric name in evaluation (f1|weightedPrecision|weightedRecall|accuracy) (default: f1, current: f1)
predictionCol: prediction column name (default: prediction, current: prediction)

accuracy = 0.5666666666666667


evaluator = mcEval_36c245695257


mcEval_36c245695257

**6，保存模型**

In [10]:
model.write.overwrite().save("mymodel.model")

// Now we can optionally save the fitted pipeline to disk
//model.write.overwrite().save("/tmp/spark-logistic-regression-model")

// We can also save this unfit pipeline to disk
//pipeline.write.overwrite().save("/tmp/unfit-lr-model")

In [12]:
//重新载入模型
val model_loaded = PipelineModel.load("mymodel.model")
model_loaded.transform(dftest).select("text","label","prediction").show

+--------------------+-----+----------+
|                text|label|prediction|
+--------------------+-----+----------+
|           spark job|  1.0|       1.0|
|         hello world|  0.0|       1.0|
|           a b c d e|  0.0|       1.0|
|      you can you up|  0.0|       0.0|
|spark is easy to ...|  1.0|       1.0|
+--------------------+-----+----------+



model_loaded = pipeline_bbe1e386268d


pipeline_bbe1e386268d

### 三，导入数据

可以使用spark.read导入csv，image，libsvm，txt等格式数据。

In [13]:
//导入图片

val dfimage = spark.read.format("image").option("dropInvalid", true).load("imagedata")
dfimage.printSchema


root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)



dfimage = [image: struct<origin: string, height: int ... 4 more fields>]


[image: struct<origin: string, height: int ... 4 more fields>]

In [14]:
dfimage.selectExpr("image.*").show

+--------------------+------+-----+---------+----+--------------------+
|              origin|height|width|nChannels|mode|                data|
+--------------------+------+-----+---------+----+--------------------+
|file:///Users/lia...|   640|  640|        3|  16|[04 13 33 02 11 3...|
|file:///Users/lia...|   287|  562|        4|  24|[91 4D 2F FF C7 6...|
|file:///Users/lia...|   276|  619|        3|  16|[E3 E0 BB E3 E0 B...|
|file:///Users/lia...|   338|  600|        3|  16|[A7 7E 5E A8 7F 5...|
+--------------------+------+-----+---------+----+--------------------+



### 四，特征工程

spark的特征处理功能主要在 spark.ml.feature 模块中，包括以下一些功能。

* 特征提取：Tf-idf, Word2Vec, CountVectorizer, FeatureHasher

* 特征转换：OneHotEncoderEstimator, Normalizer, Imputer(缺失值填充), StandardScaler, MinMaxScaler, Tokenizer(构建词典), 
  StopWordsRemover, SQLTransformer, Bucketizer, Interaction(交叉项), Binarizer(二值化), n-gram,……

* 特征选择：VectorSlicer(向量切片), RFormula, ChiSqSelector(卡方检验)

* LSH转换：局部敏感哈希广泛用于海量数据中求最邻近，聚类等算法。


**1，CountVectorizer**

CountVectorizer可以提取文本中的词频特征。

In [18]:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = spark.createDataFrame(Seq(
  (0, Array("a", "b", "c")),
  (1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")

// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .setVocabSize(3)
  .setMinDF(2)
  .fit(df)

// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
  .setInputCol("words")
  .setOutputCol("features")

cvModel.transform(df).show()

cvm.transform(df).show()

+---+---------------+--------------------+
| id|          words|            features|
+---+---------------+--------------------+
|  0|      [a, b, c]|(3,[0,1,2],[1.0,1...|
|  1|[a, b, b, c, a]|(3,[0,1,2],[2.0,2...|
+---+---------------+--------------------+

+---+---------------+--------------------+
| id|          words|            features|
+---+---------------+--------------------+
|  0|      [a, b, c]|(3,[0,1,2],[1.0,1...|
|  1|[a, b, b, c, a]|(3,[0,1,2],[2.0,2...|
+---+---------------+--------------------+



df = [id: int, words: array<string>]
cvModel = cntVec_324b7869593e
cvm = cntVecModel_4947f6142271


cntVecModel_4947f6142271

**2，Tf-IDF**

Tf-IDF可以降低文本频率过高的常用词的权重。

In [15]:
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could use case classes"),
  (1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)

val hashingTF = new HashingTF()
  .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)
// alternatively, CountVectorizer can also be used to get term frequency vectors

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)   

val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(20,[0,5,9,17],[0...|
|  0.0|(20,[2,7,9,13,15]...|
|  1.0|(20,[4,6,13,15,18...|
+-----+--------------------+



sentenceData = [label: double, sentence: string]
tokenizer = tok_be783ff7cdf7
wordsData = [label: double, sentence: string ... 1 more field]
hashingTF = hashingTF_5689f46573df
featurizedData = [label: double, sentence: string ... 2 more fields]
idf = idf_2ce0d13b12f9
idfModel = idf_2ce0d13b12f9
rescaledData = [label: double, sentence: string ... 3 more fields]


[label: double, sentence: string ... 3 more fields]

**3，Word2Vec**

Word2Vec可以使用浅层神经网络提取文本中词的相似语义信息。

In [16]:
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
  "Hi I heard about Spark".split(" "),
  "I wish Java could use case classes".split(" "),
  "Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")

// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
  .setInputCol("text")
  .setOutputCol("result")
  .setVectorSize(3)
  .setMinCount(0)
val model = word2Vec.fit(documentDF)

val result = model.transform(documentDF)
result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
  println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }

Text: [Hi, I, heard, about, Spark] => 
Vector: [-0.008142343163490296,0.02051363289356232,0.03255096450448036]

Text: [I, wish, Java, could, use, case, classes] => 
Vector: [0.043090314205203734,0.035048123182994974,0.023512658663094044]

Text: [Logistic, regression, models, are, neat] => 
Vector: [0.038572299480438235,-0.03250147425569594,-0.01552378609776497]



documentDF = [text: array<string>]
word2Vec = w2v_c890d08f1667
model = w2v_c890d08f1667
result = [text: array<string>, result: vector]


[text: array<string>, result: vector]

**4， OnHotEncoderEstimator**

OneHotEncoderEstimator可以将类别特征转换成OneHot编码。

In [26]:
import org.apache.spark.ml.feature.OneHotEncoderEstimator

val df = spark.createDataFrame(Seq(
  (0.0, 1.0),
  (1.0, 0.0),
  (2.0, 1.0),
  (0.0, 2.0),
  (0.0, 1.0),
  (2.0, 0.0)
)).toDF("categoryIndex1", "categoryIndex2")

val encoder = new OneHotEncoderEstimator()
  .setInputCols(Array("categoryIndex1", "categoryIndex2"))
  .setOutputCols(Array("categoryVec1", "categoryVec2"))
val model = encoder.fit(df)

val encoded = model.transform(df)
encoded.show()

+--------------+--------------+-------------+-------------+
|categoryIndex1|categoryIndex2| categoryVec1| categoryVec2|
+--------------+--------------+-------------+-------------+
|           0.0|           1.0|(2,[0],[1.0])|(2,[1],[1.0])|
|           1.0|           0.0|(2,[1],[1.0])|(2,[0],[1.0])|
|           2.0|           1.0|    (2,[],[])|(2,[1],[1.0])|
|           0.0|           2.0|(2,[0],[1.0])|    (2,[],[])|
|           0.0|           1.0|(2,[0],[1.0])|(2,[1],[1.0])|
|           2.0|           0.0|    (2,[],[])|(2,[0],[1.0])|
+--------------+--------------+-------------+-------------+



df = [categoryIndex1: double, categoryIndex2: double]
encoder = oneHotEncoder_1cc186061b9c
model = oneHotEncoder_1cc186061b9c
encoded = [categoryIndex1: double, categoryIndex2: double ... 2 more fields]


[categoryIndex1: double, categoryIndex2: double ... 2 more fields]

**5，FeatureHasher**

当特征数量过多时，可以用FeatureHasher来进行降维。

In [19]:
import org.apache.spark.ml.feature.FeatureHasher

val dataset = spark.createDataFrame(Seq(
  (2.2, true, "1", "foo"),
  (3.3, false, "2", "bar"),
  (4.4, false, "3", "baz"),
  (5.5, false, "4", "foo")
)).toDF("real", "bool", "stringNum", "string")

val hasher = new FeatureHasher()
  .setInputCols("real", "bool", "stringNum", "string")
  .setOutputCol("features")

val featurized = hasher.transform(dataset)
featurized.show(false)

+----+-----+---------+------+--------------------------------------------------------+
|real|bool |stringNum|string|features                                                |
+----+-----+---------+------+--------------------------------------------------------+
|2.2 |true |1        |foo   |(262144,[174475,247670,257907,262126],[2.2,1.0,1.0,1.0])|
|3.3 |false|2        |bar   |(262144,[70644,89673,173866,174475],[1.0,1.0,1.0,3.3])  |
|4.4 |false|3        |baz   |(262144,[22406,70644,174475,187923],[1.0,1.0,4.4,1.0])  |
|5.5 |false|4        |foo   |(262144,[70644,101499,174475,257907],[1.0,1.0,5.5,1.0]) |
+----+-----+---------+------+--------------------------------------------------------+



dataset = [real: double, bool: boolean ... 2 more fields]
hasher = featureHasher_e574faff1d9a
featurized = [real: double, bool: boolean ... 3 more fields]


[real: double, bool: boolean ... 3 more fields]

**6, IndexToString 和 StringIndexer**

IndexToString和StringIndexer互为逆操作。

前者可以将数字表示的类别特征转换成字符串表示的类别特征。

后者可以将字符串表示的类别特征转换成数字表示的类别特征。



In [20]:
import org.apache.spark.ml.attribute.Attribute
import org.apache.spark.ml.feature.{IndexToString, StringIndexer}

val df = spark.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, "a"),
  (4, "a"),
  (5, "c")
)).toDF("id", "category")

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")
  .fit(df)
val indexed = indexer.transform(df)

println(s"Transformed string column '${indexer.getInputCol}' " +
    s"to indexed column '${indexer.getOutputCol}'")
indexed.show()

val inputColSchema = indexed.schema(indexer.getOutputCol)
println(s"StringIndexer will store labels in output column metadata: " +
    s"${Attribute.fromStructField(inputColSchema).toString}\n")

val converter = new IndexToString()
  .setInputCol("categoryIndex")
  .setOutputCol("originalCategory")

val converted = converter.transform(indexed)

println(s"Transformed indexed column '${converter.getInputCol}' back to original string " +
    s"column '${converter.getOutputCol}' using labels in metadata")
converted.select("id", "categoryIndex", "originalCategory").show()



Transformed string column 'category' to indexed column 'categoryIndex'
+---+--------+-------------+
| id|category|categoryIndex|
+---+--------+-------------+
|  0|       a|          0.0|
|  1|       b|          2.0|
|  2|       c|          1.0|
|  3|       a|          0.0|
|  4|       a|          0.0|
|  5|       c|          1.0|
+---+--------+-------------+

StringIndexer will store labels in output column metadata: {"vals":["a","c","b"],"type":"nominal","name":"categoryIndex"}

Transformed indexed column 'categoryIndex' back to original string column 'originalCategory' using labels in metadata
+---+-------------+----------------+
| id|categoryIndex|originalCategory|
+---+-------------+----------------+
|  0|          0.0|               a|
|  1|          2.0|               b|
|  2|          1.0|               c|
|  3|          0.0|               a|
|  4|          0.0|               a|
|  5|          1.0|               c|
+---+-------------+----------------+



df = [id: int, category: string]
indexer = strIdx_210232843fab
indexed = [id: int, category: string ... 1 more field]
inputColSchema = StructField(categoryIndex,DoubleType,false)
converter = idxToStr_d7d6742718af
converted = [id: int, category: string ... 2 more fields]


[id: int, category: string ... 2 more fields]

In [25]:
indexed.schema(indexer.getOutputCol)

Attribute.fromStructField(inputColSchema).toString

{"vals":["a","c","b"],"type":"nominal","name":"categoryIndex"}

**7，StandardScaler 正态标准化**

In [28]:
import org.apache.spark.ml.feature.StandardScaler

val dataFrame = spark.read.format("libsvm").load("/Users/liangyun/ProgramFiles/"+
        "spark-2.4.3-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")

val scaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")
  .setWithStd(true)
  .setWithMean(false)

// Compute summary statistics by fitting the StandardScaler.
val scalerModel = scaler.fit(dataFrame)

// Normalize each feature to have unit standard deviation.
val scaledData = scalerModel.transform(dataFrame)
scaledData.show()

+-----+--------------------+--------------------+
|label|            features|      scaledFeatures|
+-----+--------------------+--------------------+
|  0.0|(692,[127,128,129...|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|(692,[127,128,129...|
|  1.0|(692,[154,155,156...|(692,[154,155,156...|
|  0.0|(692,[153,154,155...|(692,[153,154,155...|
|  0.0|(692,[151,152,153...|(692,[151,152,153...|
|  1.0|(692,[129,130,131...|(692,[129,130,131...|
|  0.0|(692,[154,155,156...|(692,[154,155,156...|
|  1.0|(692,[150,151,152...|(692,[150,151,152...|
|  0.0|(692,[124,125,126...|(692,[124,125,126...|


dataFrame = [label: double, features: vector]
scaler = stdScal_62b0d9d05e04
scalerModel = stdScal_62b0d9d05e04
scaledData = [label: double, features: vector ... 1 more field]


[label: double, features: vector ... 1 more field]

**8, MinMax标准化**

In [29]:
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.linalg.Vectors

val dataFrame = spark.createDataFrame(Seq(
  (0, Vectors.dense(1.0, 0.1, -1.0)),
  (1, Vectors.dense(2.0, 1.1, 1.0)),
  (2, Vectors.dense(3.0, 10.1, 3.0))
)).toDF("id", "features")

val scaler = new MinMaxScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")

// Compute summary statistics and generate MinMaxScalerModel
val scalerModel = scaler.fit(dataFrame)

// rescale each feature to range [min, max].
val scaledData = scalerModel.transform(dataFrame)
println(s"Features scaled to range: [${scaler.getMin}, ${scaler.getMax}]")
scaledData.select("features", "scaledFeatures").show()


Features scaled to range: [0.0, 1.0]
+--------------+--------------+
|      features|scaledFeatures|
+--------------+--------------+
|[1.0,0.1,-1.0]| [0.0,0.0,0.0]|
| [2.0,1.1,1.0]| [0.5,0.1,0.5]|
|[3.0,10.1,3.0]| [1.0,1.0,1.0]|
+--------------+--------------+



dataFrame = [id: int, features: vector]
scaler = minMaxScal_79c98923f40c
scalerModel = minMaxScal_79c98923f40c
scaledData = [id: int, features: vector ... 1 more field]


[id: int, features: vector ... 1 more field]

**9，MaxAbsScaler标准化**

In [30]:
import org.apache.spark.ml.feature.MaxAbsScaler
import org.apache.spark.ml.linalg.Vectors

val dataFrame = spark.createDataFrame(Seq(
  (0, Vectors.dense(1.0, 0.1, -8.0)),
  (1, Vectors.dense(2.0, 1.0, -4.0)),
  (2, Vectors.dense(4.0, 10.0, 8.0))
)).toDF("id", "features")

val scaler = new MaxAbsScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")

// Compute summary statistics and generate MaxAbsScalerModel
val scalerModel = scaler.fit(dataFrame)

// rescale each feature to range [-1, 1]
val scaledData = scalerModel.transform(dataFrame)
scaledData.select("features", "scaledFeatures").show()

+--------------+----------------+
|      features|  scaledFeatures|
+--------------+----------------+
|[1.0,0.1,-8.0]|[0.25,0.01,-1.0]|
|[2.0,1.0,-4.0]|  [0.5,0.1,-0.5]|
|[4.0,10.0,8.0]|   [1.0,1.0,1.0]|
+--------------+----------------+



dataFrame = [id: int, features: vector]
scaler = maxAbsScal_89f335f08a40
scalerModel = maxAbsScal_89f335f08a40
scaledData = [id: int, features: vector ... 1 more field]


[id: int, features: vector ... 1 more field]

**10，SQLTransformer**

可以使用SQL语法将DataFrame进行转换，等效于注册表的作用。

但它可以用于Pipeline中作为Transformer.

In [32]:
import org.apache.spark.ml.feature.SQLTransformer

val df = spark.createDataFrame(
  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF("id", "v1", "v2")

val sqlTrans = new SQLTransformer().setStatement(
  "SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")

sqlTrans.transform(df).show()

+---+---+---+---+----+
| id| v1| v2| v3|  v4|
+---+---+---+---+----+
|  0|1.0|3.0|4.0| 3.0|
|  2|2.0|5.0|7.0|10.0|
+---+---+---+---+----+



df = [id: int, v1: double ... 1 more field]
sqlTrans = sql_f7174c4e1104


sql_f7174c4e1104

**11, Imputer**

Imputer转换器可以填充缺失值，缺失值可以用 Double.NaN来表示。

In [33]:
import org.apache.spark.ml.feature.Imputer

val df = spark.createDataFrame(Seq(
  (1.0, Double.NaN),
  (2.0, Double.NaN),
  (Double.NaN, 3.0),
  (4.0, 4.0),
  (5.0, 5.0)
)).toDF("a", "b")

val imputer = new Imputer()
  .setInputCols(Array("a", "b"))
  .setOutputCols(Array("out_a", "out_b"))

val model = imputer.fit(df)
model.transform(df).show()

+---+---+-----+-----+
|  a|  b|out_a|out_b|
+---+---+-----+-----+
|1.0|NaN|  1.0|  4.0|
|2.0|NaN|  2.0|  4.0|
|NaN|3.0|  3.0|  3.0|
|4.0|4.0|  4.0|  4.0|
|5.0|5.0|  5.0|  5.0|
+---+---+-----+-----+



df = [a: double, b: double]
imputer = imputer_85599015b326
model = imputer_85599015b326


imputer_85599015b326

**12 ChiSqSelector**

当label是离散值时，ChiSqSelector选择器可以根据Chi2检验统计量筛选特征。

In [34]:
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
  (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
  (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
  (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)

val df = spark.createDataset(data).toDF("id", "features", "clicked")

val selector = new ChiSqSelector()
  .setNumTopFeatures(1)
  .setFeaturesCol("features")
  .setLabelCol("clicked")
  .setOutputCol("selectedFeatures")

val result = selector.fit(df).transform(df)

println(s"ChiSqSelector output with top ${selector.getNumTopFeatures} features selected")
result.show()

ChiSqSelector output with top 1 features selected
+---+------------------+-------+----------------+
| id|          features|clicked|selectedFeatures|
+---+------------------+-------+----------------+
|  7|[0.0,0.0,18.0,1.0]|    1.0|          [18.0]|
|  8|[0.0,1.0,12.0,0.0]|    0.0|          [12.0]|
|  9|[1.0,0.0,15.0,0.1]|    0.0|          [15.0]|
+---+------------------+-------+----------------+



data = List((7,[0.0,0.0,18.0,1.0],1.0), (8,[0.0,1.0,12.0,0.0],0.0), (9,[1.0,0.0,15.0,0.1],0.0))
df = [id: int, features: vector ... 1 more field]
selector = chiSqSelector_6f2ed137f410
result = [id: int, features: vector ... 2 more fields]


[id: int, features: vector ... 2 more fields]

**13，局部敏感哈希（LSH）**

Locality Sensitive Hashing 是一种广泛使用的哈希技巧，常用于求海量数据的最邻近，聚类和离群值检测等任务中。

局部敏感哈希的特性是将距离较近的点以很大的概率通过哈希作用映射到相同的桶内，而将距离较远的点以很大的概率通过哈希作用映射到不同的桶中。

BucketedRandomProjectionLSH 使用Euclidean distance作为距离度量，取点在某个随机方向的投影结果作为哈希值。

MinHashLSH 使用Jaccard distance作为距离度量，它是衡量两个集合差异性的一个指标，MinHashLSH取集合中所有元素的哈希值的最小值作为哈希值。

杰拉德距离计算公式如下:$$(A,B)=1−\frac{|A∩B|}{|A∪B|$$


In [35]:
import org.apache.spark.ml.feature.BucketedRandomProjectionLSH
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col

val dfA = spark.createDataFrame(Seq(
  (0, Vectors.dense(1.0, 1.0)),
  (1, Vectors.dense(1.0, -1.0)),
  (2, Vectors.dense(-1.0, -1.0)),
  (3, Vectors.dense(-1.0, 1.0))
)).toDF("id", "features")

val dfB = spark.createDataFrame(Seq(
  (4, Vectors.dense(1.0, 0.0)),
  (5, Vectors.dense(-1.0, 0.0)),
  (6, Vectors.dense(0.0, 1.0)),
  (7, Vectors.dense(0.0, -1.0))
)).toDF("id", "features")

val key = Vectors.dense(1.0, 0.0)

val brp = new BucketedRandomProjectionLSH()
  .setBucketLength(2.0)
  .setNumHashTables(3)
  .setInputCol("features")
  .setOutputCol("hashes")

val model = brp.fit(dfA)

// Feature Transformation
println("The hashed dataset where hashed values are stored in the column 'hashes':")
model.transform(dfA).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// `model.approxSimilarityJoin(transformedA, transformedB, 1.5)`
println("Approximately joining dfA and dfB on Euclidean distance smaller than 1.5:")
model.approxSimilarityJoin(dfA, dfB, 1.5, "EuclideanDistance")
  .select(col("datasetA.id").alias("idA"),
    col("datasetB.id").alias("idB"),
    col("EuclideanDistance")).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate nearest
// neighbor search.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// `model.approxNearestNeighbors(transformedA, key, 2)`
println("Approximately searching dfA for 2 nearest neighbors of the key:")
model.approxNearestNeighbors(dfA, key, 2).show()





The hashed dataset where hashed values are stored in the column 'hashes':
+---+-----------+--------------------+
| id|   features|              hashes|
+---+-----------+--------------------+
|  0|  [1.0,1.0]|[[0.0], [0.0], [-...|
|  1| [1.0,-1.0]|[[-1.0], [-1.0], ...|
|  2|[-1.0,-1.0]|[[-1.0], [-1.0], ...|
|  3| [-1.0,1.0]|[[0.0], [0.0], [-...|
+---+-----------+--------------------+

Approximately joining dfA and dfB on Euclidean distance smaller than 1.5:
+---+---+-----------------+
|idA|idB|EuclideanDistance|
+---+---+-----------------+
|  1|  4|              1.0|
|  0|  6|              1.0|
|  1|  7|              1.0|
|  3|  5|              1.0|
|  0|  4|              1.0|
|  3|  6|              1.0|
|  2|  7|              1.0|
|  2|  5|              1.0|
+---+---+-----------------+

Approximately searching dfA for 2 nearest neighbors of the key:
+---+----------+--------------------+-------+
| id|  features|              hashes|distCol|
+---+----------+--------------------+-------+


dfA = [id: int, features: vector]
dfB = [id: int, features: vector]
key = [1.0,0.0]
brp = brp-lsh_dd546f0c3270
model = brp-lsh_dd546f0c3270


brp-lsh_dd546f0c3270

In [36]:
import org.apache.spark.ml.feature.MinHashLSH
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col

val dfA = spark.createDataFrame(Seq(
  (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0)))),
  (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0)))),
  (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0))))
)).toDF("id", "features")

val dfB = spark.createDataFrame(Seq(
  (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0)))),
  (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0)))),
  (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0))))
)).toDF("id", "features")

val key = Vectors.sparse(6, Seq((1, 1.0), (3, 1.0)))

val mh = new MinHashLSH()
  .setNumHashTables(5)
  .setInputCol("features")
  .setOutputCol("hashes")

val model = mh.fit(dfA)

// Feature Transformation
println("The hashed dataset where hashed values are stored in the column 'hashes':")
model.transform(dfA).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
println("Approximately joining dfA and dfB on Jaccard distance smaller than 0.6:")
model.approxSimilarityJoin(dfA, dfB, 0.6, "JaccardDistance")
  .select(col("datasetA.id").alias("idA"),
    col("datasetB.id").alias("idB"),
    col("JaccardDistance")).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate nearest
// neighbor search.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// `model.approxNearestNeighbors(transformedA, key, 2)`
// It may return less than 2 rows when not enough approximate near-neighbor candidates are
// found.
println("Approximately searching dfA for 2 nearest neighbors of the key:")
model.approxNearestNeighbors(dfA, key, 2).show()

The hashed dataset where hashed values are stored in the column 'hashes':
+---+--------------------+--------------------+
| id|            features|              hashes|
+---+--------------------+--------------------+
|  0|(6,[0,1,2],[1.0,1...|[[2.25592966E8], ...|
|  1|(6,[2,3,4],[1.0,1...|[[2.25592966E8], ...|
|  2|(6,[0,2,4],[1.0,1...|[[2.25592966E8], ...|
+---+--------------------+--------------------+

Approximately joining dfA and dfB on Jaccard distance smaller than 0.6:
+---+---+---------------+
|idA|idB|JaccardDistance|
+---+---+---------------+
|  1|  4|            0.5|
|  0|  5|            0.5|
|  1|  5|            0.5|
|  2|  5|            0.5|
+---+---+---------------+

Approximately searching dfA for 2 nearest neighbors of the key:
+---+--------------------+--------------------+-------+
| id|            features|              hashes|distCol|
+---+--------------------+--------------------+-------+
|  1|(6,[2,3,4],[1.0,1...|[[2.25592966E8], ...|   0.75|
+---+---------------

dfA = [id: int, features: vector]
dfB = [id: int, features: vector]
key = (6,[1,3],[1.0,1.0])
mh = mh-lsh_ce80a4f003d3
model = mh-lsh_ce80a4f003d3


mh-lsh_ce80a4f003d3

### 五，分类模型

### 六，回归模型

### 七，聚类模型

### 八，降维模型

### 九，模型优化

### 十，统计工具