# LightGBM
We will demonstrate how to use the LightGBM quantile regressor with TrainRegressor and ComputeModelStatistics on the Triazines dataset.

This sample demonstrates how to use the following APIs:

- TrainRegressor
- LightGBMRegressor
- ComputeModelStatistics

## Dataset Review

The Adult dataset we are going to use is publicly available at the UCI Machine Learning Repository. This data derives from census data, and consists of information about 48842 individuals and their annual income. We will use this information to predict if an individual earns <=50K or >50k a year. The dataset is rather clean, and consists of both numeric and categorical variables.

Attribute Information:

- age: continuous
- workclass: Private,Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
- fnlwgt: continuous
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc...
- education-num: continuous
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent...
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners...
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
- sex: Female, Male
- capital-gain: continuous
- capital-loss: continuous
- hours-per-week: continuous
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany...
- Target/Label: - <=50K, >50K

In [1]:
from pyspark.sql.types import DoubleType, StringType, StructField, StructType
from pyspark.sql import SparkSession
import pyspark
spark = SparkSession.builder.appName("MyApp") \
            .config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:0.18.1") \
            .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
            .getOrCreate()
import mmlspark
from mmlspark.lightgbm import LightGBMClassifier

In [2]:
schema = StructType([
  StructField("age", DoubleType(), False),
  StructField("workclass", StringType(), False),
  StructField("fnlwgt", DoubleType(), False),
  StructField("education", StringType(), False),
  StructField("education_num", DoubleType(), False),
  StructField("marital_status", StringType(), False),
  StructField("occupation", StringType(), False),
  StructField("relationship", StringType(), False),
  StructField("race", StringType(), False),
  StructField("sex", StringType(), False),
  StructField("capital_gain", DoubleType(), False),
  StructField("capital_loss", DoubleType(), False),
  StructField("hours_per_week", DoubleType(), False),
  StructField("native_country", StringType(), False),
  StructField("income", StringType(), False)
])

dataset = spark.read.format("csv").schema(schema).load("/home/robin/datatsets/adult/adult.data")
cols = dataset.columns

In [3]:
# print some basic info
print("records read: " + str(dataset.count()))
print("Schema: ")
dataset.printSchema()
dataset.limit(5).toPandas()

records read: 32561
Schema: 
root
 |-- age: double (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: double (nullable = true)
 |-- marital_status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: double (nullable = true)
 |-- capital_loss: double (nullable = true)
 |-- hours_per_week: double (nullable = true)
 |-- native_country: string (nullable = true)
 |-- income: string (nullable = true)



Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39.0,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


## Preprocess Data

Since we are going to try algorithms like Logistic Regression, we will have to convert the categorical variables in the dataset into numeric variables. There are 2 ways we can do this.

- Category Indexing

This is basically assigning a numeric value to each category from {0, 1, 2, ...numCategories-1}. This introduces an implicit ordering among your categories, and is more suitable for ordinal variables (eg: Poor: 0, Average: 1, Good: 2)

- One-Hot Encoding

This converts categories into binary vectors with at most one nonzero value (eg: (Blue: [1, 0]), (Green: [0, 1]), (Red: [0, 0]))

In this dataset, we have ordinal variables like education (Preschool - Doctorate), and also nominal variables like relationship (Wife, Husband, Own-child, etc). For simplicity's sake, we will use One-Hot Encoding to convert all categorical variables into binary vectors. It is possible here to improve prediction accuracy by converting each categorical column with an appropriate method.

Here, we will use a combination of StringIndexer and OneHotEncoderEstimator to convert the categorical variables. The OneHotEncoderEstimator will return a SparseVector. Note: OneHotEncoderEstimator is renamed as OneHotEncoder in Spark 3.0.

Since we will have more than 1 stage of feature transformations, we use a Pipeline to tie the stages together. This simplifies our code.

In [4]:
import pyspark
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler

from distutils.version import LooseVersion

categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    # Use OneHotEncoder to convert categorical variables into binary SparseVectors
    if LooseVersion(pyspark.__version__) < LooseVersion("3.0"):
        from pyspark.ml.feature import OneHotEncoderEstimator
        encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    else:
        from pyspark.ml.feature import OneHotEncoder
        encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

The above code basically indexes each categorical column using the StringIndexer, and then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row.

We use the StringIndexer again to encode our labels to label indices.

In [5]:
# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol="income", outputCol="label")
stages += [label_stringIdx]

Use a VectorAssembler to combine all the feature columns into a single vector column. This includes both the numeric columns and the one-hot encoded binary vector columns in our dataset.

In [6]:
# Transform all features into a vector using VectorAssembler
numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

Run the stages as a Pipeline. This puts the data through all of the feature transformations we described in a single call.

In [7]:
partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(dataset)
preppedDataDF = pipelineModel.transform(dataset)
preppedDataDF.printSchema()
preppedDataDF.limit(5).toPandas()

root
 |-- age: double (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: double (nullable = true)
 |-- marital_status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: double (nullable = true)
 |-- capital_loss: double (nullable = true)
 |-- hours_per_week: double (nullable = true)
 |-- native_country: string (nullable = true)
 |-- income: string (nullable = true)
 |-- workclassIndex: double (nullable = false)
 |-- workclassclassVec: vector (nullable = true)
 |-- educationIndex: double (nullable = false)
 |-- educationclassVec: vector (nullable = true)
 |-- marital_statusIndex: double (nullable = false)
 |-- marital_statusclassVec: vector (nullable = true)
 |-- occupationIndex: double (nullable = false)
 |-- occupationclassVec: vec

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,...,relationshipIndex,relationshipclassVec,raceIndex,raceclassVec,sexIndex,sexclassVec,native_countryIndex,native_countryclassVec,label,features
0,39.0,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,...,1.0,"(0.0, 1.0, 0.0, 0.0, 0.0)",0.0,"(1.0, 0.0, 0.0, 0.0)",0.0,(1.0),0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ..."
1,50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,...,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0)",0.0,"(1.0, 0.0, 0.0, 0.0)",0.0,(1.0),0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,...,1.0,"(0.0, 1.0, 0.0, 0.0, 0.0)",0.0,"(1.0, 0.0, 0.0, 0.0)",0.0,(1.0),0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."
3,53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,...,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0)",1.0,"(0.0, 1.0, 0.0, 0.0)",0.0,(1.0),0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,...,4.0,"(0.0, 0.0, 0.0, 0.0, 1.0)",1.0,"(0.0, 1.0, 0.0, 0.0)",1.0,(0.0),9.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [8]:
# Keep relevant columns
selectedcols = ["label", "features"]
dataset = preppedDataDF.select(selectedcols)
dataset.limit(5).toPandas()

Unnamed: 0,label,features
0,0.0,"(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ..."
1,0.0,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."
3,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [9]:
### Randomly split data into training and test sets. set seed for reproducibility
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)
print(trainingData.count())
print(testData.count())

22838
9723


## 模型构建

In [10]:
from mmlspark.lightgbm import LightGBMClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

lgb = LightGBMClassifier(
    objective="binary",
    boostingType='gbdt',
    isUnbalance=True,
    featuresCol='features',
    labelCol='label',
    maxBin=60,
    baggingFreq=1,
    baggingSeed=696,
    earlyStoppingRound=30,
    learningRate=0.1,
    lambdaL1=1.0,
    lambdaL2=45.0,
    maxDepth=3,
    numLeaves=128,
    baggingFraction=0.7,
    featureFraction=0.7,
    numIterations=800,
    verbosity=30
)

装载各个阶段到Pipeline流水线中，执行训练：

In [11]:
model = lgb.fit(trainingData)

In [12]:
model

LightGBMClassificationModel_263f97e5e776

## 预测:

In [13]:
train_preds = model.transform(trainingData)
test_preds = model.transform(testData)

我们看看模型训练效果：

In [14]:
binaryEvaluator = BinaryClassificationEvaluator()
print ("Train AUC: " + str(binaryEvaluator.evaluate(train_preds, {binaryEvaluator.metricName: "areaUnderROC"})))
print ("Test AUC: " + str(binaryEvaluator.evaluate(test_preds, {binaryEvaluator.metricName: "areaUnderROC"})))

Train AUC: 0.9403789003910289
Test AUC: 0.9228858550840612


当然，我们可以把预测概率结果和真实label取出来，方便进行计算其他自定义指标，例如KS等。

In [15]:
train_prob_list = [row.probability[0] for row in train_preds.select('probability').collect()]
train_label_list = [row.label for row in train_preds.select('label').collect()]
 
test_prob_list = [row.probability[0] for row in test_preds.select('probability').collect()]
test_label_list = [row.label for row in test_preds.select('label').collect()]

DataFrame在经过一系列transform之后，会多出4列，分别为features|rawPrediction|probability|prediction|，rawPrediction是树模型最后的一对儿得分，和为0；在经过sigmoid之后，得到probability里的一对儿概率值，其和为1，分别表示模型判定该样本为两个分类的可能性；而prediction则是模型预测的样本类别。

## lightGBM调参之PySpark + Grid Search

In [16]:
import numpy as np
import mmlspark
from mmlspark.lightgbm import LightGBMClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

lgb = LightGBMClassifier(
    objective="binary",
    boostingType='gbdt',
    isUnbalance=True,
    featuresCol='features',
    labelCol='label',
    maxBin=60,
    baggingFreq=1,
    baggingSeed=696,
    earlyStoppingRound=30,
    learningRate=0.1,
    # lambdaL1=1.0,
    # lambdaL2=45.0,
    maxDepth=3,
    numLeaves=128,
    baggingFraction=0.7,
    featureFraction=0.7,
    minSumHessianInLeaf=0.001,
    numIterations=800,
    verbosity=1
)

### 设置Grid Search参数组：

In [17]:
paramGrid = ParamGridBuilder() \
    .addGrid(lgb.lambdaL1, list(np.arange(1.0, 3.0, 1.0))) \
    .addGrid(lgb.lambdaL2, list(np.arange(1.0, 4.0, 1.0))) \
    .build()

设置完成之后，我们可以看一下参数都是哪些：

In [18]:
for param in paramGrid:
    print(param.values())

dict_values([1.0, 1.0])
dict_values([1.0, 2.0])
dict_values([1.0, 3.0])
dict_values([2.0, 1.0])
dict_values([2.0, 2.0])
dict_values([2.0, 3.0])


## 交叉验证选择模型
官方提供了两种模型选择的方式：CrossValidator和TrainValidationSplit，可以参考官方文档。CrossValidator和TrainValidationSplit的区别在于：CrossValidator会每次选取一部分训练集建模，去预测另外一部分训练集，这样会有K个预测的分数（K折交叉验证），最后模型的预测分数为K个分数的平均；而TrainValidationSplit则只会训练预测一次。这里，我们试着给出一个CrossValidator的例子

In [19]:
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
cross_vallidator = CrossValidator(estimator=lgb,
                          estimatorParamMaps=paramGrid, 
                          evaluator=evaluator, 
                          numFolds=3)
model = cross_vallidator.fit(trainingData)

最后，我们可以得到最好的模型，并看最好的模型的特征重要性：

In [20]:
print(model.bestModel)
print(model.bestModel.getFeatureImportances())

LightGBMClassificationModel_31fe08e1785d
[66.0, 78.0, 44.0, 7.0, 28.0, 25.0, 24.0, 0.0, 46.0, 53.0, 32.0, 20.0, 18.0, 16.0, 16.0, 10.0, 19.0, 4.0, 11.0, 8.0, 6.0, 5.0, 2.0, 84.0, 49.0, 45.0, 12.0, 31.0, 9.0, 57.0, 39.0, 57.0, 35.0, 48.0, 49.0, 39.0, 20.0, 38.0, 25.0, 49.0, 24.0, 36.0, 9.0, 41.0, 55.0, 43.0, 50.0, 63.0, 47.0, 28.0, 23.0, 16.0, 89.0, 45.0, 22.0, 20.0, 14.0, 4.0, 5.0, 9.0, 0.0, 4.0, 10.0, 5.0, 0.0, 11.0, 6.0, 6.0, 0.0, 1.0, 0.0, 11.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 837.0, 933.0, 317.0, 318.0, 275.0, 568.0]


但是，官方api查看每个结果对应的超参数却是非常不友好，我们只好自己想办法，这里我们参考了这篇博客 LightGBM Hyper Parameters Tuning in Spark：

In [21]:
def params_extract(model):
    """
    function extact hyperparameter information from a CrossValidatorModel
    input: a CrossValidatorModel instance, model fit by CrossValidator in pyspark.ml.tuning
    output: a dictionary with key(hyperparameters setting), value(evaluator's metrics, r2, auc,...)
    """
    length = len(model.avgMetrics)
    res = {}
    for i in range(length):
        s = ""
        paraDict = model.extractParamMap()[model.estimatorParamMaps][i]
        for j in paraDict.keys():
            s += str(j).split("__")[1] + "  "
            s += str(paraDict[j]) + "  "
        res[s.strip()] = model.avgMetrics[i]
    return {k: v for k, v in sorted(res.items(), key=lambda item: item[1])}

In [22]:
params_extract(model)

{'lambdaL1  1.0  lambdaL2  2.0': 0.9236232416179453,
 'lambdaL1  1.0  lambdaL2  1.0': 0.9236650282485419,
 'lambdaL1  1.0  lambdaL2  3.0': 0.9236839543034484,
 'lambdaL1  2.0  lambdaL2  3.0': 0.9237781089311508,
 'lambdaL1  2.0  lambdaL2  1.0': 0.9238784689337032,
 'lambdaL1  2.0  lambdaL2  2.0': 0.9240805941795885}

### 官方调参探讨
我们通过CrossValidator可以获得最佳的模型，但是会有一个问题：这个最佳是拟合训练集的最佳，而不是我们给出的验证集的最佳；即使是TrainValidationSplit，我们也不能自定义验证集并传入，只能随机选择验证集。这样对于那些样本时间先后顺序不敏感的数据是影响不大的，比如图像等，但是对于交易类数据，我们希望可以根据时间先后顺序自定义训练集，验证集和测试集，并且根据验证集的效果来确定最佳参数和模型。因此，我们就需要换种方式达到目的。

In [23]:
lightGBMs = list()
for lambdaL1 in list(np.arange(1.0, 3.0, 1.0)):
    for lambdaL2 in list(np.arange(1.0, 4.0, 1.0)):
        lightGBMs.append(lgb.setLambdaL1(lambdaL1).setLambdaL2(lambdaL2))
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")
metrics = []
models = []
 
# 选择验证集效果最好的模型
for learner in lightGBMs:
    model = learner.fit(trainingData)
    models.append(model)
    scoredData = model.transform(testData)
    metrics.append(evaluator.evaluate(scoredData))
best_metric = max(metrics)
best_model = models[metrics.index(max(metrics))] 
 
# 得到测试集上AUC
scored_test = best_model.transform(testData)
print(evaluator.evaluate(scored_test))

0.9237224880382701


## lightGBM调参之PySpark + mmlspark + Grid Search

上面, 我们记录了分别用PySpark中自带的CrossValidator和更通用的生成多个分类器同时执行训练预测的方式选取最好的模型。其中CrossValidator并不能得到验证集上最佳的分类器，而是得到训练集上最佳的效果。而mmlspark当中却有更为简单的方式，既可以得到验证集上最佳的效果，也可以方便地记录我们每一组参数对应的结果，是一种很好的方式。

In [24]:
from mmlspark.lightgbm import LightGBMClassifier
from mmlspark.automl import *
from mmlspark.train import TrainClassifier, ComputeModelStatistics

In [25]:
lgb = LightGBMClassifier(
    objective="binary",
    boostingType='gbdt',
    isUnbalance=True,
    featuresCol='features',
    labelCol='label',
    maxBin=60,
    baggingFreq=1,
    baggingSeed=696,
    earlyStoppingRound=20,
    learningRate=0.1,
    #lambdaL1=1.0,
    #lambdaL2=45.0,
    maxDepth=3,
    numLeaves=128,
    baggingFraction=0.7,
    featureFraction=0.7,
    minSumHessianInLeaf=0.001,
    numIterations=800,
    verbosity=1
)

以lambdaL1和lambdaL2为例，设置4组不同的参数，这里可以自己记录每组参数，方便和后面各模型效果对应，就不再赘述了：

In [26]:
lightGBMs = list()
for lambdaL1 in list(np.arange(1.0, 3.0, 1.0)):
    for lambdaL2 in list(np.arange(1.0, 3.0, 1.0)):
        lightGBMs.append(lgb.setLambdaL1(lambdaL1).setLambdaL2(lambdaL2))

当然，这里可以用这种方式来设置参数，更方便

In [27]:
import itertools
lightGBMs = list()
params = itertools.product([1.0, 2.0], [1.0, 2.0])
for param in params:
    lightGBMs.append(lgb.setLambdaL1(param[0]).setLambdaL2(param[1]))

利用mmlspark.train模块当中的TrainClassifier类训练模型:

In [28]:
lgb_models = [TrainClassifier(model=lgb, labelCol="label").fit(trainingData) for lgb in lightGBMs]

利用mmlspark.automl当中的FindBestModel类，寻找在验证集上效果最好的模型：

In [29]:
best_model = FindBestModel(evaluationMetric='AUC', models=lgb_models).fit(testData)

我们可以看看最好的模型效果：

In [30]:
best_model.getBestModelMetrics().collect()

[Row(evaluation_type='Classification', confusion_matrix=DenseMatrix(2, 2, [6041.0, 353.0, 1307.0, 2022.0], False), accuracy=0.8292708011930474, precision=0.607389606488435, recall=0.8513684210526316, AUC=0.9234257227172484)]

也可以看看所有4个模型在验证集上的效果：

In [31]:
best_model.getAllModelMetrics().collect()

[Row(model_name='TrainClassifier_84a31e2a4d4d', metric=0.9234257227172484, parameters='featuresCol: TrainClassifier_229f52ed8c8e_features, labelCol: label, predictionCol: prediction, probabilityCol: probability, rawPredictionCol: rawPrediction'),
 Row(model_name='TrainClassifier_c8bc9ba50c7b', metric=0.9234257227172484, parameters='featuresCol: TrainClassifier_5f6badd6c4a9_features, labelCol: label, predictionCol: prediction, probabilityCol: probability, rawPredictionCol: rawPrediction'),
 Row(model_name='TrainClassifier_53de8573a8b8', metric=0.9234257227172484, parameters='featuresCol: TrainClassifier_2ca4d026570c_features, labelCol: label, predictionCol: prediction, probabilityCol: probability, rawPredictionCol: rawPrediction'),
 Row(model_name='TrainClassifier_e15e3f26dca1', metric=0.9234257227172484, parameters='featuresCol: TrainClassifier_2cc6adae0df3_features, labelCol: label, predictionCol: prediction, probabilityCol: probability, rawPredictionCol: rawPrediction')]

## 预测测试集

In [32]:
predictions = best_model.transform(testData)
metrics = ComputeModelStatistics().transform(predictions)
print("Best model's AUC on test set = "
      + "{0:.2f}%".format(metrics.first()["AUC"] * 100))

Best model's AUC on test set = 92.34%


我们可以看看metrics里面都是什么：

In [33]:
metrics.collect()

[Row(evaluation_type='Classification', confusion_matrix=DenseMatrix(2, 2, [6041.0, 353.0, 1307.0, 2022.0], False), accuracy=0.8292708011930474, precision=0.607389606488435, recall=0.8513684210526316, AUC=0.9234257227172484)]

结果也是一个list，包含混淆矩阵，accuracy,precision和recall：

总体来说，这种方式来进行分布式训练，比PySpark自带的api更方便一些，推荐mmlspark方式。