###Machine Learning with Spark: Tuning

##### ML Pipelines API
  * DataFrame &check;
  * Transformer / Estimator / Pipeline &check;
  * __CrossValidator / ParamGridBuilder / Evaluator__ &xlArr;

Extended example with the diamonds dataset, focusing on more elements of the ML process
* Handling Categorical Variables
* Evaluation 
* Tuning
* Crossvalidation

In [4]:
spark.read.option("header", True).csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv").printSchema()

In [5]:
data = spark.read.option("header", True).option("inferSchema", True) \
        .csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")
  
data.show()

In [6]:
data.printSchema()

We'll look at the features in more detail ... but right away we see we'll have to do something about string-typed features. 

The price (label) is an integer, not a double. In many cases, an integer can be auto-widened to a double, but there may be some places we'll have to watch out.

Also, that "\_c0" (a.k.a. the row number or row ID) ... not only is it not a feature, but it can leak irrelevant data:

In [8]:
display(data.select("_c0", "price").sample(False, 0.02, 42)) # what does this tell us? :)

_c0,price
7,336
102,2760
116,2762
398,554
506,2822
518,2824
523,2825
557,2831
572,2833
606,2839


In [9]:
# We'd can get rid of the row number and fix price:

data2 = data.drop("_c0").withColumn("label", data["price"].cast("double")).drop("price")
data2.show()

In [10]:
display(data2.describe())

summary,carat,cut,color,clarity,depth,table,x,y,z,label
count,53940.0,53940,53940,53940,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.7979397478679852,,,,61.74940489432624,57.45718390804603,5.731157211716609,5.734525954764462,3.538733778272332,3932.799721913237
stddev,0.4740112444054196,,,,1.4326213188336523,2.2344905628213247,1.1217607467924915,1.1421346741235616,0.7056988469499883,3989.439738146397
min,0.2,Fair,D,I1,43.0,43.0,0.0,0.0,0.0,326.0
max,5.01,Very Good,J,VVS2,79.0,95.0,10.74,58.9,31.8,18823.0


In [11]:
display(data2.sample(False, 0.05, 42))

carat,cut,color,clarity,depth,table,x,y,z,label
0.24,Very Good,I,VVS1,62.3,57.0,3.95,3.98,2.47,336.0
0.3,Very Good,I,SI1,62.6,57.0,4.25,4.28,2.67,405.0
0.3,Very Good,I,SI1,63.0,57.0,4.28,4.32,2.71,405.0
0.75,Premium,E,SI1,59.9,54.0,6.0,5.96,3.58,2760.0
0.74,Ideal,G,SI1,61.6,55.0,5.8,5.85,3.59,2760.0
0.75,Premium,G,VS2,61.7,58.0,5.85,5.79,3.59,2760.0
0.73,Ideal,F,VS2,62.7,53.0,5.8,5.75,3.62,2762.0
0.71,Good,E,VS2,59.2,61.0,5.8,5.88,3.46,2772.0
0.72,Ideal,G,SI1,61.8,56.0,5.72,5.75,3.55,2776.0
0.53,Very Good,D,VVS2,57.5,64.0,5.34,5.37,3.08,2782.0


In [12]:
display(data2.filter(data2['x'] <= 3))

carat,cut,color,clarity,depth,table,x,y,z,label
1.07,Ideal,F,SI2,61.6,56.0,0.0,6.62,0.0,4954.0
1.0,Very Good,H,VS2,63.3,53.0,0.0,0.0,0.0,5139.0
1.14,Fair,G,VS1,57.5,67.0,0.0,0.0,0.0,6381.0
1.56,Ideal,G,VS2,62.2,54.0,0.0,0.0,0.0,12800.0
1.2,Premium,D,VVS1,62.1,59.0,0.0,0.0,0.0,15686.0
2.25,Premium,H,SI2,62.8,59.0,0.0,0.0,0.0,18034.0
0.71,Good,F,SI2,64.1,60.0,0.0,0.0,0.0,2130.0
0.71,Good,F,SI2,64.1,60.0,0.0,0.0,0.0,2130.0


Looks like at least a few incomplete records in here

Now we need to do something about the categorical features: cut, color, and clarity.

In [15]:
data2.select("cut").distinct().show()

In [16]:
display(data2.groupBy("cut").count())

cut,count
Premium,13791
Ideal,21551
Good,4906
Fair,1610
Very Good,12082


First, we need to convert the categorical values to numbers.

We can do that with a StringIndexer

* https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer

Then we need to one-hot encode
* In the "old days" we would create several OneHotEncoder objects
* Now we use OneHotEncoderEstimator https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator
  * Proper pattern for Estimator
  * Handles multiple columns at once (yea!)

Now let's automate this work a bit: we'll use functional collections to create our feature helpers, and a pipeline to wrap them

In [19]:
from pyspark.ml.feature import *
from pyspark.ml import Pipeline

categoricalFields = ["cut", "color", "clarity"]

indexers = [StringIndexer(inputCol=f, outputCol=f + "Index") for f in categoricalFields]

encoder = OneHotEncoderEstimator(inputCols=[f + "Index" for f in categoricalFields], outputCols=[f + "Vec" for f in categoricalFields])

pipeline = Pipeline(stages=indexers + [encoder])

model = pipeline.fit(data2)

model.transform(data2).show()

That looks pretty good. Next, we need to bring all of our features together into a single vector. We've seen a helper that does exactly that

In [21]:
assembler = VectorAssembler(inputCols=[f + "Vec" for f in categoricalFields] + 
                            ["carat", "depth", "table", "x", "y", "z"], outputCol="features")

Pipeline(stages=indexers + [encoder, assembler]).fit(data2).transform(data2).select("features").show(truncate=False)

Let's finish the pipeline by adding the Linear Regression algorithm

In [23]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression()
completePipeline = Pipeline(stages=indexers + [encoder, assembler, lr])

Now we're ready to train and do an initial test

In [25]:
train, test = data2.randomSplit([0.75, 0.25])

model = completePipeline.fit(train)

predictions = model.transform(test).select("label", "prediction")

display(predictions.sample(False, 0.05))

label,prediction
369.0,-398.8387255873667
530.0,786.4002368391675
465.0,834.936489195937
552.0,805.7941041719162
419.0,21.362239752617995
740.0,1012.8156133688792
403.0,237.06372093607027
456.0,361.7973948672261
605.0,879.3006219979961
600.0,664.9835450630488


##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Evaluator
### Calculates statistics on our models indicating
* goodness-of-fit, explanation of variance
* error quantities, precision/recall/etc.

### Generates one stat at a time
* "mode-ful" switching of stat via setter

### Why? Designed for integration and for Spark, not just us
* In particular, answers question "Which is better?"

### RegressionEvaluator, BinaryClassificationEvaluator, ...

In [27]:
from pyspark.ml.evaluation import RegressionEvaluator

eval = RegressionEvaluator()
eval.evaluate(predictions)

In [28]:
eval.setMetricName("r2")
eval.evaluate(predictions)

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) ParamGridBuilder

### Helper to specify a grid of (hyper)params
* Several params, chosen based on algorithm/model type
* Several values for each param
* Allows Spark to find/try every combination of values!


### Example: Parameter Grid for Tuning a Decision Tree
| Parameter  | Test Value 1 | Test Value 2 | Test Value 3 | (etc.) |
|------------|--------------|--------------|--------------|--------|
| `maxDepth` | 6            | 10           | 12           | `...`  |
| `maxBins`  | 16           | 32           | 48           | `...`  | 
| `...`      | `...`        | `...`        | `...`        | `...`  |

In [30]:
from pyspark.ml.tuning import *

paramGrid = ParamGridBuilder().addGrid(lr.elasticNetParam, [0.3, 0.7]).addGrid(lr.regParam, [0.01, 0.1]).build()

cv = CrossValidator().setEstimator(completePipeline).setEvaluator(eval).setEstimatorParamMaps(paramGrid).setNumFolds(3) 

cvModel = cv.fit(train)

How different was the performance across the different parameter sets?

In [32]:
cvModel.avgMetrics

In [33]:
# Given a CrossValidatorModel cvModel, how can we find out which hyperparams produced the "best" model chosen by the CrossValidator?

cvModel.getEstimatorParamMaps()

In [34]:
for pair in zip(cvModel.getEstimatorParamMaps(), cvModel.avgMetrics):
  print(pair)

After training the CrossValidatorModel, `cvModel.bestModel` will contain a model trained on all of the training data using the best hyperparams.

However, we could also train that "best model" ourselves:

In [36]:
lr = LinearRegression(regParam=0.1, elasticNetParam=0.7)
finalModel = completePipeline.fit(train)

Run the final model against the test set

In [38]:
predictions = finalModel.transform(test).select("label", "prediction")
eval.evaluate(predictions)

In [39]:
eval.setMetricName("rmse")
eval.evaluate(predictions)

Have a quick look at the errors...

In [41]:
display(predictions.sample(False, 0.05).selectExpr("prediction-label as error"))

error
280.1759317096614
300.16535980240246
121.95868956027071
-210.09312489639245
165.82154978507106
501.51317534136
-797.5433482866549
-159.19753230535275
282.7720712524367
-755.5193619743991


In [42]:
finalModel.stages[-1].coefficients

In [43]:
finalModel.stages[-1].intercept

Linear regression? Really? Can't we do something a bit fancier?
Let's take a quick look at a Gradient-Boosted Tree Regression:

In [45]:
from pyspark.ml.regression import GBTRegressor

gbt = GBTRegressor()
assembler = VectorAssembler(inputCols=[f + "Index" for f in categoricalFields] + ["carat", "depth", "table", "x", "y", "z"], outputCol="features")
gbtPipeline = Pipeline(stages=indexers + [assembler, gbt])

train, test = data2.randomSplit([0.75, 0.25])

gbtModel = gbtPipeline.fit(train)
predictions = gbtModel.transform(test)
eval.evaluate(predictions)

Note: If you need the offical xgboost, there is a package for distributed training via Spark
* http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html
* https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html