##Machine Learning with Spark

##### Legacy API vs. Modern API
* ML Pipelines API
  * DataFrame &check;
  * __Transformer / Estimator / Pipeline &xlArr;__ 
  * CrossValidator / ParamGridBuilder / Evaluator (next module)
* RDD API in maintenance mode, will be deprecated, then removed (~ 3.0?)

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Uniform API

### Feature processors: 
 * `Transformer` or
 * `Estimator` + `Model`

### ML Algorithms: 
* `Estimator`

### ML Models: 
* `Model`

### Using any processor, algorithm, and performance tuning uses the same API!
* Low cognitive load
* "If you're bored, we're all winning!"  
(because your brain is now free to work on the hard and interesting stuff)

##### Documentation

* Programming guide with explanations examples: http://spark.apache.org/docs/latest/ml-guide.html
* API docs: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html

In [5]:
path = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"

spark.read.csv(path).show()

We can use the header data as column names through the regular reader option call:

In [7]:
spark.read.option("header", True).csv(path).show()

In this example, we're just going to look at carat weight as a linear predictor of price. So we'll grab those columns and use the inferSchema option to get numbers (recall the original data is a CSV file which defaults to reading strings):

In [9]:
data = spark.read.option("header", True) \
            .option("inferSchema", True) \
            .csv(path) \
            .select("carat", "price")

Let's have a quick look to see if there's any hope for our linear regression:

In [11]:
display(data.sample(False, 0.01))

carat,price
0.23,404
0.3,405
0.7,2777
0.73,2791
1.04,2801
0.71,2822
0.74,2855
0.31,557
0.73,2858
0.71,2891


What's the Pearson correlation?

In [13]:
data.stat.corr("price", "carat")

Ok, so there is a chance we'll get something useful out... 

But what do we need to feed in?

Vectors!

The training (and validation, test, etc.) data needs to be collected into a column of Vector

Luckily, there's a built-in `Transformer` that will take one or more columns of numbers and collect them into a new column containing a Vector

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Patterns Underlying Spark ML

### Snap-together brick model

### Encapsulation of processing
* **Transformer**
* **Estimator**
* **Pipeline**

### Evaluation / Tuning
* **Evaluator**
* **CrossValidator**
* **ParamGridBuilder**

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Transformer

### Processes features

### Typically a "map" operation:
  * Easily vectorized/parallelized
  * For example: `Binarizer`

### Run by calling `transform(..)`
  * For example: `aTransformer.transform(aDataFrame)`

### Extend Spark
  * by extending`UnaryTransformer` or
  * by extending`Transformer`

In [17]:
from pyspark.ml.feature import *

assembler = VectorAssembler(inputCols=["carat"], outputCol="features")

First look at the next cell, to help make it concrete what this transformer does, and what transformers do in general.

Next, take a look at the VectorAssembler docs and maybe even the source code.

In [19]:
assembler.transform(data).show()

Now we'll add the `LinearRegression` algorithm. The algorithm builds a model from data.

Since it needs to look at all the data and then build a new piece of state (representing the `Model`) that can be used for predictions on each row (a Model is a subtype of `Transformer`), it is an `Estimator`.

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Estimator

### More complex feature processing, and/or predictions

### Typically one or many "reduce" operations
* Builds valuable state

### Produces a Model (Transformer) to encapsulate state
* Generate state & model by calling `anEstimator.fit(aDataFrame)`
* Resulting model is a `Transformer`

### Extend Spark 
* by creating a specific Model subclass and 
* an Estimator that generates it

In [22]:
from pyspark.ml.regression import *

lr = LinearRegression(labelCol="price")

We can operate each of these components separately, to see how they're working (but we'll see a shortcut in just a minute)

In [24]:
train, test = data.randomSplit([0.75, 0.25])

lrModel = lr.fit ( assembler.transform(train) )

lrModel.transform( assembler.transform(test) ).show()

Let's package the processing steps together so that we don't need to run them
* separately
* for training, validation sets, etc.

-sandbox
##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Pipeline

### Represents composition of
* various Transformers' `.transform(..)` methods
* various Estimators' `.fit(..)` and result's `.transform(..)` methods

### "<a href="https://en.wikipedia.org/wiki/Tacit_programming" target="_blank">Point-free</a>" operations

### A Pipeline is itself an `Estimator` (supports composition)

### Instead of this...

<pre>
model = est2.fit(
             est1.fit(tf2.transform(tf1.transform(data)))
                 .transform(tf2.transform(tf1.transform(data))))</pre>

### We use this...

<pre>model = Pipeline([tf1, tf2, est1, est2]).fit(data)</pre>

The `Pipeline` is an `Estimator` that represents composing a series of `Transformer`s or `Estimator`-`Model` pairs.

When we add an `Estimator` to a pipeline (without specifically fitting the `Estimator` first), we are performing composition -- `Pipeline`s are themselves `Estimator`s, so we're making a new `Estimator` that includes the `LinearRegression` algorithm as a component part.

In [28]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[assembler, lr])

In [29]:
model = pipeline.fit(train)

In [30]:
summary = model.stages[-1].summary
print(summary.r2)
print(summary.rootMeanSquaredError)

Note this is a summary of the model, not a summary of the test. So it's showing training error, or "apparent error."

In [32]:
display(summary.residuals.sample(False, 0.05)) # training residuals

residuals
854.8344395417018
953.8344395417018
953.8344395417018
959.8344395417018
959.8344395417018
965.8344395417018
986.8344395417018
1064.8344395417018
853.0076469121407
882.0076469121407


Now ... how did we do on test data?

In [34]:
predictions = model.transform(test)

In [35]:
display(predictions.sample(False, 0.02).selectExpr("prediction - price as error"))

error
-807.8344395417018
-986.8344395417018
-1169.8344395417018
-836.0076469121407
-865.0076469121407
-851.1808542825795
-790.5272690234569
-522.7004763938958
-704.7004763938958
-920.7004763938958


##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Evaluator
### Calculates statistics on our models indicating
* goodness-of-fit, explanation of variance
* error quantities, precision/recall/etc.

### Generates one stat at a time
* "mode-ful" switching of stat via setter

### Why? Designed for integration and for Spark, not just us
* In particular, answers question "Which is better?"

### RegressionEvaluator, BinaryClassificationEvaluator, ...

In [37]:
from pyspark.ml.evaluation import *

eval = RegressionEvaluator(labelCol="price", predictionCol="prediction")

In [38]:
eval.evaluate(predictions)

In [39]:
for line in eval.explainParams().split("\n"):
  print(line)

It looks like we did about as well (or badly) on the test data as on the training data.

Recap... What have we looked and not looked at?

Looked at:
* Basic data preparation (types, DataFrame, vectors)
* Feature pre-processing helper example: VectorAssembler
* Role and type of a Transformer
* Model-building algorithm example: LinearRegression
* Role and type of an Estimator
* Pipeline
* Some basic graphs and statistics along the way

Have *not* looked at:
* Cleaning, deskewing, other data pre-processing
* Various data prep helpers
* Other algorithms
* Model tuning and cross-validation
* Combining Spark with other models and tools (sklearn, deep learning, etc.)
* Data-parallelism strategy

__But now we know enough to try a regression lab!__ (feel free to cut/paste from this notebook as needed)