# Fast ML Inference with Apache Spark

#### Apache Spark is great for training models across large datasets using a distributed compute cluster

In this webinar, we assume that you are familiar with the core Apache Spark Dataframe/Dataset and Machine Learning tools, and that you have encountered -- or want to avoid encountering! -- a common obstacle to deployment of Spark ML models.

*Terminology:* In machine learning, the term "inference" refers to the process of using a model to make predictions on new data. That is, the model is *already* trained to a satisfactory level, and a business would like to deploy that model to predict (aka "score") new data records. An example might be a credit-card fraud model which estimates the likelihood of a transaction being fraudulent. Once the model is ready to go, it needs to be deployed (typically as a service) where it can be uses to test new incoming transactions.

__What is the big challenge to deploying Spark ML models?__

Spark is optimized to process large amounts of data using large amounts of compute hardware. When that large processing task is performing ML inference across a large set of data records (e.g., choosing the best marketing offer for each of 10 million mailing list recipients) and that inference can be done as a batch job, everything works great.

However, in some use cases (e.g., fraud detection, intrusion detection, online ad bidding, etc.) there are different performance requirements: we may need to make a prediction very fast (just a few milliseconds) and for only a small number (maybe just one) data record.

And, of course, many models may need to work both ways: large batch prediction as well as small, fast on-demand prediction.

__Because Apache Spark was not designed to handle this small-data, low-latency case, its normal prediction APIs do not work at all in these situations.__

Happily, there are numerous solutions to the problem, and we're here to discuss those so your project doesn't end up crashing and burning by using the wrong pattern.

### Building a Very Simple Model to Demonstrate

In order to have a model to work with, we'll quickly train a very simple linear regression against the Diamonds (from R/ggplot) dataset.

To keep it incredibly simple, we'll look at just one predictor: the carat weight.

Here's the code to train and eval the model:

In [4]:
input_file = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"

In [5]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

data = spark.read.option('header', True).csv(input_file) \
        .selectExpr('CAST(carat AS double) AS carat', 'CAST(price AS double) AS price')

train, test = data.randomSplit([0.75, 0.25])

p = Pipeline(stages=[VectorAssembler(inputCols=['carat'], outputCol='caratVec'), 
                     LinearRegression(featuresCol='caratVec', labelCol='price')])

model = p.fit(train)

And just as a smoke test, to make sure we're on the right track, we'll evaluate:

In [7]:
from pyspark.ml.evaluation import RegressionEvaluator

RegressionEvaluator(labelCol='price').evaluate(model.transform(test))

Ok, we have a model, even if it's not great. Now let's generate some "new data records" which we would like to score using this model:

In [9]:
sample = spark.range(3).selectExpr("id + 1 as carat")
sample.show()

And, as a baseline, we'll use the official, standard Spark ML Pipeline API to make our predictions:

In [11]:
model.transform(sample).show()

Notice that this prediction takes 200-500ms, far too long for the low-latency inference use cases we may need to target.

To make things worse, the demo environment I'm using today exhibits somewhat *better* performance on this task than a real (distributed) cluster. I.e., on a proper Spark cluster we would expect perf to get even worse.

__Ok, what is the problem?__

As we've discussed, Spark is designed for large-scale distributed processing, so it spends a lot of time planning for then even when it's not necessary. In this case, we know the model is a trivial single multiply and single add to score a record. And even in Python we can do much better than Spark.

Here are the parameters:

In [13]:
linearRegressionModel = model.stages[-1]
c0 = linearRegressionModel.coefficients[0]
i = linearRegressionModel.intercept
print(c0,i)

And here are the predictions in Python:

In [15]:
[c0*carat + i for carat in [1,2,3]]

So Spark is adding 10x-100x overhead to this inference case.

### What Can We Do?

The first and simplest option is to use the `Model.predict`.

This allows us to score a record on the driver, without using the Dataframe infrastructure and scheduler.

There are pros and cons to this approach, so let's try it and then review:

In [18]:
from pyspark.serializers import PickleSerializer
from pyspark.ml.linalg import Vectors

def make_single_predicton(model, predictor):
  data = bytearray(PickleSerializer().dumps(Vectors.dense([predictor])))
  obj = sc._jvm.org.apache.spark.ml.python.MLSerDe.loads(data)
  return model._java_obj.predict(obj)

In [19]:
[make_single_predicton(linearRegressionModel, carat) for carat in [1,2,3]]

Pros:
* It's fast -- close to the raw Python compute time, not the Spark-scheduled compute time
* It maintains the original model's compute semantics exactly (since it uses the original model)
* Doesn't require anything outside of Spark itself

Cons: 
* API is only public in the most recent version(s) of Spark (2.4 or so)
* API in 2.4 is only in Scala, so we need to go "under the hood" a little to use it in Python
* Most critically, it doesn't support `Pipeline` or any feature pre-processing, so it's up to us to perform any prep on the data record and deliver a `Vector` to the model
* Spark is an awfully large and complex (and expensive) piece of software to use to perform a multiply and an add

### What Else Can We Try, Sticking Close to Spark?

Microsoft has released (as part of their Azure ML stack and MMLSpark) an adapter for Spark Structured Streaming that allows you to expose a streaming job as a REST service.

Combining that adapter with Spark's experimental (2.3+) Continuous / Low-Latency Streaming feature allows us to get a REST service for scoring records quickly.

*Note: this demo requires installing Microsoft's open-source MMLSpark as a package - see https://github.com/Azure/mmlspark *

In [22]:
%sh 

rm -rf /tmp/ck
mkdir /tmp/ck

In [23]:
import mmlspark
from pyspark.sql.functions import udf, col, length
from pyspark.sql.types import *

df = spark.readStream.continuousServer().address("localhost", 8888, "my_api").load() \
     .parseRequest(StructType().add("carat", DoubleType()))

replies = model.transform(df).makeReply("prediction")

server = replies\
    .writeStream.continuousServer().trigger(continuous="1 second") \
    .replyTo("my_api") \
    .queryName("my_query") \
    .option("checkpointLocation", "file:///tmp/ck") \
    .start()

In [24]:
import requests

for carat in [1,2,3]:
  data = u'{"carat":' + str(carat) + '}'
  r = requests.post(data=data, url="http://localhost:8888/my_api")
  print("Response {}".format(r.text))

In [25]:
server.stop()

This approach is just as fast as `model.predict` once it's warmed up... and it's much better because it allows us to reuse the feature engineering `Pipeline` and maintain its semantics. Let's look at the pros and cons here:

__Pros__
* Maintains full `Pipeline` API and semantics
* Uses native Spark APIs (ML Pipelines, Dataframe, Structured Streaming)
* Offers a fast REST service

__Cons__
* Involves deploying a large software platform to perform small, limited operations
* Requires an unsual archectural adaptation
* Relies on experimental continuous streaming mode
* Substantial complexity - my motto: never use streaming [for anything] if you don't have to, and if you do, then treat it as a first-class citizen

### How Can We Improve This Further?

Arguably, the model -- with its simple rules and minimal compute needs -- should be able to be extracted from Spark (or any training environment) and deployed elsewhere.

There are several ways to do this, and we'll start with the one that is stays closest to Spark.

MLeap replicated the Spark ML components in a way that allows them to run locally (not distributed) against a data structure called a LeapFrame, which behaves much like a small, local DataFrame. MLeap models can be deployed in a small MLeap runtime that you can use as a black-box service, or (since it's open source) you can integrate into any JVM-based service app.

*Note: this demo requires installing both the MLeap Scala libraries and the MLeap Python front-end; see http://mleap-docs.combust.ml/getting-started/spark.html and http://mleap-docs.combust.ml/getting-started/py-spark.html *

In [28]:
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer

In [29]:
%sh rm /tmp/pyspark.example.zip

In [30]:
model.serializeToBundle("jar:file:/tmp/pyspark.example.zip", model.transform(sample))

In [31]:
%sh ls -la /tmp

Here we can see the exported artifact, called an __MLeap Bundle__ ready to use in the MLeap runtime, or a JVM app of your own creation.

Pros:
* Maintains Spark semantics
* Separates model from Spark platform
  * Allows fast scoring in lots of other places without requiring Spark
* Has a multiyear track record
* Supported by MLflow platform
* F/OSS
* Also supports "bundling" and exporting models training in Scikit-Learn and TensorFlow

Cons:
* Only runs with the MLeap runtime
* Not an industry standard format
* Won't interoperate with other tools outside of the { Spark, TensorFlow, Scikit-Learn, MLeap } ecosystem
* Is not a huge OSS project; could be some risk in the future

If you want to see an example that doesn't just show the Spark export, but also the the deployment (in MLeap Serving), take a look at my tutorial (https://www.youtube.com/watch?v=KOehXxEgXFM). The first part explains the issues we've discussed today and builds a model; the last 5-6 minutes shows deployment and scoring.

### What About Industry-Standard Model Formats for Deployment?

There are three formats worth considering:
* PMML
* PFA
* ONNX

### PMML

<img src="https://materials.s3.amazonaws.com/i/PMML_Logo.png">

This is the grand-daddy standard format, dating to around 1996. It is widely supported by proprietary (non-OSS) machine learning tools, and has some spotty support in OSS.

__Example__

Here is an example of a logistic regression classifier trained using R on the Iris dataset:

(http://dmg.org/pmml/pmml_examples/rattle_pmml_examples/IrisMultinomReg.xml)

<img src="https://materials.s3.amazonaws.com/i/UFJlBqq.png" width=1000>

__The good:__
* XML based
* Widely used, well known
* Interoperable

__The bad:__
* Spark does not support exporting Pipelines as PMML
  * The tools which allow Spark Pipeline -> PMML export are almost entirely part of the Openscoring/JPMML ecosystem, which is very fine work but published under the AGPL (extremely non-permissive) license, and commercial licenses are supported by a very small company.
* There is no permissive, open-source, widely used high-performance serving library for PMML models ... again you'll run into JPMML almost everywhere ...
  * The permissive licensed version is very old; the modern version is AGPL

*PMML might be right for you, but that requires considerations beyond the purely technical*

### PFA

<img src="https://materials.s3.amazonaws.com/i/PFA_Logo-200x200.png">

PFA (created around 2016) is intended to be a Modern Replacement for PMML, and offers a variety of advantages over PMML.

##### "As data analyses mature, they must be hardened — they must have fewer dependencies, a more maintainable structure, and they must be robust against errors." - DMG

<img src="https://materials.s3.amazonaws.com/i/KuQPUbx.png" width=800>

__Example__

Here are some data records:

<img src="https://materials.s3.amazonaws.com/i/vsvToXy.png" width=600>

And a PFA document which returns the square-root of the sum of the squares of a record's x, y, and z values:

<img src="https://materials.s3.amazonaws.com/i/tIlag9o.png" width=600>

__PFA Pros:__
* Well-known semantic and security guarantees
* Open-source, permissively licensed implementations for JVM and Python
* Supports extremely large variety of operations
* Provides interchangeable compact/binary and human-readable representations

__Cons:__
* Very little industry support so far
* IBM Open Source launched a project (Aardpfark) to support Spark ML Pipeline export to PFA
  * But it only has a single v 0.1 release thus far, June 2018, and it's missing some critical pieces to make it useful

*PFA might be the future, but it definitely isn't the present*

### ONNX (Open Neural Network eXchange)

Originally created by Facebook and Microsoft as an industry collaboration for import/export of neural networks, it has grown to include support for "traditional" ML models, interop with many software libraries, and has both software (CPU + GPU accelerated) and hardware (Intel, Qualcomm, etc.) runtimes.

https://onnx.ai/

* Created by Facebook and Microsoft; AWS now on board
* DAG-based model
* Built-in operators, data types
* Extensible -- e.g., ONNX-ML
* Goal is to allow tools to share a single model format

<img src="https://materials.s3.amazonaws.com/i/9byVguG.png" width=500>

Pros:
* Most major deep learning tools have ONNX support
* MIT license makes it both OSS and business friendly
* Seems to achieve its first-order goal of allowing tools interop for neural nets
* As of 2019, is the closest thing we have to an open, versatile, next-gen format *with wide support*
* Protobuf format is compact and typesafe
* Biggest weakness was "classical" ML and feature engineering support -- this is now being shored up
* Microsoft open-sourced (Dec 2018) a high-perf runtime (GPU, CPU, language bindings, etc.) https://azure.microsoft.com/en-us/blog/onnx-runtime-is-now-open-source/
  * Being used as part of Windows ML / Azure ML
  * https://github.com/Microsoft/onnxruntime
* In Q1-Q2 of 2019, Microsoft added a Spark ML Pipeline exporter to the `onnxmltools` project
  * https://github.com/onnx/onnxmltools
  
*Of the "standard/open" formats, ONNX clearly has the most momentum in the past year or two.*

Let's take a look at exporting SparkML to ONNX!

In [39]:
dbutils.library.installPyPI("onnxmltools")

In [40]:
from onnxmltools import convert_sparkml
from onnxmltools.convert.sparkml import buildInitialTypesSimple, buildInputDictSimple

In [41]:
model_onnx = convert_sparkml(model, 'Diamonds', buildInitialTypesSimple(test.select("carat")))

In [42]:
print(model_onnx)

In [43]:
with open("/tmp/diamonds.onnx", "wb") as f:
    f.write(model_onnx.SerializeToString())

In [44]:
%sh cp /tmp/diamonds.onnx /dbfs/FileStore/diamonds.onnx

In [45]:
%sh ls -la /dbfs/FileStore/*.onnx

Ok so we have a ONNX model from out Spark Pipeline. To show that it actually works, and complete an end-to-end demo, in the next module we'll look at a simple Python app that uses Microsoft's high-perf `onnxruntime` to score some requests in a web service.

You don't need to download your ONNX model, but if you want to, you can <a href="/files/diamonds.onnx">click here</a> to download (e.g., if you'd like to build a scoring/inference server on your own machine.)