d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Batch Deployment

Batch inference is the most common way of deploying machine learning models.  This lesson introduces various strategies for deploying models using batch including pure Python, Spark, and on the JVM.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Explore batch deployment options
 - Predict on a Pandas DataFrame and save the results
 - Predict on a Spark DataFrame and save the results
 - Compare other batch deployment options

### Inference in Batch

Batch deployment represents the vast majority of use cases for deploying machine learning models.<br><br>

* This normally means running the predictions from a model and saving them somewhere for later use
* For live serving, results are often saved to a database that will serve the saved prediction quickly  
* In other cases, such as populating emails, they can be stored in less performant data stores such as a blob store

Writing the results of your inference can be optimized in a number of ways...<br><br>  

* For large sums of data, writes should be performed in parallel
* **The access pattern for the saved predictions should also be kept in mind in how the data is written** 
  - For static files or data warehouses, partitioning speeds up data reads  
  - For databases, indexing the database on the relevant query generally improves performance 
  - In either case, the index is working similar to an index in a book: it allows you to skip ahead to the relevant content

There are a few other considerations to ensure the accuracy of your model...<br><br>  

* First is to make sure that your model matches expectations
  - We'll cover this in further detail in the model drift section 
* Second is to **retrain your model on the entirety of your dataset**  
  - A train/test split is a good method in tuning hyperparameters and estimating how the model will perform on unseen data
  - Retraining the model on the entirety of your data ensures that you have as much information as possible factored into the model.

Run the following cell to set up our environment.

In [6]:
%run "./Includes/Classroom-Setup"

### Inference in Pure Python

Inference in Python leverages the predict functionality of the machine learning package or MLflow wrapper.

-sandbox
Import the data.  **Do not perform a train/test split.**

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> It is common to skip the train/test split in training a final model.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv")

X = df.drop(["price"], axis=1)
y = df["price"]

Train a final model

In [11]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rf = RandomForestRegressor(n_estimators=100, max_depth=5)
rf.fit(X, y)

predictions = X.copy()
predictions["prediction"] = rf.predict(X)

mse = mean_squared_error(y, predictions["prediction"]) # This is on the same data the model was trained

-sandbox
Save the results and partition by zipcode.  Note that zip code was indexed.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Should zip code be modeled as a continuous or categorical feature?

In [13]:
import os

path = userhome + "/ml-production/mlflow-model-training/batch-predictions/"

dbutils.fs.rm(path.replace("/dbfs", "dbfs:"), True)

for i, partition in predictions.groupby(predictions["zipcode"]):
  dirpath = path + str(i)
  print("Writing to {}".format(dirpath))
  os.makedirs(dirpath)
  
  partition.to_csv(dirpath + "/predictions.csv")

Log the model and predictions.

In [15]:
import mlflow.sklearn
from sklearn.metrics import mean_squared_error

# mlflow.set_experiment(experiment_name=experimentPath)

with mlflow.start_run(run_name="Final RF Model") as run: 
  mlflow.sklearn.log_model(rf, "random-forest-model")
  mlflow.log_metric("mse on training data", mse)
  
  mlflow.log_artifacts(path, "predictions")
  
  run_info = run.info

-sandbox
### Inference in Spark

Models trained in various machine learning libraries can be applied at scale using Spark.  To do this, use `mlflow.pyfunc.spark_udf` and pass in the `SparkSession`, name of the model, and run id.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Using UDF's in Spark means that supporting libraries must be installed on every node in the cluster.  In the case of `sklearn`, this is installed in Databricks clusters by default.  With using other libraries, install them using the UI in order to ensure that they will work as a UDF.

Create a Spark DataFrame from the Pandas DataFrame.

In [18]:
XDF = spark.createDataFrame(X)

display(XDF)

MLflow easily produces a Spark user defined function (UDF).  This bridges the gap between Python environments and applying models at scale using Spark.

In [20]:
pyfunc_udf = mlflow.pyfunc.spark_udf(spark, "random-forest-model", run_id=run_info.run_uuid)

-sandbox
Apply the model as a standard UDF using the column names as the input to the function.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Python has an internal limit to the maximum number of arguments you can pass to a function.  The maximum number of features in a model applied in this way is therefore 255.  This limit will be changed in Python 3.7.

In [22]:
predictionDF = XDF.withColumn("prediction", pyfunc_udf(*X.columns))

display(predictionDF)

### Other Deployment Options

There are a number of other common batch deployment options.  One common use case is going from a Python environment for training to a Java environment.  Here are a few tools:<br><br>

 - **An Easy Port to Java:** In certain models, such as linear regression, the coefficients of a trained model can be taken and implemented by hand in Java.  This can work with tree-based models as well
 - **Re-serializing for Java:** Since Python uses Pickle by default to serialize, a library like <a href="https://github.com/jpmml/jpmml-sklearn" target="_blank">jpmml-sklearn</a> can de-serialize `sklearn` libraries and re-serialize them for use in Java environments
 - **Leveraging Library Functionality:** Some libraries include the ability to deploy to Java such as <a href="https://github.com/dmlc/xgboost/tree/master/jvm-packages" target="_blank">xgboost4j</a>
 - **Containers:** Using containerized solutions are becoming increasingly popular since they offer the encapsulation and reliability offered by jars while offering better more deployment options than just the Java environment.
 
Finally, <a href="http://mleap-docs.combust.ml/" target="_blank">MLeap</a> is a common, open source serialization format and execution engine for Spark, `sklearn`, and `TensorFlow`

## Review
**Question:** What are the main considerations in batch deployments?  
**Answer:** The following considerations help determine the best way to deploy batch inference results:
* How the data will be queried
* How the data will be written 
* The training and deployment environment
* What data the final model is trained on

**Question:** How can you optimize inference reads and writes?  
**Answer:** Writes can be optimized by managing parallelism.  In Spark, this would mean managing the partitions of a DataFrame such that work is evenly distributed and you have the most efficient connections back to the target database.

**Question:** How can I deploy models trained in Python in a Java environment?  
**Answer:** There are a number of ways to do this.  It's not unreasonable to just export model coefficients or trees in a random forest and parse them in Java.  This works well as a minimum viable product.  You can also look at different libraries that can serialize models in a way that the JVM can make use of them.  `jpmml-sklearn` and `xgboost4j` are two examples of this.  Finally, you can re-implement Python libraries in Java if needed.

## Next Steps

Start the next lesson, [Streaming Deployment]($./07-Streaming-Deployment ).

## Additional Topics & Resources

**Q:** Where can I find more information on UDF's created by MLflow?  
**A:** See the <a href="https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html" target="_blank">MLflow documentation for details</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>