-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Inference with Pandas UDFs

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
- Build a scikit-learn model, track it with MLflow, and apply it at scale using the **Pandas Scalar Iterator UDFs** and **`mapInPandas()`**

To learn more about Pandas UDFs, you can refer to this <a href="https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html" target="_blank">blog post</a> to see what's new in Spark 3.0.

In [0]:
%pip install mlflow

In [0]:
%run ./Includes/Classroom-Setup

#### Train sklearn model and log it with MLflow

In [0]:
import mlflow.sklearn
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

with mlflow.start_run(run_name="sklearn-random-forest") as run:
    # Enable autologging 
    mlflow.sklearn.autolog(log_input_examples=True, log_model_signatures=True, log_models=True)
    # Import the data
    df = pd.read_csv(f"{datasets_dir}/airbnb/sf-listings/airbnb-cleaned-mlflow.csv".replace("dbfs:/", "/dbfs/")).drop(["zipcode"], axis=1)
    X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1), df[["price"]].values.ravel(), random_state=42)

    # Create model
    rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
    rf.fit(X_train, y_train)

#### Create Spark DataFrame

In [0]:
spark_df = spark.createDataFrame(X_test)

### Pandas/Vectorized UDFs

As of Spark 2.3, there are **Pandas UDFs** available in Python to improve the efficiency of UDFs. Pandas UDFs utilize Apache Arrow to speed up computation. Let's see how that helps improve our processing time.

* <a href="https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html" target="_blank">Blog post</a>
* <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html#pyspark-usage-guide-for-pandas-with-apache-arrow" target="_blank">Documentation</a>

<img src="https://databricks.com/wp-content/uploads/2017/10/image1-4.png" alt="Benchmark" width ="500" height="1500">

The user-defined functions are executed by: 
* <a href="https://arrow.apache.org/" target="_blank">Apache Arrow</a>, is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes with near-zero (de)serialization cost. See more <a href="https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html" target="_blank">here</a>.
* pandas inside the function, to work with pandas instances and APIs.

**NOTE**: In Spark 3.0, you should define your Pandas UDF using Python type hints.

In [0]:
from pyspark.sql.functions import pandas_udf

@pandas_udf("double")
def predict(*args: pd.Series) -> pd.Series:
    model_path = f"runs:/{run.info.run_id}/model" 
    model = mlflow.sklearn.load_model(model_path) # Load model
    pdf = pd.concat(args, axis=1)
    return pd.Series(model.predict(pdf))

prediction_df = spark_df.withColumn("prediction", predict(*spark_df.columns))
display(prediction_df)

host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,prediction
1.0,29,37.750853665952526,-122.47896134638864,0,0,4.0,1.0,0.0,4.0,0,2.0,194.0,96.0,10.0,10.0,10.0,10.0,9.0,9.0,138.7427902906762
2.0,12,37.79569442370353,-122.417081972524,1,1,2.0,1.5,1.0,1.0,0,2.0,124.0,99.0,10.0,10.0,10.0,10.0,10.0,10.0,131.24509694021717
2.0,7,37.76393574011793,-122.43001124805248,0,1,2.0,1.0,1.0,1.0,0,5.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,129.26194822031346
1.0,7,37.76690648031917,-122.43792377044348,1,0,7.0,2.0,3.0,3.0,0,3.0,3.0,93.0,10.0,9.0,10.0,10.0,10.0,10.0,413.6222277778268
1.0,2,37.77491545710221,-122.44027012206556,6,1,1.0,1.0,1.0,1.0,0,1.0,21.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,129.3876183458104
39.0,19,37.729883744746296,-122.42672685799468,1,1,2.0,1.0,1.0,1.0,0,30.0,15.0,89.0,8.0,8.0,9.0,9.0,8.0,9.0,48.90435216613397
4.0,30,37.714110738500814,-122.4072828875996,1,1,2.0,1.0,1.0,1.0,0,1.0,20.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,80.48652556180105
54.0,6,37.78663277349695,-122.4085188120046,17,1,2.0,1.0,0.0,1.0,0,1.0,0.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,78.87385270408892
1.0,15,37.78294900804669,-122.38856041539098,0,0,10.0,2.0,2.0,8.0,0,1.0,6.0,93.0,9.0,9.0,8.0,9.0,10.0,8.0,359.32030526097446
2.0,7,37.76852191665309,-122.4278718063526,0,1,2.0,1.0,1.0,1.0,0,2.0,127.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,124.4406137083362


### Pandas Scalar Iterator UDF

If your model is very large, then there is high overhead for the Pandas UDF to repeatedly load the same model for every batch in the same Python worker process. In Spark 3.0, Pandas UDFs can accept an iterator of pandas.Series or pandas.DataFrame so that you can load the model only once instead of loading it for every series in the iterator.

This way the cost of any set-up needed will be incurred fewer times. When the number of records you’re working with is greater than **`spark.conf.get('spark.sql.execution.arrow.maxRecordsPerBatch')`**, which is 10,000 by default, you should see speed ups over a pandas scalar UDF because it iterates through batches of pd.Series.

It has the general syntax of: 

**`@pandas_udf(...)
def predict(iterator):
    model = ... # load model
    for features in iterator:
        yield model.predict(features)`**

In [0]:
from typing import Iterator, Tuple

@pandas_udf("double")
def predict(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.Series]:
    model_path = f"runs:/{run.info.run_id}/model" 
    model = mlflow.sklearn.load_model(model_path) # Load model
    for features in iterator:
        pdf = pd.concat(features, axis=1)
        yield pd.Series(model.predict(pdf))

prediction_df = spark_df.withColumn("prediction", predict(*spark_df.columns))
display(prediction_df)

host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,prediction
1.0,29,37.750853665952526,-122.47896134638864,0,0,4.0,1.0,0.0,4.0,0,2.0,194.0,96.0,10.0,10.0,10.0,10.0,9.0,9.0,138.7427902906762
2.0,12,37.79569442370353,-122.417081972524,1,1,2.0,1.5,1.0,1.0,0,2.0,124.0,99.0,10.0,10.0,10.0,10.0,10.0,10.0,131.24509694021717
2.0,7,37.76393574011793,-122.43001124805248,0,1,2.0,1.0,1.0,1.0,0,5.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,129.26194822031346
1.0,7,37.76690648031917,-122.43792377044348,1,0,7.0,2.0,3.0,3.0,0,3.0,3.0,93.0,10.0,9.0,10.0,10.0,10.0,10.0,413.6222277778268
1.0,2,37.77491545710221,-122.44027012206556,6,1,1.0,1.0,1.0,1.0,0,1.0,21.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,129.3876183458104
39.0,19,37.729883744746296,-122.42672685799468,1,1,2.0,1.0,1.0,1.0,0,30.0,15.0,89.0,8.0,8.0,9.0,9.0,8.0,9.0,48.90435216613397
4.0,30,37.714110738500814,-122.4072828875996,1,1,2.0,1.0,1.0,1.0,0,1.0,20.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,80.48652556180105
54.0,6,37.78663277349695,-122.4085188120046,17,1,2.0,1.0,0.0,1.0,0,1.0,0.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,78.87385270408892
1.0,15,37.78294900804669,-122.38856041539098,0,0,10.0,2.0,2.0,8.0,0,1.0,6.0,93.0,9.0,9.0,8.0,9.0,10.0,8.0,359.32030526097446
2.0,7,37.76852191665309,-122.4278718063526,0,1,2.0,1.0,1.0,1.0,0,2.0,127.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,124.4406137083362


### Pandas Function API

Instead of using a Pandas UDF, we can use a Pandas Function API. This new category in Apache Spark 3.0 enables you to directly apply a Python native function, which takes and outputs Pandas instances against a PySpark DataFrame. Pandas Functions APIs supported in Apache Spark 3.0 are: grouped map, map, and co-grouped map.

**`mapInPandas()`** takes an iterator of pandas.DataFrame as input, and outputs another iterator of pandas.DataFrame. It's flexible and easy to use if your model requires all of your columns as input, but it requires serialization/deserialization of the whole DataFrame (as it is passed to its input). You can control the size of each pandas.DataFrame with the **`spark.sql.execution.arrow.maxRecordsPerBatch`** config.

In [0]:
def predict(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    model_path = f"runs:/{run.info.run_id}/model" 
    model = mlflow.sklearn.load_model(model_path) # Load model
    for features in iterator:
        yield pd.concat([features, pd.Series(model.predict(features), name="prediction")], axis=1)
    
display(spark_df.mapInPandas(predict, """`host_total_listings_count` DOUBLE,`neighbourhood_cleansed` BIGINT,`latitude` DOUBLE,`longitude` DOUBLE,`property_type` BIGINT,`room_type` BIGINT,`accommodates` DOUBLE,`bathrooms` DOUBLE,`bedrooms` DOUBLE,`beds` DOUBLE,`bed_type` BIGINT,`minimum_nights` DOUBLE,`number_of_reviews` DOUBLE,`review_scores_rating` DOUBLE,`review_scores_accuracy` DOUBLE,`review_scores_cleanliness` DOUBLE,`review_scores_checkin` DOUBLE,`review_scores_communication` DOUBLE,`review_scores_location` DOUBLE,`review_scores_value` DOUBLE, `prediction` DOUBLE""")) 

host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,prediction
1.0,29,37.750853665952526,-122.47896134638864,0,0,4.0,1.0,0.0,4.0,0,2.0,194.0,96.0,10.0,10.0,10.0,10.0,9.0,9.0,138.7427902906762
2.0,12,37.79569442370353,-122.417081972524,1,1,2.0,1.5,1.0,1.0,0,2.0,124.0,99.0,10.0,10.0,10.0,10.0,10.0,10.0,131.24509694021717
2.0,7,37.76393574011793,-122.43001124805248,0,1,2.0,1.0,1.0,1.0,0,5.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,129.26194822031346
1.0,7,37.76690648031917,-122.43792377044348,1,0,7.0,2.0,3.0,3.0,0,3.0,3.0,93.0,10.0,9.0,10.0,10.0,10.0,10.0,413.6222277778268
1.0,2,37.77491545710221,-122.44027012206556,6,1,1.0,1.0,1.0,1.0,0,1.0,21.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,129.3876183458104
39.0,19,37.729883744746296,-122.42672685799468,1,1,2.0,1.0,1.0,1.0,0,30.0,15.0,89.0,8.0,8.0,9.0,9.0,8.0,9.0,48.90435216613397
4.0,30,37.714110738500814,-122.4072828875996,1,1,2.0,1.0,1.0,1.0,0,1.0,20.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,80.48652556180105
54.0,6,37.78663277349695,-122.4085188120046,17,1,2.0,1.0,0.0,1.0,0,1.0,0.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,78.87385270408892
1.0,15,37.78294900804669,-122.38856041539098,0,0,10.0,2.0,2.0,8.0,0,1.0,6.0,93.0,9.0,9.0,8.0,9.0,10.0,8.0,359.32030526097446
2.0,7,37.76852191665309,-122.4278718063526,0,1,2.0,1.0,1.0,1.0,0,2.0,127.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,124.4406137083362


#### Or you can define the schema like this below.

In [0]:
from pyspark.sql.functions import lit
from pyspark.sql.types import DoubleType

schema = spark_df.withColumn("prediction", lit(None).cast(DoubleType())).schema
display(spark_df.mapInPandas(predict, schema)) 

host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,prediction
1.0,29,37.750853665952526,-122.47896134638864,0,0,4.0,1.0,0.0,4.0,0,2.0,194.0,96.0,10.0,10.0,10.0,10.0,9.0,9.0,138.7427902906762
2.0,12,37.79569442370353,-122.417081972524,1,1,2.0,1.5,1.0,1.0,0,2.0,124.0,99.0,10.0,10.0,10.0,10.0,10.0,10.0,131.24509694021717
2.0,7,37.76393574011793,-122.43001124805248,0,1,2.0,1.0,1.0,1.0,0,5.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,129.26194822031346
1.0,7,37.76690648031917,-122.43792377044348,1,0,7.0,2.0,3.0,3.0,0,3.0,3.0,93.0,10.0,9.0,10.0,10.0,10.0,10.0,413.6222277778268
1.0,2,37.77491545710221,-122.44027012206556,6,1,1.0,1.0,1.0,1.0,0,1.0,21.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,129.3876183458104
39.0,19,37.729883744746296,-122.42672685799468,1,1,2.0,1.0,1.0,1.0,0,30.0,15.0,89.0,8.0,8.0,9.0,9.0,8.0,9.0,48.90435216613397
4.0,30,37.714110738500814,-122.4072828875996,1,1,2.0,1.0,1.0,1.0,0,1.0,20.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,80.48652556180105
54.0,6,37.78663277349695,-122.4085188120046,17,1,2.0,1.0,0.0,1.0,0,1.0,0.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,78.87385270408892
1.0,15,37.78294900804669,-122.38856041539098,0,0,10.0,2.0,2.0,8.0,0,1.0,6.0,93.0,9.0,9.0,8.0,9.0,10.0,8.0,359.32030526097446
2.0,7,37.76852191665309,-122.4278718063526,0,1,2.0,1.0,1.0,1.0,0,2.0,127.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,124.4406137083362


-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>