-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Pandas UDF Lab

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
- **Perform model inference at scale using a Pandas UDF created from MLflow**

In [0]:
%run "../Includes/Classroom-Setup"

#### In the cell below, we train the same model on the same data set as in the lesson and <a href="https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html" target="_blank">autolog</a> metrics, parameters, and models to MLflow.

In [0]:
import mlflow.sklearn
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

with mlflow.start_run(run_name="sklearn-random-forest") as run:
    # Enable autologging 
    mlflow.sklearn.autolog(log_input_examples=True, log_model_signatures=True, log_models=True)
    
    # Import the data
    df = pd.read_csv(f"{datasets_dir}/airbnb/sf-listings/airbnb-cleaned-mlflow.csv".replace("dbfs:/", "/dbfs/"))
    X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1), df[["price"]].values.ravel(), random_state=42)

    # Create model, train it, and create predictions
    rf = RandomForestRegressor(n_estimators=100, max_depth=10)
    rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)

#### Let's convert our Pandas DataFrame to a Spark DataFrame for distributed inference.

In [0]:
spark_df = spark.createDataFrame(df)

### MLflow UDF

Here, instead of using **`mlflow.sklearn.load_model(model_path)`**, we would like to use **`mlflow.pyfunc.spark_udf()`**.

This method can reduce computational cost and space, since it only loads the model into memory once per Python process. In other words, when we generate predictions for a DataFrame, the Python process knows that it should reuse the copy of the model, rather than loading the same model more than once. This can actually be more performant than using a Pandas Iterator UDF.

In the cell below, fill in the **`model_path`** variable and the **`mlflow.pyfunc.spark_udf`** function. You can refer to this <a href="https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.spark_udf" target="_blank">documentation</a> for help.

In [0]:
# TODO

model_path = f"runs:/{run.info.run_id}/model"
predict = mlflow.pyfunc.spark_udf(spark,model_path)

After loading the model using **`mlflow.pyfunc.spark_udf`**, we can now perform model inference at scale.

In the cell below, fill in the blank to use the **`predict`** function you have defined above to predict the price based on the features.

In [0]:
# TODO

features = X_train.columns
display(spark_df.withColumn("prediction", predict(*features)))

host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,prediction
1.0,0,0,37.769310377340766,-122.43385634489,0,0,3.0,1.0,1.0,2.0,0,1.0,127.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,170.0,182.86826362169955
2.0,1,1,37.745112331410034,-122.42101788836888,0,0,5.0,1.0,2.0,3.0,0,30.0,112.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,235.0,221.72550126988065
10.0,2,0,37.766689597862175,-122.45250461761628,0,1,2.0,4.0,1.0,1.0,0,32.0,17.0,85.0,8.0,8.0,9.0,9.0,9.0,8.0,65.0,74.19615498151461
4.0,3,2,37.73074592978503,-122.44840862635226,1,1,1.0,2.0,1.0,1.0,0,3.0,76.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,60.0,75.94533988046499
10.0,2,0,37.76487219421756,-122.45182799146508,1,1,2.0,4.0,1.0,1.0,0,32.0,7.0,91.0,9.0,9.0,9.0,9.0,9.0,9.0,65.0,85.13805002728341
2.0,0,0,37.77524858589268,-122.43637374831292,1,0,5.0,1.5,2.0,2.0,0,5.0,26.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,575.0,416.19715319350144
1.0,0,3,37.78470745496073,-122.44555431261594,0,0,7.0,1.0,2.0,1.0,0,2.0,27.0,88.0,9.0,7.0,10.0,10.0,9.0,9.0,255.0,278.2570760800651
2.0,4,1,37.75918889708064,-122.42236687240562,0,1,3.0,1.0,1.0,2.0,0,1.0,559.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,139.0,124.59759417721054
1.0,4,1,37.75174004606522,-122.4094205953428,0,0,4.0,2.5,3.0,3.0,0,3.0,24.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,285.0,420.462478299909
1.0,5,4,37.76258885144137,-122.40543055237004,1,1,2.0,1.0,1.0,1.0,0,1.0,386.0,93.0,9.0,9.0,10.0,10.0,9.0,9.0,135.0,105.80905903251865


-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>