-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# AutoML

- **<a href="https://docs.databricks.com/applications/machine-learning/automl.html" target="_blank">Databricks AutoML</a> helps you automatically build machine learning models both through a UI and programmatically.** 
- **It prepares the dataset for model training and then performs and records a set of trials (using HyperOpt), creating, tuning, and evaluating multiple models.** 

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you will:<br>
 - **Use AutoML to automatically train and tune your models**
 - **Run AutoML in Python and through the UI**
 - **Interpret the results of an AutoML run**

In [0]:
%run "./Includes/Classroom-Setup"

#### Currently, AutoML uses a combination of XGBoost and sklearn (only single node models) but optimizes the hyperparameters within each.

In [0]:
file_path = f"{datasets_dir}/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

We can now use AutoML to search for the optimal <a href="https://docs.databricks.com/applications/machine-learning/automl.html#regression" target="_blank">regression</a> model. 

Required parameters:
* **`dataset`** - Input Spark or pandas DataFrame that contains training features and targets. If using a Spark DataFrame, it will convert it to a Pandas DataFrame under the hood by calling .toPandas() - just be careful you don't OOM!
* **`target_col`** - Column name of the target labels

We will also specify these optional parameters:
* **`primary_metric`** - Primary metric to select the best model. Each trial will compute several metrics, but this one determines which model is selected from all the trials. One of **`r2`** (default, R squared), **`mse`** (mean squared error), **`rmse`** (root mean squared error), **`mae`** (mean absolute error) for regression problems.
* **`timeout_minutes`** - The maximum time to wait for the AutoML trials to complete. **`timeout_minutes=None`** will run the trials without any timeout restrictions
* **`max_trials`** - The maximum number of trials to run. When **`max_trials=None`**, maximum number of trials will run to completion.

In [0]:
from databricks import automl

summary = automl.regress(train_df, target_col="price", primary_metric="rmse", timeout_minutes=5, max_trials=10)

Unnamed: 0,Train,Validation,Test
rmse,355.06,140.465,200.551
mae,93.149,81.461,80.28
score,0.201,0.454,0.219
r2_score,0.201,0.454,0.219
mse,126067.734,19730.454,40220.554


After running the previous cell, you will notice two notebooks and an MLflow experiment:
* **`Data exploration notebook`** - we can see a Profiling Report which organizes the input columns and discusses values, frequency and other information
* **`Best trial notebook`** - shows the source code for reproducing the best trial conducted by AutoML
* **`MLflow experiment`** - contains high level information, such as the root artifact location, experiment ID, and experiment tags. The list of trials contains detailed summaries of each trial, such as the notebook and model location, training parameters, and overall metrics.

Dig into these notebooks and the MLflow experiment - what do you find?

#### Additionally, AutoML shows a short list of metrics from the best run of the model.

In [0]:
print(summary.best_trial)

#### Now we can test the model that we got from AutoML against our test data. We'll be using <a href="https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.spark_udf" target="_blank">mlflow.pyfunc.spark_udf</a> to register our model as a UDF and apply it in parallel to our test data.

In [0]:
# Load the best trial as an MLflow Model
import mlflow

model_uri = f"runs:/{summary.best_trial.mlflow_run_id}/model"

predict = mlflow.pyfunc.spark_udf(spark, model_uri)
pred_df = test_df.withColumn("prediction", predict(*test_df.drop("price").columns))
display(pred_df)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,bathrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na,prediction
f,flexible,f,1.0,Bernal Heights,37.73615,-122.41245,House,Private room,2.0,1.0,1.0,2.0,Real Bed,1.0,194.0,91.0,9.0,9.0,10.0,10.0,9.0,9.0,86.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,70.70239614363673
f,flexible,f,1.0,Castro/Upper Market,37.76702,-122.43518,Guest suite,Entire home/apt,2.0,1.0,1.0,1.0,Real Bed,3.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,190.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,216.0272665562921
f,flexible,f,1.0,Financial District,37.78424,-122.39925,Apartment,Private room,2.0,1.0,1.0,1.0,Real Bed,180.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,100.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,113.53368614430504
f,flexible,f,1.0,Inner Richmond,37.7787,-122.4554,House,Entire home/apt,4.0,2.0,2.0,2.0,Real Bed,3.0,6.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,325.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,294.78471920325563
f,flexible,f,1.0,Nob Hill,37.79256,-122.42135,House,Private room,1.0,1.0,1.0,1.0,Real Bed,140.0,2.0,60.0,7.0,6.0,8.0,8.0,9.0,7.0,200.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,119.45675566717912
f,flexible,f,1.0,Noe Valley,37.75369,-122.42577,Apartment,Entire home/apt,2.0,1.0,1.0,1.0,Real Bed,30.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,200.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,156.46323082675332
f,flexible,f,1.0,Outer Mission,37.71969,-122.44378,House,Private room,2.0,1.0,0.0,2.0,Real Bed,1.0,24.0,86.0,9.0,9.0,10.0,10.0,9.0,9.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75.56808639252036
f,flexible,f,1.0,Pacific Heights,37.79586,-122.43035,Apartment,Private room,1.0,1.0,1.0,1.0,Real Bed,30.0,1.0,80.0,10.0,10.0,10.0,10.0,10.0,10.0,160.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,152.52666508935445
f,flexible,f,1.0,Western Addition,37.7752,-122.43765,Apartment,Entire home/apt,3.0,1.0,0.0,1.0,Real Bed,90.0,6.0,100.0,9.0,9.0,10.0,10.0,10.0,9.0,132.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,147.8294662037682
f,flexible,f,1.0,Western Addition,37.77814,-122.44079,Condominium,Private room,2.0,1.0,1.0,1.0,Real Bed,3.0,5.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,195.47584707871857


In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")
rmse = regression_evaluator.evaluate(pred_df)
r2 = regression_evaluator.setMetricName('r2').evaluate(pred_df)
print(f"RMSE on test dataset: {rmse:.3f}")
print(f" on test dataset: {r2:.3f}")

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>