
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>



# Experimentation with Mosaic AI AutoML

In this demo, we will explore the powerful capabilities of AutoML (Automated Machine Learning) within Databricks. AutoML allows you to automatically build, train, and select the best machine learning models for your data without manual intervention. We will create an AutoML experiment, evaluate the results, register the best model, and transition it to the **Staging** stage.

**Learning Objectives**

_By the end of this lesson, you will be able to:_

1. **Establish a baseline champion model with AutoML:**
    - Create an AutoML experiment using the Experiments UI.
    - (Optional) Create an AutoML experiment using the `databricks` Python library. 
    - Understand the various configuration options available in AutoML experiments.
2. **Model and notebook inspection:**
    - Evaluate the results of an AutoML experiment using the UI and identify the best model, called the champion model.
    - (Optional) Evaluate the results of an AutoML experiment using a notebook and identify the champion model.  
    - Open and explore the automatically generated notebook for the champion model. 
3. **Register the champion model:**
    - Register the model at the Workspace level. 
    - Register the model within Unity Catalog.


## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **15.4.x-cpu-ml-scala2.12 15.4.x-scala2.12**


## Classroom Setup

To get into the lesson, we first need to build some data assets and define some configuration variables required for this demonstration. When running the following cell, the output is hidden so our space isn't cluttered. To view the details of the output, you can hover over the next cell and click the eye icon. 

The cell after the setup, titled `View Setup Variables`, displays the various variables that were created. You can click the Catalog icon in the notebook space to the right to see that your catalog was created with no data.

In [0]:
%run ../Includes/Classroom-Setup-02

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


Resetting the learning environment:
| dropping the catalog "labuser8027617_1732830335_zsw6_da"...(0 seconds)

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/get-started-with-databricks-for-machine-learning/v01"

Validating the locally installed datasets:
| listing local files...(0 seconds)
| validation completed...(0 seconds total)
Creating & using the catalog "labuser8027617_1732830335_zsw6_da"...(1 seconds)

Predefined tables in "labuser8027617_1732830335_zsw6_da.default":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/labuser8027617_1732830335@vocareum.com/get-started-with-databricks-for-machine-learning
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/get-started-with-databricks-for-machine-learning/v01

Setup completed (3 seconds)


In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"User DB Location:  {DA.paths.user_db}")

Username:          labuser8027617_1732830335@vocareum.com
Catalog Name:      labuser8027617_1732830335_zsw6_da
Schema Name:       default
Working Directory: dbfs:/mnt/dbacademy-users/labuser8027617_1732830335@vocareum.com/get-started-with-databricks-for-machine-learning
User DB Location:  None



## Prepare Data

For this demonstration, we will utilize a fictional dataset from a Telecom Company, which includes customer information. This dataset encompasses customer demographics, including gender, as well as internet subscription details such as subscription plans and payment methods.

To get started, execute the code block below. 

This will create the `customers` table and allow us to explore its features.


In [0]:
DA.create_customers_table()
spark.table('customers').printSchema()

root
 |-- CustomerID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- Churn: string (nullable = true)



## Part 1: Create An AutoML Experiment

Here we will discuss two methods for creating an AutomL experiment:
1. Using the Databricks UI. 
2. (Optional) Programmatically using the `databricks` Python library.


### Create An AutoML Experiment With The UI

Let's initiate an AutoML experiment to construct a baseline model for predicting customer churn. The target field for this prediction will be the `Churn` field.

Follow these step-by-step instructions to create an AutoML experiment:

1. Navigate to **Experiments** under **Machine Learning** in the left sidebar menu.

2. Click on **Create AutoML experiment** located at the top.

3. Choose a cluster to execute the experiment.

4. For the ML problem type, opt for **Classification**.

5. To select the `customers` table as the input training data, select `Browse` under `Input training dataset` and navigate to the catalog and `default` database that was created previously. Alternatively, you type in your name in Catalogs within the **Select training data** menu to find your table. 

6. Specify **`Churn`** as the **Prediction target**.

7. Deselect the **CustomerID** field as it's not needed as a feature.

8. In **Experiment name** you will see an automatically generated string like `Churn_customers-2024_10_16-10_20`. Append this string with your first and last name or some other unique identifier like  `Churn_customers-2024_10_16-10_20_firstName_lastName`. If a message appears that indicates the name is taken, try a different set of string values to append to the experiment name. 

9. In the **Advanced Configuration** section, set the **Timeout** to **5 minutes**.

10. Click on **Start AutoML**. 

**Optional Advanced Configuration:**

- You have the flexibility to choose the **evaluation metric** and your preferred **training framework**.

- If your dataset includes a timeseries field, you can define it when splitting the dataset.


### (Optional) Create An AutoML Experiment Within A Notebook

We can Programmatically kickoff an AutoML experiment using the `databricks` Python library as well. After clicking **Run** on the following cell, you can go over to **Experiments** on the left menu bar and navigate to the experiment.

In [0]:
from databricks import automl
from datetime import datetime
# Define parameters for the classify function
table_name = "customers" # The table containing our features and labels
target_col = "Churn" # The variable we are trying to predict
exclude_cols = ["CustomerID"] # Exclude the CustomerID column
timeout_minutes = 5  # The maximum time in minutes to run the experiment


# Run the AutoML experiment to generate the best classify model and generate the best run notebook
automl_run = automl.classify(
    dataset=spark.table(table_name), 
    target_col=target_col, 
    exclude_cols=exclude_cols, 
    timeout_minutes=timeout_minutes)

2024/11/28 22:25:24 INFO databricks.automl.client.manager: AutoML will optimize for F1 score metric, which is tracked as val_f1_score in the MLflow experiment.
2024/11/28 22:25:26 INFO databricks.automl.client.manager: MLflow Experiment ID: 4412028206669628
2024/11/28 22:25:26 INFO databricks.automl.client.manager: MLflow Experiment: https://dbc-40a12f74-fe09.cloud.databricks.com/?o=2972073332228503#mlflow/experiments/4412028206669628
2024/11/28 22:27:02 INFO databricks.automl.client.manager: Data exploration notebook: https://dbc-40a12f74-fe09.cloud.databricks.com/?o=2972073332228503#notebook/4412028206669646
2024/11/28 22:31:22 INFO databricks.automl.client.manager: AutoML experiment completed successfully.


Unnamed: 0,Train,Validation,Test
f1_score,0.141,0.08,0.174
recall_score,0.082,0.045,0.105
roc_auc,0.716,0.718,0.731
false_negatives,1003.0,378.0,325.0
false_positives,78.0,35.0,36.0
example_count,4168.0,1436.0,1372.0
precision_score,0.533,0.34,0.514
true_positives,89.0,18.0,38.0
precision_recall_auc,0.438,0.425,0.449
true_negatives,2998.0,1005.0,973.0


## Part 2: Best Run Inspection

Next, we will inspect the best run produced by AutoML:

1. Using the Databricks UI. 
2. (Optional) Programmatically using the `databricks` Python library.


### Best Run Inspection With The UI

Once the experiment is finished, it's time to examine the best run:

1. Access the completed experiment in the **Experiments** section.

2. Identify the best model run by evaluating the displayed **metrics**. Alternatively, you can click on **View notebook for the best model** to access the automatically generated notebook for the top-performing model.

3. Utilize the **Chart** tab to compare and contrast the various models generated during the experiment.

You can find all details for the run  on the experiment page. There are different columns such as the framework used (e.g., Scikit-Learn, XGBoost), evaluation metrics (e.g., Accuracy, F1 Score), and links to the corresponding notebooks for each model. This allows you to make informed decisions about selecting the best model for your specific use case.



### (Optional) Best Run Inspection Within A Notebook 

Alternatively, we can approach the inspection programmatically. First, we will display all the experiments only we have created. Second, we will create Pandas DataFrame that contains F1-score information along with a visual to help pick the champion model. 

**Notebooks for Other Experiment Trials in AutoML**

For classification and regression experiments, AutoML generated notebooks for data exploration and the best trial in your experiment are automatically imported to your workspace. For all trials besides the best trial, the notebooks **are NOT created** automatically. If you need to use these notebooks, you can manually import them into your workspace with the **`automl.import_notebook`** Python API.

In [0]:
import mlflow

# Initialize MLflow client
client = mlflow.tracking.MlflowClient()

# List all experiments using search_experiments()
experiments = client.search_experiments()

# Loop through experiments and check if the username is part of the experiment name or artifact location
for experiment in experiments:
    if DA.username in experiment.name or DA.username in experiment.artifact_location:
        print(f"Experiment ID: {experiment.experiment_id}, \nName: {experiment.name}, \nArtifact Location: {experiment.artifact_location}\n")

Experiment ID: 4412028206669628, 
Name: /Users/labuser8027617_1732830335@vocareum.com/databricks_automl/Churn_global_temp.automl_17c9cec7_1366_44dc_880f_400e2dedf56b_2024-25-28_22-25-24, 
Artifact Location: dbfs:/databricks/mlflow-tracking/4412028206669628



Let's grab only the latest experiment, assuming that's the run we want. You can modify this code to grab a specific experiment if needed.

In [0]:
for experiment in experiments:
    if DA.username in experiment.name or DA.username in experiment.artifact_location:
        latest_experiment_name = experiment.name
        break
print(f"Latest experiment name: {latest_experiment_name}")

Latest experiment name: /Users/labuser8027617_1732830335@vocareum.com/databricks_automl/Churn_global_temp.automl_17c9cec7_1366_44dc_880f_400e2dedf56b_2024-25-28_22-25-24


Now that we have the name of our latest experiment, we can construct a Pandas DataFrame that grabs the F1-score for visual inspection.

In [0]:
import mlflow
import pandas as pd

# Step 1: Get the experiment by name
experiment = mlflow.get_experiment_by_name(latest_experiment_name)

# Initialize an empty list to store the data for the DataFrame
data = []

if experiment:
    experiment_id = experiment.experiment_id
    
    # Step 2: Retrieve all runs from the experiment
    runs = mlflow.search_runs(experiment_ids=experiment_id)
    
    # Step 3: Access the F1 scores for training, validation, and test data
    for _, run in runs.iterrows():
        run_id = run['run_id']
        
        # Fetch the run data
        run_data = mlflow.get_run(run_id)
        
        # Fetch F1 scores
        train_f1_score = run_data.data.metrics.get('training_f1_score', None)
        val_f1_score = run_data.data.metrics.get('val_f1_score', None)
        test_f1_score = run_data.data.metrics.get('test_f1_score', None)
        
        # Append the data for this run to the list
        data.append({
            'run_id': run_id,
            'train_f1_score': train_f1_score,
            'validation_f1_score': val_f1_score,
            'test_f1_score': test_f1_score
        })

    # Convert the list of data into a pandas DataFrame
    df = pd.DataFrame(data)
    
    # Display or return the DataFrame
    display(df)
else:
    print(f"Experiment {experiment} not found.")

run_id,train_f1_score,validation_f1_score,test_f1_score
f486e965bf2a41c4a3afdd0746d5c199,0.1413820492454328,0.0801781737193764,0.1739130434782608
ed53366b4b3a4b11b4fbff89225614c1,0.0,0.0,0.0
cf25dab8799f4473aa7b4903a381e714,0.0,0.0,0.0
cc0f3b9bc8b04c78b83dcc3961c5373d,0.0,0.0,0.0
d65f89a34d7f41cabe8c90c9bd14a042,,,


### Part 3: Register the Best Model Using the UI

1. If you have not already done so, navigate to **Experiments** on the left sidebar menu.
1. Find the AutoML experiment you ran previously and click on the name. 
1. Select the name of the top run in the table. Notice the sort widget says `val_f1_score` automatically, so the table prioritizes this metric be default. 
1. Select **Register model** at the top right. 
1. In the dialog box, you will be presented with two options: Workspace Model Registry and Unity Catalog. If you select to register the model at the workspace-level, you will need to simply enter in the name of the model. If you wish to use Unity Catalog, you will need to provide the model name in the format `<catalog_name>.<schema_name>.<model_name>`. To keep our data assets manageable for the clean-up, let's select **Unity Catalog** and enter **`churn_prediction_base_model`** as model name for the catalog and schema we've been working in. For example, `firstName_lastName_7pig_da.default.churn_prediction_base_model` would be acceptable. 
1. Finally, click the **Register** button to complete the registration process. You will see a message indicating the registration process has started. 
1. Navigate to **Catalog**. After finding the catalog and schema where you saved your model, you will find your newly registered model either in the **Catalog Explorer** or by clicking on the schema and selecting the **Models** tab. 

Your model is now registered and ready for inferencing. We will discuss how to query a model in the next lesson.


## Classroom Clean-up

After completing the demo, clean up any resources created.

- Delete saved models from model registry.

- Delete AutoML experiment from **Experiments** page.

Run the following cell to remove lessons-specific assets created during this lesson.

In [0]:
DA.cleanup()

# Conclusion And Next Steps

In this demo, we learned how to utilize AutoML as a low-code solution for establishing a baseline champion model for a sample dataset. We learned about the various configurations that can be used within the UI. We also briefly explored how to approach AutoML from a programmatic point of view. Finally, we introduced the idea of registering your model for inferencing. In the next lesson, we will dig more into concepts such as packaging custom code, staging a model, and inferencing our registered models.


&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>