
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>



# 3.Lab - Getting Started with Databricks for Machine Learning

In this lab, we will construct a comprehensive ML model pipeline using Databricks. Initially, we will train and monitor our model using mlflow. Subsequently, we will register the model and advance it to the next stage. In the latter part of the lab, we will utilize Model Serving to deploy the registered model. Following deployment, we will interact with the model via a REST endpoint and examine its behavior through an integrated monitoring dashboard.



## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **15.4.x-cpu-ml-scala2.12 15.4.x-scala2.12**


## Lab Setup

To ensure a smooth experience, follow these initial steps:

1. Run the provided classroom setup script. This script will establish necessary configuration variables tailored to each user. Execute the following code cell:

In [0]:
%run ../Includes/Classroom-Setup-Lab

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


Resetting the learning environment:
| dropping the catalog "labuser8027617_1732916382_2w2l_da"...(0 seconds)

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/get-started-with-databricks-for-machine-learning/v01"

Validating the locally installed datasets:
| listing local files...(0 seconds)
| validation completed...(0 seconds total)
Creating & using the catalog "labuser8027617_1732916382_2w2l_da"...(1 seconds)

Predefined tables in "labuser8027617_1732916382_2w2l_da.default":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/labuser8027617_1732916382@vocareum.com/get-started-with-databricks-for-machine-learning
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/get-started-with-databricks-for-machine-learning/v01

Setup completed (3 seconds)



**Other Conventions:**

Throughout this lab, we'll make use of the object `DA`, which provides critical variables. Execute the code block below to see various variables that will be used in this notebook:


In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")

Username:          labuser8027617_1732916382@vocareum.com
Catalog Name:      labuser8027617_1732916382_2w2l_da
Schema Name:       default
Working Directory: dbfs:/mnt/dbacademy-users/labuser8027617_1732916382@vocareum.com/get-started-with-databricks-for-machine-learning
Dataset Location:  dbfs:/mnt/dbacademy-datasets/get-started-with-databricks-for-machine-learning/v01



## Data Ingestion
The first step in this lab is to ingest data from .csv files and save them as delta tables. Then, we will join customer data and create a new table.

In [0]:
%sql
USE CATALOG ${DA.catalog_name};

In [0]:
print("CSV file paths:")
print(f"{DA.paths.datasets}/telco/customer-demographics.csv")
print(f"{DA.paths.datasets}/telco/customer-details.csv")

CSV file paths:
dbfs:/mnt/dbacademy-datasets/get-started-with-databricks-for-machine-learning/v01/telco/customer-demographics.csv
dbfs:/mnt/dbacademy-datasets/get-started-with-databricks-for-machine-learning/v01/telco/customer-details.csv


Here is some sample code that will be helpful for creating tables. 
```
CREATE TABLE IF NOT EXISTS <table_name>;
COPY INTO <table_name>
  FROM "<file_path>" -- see DA object path 
  FILEFORMAT = CSV
  FORMAT_OPTIONS ('inferSchema' = 'true', 'header' = 'true')
  COPY_OPTIONS ('mergeSchema' = 'true');
```

In [0]:
%sql
--Create table customer_demographics and copy data into it
CREATE TABLE IF NOT EXISTS customer_demographics;
COPY INTO customer_demographics
  FROM "${DA.paths.datasets}/telco/customer-demographics.csv"
  FILEFORMAT = CSV
  FORMAT_OPTIONS ('inferSchema' = 'true', 'header' = 'true')
  COPY_OPTIONS ('mergeSchema' = 'true');

--Create table customer_details and copy data into it
CREATE TABLE IF NOT EXISTS customer_details;
COPY INTO customer_details
  FROM "${DA.paths.datasets}/telco/customer-details.csv"
  FILEFORMAT = CSV
  FORMAT_OPTIONS ('inferSchema'= 'true', 'header'= 'true')
  COPY_OPTIONS ('mergeSchema' = 'true');


num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
7043,7043,0


In [0]:
%sql
--Create a customer table with combined data
CREATE OR REPLACE TABLE customers AS
SELECT cd.customerID AS ID,
    cd.PhoneService, 
    cd.Contract,
    cd.PaymentMethod,
    cd.MonthlyCharges, 
    cd.Churn,
    cdm.gender,
    cdm.tenure,
    cdm.Dependents
FROM customer_details cd
LEFT JOIN customer_demographics cdm ON cd.customerID = cdm.customerID
WHERE isnotnull(cd.Churn);

SELECT * FROM customers;

ID,PhoneService,Contract,PaymentMethod,MonthlyCharges,Churn,gender,tenure,Dependents
7590-VHVEG,No,Month-to-month,Electronic check,29.85,No,Female,1,No
5575-GNVDE,Yes,One year,Mailed check,56.95,No,Male,34,No
3668-QPYBK,Yes,Month-to-month,Mailed check,53.85,Yes,Male,2,No
7795-CFOCW,No,One year,Bank transfer (automatic),42.3,No,Male,45,No
9237-HQITU,Yes,Month-to-month,Electronic check,70.7,Yes,Female,2,No
9305-CDSKC,Yes,Month-to-month,Electronic check,99.65,Yes,Female,8,No
1452-KIOVK,Yes,Month-to-month,Credit card (automatic),89.1,No,Male,22,Yes
6713-OKOMC,No,Month-to-month,Mailed check,29.75,No,Female,10,No
7892-POOKP,Yes,Month-to-month,Electronic check,104.8,Yes,Female,28,No
6388-TABGU,Yes,One year,Bank transfer (automatic),56.15,No,Male,62,Yes


In [0]:
import pyspark.pandas as ps
import pandas as pd
from pyspark.sql.functions import col

sdf = spark.sql("SELECT * FROM customers")
sdf = sdf.drop("ID")

pdf = ps.DataFrame(sdf)

training_df = ps.get_dummies(
    pdf,
    columns=[
        "gender",
        "Dependents",
        "PaymentMethod",
        "Contract",
        "PhoneService"
    ],
    dtype="float64",
).to_pandas()

In [0]:
display(training_df)

MonthlyCharges,Churn,tenure,gender_Female,gender_Male,Dependents_No,Dependents_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Contract_Month-to-month,Contract_One year,Contract_Two year,PhoneService_No,PhoneService_Yes
29.85,No,1,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
56.95,No,34,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
53.85,Yes,2,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
42.3,No,45,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
70.7,Yes,2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
99.65,Yes,8,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
89.1,No,22,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
29.75,No,10,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
104.8,Yes,28,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
56.15,No,62,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


## Model Tracking and Management with MLflow

In this section, we will use MLflow to track and manage models. First, we will load features

### Train and Track Model

Next, we'll train a machine learning model using scikit-learn.

In [0]:
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

X = training_df.drop("Churn", axis=1)
y = training_df["Churn"]

# Convert categorical labels to numerical labels
y = y.map({'Yes': 1.0, 'No': 0.0})

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [0]:
mlflow.set_experiment(f"/Users/{DA.username}/LAB-Get-Started-with-Databricks-for-ML")

2024/11/29 23:28:14 INFO mlflow.tracking.fluent: Experiment with name '/Users/labuser8027617_1732916382@vocareum.com/LAB-Get-Started-with-Databricks-for-ML' does not exist. Creating a new experiment.


<Experiment: artifact_location='dbfs:/databricks/mlflow-tracking/2715468748136830', creation_time=1732922894876, experiment_id='2715468748136830', last_update_time=1732922894876, lifecycle_stage='active', name='/Users/labuser8027617_1732916382@vocareum.com/LAB-Get-Started-with-Databricks-for-ML', tags={'mlflow.experiment.sourceName': '/Users/labuser8027617_1732916382@vocareum.com/LAB-Get-Started-with-Databricks-for-ML',
 'mlflow.experimentType': 'MLFLOW_EXPERIMENT',
 'mlflow.ownerEmail': 'labuser8027617_1732916382@vocareum.com',
 'mlflow.ownerId': '5215585975143301'}>

In [0]:
with mlflow.start_run(run_name = 'gs_db_ml_LAB_run') as run:
    # Initialize the Random Forest classifier
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

    # Fit the model on the training data
    rf_classifier.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = rf_classifier.predict(X_test)

    # Enable automatic logging of input samples, metrics, parameters, and models
    mlflow.sklearn.autolog(
        log_input_examples = True,
        silent = True
    )

    mlflow.log_metric("test_f1", f1_score(y_test, y_pred))
        
    mlflow.sklearn.log_model(
        rf_classifier,
        artifact_path = "model-artifacts", 
        input_example=X_train[:3],
        signature=infer_signature(X_train, y_train)
    )

    model_uri = f"runs:/{run.info.run_id}/model-artifacts"



Uploading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

2024/11/29 23:28:48 INFO mlflow.tracking._tracking_service.client: 🏃 View run gs_db_ml_LAB_run at: dbc-40a12f74-fe09.cloud.databricks.com/ml/experiments/2715468748136830/runs/ff75816ef09a4c1c973d1779ec860739.
2024/11/29 23:28:48 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: dbc-40a12f74-fe09.cloud.databricks.com/ml/experiments/2715468748136830.



### Register the Model

Now, let's register the trained model in the model registry:

1. Use the logged model from the previous step.
2. Provide a name and description for the model.
3. Register the model to Unity Catalog.

In [0]:

# Modify the registry uri to point to Unity Catalog
mlflow.set_registry_uri("databricks-uc")

model_name = f"gs_db_ml_LAB_{DA.unique_name('-')}"

# Register the model in the model registry
registered_model = mlflow.register_model(model_uri=model_uri, name=model_name)

Successfully registered model 'labuser8027617_1732916382_2w2l_da.default.gs_db_ml_lab_labuser8027617-1732916382-2w2l-da-gsml'.


Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

Created version '1' of model 'labuser8027617_1732916382_2w2l_da.default.gs_db_ml_lab_labuser8027617-1732916382-2w2l-da-gsml'.



### Manage Model Stages

As the model is registered into Model Registry, we can manage its stage using the UI or the API. In this demo, we will use the API to transition registered model to `Staging` stage.

In [0]:
from mlflow.tracking.client import MlflowClient

# Initialize an MLflow Client
client = MlflowClient()

# Assign a "staging" alias to model version 1
client.set_registered_model_alias(
    name= registered_model.name,  # The registered model name
    alias="staging",  # The alias representing the staging environment
    version=registered_model.version  # The version of the model you want to move to "staging"
)

**Note🚨 : Below instructions are only for demonstration purposes. Please do not provision an endpoint. It is already created during the workspace setup.**


## Part 3: Mosaic AI Model Serving

**Setting Up Model Serving**

We can create Model Serving endpoints with the Databricks Machine Learning API or the Databricks Machine Learning UI. An endpoint can serve any registered Python MLflow model in the **Model Registry**.

In order to keep it simple, in this demo, we are going to use the Model Serving UI for creating, managing and using the Model Serving endpoints. We can create model serving endpoints with the **"Serving"** page UI or directly from registered **"Models"** page.  

Let's go through the steps of creating a model serving endpoint in Models page. **You will not actually create the endpoint.**

- Go to **Models**. 

- Select **Unity Catalog** at the top and select **Owned by me** as well.

- Select the model you want to serve under the **Name** column. Notice this will take you to the Catalog menu. 

- Click the **Serve this model** button on the top right. This will take you to the **Serving endpoints** screen.

- Next in **General**, enter in a name. This name should be unique like your first and last name. For example, **get-started-model-serving-endpoint**.

- There are several configurations under **Served entities** that we will not discuss here. Leave **Entity** and **Compute type** to default values. For **Compute scale-out**, select **small**. You can select **Scale to zero** for this lesson as well. We will be deleting the endpoint at the end of this lesson, so this doesn't matter too much for our purposes. 

- **Do not click Create** at the bottom right. The above instructions are only for demonstration purposes. **Do not provision an endpoint.**

    - If you happen to accidentally create an endpoint, you can navigate to the left sidebar and click on **Serving**. Then click on the model you began provisioning and click on the 3 vertical dots at the top right. Select **Delete**. Again, **Do not provision an endpoint.**

- If you do click **Create**, you might be met with an error saying "Endpoint with name 'get-started-model-serving-endpoint' already exists." This is because we already setup this endpoint during the demo notebook. **Do not provision an endpoint by changing the name.**



### Query Serving Endpoint

Let's use the deployed model for real-time inference. Here’s a step-by-step guide for querying an endpoint in Databricks Model Serving:

- Go to the **Serving** endpoints page and select the endpoint you want to query.

- Click **Use** button the top right corner.

- There are 4 methods for querying an endpoint; **browser**, **CURL**, **Python**, and **SQL**. For now, let's use the easiest method; querying right in the **browser** window. In this method, we need to provide the input parameters in JSON format. If you used `mlflow.sklearn.autolog()` with `log_input_examples = True`, you registered an example with MLflow, which appears automatically when selecting **browser**. If not, you will need to provide that input manually or update your code to include autologging (see `03 - Getting Started with MLflow and Mosaic AI Model Serving`).

- Click **Send request**.

- **Response** field on the right panel will show the result of the inference.


## Classroom Clean-up

After completing the demo, it's important to clean up any resources that were created.

Run the following cell to remove lessons-specific assets created during this lesson.

In [0]:
DA.cleanup()

com.databricks.backend.common.rpc.CommandSkippedException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:138)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:133)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:133)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:728)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:446)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:446)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.cancelExecution(ExecutionContextManagerV1.scala:464)
	at com.databricks.spark.chauffeur.ChauffeurState.$anonfun$process$1(ChauffeurState.scala:571)
	at com.data


## Conclusion

In this lab, we explored the full potential of Databricks Data Intelligence Platform for machine learning tasks. From data ingestion to model deployment, we covered essential steps such as data preparation, model training, tracking, registration, and serving. By utilizing MLflow for model tracking and management, and Model Serving for deployment, we demonstrated how Databricks offers a seamless workflow for building and deploying ML models. Through this comprehensive lab, users can gain a solid understanding of Databricks capabilities for ML tasks and streamline their development process effectively.


&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>