In [0]:
# Install packages that aren't natively on a Databricks cluster
install.packages("carrier")

* installing *source* package ‘carrier’ ...
** package ‘carrier’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (carrier)
Installing package into ‘/local_disk0/.ephemeral_nfs/envs/rEnv-66f7f07d-db19-4abc-a55e-7806af304366’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/carrier_0.3.0.4.tar.gz'
Content type 'application/x-gzip' length 9719 bytes
downloaded 9719 bytes


The downloaded source packages are in
	‘/tmp/RtmpnetyvW/downloaded_packages’
NULL

# Introduction to the Notebook

This notebook demonstrates the following key functionalities:

1. **Loading Data from Unity Catalog**: We will begin by loading our dataset from Unity Catalog, which provides a centralized data management solution.

2. **Training a Decision Tree in R**: Next, we will utilize R to train a decision tree model on the loaded dataset, showcasing the integration of R within our workflow.

3. **Logging with MLflow**: We will log our model training process and metrics using MLflow, which helps in tracking experiments and managing the machine learning lifecycle.

4. **Registering the Model to Unity Catalog using Python**: Finally, we will register the trained model back to Unity Catalog using Python, ensuring that it is available for future use and deployment.

Let's get started!

# This cell loads all required R libraries for Spark, MLflow, modeling, and model serialization.

In [0]:
# Load required libraries
library(sparklyr)   # Interface to Apache Spark for big data processing
library(dplyr)      # Data manipulation and transformation functions
library(mlflow)     # MLflow for tracking experiments and managing models
library(rpart)      # Functions for creating decision tree models
library(carrier)    # Serialization of models for use with MLflow



# Spark Connection and Data Loading from Unity Catalog

This section describes how to establish a connection to Apache Spark and load data from Unity Catalog. Unity Catalog is a unified governance solution for all data assets in the Databricks Lakehouse. It provides a centralized way to manage and access data across various sources.

To connect to Spark, we typically use the following code snippet:


In [0]:
# Connect to Spark using Databricks integration
spark <- spark_connect(method = "databricks")

# Load the iris dataset from Unity Catalog table
iris_tbl <- tbl(spark, "pedroz_e2edata_dev.default.iris_data")

# Collect the data into an R data frame for local modeling
iris_df <- collect(iris_tbl)



In [0]:
# Display the loaded iris data for inspection to understand its structure and contents
display(iris_df)

sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,species,id
5.1,3.5,1.4,0.2,0.0,1.0
4.9,3.0,1.4,0.2,0.0,2.0
4.7,3.2,1.3,0.2,0.0,3.0
4.6,3.1,1.5,0.2,0.0,4.0
5.0,3.6,1.4,0.2,0.0,5.0
5.4,3.9,1.7,0.4,0.0,6.0
4.6,3.4,1.4,0.3,0.0,7.0
5.0,3.4,1.5,0.2,0.0,8.0
4.4,2.9,1.4,0.2,0.0,9.0
4.9,3.1,1.5,0.1,0.0,10.0




This cell ensures that the target column is treated as a factor for classification tasks.  

Converting the target variable into a factor is essential for classification algorithms, as it allows the model to understand that the output variable is categorical.

In [0]:
# Ensure species is a factor
iris_df$species <- as.factor(iris_df$species)



# Split the data into train and test

In [0]:
# Split the data into training and test sets for model evaluation
set.seed(42)  # Set seed for reproducibility of random sampling
train_idx <- sample(seq_len(nrow(iris_df)), size = 0.8 * nrow(iris_df))  # Randomly select 80% of the data for training
train_df <- iris_df[train_idx, ]  # Create training dataset using the selected indices
test_df <- iris_df[-train_idx, ]  # Create test dataset using the remaining 20% of the data



This cell trains a Decision Tree Classifier using the training data.  
The 'id' column is excluded from the training process to ensure that only relevant features are used for model training.

In [0]:
# Train a decision tree classifier on the iris data
# The model predicts 'species' using the four feature columns: sepal_length_cm, sepal_width_cm, petal_length_cm, and petal_width_cm.
# The 'id' column is excluded from training because it does not provide relevant information for predicting the species.
model <- rpart(
  species ~ sepal_length_cm + sepal_width_cm + petal_length_cm + petal_width_cm,
  data = iris_df,
  method = "class"
)
# The id column is excluded from training



This cell evaluates the model on the test set and prints the accuracy.

In [0]:
# Predict species on the test set using the trained model
pred <- predict(model, test_df, type = "class")

# Compute and print the accuracy of the model on the test set
accuracy <- mean(pred == test_df$species)
cat(sprintf("Test accuracy: %.3f\n", accuracy))

Test accuracy: 0.967

This cell sets up the MLflow experiment for tracking runs.  
MLflow is a platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment.  
By setting up an experiment, we can log parameters, metrics, and artifacts for each run, making it easier to compare and analyze results.

In [0]:
# Set the MLflow experiment for tracking model runs, which allows for organizing and comparing different runs of the model training process
mlflow_set_experiment("/Users/pedro.zanlorensi@databricks.com/my_custom_iris_r_experiment")

[1] "3458162921732935"

This cell defines the model signature for MLflow logging and Unity Catalog registration.

In [0]:
# Define the model signature for the iris classifier
# This specifies the input and output schema for MLflow and Unity Catalog
# Inputs: sepal_length, sepal_width, petal_length, petal_width (all of type double)
# Output: species (of type string)
signature <- list(
  inputs = list(
    list(type = "double", name = "sepal_length"),
    list(type = "double", name = "sepal_width"),
    list(type = "double", name = "petal_length"),
    list(type = "double", name = "petal_width")
  ),
  outputs = list(
    list(type = "string")
  )
)



## Patch mlflow_log_model to support the signature argument for Unity Catalog registration
This workaround is necessary to ensure that the model signature is logged correctly for compatibility with Unity Catalog, which requires specific metadata to be present.
For more info, check out this great blog post by : [1  Log R Models to Unity Catalog
](https://zacdav-db.github.io/dbrx-r-compendium/chapters/mlflow/log-to-uc.html)

In [0]:
# Patch mlflow_log_model to support the signature argument for Unity Catalog registration
mlflow_log_model <- function(model, artifact_path, signature = NULL, ...) {
  format_signature <- function(signature) {
    lapply(signature, function(x) {
      jsonlite::toJSON(x, auto_unbox = TRUE)
    })
  }
  temp_path <- fs::path_temp(artifact_path)
  model_spec <- mlflow_save_model(
    model, path = temp_path, model_spec = list(
      utc_time_created = mlflow:::mlflow_timestamp(),
      run_id = mlflow:::mlflow_get_active_run_id_or_start_run(),
      artifact_path = artifact_path, 
      flavors = list(),
      signature = format_signature(signature)
    ), ...
  )
  res <- mlflow_log_artifact(path = temp_path, artifact_path = artifact_path)
  tryCatch({
    mlflow:::mlflow_record_logged_model(model_spec)
  },
  error = function(e) {
    warning(
      paste("Logging model metadata to the tracking server has failed, possibly due to older",
            "server version. The model artifacts have been logged successfully.",
            "In addition to exporting model artifacts, MLflow clients 1.7.0 and above",
            "attempt to record model metadata to the  tracking store. If logging to a",
            "mlflow server via REST, consider  upgrading the server version to MLflow",
            "1.7.0 or above.", sep=" ")
    )
  })
  res
}

# Override the function in the mlflow namespace
assignInNamespace("mlflow_log_model", mlflow_log_model, ns = "mlflow")



This cell logs the model and metrics to MLflow, including the signature, and ends the run.

In [0]:
# Log the model and metrics to MLflow, including the model signature
run <- mlflow_start_run()  # Start a new MLflow run to track the experiment

mlflow_log_metric("accuracy", accuracy)  # Log the accuracy metric to MLflow for tracking performance

# Wrap the rpart model in a crate object for MLflow logging
r_func <- carrier::crate(
  function(newdata) {
    stats::predict(model, newdata = newdata, type = "class")  # Define prediction function for the model
  },
  model = model  # Pass the trained model to the crate
)

# Log the model with the defined signature for Unity Catalog compatibility
mlflow_log_model(r_func, "iris_r_class_model", signature = signature)  # Log the model with its signature for reproducibility

mlflow_end_run()  # End the MLflow run to finalize the logging of metrics and model

Uploading artifacts:   0%|          | 0/2 [00:00<?, ?it/s]Uploading artifacts:  50%|█████     | 1/2 [00:00<00:00,  3.44it/s]Uploading artifacts:  50%|█████     | 1/2 [00:00<00:00,  3.44it/s]Uploading artifacts: 100%|██████████| 2/2 [00:00<00:00,  4.96it/s]Uploading artifacts: 100%|██████████| 2/2 [00:00<00:00,  4.96it/s]Uploading artifacts: 100%|██████████| 2/2 [00:00<00:00,  4.65it/s]
2025/12/15 18:50:15 INFO mlflow.store.artifact.cli: Logged artifact from local dir /tmp/RtmpnetyvW/iris_r_class_model to artifact_path=iris_r_class_model
Root URI: dbfs:/databricks/mlflow-tracking/3458162921732935/a3934b5d88d041518e549ec4375ee0b3/artifacts
# A tibble: 1 × 13
  run_id              run_uuid experiment_id run_name status start_time         
  <chr>               <chr>    <chr>         <chr>    <chr>  <dttm>             
1 a3934b5d88d041518e… a3934b5… 345816292173… worried… FINIS… 2025-12-15 18:50:12
# ℹ 7 more variables: end_time <dttm>, artifact_uri <chr>,
#   lifecycle_stage <chr>, 

This cell retrieves the MLflow run ID for use in model registration.  

The run ID is essential for tracking the model's training process and ensuring that the correct model version is registered in MLflow.

In [0]:
# Retrieve the MLflow run ID for use in model registration, which allows tracking and versioning of the model associated with this specific run
run$run_id

[1] "a3934b5d88d041518e549ec4375ee0b3"

# Save the model to the UC using Python

The next cell upgrades the MLflow Python client to ensure support for Unity Catalog. This is important for leveraging the features and functionalities provided by Unity Catalog in your machine learning workflows.

In [0]:
%python

%pip install --upgrade "mlflow[databricks]==3.5.0"

dbutils.library.restartPython()

Collecting mlflow==3.5.0 (from mlflow[databricks]==3.5.0)
  Downloading mlflow-3.5.0-py3-none-any.whl.metadata (30 kB)
Collecting mlflow-skinny==3.5.0 (from mlflow==3.5.0->mlflow[databricks]==3.5.0)
  Downloading mlflow_skinny-3.5.0-py3-none-any.whl.metadata (31 kB)
Collecting mlflow-tracing==3.5.0 (from mlflow==3.5.0->mlflow[databricks]==3.5.0)
  Downloading mlflow_tracing-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting fastmcp<3,>=2.0.0 (from mlflow==3.5.0->mlflow[databricks]==3.5.0)
  Downloading fastmcp-2.14.1-py3-none-any.whl.metadata (20 kB)
Collecting authlib>=1.6.5 (from fastmcp<3,>=2.0.0->mlflow==3.5.0->mlflow[databricks]==3.5.0)
  Downloading authlib-1.6.6-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting cyclopts>=4.0.0 (from fastmcp<3,>=2.0.0->mlflow==3.5.0->mlflow[databricks]==3.5.0)
  Downloading cyclopts-4.3.0-py3-none-any.whl.metadata (12 kB)
Collecting exceptiongroup>=1.2.2 (from fastmcp<3,>=2.0.0->mlflow==3.5.0->mlflow[databricks]==3.5.0)
  Downloading exceptiongro

This cell registers the R model to Unity Catalog using the MLflow Python API and sets the 'champion' alias.

In [0]:
%python
import mlflow
from mlflow.tracking import MlflowClient

# Set the MLflow registry URI to Unity Catalog
mlflow.set_registry_uri("databricks-uc")

run_id = "a3934b5d88d041518e549ec4375ee0b3"  # Replace with your actual run ID
artifact_path = "iris_r_class_model"  # Replace with your actual artifact path

# Build the run URI using the run ID and artifact path
run_uri = f"runs:/{run_id}/{artifact_path}"

catalog = "pedroz_e2edata_dev"
schema = "default"
model_name = "iris_r_class_model_uc"

# Construct the full model name for Unity Catalog
full_model_name_uc = f"{catalog}.{schema}.{model_name}"

# Register the R model to Unity Catalog using the run URI
registered_model = mlflow.register_model(run_uri, full_model_name_uc)

# Initialize the MLflow client to interact with the model registry
client = MlflowClient(registry_uri="databricks-uc")

# Set the 'champion' alias for the registered model to indicate its status
client.set_registered_model_alias(
    name=full_model_name_uc,
    alias="champion",
    version=registered_model.version
)

Registered model 'pedroz_e2edata_dev.default.iris_r_class_model_uc' already exists. Creating a new version of this model...


Downloading artifacts:   0%|          | 0/2 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/2 [00:00<?, ?it/s]

🔗 Created version '7' of model 'pedroz_e2edata_dev.default.iris_r_class_model_uc': https://adb-4181970831265458.18.azuredatabricks.net/explore/data/models/pedroz_e2edata_dev/default/iris_r_class_model_uc/version/7?o=4181970831265458
