# MLflow on Databricks

Thursday, 8 May, 2025

[Invitation on Luma](https://lu.ma/hictrxvv), [LinkedIn](https://www.linkedin.com/groups/9307761/), [Meetup](https://www.meetup.com/warsaw-data-engineering/events/307662118/)


# Agenda

1. Oficjalna strona produktu [Managed MLflow](https://www.databricks.com/product/managed-mlflow)
1. Intro to MLflow and Databricks Machine Learning
1. MLflow's [examples/databricks](https://github.com/mlflow/mlflow/tree/master/examples/databricks)
1. What Takes to Execute MLflow's `dev/pyproject.py` locally

Całkowity czas trwania spotkania: **1h 15min**


# Event Question

[O czym chciał(a)byś usłyszeć podczas meetupu? Rzuć ciekawym pomysłem na kolejne edycje](https://www.meetup.com/warsaw-data-engineering/events/307662118/attendees/) 🙏

1. o ciekawych problemach spotkanych na produkcji przy big data pipeline'ach
1. CICD z DAB
1. Co nowego w Databricks
1. how to use vector search and own models
1. Dowiedzieć się więcej o MLflow

# 📢 News

Things worth watching out for...


## New members in Warsaw Data Engineering!

[You now have 591 members!](https://www.meetup.com/warsaw-data-engineering/)

Co zainteresowało Cię w Warsaw Data Engineering Meetup, że zdecydowałaś/-eś się przyłączyć?

1. Pracuję jako Data Engineer i ten temat jest mi bliski
1. Tematyka


## New Versions

What has changed in the tooling space since we last met? I.e. hunting down the features to learn more about.

* [DSPy 2.6.23](https://github.com/stanfordnlp/dspy/releases/tag/2.6.23)
    * [Support streaming in async DSPy program](https://github.com/stanfordnlp/dspy/pull/8144)
    * [Support token streaming with json adapter](https://github.com/stanfordnlp/dspy/pull/8158)
    * [Utility that converts async stream to sync stream](https://github.com/stanfordnlp/dspy/pull/8162)
* [PydanticAI 0.1.10](https://github.com/pydantic/pydantic-ai/releases/tag/v0.1.10)
    * [Handle multi-modal and error responses from MCP tool calls](https://github.com/pydantic/pydantic-ai/pull/1618)
    * [Store additional usage details from Anthropic](https://github.com/pydantic/pydantic-ai/pull/1549)
* [OpenAI Agents SDK 0.0.14](https://github.com/openai/openai-agents-python/releases/tag/v0.0.14)
    * [Add usage to context in streaming](https://github.com/openai/openai-agents-python/pull/595)
* [Dagster 1.10.13](https://github.com/dagster-io/dagster/releases/tag/1.10.13)
    * [Added Scala Spark / Dagster Pipes guide](https://docs.dagster.io/guides/build/external-pipelines/scalaspark-pipeline)


# MLflow and Databricks Machine Learning

It all started with [Manage model lifecycle in Unity Catalog](https://docs.databricks.com/aws/en/machine-learning/manage-model-lifecycle/) and [Tutorials: Get started with AI and machine learning](https://docs.databricks.com/aws/en/machine-learning/ml-tutorials)


## Discovery of the Day

A Data Engineer's take on the matters:

> The key is to think about **model training workload** as a Python code and **ML model** as a directory with a bunch of files.


```py
mlflow.start_run()

model_run = mlflow.active_run()

mlflow.end_run()

print(model_run.info)
```


# MLflow's examples/databricks

[examples/databricks](https://github.com/mlflow/mlflow/tree/master/examples/databricks)


## Step 0. Clone MLflow Repo

`git clone` https://github.com/mlflow/mlflow


## Step 1. Install Dependencies


```
uv pip install databricks-connect
uv pip install scikit-learn
```


### ❤️ Editable Install

[Development Mode (a.k.a. “Editable Installs”)](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)


```
uv pip install -e .
```


## Step 2. Run Experiment


```
❯ python examples/databricks/dbconnect.py --cluster-id xxx
2025/05/08 17:51:04 INFO mlflow.tracking.fluent: Experiment with name '/Users/jacek@japila.pl/dbconnect' does not exist. Creating a new experiment.
🏃 View run smiling-ox-667 at: https://curriculum-dev.cloud.databricks.com/ml/experiments/1275781889574864/runs/b88fd8406e7d410bac8992258093ef5d
🧪 View experiment at: https://curriculum-dev.cloud.databricks.com/ml/experiments/1275781889574864
Traceback (most recent call last):
  File "/Users/jacek/oss/mlflow/examples/databricks/dbconnect.py", line 56, in <module>
    main()
    ~~~~^^
  File "/Users/jacek/oss/mlflow/examples/databricks/dbconnect.py", line 37, in main
    model_info = mlflow.sklearn.log_model(model, name="model", signature=signature)
TypeError: log_model() got an unexpected keyword argument 'name'
```

# What Takes to Execute MLflow's dev/pyproject.py locally

1. What I learnt while reviewing the source code of MLflow and having found [dev/pyproject.py](https://github.com/mlflow/mlflow/blob/master/dev/pyproject.py) to execute locally.
1. And how uv helped.

Why it even matters?! 🤨


## Step 0. Clone MLflow Repo

`git clone` https://github.com/mlflow/mlflow


## Step 1. uvx python dev/pyproject.py

<br>

```
❯ uvx python dev/pyproject.py
Traceback (most recent call last):
  File "/Users/jacek/oss/mlflow/./dev/pyproject.py", line 10, in <module>
    import toml
ModuleNotFoundError: No module named 'toml'
```


## Step 2. Set Up Dev Env


`uv venv .dev_pyproject_py_deep_dive`

`source .dev_pyproject_py_deep_dive/bin/activate`


## Step 3. Virtual Envs in Python

Please note that I'm a JVM dev (and only very recently switched to Python).


`uv pip install toml`

`python ./dev/pyproject.py`

`type python` and it finally clicked how virtual envs work 🔥

[venv — Creation of virtual environments](https://docs.python.org/3/library/venv.html)

## Step 4. It Works 🥳


`uv pip install pyyaml`

> ⚠️ NOTE
>
> All the dev deps are in [dev/requirements.txt](https://github.com/mlflow/mlflow/blob/master/dev/requirements.txt)

`uv pip install packaging`

`brew install taplo`

`python ./dev/pyproject.py` seems to change nothing, huh?! 🤨

💎 Think what the script does and you will know why nothing seems changed 😉

# That's all Folks 👋

![Warner Bros., Public domain, via Wikimedia Commons](https://upload.wikimedia.org/wikipedia/commons/e/ea/Thats_all_folks.svg)


# 💡 Ideas for Future Events

1. [Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/index.html) with uv and pydantic
1. Explore more [Pydantic](https://docs.pydantic.dev/latest/) features
1. Create a new DAB template with `uv` as the project management tool (based on `default-python` template). Start from `databricks bundle init --help`.



## MLflow Prompt Registry

In [MLflow 2.21.0](https://github.com/mlflow/mlflow/releases/tag/v2.21.0):

>  **Prompt Registry**: MLflow Prompt Registry is a powerful tool that streamlines prompt engineering and management in your GenAI applications. It enables you to version, track, and reuse prompts across your organization.

[MLflow Prompt Registry](https://mlflow.org/docs/latest/prompts/)

## MLflow Tracing

In [MLflow 2.21.0](https://github.com/mlflow/mlflow/releases/tag/v2.21.0):

>  **Enhanced Tracing Capabilities**: MLflow Tracing now supports synchronous/asynchronous generators and auto-tracing for Async OpenAI, providing more flexible and comprehensive tracing options.

[MLflow Tracing for LLM Observability](https://mlflow.org/docs/latest/tracing/)


## Databricks Asset Bundles and Library Dependencies

[PyPI package](https://docs.databricks.com/aws/en/dev-tools/bundles/library-dependencies#pypi-package)

Databricks CLI v0.244.0: [Support all version identifiers as per PEP440 in environment deps](https://github.com/databricks/cli/releases/tag/v0.244.0)


## Databricks Asset Bundles and Set the target catalog and schema

Databricks CLI v0.243.0: [Use schema field for pipeline in builtin template](https://github.com/databricks/cli/releases/tag/v0.243.0):

> The schema field implies the lifecycle of tables is no longer tied to the lifecycle of the pipeline, as was the case with the target field.

[Set the target catalog and schema](https://docs.databricks.com/aws/en/dlt/target-schema)

## uv with PyTorch

uv 0.6.9: [Add experimental --torch-backend to the PyTorch guide](https://github.com/astral-sh/uv/releases/tag/0.6.9)

[Using uv with PyTorch](https://docs.astral.sh/uv/guides/integration/pytorch/)