# Trace, Evaluate, Iterate: Building Reliable AI Systems with MLflow

In this demo, we will show how to use MLflow to trace and evaluate an AI social media style transfer system. MLflow provides a suite of tools for GenAI application development.

- **Different combinations of models, prompts, and inference parameters can have huge impacts on a GenAI application's performance.**
- **It can be difficult to define what makes a GenAI model's responses "good" or "bad," making it challenging to evaluate GenAI application performance.**
- **Identifying sources of errors or breakdowns in application logic is very difficult when dealing with complex GenAI applications.**

We will see how MLflow can be used to address these common challenges in GenAI application development.

## Problem Setup: Generating Social Media Posts in Your Voice with AI
- By default, AI models will generate writing that does not sound like it was written by you or by your organization.
- In many cases, it will obviously be AI generated.
- It might also contain significant inaccuracies

In this notebook, we will use GenAI models to write social media posts in a specified voice and style. We will use MLflow to keep track of our experiments, make sure the posts are rigorously grounded in source material, and evaluate how well they correspond to the target style.

## Setup
- Install Packages
- We will be using models available through the Foundation Models API, so no need to configure GenAI model provider API keys.

In [0]:
%pip install -q mlflow==2.21.3 openai==1.75.0 requests==2.31.0 markdownify
%restart_python

## Generating our First Post
We will need:

- An **example post** to use as a style reference (eventually, we will use a set of multiple examples)
- A **source document or website** with content to use as a source of information for a new post. We will write a helper function that can pull the text from a webpage and convert it to markdown.
- A **template prompt** to tie the above information together
- A **system prompt** explaining the task to the model

Let's try the following. Feel free to experiment with your own prompts and ideas.



In [0]:
import requests
from markdownify import markdownify

system_prompt_1 = """You are a social media content specialist who can precisely match writing styles. Your task is to:
1. Analyze the provided example post(s) to understand their style and tone
2. Generate a new LinkedIn post about the given topic that perfectly matches this style
3. Return only the generated post, nothing else.
"""

user_template = """
example posts:
{example_posts}

topic:
{topic}

additional instructions:
{additional_instructions}
"""


# Helper function to convert a webpage to markdown
def webpage_to_markdown(url):
    # Get webpage content
    response = requests.get(url)
    html_content = response.text

    markdown_content = markdownify(html_content)

    return markdown_content

# Function to assemble the prompt
def generate_prompt(
    system, user_template, example_posts, topic, additional_instructions
):
    """Generate a prompt for the LLM based on the example posts, topic, and additional instructions."""
    example_posts = "\n".join(
        [f"Example {i+1}:\n{post}" for i, post in enumerate(example_posts)]
    )
    prompt = user_template.format(
        example_posts=example_posts,
        topic=topic,
        additional_instructions=additional_instructions,
    )

    formatted_prompt = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
    ]

    return formatted_prompt

### Querying an AI Model—Without Tracing

In [0]:
url = "https://mlflow.org/docs/latest/llms/chat-model-intro/index.html"
markdown_content = webpage_to_markdown(url)

additional_instructions = """This post will be written for the MLflow LinkedIn account.
Maintain a professional but approachable tone. Developers are the primary audience, so keep 
the content technically focused but accessible."""

example_post = """MLflow's GenAI evaluation metrics now work as callable functions as of MLflow 2.17, making them easier to use and integrate.

Now you can use metrics like answer_relevance, answer_correctness, faithfulness, and toxicity directly as functions—no need to go through mlflow.evaluate() anymore if you're just prototyping with individual metrics or integrating metric calls into systems where mlflow.evalaute is not necessary.

This means:

🔍 Easier debugging during prototyping
🔌 More flexible integration options
🎯 Works with or without other MLflow features

Learn more:
📚 Docs: https://lnkd.in/gyBzcrDr
📝 Release notes: https://lnkd.in/gBrNQfFC

#MachineLearning #AI #LLMs #LLMOps #Evals"""

In [0]:
from databricks.sdk import WorkspaceClient
from openai import OpenAI

SERVING_ENDPOINT = "https://<your-workspace-name>.cloud.databricks.com/serving-endpoints"

w = WorkspaceClient()

# Create a temporary token
tmp_token = w.tokens.create(lifetime_seconds=1200).token_value

# Set up the OpenAI client to use Databricks Model Serving
client = OpenAI(
    api_key=tmp_token,
    base_url=SERVING_ENDPOINT,
)

In [0]:
import os

messages = generate_prompt(
    system_prompt_1,
    user_template,
    [example_post],
    [markdown_content],
    additional_instructions,
)
response = client.chat.completions.create(
    model="databricks-llama-4-maverick",
    # model = "databricks-meta-llama-3-3-70b-instruct",
    messages=messages,
)

response

### With Tracing

In [0]:
import mlflow

EXPERIMENT_NAME = "<your-experiment-name>"

mlflow.openai.autolog()
mlflow.set_experiment(EXPERIMENT_NAME)

response = client.chat.completions.create(
    model="databricks-llama-4-maverick",
    messages=messages,
)

## Capture the full execution flow

We are not limited to tracing the AI inputs and outputs. We can also trace the `webpage_to_markdown` and `generate_prompt` functions. This can help us to identify bugs anywhere in the execution flow.

Let's update these functions by adding the `@mlflow.trace` decorator.

In [0]:
# trace the webpage_to_markdown function
@mlflow.trace(span_type="FUNCTION")
def webpage_to_markdown(url):
    # Get webpage content
    response = requests.get(url)
    html_content = response.text

    # Convert to markdown
    markdown_content = markdownify(html_content)

    return markdown_content


# trace the generate_prompt function
@mlflow.trace(span_type="FUNCTION")
def generate_prompt(
    system, user_template, example_posts, topic, additional_instructions
):
    """Generate a prompt for the LLM based on the example posts, topic, and additional instructions."""
    example_posts = "\n".join(
        [f"Example {i+1}:\n{post}" for i, post in enumerate(example_posts)]
    )
    prompt = user_template.format(
        example_posts=example_posts,
        topic=topic,
        additional_instructions=additional_instructions,
    )

    formatted_prompt = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
    ]

    return formatted_prompt


# create a parent span
with mlflow.start_span(name="Generate Post", span_type="CHAIN") as span:
    markdown_content = webpage_to_markdown(url)
    messages = generate_prompt(
        system_prompt_1,
        user_template,
        [example_post],
        [markdown_content],
        additional_instructions,
    )
    response = client.chat.completions.create(
        model="databricks-llama-4-maverick",
        messages=messages,
    )

response

## Evaluating the Generated Post
- Suppose during our experimentation we found that a couple of different prompts and a couple of different models were especially promising, and we want to compare them to detemrine which combination is best.
- Factual correctness and style similarity are our highest priorities.
- We will use the built-in `faithfulness` metric to make sure the generated texts are true to the source texts.
- We will create a custom `style_similarity_metric` to measure the similarity between the generated post and the example posts.
- Both of these are *llm-as-judge* metrics, meaning that another large language model is used to grade the outputs.

### Built-in metrics

In [0]:
from mlflow.metrics.genai import faithfulness

faithfulness_metric = faithfulness(model="endpoints:/databricks-claude-3-7-sonnet")

result = response.choices[0].message.content

print(
    faithfulness_metric(
        predictions=result,
        inputs=additional_instructions,  # ignored
        context=markdown_content,
    )
)

### Custom Metric: Style Similarity

In [0]:
post_example_1 = """MLflow's GenAI evaluation metrics now work as callable functions as of MLflow 2.17, making them easier to use and integrate.

Now you can use metrics like answer_relevance, answer_correctness, faithfulness, and toxicity directly as functions—no need to go through mlflow.evaluate() anymore if you're just prototyping with individual metrics or integrating metric calls into systems where mlflow.evalaute is not necessary.

This means:

🔍 Easier debugging during prototyping
🔌 More flexible integration options
🎯 Works with or without other MLflow features

Check it out in action ⬇️

Learn more:
📚 Docs: https://lnkd.in/gyBzcrDr
📝 Release notes: https://lnkd.in/gBrNQfFC

#MachineLearning #AI #LLMs #LLMOps #Evals"""

post_example_2 = """If you're already building with Python ML libraries, adding mlflow.autolog() to your code instantly gives you production-grade experiment tracking, model management, and observability—no extra infrastructure or code changes needed.

The automatic logging works across a remarkable breadth of libraries—from GenAI frameworks like LangChain, OpenAI, and LlamaIndex to traditional ML and deep learning libraries like PyTorch, scikit-learn, and Fastai.

MLflow's autolog feature changes this equation. With a single line—mlflow.autolog()—you get automatic logging of:

📊 Training metrics and parameters for scikit-learn, PyTorch, many other ML frameworks
🔄 LLM traces, prompts, responses, and tool calls for OpenAI and LangChain
📦 Model signatures and artifacts
💾 Dataset information and example inputs

The best part is that it works out of the box with the most popular libraries in the Python ML ecosystem: no need to modify your existing training code or add manual logging statements.

Read more: https://lnkd.in/e_aTp6HH

#machinelearning #mlops #ai #llmops"""

post_example_3 = """New tutorial: Step-by-step guide to building a tool-calling LLM application using MLflow's ChatModel wrapper and tracing system.

This tutorial shows you how to:

🔧 Create a tool-calling model using mlflow.pyfunc.ChatModel
🔄 Implement OpenAI function calling with automatic input/output handling
🔍 Add comprehensive tracing to debug multi-step LLM interactions
🚀 Deploy your model with full MLflow lifecycle management

The guide includes a practical example building a weather information agent, showing how ChatModel simplifies complex LLM patterns while providing enterprise-grade observability.

Check out the complete tutorial here: https://lnkd.in/gdTw8N2S

#MLOps #AIEngineering #LLMOps #AI"""

example_posts = [post_example_1, post_example_2, post_example_3]


#### Define Evaluation Examples
- Evaluation examples give the LLM judge guidance on high- and low-scoring outputs.
- Here, we will generate two examples: one following the style of the example posts (technical but approachable) and one with an over-the-top tone and hype-oriented language.

In [0]:
from mlflow.metrics.genai import EvaluationExample

similar_post = """
MLflow's ChatModel and PythonModel classes serve different needs when deploying GenAI applications. Here's when to use each:

ChatModel simplifies GenAI deployment with standardized OpenAI-compatible interfaces. This means:

🔗 Immediate compatibility with existing OpenAI-based tools and workflows
🚀 Pre-defined model signatures that work out of the box
📊 Streamlined integration with MLflow's tracking and evaluation features

PythonModel is your choice when you need complete control over:

🛠️ Custom input/output schemas for specialized use cases
🔄 Complex data transformations beyond standard chat patterns
⚙️ Fine-grained model behavior and deployment configurations

For most conversational AI applications, ChatModel's standardized approach helps you avoid common deployment pitfalls while maintaining consistent interfaces across your GenAI services. Consider PythonModel when your use case requires specialized data handling or custom interaction patterns.

See the comment below for links to in-depth tutorials on ChatModel 👇 

#MLflow #LLMOps #MachineLearning #GenAI"""

dissimilar_post = """
🔥 HOLY MOLY! MLflow Just Dropped Something INSANE for AI Deployment! 🤯
TWO EPIC WAYS to deploy your next-gen AI:
1️⃣ ChatModel: The No-BS Fast Track!

INSTANT OpenAI compatibility 🤝
Zero headaches, works RIGHT NOW 🚀
All the tracking & metrics you're craving 📈

2️⃣ PythonModel: For When You Need to GO WILD!

Customize EVERYTHING 🎨
Transform data like a BOSS 💪
Ultimate control = Ultimate POWER! ⚡️

Don't sleep on this update! Your AI deployment game is about to get ABSOLUTELY CRACKED! 🚀✨
#MLflowGang #AIrevolution #FutureIsNow #TechTwitter
"""

evaluation_example_1 = EvaluationExample(
    input=additional_instructions,
    output=similar_post,
    grading_context={"examples": example_posts},
    score=5,
    justification="This post is a perfect match to the style of the example posts."
)

evaluation_example_2 = EvaluationExample(
    input=additional_instructions,
    output=dissimilar_post,
    grading_context={"examples": example_posts},
    score=1,
    justification=("The post earns a 1/5 for maintaining the basic bullet-point structure and use of emojis, "
                   "but significantly overplays the informal tone with phrases like 'HOLY MOLY!' and 'fam' "
                   "that aren't present in the examples. While the example posts balance professional "
                   "enthusiasm with technical detail, this submission sacrifices information density for "
                   "excessive hype and casual language that goes well beyond the controlled informality shown "
                   "in the reference posts." )
)

#### Define the Custom Grading Rubric

In this section, we use the `make_genai_metric` function to define our custom metric. Critically, we have to include the metric definition and grading prompt instructing the model how to grade the inputs. 

In [0]:
from mlflow.metrics.genai import make_genai_metric

style_similarity_metric = make_genai_metric(
    name="style_similarity",
    definition=(
        "Style similarity measures how well a generated social media post matches the style, tone, "
        "and vocabulary of provided example posts. This includes analyzing the similarity of tone, "
        "word choice, punctuation, sentence structure, and stylistic elements like hashtags and emojis. "
        "Content similarity should not factor into the style similarity score. Post length is of minimal importance."
    ),
    grading_prompt=(
        "Style Similarity: Score the generated post's similarity to the example posts on a scale from 0 to 5:\n"
        "- Score 0: No stylistic similarity at all\n"
        "- Score 1: Minimal stylistic similarity\n"
        "- Score 2: Some stylistic elements match but significant differences exist\n"
        "- Score 3: Moderate stylistic similarity in tone, vocabulary, or structure\n"
        "- Score 4: High stylistic similarity across most elements\n"
        "- Score 5: Could be written by the same author\n\n"
        "Consider:\n"
        "- Tone: similarity in voice and attitude\n"
        "- Vocabulary: similarity in word choice and complexity\n"
        "- Style: similarity in punctuation, sentence structure, hashtags, and emojis"
    ),
    examples=[evaluation_example_1, evaluation_example_2],
    version="v1",
    model="endpoints:/databricks-claude-3-7-sonnet",
    parameters={"temperature": 0.0, "max_tokens": 1000},
    aggregations=["mean", "variance", "p90"],
    grading_context_columns=["examples"],
    greater_is_better=True
)

#### Test our Custom Metric
- MLflow metrics work as Python callables, so we can invoke the metric directly on an example to test it out.

In [0]:
too_formal_example = """MLflow has introduced distinct deployment paradigms through its ChatModel and PythonModel classes, each serving specific implementation requirements in GenAI applications.
ChatModel implements a standardized deployment framework utilizing OpenAI-compatible interfaces, offering several advantages:

Full compatibility with existing OpenAI infrastructure and workflows
Implementation of predefined model signatures ensuring immediate functionality
Seamless integration with MLflow's comprehensive tracking and evaluation systems

Conversely, PythonModel provides advanced customization capabilities for specialized requirements:

Implementation of bespoke input/output schemas
Advanced data transformation protocols beyond standard conversational patterns
Granular control over model behavior and deployment specifications

For standard conversational AI implementations, ChatModel's structured approach mitigates common deployment challenges while maintaining consistent interfaces across GenAI services. PythonModel remains the optimal choice for implementations requiring specialized data handling protocols or custom interaction patterns.
For detailed implementation guidelines, please refer to the accompanying documentation.
Reference: MLflow Documentation
"""

style_similarity_metric(
    predictions=too_formal_example,
    inputs=additional_instructions,
    examples=[example_posts],
)

## `mlflow.evaluate` for structured experimentation

In [0]:
system_prompt_2 = """You are a social media content specialist with expertise in matching writing styles and voice across platforms. Your task is to:

1. Analyze the provided example post(s) by examining:
   - Writing style, tone, and voice
   - Sentence structure and length
   - Use of hashtags, emojis, and formatting
   - Engagement techniques and calls-to-action

2. Generate a new LinkedIn post about the given topic that matches:
   - The identified writing style and tone
   - Similar structure and formatting choices
   - Equivalent use of platform features and hashtags
   - Comparable engagement elements

3. Return only the generated post, formatted exactly as it would appear on LinkedIn, without any additional commentary or explanations."""

system_prompts = {"concise": system_prompt_1, "detailed": system_prompt_2}

### Generate the Evaluation Dataset

In [None]:
import pandas as pd
import openai
from openai import OpenAI
import mlflow
import requests
from markdownify import markdownify

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create a temporary token
tmp_token = w.tokens.create(lifetime_seconds=1200).token_value

# Set up the OpenAI client to use Databricks Model Serving
client = OpenAI(
    api_key=tmp_token,
    base_url=SERVING_ENDPOINT,
)

user_template = """
example posts:
{example_posts}

topic:
{topic}


additional instructions:
{additional_instructions}
"""


@mlflow.trace(span_type="FUNCTION")
def webpage_to_markdown(url):
    # Get webpage content
    response = requests.get(url)
    html_content = response.text

    # Convert to markdown
    markdown_content = markdownify(html_content)

    return markdown_content


# trace the generate_prompt function
@mlflow.trace(span_type="FUNCTION")
def generate_prompt(
    system, user_template, example_posts, topic, additional_instructions
):
    """Generate a prompt for the LLM based on the example posts, topic, and additional instructions."""
    example_posts = "\n".join(
        [f"Example {i+1}:\n{post}" for i, post in enumerate(example_posts)]
    )
    prompt = user_template.format(
        example_posts=example_posts,
        topic=topic,
        additional_instructions=additional_instructions,
    )

    formatted_prompt = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
    ]

    return formatted_prompt


def generate_evaluation_dataset(
    system_prompts: dict,
    client: openai.OpenAI,
    models: list,
    additional_instructions: str,
    example_posts: list,
):
    mlflow_pages = {
        "Tutorial: Custom GenAI Models using ChatModel": "https://mlflow.org/docs/latest/llms/chat-model-guide/index.html",
        "MLflow Tracing Schema": "https://mlflow.org/docs/latest/tracing/tracing-schema",
        "MLflow AI Gateway (Experimental)": "https://mlflow.org/docs/latest/llms/deployments/index.html",
        "MLflow LLM Evaluation": "https://mlflow.org/docs/latest/llms/llm-evaluate/index.html",
        "LLM Evaluation with MLflow Example Notebook": "https://mlflow.org/docs/latest/llms/llm-evaluate/notebooks/question-answering-evaluation.html",
        "MLflow Tracing for LLM Observability": "https://mlflow.org/docs/latest/tracing",
        "Deep Learning": "https://mlflow.org/docs/latest/deep-learning/index.html",
        "DSPy Quickstart": "https://mlflow.org/docs/latest/llms/dspy/notebooks/dspy_quickstart.html",
        "Building Custom Python Function Models with MLflow": "https://mlflow.org/docs/latest/traditional-ml/creating-custom-pyfunc/index.html",
        "Quickstart with MLflow PyTorch Flavor": "https://mlflow.org/docs/latest/deep-learning/pytorch/quickstart/pytorch_quickstart.html",
    }
    results = []

    for page_name, page_url in mlflow_pages.items():
        for prompt_name, system_prompt in system_prompts.items():
            for model in models:
                with mlflow.start_span(
                    name="eval_dataset_generation",
                    span_type="CHAIN",
                ) as parent_span:
                    parent_span.set_inputs(
                        {
                            "model": model,
                            "system_prompt": system_prompt,
                            "example_post": page_name,
                        }
                    )
                    page_content = webpage_to_markdown(page_url)
                    messages = generate_prompt(
                        system_prompt,
                        user_template,
                        example_posts,
                        page_content,
                        additional_instructions,
                    )

                    response = client.chat.completions.create(
                        model=model,
                        messages=messages,
                    )

                    results.append(
                        {
                            "model": model,
                            "system_prompt": prompt_name,
                            "context_page": page_name,
                            "context_full": page_content,
                            "additional_instructions": additional_instructions,
                            "output": response.choices[0].message.content,
                            "example_posts": example_posts,
                        }
                    )

                    parent_span.set_outputs(
                        {"output": response.choices[0].message.content}
                    )

    return pd.DataFrame(results)

In [0]:
TABLE_FULL_NAME = "<catalog-name>.<schema-name>.<table-name>"

if not spark.catalog.tableExists(TABLE_FULL_NAME):
    eval_data = generate_evaluation_dataset(
        system_prompts=system_prompts,
        client=client,
        models=[
            #"databricks-llama-4-maverick",
            "databricks-meta-llama-3-3-70b-instruct",
            "databricks-meta-llama-3-1-8b-instruct"],
        additional_instructions=additional_instructions,
        example_posts=example_posts,
    )

    # Convert the Pandas DataFrame to a PySpark DataFrame
    spark_df = spark.createDataFrame(eval_data)

    # Write the eval data to a Delta table
    spark_df.write.format("delta").saveAsTable(TABLE_FULL_NAME)
else:
    eval_data = spark.table(TABLE_FULL_NAME)

### Evaluate the Generated Dataset for Faithfulness and Style Similarity

In [0]:
eval_data = spark.table(TABLE_FULL_NAME)
display(eval_data)

In [0]:
import uuid
import json

# Assume eval_data, style_similarity_metric, faithfulness_metric are defined and valid

run_name = f"full-dataset-eval-{uuid.uuid4()}"
with mlflow.start_run(run_name=run_name) as run:
    run_id = run.info.run_id
    print(f"Started run '{run_name}' ({run_id}).")

    # Evaluate the entire dataframe. Assumes success.
    eval_results = mlflow.evaluate(
        data=eval_data,
        predictions="output",
        extra_metrics=[
            style_similarity_metric,
            faithfulness_metric,
        ],
        evaluators="default",
        evaluator_config={
            "col_mapping": {
                "inputs": "additional_instructions",
                "context": "context_full",
                "examples": "example_posts"
            }
        }
    )
    mlflow.log_param("evaluation_status", "success")

In [0]:
display(eval_results.tables['eval_results_table'])