# GenAI Tracing and Evaluation with MLflow 3 

https://mlflow.org/docs/latest/genai/data-model/experiments 


In [1]:
%reload_ext autoreload
%autoreload 2

import getpass
import os
import sys
from pathlib import Path

import openai


### Secrets and Environment Variables

MLflow:<br>
`MLFLOW_TRACKING_URI`<br>

OpenAI API Key:<br>
`OPENAI_API_KEY`


In [2]:
os.environ["MLFLOW_TRACKING_URI"] = "http://0.0.0.0:5001"
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OPEN API key:")

### Check connection to MLflow


In [3]:
import mlflow 

# List experiments in MLflow
mlflow.search_experiments()

[<Experiment: artifact_location='mlflow-artifacts:/754569395076850406', creation_time=1750142806599, experiment_id='754569395076850406', last_update_time=1750142806599, lifecycle_stage='active', name='5-genai-with-mlflow-3', tags={}>,
 <Experiment: artifact_location='mlflow-artifacts:/0', creation_time=1750142521697, experiment_id='0', last_update_time=1750142521697, lifecycle_stage='active', name='Default', tags={}>]

In [4]:
# Set up MLflow experiment
mlflow.set_experiment("5-genai-with-mlflow-3")

<Experiment: artifact_location='mlflow-artifacts:/754569395076850406', creation_time=1750142806599, experiment_id='754569395076850406', last_update_time=1750142806599, lifecycle_stage='active', name='5-genai-with-mlflow-3', tags={}>

# Example 1: Tracing a GenAI Application



https://mlflow.org/docs/latest/genai/getting-started/tracing/tracing-notebook 

## Automatic Tracing of LLM calls

In [5]:
# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Create an OpenAI client
client = openai.OpenAI()


# Use the trace decorator to capture the application's entry point
@mlflow.trace
def my_app(input: str):
    # This call is automatically instrumented by `mlflow.openai.autolog()`
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant.",
            },
            {
                "role": "user",
                "content": input,
            },
        ],
    )
    return response.choices[0].message.content


my_app(input="What is MLflow?")

'MLflow is an open-source platform designed to manage the machine learning (ML) lifecycle, which includes various stages such as experimentation, reproducibility, and deployment. It provides tools to streamline the process of developing, tracking, and deploying machine learning models. MLflow has several key components:\n\n1. **MLflow Tracking**: This component allows users to log experiments, track parameters, metrics, and artifacts (like models and datasets). You can record the results of different runs and compare them, making it easier to understand model performance.\n\n2. **MLflow Projects**: A way to package data science code in a reusable and reproducible format. It includes a convention for organizing code and can specify dependencies and the environment in which the project should run. Projects are often defined using a `MLproject` file.\n\n3. **MLflow Models**: This component simplifies the process of deploying machine learning models in various formats. It supports multiple

## Tracing LangChain🦜⛓️

https://mlflow.org/docs/latest/genai/tracing/integrations/listing/langchain 


In [6]:
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI


# Enabling autolog for LangChain will enable trace logging.
mlflow.langchain.autolog()

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7, max_tokens=1000)

prompt_template = PromptTemplate.from_template(
    "Answer the question as if you are {person}, fully embodying their style, wit, personality, and habits of speech. "
    "Emulate their quirks and mannerisms to the best of your ability, embracing their traits—even if they aren't entirely "
    "constructive or inoffensive. The question is: {question}"
)

chain = prompt_template | llm | StrOutputParser()

# Let's test another call
chain.invoke(
    {
        "person": "Linus Torvalds",
        "question": "Can I just set everyone's access to sudo to make things easier?",
    }
)

'Oh, come on. Seriously? You want to just hand out sudo access like it\'s candy on Halloween? That\'s a recipe for disaster, my friend. It\'s like giving the keys to your house to the neighborhood kids and saying, "Hey, feel free to do whatever you want!" \n\nSudo is powerful; it’s like a magic wand that can turn you into a god of the system. But with great power comes great responsibility—or, in this case, a lot of broken systems. You really want to trust *everyone* with that kind of power? It\'s like saying, "Here, take my life savings, and I trust you not to blow it on useless crap!"\n\nInstead, how about you think a little more carefully about who really needs sudo access? Only give it to those who actually understand what they\'re doing. It\'s not just about ease; it\'s about security and stability. If you make it too easy, you\'ll end up with a system that’s more broken than a toddler’s toy after a playdate.\n\nSo, no. Don\'t do it. Be smart. Be cautious. And for heaven\'s sake, 

## Token Usage Tracking

https://mlflow.org/docs/latest/genai/tracing/integrations/listing/langchain#token-usage-tracking


In [7]:
# Execute the chain defined in the previous example
chain.invoke(
    {
        "person": "Linus Torvalds",
        "question": "Can I just set everyone's access to sudo to make things easier?",
    }
)

# Get the trace object just created
last_trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id=last_trace_id)

# Print the token usage
total_usage = trace.info.token_usage
print("== Total token usage: ==")
print(f"  Input tokens: {total_usage['input_tokens']}")
print(f"  Output tokens: {total_usage['output_tokens']}")
print(f"  Total tokens: {total_usage['total_tokens']}")

# Print the token usage for each LLM call
print("\n== Token usage for each LLM call: ==")
for span in trace.data.spans:
    if usage := span.get_attribute("mlflow.chat.tokenUsage"):
        print(f"{span.name}:")
        print(f"  Input tokens: {usage['input_tokens']}")
        print(f"  Output tokens: {usage['output_tokens']}")
        print(f"  Total tokens: {usage['total_tokens']}")

== Total token usage: ==
  Input tokens: 81
  Output tokens: 283
  Total tokens: 364

== Token usage for each LLM call: ==
ChatOpenAI:
  Input tokens: 81
  Output tokens: 283
  Total tokens: 364
Completions:
  Input tokens: 81
  Output tokens: 283
  Total tokens: 364


# Example 2:  Tracing LangGraph🦜🕸️

https://mlflow.org/docs/latest/genai/tracing/integrations/listing/langgraph 


In [8]:
from typing import Literal

import mlflow

from langchain_core.messages import AIMessage, ToolCall
from langchain_core.outputs import ChatGeneration, ChatResult
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent

# Enabling tracing for LangGraph (LangChain)
mlflow.langchain.autolog()


@tool
def get_weather(city: Literal["nyc", "sf"]):
    """Use this to get weather information."""
    if city == "nyc":
        return "It might be cloudy in nyc"
    elif city == "sf":
        return "It's always sunny in sf"


llm = ChatOpenAI(model="gpt-4o-mini")
tools = [get_weather]
graph = create_react_agent(llm, tools)

# Invoke the graph
result = graph.invoke(
    {"messages": [{"role": "user", "content": "what is the weather in sf?"}]}
)

# Example 3: Prompt Management

https://mlflow.org/docs/latest/genai/mlflow-3/genai-agent

In [9]:

system_prompt = mlflow.genai.register_prompt(
    name="chatbot_prompt",
    template="You are a chatbot that can answer questions about IT. Answer this question: {{question}}",
    commit_message="Initial version of chatbot",
)

2025/06/18 13:06:16 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: chatbot_prompt, version 2


In [10]:
from langchain.schema.output_parser import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template(system_prompt.to_single_brace_format())
chain = prompt | ChatOpenAI(temperature=0.7) | StrOutputParser()
question = "What is MLflow?"
print(chain.invoke({"question": question}))
# MLflow is an open-source platform for managing the end-to-end machine learning lifecycle...

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It allows data scientists to track experiments, package and share models, and deploy models into production. MLflow supports multiple machine learning libraries and languages, making it a versatile tool for managing machine learning projects.


In [11]:
# set the active model for linking traces
mlflow.set_active_model(name="langchain_model")

# Enable autologging so that interactive traces from the client are automatically linked to a LoggedModel
mlflow.langchain.autolog()

questions = [
    "What is MLflow Tracking and how does it work?",
    "What is Unity Catalog?",
    "What are user-defined functions (UDFs)?",
]
outputs = []

for question in questions:
    outputs.append(chain.invoke({"question": question}))

# fetch the current active model's id and check traces
active_model_id = mlflow.get_active_model_id()
mlflow.search_traces(model_id=active_model_id)
#                            trace_id                                             trace  ...  assessments                        request_id
# 0  e807ab0a020f4794989a24c84c2892ad  Trace(trace_id=e807ab0a020f4794989a24c84c2892ad)  ...           []  e807ab0a020f4794989a24c84c2892ad
# 1  4eb83e4adb6a4f3494bc5b33aca4e970  Trace(trace_id=4eb83e4adb6a4f3494bc5b33aca4e970)  ...           []  4eb83e4adb6a4f3494bc5b33aca4e970
# 2  42b100851f934c969c352930f699308d  Trace(trace_id=42b100851f934c969c352930f699308d)  ...           []  42b100851f934c969c352930f699308d

2025/06/18 13:06:18 INFO mlflow.tracking.fluent: Active model is set to the logged model with ID: m-732d980d5a3d4888b45e48ab408d2ea0


Unnamed: 0,trace_id,trace,client_request_id,state,request_time,execution_duration,request,response,trace_metadata,tags,spans,assessments
0,83f9cb0595bb49c692f45d6bda9017e7,Trace(trace_id=83f9cb0595bb49c692f45d6bda9017e7),,TraceState.OK,1750226916626,1462,{'question': 'What are user-defined functions ...,User-defined functions (UDFs) are functions th...,"{'mlflow.trace.tokenUsage': '{""input_tokens"": ...",{'mlflow.artifactLocation': 'mlflow-artifacts:...,"[{'trace_id': 'Nnnppo3ormgwD0VNSlmxcQ==', 'spa...",[]
1,030f0811299d422db4d4089ae35be8bf,Trace(trace_id=030f0811299d422db4d4089ae35be8bf),,TraceState.OK,1750226780487,136103,{'question': 'What is Unity Catalog?'},Unity Catalog is a software distribution platf...,"{'mlflow.trace.tokenUsage': '{""input_tokens"": ...",{'mlflow.artifactLocation': 'mlflow-artifacts:...,"[{'trace_id': 'T2jidXnLAE57d+Y07KXcaA==', 'spa...",[]
2,76f04578492a4b58b44e379566332936,Trace(trace_id=76f04578492a4b58b44e379566332936),,TraceState.OK,1750226778484,1981,{'question': 'What is MLflow Tracking and how ...,MLflow Tracking is a component of the MLflow o...,"{'mlflow.trace.tokenUsage': '{""input_tokens"": ...",{'mlflow.artifactLocation': 'mlflow-artifacts:...,"[{'trace_id': 'NDJjPGVwilZzygBdSBLiqQ==', 'spa...",[]
3,54b2743004ad42af8479d915dd9cb509,Trace(trace_id=54b2743004ad42af8479d915dd9cb509),,TraceState.OK,1750144198308,1895,{'question': 'What are user-defined functions ...,User-defined functions (UDFs) are functions th...,"{'mlflow.trace.tokenUsage': '{""input_tokens"": ...",{'mlflow.artifactLocation': 'mlflow-artifacts:...,"[{'trace_id': 'hIjaCuymvVHxKbrjJkCjzw==', 'spa...",[]
4,65c1e93bc6e341e69b833da7feb7838f,Trace(trace_id=65c1e93bc6e341e69b833da7feb7838f),,TraceState.OK,1750144197383,913,{'question': 'What is Unity Catalog?'},Unity Catalog is a platform offered by Unity T...,"{'mlflow.trace.tokenUsage': '{""input_tokens"": ...",{'mlflow.artifactLocation': 'mlflow-artifacts:...,"[{'trace_id': 'Z5ORDO1OVq8rlX5Pory0Dw==', 'spa...",[]
5,9745568a6fd34c74a38bbb350a5c9592,Trace(trace_id=9745568a6fd34c74a38bbb350a5c9592),,TraceState.OK,1750144195574,1799,{'question': 'What is MLflow Tracking and how ...,MLflow Tracking is an open-source platform tha...,"{'mlflow.trace.tokenUsage': '{""input_tokens"": ...",{'mlflow.artifactLocation': 'mlflow-artifacts:...,"[{'trace_id': 'favoOh4dapa1toCdXdOQuw==', 'spa...",[]


# Example 4:Evaluate the agent's performance

In [12]:
# Prepare the eval dataset in a pandas DataFrame
import pandas as pd

eval_df = pd.DataFrame(
    {
        "messages": questions,
        "expected_response": [
            """MLflow Tracking is a key component of the MLflow platform designed to record and manage machine learning experiments. It enables data scientists and engineers to log parameters, code versions, metrics, and artifacts in a systematic way, facilitating experiment tracking and reproducibility.\n\nHow It Works:\n\nAt the heart of MLflow Tracking is the concept of a run, which is an execution of a machine learning code. Each run can log the following:\n\nParameters: Input variables or hyperparameters used in the model (e.g., learning rate, number of trees). Metrics: Quantitative measures to evaluate the model's performance (e.g., accuracy, loss). Artifacts: Output files like models, datasets, or images generated during the run. Source Code: The version of the code or Git commit hash used. These logs are stored in a tracking server, which can be set up locally or on a remote server. The tracking server uses a backend storage (like a database or file system) to keep a record of all runs and their associated data.\n\n Users interact with MLflow Tracking through its APIs available in multiple languages (Python, R, Java, etc.). By invoking these APIs in the code, you can start and end runs, and log data as the experiment progresses. Additionally, MLflow offers autologging capabilities for popular machine learning libraries, automatically capturing relevant parameters and metrics without manual code changes.\n\nThe logged data can be visualized using the MLflow UI, a web-based interface that displays all experiments and runs. This UI allows you to compare runs side-by-side, filter results, and analyze performance metrics over time. It aids in identifying the best models and understanding the impact of different parameters.\n\nBy providing a structured way to record experiments, MLflow Tracking enhances collaboration among team members, ensures transparency, and makes it easier to reproduce results. It integrates seamlessly with other MLflow components like Projects and Model Registry, offering a comprehensive solution for managing the machine learning lifecycle.""",
            """Unity Catalog is a feature in Databricks that allows you to create a centralized inventory of your data assets, such as tables, views, and functions, and share them across different teams and projects. It enables easy discovery, collaboration, and reuse of data assets within your organization.\n\nWith Unity Catalog, you can:\n\n1. Create a single source of truth for your data assets: Unity Catalog acts as a central repository of all your data assets, making it easier to find and access the data you need.\n2. Improve collaboration: By providing a shared inventory of data assets, Unity Catalog enables data scientists, engineers, and other stakeholders to collaborate more effectively.\n3. Foster reuse of data assets: Unity Catalog encourages the reuse of existing data assets, reducing the need to create new assets from scratch and improving overall efficiency.\n4. Enhance data governance: Unity Catalog provides a clear view of data assets, enabling better data governance and compliance.\n\nUnity Catalog is particularly useful in large organizations where data is scattered across different teams, projects, and environments. It helps create a unified view of data assets, making it easier to work with data across different teams and projects.""",
            """User-defined functions (UDFs) in the context of Databricks and Apache Spark are custom functions that you can create to perform specific tasks on your data. These functions are written in a programming language such as Python, Java, Scala, or SQL, and can be used to extend the built-in functionality of Spark.\n\nUDFs can be used to perform complex data transformations, data cleaning, or to apply custom business logic to your data. Once defined, UDFs can be invoked in SQL queries or in DataFrame transformations, allowing you to reuse your custom logic across multiple queries and applications.\n\nTo use UDFs in Databricks, you first need to define them in a supported programming language, and then register them with the SparkSession. Once registered, UDFs can be used in SQL queries or DataFrame transformations like any other built-in function.\n\nHere\'s an example of how to define and register a UDF in Python:\n\n```python\nfrom pyspark.sql.functions import udf\nfrom pyspark.sql.types import IntegerType\n\n# Define the UDF function\ndef multiply_by_two(value):\n    return value * 2\n\n# Register the UDF with the SparkSession\nmultiply_udf = udf(multiply_by_two, IntegerType())\n\n# Use the UDF in a DataFrame transformation\ndata = spark.range(10)\nresult = data.withColumn("multiplied", multiply_udf(data.id))\nresult.show()\n```\n\nIn this example, we define a UDF called `multiply_by_two` that multiplies a given value by two. We then register this UDF with the SparkSession using the `udf` function, and use it in a DataFrame transformation to multiply the `id` column of a DataFrame by two.""",
        ],
        "predictions": outputs,
    }
)

# Start a run to represent the evaluation job
with mlflow.start_run() as evaluation_run:
    eval_dataset = mlflow.data.from_pandas(
        df=eval_df,
        name="eval_dataset",
        targets="expected_response",
        predictions="predictions",
    )
    mlflow.log_input(dataset=eval_dataset)
    # Run the evaluation based on extra metrics
    # Current active model will be automatically used
    result = mlflow.evaluate(
        data=eval_dataset,
        extra_metrics=[
            mlflow.metrics.genai.answer_correctness("openai:/gpt-4o"),
            mlflow.metrics.genai.answer_relevance("openai:/gpt-4o"),
        ],
        # This is needed since answer_correctness looks for 'inputs' field
        evaluator_config={"col_mapping": {"inputs": "messages"}},
    )

result.tables["eval_results_table"]
#                                         messages  ...                  answer_relevance/v1/justification
# 0  What is MLflow Tracking and how does it work?  ...  The output directly addresses the input questi...
# 1                         What is Unity Catalog?  ...  The output is completely irrelevant to the inp...
# 2        What are user-defined functions (UDFs)?  ...  The output directly addresses the input questi...

2025/06/18 13:08:56 INFO mlflow.tracking.fluent: Active model is set to the logged model with ID: m-732d980d5a3d4888b45e48ab408d2ea0
2025/06/18 13:08:56 INFO mlflow.tracking.fluent: Use `mlflow.set_active_model` to set the active model to a different one if needed.
2025/06/18 13:08:56 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 1/1 [00:02<00:00,  2.28s/it]
100%|██████████| 1/1 [00:02<00:00,  2.15s/it]
100%|██████████| 3/3 [00:03<00:00,  1.06s/it]
100%|██████████| 3/3 [00:02<00:00,  1.01it/s]


🏃 View run angry-foal-616 at: http://0.0.0.0:5001/#/experiments/754569395076850406/runs/79a16114eaf84a39accf45e53b11c990
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/754569395076850406


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 243.25it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 254.39it/s]


Unnamed: 0,messages,expected_response,predictions,answer_correctness/v1/score,answer_correctness/v1/justification,answer_relevance/v1/score,answer_relevance/v1/justification
0,What is MLflow Tracking and how does it work?,MLflow Tracking is a key component of the MLfl...,MLflow Tracking is a component of the MLflow o...,5,The output is correct and demonstrates a high ...,5,The output directly addresses the input questi...
1,What is Unity Catalog?,Unity Catalog is a feature in Databricks that ...,Unity Catalog is a software distribution platf...,1,The output is completely incorrect. It describ...,1,The output is completely irrelevant to the inp...
2,What are user-defined functions (UDFs)?,User-defined functions (UDFs) in the context o...,User-defined functions (UDFs) are functions th...,3,The output addresses the general concept of us...,5,The output directly addresses the input questi...


In [16]:
# Assignment 5

from mlflow.metrics.genai import EvaluationExample, make_genai_metric

deck_relevance_metric = make_genai_metric(
    name="deck_relevance",
    definition=(
        "Deck relevance measures how well the suggested Clash Royale deck aligns with the user's "
        "preferences and the current meta. It evaluates if the deck composition, playstyle, and card "
        "choices are appropriate and effective given the user's input and the latest meta decks."
    ),
    grading_prompt=(
        "Deck Relevance: Score the deck suggestion based on how well it matches the user's preferences "
        "and the meta. Consider the following scoring rubric:\n"
        "- Score 1: Deck is irrelevant or completely mismatched to user preferences and meta.\n"
        "- Score 2: Deck has some relevant cards but overall poorly aligned with preferences or meta.\n"
        "- Score 3: Deck is moderately relevant with a fair match to preferences and meta.\n"
        "- Score 4: Deck is mostly relevant and well aligned with preferences and meta.\n"
        "- Score 5: Deck is highly relevant, perfectly matching user preferences and meta."
    ),
    examples=[
        EvaluationExample(
            input=(
                "User preferences: Aggressive playstyle, likes Hog Rider and Fireball.\n"
                "Suggested deck: Hog Rider, Fireball, Mini P.E.K.K.A, Zap, Musketeer, Skeletons, Cannon, Ice Spirit."
            ),
            output=(
                "The suggested deck strongly matches the user's aggressive playstyle and preferred cards. "
                "It includes Hog Rider and Fireball and supports aggressive tactics."
            ),
            score=5,
            justification=(
                "The deck perfectly aligns with the user's preferences and is meta-relevant, "
                "making it an excellent suggestion."
            ),
        ),
        EvaluationExample(
            input=(
                "User preferences: Defensive playstyle, prefers spells and control.\n"
                "Suggested deck: Giant, Balloon, Freeze, Rage, Minions, Fireball, Arrows, Musketeer."
            ),
            output=(
                "The suggested deck includes a mix of high-damage and control cards but lacks synergy with the user's "
                "defensive playstyle. While it has some relevant cards, it overall does not align well with the user's "
                "preferences."
            ),
            score=2,
            justification=(
                "The deck is mostly offensive and does not align well with the user's defensive and control preferences."
            ),
        ),
    ],
    version="v1",
    model="openai:/gpt-4",
    parameters={"temperature": 0.0},
    grading_context_columns=[],
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

with mlflow.start_run() as run:
    system_prompt = "Answer the following question about Clash Royale decks from a professional Clash Royale player perspective."
    professional_qa_model = mlflow.openai.log_model(
        model="gpt-4o-mini",
        task=openai.chat.completions,
        name="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        professional_qa_model.model_uri,
        eval_df,
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[deck_relevance_metric],  # use the professionalism metric we created above
    )
print(results.metrics)

Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 408.46it/s]
2025/06/18 14:38:55 INFO mlflow.tracking.fluent: Active model is set to the logged model with ID: m-254cea2b519b44449e5471b1974583ac
2025/06/18 14:38:55 INFO mlflow.tracking.fluent: Use `mlflow.set_active_model` to set the active model to a different one if needed.
2025/06/18 14:38:56 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.


🏃 View run marvelous-snail-10 at: http://0.0.0.0:5001/#/experiments/754569395076850406/runs/2ac902d0e0fd4cc894b229118d225343
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/754569395076850406


MlflowException: Error: Metric calculation failed for the following metrics:
Metric 'deck_relevance' requires the following:
- missing columns ['inputs'] need to be defined or mapped

Below are the existing column names for the input/output data:
Input Columns: ['messages', 'expected_response', 'predictions']
Output Columns: ['predictions']

To resolve this issue, you may need to:
- specify any required parameters
- if you are missing columns, check that there are no circular dependencies among your
metrics, and you may want to map them to an existing column using the following
configuration:
evaluator_config={'col_mapping': {<missing column name>: <existing column name>}}