# Evaluate Retriever with Doc IDs

In mlflow 2.8.0, we introduced a new model type "retriever" to the `mlflow.evaluate()` API. It helps you to evaluate the retriever in a RAG application, and contains a built-in metric `precision_at_k`.

This notebook illustrates how to use `mlflow.evaluate()` to evaluate the retriever in a RAG application. It has the following sections:

1. Evaluation dataset preparation
2. Metrics definition
3. Calling `mlflow.evaluate()`
4. Result Analysis

In [0]:
%pip install --upgrade --force-reinstall git+https://github.com/bbqiu/mlflow@retriever-recall

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting git+https://github.com/bbqiu/mlflow@retriever-recall
  Cloning https://github.com/bbqiu/mlflow (to revision retriever-recall) to /tmp/pip-req-build-szvgb1w3
  Running command git clone --filter=blob:none --quiet https://github.com/bbqiu/mlflow /tmp/pip-req-build-szvgb1w3
  Running command git checkout -b retriever-recall --track origin/retriever-recall
  Switched to a new branch 'retriever-recall'
  branch 'retriever-recall' set up to track 'origin/retriever-recall'.
  Resolved https://github.com/bbqiu/mlflow to commit ace64d76b3e5e86db120f34339f6b5878fbb8304
  Running command git submodule update --init --recursive -q
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.t

In [0]:
dbutils.library.restartPython()

In [0]:
import mlflow

mlflow.__version__

'2.7.2.dev0'

## Evaluation dataset preparation

When evaluating a retriever, it's recommended to save the retrieved document IDs into a static dataset represented by a Pandas Dataframe or an MLflow Pandas Dataset containing the input queries, retrieved relevant document IDs, and the ground-truth relevant document IDs for the evaluation.

A "document ID" should be a string that identifies a document.

For each row, the retrieved relevant document IDs and the ground-truth relevant document IDs should be provided as a tuple of document ID strings.

The column name of the retrieved relevant document IDs should be specified by the `predictions` parameter, and the column name of the ground-truth relevant document IDs should be specified by the `targets` parameter.

Alternatively, you can use a function that returns a tuple of document ID strings for
the evaluation. The function should take a Pandas DataFrame as input and return a Pandas
DataFrame with the same number of rows, where each row contains a tuple of document ID
strings.

Here is a simple example dataset that illustrates the expected data format.

In [0]:
import pandas as pd

data = pd.DataFrame(
    {
        "questions": [
            "What is MLflow?",
            "What is Databricks?",
            "How to serve a model on Databricks?",
            "How to enable MLflow Autologging for my workspace by default?",
        ],
        "retrieved_context": [
            [
                "https://docs.databricks.com/en/mlflow/index.html",
                "https://docs.databricks.com/en/mlflow/quick-start.html",
            ],
            [
                "https://docs.databricks.com/en/introduction/index.html",
                "https://docs.databricks.com/en/getting-started/overview.html",
            ],
            [
                "https://docs.databricks.com/en/machine-learning/model-serving/index.html",
                "https://docs.databricks.com/en/machine-learning/model-serving/model-serving-intro.html",
            ],
            [],
        ],
        "ground_truth_context": [
            ["https://docs.databricks.com/en/mlflow/index.html"],
            ["https://docs.databricks.com/en/introduction/index.html"],
            [
                "https://docs.databricks.com/en/machine-learning/model-serving/index.html",
                "https://docs.databricks.com/en/machine-learning/model-serving/llm-optimized-model-serving.html",
            ],
            ["https://docs.databricks.com/en/mlflow/databricks-autologging.html"],
        ],
    }
)

## Metric Definition

A built-in metric `mlflow.metrics.precision_at_k(k)` is available for the retriever evaluation.

This metric computes a score between 0 and 1 for each row representing the precision of the
retriever model at the given k value. The score is calculated by dividing the number of relevant
documents retrieved by the total number of documents retrieved or k, whichever is smaller.
If no relevant documents are retrieved, the score is 1, indication that no false positives were
retrieved.

The ``k`` parameter should be a positive integer representing the number of retrieved documents
to evaluate for each row. ``k`` defaults to 3.

This metric is a default metric for the ``retriever`` model type.

When the model type is ``"retriever"``, this metric will be calculated automatically with the
default ``k`` value of 3. To use another ``k`` value, use the ``evaluator_config`` parameter
in the ``mlflow.evaluate()`` API as follows: ``evaluator_config={"k": <k_value>}``.


```python
# Case 1: Specifying the model type
evaluate_results = mlflow.evaluate(
    data=data,
    model_type="retriever",
    targets="ground_truth_context",
    predictions="retrieved_context",
    evaluators="default",
    evaluator_config={"k": 5}
  )
```

Alternatively, you can directly specify the ``mlflow.metrics.precision_at_k(<k_value>)`` metric
in the ``extra_metrics`` parameter of the ``mlflow.evaluate()`` API without specifying a model
type. In this case, the ``k`` value specified in the ``evaluator_config`` parameter will be
ignored.


```python
# Case 2: Specifying the extra_metrics
evaluate_results = mlflow.evaluate(
    data=data,
    targets="ground_truth_context",
    predictions="retrieved_context",
    extra_matrics=[mlflow.metrics.precision_at_k(5)],
  )
```


## Calling mlflow.evaluate()

In [0]:
with mlflow.start_run() as run:
    evaluate_results = mlflow.evaluate(
        data=data,
        model_type="retriever",
        targets="ground_truth_context",
        predictions="retrieved_context",
        evaluators="default",
        evaluator_config={"retriever_k": 3},
    )

  string_columns = trimmed_df.columns[(df.applymap(type) == str).all(0)]
  data = data.applymap(_hash_array_like_element_as_bytes)
2023/10/27 23:08:12 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/10/27 23:08:12 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: precision_at_3
2023/10/27 23:08:12 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: recall_at_3


In [0]:
import pprint

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(evaluate_results.metrics)

{   'precision_at_3/mean': 0.375,
    'precision_at_3/p90': 0.5,
    'precision_at_3/variance': 0.046875,
    'recall_at_3/mean': 0.625,
    'recall_at_3/p90': 1.0,
    'recall_at_3/variance': 0.171875}


## Result Analysis

You can view the per-row scores in the logged "eval_results_table.json" in artifacts by either loading it to a pandas dataframe (shown below) or visiting the MLflow run comparason UI.

In [0]:
display(evaluate_results.tables["eval_results_table"])

  if is_datetime64_dtype(t) or is_datetime64tz_dtype(t)
  if is_categorical_dtype(series.dtype):


questions,ground_truth_context,retrieved_context,precision_at_3/score,recall_at_3/score
What is MLflow?,List(https://docs.databricks.com/en/mlflow/index.html),"List(https://docs.databricks.com/en/mlflow/index.html, https://docs.databricks.com/en/mlflow/quick-start.html)",0.5,1.0
What is Databricks?,List(https://docs.databricks.com/en/introduction/index.html),"List(https://docs.databricks.com/en/introduction/index.html, https://docs.databricks.com/en/getting-started/overview.html)",0.5,1.0
How to serve a model on Databricks?,"List(https://docs.databricks.com/en/machine-learning/model-serving/index.html, https://docs.databricks.com/en/machine-learning/model-serving/llm-optimized-model-serving.html)","List(https://docs.databricks.com/en/machine-learning/model-serving/index.html, https://docs.databricks.com/en/machine-learning/model-serving/model-serving-intro.html)",0.5,0.5
How to enable MLflow Autologging for my workspace by default?,List(https://docs.databricks.com/en/mlflow/databricks-autologging.html),List(),0.0,0.0


In [0]:
evaluate_results.tables["eval_results_table"]["ground_truth_context"] = evaluate_results.tables[
    "eval_results_table"
]["ground_truth_context"].astype(tuple)

[0;31m---------------------------------------------------------------------------[0m
[0;31mTypeError[0m                                 Traceback (most recent call last)
File [0;32m<command-420022480716680>, line 1[0m
[0;32m----> 1[0m evaluate_results[38;5;241m.[39mtables[[38;5;124m"[39m[38;5;124meval_results_table[39m[38;5;124m"[39m][[38;5;124m"[39m[38;5;124mground_truth_context[39m[38;5;124m"[39m] [38;5;241m=[39m [43mevaluate_results[49m[38;5;241;43m.[39;49m[43mtables[49m[43m[[49m[38;5;124;43m"[39;49m[38;5;124;43meval_results_table[39;49m[38;5;124;43m"[39;49m[43m][49m[43m[[49m[38;5;124;43m"[39;49m[38;5;124;43mground_truth_context[39;49m[38;5;124;43m"[39;49m[43m][49m[38;5;241;43m.[39;49m[43mastype[49m[43m([49m[38;5;28;43mtuple[39;49m[43m)[49m

File [0;32m/local_disk0/.ephemeral_nfs/envs/pythonEnv-3531e620-387e-4c5b-aa5f-0354c0bfda65/lib/python3.10/site-packages/pandas/core/generic.py:6534[0m, in [0;36mNDFrame.astype[0;34m

In [0]:
import pandas as pd

df = pd.DataFrame({"A": [("doc1",), ("doc2",), ("doc3",)]})
print("Original DataFrame:")
print(df)

# Using apply can lead to "unwrapping"
df["B"] = df["A"].apply(lambda x: x)
print("\nDataFrame after apply:")
print(df)

