# Evaluate a HuggingFace LLM with mlflow.evaluate()

This guide will show how to load a pre-trained HuggingFace pipeline, log it to MLflow, and use `mlflow.evaluate()` to evaluate builtin metrics as well as custom LLM-judged metrics for the model.

For detailed information, please read the following documentation:
https://mlflow.org/docs/latest/llms/llm-evaluate/index.html



## Start MLflow Server

You can either:

- Start a local tracking server by running `mlflow ui` within the same directory that your notebook is in
  - Please follow [this section of the contributing guide](https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#javascript-and-ui) to get the UI set up.
- Use a tracking server, as described in [this overview](https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html)

## Install necessary dependencies

In [18]:
%pip install -q mlflow transformers torch torchvision evaluate datasets openai==0.27.9 tiktoken fastapi rouge_score textstat


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Load pretrained HuggingFace pipeline

Here we are loading a text summarization pipeline, but you can also use a text generation or question answering pipeline.

In [19]:
from transformers import pipeline

summarizer = pipeline("summarization", model="Falconsai/text_summarization")

## Log model to mlflow

In [20]:
import mlflow

with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=summarizer,
        artifact_path="falcons",
        input_example="Please summarize the following article:\n article",
        registered_model_name="falconsai-summarization",
    )

  model_info = mlflow.transformers.log_model(
  flavor.save_model(path=local_path, mlflow_model=mlflow_model, **kwargs)
Your max_length is set to 200, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
Registered model 'falconsai-summarization' already exists. Creating a new version of this model...
2023/12/01 14:24:04 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: falconsai-summarization, version 8
Created version '8' of model 'falconsai-summarization'.


## Load Evaluation Data

Load in a dataset from HuggingFace to use for evaluation

In [21]:
import pandas as pd
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0")

To create our `inputs` column, we append a prompt asking to each article.

In [22]:
eval_df = pd.DataFrame(dataset["test"])
eval_df["inputs"] = "Please summarize the following article:\n" + eval_df["article"]

display(eval_df.head(10))

Unnamed: 0,article,highlights,id,inputs
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01,Please summarize the following article:\n(CNN)...
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef,Please summarize the following article:\n(CNN)...
2,"(CNN)If you've been following the news lately,...",Mohammad Javad Zarif has spent more time with ...,4495ba8f3a340d97a9df1476f8a35502bcce1f69,Please summarize the following article:\n(CNN)...
3,(CNN)Five Americans who were monitored for thr...,17 Americans were exposed to the Ebola virus w...,a38e72fed88684ec8d60dd5856282e999dc8c0ca,Please summarize the following article:\n(CNN)...
4,(CNN)A Duke student has admitted to hanging a ...,Student is no longer on Duke University campus...,c27cf1b136cc270023de959e7ab24638021bc43f,Please summarize the following article:\n(CNN)...
5,(CNN)He's a blue chip college basketball recru...,College-bound basketball star asks girl with D...,1b2cc634e2bfc6f2595260e7ed9b42f77ecbb0ce,Please summarize the following article:\n(CNN)...
6,(CNN)Governments around the world are using th...,Amnesty's annual death penalty report catalogs...,e2706dce6cf26bc61b082438188fdb6e130d9e40,Please summarize the following article:\n(CNN)...
7,"(CNN)Andrew Getty, one of the heirs to billion...",Andrew Getty's death appears to be from natura...,0d3c8c276d079c4c225f034c69aa024cdab7869d,Please summarize the following article:\n(CNN)...
8,(CNN)Filipinos are being warned to be on guard...,"Once a super typhoon, Maysak is now a tropical...",6222f33c2c79b80be437335eeb3f488509e92cf5,Please summarize the following article:\n(CNN)...
9,"(CNN)For the first time in eight years, a TV l...","Bob Barker returned to host ""The Price Is Righ...",2bd8ada1de6a7b02f59430cc82045eb8d29cf033,Please summarize the following article:\n(CNN)...


## Define Extra Metrics

Create a custom LLM-judged metric named `answer_quality` using `make_genai_metric()`. We need to define a metric definition and grading rubric, as well as some examples for the LLM judge to use.

In [23]:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric

answer_quality_definition = """Please evaluate answer quality for the provided output on the following criteria: fluency, clarity, and conciseness. Each of the criteria is defined as follows:
  - Fluency measures how naturally and smooth the output reads.
  - Clarity measures how understandable the output is.
  - Conciseness measures the brevity and efficiency of the output without compromising meaning.
The more fluent, clear, and concise a text, the higher the score it deserves.
"""

answer_quality_rubric = """Answer quality: Below are the details for different scores:
  - Score 1: The output is entirely incomprehensible and cannot be read.
  - Score 2: The output conveys some meaning, but needs lots of improvement in to improve fluency, clarity, and conciseness.
  - Score 3: The output is understandable but still needs improvement.
  - Score 4: The output performs well on two of fluency, clarity, and conciseness, but could be improved on one of these criteria.
  - Score 5: The output reads smoothly, is easy to understand, and clear. There is no clear way to improve the output on these criteria."""

example1 = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform. For managing machine learning workflows, it including experiment tracking model packaging versioning and deployment as well as a platform simplifying for on the ML lifecycle.",
    score=2,
    justification="The output is difficult to understand and demonstrates extremely low clarity. However, it still conveys some meaning so this output deserves a score of 2.",
)

example2 = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine learning workflows, including experiment tracking, model packaging, versioning, and deployment.",
    score=5,
    justification="The output is easily understandable, clear, and concise. It deserves a score of 5.",
)

answer_quality_metric = make_genai_metric(
    name="answer_quality",
    definition=answer_quality_definition,
    grading_prompt=answer_quality_rubric,
    version="v1",
    examples=[example1, example2],
    model="openai:/gpt-4",
    greater_is_better=True,
)

print(answer_quality_metric)

EvaluationMetric(name=answer_quality, greater_is_better=True, long_name=answer_quality, version=v1, metric_details=
Task:
You must return the following fields in your response one below the other:
score: Your numerical score for the model's answer_quality based on the rubric
justification: Your step-by-step reasoning about the model's answer_quality score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_quality based on the input and output.
A definition of answer_quality and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the

We can also load one of the predefined metrics - in this case we are using answer_correctness with GPT-4.

In [24]:
from mlflow.metrics.genai import answer_correctness

answer_correctness_metric = answer_correctness(model="openai:/gpt-4")

print(answer_correctness_metric)

EvaluationMetric(name=answer_correctness, greater_is_better=True, long_name=answer_correctness, version=v1, metric_details=
Task:
You must return the following fields in your response one below the other:
score: Your numerical score for the model's answer_correctness based on the rubric
justification: Your step-by-step reasoning about the model's answer_correctness score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_correctness based on the input and output.
A definition of answer_correctness and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand th

## Evaluate

We need to set our OpenAI API key, since we are using GPT-4 for our LLM-judged metrics.

In [25]:
import os

os.environ["OPENAI_API_KEY"] = "redacted"

Call `mlflow.evaluate()` on the first 10 rows of the data. Using the 'text-summarization' model, we get toxicity, readability metrics, and rouge score as builtin metrics. We also pass in the two metrics we defined above into the extra_metrics parameter to be evaluated.

In [26]:
import mlflow

with mlflow.start_run():
    results = mlflow.evaluate(
        model_info.model_uri,
        eval_df.head(10),
        model_type="text-summarization",
        targets="highlights",
        extra_metrics=[answer_correctness_metric, answer_quality_metric],
    )

Downloading artifacts:   0%|          | 0/13 [00:00<?, ?it/s]2023/12/01 14:24:05 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
Downloading artifacts: 100%|██████████| 13/13 [00:10<00:00,  1.29it/s] 
2023/12/01 14:24:16 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/12/01 14:24:16 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
Token indices sequence length is longer than the specified maximum sequence length for this model (789 > 512). Running this sequence through the model will result in indexing errors
Your max_length is set to 200, but your input_length is only 161. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=80)
2023/12/01 14:24:40 INFO mlflow.models.evaluation.defau

## View results

`results.metrics` is a dictionary with the aggregate values for all the metrics calculated.

In [27]:
results.metrics

{'toxicity/v1/mean': 0.0016064770505181513,
 'toxicity/v1/variance': 1.9989291306231105e-06,
 'toxicity/v1/p90': 0.0036141288001090284,
 'toxicity/v1/ratio': 0.0,
 'flesch_kincaid_grade_level/v1/mean': 7.31,
 'flesch_kincaid_grade_level/v1/variance': 4.450899999999999,
 'flesch_kincaid_grade_level/v1/p90': 9.79,
 'ari_grade_level/v1/mean': 8.84,
 'ari_grade_level/v1/variance': 3.9204000000000008,
 'ari_grade_level/v1/p90': 11.3,
 'rouge1/v1/mean': 0.3388889861705919,
 'rouge1/v1/variance': 0.010298884566027731,
 'rouge1/v1/p90': 0.4338874680306905,
 'rouge2/v1/mean': 0.12654092075859483,
 'rouge2/v1/variance': 0.003022103718907744,
 'rouge2/v1/p90': 0.18981132075471696,
 'rougeL/v1/mean': 0.23649862112921,
 'rougeL/v1/variance': 0.004164158957865274,
 'rougeL/v1/p90': 0.28343873517786566,
 'rougeLsum/v1/mean': 0.2624236382358652,
 'rougeLsum/v1/variance': 0.006182113389247971,
 'rougeLsum/v1/p90': 0.3278787878787879,
 'answer_correctness/v1/mean': 3.4444444444444446,
 'answer_correctne

We can also view `eval_results_table`, which shows us the metrics for each row of data.

In [28]:
results.tables["eval_results_table"]

Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 95.26it/s] 


Unnamed: 0,article,id,inputs,highlights,outputs,token_count,toxicity/v1/score,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score,rouge1/v1/score,rouge2/v1/score,rougeL/v1/score,rougeLsum/v1/score,answer_correctness/v1/score,answer_correctness/v1/justification,answer_quality/v1/score,answer_quality/v1/justification
0,(CNN)The Palestinian Authority officially beca...,f001ec5c4704938247d27a44948eebb37ae98d01,Please summarize the following article:\n(CNN)...,Membership gives the ICC jurisdiction over all...,The Palestinian Authority officially became th...,47,0.001351,9.7,11.3,0.324324,0.166667,0.27027,0.27027,4.0,The output provided by the model is mostly cor...,3.0,"The output is understandable and fluent, but i..."
1,(CNN)Never mind cats having nine lives. A stra...,230c522854991d053fe98a718b1defa077a8efef,Please summarize the following article:\n(CNN)...,"Theia, a bully breed mix, was apparently hit b...",Theia is a friendly white-and-black bully bree...,54,0.002589,5.7,6.7,0.456522,0.088889,0.282609,0.326087,3.0,The output provided by the model addresses a c...,4.0,"The output is fluent and clear, as it reads sm..."
2,"(CNN)If you've been following the news lately,...",4495ba8f3a340d97a9df1476f8a35502bcce1f69,Please summarize the following article:\n(CNN)...,Mohammad Javad Zarif has spent more time with ...,Mohammad Javad Zarif has been U.S. Secretary o...,58,0.004271,8.1,9.3,0.32,0.136986,0.266667,0.266667,3.0,The output provided by the model addresses a c...,3.0,The output is understandable but still needs i...
3,(CNN)Five Americans who were monitored for thr...,a38e72fed88684ec8d60dd5856282e999dc8c0ca,Please summarize the following article:\n(CNN)...,17 Americans were exposed to the Ebola virus w...,Five americans who were monitored for three we...,53,0.000439,6.9,9.7,0.382022,0.137931,0.247191,0.314607,4.0,The output provided by the model is mostly cor...,4.0,"The output is fluent and clear, as it provides..."
4,(CNN)A Duke student has admitted to hanging a ...,c27cf1b136cc270023de959e7ab24638021bc43f,Please summarize the following article:\n(CNN)...,Student is no longer on Duke University campus...,Duke student admits to hanging a noose from a ...,44,0.000296,5.9,6.0,0.35,0.102564,0.225,0.25,,Failed to extract score and justification. Raw...,,Failed to extract score and justification. Raw...
5,(CNN)He's a blue chip college basketball recru...,1b2cc634e2bfc6f2595260e7ed9b42f77ecbb0ce,Please summarize the following article:\n(CNN)...,College-bound basketball star asks girl with D...,Trey Moses and Ellie Meredith asked Ellie to b...,50,0.003541,3.5,6.3,0.061538,0.0,0.061538,0.061538,3.0,The output provided by the model addresses a c...,3.0,The output is understandable but still needs i...
6,(CNN)Governments around the world are using th...,e2706dce6cf26bc61b082438188fdb6e130d9e40,Please summarize the following article:\n(CNN)...,Amnesty's annual death penalty report catalogs...,Amnesty International says governments are usi...,63,0.002296,10.6,11.3,0.431373,0.2,0.27451,0.333333,3.0,The output provided by the model addresses a c...,4.0,"The output is fluent and clear, as it reads sm..."
7,"(CNN)Andrew Getty, one of the heirs to billion...",0d3c8c276d079c4c225f034c69aa024cdab7869d,Please summarize the following article:\n(CNN)...,Andrew Getty's death appears to be from natura...,The coroner's preliminary assessment is there ...,48,0.000145,9.3,11.0,0.361446,0.098765,0.192771,0.192771,4.0,The output provided by the model is mostly cor...,4.0,"The output is fluent and clear, as it reads sm..."
8,(CNN)Filipinos are being warned to be on guard...,6222f33c2c79b80be437335eeb3f488509e92cf5,Please summarize the following article:\n(CNN)...,"Once a super typhoon, Maysak is now a tropical...","Maysak is now classified as a tropical storm, ...",59,0.000935,8.0,9.4,0.338028,0.144928,0.253521,0.28169,4.0,The output provided by the model is mostly cor...,4.0,"The output is fluent and clear, as it provides..."
9,"(CNN)For the first time in eight years, a TV l...",2bd8ada1de6a7b02f59430cc82045eb8d29cf033,Please summarize the following article:\n(CNN)...,"Bob Barker returned to host ""The Price Is Righ...","Bob Barker hosted ""The Price Is Right"" for 35 ...",44,0.000201,5.4,7.4,0.363636,0.188679,0.290909,0.327273,3.0,The output provided by the model addresses a c...,4.0,"The output is fluent and clear, as it is easy ..."


Finally, we can view our evaluation results in the MLflow UI under the Evaluation tab. Here, we can choose which columns to group by and a column to compare on.

![](https://i.imgur.com/uDmh4M0.png)