# LLM Evaluation with MLflow Example Notebook

In this notebook, we will demonstrate how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as toxicity, as well as LLM-judged metrics such as relevance, and even custom LLM-judged metrics such as professionalism

We need to set our OpenAI API key, since we will be using GPT-4 for our LLM-judged metrics.

In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure your favored environment management configuration file (i.e., .bashrc, .zshrc) to have the following entry:

`OPENAI_API_KEY=<your openai API key>`

In [23]:
!pip install 'mlflow[genai]' tf-keras textstat

Collecting tf-keras
  Downloading tf_keras-2.17.0-py3-none-any.whl.metadata (1.6 kB)
Collecting textstat
  Downloading textstat-0.7.4-py3-none-any.whl.metadata (14 kB)
Collecting pyphen (from textstat)
  Downloading pyphen-0.16.0-py3-none-any.whl.metadata (3.2 kB)
Downloading tf_keras-2.17.0-py3-none-any.whl (1.7 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading textstat-0.7.4-py3-none-any.whl (105 kB)
Downloading pyphen-0.16.0-py3-none-any.whl (2.1 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, textstat, tf-keras
Successfully installed pyphen-0.16.0 textstat-0.7.4 tf-keras-2.17.0


In [17]:
%load_ext dotenv
%dotenv

import openai
import pandas as pd

import mlflow

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [42]:
MLFLOW_TRACKING_URI = "http://localhost:5001"

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

## Basic Question-Answering Evaluation

Create a test case of `inputs` that will be passed into the model and `ground_truth` which will be used to compare against the generated output from the model.

In [43]:
eval_df = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) "
            "lifecycle. It was developed by Databricks, a company that specializes in big data and "
            "machine learning solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, and deploying "
            "machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data "
            "processing and analytics. It was developed in response to limitations of the Hadoop "
            "MapReduce computing model, offering improvements in speed and ease of use. Spark "
            "provides libraries for various tasks such as data ingestion, processing, and analysis "
            "through its components like Spark SQL for structured data, Spark Streaming for "
            "real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)

Create a simple OpenAI model that asks gpt-4o to answer the question in two sentences. Call `mlflow.evaluate()` with the model and evaluation dataframe. 

In [44]:
with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-4o-mini",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",  # specify which column corresponds to the expected output
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )
results.metrics

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

2024/10/24 14:50:33 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/10/24 14:50:35 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2024/10/24 14:50:36 INFO mlflow.tracking._tracking_service.client: 🏃 View run invincible-shark-965 at: http://localhost:5001/#/experiments/0/runs/08b5bccfff114719949afe18cc153074.
2024/10/24 14:50:36 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5001/#/experiments/0.


{'toxicity/v1/mean': 0.00013739550195168704,
 'toxicity/v1/variance': 9.55839459172402e-13,
 'toxicity/v1/p90': 0.0001381776382913813,
 'toxicity/v1/ratio': 0.0,
 'flesch_kincaid_grade_level/v1/mean': 14.450000000000001,
 'flesch_kincaid_grade_level/v1/variance': 4.622500000000001,
 'flesch_kincaid_grade_level/v1/p90': 16.17,
 'ari_grade_level/v1/mean': 19.15,
 'ari_grade_level/v1/variance': 2.1025000000000027,
 'ari_grade_level/v1/p90': 20.310000000000002,
 'exact_match/v1': 0.0}

Inspect the evaluation results table as a dataframe to see row-by-row metrics to further assess model performance

In [45]:
results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,inputs,ground_truth,outputs,token_count,toxicity/v1/score,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score
0,What is MLflow?,MLflow is an open-source platform for managing...,MLflow is an open-source platform designed to ...,43,0.000138,16.6,20.6
1,What is Spark?,"Apache Spark is an open-source, distributed co...","Apache Spark is an open-source, distributed co...",51,0.000136,12.3,17.7


## LLM-judged correctness with OpenAI GPT-4

Construct an answer similarity metric using the `answer_similarity()` metric factory function.

In [46]:
from mlflow.metrics.genai import EvaluationExample, answer_similarity

# Create an example to describe what answer_similarity means like for this problem.
example = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle.",
    score=4,
    justification="The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score.",
    grading_context={
        "targets": "MLflow is an open-source platform for managing "
        "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
        "a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning "
        "engineers face when developing, training, and deploying machine learning models."
    },
)

# Construct the metric using OpenAI GPT-4 as the judge
answer_similarity_metric = answer_similarity(model="openai:/gpt-4", examples=[example])

print(answer_similarity_metric)

EvaluationMetric(name=answer_similarity, greater_is_better=True, long_name=answer_similarity, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's answer_similarity based on the rubric
justification: Your reasoning about the model's answer_similarity score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_similarity based on the input and output.
A definition of answer_similarity and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them be

Call `mlflow.evaluate()` again but with your new `answer_similarity_metric`

In [47]:
with mlflow.start_run() as run:
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[answer_similarity_metric],  # use the answer similarity metric created above
    )
results.metrics

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

2024/10/24 14:52:01 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/10/24 14:52:02 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

2024/10/24 14:52:09 INFO mlflow.tracking._tracking_service.client: 🏃 View run rare-lark-648 at: http://localhost:5001/#/experiments/0/runs/2c693b33b281471892339b265fb6e8ee.
2024/10/24 14:52:09 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5001/#/experiments/0.


{'toxicity/v1/mean': 0.000139602372655645,
 'toxicity/v1/variance': 1.8213994765550745e-13,
 'toxicity/v1/p90': 0.00013994379551149905,
 'toxicity/v1/ratio': 0.0,
 'flesch_kincaid_grade_level/v1/mean': 13.85,
 'flesch_kincaid_grade_level/v1/variance': 0.5625,
 'flesch_kincaid_grade_level/v1/p90': 14.45,
 'ari_grade_level/v1/mean': 19.75,
 'ari_grade_level/v1/variance': 0.30250000000000077,
 'ari_grade_level/v1/p90': 20.19,
 'exact_match/v1': 0.0,
 'answer_similarity/v1/mean': 4.0,
 'answer_similarity/v1/variance': 0.0,
 'answer_similarity/v1/p90': 4.0}

See the row-by-row LLM-judged answer similarity score and justifications

In [48]:
results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,inputs,ground_truth,outputs,token_count,toxicity/v1/score,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score,answer_similarity/v1/score,answer_similarity/v1/justification
0,What is MLflow?,MLflow is an open-source platform for managing...,MLflow is an open-source platform designed to ...,57,0.000139,14.6,20.3,4,The model's output accurately describes MLflow...
1,What is Spark?,"Apache Spark is an open-source, distributed co...",Apache Spark is an open-source distributed com...,47,0.00014,13.1,19.2,4,The output effectively describes what Apache S...


## Custom LLM-judged metric for professionalism

Create a custom metric that will be used to determine professionalism of the model outputs. Use `make_genai_metric` with a metric definition, grading prompt, grading example, and judge model configuration

In [49]:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric

professionalism_metric = make_genai_metric(
    name="professionalism",
    definition=(
        "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language"
    ),
    grading_prompt=(
        "Professionalism: If the answer is written using a professional tone, below "
        "are the details for different scores: "
        "- Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts."
        "- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings."
        "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
        "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. "
        "- Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks. "
    ),
    examples=[
        EvaluationExample(
            input="What is MLflow?",
            output=(
                "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning!"
            ),
            score=2,
            justification=(
                "The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. "
            ),
        )
    ],
    version="v1",
    model="openai:/gpt-4",
    parameters={"temperature": 0.0},
    grading_context_columns=[],
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

print(professionalism_metric)

EvaluationMetric(name=professionalism, greater_is_better=True, long_name=professionalism, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's professionalism based on the rubric
justification: Your reasoning about the model's professionalism score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called professionalism based on the input and output.
A definition of professionalism and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before complet

Call `mlflow.evaluate` with your new professionalism metric. 

In [50]:
with mlflow.start_run() as run:
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[professionalism_metric],  # use the professionalism metric we created above
    )
print(results.metrics)

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

2024/10/24 14:52:28 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/10/24 14:52:29 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/2 [00:00<?, ?it/s]

2024/10/24 14:52:35 INFO mlflow.tracking._tracking_service.client: 🏃 View run nimble-ant-199 at: http://localhost:5001/#/experiments/0/runs/5ba77b7cac9e4cae8a9ae1de4b995a84.
2024/10/24 14:52:35 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5001/#/experiments/0.


{'toxicity/v1/mean': 0.00013975516048958525, 'toxicity/v1/variance': 5.398332805167781e-13, 'toxicity/v1/p90': 0.00014034294727025554, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 15.65, 'flesch_kincaid_grade_level/v1/variance': 3.802500000000004, 'flesch_kincaid_grade_level/v1/p90': 17.21, 'ari_grade_level/v1/mean': 20.7, 'ari_grade_level/v1/variance': 3.2400000000000024, 'ari_grade_level/v1/p90': 22.14, 'professionalism/v1/mean': 4.0, 'professionalism/v1/variance': 0.0, 'professionalism/v1/p90': 4.0}


In [51]:
results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,inputs,ground_truth,outputs,token_count,toxicity/v1/score,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score,professionalism/v1/score,professionalism/v1/justification
0,What is MLflow?,MLflow is an open-source platform for managing...,MLflow is an open-source platform designed for...,55,0.000139,17.6,22.5,4,The language used in the response is formal an...
1,What is Spark?,"Apache Spark is an open-source, distributed co...",Apache Spark is an open-source distributed com...,42,0.00014,13.7,18.9,4,The language used in the response is formal an...


Lets see if we can improve `basic_qa_model` by creating a new model that could perform better by changing the system prompt.

Call `mlflow.evaluate()` using the new model. Observe that the professionalism score has increased!

In [52]:
with mlflow.start_run() as run:
    system_prompt = "Answer the following question using extreme formality."
    professional_qa_model = mlflow.openai.log_model(
        model="gpt-4o-mini",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        professional_qa_model.model_uri,
        eval_df,
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[professionalism_metric],
    )
print(results.metrics)

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

2024/10/24 14:52:37 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/10/24 14:52:42 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/2 [00:00<?, ?it/s]

2024/10/24 14:52:48 INFO mlflow.tracking._tracking_service.client: 🏃 View run resilient-mare-739 at: http://localhost:5001/#/experiments/0/runs/18a504b7bff644bab7d7f9161af31456.
2024/10/24 14:52:48 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5001/#/experiments/0.


{'toxicity/v1/mean': 0.0002254065329907462, 'toxicity/v1/variance': 2.251930386804448e-09, 'toxicity/v1/p90': 0.0002633701398735866, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 16.2, 'flesch_kincaid_grade_level/v1/variance': 0.25, 'flesch_kincaid_grade_level/v1/p90': 16.599999999999998, 'ari_grade_level/v1/mean': 19.9, 'ari_grade_level/v1/variance': 0.6399999999999983, 'ari_grade_level/v1/p90': 20.54, 'professionalism/v1/mean': 4.0, 'professionalism/v1/variance': 0.0, 'professionalism/v1/p90': 4.0}


In [53]:
results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,inputs,ground_truth,outputs,token_count,toxicity/v1/score,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score,professionalism/v1/score,professionalism/v1/justification
0,What is MLflow?,MLflow is an open-source platform for managing...,MLflow is an open-source platform designed to ...,251,0.000273,16.7,20.7,4,"The language used in the response is formal, r..."
1,What is Spark?,"Apache Spark is an open-source, distributed co...","Apache Spark, as it is formally known, is an o...",247,0.000178,15.7,19.1,4,The language used in the response is formal an...


In [54]:
results

<mlflow.models.evaluation.base.EvaluationResult at 0x136b5bf10>