# LLM Evaluation with MLflow Example Notebook

In this notebook, we will demonstrate how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as perplexity and toxicity, as well as LLM-judged metrics such as relevance, and even custom LLM-judged metrics such as professionalism

In [None]:
%pip install git+https://github.com/mlflow/mlflow.git@master
%pip install openai tiktoken textstat evaluate transformers torch

In [None]:
import openai
import pandas as pd
import os
import mlflow

Set OpenAI Key

In [None]:
os.environ["OPENAI_API_KEY"] = "redacted"

## Basic Question-Answering Evaluation

Create a simple OpenAI model that asks gpt-3.5 to answer the question in two sentences

In [None]:
system_prompt = "Answer the following question in two sentences"
basic_qa_model = mlflow.openai.log_model(
    model="gpt-3.5-turbo",
    task=openai.ChatCompletion,
    artifact_path="model",
    messageas=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "{question}"},
    ],
)

Create a test case of `inputs` that will be passed into the model and `ground_truth` which will be used to compare against the generated output from the model.

In [None]:
eval_df = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
            "What is Python?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, offering improvements in speed and ease of use. Spark provides libraries for various tasks such as data ingestion, processing, and analysis through its components like Spark SQL for structured data, Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
            "Python is a high-level programming language that was created by Guido van Rossum and released in 1991. It emphasizes code readability and allows developers to express concepts in fewer lines of code than languages like C++ or Java. Python is used in various domains, including web development, scientific computing, data analysis, and machine learning.",
        ],
    }
)

Call `mlflow.evaluate()` with the model and evaluation dataframe. 

In [None]:
results = mlflow.evaluate(
    basic_qa_model.model_uri,
    eval_df,
    targets="ground_truth", # specify which column corresponds to the expected output
    model_type="question-answering", # model type indicates which metrics are relevant for this task
    evaluators="default",
)
results.metrics

Inspect the evaluation results table as a dataframe to see row-by-row metrics to further assess model performance

In [None]:
eval_table = results.tables["eval_results_table"]
display(eval_table)

## LLM-judged correctness with OpenAI GPT-4

Construct an answer similarity metric using the `answer_similarity()` metric factory function.

In [None]:
from mlflow.metrics import EvaluationExample, answer_similarity

# Create an example to describe what answer_similarity means like for this problem.
example = EvaluationExample(
        input="What is MLflow?",
        output="MLflow is an open-source platform for managing machine "
        "learning workflows, including experiment tracking, model packaging, "
        "versioning, and deployment, simplifying the ML lifecycle.",
        score=4,
        justification="The definition effectively explains what MLflow is "
        "its purpose, and its developer. It could be more concise for a 5-score.",
        grading_context={
            "ground_truth": "MLflow is an open-source platform for managing "
            "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
            "a company that specializes in big data and machine learning solutions. MLflow is "
            "designed to address the challenges that data scientists and machine learning "
            "engineers face when developing, training, and deploying machine learning models."
        },
    )

# Construct the metric using OpenAI GPT-4 as the judge
answer_similarity_metric = answer_similarity(model="openai:/gpt-4", examples=[example])

print(answer_similarity_metric)

Call `mlflow.evaluate()` again but with your new `answer_similarity_metric`

In [None]:
results = mlflow.evaluate(
    basic_qa_model.model_uri,
    eval_df,
    targets="ground_truth",
    model_type="question-answering",
    evaluators="default",
    extra_metrics=[answer_similarity_metric]
)
results.metrics

See the row-by-row LLM-judged answer similarity score and justifications

In [None]:
eval_table = results.tables["eval_results_table"]
eval_table

TODO: Find the best and worst input

## Custom LLM-judged metric for professionalism

Create a custom metric that will be used to determine professionalism of the model outputs. Use `make_genai_metric` with a metric definition, grading prompt, grading example, and judge model configuration

In [None]:
from mlflow.metrics import EvaluationExample, make_genai_metric

professionalism_metric = make_genai_metric(
  name="professionalism",
  definition=(
      "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language"
  ),
  grading_prompt=(
      "Professionalism: If the answer is written using a professional tone, below "
      "are the details for different scores: "
      "- Score 0: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts."
      "- Score 1: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings."
      "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
      "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for formal business or academic settings. "
  ),
  examples=[EvaluationExample(
  input="What is MLflow?",
  output=(
      "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning!"
  ),
  score=2,
  justification=(
      "The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. "
  ),
)
],
  version="v1",
  model="openai:/gpt-3.5-turbo-16k",
  parameters={"temperature": 0.0},
  grading_context_columns=[],
  aggregations=["mean", "variance", "p90"],
  greater_is_better=True,
)

TODO: Try out your new professionalism metric on a sample output to make sure it behaves as you expect

Call `mlflow.evaluate` with your new professionalism metric. 

In [None]:
results = mlflow.evaluate(
    basic_qa_model.model_uri,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    extra_metrics=[professionalism_metric]
)
print(results.metrics)

In [None]:
results.tables["eval_results_table"]

The professionalism score of the `basic_qa_model` is not very good. Let's try to create a new model that can perform better

In [None]:
system_prompt = "Answer the following question in two sentences using professional language"
professional_qa_model = mlflow.openai.log_model(
    model="gpt-3.5-turbo",
    task=openai.ChatCompletion,
    artifact_path="model",
    messageas=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "{question}"},
    ],
)

Call `mlflow.evaluate()` using the new model. Observe that the professionalism score has increased!

In [None]:
results = mlflow.evaluate(
    basic_qa_model.model_uri,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    extra_metrics=[professionalism_metric]
)
print(results.metrics)

In [None]:
results.tables["eval_results_table"]