# LLM RAG Evaluation with MLflow Example Notebook

In this notebook, we will demonstrate how to evaluate various a RAG system with MLflow.

In [1]:
import os

Set OpenAI Key

In [2]:
os.environ["OPENAI_API_KEY"] = "redacted"

In [3]:
import pandas as pd

import mlflow

## Create a RAG system

Use Langchain and Chroma to create a RAG system that answers questions based on the MLflow documentation.

In [4]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [5]:
loader = WebBaseLoader("https://mlflow.org/docs/latest/index.html")

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True,
)

## Evaluate the RAG system using `mlflow.evaluate()`

Create a simple function that runs each input through the RAG chain

In [6]:
def model(input_df):
    answer = []
    for index, row in input_df.iterrows():
        answer.append(qa(row["questions"]))

    return answer

Create an eval dataset

In [7]:
eval_df = pd.DataFrame(
    {
        "questions": [
            "What is MLflow?",
            "How to run Mlflow.evalaute()?",
            "How to log_table()?",
            "How to load_table()?",
        ],
    }
)

Create a faithfulness metric

In [8]:
from mlflow.metrics import faithfulness, EvaluationExample

# Create a good and bad example for faithfulness in the context of this problem
faithfulness_examples = [
    EvaluationExample(
        input="How do I disable MLflow autologging?",
        output="mlflow.autolog(disable=True) will disable autologging for all functions. In Databricks, autologging is enabled by default. ",
        score=2,
        justification="The output provides a working solution, using the mlflow.autolog() function that is provided in the context.",
        grading_context={
            "context": "mlflow.autolog(log_input_examples: bool = False, log_model_signatures: bool = True, log_models: bool = True, log_datasets: bool = True, disable: bool = False, exclusive: bool = False, disable_for_unsupported_versions: bool = False, silent: bool = False, extra_tags: Optional[Dict[str, str]] = None) → None[source] Enables (or disables) and configures autologging for all supported integrations. The parameters are passed to any autologging integrations that support them. See the tracking docs for a list of supported autologging integrations. Note that framework-specific configurations set at any point will take precedence over any configurations set by this function."
        },
    ),
    EvaluationExample(
        input="How do I disable MLflow autologging?",
        output="mlflow.autolog(disable=True) will disable autologging for all functions.",
        score=5,
        justification="The output provides a solution that is using the mlflow.autolog() function that is provided in the context.",
        grading_context={
            "context": "mlflow.autolog(log_input_examples: bool = False, log_model_signatures: bool = True, log_models: bool = True, log_datasets: bool = True, disable: bool = False, exclusive: bool = False, disable_for_unsupported_versions: bool = False, silent: bool = False, extra_tags: Optional[Dict[str, str]] = None) → None[source] Enables (or disables) and configures autologging for all supported integrations. The parameters are passed to any autologging integrations that support them. See the tracking docs for a list of supported autologging integrations. Note that framework-specific configurations set at any point will take precedence over any configurations set by this function."
        },
    ),
]

faithfulness_metric = faithfulness(model="openai:/gpt-4", examples=faithfulness_examples)
print(faithfulness_metric)

EvaluationMetric(name=faithfulness, greater_is_better=True, long_name=faithfulness, version=v1, metric_details=
Task:
You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called faithfulness based on the input and output.
A definition of faithfulness and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Faithfulness is only evaluated with the provided output and provided context, please ignore the provided input entirely when scoring faithfulness. Faithfulne

Create an answer relevance metric

In [9]:
from mlflow.metrics import answer_relevance, EvaluationExample


answer_relevance_metric = answer_relevance(model="openai:/gpt-4")
print(answer_relevance_metric)

EvaluationMetric(name=answer_relevance, greater_is_better=True, long_name=answer_relevance, version=v1, metric_details=
Task:
You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_relevance based on the input and output.
A definition of answer_relevance and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Answer relevance measures the appropriateness and applicability of the output with respect to the input. Scores should reflect the extent to 

In [10]:
results = mlflow.evaluate(
    model,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    predictions="result",
    extra_metrics=[faithfulness_metric, answer_relevance_metric, mlflow.metrics.latency()],
    evaluator_config={
        "col_mapping": {
            "inputs": "questions",
            "context": "source_documents",
        }
    },
)
print(results.metrics)

2023/10/25 15:11:34 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/10/25 15:11:34 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint
Using pad_token, but it is not set yet.


  0%|          | 0/1 [00:00<?, ?it/s]

2023/10/25 15:12:14 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/10/25 15:12:14 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/10/25 15:12:14 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: perplexity
Using pad_token, but it is not set yet.


  0%|          | 0/1 [00:00<?, ?it/s]

2023/10/25 15:12:17 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/10/25 15:12:17 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/10/25 15:12:17 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match
2023/10/25 15:12:17 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: faithfulness
2023/10/25 15:12:27 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_relevance
2023/10/25 15:12:27 INFO mlflow.metrics.genai.genai_metric: Failed to score model on payload. Error: MlflowException("Error response from OpenAI:\n {'error': {'message': 'Rate limit reached for gpt-4 in organization org-n6HsoZWRns6Q46qNiDMiTFU7 on tokens per min. Limit: 10000 / min. Please try again in 6ms. Contact us through our help center at help.openai.com if you continue to have issues.', 'type': 'tokens', 'param': None, 'code': 'rate_li

{'toxicity/v1/mean': 0.00019068770416197367, 'toxicity/v1/variance': 2.3719048510286652e-09, 'toxicity/v1/p90': 0.0002462980875861831, 'toxicity/v1/ratio': 0.0, 'perplexity/v1/mean': 112.43641662597656, 'perplexity/v1/variance': 7690.103411271099, 'perplexity/v1/p90': 202.41207427978514, 'flesch_kincaid_grade_level/v1/mean': 6.25, 'flesch_kincaid_grade_level/v1/variance': 34.042500000000004, 'flesch_kincaid_grade_level/v1/p90': 12.880000000000003, 'ari_grade_level/v1/mean': 8.875, 'ari_grade_level/v1/variance': 33.62687499999999, 'ari_grade_level/v1/p90': 15.030000000000001, 'faithfulness/v1/mean': 4.0, 'faithfulness/v1/variance': 3.0, 'faithfulness/v1/p90': 5.0, 'answer_relevance/v1/mean': 5.0, 'answer_relevance/v1/variance': 0.0, 'answer_relevance/v1/p90': 5.0}


In [11]:
results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,questions,outputs,query,source_documents,latency,token_count,toxicity/v1/score,perplexity/v1/score,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score,faithfulness/v1/score,faithfulness/v1/justification,answer_relevance/v1/score,answer_relevance/v1/justification
0,What is MLflow?,MLflow is an open source platform for managin...,What is MLflow?,"[{'lc_attributes': {}, 'lc_namespace': ['langc...",4.172886,176,0.000208,28.626591,15.4,18.9,5,The output provided by the model is completely...,,
1,How to run Mlflow.evalaute()?,You can use the Mlflow.evaluate() function to...,How to run Mlflow.evalaute()?,"[{'lc_attributes': {}, 'lc_namespace': ['langc...",1.699869,48,0.000263,21.14967,7.0,6.0,1,The output claims that there is a function cal...,,
2,How to log_table()?,log_table() is not a function in MLflow.,How to log_table()?,"[{'lc_attributes': {}, 'lc_namespace': ['langc...",4.128658,11,0.000148,206.053131,0.1,5.0,5,"The output claim that ""log_table() is not a fu...",,
3,How to load_table()?,load_table() is not a function in MLflow.,How to load_table()?,"[{'lc_attributes': {}, 'lc_namespace': ['langc...",4.035665,11,0.000144,193.916275,2.5,5.6,5,"The output claim that ""load_table() is not a f...",5.0,The output directly addresses the input questi...
