# LLM RAG Evaluation with MLflow Example Notebook

In this notebook, we will demonstrate how to evaluate various a RAG system with MLflow.

In [0]:
%pip install git+https://github.com/mlflow/mlflow.git@master
%pip install openai tiktoken textstat evaluate transformers torch langchain chromadb
dbutils.library.restartPython()

In [0]:
import openai
import pandas as pd
import os
import mlflow

Set OpenAI Key

In [0]:
os.environ["OPENAI_API_KEY"] = "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"
openai.api_key = "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"

## Create a RAG system

Use Langchain and Chroma to create a RAG system that answers questions based on the MLflow documentation.

In [0]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [0]:
loader = WebBaseLoader("https://mlflow.org/docs/latest/index.html")

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0), chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)

## Evaluate the RAG system using `mlflow.evaluate()`

Create a simple function that runs each input through the RAG chain

In [0]:
def model(input_df):
    answer = []
    for index, row in input_df.iterrows():
      answer.append(qa(row["questions"]))

    return answer

Create an eval dataset

In [0]:
eval_df = pd.DataFrame(
    {
        "questions": [
            "What is MLflow?",
            "How to run Mlflow.evalaute()?",
            "How to log_table()?",
            "How to load_table()?"
        ],
    }
)

Create a relevance metric

In [0]:
from mlflow.metrics.genai.metric_definitions import relevance

relevance_metric = relevance(model="openai:/gpt-3.5-turbo-16k")

In [0]:
results = mlflow.evaluate(
    model,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    extra_metrics=[custom_metric, mlflow.metrics.latency],
    evaluator_config={
      "predicted_column": "result",
      "col_mapping": {
        "inputs": "questions",
        "context": "source_documents"
      }
    }
)
print(results.metrics)
display(results.tables['eval_results_table'])