# LLM RAG Evaluation with MLflow Example Notebook

In this notebook, we will demonstrate how to evaluate various RAG systems with MLflow.

In [None]:
%pip install mlflow>=2.8.1
%pip install openai
%pip install chromadb==0.4.15
%pip install langchain==0.0.348
%pip install tiktoken
%pip install 'mlflow[genai]'
%pip install databricks-sdk --upgrade

In [None]:
dbutils.library.restartPython()

In [None]:

import os
import pandas as pd
import mlflow
import chromadb
import openai
import langchain

In [None]:
# check mlflow version
mlflow.__version__

'2.9.1'

In [None]:
# check chroma version
chromadb.__version__

'0.4.18'

## Set-up Databricks Workspace Secrets

In order to use the secrets that are defined within this notebook, ensure that they are set via following the [guide to Databricks Secrets here](https://docs.databricks.com/en/security/secrets/secrets.html). It is highly recommended to utilize the [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html) to set secrets within your workspace for a secure experience.

In order to safely store and access your API KEY for Azure OpenAI, ensure that you are setting the following when registering your secret:

- **KEY_NAME**: The name that you will be setting for your Azure OpenAI Key
- **SCOPE_NAME**: The referenced scope that your secret will reside in, within Databricks Secrets
- **OPENAI_API_KEY**: Your Azure OpenAI Key

As an example, you would set these keys through a terminal as follows:

```bash
    databricks secrets put-secret "<SCOPE_NAME>" "<KEY_NAME>" --string-value "<OPENAI_API_KEY>"
```

In [None]:
# Set your Scope and Key Names that you used when registering your API KEY from the Databricks CLI
# Do not put your OpenAI API Key in the notebook!
SCOPE_NAME = ...
KEY_NAME = ...

In [None]:
os.environ["OPENAI_API_KEY"] = dbutils.secrets.get(scope=SCOPE_NAME, key=KEY_NAME)
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2023-05-15"
# Ensure that you set the name of your OPEN_API_BASE value to the name of your OpenAI instance on Azure
os.environ["OPENAI_API_BASE"] = "https://<NAME_OF_YOUR_INSTANCE>.openai.azure.com/"
os.environ["OPENAI_DEPLOYMENT_NAME"] = "gpt-35-turbo"
os.environ["OPENAI_ENGINE"] = "gpt-35-turbo"

## Create and Test Endpoint on MLflow for OpenAI

In [None]:
import mlflow
import mlflow.deployments

client = mlflow.deployments.get_deploy_client("databricks")

endpoint_name = "<your-endpoint-name>"
client.create_endpoint(
name=endpoint_name,
config={
        "served_entities": [
            {
                "name": "test-gpt",  # Provide a unique identifying name for your deployments endpoint
                "external_model": {
                    "name": "gpt-3.5-turbo",
                    "provider": "openai",
                    "task": "llm/v1/completions",
                    "openai_config": {
                        "openai_api_type": "azure",
                        "openai_api_key": os.environ.get("OPENAI_API_KEY"),
                        "openai_api_base": os.environ.get("OPENAI_API_BASE"),
                        "openai_deployment_name": "gpt-35-turbo",
                        "openai_api_version": "2023-05-15",
                    },
                },
            }
        ],
    },
)


In [None]:
print(client.predict(
    endpoint=endpoint_name,
    inputs={
        "prompt": "How is Pi calculated? Be very concise.",
        "max_tokens": 100,
    }
))

## Create RAG POC with LangChain and log with MLflow

Use Langchain and Chroma to create a RAG system that answers questions based on the MLflow documentation.

In [None]:
import os
import pandas as pd
import mlflow
import chromadb
import openai
from langchain import LLMChain, PromptTemplate
from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.llms import OpenAI,Databricks 
from langchain.embeddings.databricks import DatabricksEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

loader = WebBaseLoader(
    [ 
     "https://mlflow.org/docs/latest/index.html",
     "https://mlflow.org/docs/latest/tracking/autolog.html", 
     "https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html",
     "https://mlflow.org/docs/latest/python_api/mlflow.deployments.html" ])

documents = loader.load()
CHUNK_SIZE = 1000
text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

llm = Databricks(
    endpoint_name="test-endpoint-abraham-omor",
    extra_params={"temperature": 0.1,
                 "top_p": 0.1,
                 "max_tokens": 500,
                 } #parameters used in AI Playground
)


# create the embedding function using Databricks Foundation Model APIs
embedding_function = DatabricksEmbeddings(endpoint="databricks-bge-large-en")
docsearch = Chroma.from_documents(texts, embedding_function)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(fetch_k=3),
    return_source_documents=True,
)


## Evaluate the Vector Database and Retrieval using `mlflow.evaluate()`

### Create an eval dataset (Golden Dataset)

We can [leveraging the power of an LLM to generate synthetic data for testing](https://mlflow.org/docs/latest/llms/rag/notebooks/question-generation-retrieval-evaluation.html), offering a creative and efficient alternative. To our readers and customers, we emphasize the importance of crafting a dataset that mirrors the expected inputs and outputs of your RAG application. It's a journey worth taking for the incredible insights you'll gain!

In [None]:
import ast

EVALUATION_DATASET_PATH = "https://raw.githubusercontent.com/mlflow/mlflow/master/examples/llms/RAG/static_evaluation_dataset.csv"

synthetic_eval_data = pd.read_csv(EVALUATION_DATASET_PATH)

# Load the static evaluation dataset from disk and deserialize the source and retrieved doc ids
synthetic_eval_data["source"] = synthetic_eval_data["source"].apply(ast.literal_eval)
synthetic_eval_data["retrieved_doc_ids"] = synthetic_eval_data["retrieved_doc_ids"].apply(ast.literal_eval)

In [None]:
display(synthetic_eval_data)

### Evaluate the Embedding Model with MLflow
You can explore with the full dataset but let's demo with fewer data points

In [None]:
eval_data = pd.DataFrame(
    {
        "question": [
            "What is MLflow?",
            "What is Databricks?",
            "How to serve a model on Databricks?",
            "How to enable MLflow Autologging for my workspace by default?",
        ],
        "source": [
            ["https://mlflow.org/docs/latest/index.html"],
            ["https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html"],
            ["https://mlflow.org/docs/latest/python_api/mlflow.deployments.html"],
            ["https://mlflow.org/docs/latest/tracking/autolog.html"],
        ],
    }
)


In [None]:
from typing import List
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

def evaluate_embedding(embedding_function):
    CHUNK_SIZE = 1000
    list_of_documents = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=0)
    docs = text_splitter.split_documents(list_of_documents)
    retriever = Chroma.from_documents(docs, embedding_function).as_retriever()

    def retrieve_doc_ids(question: str) -> List[str]:
        docs = retriever.get_relevant_documents(question)
        doc_ids = [doc.metadata["source"] for doc in docs]
        return doc_ids

    def retriever_model_function(question_df: pd.DataFrame) -> pd.Series:
        return question_df["question"].apply(retrieve_doc_ids)

    with mlflow.start_run() as run:
        evaluate_results = mlflow.evaluate(
                model=retriever_model_function,
                data=eval_data,
                model_type="retriever",
                targets="source",
                evaluators="default",
            )
    return evaluate_results

result1 = evaluate_embedding(DatabricksEmbeddings(endpoint="databricks-bge-large-en"))	
#result2 = evaluate_embedding(SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2"))

eval_results_of_retriever_df_bge = result1.tables["eval_results_table"]
#eval_results_of_retriever_df_MiniLM = result2.tables["eval_results_table"]
display(eval_results_of_retriever_df_bge)

### Evaluate different Top K strategy with MLflow

In [None]:
with mlflow.start_run() as run:
        evaluate_results = mlflow.evaluate(
        data=eval_results_of_retriever_df_bge,
        targets="source",
        predictions="outputs",
        evaluators="default",
        extra_metrics=[
            mlflow.metrics.precision_at_k(1),
            mlflow.metrics.precision_at_k(2),
            mlflow.metrics.precision_at_k(3),
            mlflow.metrics.recall_at_k(1),
            mlflow.metrics.recall_at_k(2),
            mlflow.metrics.recall_at_k(3),
            mlflow.metrics.ndcg_at_k(1),
            mlflow.metrics.ndcg_at_k(2),
            mlflow.metrics.ndcg_at_k(3),
        ],
    )

display(evaluate_results.tables["eval_results_table"])

### Evaluate the Chunking Strategy with MLflow

In [None]:
from typing import List

def evaluate_chunk_size(chunk_size):
  list_of_documents = loader.load()
  text_splitter = CharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=0)
  docs = text_splitter.split_documents(list_of_documents)
  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
  retriever = Chroma.from_documents(docs, embedding_function).as_retriever()
  
  def retrieve_doc_ids(question: str) -> List[str]:
    docs = retriever.get_relevant_documents(question)
    doc_ids = [doc.metadata["source"] for doc in docs]
    return doc_ids
   
  def retriever_model_function(question_df: pd.DataFrame) -> pd.Series:
    return question_df["question"].apply(retrieve_doc_ids)

  with mlflow.start_run() as run:
      evaluate_results = mlflow.evaluate(
          model=retriever_model_function,
          data=eval_data,
          model_type="retriever",
          targets="source",
          evaluators="default",
      )
  return evaluate_results

result1 = evaluate_chunk_size(1000)
result2 = evaluate_chunk_size(2000)


display(result1.tables["eval_results_table"])
display(result2.tables["eval_results_table"])

## Evaluate the RAG system using `mlflow.evaluate()`
Create a simple function that runs each input through the RAG chain

In [None]:
def model(input_df):
    return input_df["questions"].map(qa).tolist()

## Create an eval dataset (Golden Dataset)

In [None]:
eval_df = pd.DataFrame(
    {
        "questions": [
            "What is MLflow?",
            "What is Databricks?",
            "How to serve a model on Databricks?",
            "How to enable MLflow Autologging for my workspace by default?",
        ],
    }
)
display(eval_df)

## Evaluate using LLM as a Judge and Basic Metric

Use relevance metric to determine the relevance of the answer and context. There are other metrics you can use too.


In [None]:
from mlflow.deployments import set_deployments_target
from  mlflow.metrics.genai.metric_definitions import relevance

set_deployments_target("databricks")  # To retrieve all endpoint in your Databricks Workspace

relevance_metric = relevance(model="endpoints:/databricks-llama-2-70b-chat")  # You can also use any model you have hosted on Databricks, models from the Marketplace or models in the Foundation model API

with mlflow.start_run():
    results =  mlflow.evaluate(
        model,
        eval_df,
        model_type="question-answering",
        evaluators="default",
        predictions="result",
        extra_metrics=[relevance_metric, mlflow.metrics.latency()],
        evaluator_config={
            "col_mapping": {
                "inputs": "questions",
                "context": "source_documents",
            }
        }
    )
    print(results.metrics)

display(results.tables["eval_results_table"])