# Montioring a RAG chain with OpenAI, LangChain, and MLflow

This notebook is a quick tutorial on how to use [MLflow Tracing](https://mlflow.org/docs/latest/llms/tracing/index.html) to improve observability in your Retrieval Augmented Generation (RAG) application. RAG chains can be complex with many steps involved, and when failures or unexpected responses happen, it can be difficult to pinpoint what exactly went wrong. MLflow Tracing helps by allowing you to view the inputs and outputs of each intermediate step in your workflow, which enables more effective debugging and iteration.

We'll be building a simple question answering app in this notebook. For convenience, we're using [LangChain](https://www.langchain.com/) here (as MLflow has a built-in integration with it), but traces can be instrumented manually to suit any use-case. Let's dive in!

## Setting up the environment

The following cells simply set up our dev environment, installing the necessary libraries, and setting our OpenAI key. 

In [None]:
%pip install -Uqq mlflow langchain langchain-chroma langchain-openai

In [1]:
from getpass import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

## Enabling tracing

When using LangChain, tracing is enabled automatically by simply calling `mlflow.langchain.autolog()`. Manual trace instrumentation is also possible via the `@mlflow.trace()` function decorator. For more details, please check out the [MLflow Docs](https://mlflow.org/docs/latest/llms/tracing/index.html#trace-decorator).

In [2]:
import mlflow

mlflow.set_experiment("openai-rag-demo")
mlflow.langchain.autolog()

2024/09/05 13:55:08 INFO mlflow.tracking.fluent: Experiment with name 'openai-rag-demo' does not exist. Creating a new experiment.


## Setting up our vector store

The first step when building a RAG app is embedding our documents. Here, we'll be using OpenAI's `text-embedding-3-small` model to generate the embeddings, and storing them in an in-memory [Chroma](https://docs.trychroma.com/) instance. We'll be querying this vector store later to fetch documents that are relevant to the user's query.

In [3]:
from langchain_core.documents import Document
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

documents = [
    Document(
        page_content="NumPy is a powerful Python library used for numerical computing. It provides support for large multidimensional arrays and matrices along with a collection of high-level mathematical functions."
    ),
    Document(
        page_content="Pandas is a Python library primarily used for data manipulation and analysis. It provides data structures like DataFrame, which makes working with structured data intuitive and efficient."
    ),
    Document(
        page_content="PyTorch is an open-source machine learning library. It is widely used for deep learning applications and provides dynamic computational graphs for flexibility."
    ),
    Document(
        page_content="MLflow is an open-source platform to manage the machine learning lifecycle. It supports tracking experiments, packaging code into reproducible runs, and sharing and deploying models."
    ),
    Document(
        page_content="Langchain is an open-source library designed to build applications powered by language models. It provides a flexible interface to chain together components like prompts, memory, and tools."
    ),
]

# Provide the `persist_directory` argument to save the vector store to disk
vectorstore = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
)

## Building our chain

Our full RAG chain has the following steps:

1. Embed the question and fetch relevant documents
2. Construct a prompt using the fetched documents
3. Feed the prompt to the chat model (in this case, we'll be using OpenAI's `gpt-4o-mini`)

The following cell sets each of these steps up, and links them together to produce the final chain.

In [4]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

# This handles Step #1. The retriever will be invoked using the user's question,
# and will perform a similarity search over the embedded documents to retrieve
# the top 3 most relevant ones.
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3},
)

# This handles Step #2. Placeholders are used to inject
# both the user's question and the retrieved context.
prompt = ChatPromptTemplate.from_template("""
Answer the following question using the provided context. If the information
required to answer the question is not contained within the context, simply
respond that you do not know.

{question}

Context:
{context}
""")

# This handles Step #3. Feel free to change the
# parameters of the modelto whatever you wish!
chat_model = ChatOpenAI(
  model="gpt-4o-mini",
  temperature=0,
  max_tokens=50,
)

# Finally, we link them all together using LangChain Expression Language's pipe operator 
chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | chat_model

In [5]:
# Now, we can simply invoke the chain with our question,
# and the full pipeline will be executed!
chain.invoke("What is MLflow?")

AIMessage(content='MLflow is an open-source platform to manage the machine learning lifecycle. It supports tracking experiments, packaging code into reproducible runs, and sharing and deploying models.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 32, 'prompt_tokens': 156, 'total_tokens': 188}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_f33667828e', 'finish_reason': 'stop', 'logprobs': None}, id='run-38b81240-02a0-42ee-8d23-26c3460b7c7f-0', usage_metadata={'input_tokens': 156, 'output_tokens': 32, 'total_tokens': 188})

## Viewing the trace

After invoking the chain, MLflow will have automatically captured a trace of the execution. In order to view the trace, you can visit the MLflow UI. To start the MLflow UI, simply run the `mlflow ui` command in the working directory containing this notebook.

The trace will be contained in the "openai-rag-demo" experiment, under the "Traces" tab:

![MLflow Trace UI](../../images/mlflow_trace.png)

This visualization allows you to easily step through the various intermediate steps in the chain's execution, and to see the inputs and outputs at a glance. The UI also supports searching through spans to make it easy to debug unexpected responses even in very large and complex applications.

Of course, the trace data is accessible via the python client as well. Simply call `mlflow.get_last_active_trace()`, and call `.to_dict()` on the result to convert it to a python dictionary. The `mlflow.search_traces()` and `mlflow.get_trace()` [APIs](https://mlflow.org/docs/latest/llms/tracing/index.html#searching-and-retrieving-traces) are also available to retrieve traces given certain filter conditions.

In [6]:
mlflow.get_last_active_trace().to_dict()

{'info': {'request_id': '82bce1ed74b84701906de60ce1a2003b',
  'experiment_id': '871370117864542956',
  'timestamp_ms': 1725515725835,
  'execution_time_ms': 1032,
  'status': 'OK',
  'request_metadata': {'mlflow.trace_schema.version': '2',
   'mlflow.traceInputs': '"What is MLflow?"',
   'mlflow.traceOutputs': '{"content": "MLflow is an open-source platform to manage the machine learning lifecycle. It supports tracking experiments, packaging code into reproducible runs, and sharing and deploying models.", "additional_kwargs": {"refusal": null}, "response...'},
  'tags': {'mlflow.source.name': '/Users/daniel.lok/miniconda3/envs/dev/lib/python3.9/site-packages/ipykernel_launcher.py',
   'mlflow.source.type': 'LOCAL',
   'mlflow.traceName': 'RunnableSequence',
   'mlflow.artifactLocation': 'file:///Users/daniel.lok/openai-cookbook/examples/third_party/mlruns/871370117864542956/traces/82bce1ed74b84701906de60ce1a2003b/artifacts'}},
 'data': {'spans': [{'name': 'RunnableSequence',
    'conte