# RAG Pipeline with MLflow Tracking, Tracing & Evaluation

This notebook demonstrates how to build a complete Retrieval-Augmented Generation (RAG) pipeline using LangChain and integrate it with MLflow for experiment tracking, tracing, and evaluation.


- **RAG Pipeline Construction**: Build a complete RAG system using LangChain components
- **MLflow Integration**: Track experiments, parameters, and artifacts
- **Tracing**: Monitor inputs, outputs, retrieved documents, scores, prompts, and timings
- **Evaluation**: Use MLflow's built-in scorers to assess RAG performance
- **Best Practices**: Implement proper configuration management and reproducible experiments

We'll build a RAG system that can answer questions about academic papers by:
1. Loading and chunking documents from ArXiv
2. Creating embeddings and a vector store
3. Setting up a retrieval-augmented generation chain
4. Tracking all experiments with MLflow
5. Evaluating the system's performance

![System Diagram](https://miro.medium.com/v2/resize:fit:720/format:webp/1*eiw86PP4hrBBxhjTjP0JUQ.png)

#### Setup

In [None]:
%pip install -U langchain mlflow langchain-community arxiv pymupdf langchain-text-splitters langchain-openai

In [1]:
import os
import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Correctness, ExpectationsGuidelines
from langchain_community.document_loaders import ArxivLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

In [None]:
os.environ["OPENAI_API_KEY"] = "<YOUR OPENAI API KEY>"

mlflow.set_experiment("LangChain-RAG-MLflow")
mlflow.langchain.autolog()

Define all hyperparameters and configuration in a centralized dictionary. This makes it easy to:
- Track different experiment configurations
- Reproduce results
- Perform hyperparameter tuning

**Key Parameters**:
- `chunk_size`: Size of text chunks for document splitting
- `chunk_overlap`: Overlap between consecutive chunks
- `retriever_k`: Number of documents to retrieve
- `embeddings_model`: OpenAI embedding model
- `llm`: Language model for generation
- `temperature`: Sampling temperature for the LLM

In [3]:
CONFIG = {
    "chunk_size": 400,
    "chunk_overlap": 80,
    "retriever_k": 3,
    "embeddings_model": "text-embedding-3-small",
    "system_prompt": "You are a helpful assistant. Use the following context to answer the question. Use three sentences maximum and keep the answer concise.",
    "llm": "gpt-5-nano",
    "temperature": 0,
}

#### ArXiv Dcoument Loading and Processing

In [4]:
# Load documents from ArXiv
loader = ArxivLoader(
    query="1706.03762",
    load_max_docs=1,
)
docs = loader.load()
print(docs[0].metadata)

# Split documents into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=CONFIG["chunk_size"],
    chunk_overlap=CONFIG["chunk_overlap"],
)
chunks = splitter.split_documents(docs)

# Join chunks into a single string
def join_chunks(chunks):
    return "\n\n".join([chunk.page_content for chunk in chunks])


{'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntranslation task, our

#### Vector Store and Retriever Setup

In [5]:
# Create embeddings
embeddings = OpenAIEmbeddings(model=CONFIG["embeddings_model"])

# Create vector store from documents
vectorstore = InMemoryVectorStore.from_documents(
    chunks,
    embedding=embeddings,
)

# Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": CONFIG["retriever_k"]})

#### RAG Chain Construction using [LCEL](https://python.langchain.com/docs/concepts/lcel/)

Flow:
1. Query → Retriever (finds relevant chunks)
2. Chunks → join_chunks (creates context)
3. Context + Query → Prompt Template
4. Prompt → Language Model → Response


In [6]:
# Initialize the language model
llm = ChatOpenAI(model=CONFIG["llm"], temperature=CONFIG["temperature"])

# Create the prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system",CONFIG["system_prompt"] + "\n\nContext:\n{context}\n\n"),
    ("human", "\n{question}\n"),
])

# Construct the RAG chain
rag_chain = (
    {
        "context": retriever | RunnableLambda(join_chunks),
        "question": RunnablePassthrough(),
    }
    |prompt
    | llm
    | StrOutputParser()
)

#### Prediction Function with MLflow Tracing

Create a prediction function decorated with `@mlflow.trace` to automatically log:
- Input queries
- Retrieved documents
- Generated responses
- Execution time
- Chain intermediate steps

In [7]:
@mlflow.trace
def predict_fn(question: str) -> str:
    return rag_chain.invoke(question)

# Test the prediction function
sample_question = "What is the main idea of the paper?"
response = predict_fn(sample_question)
print(f"Question: {sample_question}")
print(f"Response: {response}")

Question: What is the main idea of the paper?
Response: The main idea is to replace recurrent/convolutional sequence models with a pure attention-based architecture called the Transformer. It uses self-attention to model dependencies between all positions in the input and output, enabling full parallelization and better handling of long-range relations. This approach achieves strong results on translation and can extend to other modalities.


#### Evaluation Dataset and Scoring

Define an evaluation dataset and run systematic evaluation using [MLflow's built-in scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/#available-scorers):

<u>Evaluation Components:</u>
- **Dataset**: Questions with expected concepts and facts
- **Scorers**: 
  - `RelevanceToQuery`: Measures how relevant the response is to the question
  - `Correctness`: Evaluates factual accuracy of the response
  - `ExpectationsGuidelines`: Checks that output matches expectation guidelines

<u>Best Practices:</u>
- Create diverse test cases covering different query types
- Include expected concepts to guide evaluation
- Use multiple scoring metrics for comprehensive assessment

In [8]:
# Define evaluation dataset
eval_dataset = [
    {
        "inputs": {"question": "What is the main idea of the paper?"},
        "expectations": {
            "key_concepts": ["attention mechanism", "transformer", "neural network"],
            "expected_facts": ["attention mechanism is a key component of the transformer model"],
            "guidelines": ["The response must be factual and concise"],
        }
    },
    {
        "inputs": {"question": "What's the difference between a transformer and a recurrent neural network?"},
        "expectations": {
            "key_concepts": ["sequential", "attention mechanism", "hidden state"],
            "expected_facts": ["transformer processes data in parallel while RNN processes data sequentially"],
            "guidelines": ["The response must be factual and focus on the difference between the two models"],
        }
    },
    {
        "inputs": {"question": "What does the attention mechanism do?"},
        "expectations": {
            "key_concepts": ["query", "key", "value", "relationship", "similarity"],
            "expected_facts": ["attention allows the model to weigh the importance of different parts of the input sequence when processing it"],
            "guidelines": ["The response must be factual and explain the concept of attention"],
        }
    }
]

# Run evaluation with MLflow
with mlflow.start_run(run_name="baseline_eval") as run:
    # Log configuration parameters
    mlflow.log_params(CONFIG)

    # Run evaluation
    results = mlflow.genai.evaluate(
        data=eval_dataset,
        predict_fn=predict_fn,
        scorers=[RelevanceToQuery(), Correctness(), ExpectationsGuidelines()],
    )


2025/08/23 20:14:39 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.
2025/08/23 20:14:39 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset.


Evaluating:   0%|          | 0/3 [Elapsed: 00:00, Remaining: ?] 


✨ Evaluation completed.

Metrics and evaluation results are logged to the MLflow run:
  Run name: [94mbaseline_eval[0m
  Run ID: [94ma2218d9f24c9415f8040d3b77af103a9[0m

To view the detailed evaluation results with sample-wise scores,
open the [93m[1mTraces[0m tab in the Run page in the MLflow UI.



#### Launch MLflow UI to check out the results

<u>What you'll see in the UI:</u>
- **Experiments**: Compare different RAG configurations
- **Runs**: Individual experiment runs with metrics and parameters
- **Traces**: Detailed execution traces showing retrieval and generation steps
- **Evaluation Results**: Scoring metrics and detailed comparisons
- **Artifacts**: Saved models, datasets, and other files

Navigate to `http://localhost:5000` after running the command below.

In [None]:
!mlflow ui

You should see something like this

![MLflow UI image](https://miro.medium.com/v2/resize:fit:720/format:webp/1*Cx7MMy53pAP7150x_hvztA.png)