
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>



# Assembling and Evaluating a RAG Application

In the previous demo, we created a Vector Search Index. To build a complete RAG application, it is time to connect all the components that you have learned so far and evaluate the performance of the RAG.

After evaluating the performance of the RAG pipeline, we will create and deploy a new Model Serving Endpoint to perform RAG.

**Learning Objectives:**

*By the end of this demo, you will be able to:*

- Describe embeddings, vector databases, and search/retrieval as key components of implementing performant RAG applications.
- Assemble a RAG pipeline by combining various components.
- Build a RAG evaluation pipeline with MLflow evaluation functions.
- Register a RAG pipeline to the Model Registry.


## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.
   
   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.

## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **15.4.x-cpu-ml-scala2.12**



**🚨 Important: This demonstration relies on the resources established in the previous one. Please ensure you have completed the prior demonstration before starting this one.**


## Classroom Setup

Install required libraries.

In [0]:
%pip install -U -qq databricks-vectorsearch langchain==0.3.7 flashrank langchain-databricks PyPDF2
dbutils.library.restartPython()

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed.
jupyter-server 1.23.4 requires anyio<4,>=3.1.0, but you have anyio 4.9.0 which is incompatible.
langchain-community 0.0.38 requires langchain-core<0.2.0,>=0.1.52, but you have langchain-core 0.3.63 which is incompatible.
numba 0.57.1 requires numpy<1.25,>=1.21, but you have numpy 1.26.4 which is incompatible.
ydata-profiling 4.5.1 requires numpy<1.24,>=1.16.0, but you have numpy 1.26.4 which is incompatible.
ydata-profiling 4.5.1 requires pydantic<2,>=1.8.1, but you have pydantic 2.11.7 which is incompatible.[0m[31m
[0m[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


Before starting the demo, run the provided classroom setup script. This script will define configuration variables necessary for the demo. Execute the following cell:

In [0]:
%run ../Includes/Classroom-Setup-04

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m



The examples and models presented in this course are intended solely for demonstration and educational purposes.
 Please note that the models and prompt examples may sometimes contain offensive, inaccurate, biased, or harmful content.


**Other Conventions:**

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")

Username:          labuser10822842_1751576779@vocareum.com
Catalog Name:      dbacademy
Schema Name:       labuser10822842_1751576779
Working Directory: /Volumes/dbacademy/ops/labuser10822842_1751576779@vocareum_com
Dataset Location:  NestedNamespace (arxiv='/Volumes/dbacademy_arxiv/v01', dais='/Volumes/dbacademy_dais/v01', news='/Volumes/dbacademy_news/v01', docs='/Volumes/dbacademy_docs/v01')


## Demo Overview

As seen in the diagram below, in this demo we will focus on the inference section (highlighted in green). The main focus of the previous demos was  Step 1 - Data preparation and vector storage. Now, it is time to put all components together to create a RAG application. 

The flow will be the following:

- A user asks a question
- The question is sent to our serverless Chatbot RAG endpoint
- The endpoint compute the embeddings and searches for docs similar to the question, leveraging the Vector Search Index
- The endpoint creates a prompt enriched with the doc
- The prompt is sent to the Foundation Model Serving Endpoint
- We display the output to our users!


<!-- <img src="https://files.training.databricks.com/images/genai/genai-as-01-llm-rag-self-managed-flow-2.png" width="100%"> -->

<!--  -->

![genai-as-01-llm-rag-self-managed-flow-2](../Includes/images/genai-as-01-llm-rag-self-managed-flow-2.png)



## Setup the RAG Components

In this section, we will first define the components that we created before. Next, we will set up the retriever component for the application. Then, we will combine all the components together. In the final step, we will register the developed application as a model in the Model Registry with Unity Catalog.

### Setup the Retriever

We will setup the Vector Search endpoint that we created in the previous demos as retriever. The retriever will return 2 relevant documents based on the query.


In [0]:
# components we created before
# assign vs search endpoint by username
vs_endpoint_prefix = "vs_endpoint_"

vs_endpoint_name = vs_endpoint_prefix + str(get_fixed_integer(DA.unique_name("_")))
print(f"Assigned Vector Search endpoint name: {vs_endpoint_name}.")

vs_index_fullname = f"{DA.catalog_name}.{DA.schema_name}.pdf_text_self_managed_vs_index"

Assigned Vector Search endpoint name: vs_endpoint_2.


In [0]:
from databricks.vector_search.client import VectorSearchClient
from langchain_databricks import DatabricksEmbeddings
from langchain_core.runnables import RunnableLambda
from langchain.docstore.document import Document
from flashrank import Ranker, RerankRequest

def get_retriever(cache_dir="/tmp"):

    def retrieve(query, k: int=10):
        if isinstance(query, dict):
            query = next(iter(query.values()))

        # get the vector search index
        vsc = VectorSearchClient(disable_notice=True)
        vs_index = vsc.get_index(endpoint_name=vs_endpoint_name, index_name=vs_index_fullname)
        
        # get the query vector
        embeddings = DatabricksEmbeddings(endpoint="databricks-bge-large-en")
        query_vector = embeddings.embed_query(query)
        
        # get similar k documents
        return query, vs_index.similarity_search(
            query_vector=query_vector,
            columns=["pdf_name", "content"],
            num_results=k)

    def rerank(query, retrieved, cache_dir, k: int=2):
        # format result to align with reranker lib format 
        passages = []
        for doc in retrieved.get("result", {}).get("data_array", []):
            new_doc = {"file": doc[0], "text": doc[1]}
            passages.append(new_doc)       
        # Load the flashrank ranker
        ranker = Ranker(model_name="rank-T5-flan", cache_dir=cache_dir)

        # rerank the retrieved documents
        rerankrequest = RerankRequest(query=query, passages=passages)
        results = ranker.rerank(rerankrequest)[:k]

        # format the results of rerank to be ready for prompt
        return [Document(page_content=r.get("text"), metadata={"source": r.get("file")}) for r in results]

    # the retriever is a runnable sequence of retrieving and reranking.
    return RunnableLambda(retrieve) | RunnableLambda(lambda x: rerank(x[0], x[1], cache_dir))

# test our retriever
question = {"input": "How does Generative AI impact humans?"}
vectorstore = get_retriever(cache_dir = f"{DA.paths.working_dir}/opt")
similar_documents = vectorstore.invoke(question)
print(f"Relevant documents: {similar_documents}")

  embeddings = DatabricksEmbeddings(endpoint="databricks-bge-large-en")


Relevant documents: [Document(metadata={'source': 'dbfs:/Volumes/dbacademy_arxiv/v01/arxiv-articles/2302.09419.pdf'}, page_content='There are 52 classes in R52, which consists of 70\naverage tokens. It is divided into 6,532 and 2,568 training and testing texts.\nTopic Labeling (TL) The task mainly obtains the meaning of the ﬁle by deﬁning complex ﬁle themes. It\nis a critical component of topic analysis technology, which aims at simplifying topic analysis by assigning\neach article to one or more topics. Here, we introduce a few in detail.\nDBpedia [485] It is a large-scale multilingual knowledge base generated by Wikipedia’s most commonly\nused information boxes. It releases DBpedia every month, adding or removing classes and attributes in each\nversion. The most popular version of DBpedia has 14 categories, separated into 560,000 training data and\n70,000 testing data. The number of average tokens is 55.\nOhsumed [486] This is a biomedical literature database. The number of texts is 

Trace(request_id=tr-febd1d3618424192a0933cb155167020)

### Setup the Foundation Model

Our chatbot will be using `llama3.3` foundation model to provide answer. 

While the model is available using the built-in [Foundation endpoint](/ml/endpoints), we can use Databricks Langchain Chat Model wrapper to easily build our chain.  

Note: multiple type of endpoint or langchain models can be used.

- Databricks Foundation models (what we'll use)
- Your fined-tune model
- An external model provider (such as Azure OpenAI)

In [0]:
from langchain_databricks import ChatDatabricks

# test Databricks Foundation LLM model
chat_model = ChatDatabricks(endpoint="databricks-meta-llama-3-3-70b-instruct", max_tokens = 300)
print(f"Test chat model: {chat_model.invoke('What is Generative AI?')}")

  chat_model = ChatDatabricks(endpoint="databricks-meta-llama-3-3-70b-instruct", max_tokens = 300)


Test chat model: content='Generative AI refers to a type of artificial intelligence that is capable of generating new, original content, such as images, videos, music, text, or even entire datasets. This is in contrast to traditional AI, which is typically focused on analyzing and processing existing data.\n\nGenerative AI uses complex algorithms and neural networks to learn patterns and structures from large datasets, and then generates new content based on that learning. The goal of generative AI is to create new, synthetic data that is similar in style, structure, and quality to the original data.\n\nThere are several types of generative AI, including:\n\n1. **Generative Adversarial Networks (GANs)**: GANs consist of two neural networks that work together to generate new content. One network generates new data, while the other network evaluates the generated data and tells the first network whether it is realistic or not.\n2. **Variational Autoencoders (VAEs)**: VAEs are a type of n

Trace(request_id=tr-a7d212101b864108a79080b332245a35)

## Assembling the Complete RAG Solution

Let's now merge the retriever and the model in a single Langchain chain.

We will use a custom langchain template for our assistant to give proper answer.

Make sure you take some time to try different templates and adjust your assistant tone and personality for your requirement.

<!-- <img src="https://files.training.databricks.com/images/genai/genai-as-01-llm-rag-self-managed-model-2.png" width="100%" /> -->

![genai-as-01-llm-rag-self-managed-model-2](../Includes/images/genai-as-01-llm-rag-self-managed-model-2.png)

<!--  -->

Some important notes about the LangChain formatting:

* Context documents retrieved from the vector store are added by separated newline.

In [0]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import PromptTemplate


TEMPLATE = """You are an assistant for GENAI teaching class. You are answering questions related to Generative AI and how it impacts humans life. If the question is not related to one of these topics, kindly decline to answer. If you don't know the answer, just say that you don't know, don't try to make up an answer. Keep the answer as concise as possible.
Use the following pieces of context to answer the question at the end:

<context>
{context}
</context>

Question: {input}

Answer:
"""
prompt = PromptTemplate(template=TEMPLATE, input_variables=["context", "input"])

# unwrap the longchain document from the context to be a dict so we can register the signature in mlflow
def unwrap_document(answer):
  return answer | {"context": [{"metadata": r.metadata, "page_content": r.page_content} for r in answer["context"]]}

question_answer_chain = create_stuff_documents_chain(chat_model, prompt)
chain = create_retrieval_chain(get_retriever(), question_answer_chain)|RunnableLambda(unwrap_document)

In [0]:
question = {"input": "How does Generative AI impact humans?"}
answer = chain.invoke(question)
print(answer)

rank-T5-flan.zip:   0%|          | 0.00/73.7M [00:00<?, ?iB/s]rank-T5-flan.zip:  16%|█▋        | 12.0M/73.7M [00:00<00:00, 125MiB/s]rank-T5-flan.zip:  46%|████▌     | 33.9M/73.7M [00:00<00:00, 186MiB/s]rank-T5-flan.zip:  83%|████████▎ | 61.0M/73.7M [00:00<00:00, 231MiB/s]rank-T5-flan.zip: 100%|██████████| 73.7M/73.7M [00:00<00:00, 221MiB/s]


{'input': 'How does Generative AI impact humans?', 'context': [{'metadata': {'source': 'dbfs:/Volumes/dbacademy_arxiv/v01/arxiv-articles/2302.09419.pdf'}, 'page_content': 'There are 52 classes in R52, which consists of 70\naverage tokens. It is divided into 6,532 and 2,568 training and testing texts.\nTopic Labeling (TL) The task mainly obtains the meaning of the ﬁle by deﬁning complex ﬁle themes. It\nis a critical component of topic analysis technology, which aims at simplifying topic analysis by assigning\neach article to one or more topics. Here, we introduce a few in detail.\nDBpedia [485] It is a large-scale multilingual knowledge base generated by Wikipedia’s most commonly\nused information boxes. It releases DBpedia every month, adding or removing classes and attributes in each\nversion. The most popular version of DBpedia has 14 categories, separated into 560,000 training data and\n70,000 testing data. The number of average tokens is 55.\nOhsumed [486] This is a biomedical lite

Trace(request_id=tr-df5703ce7ba349c59a4fc101d1ffd9c8)

## Save the Model to Model Registry in UC

Now that our model is ready and evaluated, we can register it within our Unity Catalog schema. 

After registering the model, you can view the model and models in the **Catalog Explorer**.

In [0]:
from mlflow.models import infer_signature
import mlflow
import langchain

# set model registry to UC
mlflow.set_registry_uri("databricks-uc")
model_name = f"{DA.catalog_name}.{DA.schema_name}.rag_app_demo4"

with mlflow.start_run(run_name="rag_app_demo4") as run:
    signature = infer_signature(question, answer)
    model_info = mlflow.langchain.log_model(
        chain,
        loader_fn=get_retriever, 
        artifact_path="chain",
        registered_model_name=model_name,
        pip_requirements=[
            "mlflow==" + mlflow.__version__,
            "langchain==" + langchain.__version__,
            "databricks-vectorsearch",
        ],
        input_example=question,
        signature=signature
    )


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_community.utilities.requests import TextRequestsWrapper
  warn(
* 'allow_population_by_field_name' has been renamed to 'validate_by_name'
* 'underscore_attrs_are_private' has been removed
2025/07/03 22:00:23 INFO mlflow: Attempting to auto-detect Databricks resource dependencies for the current langchain model. Dependency auto-detection is best-effort and may not capture all dependencies of your langchain model, resulting in authorization errors when serving or querying your model. We recommend that you explicitly pass `resources` to mlflow.langchain.log_model() to ensure authorization to dependent resources succeeds when the model is deployed.


Uploading artifacts:   0%|          | 0/36 [00:00<?, ?it/s]

Successfully registered model 'dbacademy.labuser10822842_1751576779.rag_app_demo4'.


Uploading artifacts:   0%|          | 0/36 [00:00<?, ?it/s]

Created version '1' of model 'dbacademy.labuser10822842_1751576779.rag_app_demo4'.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



## Clean up Resources

This is the final demo. You can delete all resources created in this course.


## Conclusion

In this demo, we illustrated the process of constructing a comprehensive RAG application utilizing a variety of Databricks products. Initially, we established the RAG components that were previously created in the earlier demos, namely the Vector Search endpoint and Vector Search index. Subsequently, we constructed the retriever component and set up the foundational model for use. Following this, we put together the entire RAG application and evaluated the performance of the pipeline using MLflow's LLM evaluation functions. As a final step, we registered the newly created RAG application as a model within the Model Registry with Unity Catalog.


&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="blank">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy" target="blank">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use" target="blank">Terms of Use</a> | 
<a href="https://help.databricks.com/" target="blank">Support</a>