<a href="https://colab.research.google.com/github/ppicello/Atlas-Search-eWorkshop/blob/main/MDB_RAG_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MongoDB Vector Search Workshop - RETRIEVAL AUGMENTED GENERATION (RAG)

In this notebook we will build a RAG architecture to ask questions about PDF documents uploaded in a folder within your Google Drive account. We will leverage [MongoDB Atlas](https://www.mongodb.com/products/platform/atlas-vector-search) as vector store, [OpenAI](https://openai.com/) as embedding model and LLM and [Langchain](https://www.langchain.com/) as LLM framework. The choice of using Google Drive was just to make it easier for participants to upload documents.


> There are no requirements in order to run this notebook, you should be able to run it completely in Google Colab

![](https://drive.google.com/uc?export=view&id=1QWp1TpFQmFcv9lcmkAtpARTrzyBBUDHY)


#Google Colab instruction

A notebook is a list of cells. Cells contain either explanatory text or executable code and its output. Click a cell to select it.

Below is a **code cell**. Click in the cell to select it and execute the contents in the following ways:

* Click the **Play icon** in the left gutter of the cell;
* Type **Shift+Enter** to run the cell and move focus to the next cell (adding one if none exists)

In [None]:
a = 10
print(a)

10


# Step 0 - Install dependencies
In this step we will install the dependencies needed (like pymongo, lancgain etc). This can take a couple of minutes.


In [None]:
! python -m pip install pymongo pypdf langchain openai tiktoken gradio

Collecting pymongo
  Downloading pymongo-4.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (677 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m677.2/677.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-4.0.2-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.1.9-py3-none-any.whl (816 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.0/817.0 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-1.12.0-py3-none-any.whl (226 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.7/226.7 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

# Step 1 - Setup MongoDB and OpenAI

*example:*
```
mongodb_uri='mongodb+srv://<username>:<password>@cluster0.4m8aa.mongodb.net/?retryWrites=true&w=majority'
openai_apikey='sk-XXXXXXXXXXXXXX'
username = 'picellopaolo'
```

Feel free to use your own Atlas cluster and OpenAI apikey if you have one.

**Remember to choose your username**, it will be used to isolate your vectors from other participants when using the same Atlas cluster (please use only lowercase letters, ex: `jamessmith`)

In [None]:
import ipywidgets as widgets
import os

mongodb_uri_widget = widgets.Password(
    description='Your Atlas URI:',
    disabled=False,
    style=dict(description_width='125px')
)

openai_api_key_widget = widgets.Password(
    description='Your OpenAI API key:',
    disabled=False,
    style=dict(description_width='125px')
)

username_widget = widgets.Text(
    value='',
    placeholder='username',
    description='Your username:',
    style=dict(description_width='125px'),
    disabled=False
)


display(mongodb_uri_widget)
display(openai_api_key_widget)
display(username_widget)

Password(description='Your Atlas URI:', style=DescriptionStyle(description_width='125px'))

Password(description='Your OpenAI API key:', style=DescriptionStyle(description_width='125px'))

Text(value='', description='Your username:', placeholder='username', style=DescriptionStyle(description_width=…

In the following cell we are going to create the MongoDB collection that will store our chunks and the related vectors. The collection will be created in the `rag_demo` database and will be named as your username.



In [None]:
from pymongo import MongoClient
import os

mongo_db_name = 'rag_demo'
mongo_coll_name = username_widget.value

mongo_client = MongoClient(mongodb_uri_widget.value)
mongo_coll = mongo_client[mongo_db_name][mongo_coll_name]
mongo_db_and_coll_path = '{}.{}'.format(mongo_db_name, mongo_coll_name)

# Delete existing documents -- run before demo
mongo_coll.delete_many({})

doc_count = mongo_coll.count_documents({})
'{} document count is {:,}'.format(mongo_db_and_coll_path, doc_count)

'rag_demo.picellopaolo document count is 0'

#Step 2 - Connect GDrive
Allow access to Google Drive

In [None]:
#mount GDrive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Create a folder within your Google Drive account. This folder will be used to upload the documents.


In [None]:
import os.path
from os import path

# Just needed in case you'd like to append it to an array
data = []
folder_name = 'RAG_workshop_documents'
folder_path = '/content/drive/MyDrive/' + folder_name

if path.exists(folder_path) == False:
  os.mkdir(folder_path)

#Step 3 - Upload some PDFs in the newly created GDrive
Upload some sample PDFs in the newly created folder in your Google Drive account.

You should find a new folder called ```RAG_workshop_documents``` at the root level of your Google Drive account. You are now ready to drop some PDF documents in this folder.


> ⛔ **Do not upload any document containing private, confidential or sensitive data!!** 🚫


> If you don't know what documents to upload try with the *Practical MongoDB Aggregations Book.pdf* available [here](https://www.practical-mongodb-aggregations.com/)






*Example:*
![](https://drive.google.com/uc?export=view&id=1dcEVyvsP3do5t-g_KonJROTUvapUg2aT)


In [None]:
# List documents in the folder
for filename in os.listdir(folder_path):
      print(filename)

Real-Time-Data-via-Event-Driven-Architecture.pdf
MongoDB_Atlas_Search_Transforming.pdf
MongoDB_Atlas_Security_Controls-v7k3rbhi3p.pdf
MongoDB_Best_Practices_Guide.pdf
Embedding-GenAI-with-MongoDB.pdf
Event-Driven-Applications.pdf
MongoDB_&_FSI_in_the_Blockchain_Era.pdf
Practical MongoDB Aggregations Book.pdf
rag.png
sample_documents.png
logos.png
example_collection.png
example_ui.png


#Step 4 - Loop thorugh the different files and split them into chunks (by page)

To split document into chunk we leverage Langchain. Each document is splitted into multiple chunks (by page). Each chunk contains the page content and metadata (original document and page number).

In [None]:
from langchain.document_loaders import PyPDFLoader
chunked_docs = {}

for filename in os.listdir(folder_path):
      if filename.endswith('pdf'):
        print(filename)
        loader = PyPDFLoader(os.path.join(folder_path, filename))
        chunked_docs[filename] = loader.load_and_split()
        print('computed ' + str(len(chunked_docs[filename])) + ' chunks for document: ' + filename)


Real-Time-Data-via-Event-Driven-Architecture.pdf
computed 23 chunks for document: Real-Time-Data-via-Event-Driven-Architecture.pdf
MongoDB_Atlas_Search_Transforming.pdf
computed 23 chunks for document: MongoDB_Atlas_Search_Transforming.pdf
MongoDB_Atlas_Security_Controls-v7k3rbhi3p.pdf
computed 34 chunks for document: MongoDB_Atlas_Security_Controls-v7k3rbhi3p.pdf
MongoDB_Best_Practices_Guide.pdf
computed 30 chunks for document: MongoDB_Best_Practices_Guide.pdf
Embedding-GenAI-with-MongoDB.pdf
computed 17 chunks for document: Embedding-GenAI-with-MongoDB.pdf
Event-Driven-Applications.pdf
computed 18 chunks for document: Event-Driven-Applications.pdf
MongoDB_&_FSI_in_the_Blockchain_Era.pdf
computed 20 chunks for document: MongoDB_&_FSI_in_the_Blockchain_Era.pdf
Practical MongoDB Aggregations Book.pdf
computed 280 chunks for document: Practical MongoDB Aggregations Book.pdf


#Step 5 - Generate Vectors for each chunk
We will use the `text-embedding-ada-002` model from OpenAI to generate the vectors. The vector will be stored in the `rag_demo` database

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import MongoDBAtlasVectorSearch

embeddings_model = OpenAIEmbeddings(
    model='text-embedding-ada-002',
    openai_api_key=openai_api_key_widget.value
)


for key,value in chunked_docs.items():
  print('Computing vectors for document: ' + key)
  vector_db = MongoDBAtlasVectorSearch.from_documents(
    value,
    embeddings_model,
    collection=mongo_coll
)

  warn_deprecated(


Computing vectors for document: Real-Time-Data-via-Event-Driven-Architecture.pdf
Computing vectors for document: MongoDB_Atlas_Search_Transforming.pdf
Computing vectors for document: MongoDB_Atlas_Security_Controls-v7k3rbhi3p.pdf
Computing vectors for document: MongoDB_Best_Practices_Guide.pdf
Computing vectors for document: Embedding-GenAI-with-MongoDB.pdf
Computing vectors for document: Event-Driven-Applications.pdf
Computing vectors for document: MongoDB_&_FSI_in_the_Blockchain_Era.pdf
Computing vectors for document: Practical MongoDB Aggregations Book.pdf


In [None]:
doc_count = mongo_coll.count_documents({})
'MongoDB document count in {} is {:,}'.format(mongo_db_and_coll_path, doc_count)

'MongoDB document count in rag_demo.picellopaolo is 445'

The collection in MongoDB will look something like this. As you can see for every MongoDB document we have the reference to the source document (in the `source` field), the content of the page (in the `text` field), the vector (in the `embedding` filed) and the page number (in the `page` field)

![](https://drive.google.com/uc?export=view&id=18qnK-o3MXMH1YZb76UsVLK6IxZ0suY51)

#Step 6 - Create MongoDB Atlas Vector Search index

Now that we computed our vectors and loaded our chunks into MongoDB Atlas we need to define a Vector Search Index

In [None]:
from pymongo.errors import OperationFailure
import inspect

mongo_index_def = {
    'name': 'rag_demo_index',
    'definition': {
        'mappings': {
            'dynamic': True,
            'fields': {
                'embedding': {
                    'type': 'knnVector',
                    'dimensions': 1536,
                    'similarity': 'cosine'
                }
            }
        }
    }
}

try:
    mongo_coll.create_search_index(mongo_index_def)
    print('Search index is building')
except OperationFailure as e:
    print(e.details['codeName'])

IndexAlreadyExists


#Step 7 - Setup question function

We are now ready to test our RAG. Let's first build the function that will be used to submit questions to our LLM. We also retrieve the document sent to the LLM in order to be able to show in the UI what is the document that is sent to the LLM to generate the final answer.

In this case the prompt is hidden by Langchain, you can think of it as something like "*Answer the following question, but only using these documents that I'm passing to you*")

In [None]:
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

vector_db = MongoDBAtlasVectorSearch.from_connection_string(
    mongodb_uri_widget.value,
    mongo_db_and_coll_path,
    embeddings_model,
    index_name='rag_demo_index'
)

def query_data(query):

    # Convert question to vector using OpenAI embeddings
    # Perform Atlas Vector Search using Langchain's vectorStore
    # similarity_search returns MongoDB documents most similar to the query

    docs = vector_db.similarity_search(query, K=1)
    as_output = docs[0].page_content

    # Leveraging Atlas Vector Search paired with Langchain's QARetriever

    # Define the LLM that we want to use -- note that this is the Language Generation Model and NOT an Embedding Model
    # If it's not specified (for example like in the code below),
    # then the default OpenAI model used in LangChain is OpenAI GPT-3.5-turbo, as of August 30, 2023

    llm = OpenAI(model_name="gpt-3.5-turbo-instruct", openai_api_key=openai_api_key_widget.value, temperature=0)

    # Get VectorStoreRetriever: Specifically, Retriever for MongoDB VectorStore.
    retriever = vector_db.as_retriever()

    # Load "stuff" documents chain. Stuff documents chain takes a list of documents,
    # inserts them all into a prompt and passes that prompt to an LLM.

    # Deafault prompt: "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer."

    qa = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=retriever)

    # Execute the chain
    retriever_output = qa.run(query)

    # Return Atlas Vector Search output, and output generated using RAG Architecture
    return as_output, retriever_output

#Step 8 - Create UI

Let's build a very simple UI to test our RAG

In [None]:
import gradio as gr
from gradio.themes.base import Base
with gr.Blocks(theme=Base(), title="Question Answering App using Vector Search + RAG") as demo:
    gr.Markdown(
        """
        # Question Answering App using Atlas Vector Search + RAG Architecture
        """)
    textbox = gr.Textbox(label="Enter your Question:")
    with gr.Row():
        button = gr.Button("Submit", variant="primary")
    with gr.Column():
        output1 = gr.Textbox(lines=1, max_lines=10, label="Output with just Atlas Vector Search (returns text field as is):")
        output2 = gr.Textbox(lines=1, max_lines=10, label="Output generated by chaining Atlas Vector Search to Langchain's RetrieverQA + OpenAI LLM:")

# Call query_data function upon clicking the Submit button

    button.click(query_data, textbox, outputs=[output1, output2])

demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://50adcc3b1bc8ce6a26.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




For example if I ask:

> *how can I group documents in mongodb? Can you give me an example?*

I can see the relevant chunk retrieved and sent to the LLM for the final answer geenration:



> *To group documents in MongoDB, you can use the $group stage in the aggregation pipeline. For example, if you have a collection of flight seats and you want to group them by flight ID and count the number of first class seats, you can use the following aggregation pipeline:*

```
db.seats.aggregate([
 {
   $match: { "class": "first" }
 },
 {
   $group: { _id: "$flightId", firstClassSeats: { $sum: 1 } }
 }
])
```


Feel free to try out with different queries and different documents.

Example UI:
![](https://drive.google.com/uc?export=view&id=1d-bCZsuH4g-TpOrPVK650ZwflJ3r_OJw)

#Step 9 (optional) - Make it fun: Modify the prompt

In [None]:
from langchain.chains.prompt_selector import ConditionalPromptSelector, is_chat_model
from langchain_core.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

vector_db = MongoDBAtlasVectorSearch.from_connection_string(
    mongodb_uri_widget.value,
    mongo_db_and_coll_path,
    embeddings_model,
    index_name='rag_demo_index'
)

def query_data_custom_prompt(query):

    # Convert question to vector using OpenAI embeddings
    # Perform Atlas Vector Search using Langchain's vectorStore
    # similarity_search returns MongoDB documents most similar to the query

    docs = vector_db.similarity_search(query, K=1)
    as_output = docs[0].page_content

    # Leveraging Atlas Vector Search paired with Langchain's QARetriever

    # Define the LLM that we want to use -- note that this is the Language Generation Model and NOT an Embedding Model
    # If it's not specified (for example like in the code below),
    # then the default OpenAI model used in LangChain is OpenAI GPT-3.5-turbo, as of August 30, 2023

    llm = OpenAI(model_name="gpt-3.5-turbo-instruct", openai_api_key=openai_api_key_widget.value, temperature=0)

    # Get VectorStoreRetriever: Specifically, Retriever for MongoDB VectorStore.
    retriever = vector_db.as_retriever()

    ################# MODIFY THE PROMPT ##################
    # To keep the code working DO NOT delete the {context} and
    # {question} bit of the prompt. You can add instructions
    # for the LLM at the beginning or at the end of the prompt
    ######################################################

    prompt_template = """

      ***WRITE YOUR MODIFIED PROMPT HERE*** example: I want you to act as a rapper.  You have to give the answer as it was an american rap song, everything should be in ryhme.

      Use the following pieces of context to answer the question at the end.If you don't know the answer, just say that you don't know, don't try to make up an answer.

      {context}

      Question: {question}

      Helpful Answer:"""

    PROMPT = PromptTemplate(
        template=prompt_template, input_variables=["context", "question"]
    )

    chain_type_kwargs = {"prompt": PROMPT}

    # Load "stuff" documents chain. Stuff documents chain takes a list of documents,
    # inserts them all into a prompt and passes that prompt to an LLM.
    # Deafault prompt: "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer."
    qa = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=retriever, chain_type_kwargs=chain_type_kwargs)

    # Execute the chain
    retriever_output = qa.run(query)

    # Return Atlas Vector Search output, and output generated using RAG Architecture
    return as_output, retriever_output

In [None]:
query_data_custom_prompt("what are some best practices when it comes to indexing in mongoDB?")

('20\nIndexing\nEliminate unnecessary indexes. Indexes are resource-intensive and consume RAM and disk. As fields \nare updated, associated indexes must be maintained, incurring additional CPU and disk I/O overhead. \nIf you’re running fully managed databases on MongoDB Atlas, the built-in Performance Advisor \nsuggests dropping unused, redundant, and hidden indexes to improve write performance and increase \nstorage space.\nReduce the size and performance overhead of indexes by only including documents that will be \naccessed through the index. For example, create a partial index on the orderID field that only includes \norder documents with an orderStatus of “In progress, ” or only indexes the emailAddress field for \ndocuments where it exists.Figure 7: Remove indexes that are unused, either because the field is not used to query the database or because the \nindex is redundant.',
 " Yo, listen up, I'm here to drop some knowledge\n      Indexes are important, but don't go overboard, 

Some example of outputs using the "rapper prompt":


```
Q: How can i group data in mongodb?

MongoDB's Aggregation Framework is the key,
It allows you to group data with just a few clicks.
An aggregation pipeline is what you need,
To process, transform, and return results with speed.

The pipeline is a set of operations in order,
Each stage takes the output of the previous, no need to border.
With $match and $group, you can filter and combine,
Reducing code and data manipulation, now that's just fine.

Data modeling is important, don't forget,
Store related data together, it's a safe bet.
Subdocuments and embedded models, oh my,
Update related data with just one atomic write, oh my my.

So if you want to group data in MongoDB,
Just use the Aggregation Framework, it's easy as can be.
No need for joins, just store data in a document,
With MongoDB, your data will always be on point.
```


```
Q: what are some of the best practices when it comes to security in MongoDB?

When it comes to security, MongoDB's got your back
Their product security teams work together, that's a fact
They find and prevent issues, building features anew
Reviewing code and tracking, they know what to do

Their customer-facing software goes through a CI/CD
With peer-review and automated testing, it's secure as can be
And if a security incident does arise
MongoDB's got a policy, they'll inform you in no time

Patching and change management, they've got it down
With automated tools and monitoring, they'll never frown
They keep their server software updated, that's for sure
And with change management, they've got a QA test plan in store

So when it comes to security, MongoDB's got the best
With their best practices guide, you'll never be stressed
Just follow their tips and you'll be good to go
MongoDB's got your back, that's all you need to know.
```

Feel free to modify the prompt and play around with it to see how it can affect your results.


Let's update our UI to add a button to call this new function. This will allow us to quickly compare how prompt engineering affect the results of an LLM.

In [None]:
with gr.Blocks(theme=Base(), title="Question Answering App using Vector Search + RAG") as demo:
    gr.Markdown(
        """
        # Question Answering App using Atlas Vector Search + RAG Architecture
        """)
    textbox = gr.Textbox(label="Enter your Question:")
    with gr.Row():
        button = gr.Button("standard prompt", variant="primary")
        button2 = gr.Button("modifed prompt", variant="primary")
    with gr.Column():
        output1 = gr.Textbox(lines=1, max_lines=10, label="Output with just Atlas Vector Search (returns text field as is):")
        output2 = gr.Textbox(lines=1, max_lines=10, label="Output generated by chaining Atlas Vector Search to Langchain's RetrieverQA + OpenAI LLM:")

# Call query_data function upon clicking the Submit button

    button.click(query_data, textbox, outputs=[output1, output2])
    button2.click(query_data_custom_prompt, textbox, outputs=[output1, output2])

demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://4349b40ec383c936b7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




Example question:

```
What are some best practices when it comes to indexing in mongoDB?"
```

Standard response:
```
Some best practices for indexing in MongoDB include eliminating unnecessary indexes,
reducing the size and performance overhead of indexes, and only including documents
that will be accessed through the index. It is also important to consider the limitations
and performance overhead of using database search for more sophisticated search experiences.
```

Modified prompt (rap song) response:
```
Listen up, I'll give you some tips
For indexing in MongoDB, don't let it slip
Indexes are resource-intensive, that's a fact
So eliminate unnecessary ones, that's the hack

As fields are updated, indexes must be maintained
That means more CPU and disk I/O, it's a pain
But don't worry, I'll tell you what to do
To improve write performance, here's a clue

If you're on MongoDB Atlas, you're in luck
The Performance Advisor will help you unstuck
It suggests dropping unused and redundant ones
To increase storage space and have more fun

Only include documents that will be accessed
Through the index, don't leave them neglected
Create partial indexes, that's the key
For example, only include "In progress" orders, you see

Or just index the email addresses that exist
Don't waste resources, that's the gist
So remember, when it comes to indexing
Be smart and efficient, that's the best thing.
```

# Conclusion

In this lab we learned how to build a RAG system using MongoDB, OpenAI and LangChain and we saw how powerful prompt engineering can be. This wanted to be just a fun experiment to get you hands-on on this new cool architectural pattern.

Now think of your customers and try to uncover use cases where RAG and vector search can bring value.




#Advanced use cases

Some possible ways to improve this RAG system:
- different chunking techniques
- leverage metadata when filtering results (`filter` option in `$vectorSearch`)
- prompt engineering (use different prompts)
- support multiple data types (txt, gdocs, etc)
- combine with keyword search for hybrid search
