# Draft - In Review

In this tutorial you will see how to build a RAG application utilizing the [LangChain](https://www.langchain.com/) framework, [OpenAI](https://platform.openai.com/docs/introduction) models, and [Gradio](https://www.gradio.app/) for interface creation, we'll guide you through building a question-answering application that leverages vector databases for more accurate and informed responses. This hands-on session aims to provide a high-level understanding of these advanced concepts and demonstrate their practical application in real-world scenarios

## How to run this notebook

This Colab Notebook is designed to be run against your own MongoDB Atlas cluster. You can [sign up for a *free* MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register) and create a free cluster.

You will also need an OpenAI API account, along with some credit. If you don't have one, you can [sign up for an OpenAI account](https://platform.openai.com/signup). Create an OpenAI API key. This requires a paid account with OpenAI, with enough credits. OpenAI API requests stop working if credit balance reaches $0.

## Running in Google Colab

You need to configure two secrets (using the key icon on the side of the page).

Set `mongodb_connection_string` to a valid MongoDB connection string. This should include a valid username and password, _and_ a database name, like this:


    mongodb+srv://USERNAME:PASSWORD@sandbox.abcdef.mongodb.net/DATABASE?retryWrites=true&w=majority

Set `openai_api_key`, which should store the OpenAI key you can create on the [OpenAI API key page](https://platform.openai.com/api-keys).

Finally, you need a [GitHub Access Token](https://github.com/settings/personal-access-tokens/new), configured with the name "github_access_token"

## Running locally

If you wish to run the notebook locally, you'll need to set the _environment variables_ `openai_api_key` and `mongodb_connection_string` to the values described in the previous section.

In [1]:
try:
  # If we're in a colab environment, load the configured secrets:
  from google.colab import userdata
  openai_api_key = userdata.get('openai_api_key')
  mongodb_connection_string = userdata.get('mongodb_connection_string')
except ImportError:
  # If the notebook is running outside of colab, configuration is via
  # environment variables.
  import os
  openai_api_key = os.environ['openai_api_key']
  mongodb_connection_string = os.environ['mongodb_connection_string']


In [2]:
# Install the necessary libraries
import sys
!{sys.executable} -m pip install langchain pymongo
# bs4 openai tiktoken gradio requests lxml argparse unstructured

Collecting langchain
  Downloading langchain-0.1.11-py3-none-any.whl (807 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.5/807.5 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymongo
  Downloading pymongo-4.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (677 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m677.2/677.2 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.25 (from langchain)
  Downloading langchain_community-0.0.25-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1.29 (from langchain)
  Downloading langcha

We're going to divide this task into two steps:

* Load some data into MongoDB and add vector embeddings.
* Query the data using vector search and passing the results to OpenAI to interpret the results.

First...

## Loading some data

In [5]:
# Import the following libraries:
from pymongo import MongoClient
from langchain_core.documents import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import MongoDBAtlasVectorSearch
from langchain.document_loaders import DirectoryLoader, GithubFileLoader
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA


In [None]:
# In this tutorial, we will be loading three text files from a directory using the DirectoryLoader
# These files should be saved to a directory named sample_files :
# "log_example.txt"

# 2023-08-16T16:43:06.537+0000 I MONGOT [63528f5c2c4f78275d37902d-f5-u6-a0 BufferlessChangeStreamApplier] [63528f5c2c4f78275d37902d-f5-u6-a0 BufferlessChangeStreamApplier] Starting change stream from opTime=Timestamp{value=7267960339944178238, seconds=1692203884, inc=574}2023-08-16T16:43:06.543+0000 W MONGOT [63528f5c2c4f78275d37902d-f5-u6-a0 BufferlessChangeStreamApplier] [c.x.m.r.m.common.SchedulerQueue] cancelling queue batches for 63528f5c2c4f78275d37902d-f5-u6-a02023-08-16T16:43:06.544+0000 E MONGOT [63528f5c2c4f78275d37902d-f5-u6-a0 InitialSyncManager] [BufferlessInitialSyncManager 63528f5c2c4f78275d37902d-f5-u6-a0] Caught exception waiting for change stream events to be applied. Shutting down.com.xgen.mongot.replication.mongodb.common.InitialSyncException: com.mongodb.MongoCommandException: Command failed with error 286 (ChangeStreamHistoryLost): 'Executor error during getMore :: caused by :: Resume of change stream was not possible, as the resume point may no longer be in the oplog.' on server atlas-6keegs-shard-00-01.4bvxy.mongodb.net:27017.2023-08-16T16:43:06.545+0000 I MONGOT [indexing-lifecycle-3] [63528f5c2c4f78275d37902d-f5-u6-a0 ReplicationIndexManager] Transitioning from INITIAL_SYNC to INITIAL_SYNC_BACKOFF.2023-08-16T16:43:18.068+0000 I MONGOT [config-monitor] [c.x.m.config.provider.mms.ConfCaller] Conf call response has not changed. Last update date: 2023-08-16T16:43:18Z.2023-08-16T16:43:36.545+0000 I MONGOT [indexing-lifecycle-2] [63528f5c2c4f78275d37902d-f5-u6-a0 ReplicationIndexManager] Transitioning from INITIAL_SYNC_BACKOFF to INITIAL_SYNC.



# "chat_conversation.txt"

# Alfred: Hi, can you explain to me how compression works in MongoDB? Bruce: Sure! MongoDB supports compression of data at rest. It uses either zlib or snappy compression algorithms at the collection level. When data is written, MongoDB compresses and stores it compressed. When data is read, MongoDB uncompresses it before returning it. Compression reduces storage space requirements. Alfred: Interesting, that's helpful to know. Can you also tell me how indexes are stored in MongoDB? Bruce: MongoDB indexes are stored in B-trees. The internal nodes of the B-trees contain keys that point to children nodes or leaf nodes. The leaf nodes contain references to the actual documents stored in the collection. Indexes are stored in memory and also written to disk. The in-memory B-trees provide fast access for queries using the index.Alfred: Ok that makes sense. Does MongoDB compress the indexes as well?Bruce: Yes, MongoDB also compresses the index data using prefix compression. This compresses common prefixes in the index keys to save space. However, the compression is lightweight and focused on performance vs storage space. Index compression is enabled by default.Alfred: Great, that's really helpful context on how indexes are handled. One last question - when I query on a non-indexed field, how does MongoDB actually perform the scanning?Bruce: MongoDB performs a collection scan if a query does not use an index. It will scan every document in the collection in memory and on disk to select the documents that match the query. This can be resource intensive for large collections without indexes, so indexing improves query performance.Alfred: Thank you for the detailed explanations Bruce, I really appreciate you taking the time to walk through how compression and indexes work under the hood in MongoDB. Very helpful!Bruce: You're very welcome! I'm glad I could explain the technical details clearly. Feel free to reach out if you have any other MongoDB questions.

# "aerodynamics.txt"

# Boundary layer control, achieved using suction or blowing methods, can significantly reduce the aerodynamic drag on an aircraft's wing surface.The yaw angle of an aircraft, indicative of its side-to-side motion, is crucial for stability and is controlled primarily by the rudder.With advancements in computational fluid dynamics (CFD), engineers can accurately predict the turbulent airflow patterns around complex aircraft geometries, optimizing their design for better performance.

In [None]:
# Let's load the documents
client = MongoClient(mongodb_connection_string)
collection_name = "collection_of_text_blobs"
collection = client.get_default_database().get_collection(collection_name)

In [7]:
# Let's initiatise the directory loader
loader = DirectoryLoader( './sample_files', glob="./*.txt", show_progress=True)
data = loader.load()

FileNotFoundError: Directory not found: './sample_files'

In [None]:
# Define the OpenAI Embedding Model we want to use for the source data
embeddings = OpenAIEmbeddings(openai_api_key=key_param.openai_api_key)

In [None]:
# Initialise the VectorStore
vectorStore = MongoDBAtlasVectorSearch.from_documents( data, embeddings, collection=collection )

In [None]:
# Creating Atlas Search Index
{
  "type": "vectorSearch",
  "fields": [{
    "path": "embedding",
    "numDimensions": 1536,
    "similarity": "cosine",
    "type": "vector"
  }]
}

## Performing vector search using Atlas Vector Search


In [None]:
# Define the OpenAI Embedding Model we want to use
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

  warn_deprecated(


In [None]:
# Initialize the Vector Store
vectorStore = MongoDBAtlasVectorSearch( collection, embeddings )

In [None]:
# Define a function that performs semantic similarity search using Atlas Vector Search
# Note we are including this step only to highlight the differences between output of only semantic search vs output generated with RAG architecture using RetrieverQA

def query_data(query):
    # Convert question to vector using OpenAI embeddings
    # Perform Atlas Vector Search using Langchain's vectorStore
    # similarity_search returns MongoDB documents most similar to the query

    docs = vectorStore.similarity_search(query, K=1)
    as_output = docs[0].page_content

    # Define a function that uses a retrieval-based augmentation to perform question-answering on the data

    # Leveraging Atlas Vector Search paired with Langchain's QARetriever

    # Define the LLM that we want to use -- note that this is the Language Generation Model and NOT an Embedding Model
    # If it's not specified (for example like in the code below),
    # then the default OpenAI model used in LangChain is OpenAI GPT-3.5-turbo, as of August 30, 2023

    llm = OpenAI(openai_api_key=openai_api_key, temperature=0)


    # Get VectorStoreRetriever: Specifically, Retriever for MongoDB VectorStore.
    # Implements _get_relevant_documents which retrieves documents relevant to a query.
    retriever = vectorStore.as_retriever()

    # Load "stuff" documents chain. Stuff documents chain takes a list of documents,
    # inserts them all into a prompt and passes that prompt to an LLM.

    qa = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=retriever)

    # Execute the chain

    retriever_output = qa.run(query)


    # Return Atlas Vector Search output, and output generated using RAG Architecture
    return as_output, retriever_output

In [None]:
# Create a web interface for the app using Gradio

import gradio as gr
from gradio.themes.base import Base

with gr.Blocks(theme=Base(), title="Question Answering App using Vector Search + RAG") as demo:
    gr.Markdown(
        """
        # Question Answering App using Atlas Vector Search + RAG Architecture
        """)
    textbox = gr.Textbox(label="Enter your Question:")
    with gr.Row():
        button = gr.Button("Submit", variant="primary")
    with gr.Column():
        output1 = gr.Textbox(lines=1, max_lines=10, label="Output with just Atlas Vector Search (returns text field as is):")
        output2 = gr.Textbox(lines=1, max_lines=10, label="Output generated by chaining Atlas Vector Search to Langchain's RetrieverQA + OpenAI LLM:")

# Call query_data function upon clicking the Submit button

    button.click(query_data, textbox, outputs=[output1, output2])

demo.launch(debug=True, share=True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().

Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.


<IPython.core.display.Javascript object>

## Output



In [None]:
# Log analysis example

In [None]:
# Chat conversation example

In [None]:
# Sentiment analysis example

In [None]:
# Precise answer retrieval example

## Next Steps