[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/rag/Haystack_MongoDB_Atlas_RAG.ipynb)


# Haystack and MongoDB Atlas RAG notebook

Install dependencies:

In [1]:
pip install haystack-ai mongodb-atlas-haystack tiktoken

Collecting haystack-ai
  Downloading haystack_ai-2.1.2-py3-none-any.whl (319 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m319.5/319.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mongodb-atlas-haystack
  Downloading mongodb_atlas_haystack-0.3.0-py3-none-any.whl (13 kB)
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting boilerpy3 (from haystack-ai)
  Downloading boilerpy3-1.0.7-py3-none-any.whl (22 kB)
Collecting haystack-bm25 (from haystack-ai)
  Downloading haystack_bm25-1.0.2-py2.py3-none-any.whl (8.8 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting openai>=1.1.0 (from haystack-ai)
  Downloading openai-1.30.5-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


## Setup MongoDB Atlas connection and Open AI


* Set the MongoDB connection string. Follow the steps [here](https://www.mongodb.com/docs/manual/reference/connection-string/) to get the connection string from the Atlas UI.

* Set the OpenAI API key. Steps to obtain an API key as [here](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)

In [2]:
import getpass
import os

In [3]:
os.environ["MONGO_CONNECTION_STRING"] = getpass.getpass(
    "Enter your MongoDB connection string:"
)

Enter your MongoDB connection string:··········


In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your Open AI Key:")

Enter your Open AI Key:··········


## Create vector search index on collection

Follow this [tutorial](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/) to create a vector index on database: `haystack_test` collection `test_collection`.

Verify that the index name is `vector_index` and the syntax specify:
```
{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    }
  ]
}
```

### Setup vector store to load documents:

In [5]:
from haystack import Document, Pipeline
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.components.retrievers.mongodb_atlas import (
    MongoDBAtlasEmbeddingRetriever,
)
from haystack_integrations.document_stores.mongodb_atlas import (
    MongoDBAtlasDocumentStore,
)

# Create some example documents
documents = [
    Document(content="My name is Jean and I live in Paris."),
    Document(content="My name is Mark and I live in Berlin."),
    Document(content="My name is Giorgio and I live in Rome."),
]

document_store = MongoDBAtlasDocumentStore(
    database_name="haystack_test",
    collection_name="test_collection",
    vector_search_index="vector_index",
)

Build the writer pipeline to load documnets

In [6]:
# Setting up a document writer to handle the insertion of documents into the MongoDB collection.
doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

# Initializing a document embedder to convert text content into vectorized form.
doc_embedder = OpenAIDocumentEmbedder()

# Creating a pipeline for indexing documents. The pipeline includes embedding and writing documents.
indexing_pipe = Pipeline()
indexing_pipe.add_component(instance=doc_embedder, name="doc_embedder")
indexing_pipe.add_component(instance=doc_writer, name="doc_writer")

# Connecting the components of the pipeline for document flow.
indexing_pipe.connect("doc_embedder.documents", "doc_writer.documents")

# Running the pipeline with the list of documents to index them in MongoDB.
indexing_pipe.run({"doc_embedder": {"documents": documents}})

Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  4.16it/s]


{'doc_embedder': {'meta': {'model': 'text-embedding-ada-002',
   'usage': {'prompt_tokens': 32, 'total_tokens': 32}}},
 'doc_writer': {'documents_written': 0}}

## Build a RAG Pipeline

Lets create a pipeline that will Retrieve Augment and Generate a response for user questions

In [9]:
# Template for generating prompts for a movie recommendation engine.
prompt_template = """
    You are an assistant allowed to use the following context documents.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}

    \Query: {{query}}
    \nAnswer:
"""

# Setting up a retrieval-augmented generation (RAG) pipeline for generating responses.
rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", OpenAITextEmbedder())

# Adding a component for retrieving related documents from MongoDB based on the query embedding.
rag_pipeline.add_component(
    instance=MongoDBAtlasEmbeddingRetriever(document_store=document_store, top_k=15),
    name="retriever",
)

# Building prompts based on retrieved documents to be used for generating responses.
rag_pipeline.add_component(
    instance=PromptBuilder(template=prompt_template), name="prompt_builder"
)

# Adding a language model generator to produce the final text output.
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")

# Connecting the components of the RAG pipeline to ensure proper data flow.
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7fc98d95bdf0>
🚅 Components
  - text_embedder: OpenAITextEmbedder
  - retriever: MongoDBAtlasEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)

Lets test the pipeline

In [12]:
query = "Where does mark live?"
result = rag_pipeline.run(
    {
        "text_embedder": {"text": query},
        "prompt_builder": {"query": query},
    }
)
print(result["llm"]["replies"][0])

Mark lives in Berlin.
