## Data Ingestion into MongoDB Database

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/rag/chat_with_pdf_mongodb_openai_langchain_POLM_AI_Stack.ipynb)

**Steps to creating a MongoDB Database**
- [Register for a free MongoDB Atlas Account](https://www.mongodb.com/cloud/atlas/register?utm_campaign=devrel&utm_source=workshop&utm_medium=organic_social&utm_content=rag%20to%20agents%20notebook&utm_term=richmond.alake)
- [Create a Cluster](https://www.mongodb.com/docs/guides/atlas/cluster/)
- [Get your connection string](https://www.mongodb.com/docs/guides/atlas/connection-string/)

## Vector Index Creation

- [Create an Atlas Vector Search Index](https://www.mongodb.com/docs/compass/current/indexes/create-vector-search-index/)

- If you are following this notebook ensure that you are creating a vector search index for the right database(anthropic_demo) and collection(research)

Below is the vector search index definition for this notebook

```json
{
  "fields": [
    {
      "numDimensions": 1536,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"
    }
  ]
}
```

- Give your vector search index the name "vector_index" if you are following this notebook




## Code

In [None]:
! pip install --quiet langchain pymongo langchain-openai langchain-community pypdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os

from google.colab import userdata
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import MongoDBAtlasVectorSearch
from langchain_community.document_loaders import PyPDFLoader
from pymongo import MongoClient

# Set up your OpenAI API key
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

# Set up MongoDB connection
mongo_uri = userdata.get("MONGO_URI")
db_name = "anthropic_demo"
collection_name = "research"

client = MongoClient(mongo_uri, appname="devrel.showcase.chat_with_pdf")
db = client[db_name]
collection = db[collection_name]

# Set up document loading and splitting
loader = PyPDFLoader("mapping_llms.pdf")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

# Set up embeddings and vector store
embeddings = OpenAIEmbeddings()
vector_store = MongoDBAtlasVectorSearch.from_documents(
    texts, embeddings, collection=collection, index_name="vector_index"
)

# Set up retriever and language model
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Set up RAG pipeline
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True
)


# Function to process user query
def process_query(query):
    result = qa_chain({"query": query})
    return result["result"], result["source_documents"]


# Example usage
query = "What is the document about?"
answer, sources = process_query(query)
print(f"Answer: {answer}")
print("Sources:")
for doc in sources:
    print(f"- {doc.metadata['source']}: {doc.page_content[:100]}...")

# Don't forget to close the MongoDB connection when done
client.close()

Answer: The document is about a significant advance in understanding the inner workings of AI models, specifically focusing on the interpretation of the features inside a large language model called Claude Sonnet. It discusses how millions of concepts are represented within the model, the ability to manipulate these features to see how the model's responses change, and the potential implications for making AI models safer and more trustworthy.
Sources:
- mapping_llms.pdf: As for the scientific risk, the proof is in the pudding.
We successfully extracted millions of featu...
- mapping_llms.pdf: A map of the features near an "Inner Conflict" feature, including clusters
related to balancing trad...
- mapping_llms.pdf: Interpret bility
M apping the M ind of a Large
Language M odel
21 May 2024
Today we report a signifi...
- mapping_llms.pdf: English word in a dictionary is made by combining letters, and
every sentence is made by combining w...
- mapping_llms.pdf: answer  I have no physical 