## MongoDB Atlas

To use MongoDB Atlas, you must first deploy a cluster. To get started head over to Atlas here: [quick start](https://www.mongodb.com/docs/atlas/getting-started/).
Create an Atlas database and create an Atlas Search Index to search vectors.

Follow below MongoDB Atlas guide
- [Create new cluster](https://www.mongodb.com/docs/atlas/tutorial/create-new-cluster/)
- [Connect to database](https://www.mongodb.com/docs/atlas/driver-connection/)
- [Create an Atlas Vector Search Index](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/)
- [Create Index Fields](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search)

### benefits?
- use mongoDB itself!

In [1]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25 pymongo langchain langchain-mongodb


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:

%load_ext dotenv
%dotenv
# UPSTAGE_API_KEY
# ... KEY

In [2]:
import warnings

warnings.filterwarnings("ignore")

In [39]:
from pymongo.mongo_client import MongoClient
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_upstage import UpstageEmbeddings

"""
Your connection string should use following format:
mongodb+srv://<username>:<password>@<clusterName>.<hostname>.mongodb.net
"""


# Connect to your Atlas cluster
client = MongoClient('mongodb+srv://<username>:<password>@<clusterName>.<hostname>.mongodb.net')
# Define collection and index name
db_name = "db"
collection_name = "collection"
atlas_collection = client[db_name][collection_name]
vector_search_index = "vector_index"

# Create Indexes

create atlas index fields.

[Create Index Fields](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search)

```json
{
  "fields": [
    {
      "numDimensions": 4096,
      "path": "embedding",
      "similarity": "dotProduct",
      "type": "vector"
    },
    {
      "path": "metadata.team",
      "type": "filter"
    },
    {
      "path": "metadata._id",
      "type": "filter"
    }
  ]
}
```

In [40]:
from langchain.schema import Document
import hashlib
import uuid

def generate_uuid_from_text(text):
    hash_object = hashlib.md5(text.encode('utf-8'))
    return str(uuid.UUID(hash_object.hexdigest()))

def get_documents_with_ids(documents):
    documents_with_ids = []
    for doc in documents:
        doc_id = generate_uuid_from_text(doc.page_content)
        new_doc = Document(page_content=doc.page_content, metadata={'_id': doc_id})
        documents_with_ids.append(new_doc)
    return documents_with_ids

In [41]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


sample_text = [
    "Korea is a beautiful country to visit in the spring.",
    "The best time to visit Korea is in the fall.",
    "Best way to find bug is using unit test.",
    "Python is a great programming language for beginners.",
    "Sung Kim is a great teacher.",
]

splits = RecursiveCharacterTextSplitter().create_documents(sample_text)
documents_with_ids = get_documents_with_ids(splits)

print(documents_with_ids)

vectorstore = MongoDBAtlasVectorSearch.from_documents(
    documents=documents_with_ids,
    collection=atlas_collection,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    index_name=vector_search_index
)

[Document(page_content='Korea is a beautiful country to visit in the spring.', metadata={'_id': 'ffdcd54b-d518-f7df-f467-d5a89e134915'}), Document(page_content='The best time to visit Korea is in the fall.', metadata={'_id': '4c433943-7d7d-bac3-8b2c-950e2ecf501b'}), Document(page_content='Best way to find bug is using unit test.', metadata={'_id': '64435a9d-c192-58eb-1b75-296b8426d51f'}), Document(page_content='Python is a great programming language for beginners.', metadata={'_id': '1f4d8b9f-d54d-897b-e3eb-5aa5711c15ba'}), Document(page_content='Sung Kim is a great teacher.', metadata={'_id': '3aee06fb-708c-3404-1d42-7f22c75895fb'})]


In [13]:
# check if text is in the vector store
# vector store is mongodb collection
def is_in_vectorstore(collection, document_id):
    result = collection.find_one({"_id": document_id})
    return result is not None

In [45]:
is_in_vectorstore(atlas_collection, "Hello, new sentence")

False

In [46]:
is_in_vectorstore(atlas_collection, documents_with_ids[0].metadata['_id'])

True

In [16]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [47]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 125


In [48]:
from langchain_mongodb import MongoDBAtlasVectorSearch

vectorstore = MongoDBAtlasVectorSearch(
    collection=atlas_collection,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    index_name=vector_search_index
)
retriever = vectorstore.as_retriever()


unique_splits = [
    split for split in splits if not is_in_vectorstore(atlas_collection, split.page_content)
]
print(len(unique_splits))
unique_splits_with_ids = get_documents_with_ids(unique_splits)

# 3. Embed & indexing if it's not in the vector store
if len(unique_splits) > 0:
    MongoDBAtlasVectorSearch.from_documents(
    documents=unique_splits_with_ids,
    collection=atlas_collection,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    index_name=vector_search_index
)

125


In [52]:
# Query the retriever
search_result = retriever.invoke("How to find problems in code?")
print(search_result)
print(search_result[0].page_content[:100])


[Document(page_content='Best way to find bug is using unit test.', metadata={'_id': '64435a9d-c192-58eb-1b75-296b8426d51f'}), Document(page_content="<p id='13' style='font-size:16px'>introduced bugs immediately. Several bug-finding techni-<br>ques could be used, including code inspections, unit testing,<br>and the use of static analysis tools. Since these steps would<br>be taken right after a code change was made, the developer<br>would still retain the full mental context of the change. This<br>holds promise for reducing the time required to find<br>software bugs and reducing the time that bugs stay resident<br>in software before removal.</p><br>", metadata={'_id': 'd9531bfe-b3ed-7151-eef5-1cf41c1105a7'}), Document(page_content="<p id='39' style='font-size:20px'>2.2 Mining Buggy Patterns</p><br><p id='40' style='font-size:16px'>One thread of research attempts to find buggy or clean code<br>patterns in the history of development of a software project.</p><br><p id='41' style='font-size