## MongoDB Atlas

To use MongoDB Atlas, you must first deploy a cluster. To get started head over to Atlas here: [quick start](https://www.mongodb.com/docs/atlas/getting-started/).
Create an Atlas database and create an Atlas Search Index to search vectors.

Follow below MongoDB Atlas guide
- [Create new cluster](https://www.mongodb.com/docs/atlas/tutorial/create-new-cluster/)
- [Connect to database](https://www.mongodb.com/docs/atlas/driver-connection/)
- [Create an Atlas Vector Search Index](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/)
- [Create Index Fields](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search)

### benefits?
- use mongoDB itself!

In [1]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25 pymongo langchain langchain-mongodb


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:

%load_ext dotenv
%dotenv
# UPSTAGE_API_KEY
# ... KEY

In [2]:
import warnings

warnings.filterwarnings("ignore")

In [40]:
from pymongo.mongo_client import MongoClient
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_upstage import UpstageEmbeddings

"""
Your connection string should use following format:
mongodb+srv://<username>:<password>@<clusterName>.<hostname>.mongodb.net
"""


# Connect to your Atlas cluster
client = MongoClient('mongodb+srv://<username>:<password>@<clusterName>.<hostname>.mongodb.net')

# Define collection and index name
db_name = "db"
collection_name = "collection"
atlas_collection = client[db_name][collection_name]
vector_search_index = "vector_index"

# Create Indexes

create atlas index fields.

[Create Index Fields](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search)

```json
{
  "fields": [
    {
      "numDimensions": 1536,
      "path": "embedding",
      "similarity": "dotProduct",
      "type": "vector"
    },
    {
      "path": "metadata.team",
      "type": "filter"
    },
    {
      "path": "metadata._id",
      "type": "filter"
    }
  ]
}
```

In [53]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema import Document
import hashlib
import uuid

sample_text = [
    "Korea is a beautiful country to visit in the spring.",
    "The best time to visit Korea is in the fall.",
    "Best way to find bug is using unit test.",
    "Python is a great programming language for beginners.",
    "Sung Kim is a great teacher.",
]

splits = RecursiveCharacterTextSplitter().create_documents(sample_text)

def generate_uuid_from_text(text):
    hash_object = hashlib.md5(text.encode('utf-8'))
    return str(uuid.UUID(hash_object.hexdigest()))

documents_with_ids = []
for doc in splits:
    doc_id = generate_uuid_from_text(doc.page_content)
    new_doc = Document(page_content=doc.page_content, metadata={'_id': doc_id})
    documents_with_ids.append(new_doc)
print(documents_with_ids)

vectorstore = MongoDBAtlasVectorSearch.from_documents(
    documents=documents_with_ids,
    collection=atlas_collection,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large", api_key="up_gg262m01hW6eoW2xdf1NXT1VVTDj0"),
    index_name=vector_search_index
)

[Document(page_content='Korea is a beautiful country to visit in the spring.', metadata={'_id': 'ffdcd54b-d518-f7df-f467-d5a89e134915'}), Document(page_content='The best time to visit Korea is in the fall.', metadata={'_id': '4c433943-7d7d-bac3-8b2c-950e2ecf501b'}), Document(page_content='Best way to find bug is using unit test.', metadata={'_id': '64435a9d-c192-58eb-1b75-296b8426d51f'}), Document(page_content='Python is a great programming language for beginners.', metadata={'_id': '1f4d8b9f-d54d-897b-e3eb-5aa5711c15ba'}), Document(page_content='Sung Kim is a great teacher.', metadata={'_id': '3aee06fb-708c-3404-1d42-7f22c75895fb'})]


In [58]:
# check if text is in the vector store
# vector store is mongodb collection
def is_in_vectorstore(collection, document_id):
    result = collection.find_one({"_id": document_id})
    return result is not None

In [60]:
is_in_vectorstore(atlas_collection, "Hello, new sentence")

False

In [70]:
is_in_vectorstore(atlas_collection, documents_with_ids[0].metadata['_id'])

True

In [71]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

ValueError: Did not find UPSTAGE_API_KEY, please add an environment variable `UPSTAGE_API_KEY` which contains it, or pass  `UPSTAGE_API_KEY` as a named parameter.

In [9]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 125


In [10]:
from langchain_chroma import Chroma
from langchain_mongodb import MongoDBAtlasVectorSearch

vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=UpstageEmbeddings(model="solar-embedding-1-large"),
)
retriever = vectorstore.as_retriever()


unique_splits = [
    split for split in splits if not is_in_vectorstore(vectorstore, split.page_content)
]
print(len(unique_splits))

# 3. Embed & indexing if it's not in the vector store
if len(unique_splits) > 0:
    vectorstore = Chroma.from_documents(
        ids=[split.page_content for split in unique_splits],
        persist_directory="./chroma_db",
        documents=unique_splits,
        embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
)

125


In [13]:
from langchain_chroma import Chroma

vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=UpstageEmbeddings(model="solar-embedding-1-large"),
)
retriever = vectorstore.as_retriever()

unique_splits = [
    split for split in splits if not is_in_vectorstore(vectorstore, split.page_content)
]

# It's already in the vector store so we don't need to index it again.
print(len(unique_splits))

if len(unique_splits) > 0:
    vectorstore = Chroma.from_documents(
        ids=[split.page_content for split in unique_splits],
        persist_directory="./chroma_db",
        documents=unique_splits,
        embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
)



0


In [15]:
# Query the retriever
search_result = retriever.invoke("How to find problems in code?")
print(search_result[0].page_content[:100])


<p id='13' style='font-size:16px'>introduced bugs immediately. Several bug-finding techni-<br>ques c
