## MongoDB Atlas

To use MongoDB Atlas, you must first deploy a cluster. To get started head over to Atlas here: [quick start](https://www.mongodb.com/docs/atlas/getting-started/).
Create an Atlas database and create an Atlas Search Index to search vectors.

Follow below MongoDB Atlas guide
- [Create new cluster](https://www.mongodb.com/docs/atlas/tutorial/create-new-cluster/)
- [Connect to database](https://www.mongodb.com/docs/atlas/driver-connection/)
- [Create an Atlas Vector Search Index](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/)
- [Create Index Fields](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search)

### benefits?
- use mongoDB itself!

In [18]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25 pymongo langchain langchain-mongodb

In [19]:

%load_ext dotenv
%dotenv
# UPSTAGE_API_KEY
# ... KEY

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [20]:
import warnings

warnings.filterwarnings("ignore")

In [21]:
from pymongo.mongo_client import MongoClient
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_upstage import UpstageEmbeddings
import os

"""
Your connection string should use following format:
mongodb+srv://<username>:<password>@<clusterName>.<hostname>.mongodb.net
"""
MONGODB_ATLAS_CLUSTER_URI = os.environ["MONGODB_ATLAS_CLUSTER_URI"]

# Connect to your Atlas cluster
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
# Define collection and index name
DB_NAME = "langchain_db"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index"

db_collection = client[DB_NAME][COLLECTION_NAME]

# Create Indexes

create atlas index fields.

[Create Index Fields](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search)

```json
{
  "fields": [
    {
      "numDimensions": 4096,
      "path": "embedding",
      "similarity": "dotProduct",
      "type": "vector"
    }
  ]
}
```

In [22]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


sample_text = [
    "Korea is a beautiful country to visit in the spring.",
    "The best time to visit Korea is in the fall.",
    "Best way to find bug is using unit test.",
    "Python is a great programming language for beginners.",
    "Sung Kim is a great teacher.",
]

splits = RecursiveCharacterTextSplitter().create_documents(sample_text)

print(splits)

vectorstore = MongoDBAtlasVectorSearch.from_documents(
    documents=splits,
    collection=db_collection,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME
)

[Document(page_content='Korea is a beautiful country to visit in the spring.'), Document(page_content='The best time to visit Korea is in the fall.'), Document(page_content='Best way to find bug is using unit test.'), Document(page_content='Python is a great programming language for beginners.'), Document(page_content='Sung Kim is a great teacher.')]
batch_size: 5


In [30]:
db_collection.find_one({"text":"Hello, new sentence"}) is not None

False

In [31]:
db_collection.find_one({"text":splits[0].page_content}) is not None

True

In [25]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [26]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 125


In [27]:
from langchain_mongodb import MongoDBAtlasVectorSearch

vectorstore = MongoDBAtlasVectorSearch(
    collection=db_collection,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME
)
retriever = vectorstore.as_retriever()


unique_splits = [
    split for split in splits if not db_collection.find_one({"text":split.page_content})
]
print(len(unique_splits))

# 3. Embed & indexing if it's not in the vector store
if len(unique_splits) > 0:
    MongoDBAtlasVectorSearch.from_documents(
    documents=unique_splits,
    collection=MONGODB_COLLECTION,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME
)

0


In [28]:
# Query the retriever
search_result = retriever.invoke("How to find problems in code?")
print(search_result)
print(search_result[0].page_content[:100])


[]


IndexError: list index out of range