# Example of combining undatasio PDF parsing with Langchain and MongoDB.

![](example_content/undatasio_example.png)

_By stay, Tech Enthusiast @Undatasio_
- - -
🚀 Let's begin this example.

   😃 😎 😝

📣 This is a notebook example demonstrating the retrieval of formatted data from a markdown file converted from a PDF parsed by the undatasio platform using the qwen_agent framework.

##### 📚 Below are the steps I took for this example:
- 📄 Upload the PDF file to be parsed to the undatasio platform.
  - _Download the undatasio Python library._
  - _Import environment variables._
  - _Use the undatasio Python library to convert the output to a langchain document object._
- 📝 Split langchain document files, store them in MongoDB, and perform QA queries..
  - _First, Install all relevant Python libraries for langchain and MongoDB._
  - _Next, Use langchain's RecursiveCharacterTextSplitter to split the original document._
  - _Use langchain's RecursiveCharacterTextSplitter to split the original document._
  - _Finally, use the vector_store object to ask QA questions._

🎃 This is the entire process for this example. I hope you can gain some experience from it.

**Below is a PDF file processed by the undatasio platform, converted into a Langchain Document object, then split, and finally processed using a MongoDB database.**

#### Installing the **Undatasio** Python API library

In [1]:
# install undatasio
!pip install -U -q undatasio

**Install the **python-dotenv** module and load environment variables using the **load_dotenv()** function.**

> If you are unsure which environment variables are required, you can check the file named dev.env for explanations of the environment variables.

In [2]:
!conda install -c conda-forge python-dotenv -y -q

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [3]:
import os
from dotenv import load_dotenv

load_dotenv('.env')

True

#### Use the Undatasio python SDK
_To import an **UnDataIO** object, you need a token and an optional task name from the Undatasio platform._

In [4]:
from undatasio.undatasio import UnDatasIO

undatasio_obj = UnDatasIO(os.getenv("UNDATASIO_API_KEY"))

_The **get_result_to_langchain_document** function of the Undatasio object returns a Langchain Document object. Parameters for this function can be gleaned from the data returned by the **show_version** function._

In [5]:
lc_document = undatasio_obj.get_result_to_langchain_document(
    type_info=['text'],
    file_name='1d8c9bc374114b6e901da.pdf',
    version='v26'
)
lc_document



#### Install the required third-party libraries.

In [6]:
!pip install --upgrade --quiet langchain langchain-community langchain-core langchain-mongodb langchain_huggingface pymongo

Import all necessary classes and methods.

In [7]:
import getpass, os, pymongo, pprint
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pymongo import MongoClient
from pymongo.operations import SearchIndexModel

Splitting a **LangChain Document** object using the **RecursiveCharacterTextSplitter** class.

In [8]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
docs = text_splitter.split_documents([lc_document])

Create a model object for **BAAI/bge-m3** from Hugging Face.

In [9]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "BAAI/bge-m3"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

Create a MongoDB client object, establish a connection, and create indexes.

In [10]:
client = MongoClient(os.getenv("MONGODB_URI"))
db_name = "langchain_db"
collection_name = "test"
collection = client[db_name][collection_name]
vector_search_index = "vector_index"

Create a **MongoDBAtlasVectorSearch** object.

In [11]:
vector_store = MongoDBAtlasVectorSearch.from_documents(
    documents = docs,
    embedding = embedding,
    collection = collection,
    index_name = vector_search_index
)

Create a SearchIndexModel object and add a search index to the MongoDB connection.

In [13]:
search_index_model = SearchIndexModel(
   definition={
      "fields": [
         {
         "type": "vector",
         "path": "embedding",
         "numDimensions": 768,
         "similarity": "cosine"
         },
         {
         "type": "filter",
         "path": "source"
         }
      ]
   },
   name="vector_index",
   type="vectorSearch"
)

collection.create_search_index(model=search_index_model)


'vector_index'

In [14]:
query="All eyes are on the July policy meetings."

In [15]:
results = vector_store.similarity_search(query=query)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")