# Example of combining undatasio PDF parsing with Langchain and ChromaDB.

![](example_content/undatasio_example.png)

_By stay, Tech Enthusiast @Undatasio_

- - -

#### Installing the **Undatasio** Python API library

In [1]:
# install undatasio
!pip install -U -q undatasio

**Install the **python-dotenv** module and load environment variables using the **load_dotenv()** function.**

> If you are unsure which environment variables are required, you can check the file named dev.env for explanations of the environment variables.

In [2]:
!conda install -c conda-forge python-dotenv -y -q

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [3]:
import os
from dotenv import load_dotenv

load_dotenv('.env')

True

#### Use the Undatasio python SDK
_To import an **UnDataIO** object, you need a token and an optional task name from the Undatasio platform._

In [4]:
from undatasio.undatasio import UnDatasIO

undatasio_obj = UnDatasIO(os.getenv("UNDATASIO_API_KEY"))

_The **get_result_to_langchain_document** function of the Undatasio object returns a Langchain Document object. Parameters for this function can be gleaned from the data returned by the **show_version** function._

In [5]:
lc_document = undatasio_obj.get_result_to_langchain_document(
    type_info=['text'],
    file_name='1d8c9bc374114b6e901da.pdf',
    version='v26'
)
lc_document



#### Install the required third-party libraries.

In [6]:
!pip install -U -q langchain_chroma langchain_huggingface chromadb 

Create a model object for **BAAI/bge-m3** from Hugging Face.

In [7]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "BAAI/bge-m3"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

Initialize a **Chroma** object from **langchain_chroma**.

In [8]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="foo",
    embedding_function=embedding,
    # other params...
)

Splitting a **LangChain Document** object using the **RecursiveCharacterTextSplitter** class.

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
docs = text_splitter.split_documents([lc_document])

Add the split text chunks to the vector store.

In [10]:
ids = [f"{i}" for i in docs]
vector_store.add_documents(documents=docs, ids=ids)

["page_content='2. Profit-takingxpullback has developed into a minor correction. While the10\\%\ncorrection Since mid-May is broadly in-line with the historical norms of most technical' metadata={'source': '_v26_1d8c9bc374114b6e901da.pdf_[text]'}",
 "page_content='bull runs, the six-week market weakness has prompted increasing investor questions\nabout the strengthof thepolicy put, and concerns regarding a redux of the powerful but' metadata={'source': '_v26_1d8c9bc374114b6e901da.pdf_[text]'}",
 "page_content='short-livedRe0peningrallyinlate2022/early2023.Empirically,inthe23episodesinthe\npast 20 years where MSCl China rallied more than20\\%, the market almost in all cases' metadata={'source': '_v26_1d8c9bc374114b6e901da.pdf_[text]'}",
 "page_content='(22 out of 23) experienced at least a5\\% pullback after entering a technical bull phase.\nThese corrections averaged12\\%by magnitude, and 32 days in duration, although their' metadata={'source': '_v26_1d8c9bc374114b6e901da.pdf_[text]'}"

Query the vector store.

In [12]:
query="All eyes are on the July policy meetings."

In [13]:
results = vector_store.similarity_search(query=query, k=1)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

* 3.All eyes are on the July policy meetings. July will be a hectic month for China
policy watchers: The Third Plenum of the Chinese Communist Party is scheduled for July [{'source': '_v26_1d8c9bc374114b6e901da.pdf_[text]'}]
