# Example of combining undatasio PDF parsing with Langchain and Redis.

![](example_content/undatasio_example.png)

_By stay, Tech Enthusiast @Undatasio_
- - -
🚀 Let's begin this example.

   😃 😎 😝

📣 This is a notebook example demonstrating the retrieval of formatted data from a markdown file converted from a PDF parsed by the undatasio platform using the qwen_agent framework.

##### 📚 Below are the steps I took for this example:
- 📄 Upload the PDF file to be parsed to the undatasio platform.
  - _Download the undatasio Python library._
  - _Import environment variables._
  - _Use the undatasio Python library to convert the output to a langchain document object._
- 📝 Split langchain document files, store them in Qdrant, and perform QA queries..
  - _First, Install all relevant Python libraries for langchain and redis._
  - _Next, Use langchain's RecursiveCharacterTextSplitter to split the original document._
  - _Use langchain's RecursiveCharacterTextSplitter to split the original document._
  - _Finally, use the vector_store object to ask QA questions._

🎃 This is the entire process for this example. I hope you can gain some experience from it.

#### Installing the **Undatasio** Python API library

In [1]:
# install undatasio
!pip install -U -q undatasio

**Install the **python-dotenv** module and load environment variables using the **load_dotenv()** function.**

> If you are unsure which environment variables are required, you can check the file named dev.env for explanations of the environment variables.

In [2]:
!conda install -c conda-forge python-dotenv -y -q

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [3]:
import os
from dotenv import load_dotenv

load_dotenv('.env')

True

#### Use the Undatasio python SDK
_To import an **UnDataIO** object, you need a token and an optional task name from the Undatasio platform._

In [4]:
from undatasio.undatasio import UnDatasIO

undatasio_obj = UnDatasIO(os.getenv("UNDATASIO_API_KEY"))

_The **get_result_to_langchain_document** function of the Undatasio object returns a Langchain Document object. Parameters for this function can be gleaned from the data returned by the **show_version** function._

In [5]:
lc_document = undatasio_obj.get_result_to_langchain_document(
    type_info=['text'],
    file_name='1d8c9bc374114b6e901da.pdf',
    version='v26'
)
lc_document



#### Install the required third-party libraries.

In [6]:
!pip install -U -q langchain-redis

Import the necessary classes and functions for the example.

In [7]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_redis import RedisVectorStore

Create a model object for **BAAI/bge-m3** from Hugging Face.

In [8]:
model_name = "BAAI/bge-m3"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

Create a **RedisVectorStore** object using embeddings and the **redis_url**.
_You can run a Redis instance in Docker to execute the current notebook example._
> docker pull redis/redis-stack-server:latest
>
> docker run -dit --name redis -p 6379:6379 redis/redis-stack-server:latest

In [9]:
vector_store = RedisVectorStore(
    index_name="undatasio_demo",
    embeddings=embedding,
    redis_url=os.getenv("REDIS_URI"),
)

Splitting a **LangChain Document** object using the **RecursiveCharacterTextSplitter** class.

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
docs = text_splitter.split_documents([lc_document])

Add the split text chunks to the vector store.

In [11]:
from uuid import uuid4

ids = [f"{i + 1}" for i in range(0, len(docs))]
vector_store.add_documents(documents=docs, ids=ids)

['undatasio_demo:c41d4198e601457abb98c8ae89526f35',
 'undatasio_demo:0e23b42320734b0cb4a8b3ac39263a97',
 'undatasio_demo:bc265a68c45646a0a3e889daaded8789',
 'undatasio_demo:d529bfc3d23d4db38818302a5bee95b0',
 'undatasio_demo:9b50a1fb035c44a5a6ae1acdc15a9ef0',
 'undatasio_demo:77bbe1b77a1f4f2388f3d401bd5f6143',
 'undatasio_demo:ae260df8f38141a69751e629ef4f7d0a',
 'undatasio_demo:c33464bcad8e429196cfab296dafc8aa',
 'undatasio_demo:26f1e4c5543b4c54a1fedeaf80902c92',
 'undatasio_demo:62a791ac712d43d69694a97ed3214cba',
 'undatasio_demo:55fc0e103eac4f38bebdb9e1420d9c06',
 'undatasio_demo:e3e3a6a6cd4e4448923f238930d6a514',
 'undatasio_demo:ef8b29b11c5742799f07a2d6c88dd3fa',
 'undatasio_demo:e7cdc106a29140569c06d133343eb051',
 'undatasio_demo:cdbc8adc769643069fcf49107ef2e015',
 'undatasio_demo:0fc3feaf1daf416695cef57181cd1c55',
 'undatasio_demo:557d6d33a0834bc4b6516ccaa8766b1a',
 'undatasio_demo:b961fd733c1d4e6998ad612e7853b139',
 'undatasio_demo:b6536d3acbe44617a598c42a557152f7',
 'undatasio_

Query the vector store.

In [12]:
query="All eyes are on the July policy meetings."

In [13]:
results = vector_store.similarity_search(query=query, k=1)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

* 3.All eyes are on the July policy meetings. July will be a hectic month for China
policy watchers: The Third Plenum of the Chinese Communist Party is scheduled for July [{}]
