# Preprocessing Documents and Storing It to The QDRANT

In this notebook, we will preprocess the document before storing them into Qdrant vector database. The steps we will take are:
1. Load the file into memory
2. Chunk the extracted text
3. Vectorize the extracted text
4. Store the text into qdrant database

Let's import important packages first. We will use `langchain-community` package to import `UnstructuredMarkdownLoader` and `PyMuPDFLoader`. If you haven't install the `langchain-community` package, you can use the following command.

`pip install -U langchain-community unstructured langchain-pymupdf4llm`

In [8]:
import os

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct, HnswConfigDiff
from langchain_openai.embeddings import OpenAIEmbeddings

from langchain_pymupdf4llm import PyMuPDF4LLMLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import UnstructuredMarkdownLoader

## Loading File to Memory and Chunk The Extracted Text

As the name suggest, we will load a `.md` and `.pdf` file formats. I choose this two files because I have some learning notes that I wrote in the Notion applications. I'm planning to extend the local LLM capabilities to be able to value a stock qualitatively.

In [3]:
# Load files
supported_file_formats = [ '.md', '.pdf' ]
documents_pth = "../documents"

document_ls = os.listdir(documents_pth)
supported_doc_ls = [ doc for doc in document_ls if doc.endswith(tuple(supported_file_formats)) ]

Load the file and split the text using recursive text splitter.

In [17]:
def load_and_split_document(file_path, text_splitter):
    """Load the document based on the file format."""
    if file_path.endswith('.md'):
        loader = UnstructuredMarkdownLoader(file_path)
    elif file_path.endswith('.pdf'):
        loader = PyMuPDF4LLMLoader(file_path)
    else:
        print(f"File {file_path} isn't supported file format")
        return []

    document = loader.load_and_split(text_splitter=text_splitter)

    return document

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

chunk_documents_ls = []
for doc_pth in supported_doc_ls:
    chunk_documents_ls.extend(load_and_split_document(os.path.join(documents_pth, doc_pth), text_splitter=text_splitter))

## Vectorize The Extracted Text

To vectorize the text, we will utilize the `OpenAIEmbeddings` class from openai package. We vectorized it as a batch request using `.embed_documents` method.

In [18]:
OPENAI_ENDPOINT="http://localhost:8080/v1"
OPENAI_API_KEY="my-openai-api-key"

emb_client = OpenAIEmbeddings(
    base_url=OPENAI_ENDPOINT,
    api_key=OPENAI_API_KEY,
)

chunk_content_ls = [ doc.page_content for doc in chunk_documents_ls ]
embeddings_documents = emb_client.embed_documents(chunk_content_ls)

In [19]:
len(embeddings_documents)  # Check the number of embedded documents

13

## Store The Data Into Qdrant Database

1. We connect our client to the Qdrant vector database in localhost. If you did not install the qdrant using docker, you can use ":memory:" to use qdrant as temporary database.
2. We define the collection data structure such as vector size and HNSW configurations.
3. We store the data using `PointStruct` class then store it using `.upsert` method.
4. To search, we use `.query_points` method and vectorized query as an input.
5. To delete a collection, we can use `.delete_collection` method. 

In [None]:
qdrant_client = QdrantClient(url="http://localhost:6333")
# qdrant_client = QdrantClient(":memory:") # use this if you did not install qdrant using docker

Define a collection and its data structure

In [21]:
COLLECTION_NAME = "my_collections"
VECTOR_SIZE = 1024 # depends on your embedding model
M_VALUE = 32 # The number of edges the nodes will have
EF_CONSTRUCT = 128 # The number of neighbours considered during indexing

qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(
        size=VECTOR_SIZE,
        distance=Distance.COSINE,
        hnsw_config=HnswConfigDiff(
            m=M_VALUE,
            ef_construct=EF_CONSTRUCT,
            full_scan_threshold=1000
        )
    ),
    on_disk_payload=True
)

True

Preprocess the data and its payload before inserting to the Qdrant vector database.

In [22]:
data_points = []

for i, chunk_emb_pair in enumerate(zip(chunk_documents_ls, embeddings_documents)):
    chunk, emb = chunk_emb_pair
    
    temp_payload = {
        "source": chunk.metadata.get("source", None),
        "text": chunk.page_content,
    }

    point = PointStruct(
        id=i,
        vector=emb,
        payload=temp_payload
    )
    data_points.append(point)

len(data_points)

13

Upsert the data into Qdrant vector database.

In [23]:
operation_info = qdrant_client.upsert(
    collection_name=COLLECTION_NAME,
    wait=True,
    points=data_points
)

print(operation_info)

operation_id=0 status=<UpdateStatus.COMPLETED: 'completed'>


Simple search to check wether the data is already inserted or not.

In [29]:
query = "margin of safety"

search_result = qdrant_client.query_points(
    collection_name=COLLECTION_NAME,
    query=emb_client.embed_query(query),
    with_payload=True,
    limit=5
).points

for res in search_result:
    print(f"ID: {res.id}, Score: {res.score}, Payload: {res.payload.get('source')}")
    print(f"Text: {res.payload['text'][:50]}\n")

ID: 1, Score: 0.5689739, Payload: ../documents\Book “The Snowball Warren Buffet and The Business.md
Text: Over the long run facts will be more important tha

ID: 8, Score: 0.53517836, Payload: ../documents\Book “Warren Buffet and The Interpretation of Financial report.md
Text: In the cash flow statement, you want to make sure 

ID: 4, Score: 0.49515468, Payload: ../documents\Book “The Snowball Warren Buffet and The Business.md
Text: Companies listed on the stock markets are subject 

ID: 2, Score: 0.49213064, Payload: ../documents\Book “The Snowball Warren Buffet and The Business.md
Text: Focus on cheap and dislike stock. You can also com

ID: 12, Score: 0.48533383, Payload: ../documents\how-to-identify-superior-stock_warren-buffet.pdf
Text: ## **Finding business with moat quickly**

You can



In [20]:
qdrant_client.delete_collection(collection_name=COLLECTION_NAME)

True