# Processing: Load, Chunk, and Prepare for Vector DB

This notebook demonstrates a step-by-step approach to loading, enriching,
chunking, and storing documents into a vector database.
It compares different outputs at each stage interactively with helpful visualizations.

Create from scratch a chain that: 
- takes an input document. (In this case many Arxiv pdf documents.)
- Chunks the document, keeping the metadata.
- Embed the document chunks.
- Saves the embeddings into a vector DB.

In [1]:
from langchain_community.document_loaders import ArxivLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer
from langchain.schema import Document
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from langchain.embeddings import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Qdrant
from sklearn.decomposition import PCA
import pandas as pd
import plotly.express as px
import tiktoken


In [2]:

# Supports all arguments of `ArxivAPIWrapper`
loader = ArxivLoader(
    query="LLM",
    load_max_docs=50,
    top_k_results=20
    # doc_content_chars_max=1000,
    # load_all_available_meta=False,
    # ...
)

In [3]:
docs = loader.load()
len(docs)


20

# Splitter - (There are the variables that change)

In [4]:
tokenizer = tiktoken.get_encoding("cl100k_base")
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300, 
    chunk_overlap=30)

# Chunking

In [6]:
final_chunks = []
chunk_by_doc = {}
for doc in docs:
    doc_chunks = []
    for i, chunk in enumerate(text_splitter.split_text(doc.page_content)):
        metadata = doc.metadata.copy()
        metadata["chunk_index"] = i
        doc_chunks.append(Document(page_content=chunk, metadata=metadata))
    chunk_by_doc[doc.metadata.get("Title", "Document")] = (
        doc_chunks
    )
    final_chunks.extend(doc_chunks)

# Storing

In [14]:
# ## Step 3: Store in Qdrant Vector DB

embedding = OpenAIEmbeddings()

db = Qdrant.from_documents(
    documents=final_chunks,
    embedding=embedding,
    location="localhost:6333",
    collection_name="LLM-papers openai",
    prefer_grpc=False,
)

print("\n--- Stored in Qdrant Vector DB ---")
print(f"Collection: {db.collection_name}")


--- Stored in Qdrant Vector DB ---
Collection: LLM-papers openai
