# Processing: Load, Chunk, and Prepare for Vector DB

This notebook demonstrates a step-by-step approach to loading, enriching,
chunking, and storing documents into a vector database.
It compares different outputs at each stage interactively with helpful visualizations.

Create from scratch a chain that: 
- takes an input document. (In this case many Arxiv pdf documents.)
- Chunks the document, keeping the metadata.
- Embed the document chunks.
- Saves the embeddings into a vector DB.

In [1]:
from langchain_community.document_loaders import ArxivLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer
from langchain.schema import Document
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from langchain.embeddings import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Qdrant
from sklearn.decomposition import PCA
import pandas as pd
import plotly.express as px


In [2]:

# Supports all arguments of `ArxivAPIWrapper`
loader = ArxivLoader(
    query="LLM",
    load_max_docs=50,
    top_k_results=20
    # doc_content_chars_max=1000,
    # load_all_available_meta=False,
    # ...
)

In [3]:
docs = loader.load()
len(docs)


20

# Splitter

In [4]:
# V1 : Selecting the modelID from HF and the tokenizer, choosing a model
model_id = "sentence-transformers/all-mpnet-base-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer,
    chunk_size=300,
    chunk_overlap=30,
    add_start_index=True,
    strip_whitespace=True,
    #separators=["\n\n", "\n", ".", " ", ""],
)

In [5]:
vars(docs[0])

{'id': None,
 'metadata': {'Published': '2024-12-23',
  'Title': 'Trustworthy and Efficient LLMs Meet Databases',
  'Authors': 'Kyoungmin Kim, Anastasia Ailamaki',
  'Summary': 'In the rapidly evolving AI era with large language models (LLMs) at the core,\nmaking LLMs more trustworthy and efficient, especially in output generation\n(inference), has gained significant attention. This is to reduce plausible but\nfaulty LLM outputs (a.k.a hallucinations) and meet the highly increased\ninference demands. This tutorial explores such efforts and makes them\ntransparent to the database community. Understanding these efforts is essential\nin harnessing LLMs in database tasks and adapting database techniques to LLMs.\nFurthermore, we delve into the synergy between LLMs and databases, highlighting\nnew opportunities and challenges in their intersection. This tutorial aims to\nshare with database researchers and practitioners essential concepts and\nstrategies around LLMs, reduce the unfamiliarit

# Chunking

In [6]:
final_chunks = []
chunk_by_doc = {}
for doc in docs:
    doc_chunks = []
    for i, chunk in enumerate(text_splitter.split_text(doc.page_content)):
        metadata = doc.metadata.copy()
        metadata["chunk_index"] = i
        doc_chunks.append(Document(page_content=chunk, metadata=metadata))
    chunk_by_doc[doc.metadata.get("Title", "Document")] = (
        doc_chunks
    )
    final_chunks.extend(doc_chunks)

# Storing

In [14]:
# ## Step 3: Store in Qdrant Vector DB

embedding = HuggingFaceEmbeddings(model_name=model_id)

db = Qdrant.from_documents(
    documents=final_chunks,
    embedding=embedding,
    location="localhost:6333",
    collection_name="LLM-papers all-mpnet-base-v2",
    prefer_grpc=False,
)

print("\n--- Stored in Qdrant Vector DB ---")
print(f"Collection: {db.collection_name}")


--- Stored in Qdrant Vector DB ---
Collection: LLM-papers all-mpnet-base-v2
