## Data Ingestion for Deep RAG

In this notebook, we'll load extracted data into Qdrant vector database:

- **Markdown**: Page-level chunks with metadata
- **Tables**: Separate documents with context and page numbers
- **Images**: Text descriptions embedded (generated in notebook 06-01b)
- **Hybrid Search**: Dense (semantic) + Sparse (keyword) embeddings

**Prerequisites:**
- Run notebook 06-01 first to extract PDFs
- Run notebook 06-01b to generate image descriptions
- Qdrant server running on localhost:6333
- Google API key set in .env file

**Output:**
- Single Qdrant collection with all content types
- Rich metadata for filtering (company, year, quarter, doc_type, page)
- Deduplication using file hashes

**Make Sure You Have Your QDRANT Vector DB Docker Running**

https://qdrant.tech/

| Point            | **Qdrant** | **Chroma**       | **FAISS** | Weaviate     | Milvus | Pinecone |
| ---------------- | ---------- | ---------------- | --------- | ------------ | ------ | -------- |
| Open Source      | ✅ Yes      | ✅ Yes            | ✅ Yes     | ⚠️ Open-core | ✅ Yes  | ❌ No     |
| DB vs Library    | DB         | DB (dev-focused) | Library   | DB           | DB     | Managed  |
| Hybrid Search    | ✅ Native   | ❌                | ❌         | ✅            | ⚠️     | ✅        |
| Metadata Filter  | ✅ Strong   | ⚠️ Basic         | ❌         | ✅            | ✅      | ✅        |
| Production Ready | ✅ Yes      | ❌ (POC)          | ❌         | ✅            | ✅      | ✅        |
| Local / Offline  | ✅ Yes      | ✅ Yes            | ⚠️        | ⚠️           | ⚠️     | ❌        |


### 0. Qdrant API Setup

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

from qdrant_client import QdrantClient

qdrant_client = QdrantClient(
    url="https://d599160b-804b-42db-8d0d-c1dd093fa909.us-east4-0.gcp.cloud.qdrant.io:6333", 
    api_key = os.getenv("QDRANT_API_KEY")
)

print(qdrant_client.get_collections())

collections=[]


In [9]:
qdrant_client = QdrantClient(
    url="http://localhost:6333"
)

print(qdrant_client.get_collections())

collections=[CollectionDescription(name='financial_docs')]


### 1. Setup and Imports

In [10]:
import hashlib
from pathlib import Path

from langchain_google_genai import GoogleGenerativeAIEmbeddings

from langchain_qdrant import QdrantVectorStore, RetrievalMode, FastEmbedSparse

from langchain_core.documents import Document
from qdrant_client import QdrantClient

### 2. Configuration

In [11]:
# Paths
MARKDOWN_DIR = "data/rag-data/markdown"
TABLES_DIR = "data/rag-data/tables"
IMAGES_DESC_DIR = "data/rag-data/images_desc"

# Qdrant Configuration
COLLECTION_NAME = "financial_docs"
EMBEDDING_MODEL = "models/gemini-embedding-001"

### 3. Initialize Embeddings and Client

In [12]:
# Embeddings
embeddings = GoogleGenerativeAIEmbeddings(model=EMBEDDING_MODEL)
sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

In [14]:
result = embeddings.embed_query('anything')
len(result)

3072

In [29]:
result = sparse_embeddings.embed_query('hi hello')
result

SparseVector(indices=[948991206, 613153351], values=[1.0, 1.0])

In [30]:
result = sparse_embeddings.embed_documents(['hi', 'hello'])
result

[SparseVector(indices=[948991206], values=[1.6877434821696136]),
 SparseVector(indices=[613153351], values=[1.6877434821696136])]

### 4. Create or Recreate Collection

In [None]:
# # Create vector store at Remote location
# vector_store = QdrantVectorStore.from_documents(
#     documents=[],
#     embedding=embeddings,
#     sparse_embedding=sparse_embeddings,
#     url="https://d599160b-804b-42db-8d0d-c1dd093fa909.us-east4-0.gcp.cloud.qdrant.io:6333", 
#     api_key = os.getenv("QDRANT_API_KEY"),
#     collection_name = COLLECTION_NAME,
#     retrieval_mode=RetrievalMode.HYBRID,
#     force_recreate=False
# )

In [97]:
# Create vector store at local computer
vector_store = QdrantVectorStore.from_documents(
    documents=[],
    embedding=embeddings,
    sparse_embedding=sparse_embeddings,
    url="http://localhost:6333", 
    collection_name = COLLECTION_NAME,
    retrieval_mode=RetrievalMode.HYBRID,
    force_recreate=False
)

In [98]:
vector_store.client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='financial_docs')])

### 5. Helper Functions

In [48]:
def extract_metadata_from_filename(filename: str):
    """
    Extract metadata from filename.
    
    Expected format: CompanyName DocType [Quarter] Year.pdf
    Examples:
        - Amazon 10-Q Q1 2024.pdf
        - Microsoft 10-K 2023.pdf
    """

    filename = filename.replace('.pdf', '').replace('.md', '')
    parts = filename.split()

    return {
        'company_name': parts[0],
        'doc_type': parts[1],
        'fiscal_quarter': parts[2] if len(parts)==4 else None,
        'fiscal_year': parts[-1]
    }

extract_metadata_from_filename('apple 10-k 2023.md')

{'company_name': 'apple',
 'doc_type': '10-k',
 'fiscal_quarter': None,
 'fiscal_year': '2023'}

In [49]:
def compute_file_hash(file_path: Path):

    sha256_hash = hashlib.sha256()

    with open(file_path, 'rb') as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)

    return sha256_hash.hexdigest()


In [50]:
compute_file_hash(Path(r'data\rag-data\markdown\amazon\amazon 10-k 2023.md'))

'05f2d434b6eee52a5bbb4155a78068b2eda1eeda86b7af55335beb0634ac0398'

In [51]:
compute_file_hash(Path(r'data\rag-data\markdown\amazon\amazon 10-k 2023 copy.md'))

'05f2d434b6eee52a5bbb4155a78068b2eda1eeda86b7af55335beb0634ac0398'

In [69]:
# get the list of ingested file
all_points = vector_store.client.scroll(
    collection_name=COLLECTION_NAME,
    limit=10_00,
    with_payload=True,
    offset=None
)

In [70]:
all_points[1]


'5d635eb5-97bc-422b-ad3f-867f9ecb04ef'

In [60]:
all_points[0][0].payload['metadata']['file_hash']

'455276692b26d8b0d04bcd2eceab403e885b3cc2ec92991b608649bab0956488'

In [None]:
def get_processed_hashes():
    
    processed_hashes = set()
    offset = None

    while True:
        points, offset = vector_store.client.scroll(
                            collection_name=COLLECTION_NAME,
                            limit=10_000,
                            with_payload=True,
                            offset=offset
                        )

        if not points:
            break
        
        processed_hashes.update(
            point.payload.get("metadata", {}).get("file_hash")
            for point in points
            if point.payload.get("metadata", {}).get("file_hash") is not None
        )

        if offset is None:
            break

    return processed_hashes

In [72]:
processed_hashes = get_processed_hashes()

In [74]:
len(processed_hashes)

1098

In [84]:
# extract the page number from the file path
import re

def extract_page_number(file_path: Path):
    pattern = r'page_(\d+)'
    match = re.search(pattern=pattern, string=file_path.stem)
    return int(match.group(1)) if match else None

In [86]:
file_path = Path(r'data\rag-data\images_desc\google\google 10-k 2023\page_28.md')
extract_page_number(file_path)

28

### 6. Ingestion Function

In [104]:
def ingest_file_in_db(file_path, processed_hashes):

    file_hash = compute_file_hash(file_path)
    if file_hash in processed_hashes:
        print(f"Following file has been already uploaded: {file_path}")

    path_str = str(file_path)
    if 'markdown' in path_str:
        content_type = 'text'
        doc_name = file_path.name
    elif 'tables' in path_str:
        content_type = 'tables'
        doc_name = file_path.parent.name
    elif 'images_desc' in path_str:
        content_type = 'image'
        doc_name = file_path.parent.name
    else:
        content_type = 'unknown'
        doc_name = file_path.name

    content = file_path.read_text(encoding='utf-8')

    base_metadata = extract_metadata_from_filename(doc_name)

    base_metadata.update({
        'content_type': content_type,
        'file_hash': file_hash,
        'source_file': doc_name
    })

    if content_type == 'text':
        # write method for ingesting markdown data
        pages = content.split('<!-- page break -->')
        documents = []
        for idx, page in enumerate(pages, start=1):
            metadata = base_metadata.copy()
            metadata.update({'page': idx})
            documents.append(Document(page_content=page, metadata=metadata))

        vector_store.add_documents(documents)

    else:
        # write method to ingest images desc and tables .md data
        page_num = extract_page_number(file_path)
        metadata = base_metadata.copy()
        metadata.update({'page': page_num})
        documents = [Document(page_content=content, metadata=metadata)]

        vector_store.add_documents(documents)


    processed_hashes.add(file_hash)


In [None]:
file_path = Path(r'data\rag-data\markdown\amazon\amazon 10-k 2023.md')
processed_hashes = get_processed_hashes()

if processed_hashes is None:
    processed_hashes = set()

ingest_file_in_db(file_path, processed_hashes)

In [None]:
from tqdm import tqdm

base_path = Path('data/rag-data')
all_md_files = list(base_path.rglob("*.md"))

processed_hashes = get_processed_hashes()

if processed_hashes is None:
    processed_hashes = set()

for md_file in tqdm(all_md_files):
    ingest_file_in_db(md_file, processed_hashes)

### 8. Verify Ingestion

In [108]:
collection_info = vector_store.client.get_collection(COLLECTION_NAME)
collection_info



### 9. Test Search

In [111]:
query = "what is the tesla's revenue"
results = vector_store.similarity_search(query)

In [112]:
results

[Document(metadata={'company_name': 'meta', 'doc_type': '10-k', 'fiscal_quarter': None, 'fiscal_year': '2024', 'content_type': 'tables', 'file_hash': '459a6644aa4ab684fc242f5492e2438d9594865aa9a562d702d7ecebb724de03', 'source_file': 'meta 10-k 2024', 'page': 101, '_id': '663bb7f2-6366-417c-b672-4b50f3ee6b3c', '_collection_name': 'financial_docs'}, page_content='**Page:** 101\n\n| Total revenue  | $ 164,501                 | $ 134,902                 | $ 116,609                 |\nRevenue disaggregated by geography, based on the addresses of our customers, consists of the following (in millions):\n\n|                              | Year Ended December 31,   | Year Ended December 31,   | Year Ended December 31,   |\n|------------------------------|---------------------------|---------------------------|---------------------------|\n|                              | 2024                      | 2023                      | 2022                      |\n| United States and Canada (1) | $ 63,20