## Data Ingestion for Deep RAG

In this notebook, we'll load extracted data into Qdrant vector database:

- **Markdown**: Page-level chunks with metadata
- **Tables**: Separate documents with context and page numbers
- **Images**: Text descriptions embedded (generated in notebook 06-01b)
- **Hybrid Search**: Dense (semantic) + Sparse (keyword) embeddings

**Prerequisites:**
- Run notebook 06-01 first to extract PDFs
- Run notebook 06-01b to generate image descriptions
- Qdrant server running on localhost:6333
- Google API key set in .env file

**Output:**
- Single Qdrant collection with all content types
- Rich metadata for filtering (company, year, quarter, doc_type, page)
- Deduplication using file hashes

### 1. Setup and Imports

In [1]:
from dotenv import load_dotenv
load_dotenv()

import hashlib
from pathlib import Path

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_qdrant import QdrantVectorStore, RetrievalMode, FastEmbedSparse
from langchain_core.documents import Document
from qdrant_client import QdrantClient

### 2. Configuration

In [2]:
# Paths
MARKDOWN_DIR = "data/rag-data/markdown"
TABLES_DIR = "data/rag-data/tables"
IMAGES_DESC_DIR = "data/rag-data/images_desc"

# Qdrant Configuration
COLLECTION_NAME = "financial_docs"
EMBEDDING_MODEL = "models/gemini-embedding-001"

### 3. Initialize Embeddings and Client

In [None]:
# Embeddings
embeddings = GoogleGenerativeAIEmbeddings(model=EMBEDDING_MODEL)
sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

# # Qdrant client
# qdrant_client = QdrantClient(url="http://localhost:6333")

# print("✓ Embeddings and client initialized")

✓ Embeddings and client initialized


### 4. Create or Recreate Collection

In [4]:
# # Delete collection if exists
# if qdrant_client.collection_exists(COLLECTION_NAME):
#     qdrant_client.delete_collection(COLLECTION_NAME)
#     print(f"✓ Deleted existing collection: {COLLECTION_NAME}")

# Create vector store
vector_store = QdrantVectorStore.from_documents(
    documents=[],  # Empty initialization
    embedding=embeddings,
    sparse_embedding=sparse_embeddings,
    url="http://localhost:6333",
    collection_name=COLLECTION_NAME,
    retrieval_mode=RetrievalMode.HYBRID,
    force_recreate=True
)

print(f"✓ Created collection: {COLLECTION_NAME}")

✓ Created collection: financial_docs


In [18]:
vector_store._client

<qdrant_client.qdrant_client.QdrantClient at 0x1d82d56fc50>

### 5. Helper Functions

In [5]:
def extract_metadata_from_filename(filename: str) -> dict:
    """
    Extract metadata from filename.
    
    Expected format: CompanyName DocType [Quarter] Year.md
    Examples:
        - Amazon 10-K 2024.md
        - Amazon 10-Q Q1 2024.md
    """
    name = filename.replace('.md', '').replace('.pdf', '')
    parts = name.split()
    
    return {
        'company_name': parts[0],
        'doc_type': parts[1],
        'fiscal_quarter': parts[2] if len(parts) == 4 else None,
        'fiscal_year': int(parts[-1])
    }

In [6]:
def compute_file_hash(file_path: Path) -> str:
    """Compute SHA-256 hash for deduplication."""
    sha256_hash = hashlib.sha256()
    with open(file_path, 'rb') as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
    return sha256_hash.hexdigest()

In [38]:
all_points = vector_store.client.scroll(
        collection_name=COLLECTION_NAME,
        limit=10000,
        with_payload=True
    )
    
all_points[0][0].payload['metadata']['file_hash']

'455276692b26d8b0d04bcd2eceab403e885b3cc2ec92991b608649bab0956488'

In [41]:
def get_processed_hashes() -> set:
    """Get file hashes already in Qdrant."""
    all_points = vector_store.client.scroll(
        collection_name=COLLECTION_NAME,
        limit=10000,
        with_payload=True
    )
    
    hashes = set(
        point.payload['metadata'].get('file_hash') 
        for point in all_points[0]
    )
    
    print(f"Already processed: {len(hashes)} files")
    return hashes

hashes = get_processed_hashes()

Already processed: 1098 files


### 6. Ingestion Functions

In [8]:
def ingest_markdown_file(md_path: Path, processed_hashes: set):
    """Ingest markdown file split by pages."""
    file_hash = compute_file_hash(md_path)
    if file_hash in processed_hashes:
        print(f"  [SKIP] {md_path.name}")
        return 0
    
    # Read and split by page breaks
    markdown_text = md_path.read_text(encoding='utf-8')
    pages = markdown_text.split("<!-- page break -->")
    
    # Get metadata from filename
    file_metadata = extract_metadata_from_filename(md_path.name)
    
    # Create documents for each page
    documents = []
    for page_num, page_text in enumerate(pages, start=1):
        page_content = page_text.strip()
        if page_content:
            metadata = file_metadata.copy()
            metadata['content_type'] = 'text'
            metadata['page'] = page_num
            metadata['file_hash'] = file_hash
            metadata['source_file'] = md_path.name
            
            documents.append(Document(page_content=page_content, metadata=metadata))
    
    # Add to vector store
    if documents:
        vector_store.add_documents(documents)
        processed_hashes.add(file_hash)
        print(f"  ✓ {md_path.name} ({len(documents)} pages)")
    
    return len(documents)

In [9]:
def ingest_table_file(table_path: Path, doc_name: str, processed_hashes: set):
    """Ingest a single table file."""
    file_hash = compute_file_hash(table_path)
    if file_hash in processed_hashes:
        return 0
    
    # Read table content
    table_content = table_path.read_text(encoding='utf-8')
    
    # Extract metadata from filename
    file_metadata = extract_metadata_from_filename(doc_name + '.md')
    
    # Extract table number and page number from filename
    stem = table_path.stem
    parts = stem.split('_')
    table_num = int(parts[1])
    page_num = int(parts[3]) if len(parts) >= 4 else None
    
    # Create metadata
    metadata = file_metadata.copy()
    metadata['content_type'] = 'table'
    metadata['table_number'] = table_num
    metadata['page'] = page_num
    metadata['file_hash'] = file_hash
    metadata['source_file'] = table_path.name
    
    # Create document and add to vector store
    doc = Document(page_content=table_content, metadata=metadata)
    vector_store.add_documents([doc])
    processed_hashes.add(file_hash)
    
    return 1

In [10]:
def ingest_image_description(desc_path: Path, doc_name: str, processed_hashes: set):
    """Ingest image description file."""
    file_hash = compute_file_hash(desc_path)
    if file_hash in processed_hashes:
        return 0
    
    # Read description
    description = desc_path.read_text(encoding='utf-8')
    
    # Extract metadata from filename
    file_metadata = extract_metadata_from_filename(doc_name + '.md')
    
    # Extract page number from filename (page_5.md)
    page_num = int(desc_path.stem.split('_')[1])
    
    # Create metadata
    metadata = file_metadata.copy()
    metadata['content_type'] = 'image'
    metadata['page'] = page_num
    metadata['file_hash'] = file_hash
    metadata['source_file'] = desc_path.name
    
    # Create document and add to vector store
    doc = Document(page_content=description, metadata=metadata)
    vector_store.add_documents([doc])
    processed_hashes.add(file_hash)
    
    return 1

In [11]:
def ingest_company_tables(company_dir: Path, processed_hashes: set) -> int:
    """Ingest all tables for a company."""
    table_count = 0
    
    for doc_dir in company_dir.iterdir():
        if doc_dir.is_dir():
            for table_file in doc_dir.glob("table_*.md"):
                table_count += ingest_table_file(table_file, doc_dir.name, processed_hashes)
    
    return table_count

In [12]:
def ingest_company_image_descriptions(company_dir: Path, processed_hashes: set) -> int:
    """Ingest all image descriptions for a company."""
    desc_count = 0
    
    for doc_dir in company_dir.iterdir():
        if doc_dir.is_dir():
            for desc_file in doc_dir.glob("page_*.md"):
                desc_count += ingest_image_description(desc_file, doc_dir.name, processed_hashes)
    
    return desc_count

### 7. Process All Data

In [13]:
# Get already processed files
processed_hashes = get_processed_hashes()

# Process markdown files
print("\n=== Ingesting Markdown Files ===")
markdown_path = Path(MARKDOWN_DIR)
md_files = list(markdown_path.rglob("*.md"))
print(f"Found {len(md_files)} markdown files\n")

total_pages = 0
for idx, md_path in enumerate(md_files, 1):
    print(f"[{idx}/{len(md_files)}]", end=" ")
    total_pages += ingest_markdown_file(md_path, processed_hashes)

print(f"\nTotal pages ingested: {total_pages}")

Already processed: 0 files

=== Ingesting Markdown Files ===
Found 28 markdown files

[1/28]   ✓ amazon 10-k 2023.md (93 pages)
[2/28]   ✓ amazon 10-k 2024.md (88 pages)
[3/28]   ✓ amazon 10-q q1 2024.md (52 pages)
[4/28]   ✓ amazon 10-q q1 2025.md (50 pages)
[5/28]   ✓ amazon 10-q q2 2024.md (51 pages)
[6/28]   ✓ amazon 10-q q2 2025.md (51 pages)
[7/28]   ✓ amazon 10-q q3 2024.md (147 pages)
[8/28]   ✓ apple 10-k 2023.md (79 pages)
[9/28]   ✓ apple 10-k 2024.md (120 pages)
[10/28]   ✓ apple 10-q q1 2024.md (27 pages)
[11/28]   ✓ apple 10-q q2 2024.md (27 pages)
[12/28]   ✓ apple 10-q q4 2023.md (27 pages)
[13/28]   ✓ apple 8-k q4 2023.md (8 pages)
[14/28]   ✓ google 10-k 2023.md (110 pages)
[15/28]   ✓ google 10-k 2024.md (107 pages)
[16/28]   ✓ google 10-q q1 2025.md (52 pages)
[17/28]   ✓ google 10-q q2 2024.md (55 pages)
[18/28]   ✓ google 10-q q2 2025.md (59 pages)
[19/28]   ✓ google 10-q q3 2024.md (56 pages)
[20/28]   ✓ meta 10-k 2024.md (149 pages)
[21/28]   ✓ meta 10-q q1 2024

In [14]:
# Process tables
print("\n=== Ingesting Tables ===")
tables_path = Path(TABLES_DIR)
company_dirs = [d for d in tables_path.iterdir() if d.is_dir()]
print(f"Found {len(company_dirs)} companies\n")

total_tables = 0
for idx, company_dir in enumerate(company_dirs, 1):
    print(f"[{idx}/{len(company_dirs)}] {company_dir.name}...", end=" ")
    count = ingest_company_tables(company_dir, processed_hashes)
    total_tables += count
    print(f"✓ {count} tables")

print(f"\nTotal tables ingested: {total_tables}")


=== Ingesting Tables ===
Found 5 companies

[1/5] amazon... ✓ 299 tables
[2/5] apple... ✓ 187 tables
[3/5] google... ✓ 372 tables
[4/5] meta... ✓ 85 tables
[5/5] meta10-k... ✓ 67 tables

Total tables ingested: 1010


In [15]:
# Process image descriptions
print("\n=== Ingesting Image Descriptions ===")
images_desc_path = Path(IMAGES_DESC_DIR)
company_dirs = [d for d in images_desc_path.iterdir() if d.is_dir()]
print(f"Found {len(company_dirs)} companies\n")

total_images = 0
for idx, company_dir in enumerate(company_dirs, 1):
    print(f"[{idx}/{len(company_dirs)}] {company_dir.name}...", end=" ")
    count = ingest_company_image_descriptions(company_dir, processed_hashes)
    total_images += count
    print(f"✓ {count} images")

print(f"\nTotal image descriptions ingested: {total_images}")


=== Ingesting Image Descriptions ===
Found 5 companies

[1/5] amazon... ✓ 0 images
[2/5] apple... ✓ 0 images
[3/5] google... ✓ 0 images
[4/5] meta... ✓ 60 images
[5/5] meta10-k... ✓ 0 images

Total image descriptions ingested: 60


### 8. Verify Ingestion

In [23]:
collection_info = vector_store.client.get_collection(COLLECTION_NAME)
collection_info



### 9. Test Search

In [24]:
# Test hybrid search
query = "What is Amazon's revenue?"
results = vector_store.similarity_search(query, k=5)

results

[Document(metadata={'company_name': 'amazon', 'doc_type': '10-k', 'fiscal_quarter': None, 'fiscal_year': 2023, 'content_type': 'text', 'page': 69, 'file_hash': '05f2d434b6eee52a5bbb4155a78068b2eda1eeda86b7af55335beb0634ac0398', 'source_file': 'amazon 10-k 2023.md', '_id': 'f3bd1773-1487-495f-9534-f979e27a3606', '_collection_name': 'financial_docs'}, page_content="Net sales by groups of similar products and services, which also have similar economic characteristics, is as follows (in millions):\n\n|                                 | Year Ended December 31,   | Year Ended December 31,   | Year Ended December 31,   |\n|---------------------------------|---------------------------|---------------------------|---------------------------|\n|                                 | 2021                      | 2022                      | 2023                      |\n| Net Sales:                      |                           |                           |                           |\n| Online store