## Advanced RAG - Data Ingestion Pipeline
### Load Extracted Data into Qdrant with Multimodal Embeddings

**Learning Objectives:**
- Load extracted markdown, tables, and images from 06-01
- Create hybrid embeddings (dense + sparse)
- Store in Qdrant with rich metadata
- Support multimodal search (text + images)

**Prerequisites:**
- Run 06-01 notebook first to extract PDFs into markdown/images/tables
- Qdrant server running on localhost:6333

**What This Notebook Does:**
1. Load markdown files (split by page breaks)
2. Load tables with context
3. Load images with multimodal embeddings
4. Store all in single Qdrant collection with content_type metadata
5. Enable hybrid retrieval with deduplication

### Setup and Configuration

In [None]:
from dotenv import load_dotenv
load_dotenv()

import hashlib
from pathlib import Path

from langchain_google_vertexai import VertexAIEmbeddings
from langchain_qdrant import QdrantVectorStore, RetrievalMode, FastEmbedSparse
from langchain_core.documents import Document

In [None]:
# Configuration
MARKDOWN_DIR = "data/rag-data/rag-markdown"
TABLES_DIR = "data/rag-data/rag-tables"
IMAGES_DIR = "data/rag-data/rag-images"
COLLECTION_NAME = "financial_docs"
EMBEDDING_MODEL = "multimodalembedding@001"  # Vertex AI Multimodal

### Initialize Gemini Embeddings, BM25, and Qdrant

**Hybrid Retrieval**: Combines dense (semantic) and sparse (keyword) search for better results

In [None]:
# Multimodal embeddings (Vertex AI) - works for text AND images
embeddings = VertexAIEmbeddings(model_name=EMBEDDING_MODEL)

# Sparse embeddings (BM25)
sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

# Initialize vector store with hybrid retrieval
vector_store = QdrantVectorStore.from_documents(
    documents=[],
    embedding=embeddings,
    sparse_embedding=sparse_embeddings,
    collection_name=COLLECTION_NAME,
    url="http://localhost:6333",
    retrieval_mode=RetrievalMode.HYBRID,
    force_recreate=True
)

### Metadata Extraction from Filename

In [None]:
extract_metadata_from_filename('amazon 10-k 2023.pdf')

### Extract Text from PDF Pages

In [None]:
pages = extract_pdf_pages('data/rag-data/amazon/amazon 10-q q1 2024.pdf')
print(f"Total pages: {len(pages)}")

In [None]:
def extract_metadata_from_filename(filename: str) -> dict:
    """Extract metadata from markdown filename."""
    name = filename.replace('.md', '')
    parts = name.split()
    
    metadata = {}
    metadata['company_name'] = parts[0]
    metadata['doc_type'] = parts[1]
    
    if len(parts) == 4:
        metadata['fiscal_quarter'] = parts[2]
        metadata['fiscal_year'] = int(parts[3])
    else:
        metadata['fiscal_quarter'] = None
        metadata['fiscal_year'] = int(parts[2])
    
    return metadata

### Track Processed Files

### Document Ingestion Pipeline

### Process All Markdown Files

In [None]:
### Helper Functions

In [None]:
def extract_metadata_from_filename(filename: str) -> dict:
    """Extract metadata from markdown filename."""
    name = filename.replace('.md', '')
    parts = name.split()
    
    metadata = {}
    metadata['company_name'] = parts[0]
    metadata['doc_type'] = parts[1]
    
    if len(parts) == 4:
        metadata['fiscal_quarter'] = parts[2]
        metadata['fiscal_year'] = int(parts[3])
    else:
        metadata['fiscal_quarter'] = None
        metadata['fiscal_year'] = int(parts[2])
    
    return metadata


def compute_file_hash(file_path: str) -> str:
    """Compute SHA-256 hash of file content for deduplication."""
    sha256_hash = hashlib.sha256()
    with open(file_path, 'rb') as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
    return sha256_hash.hexdigest()

In [None]:
collection_info = vector_store.client.get_collection(COLLECTION_NAME)
print(f"Total documents in Qdrant: {collection_info.points_count}")

In [None]:
# Hybrid search with RRF (Reciprocal Rank Fusion)
query = "What is amazon's cashflows?"
query = "What is amazon's Profit and Loss statement."
query = "asset base and earning"
results = vector_store.similarity_search(query, k=5)

In [None]:
results

In [None]:
from IPython.display import display, Markdown

for res in results:
    display(Markdown(res.page_content))