## Data Ingestion for Deep RAG

In this notebook, we'll load extracted data into Qdrant vector database:

- **Markdown**: Page-level chunks with metadata
- **Tables**: Separate documents with context and page numbers
- **Images**: Text descriptions embedded (generated in notebook 06-01b)
- **Hybrid Search**: Dense (semantic) + Sparse (keyword) embeddings

**Prerequisites:**
- Run image_description and Docling data extraction notebook
- Qdrant server running on localhost:6333
- Google API key set in .env file

**Output:**
- Single Qdrant collection with all content types
- Rich metadata for filtering (company, year, quarter, doc_type, page)
- Deduplication using file hashes

**Make Sure QDRANT Vector DB Docker Running**

https://qdrant.tech/

### 0. Qdrant API Setup

In [31]:
import os
from dotenv import load_dotenv
load_dotenv()

from qdrant_client import QdrantClient

qdrant_client = QdrantClient(
    url="https://21dc13f4-3860-4af2-a0db-965efea65ef7.us-east4-0.gcp.cloud.qdrant.io:6333", 
    api_key = os.getenv("QDRANT_API_KEY")
)

print(qdrant_client.get_collections())

collections=[]


In [51]:
###  This for localhost setup on docker

qdrant_client = QdrantClient(
    url="http://localhost:6333"
)

print(qdrant_client.get_collections())

collections=[CollectionDescription(name='financial_docs')]


### 1. Setup and Imports

In [3]:
import hashlib
from pathlib import Path

from langchain_google_genai import GoogleGenerativeAIEmbeddings

from langchain_qdrant import QdrantVectorStore, RetrievalMode, FastEmbedSparse

from langchain_core.documents import Document
from qdrant_client import QdrantClient

### 2. Configuration

In [35]:
# Paths
MARKDOWN_DIR = "data/rag-data/markdown"
TABLES_DIR = "data/rag-data/tables"
IMAGES_DESC_DIR = "data/rag-data/images_desc"

# Qdrant Configuration
COLLECTION_NAME = "financial_docs"
EMBEDDING_MODEL = "models/gemini-embedding-001"

### 3. Initialize Embeddings and Client

In [36]:
# Embeddings
embeddings = GoogleGenerativeAIEmbeddings(model=EMBEDDING_MODEL)
sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

In [37]:
result = embeddings.embed_query('anything')
len(result)

3072

In [38]:
result = sparse_embeddings.embed_query('hi hello')
result

SparseVector(indices=[948991206, 613153351], values=[1.0, 1.0])

In [39]:
result = sparse_embeddings.embed_documents(['hi', 'hello'])
result

[SparseVector(indices=[948991206], values=[1.6877434821696136]),
 SparseVector(indices=[613153351], values=[1.6877434821696136])]

### 4. Create or Recreate Collection

In [None]:
# # Create vector store at Remote location
# vector_store = QdrantVectorStore.from_documents(
#     documents=[],
#     embedding=embeddings,
#     sparse_embedding=sparse_embeddings,
#     url="https://d599160b-804b-42db-8d0d-c1dd093fa909.us-east4-0.gcp.cloud.qdrant.io:6333", 
#     api_key = os.getenv("QDRANT_API_KEY"),
#     collection_name = COLLECTION_NAME,
#     retrieval_mode=RetrievalMode.HYBRID,
#     force_recreate=False
# )

In [52]:
# Create vector store at local computer
vector_store = QdrantVectorStore.from_documents(
    documents=[],
    embedding=embeddings,
    sparse_embedding=sparse_embeddings,
    url="http://localhost:6333", 
    collection_name = COLLECTION_NAME,
    retrieval_mode=RetrievalMode.HYBRID,
    force_recreate=False
)

In [53]:
vector_store.client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='financial_docs')])

### 5. Helper Functions

In [42]:
def extract_metadata_from_filename(filename: str):
    """
    Extract metadata from filename.
    
    Expected format: CompanyName DocType [Quarter] Year.pdf
    Examples:
        - Amazon 10-Q Q1 2024.pdf
        - Microsoft 10-K 2023.pdf
    """

    filename = filename.replace('.pdf', '').replace('.md', '')
    parts = filename.split()

    return {
        'company_name': parts[0],
        'doc_type': parts[1],
        'fiscal_quarter': parts[2] if len(parts)==4 else None,
        'fiscal_year': parts[-1]
    }

extract_metadata_from_filename('apple 10-k 2023.md')

{'company_name': 'apple',
 'doc_type': '10-k',
 'fiscal_quarter': None,
 'fiscal_year': '2023'}

In [43]:
def compute_file_hash(file_path: Path):

    sha256_hash = hashlib.sha256()

    with open(file_path, 'rb') as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)

    return sha256_hash.hexdigest()


In [45]:
from pathlib import Path
Path.cwd()


PosixPath('/mnt/c/Z_data/projects/Convolve/downloads/Multi-Agent-Deep-RAG')

In [56]:
Path("data/rag-data").exists()

True

In [57]:
compute_file_hash(Path('data/rag-data/markdown/amazon/amazon 10-k 2023.md'))

'fc7817c1b8473b2619bedf24fd8a094d9dbd638ee28546bc9f779937efbfcd1a'

In [59]:
# compute_file_hash(Path('data/rag-data/markdown/amazon/amazon 10-k 2023 copy.md'))

In [60]:
# get the list of ingested file
all_points = vector_store.client.scroll(
    collection_name=COLLECTION_NAME,
    limit=10_00,
    with_payload=True,
    offset=None
)

In [61]:
all_points[1]


In [62]:
all_points[0][0].payload['metadata']['file_hash']

IndexError: list index out of range

In [18]:
def get_processed_hashes():
    
    processed_hashes = set()
    offset = None

    while True:
        points, offset = vector_store.client.scroll(
                            collection_name=COLLECTION_NAME,
                            limit=10_000,
                            with_payload=True,
                            offset=offset
                        )

        if not points:
            break
        
        processed_hashes.update(point.payload['metadata']['file_hash'] for point in points)

        if offset is None:
            break

    return processed_hashes

In [19]:
processed_hashes = get_processed_hashes()

In [20]:
len(processed_hashes)

0

In [21]:
# extract the page number from the file path
import re

def extract_page_number(file_path: Path):
    pattern = r'page_(\d+)'
    match = re.search(pattern=pattern, string=file_path.stem)
    return int(match.group(1)) if match else None

In [22]:
file_path = Path(r'data\rag-data\images_desc\google\google 10-k 2023\page_28.md')
extract_page_number(file_path)

28

### 6. Ingestion Function

In [63]:
def ingest_file_in_db(file_path, processed_hashes):

    file_hash = compute_file_hash(file_path)
    if file_hash in processed_hashes:
        print(f"Following file has been already uploaded: {file_path}")

    path_str = str(file_path)
    if 'markdown' in path_str:
        content_type = 'text'
        doc_name = file_path.name
    elif 'tables' in path_str:
        content_type = 'tables'
        doc_name = file_path.parent.name
    elif 'images_desc' in path_str:
        content_type = 'image'
        doc_name = file_path.parent.name
    else:
        content_type = 'unknown'
        doc_name = file_path.name

    content = file_path.read_text(encoding='utf-8')

    base_metadata = extract_metadata_from_filename(doc_name)

    base_metadata.update({
        'content_type': content_type,
        'file_hash': file_hash,
        'source_file': doc_name
    })

    if content_type == 'text':
        # write method for ingesting markdown data
        pages = content.split('<!-- page break -->')
        documents = []
        for idx, page in enumerate(pages, start=1):
            metadata = base_metadata.copy()
            metadata.update({'page': idx})
            documents.append(Document(page_content=page, metadata=metadata))

        vector_store.add_documents(documents)

    else:
        # write method to ingest images desc and tables .md data
        page_num = extract_page_number(file_path)
        metadata = base_metadata.copy()
        metadata.update({'page': page_num})
        documents = [Document(page_content=content, metadata=metadata)]

        vector_store.add_documents(documents)


    processed_hashes.add(file_hash)


In [64]:
file_path = Path('data/rag-data/markdown/amazon/amazon 10-k 2023.md')
processed_hashes = get_processed_hashes()

ingest_file_in_db(file_path, processed_hashes)

In [65]:
from tqdm import tqdm

base_path = Path('data/rag-data')
all_md_files = list(base_path.rglob("*.md"))

for md_file in tqdm(all_md_files):
    ingest_file_in_db(md_file, processed_hashes)

  7%|▋         | 79/1118 [00:38<08:45,  1.98it/s]

Following file has been already uploaded: data/rag-data/markdown/amazon/amazon 10-k 2023.md


 83%|████████▎ | 924/1118 [11:49<01:35,  2.02it/s]  

Following file has been already uploaded: data/rag-data/tables/google/google 10-q q3 2024/table_23_page_20.md


100%|██████████| 1118/1118 [13:23<00:00,  1.39it/s]


### 8. Verify Ingestion

In [66]:
collection_info = vector_store.client.get_collection(COLLECTION_NAME)
collection_info



### 9. Test Search

In [None]:
query = "what is the Zomato's revenue"
results = vector_store.similarity_search(query)

In [68]:
results

[Document(metadata={'company_name': 'meta', 'doc_type': '10-k', 'fiscal_quarter': None, 'fiscal_year': '2023', 'content_type': 'tables', 'file_hash': '1e621c69111b6c99fd696decbd3b59c33a0669862bb357ba69424dce6654a9a8', 'source_file': 'meta 10-k 2023', 'page': 77, '_id': '99cc37ad-d473-431a-8c74-c53f63ed189c', '_collection_name': 'financial_docs'}, page_content="**Page:** 77\n\nChanges in foreign exchange rates had a favorable impact on our total revenue in the full year 2023 compared to the same period in 2022. If we had translated revenue for the full year 2023 using the prior year's monthly exchange rates for our settlement or billing currencies other than the U.S. dollar, our total  revenue  and  advertising  revenue  would  have  been  $134.53  billion  and  $131.57  billion,  respectively.  Using  these  constant  rates,  total  revenue  and advertising revenue would have been $374 million and $379 million lower than actual total revenue and advertising revenue, respectively, for t