# 🏥 Offline Vector Indexing Pipeline (Medical Education)

## 🎯 Overview
This notebook serves as the **Offline Preprocessing & Indexing Pipeline** for the Adaptive RAG system. Its sole purpose is to transform raw medical documents into a searchable vector index.

### 🚫 Scope & Usage
- **Offline Only**: This notebook is run *once* (or periodically) to build the database.
- **No RAG/LLM**: It does not perform retrieval or generation.
- **Zero Medical Advice**: It processes text data deterministically without interpretation.

### 📂 Output Artifacts
This pipeline will generate the following files in a `./vector_store/` directory:
1. `index.faiss`: The FAISS vector index for fast similarity search.
2. `metadata.pkl`: Semantic metadata mapping for chunks.
3. `texts.pkl`: The raw text content corresponding to vectors.
4. `config.json`: Configuration used (embedding model name, chunk size) to ensure consistency.

### ⚠️ CRITICAL WARNING
> **Consistency is Key**: The embedding model used here (`sentence-transformers/all-MiniLM-L6-v2`) **MUST** match the model used in the online query notebook. Mismatched models will result in random/garbage retrieval.


In [None]:
# @title 📦 Install Dependencies
# We need specific libraries for OCR, PDF processing, and Vector Indexing.

# 1. System dependencies for PDF handling and OCR
!apt-get update -qq
!apt-get install -y poppler-utils tesseract-ocr

# 2. Python Libraries
# sentence-transformers: For generating state-of-the-art text embeddings
# faiss-cpu: Facebook AI Similarity Search (efficient vector storage)
# pytesseract: Wrapper for Google's Tesseract-OCR
# pdf2image: To convert PDF pages to images for OCR
# opencv-python: For image preprocessing (noise removal)
# tqdm: For progress bars
!pip install -q sentence-transformers faiss-cpu pytesseract pdf2image opencv-python numpy tqdm

print("✅ Libraries installed successfully.")


In [None]:
# @title ⚙️ Environment Setup
import os
import sys
import shutil
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Create the output directory for our vector store
OUTPUT_DIR = './vector_store'
if os.path.exists(OUTPUT_DIR):
    shutil.rmtree(OUTPUT_DIR) # Clean start
os.makedirs(OUTPUT_DIR)
print(f"📂 Created output directory: {OUTPUT_DIR}")

# Tesseract Config (usually found automatically, but good to ensure availability)
try:
    import pytesseract
    # Check if tesseract is in path
    pytesseract.get_tesseract_version()
    print("✅ Tesseract OCR is available.")
except Exception as e:
    print("❌ Tesseract OCR not found. Please verify installation.")
    raise e


## 🔢 Step 1: Configuration

We need to know how many documents you intend to process. We assign internal IDs to maintain lineage.
Batch processing allows us to track progress and manage memory effectively.


In [None]:
# @title Input Document Count
try:
    num_documents = int(input("Enter number of documents to process: "))
    print(f"📋 We will process {num_documents} documents.")
except ValueError:
    num_documents = 1
    print("⚠️ Invalid input. Defaulting to 1 document.")


In [None]:
# @title 📤 Upload Documents
from google.colab import files

print(f"Please upload {num_documents} file(s) (PDF, JPG, PNG)...")
uploaded = files.upload()

source_files = list(uploaded.keys())

if len(source_files) == 0:
    raise ValueError("No files uploaded Exiting.")

print("\nfiles to be processed:")
for i, f in enumerate(source_files):
    print(f"{i}: {f}")


In [None]:
# @title 🔍 OCR Text Extraction
from pdf2image import convert_from_path
import pytesseract
from tqdm import tqdm
import cv2
import numpy as np
from PIL import Image
import io

# Data structure to hold raw text
documents = [] 
# Format: [{'doc_id': int, 'source': str, 'raw_text': str}]

print("🚀 Starting OCR extraction... (This may take time via Tesseract)")

for doc_idx, filename in enumerate(source_files):
    print(f"\n📄 Processing {filename} ({doc_idx+1}/{len(source_files)})...")
    
    full_text = ""
    file_ext = filename.split('.')[-1].lower()
    
    try:
        if file_ext == 'pdf':
            # Convert PDF to list of images
            images = convert_from_path(filename)
            
            for i, image in enumerate(images):
                # Convert to grayscale for better OCR
                # Text extraction
                text = pytesseract.image_to_string(image)
                full_text += text + "\n"
                
        elif file_ext in ['jpg', 'jpeg', 'png']:
            image = Image.open(filename)
            text = pytesseract.image_to_string(image)
            full_text += text
        else:
            print(f"⚠️ Skipping unsupported file type: {filename}")
            continue
            
        # Store result
        documents.append({
            "doc_id": doc_idx,
            "source": filename,
            "raw_text": full_text
        })
        print(f"   ✅ Extracted {len(full_text)} characters from {filename}")
        
    except Exception as e:
        print(f"   ❌ Error processing {filename}: {e}")

print(f"\n🏁 OCR Complete. Processed {len(documents)} documents.")


In [None]:
# @title 🧹 Noise Removal & Normalization
import re
import unicodedata

def normalize_text(text):
    # 1. Unicode normalization (NFKD to decompose special chars)
    text = unicodedata.normalize('NFKD', text)
    
    # 2. Lowercase (Consistent for embeddings)
    text = text.lower()
    
    # 3. Remove excess whitespace/newlines
    text = re.sub(r'\s+', ' ', text).strip()
    
    # 4. Remove common artifact patterns (e.g., page numbers like 'Page 1 of 5')
    text = re.sub(r'page \d+ of \d+', '', text)
    text = re.sub(r'page \d+', '', text)
    
    return text

print("Cleaning text...\n")
for doc in documents:
    original_len = len(doc['raw_text'])
    doc['clean_text'] = normalize_text(doc['raw_text'])
    cleaned_len = len(doc['clean_text'])
    
    print(f"Doc {doc['doc_id']} ({doc['source']}): reduced {original_len} -> {cleaned_len} chars")
    
# Note: This step is irreversible. We construct embeddings from valid semantic content only.


## 🧩 Chunking Strategy

We use a **sliding window** approach to chunking. 

- **Chunk Size**: Number of characters/tokens per chunk. Keeping this around 400-500 helps in capturing single concepts.
- **Overlap**: Essential for medical text. Ensures that context (like a disease name appearing at the end of chunk A) is carried over to chunk B.
- **Model Limit**: `all-MiniLM-L6-v2` works best with inputs under 256-512 tokens.


In [None]:
# @title Chunking Parameters

# Configurable parameters
MAX_TOKENS = 500   # Not explicitly used if we chunk by char, but guides the design
CHUNK_SIZE = 400   # Characters (approx 100 tokens)
CHUNK_OVERLAP = 80 # Characters (approx 20 tokens)

print(f"Configuration: Size={CHUNK_SIZE}, Overlap={CHUNK_OVERLAP}")


In [None]:
# @title 🔪 Execute Chunking

chunks = []
chunk_counter = 0

for doc in documents:
    text = doc['clean_text']
    source = doc['source']
    doc_id = doc['doc_id']
    
    # Simple sliding window by character
    # (For production, consider nltk sentence tokenizer or recursive chunking)
    for i in range(0, len(text), CHUNK_SIZE - CHUNK_OVERLAP):
        chunk_text = text[i : i + CHUNK_SIZE]
        
        # Skip chunks that are too small (noise)
        if len(chunk_text) < 50:
            continue
            
        chunks.append({
            "chunk_id": chunk_counter,
            "doc_id": doc_id,
            "text": chunk_text,
            "source": source,
            "position": i
        })
        chunk_counter += 1

print(f"✅ Generated {len(chunks)} chunks from {len(documents)} documents.")
# Example peek
if chunks:
    print("Sample Chunk:", chunks[0])


In [None]:
# @title 🧠 Load Embedding Model
from sentence_transformers import SentenceTransformer

MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'

print(f"Loading model: {MODEL_NAME}...")
# This downloads the model weights
embedding_model = SentenceTransformer(MODEL_NAME)

dim = embedding_model.get_sentence_embedding_dimension()
print(f"✅ Model loaded. Embedding Dimension: {dim}")
print("⚠️ WARNING: If you change this model, you MUST re-run this entire notebook. Index varies by model!")


In [None]:
# @title ⚡ Generate Embeddings
import numpy as np

# Extract text list
chunk_texts = [c['text'] for c in chunks]

print(f"Encoding {len(chunk_texts)} chunks... (Using CPU/GPU)")

# Generate embeddings
# show_progress_bar=True provides a tqdm bar automatically
embeddings = embedding_model.encode(chunk_texts, show_progress_bar=True, convert_to_numpy=True)

# Ensure float32 for FAISS
embeddings = embeddings.astype(np.float32)

print(f"✅ Embeddings shape: {embeddings.shape}")


In [None]:
# @title 🗄️ Create FAISS Index
import faiss

dimension = embeddings.shape[1]

# Create L2 Index (Euclidean Distance). 
# Since embeddings are often normalized, L2 is proportional to Cosine Similarity.
# For exact cosine similarity, we would normalize vectors first then use IndexFlatIP (Inner Product).
# Here we stick to Standard L2 for robust testing.
index = faiss.IndexFlatL2(dimension)

# Add vectors to index
index.add(embeddings)

print(f"✅ FAISS Index created. Total vectors: {index.ntotal}")


In [None]:
# @title 🎒 Prepare Metadata Store

# FAISS only stores vectors. It doesn't know what text belongs to which vector.
# We need a 'Sidecar' storage: ID -> Data mapping.

metadata_store = {}
text_store = {}

for i, chunk in enumerate(chunks):
    # chunk['chunk_id'] corresponds to the index in FAISS (sequential 0..N)
    # In this simple case, index ID == chunk_id because we added them in order.
    
    c_id = chunk['chunk_id']
    
    # Metadata: Source info
    metadata_store[c_id] = {
        "doc_id": chunk['doc_id'],
        "source": chunk['source'],
        "position": chunk['position']
    }
    
    # Text: The actual content
    text_store[c_id] = chunk['text']

print(f"✅ Prepared metadata for {len(metadata_store)} items.")


In [None]:
# @title 💾 Save to Disk
import pickle
import json
from datetime import datetime

# Paths
index_path = os.path.join(OUTPUT_DIR, 'index.faiss')
metadata_path = os.path.join(OUTPUT_DIR, 'metadata.pkl')
texts_path = os.path.join(OUTPUT_DIR, 'texts.pkl')
config_path = os.path.join(OUTPUT_DIR, 'config.json')

# 1. Save FAISS Index
faiss.write_index(index, index_path)

# 2. Save Metadata (Pickle)
with open(metadata_path, 'wb') as f:
    pickle.dump(metadata_store, f)
    
# 3. Save Texts (Pickle)
with open(texts_path, 'wb') as f:
    pickle.dump(text_store, f)

# 4. Save Config (JSON) for reproducibility
config_data = {
    "embedding_model": MODEL_NAME,
    "chunk_size": CHUNK_SIZE,
    "chunk_overlap": CHUNK_OVERLAP,
    "num_documents": len(documents),
    "total_chunks": len(chunks),
    "timestamp": str(datetime.now())
}
with open(config_path, 'w') as f:
    json.dump(config_data, f, indent=4)

print("💾 All artifacts saved successfully to ./vector_store/:")
print(f"  - {index_path}")
print(f"  - {metadata_path}")
print(f"  - {texts_path}")
print(f"  - {config_path}")

# Create a zip for easy download
shutil.make_archive('vector_store_backup', 'zip', OUTPUT_DIR)
print("\n📦 Created 'vector_store_backup.zip' for download.")


## 🏁 Pipeline Complete

**Summary:**
We have successfully converted your raw documents into a searchable Vector Database.

**Artifacts Created:**
1. `index.faiss`: Geometry of your data.
2. `metadata.pkl`: Links vectors to document sources.
3. `texts.pkl`: The read-able text returned to the LLM.

**Next Steps:**
- Download `vector_store_backup.zip`.
- Upload it to your **Online RAG Notebook**.
- Load the `all-MiniLM-L6-v2` model there to query this data.

> **Note:** If you add new documents later, you must re-run this entire pipeline to regenerate the index.
