# üè• Offline Vector Indexing Pipeline (Medical Education)

## üìã Overview
This notebook is the **Preprocessing Factory** for the Adaptive RAG system.
It transforms raw medical documents (PDFs/Images) into a mathematical **Vector Index** that the main AI system can search.

### üîÑ Pipeline Workflow
1.  **Input**: Raw PDF/Image files.
2.  **OCR**: Extract text using Tesseract.
3.  **Cleaning**: Normalize and remove noise.
4.  **Chunking**: Split text into overlapping sliding windows.
5.  **Embedding**: Convert text chunks into Vectors (Numbers).
6.  **Indexing**: Build a FAISS Vector Database.
7.  **Output**: Save artifacts (`index.faiss`, `metadata.pkl`, `texts.pkl`).

---


## üõ†Ô∏è Step 1: Install Dependencies
**What it does**:
- Installs OCR tools (Tesseract, Poppler).
- Installs Python libraries for PDF processing (`pdf2image`), Vector Search (`faiss-cpu`), and embeddings (`sentence-transformers`).

**Input**: None
**Output**: System tools and Python libraries installed.
**Role**: Infrastructure Setup.


In [None]:
# @title üì¶ Install Dependencies
!apt-get update -qq
!apt-get install -y poppler-utils tesseract-ocr
!pip install -q sentence-transformers faiss-cpu pytesseract pdf2image opencv-python numpy tqdm

print("‚úÖ Libraries installed successfully.")


## ‚öôÔ∏è Step 2: Environment Setup
**What it does**:
- Creates a clean `./vector_store` directory to save our results.
- Checks if OCR tools are working correctly.

**Input**: None
**Output**: A clean directory ready for data.
**Role**: Workspace Preparation.


In [None]:
# @title ‚öôÔ∏è Environment Setup
import os
import shutil
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

OUTPUT_DIR = './vector_store'
if os.path.exists(OUTPUT_DIR):
    shutil.rmtree(OUTPUT_DIR) # Clean start
os.makedirs(OUTPUT_DIR)
print(f"üìÇ Created output directory: {OUTPUT_DIR}")

try:
    import pytesseract
    pytesseract.get_tesseract_version()
    print("‚úÖ Tesseract OCR is available.")
except Exception as e:
    print("‚ùå Tesseract OCR not found. Please verify installation.")
    raise e


## üî¢ Step 3: Input Configuration
**What it does**:
- Asks you how many files you want to process.
- Sets up the batch size.

**Input**: User types a number (e.g., "3").
**Output**: `num_documents` variable set.
**Role**: Job Configuration.


In [None]:
# @title üî¢ Input Count
try:
    num_documents = int(input("Enter number of documents to process: "))
    print(f"üìÑ We will process {num_documents} documents.")
except ValueError:
    num_documents = 1
    print("‚ö†Ô∏è Invalid input. Defaulting to 1 document.")


## üì§ Step 4: Upload Documents
**What it does**:
- Opens the Google Colab file picker.
- Allows you to upload PDF, JPG, or PNG files.

**Input**: Files from your computer.
**Output**: Files saved to Colab runtime.
**Role**: Data Ingestion.


In [None]:
# @title üì§ Upload Files
from google.colab import files

print(f"Please upload {num_documents} file(s) (PDF, JPG, PNG)...")
uploaded = files.upload()

source_files = list(uploaded.keys())
if len(source_files) == 0: raise ValueError("No files uploaded Exiting.")

print("\nFiles to be processed:")
for i, f in enumerate(source_files): print(f"{i}: {f}")


## üîç Step 5: OCR (Text Extraction)
**What it does**:
- Converts PDFs into images.
- Uses Tesseract OCR to read text from those images.
- Handles both PDFs and raw Images (JPG/PNG).

**Input**: Raw files (PDF/Image).
**Output**: Raw text strings for each document.
**Role**: Digitization (converting pixels to text).


In [None]:
# @title üîç Run OCR
from pdf2image import convert_from_path
import pytesseract
from tqdm import tqdm
from PIL import Image

documents = [] # [{'doc_id', 'source', 'raw_text'}]
print("üöÄ Starting OCR extraction...")

for doc_idx, filename in enumerate(source_files):
    print(f"\nüìÑ Processing {filename} ({doc_idx+1}/{len(source_files)})...")
    full_text = ""
    file_ext = filename.split('.')[-1].lower()
    
    try:
        if file_ext == 'pdf':
            images = convert_from_path(filename)
            for image in images:
                full_text += pytesseract.image_to_string(image) + "\n"
        elif file_ext in ['jpg', 'jpeg', 'png']:
            full_text += pytesseract.image_to_string(Image.open(filename))
        else:
            print(f"‚ö†Ô∏è Skipping unsupported file: {filename}")
            continue
            
        documents.append({"doc_id": doc_idx, "source": filename, "raw_text": full_text})
        print(f"   ‚úÖ Extracted {len(full_text)} characters.")
    except Exception as e:
        print(f"   ‚ùå Error: {e}")

print(f"\nüèÅ OCR Complete. Processed {len(documents)} docs.")


## üßπ Step 6: Noise Removal
**What it does**:
- Normalizes text (lowercasing, unicode fixing).
- Removes artifacts like "Page 1 of 5".
- Removes excess whitespace.

**Input**: Raw OCR text.
**Output**: Clean, high-quality text.
**Role**: Data Cleaning.


In [None]:
# @title üßπ Clean Text
import re
import unicodedata

def normalize_text(text):
    text = unicodedata.normalize('NFKD', text)
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'page \d+ of \d+', '', text)
    text = re.sub(r'page \d+', '', text)
    return text

print("Cleaning text...\n")
for doc in documents:
    doc['clean_text'] = normalize_text(doc['raw_text'])
    print(f"Doc {doc['doc_id']}: reduced {len(doc['raw_text'])} -> {len(doc['clean_text'])} chars")


## ‚úÇÔ∏è Step 7: Chunking (Sliding Window)
**What it does**:
- Splits long documents into smaller segments (Chunks).
- Uses **Overlap** to ensure context isn't cut off at the edge.

**Input**: Config (Chunk Size=400 chars, Overlap=80 chars).
**Output**: List of Chunk objects.
**Role**: Granularity Control (preparing text for the Embedding Model).


In [None]:
# @title ‚úÇÔ∏è Execute Chunking
CHUNK_SIZE = 400
CHUNK_OVERLAP = 80

chunks = []
chunk_counter = 0

for doc in documents:
    text = doc['clean_text']
    for i in range(0, len(text), CHUNK_SIZE - CHUNK_OVERLAP):
        chunk_text = text[i : i + CHUNK_SIZE]
        if len(chunk_text) < 50: continue # Skip noise
        
        chunks.append({
            "chunk_id": chunk_counter,
            "doc_id": doc['doc_id'],
            "text": chunk_text,
            "source": doc['source'],
            "position": i
        })
        chunk_counter += 1

print(f"‚úÖ Generated {len(chunks)} chunks.")


## üß† Step 8: Load Embedding Model
**What it does**:
- Loads `sentence-transformers/all-MiniLM-L6-v2`.
- This model converts text into 384-dimensional vectors.
**Critical**: This MUST match the model used in the Inference Notebook.

**Input**: Model Name.
**Output**: Loaded Model in memory.
**Role**: Neural Encoder Loading.


In [None]:
# @title üß† Load Model
from sentence_transformers import SentenceTransformer
MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
print(f"Loading model: {MODEL_NAME}...")
embedding_model = SentenceTransformer(MODEL_NAME)
print("‚úÖ Model loaded.")


## üî¢ Step 9: Generate Embeddings
**What it does**:
- Passes all text chunks through the Neural Network.
- Returns a matrix of floating point numbers.

**Input**: List of text strings.
**Output**: Numpy array of shape `(Num_Chunks, 384)`.
**Role**: Vectorization.


In [None]:
# @title üî¢ Compute Embeddings
import numpy as np
chunk_texts = [c['text'] for c in chunks]
print(f"Encoding {len(chunk_texts)} chunks...")
embeddings = embedding_model.encode(chunk_texts, show_progress_bar=True, convert_to_numpy=True)
embeddings = embeddings.astype(np.float32)
print(f"‚úÖ Embeddings shape: {embeddings.shape}")


## üìö Step 10: Create FAISS Index
**What it does**:
- Creates a structural index optimized for fast L2 (Euclidean) distance search.
- Adds the vectors to this index.

**Input**: Embeddings Matrix.
**Output**: Populated FAISS Index.
**Role**: Database Creation.


In [None]:
# @title üìö Build Index
import faiss
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
print(f"‚úÖ FAISS Index created. Total vectors: {index.ntotal}")


## üóÇÔ∏è Step 11: Configure Metadata
**What it does**:
- Creates a "Sidecar" dictionary that links every Vector ID back to its original Text and Source File.
- FAISS stores numbers; this stores the actual info.

**Input**: Chunk list.
**Output**: `metadata_store` and `text_store` dictionaries.
**Role**: Data Mapping.


In [None]:
# @title üóÇÔ∏è Prepare Meta Stores
metadata_store = {}
text_store = {}
for i, chunk in enumerate(chunks):
    c_id = chunk['chunk_id']
    metadata_store[c_id] = { "doc_id": chunk['doc_id'], "source": chunk['source'], "position": chunk['position'] }
    text_store[c_id] = chunk['text']
print(f"‚úÖ Prepared metadata for {len(metadata_store)} items.")


## üíæ Step 12: Save to Disk
**What it does**:
- Serializes (saves) all artifacts to the `./vector_store` folder.
- Zips the folder for easy download.

**Input**: Index, Dictionaries, Config.
**Output**: `vector_store_backup.zip`.
**Role**: Persistence.


In [None]:
# @title üíæ Save & Zip
import pickle
import json
from datetime import datetime

index_path = os.path.join(OUTPUT_DIR, 'index.faiss')
metadata_path = os.path.join(OUTPUT_DIR, 'metadata.pkl')
texts_path = os.path.join(OUTPUT_DIR, 'texts.pkl')

faiss.write_index(index, index_path)
with open(metadata_path, 'wb') as f: pickle.dump(metadata_store, f)
with open(texts_path, 'wb') as f: pickle.dump(text_store, f)

print("‚úÖ All artifacts saved.")
shutil.make_archive('vector_store_backup', 'zip', OUTPUT_DIR)
print("üì¶ Created 'vector_store_backup.zip' for download.")
