# üè• Offline Vector Indexing Pipeline (Medical Education)

## üìã Overview
This notebook is the **Preprocessing Factory** for the Adaptive RAG system.
It transforms raw medical documents (PDFs/Images) into a mathematical **Vector Index** that the main AI system can search.

### üîÑ Pipeline Workflow
1.  **Input**: Raw PDF/Image files.
2.  **OCR**: Extract text using Tesseract.
3.  **Cleaning**: Normalize and remove noise.
4.  **Chunking**: Split text into overlapping sliding windows.
5.  **Embedding**: Convert text chunks into Vectors (Numbers).
6.  **Indexing**: Build a FAISS Vector Database.
7.  **Output**: Save artifacts (`index.faiss`, `metadata.pkl`, `texts.pkl`).

---


## üõ†Ô∏è Step 1: Install Dependencies
**What it does**:
- Installs OCR tools (Tesseract, Poppler).
- Installs Python libraries for PDF processing (`pdf2image`), Vector Search (`faiss-cpu`), and embeddings (`sentence-transformers`).

**Input**: None
**Output**: System tools and Python libraries installed.
**Role**: Infrastructure Setup.


In [1]:
# @title üì¶ Install Dependencies
import sys

if 'google.colab' in sys.modules:
    !apt-get update -qq
    !apt-get install -y poppler-utils tesseract-ocr
    !pip install -q faiss-cpu gradio ipykernel jupyter numpy opencv-python pdf2image pickle-mixin pillow pytesseract requests scikit-learn sentence-transformers tqdm
    print("‚úÖ Libraries installed successfully (Colab).")
else:
    print("‚úÖ Skipping apt-get (Local Environment detected).")
    print("Ensure Tesseract OCR and Poppler are installed on your system.")
    print("Run: pip install -r requirements.txt")


‚úÖ Skipping apt-get (Local Environment detected).
Ensure Tesseract OCR and Poppler are installed on your system.
Run: pip install -r requirements.txt


## ‚öôÔ∏è Step 2: Environment Setup
**What it does**:
- Creates a clean `./vector_store` directory to save our results.
- Checks if OCR tools are working correctly.

**Input**: None
**Output**: A clean directory ready for data.
**Role**: Workspace Preparation.


In [2]:
# @title ‚öôÔ∏è Environment Setup
import os
import shutil
import logging
import sys

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

OUTPUT_DIR = './vector_store'
if os.path.exists(OUTPUT_DIR):
    shutil.rmtree(OUTPUT_DIR) # Clean start
os.makedirs(OUTPUT_DIR)
print(f"üìÇ Created output directory: {OUTPUT_DIR}")

# --- TESSERACT SETUP ---
try:
    import pytesseract
    # Check for Conda-installed Tesseract
    # Typically in <Env>/Library/bin/tesseract.exe on Windows
    conda_prefix = sys.prefix
    conda_tesseract = os.path.join(conda_prefix, 'Library', 'bin', 'tesseract.exe')
    
    if os.path.exists(conda_tesseract):
        pytesseract.pytesseract.tesseract_cmd = conda_tesseract
        print(f"üîç Found Conda Tesseract at: {conda_tesseract}")
    elif os.name == 'nt':
        # Fallback to standard install locations
        possible_paths = [
            r'C:\Program Files\Tesseract-OCR\tesseract.exe',
            r'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe',
            os.path.join(os.getenv('LOCALAPPDATA', ''), r'Tesseract-OCR\tesseract.exe')
        ]
        for p in possible_paths:
            if os.path.exists(p):
                pytesseract.pytesseract.tesseract_cmd = p
                print(f"üîç Found System Tesseract at: {p}")
                break
                
    version = pytesseract.get_tesseract_version()
    print(f"‚úÖ Tesseract OCR is available (Version: {version}).")
except Exception as e:
    print("‚ùå Tesseract OCR not found. Please run: conda install -c conda-forge tesseract")
    print(f"Details: {e}")


üìÇ Created output directory: ./vector_store
üîç Found Conda Tesseract at: v:\anaconda3\envs\venv\Library\bin\tesseract.exe
‚úÖ Tesseract OCR is available (Version: 5.5.1).


## üî¢ Step 3: Input Configuration
**What it does**:
- Asks you how many files you want to process.
- Sets up the batch size.

**Input**: User types a number (e.g., "3").
**Output**: `num_documents` variable set.
**Role**: Job Configuration.


In [3]:
# @title üî¢ Input Count
try:
    num_documents = int(input("Enter number of documents to process: "))
    print(f"üìÑ We will process {num_documents} documents.")
except ValueError:
    num_documents = 1
    print("‚ö†Ô∏è Invalid input. Defaulting to 1 document.")


üìÑ We will process 6 documents.


## üì§ Step 4: Upload Documents
**What it does**:
- Opens the Google Colab file picker.
- Allows you to upload PDF, JPG, or PNG files.

**Input**: Files from your computer.
**Output**: Files saved to Colab runtime.
**Role**: Data Ingestion.


In [4]:
# @title üì§ Upload Files (Hybrid)
try:
    from google.colab import files
    print(f"Please upload {num_documents} file(s) (PDF, JPG, PNG)...")
    uploaded = files.upload()
    source_files = list(uploaded.keys())
except ImportError:
    import os
    print("üíª Local Environment detected. Scanning './raw_documents' folder...")
    input_dir = './raw_documents'
    if not os.path.exists(input_dir):
        os.makedirs(input_dir)
        print(f"‚ö†Ô∏è Created {input_dir}. Please put your files there and re-run.")
        source_files = []
    else:
        source_files = [f for f in os.listdir(input_dir) if f.lower().endswith(('.pdf', '.jpg', '.jpeg', '.png'))]
        source_files = [os.path.join(input_dir, f) for f in source_files]

if len(source_files) == 0:
    print(f"‚ö†Ô∏è No files found. Please upload to Colab or place files in './raw_documents' locally.")
else:
    print("\nFiles to be processed:")
    for i, f in enumerate(source_files): print(f"{i}: {f}")


üíª Local Environment detected. Scanning './raw_documents' folder...

Files to be processed:
0: ./raw_documents\bacterial.pdf
1: ./raw_documents\comon_overview.pdf
2: ./raw_documents\fungal.pdf
3: ./raw_documents\parasitic_infection.pdf
4: ./raw_documents\viral.pdf


## üîç Step 5: OCR (Text Extraction)
**What it does**:
- Converts PDFs into images.
- Uses Tesseract OCR to read text from those images.
- Handles both PDFs and raw Images (JPG/PNG).

**Input**: Raw files (PDF/Image).
**Output**: Raw text strings for each document.
**Role**: Digitization (converting pixels to text).


In [5]:
# @title üîç Run OCR
from pdf2image import convert_from_path
import pytesseract
from tqdm import tqdm
from PIL import Image

documents = [] # [{'doc_id', 'source', 'raw_text'}]
print("üöÄ Starting OCR extraction...")

for doc_idx, filename in enumerate(source_files):
    print(f"\nüìÑ Processing {filename} ({doc_idx+1}/{len(source_files)})...")
    full_text = ""
    file_ext = filename.split('.')[-1].lower()
    
    try:
        if file_ext == 'pdf':
            # Assuming poppler is in PATH (installed via Conda)
            images = convert_from_path(filename)
            
            for image in images:
                full_text += pytesseract.image_to_string(image) + "\n"
        elif file_ext in ['jpg', 'jpeg', 'png']:
            full_text += pytesseract.image_to_string(Image.open(filename))
        else:
            print(f"‚ö†Ô∏è Skipping unsupported file: {filename}")
            continue
            
        documents.append({"doc_id": doc_idx, "source": filename, "raw_text": full_text})
        print(f"   ‚úÖ Extracted {len(full_text)} characters.")
    except Exception as e:
        print(f"   ‚ùå Error: {e}")
        if "poppler" in str(e).lower():
             print("   üí° TIP: Ensure Poppler is installed in your Conda environment (conda install -c conda-forge poppler).")

print(f"\nüèÅ OCR Complete. Processed {len(documents)} docs.")


üöÄ Starting OCR extraction...

üìÑ Processing ./raw_documents\bacterial.pdf (1/5)...
   ‚úÖ Extracted 41795 characters.

üìÑ Processing ./raw_documents\comon_overview.pdf (2/5)...
   ‚úÖ Extracted 27768 characters.

üìÑ Processing ./raw_documents\fungal.pdf (3/5)...
   ‚úÖ Extracted 10284 characters.

üìÑ Processing ./raw_documents\parasitic_infection.pdf (4/5)...
   ‚úÖ Extracted 86982 characters.

üìÑ Processing ./raw_documents\viral.pdf (5/5)...
   ‚úÖ Extracted 150177 characters.

üèÅ OCR Complete. Processed 5 docs.


## üíæ Step 5a: Save Extracted Text (Debug)
**What it does**: Saves the raw OCR text to `.txt` files in `./extracted_texts` so you can inspect them.


In [6]:
# @title üíæ Save Extracted Text
import os

DEBUG_DIR = './extracted_texts'
if not os.path.exists(DEBUG_DIR):
    os.makedirs(DEBUG_DIR)

print(f"üíæ Saving extracted text to {DEBUG_DIR}...")
for doc in documents:
    filename = os.path.basename(doc['source']) + ".txt"
    save_path = os.path.join(DEBUG_DIR, filename)
    with open(save_path, 'w', encoding='utf-8') as f:
        f.write(doc['raw_text'])
    print(f"   üìù Saved: {filename}")

print("\n‚úÖ Text saved for inspection.")


üíæ Saving extracted text to ./extracted_texts...
   üìù Saved: bacterial.pdf.txt
   üìù Saved: comon_overview.pdf.txt
   üìù Saved: fungal.pdf.txt
   üìù Saved: parasitic_infection.pdf.txt
   üìù Saved: viral.pdf.txt

‚úÖ Text saved for inspection.


## üßπ Step 6: Noise Removal
**What it does**:
- Normalizes text (lowercasing, unicode fixing).
- Removes artifacts like "Page 1 of 5".
- Removes excess whitespace.

**Input**: Raw OCR text.
**Output**: Clean, high-quality text.
**Role**: Data Cleaning.


In [7]:
# @title üßπ Clean Text
import re
import unicodedata

def normalize_text(text):
    text = unicodedata.normalize('NFKD', text)
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'page \d+ of \d+', '', text)
    text = re.sub(r'page \d+', '', text)
    return text

print("Cleaning text...\n")
for doc in documents:
    doc['clean_text'] = normalize_text(doc['raw_text'])
    print(f"Doc {doc['doc_id']}: reduced {len(doc['raw_text'])} -> {len(doc['clean_text'])} chars")


Cleaning text...

Doc 0: reduced 41795 -> 41306 chars
Doc 1: reduced 27768 -> 27365 chars
Doc 2: reduced 10284 -> 10049 chars
Doc 3: reduced 86982 -> 86138 chars
Doc 4: reduced 150177 -> 148755 chars


## ‚úÇÔ∏è Step 7: Chunking (Sliding Window)
**What it does**:
- Splits long documents into smaller segments (Chunks).
- Uses **Overlap** to ensure context isn't cut off at the edge.

**Input**: Config (Chunk Size=400 chars, Overlap=80 chars).
**Output**: List of Chunk objects.
**Role**: Granularity Control (preparing text for the Embedding Model).


In [8]:
# @title ‚úÇÔ∏è Execute Chunking
CHUNK_SIZE = 400
CHUNK_OVERLAP = 80

chunks = []
chunk_counter = 0

for doc in documents:
    text = doc['clean_text']
    for i in range(0, len(text), CHUNK_SIZE - CHUNK_OVERLAP):
        chunk_text = text[i : i + CHUNK_SIZE]
        if len(chunk_text) < 50: continue # Skip noise
        
        chunks.append({
            "chunk_id": chunk_counter,
            "doc_id": doc['doc_id'],
            "text": chunk_text,
            "source": doc['source'],
            "position": i
        })
        chunk_counter += 1

print(f"‚úÖ Generated {len(chunks)} chunks.")


‚úÖ Generated 982 chunks.


## üß† Step 8: Load Embedding Model
**What it does**:
- Loads `sentence-transformers/all-MiniLM-L6-v2`.
- This model converts text into 384-dimensional vectors.
**Critical**: This MUST match the model used in the Inference Notebook.

**Input**: Model Name.
**Output**: Loaded Model in memory.
**Role**: Neural Encoder Loading.


In [9]:
# @title üß† Load Model
from sentence_transformers import SentenceTransformer
MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
print(f"Loading model: {MODEL_NAME}...")
embedding_model = SentenceTransformer(MODEL_NAME)
print("‚úÖ Model loaded.")


2026-01-28 13:42:53,715 - INFO - Use pytorch device_name: cpu
2026-01-28 13:42:53,717 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


Loading model: sentence-transformers/all-MiniLM-L6-v2...


2026-01-28 13:42:54,346 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/modules.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:42:54,464 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/modules.json "HTTP/1.1 200 OK"
2026-01-28 13:42:54,745 - INFO - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/modules.json "HTTP/1.1 200 OK"


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

2026-01-28 13:42:55,184 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/config_sentence_transformers.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:42:55,474 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/config_sentence_transformers.json "HTTP/1.1 200 OK"
2026-01-28 13:42:55,749 - INFO - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/config_sentence_transformers.json "HTTP/1.1 200 OK"


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

2026-01-28 13:42:56,206 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/config_sentence_transformers.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:42:56,246 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/config_sentence_transformers.json "HTTP/1.1 200 OK"
2026-01-28 13:42:56,511 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/README.md "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:42:56,558 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/README.md "HTTP/1.1 200 OK"
2026-01-28 13:42:56,606 - INFO - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/README.md "HTTP

README.md: 0.00B [00:00, ?B/s]

2026-01-28 13:42:56,932 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/modules.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:42:56,973 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/modules.json "HTTP/1.1 200 OK"
2026-01-28 13:42:57,245 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/sentence_bert_config.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:42:57,287 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/sentence_bert_config.json "HTTP/1.1 200 OK"
2026-01-28 13:42:57,333 - INFO - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/sentence_bert_config.json

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

2026-01-28 13:42:57,894 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/adapter_config.json "HTTP/1.1 404 Not Found"
2026-01-28 13:42:58,154 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:42:58,194 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/config.json "HTTP/1.1 200 OK"
2026-01-28 13:42:58,244 - INFO - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/config.json "HTTP/1.1 200 OK"


config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

2026-01-28 13:42:58,771 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/model.safetensors "HTTP/1.1 302 Found"
2026-01-28 13:42:59,108 - INFO - HTTP Request: GET https://huggingface.co/api/models/sentence-transformers/all-MiniLM-L6-v2/xet-read-token/c9745ed1d9f207416be6d2e6f8de32d1f16199bf "HTTP/1.1 200 OK"


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
2026-01-28 13:45:34,355 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:45:34,395 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/config.json "HTTP/1.1 200 OK"
2026-01-28 13:45:34,660 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:45:34,694 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transfor

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

2026-01-28 13:45:35,056 - INFO - HTTP Request: GET https://huggingface.co/api/models/sentence-transformers/all-MiniLM-L6-v2/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
2026-01-28 13:45:35,327 - INFO - HTTP Request: GET https://huggingface.co/api/models/sentence-transformers/all-MiniLM-L6-v2/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
2026-01-28 13:45:35,593 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/vocab.txt "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:45:35,630 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/vocab.txt "HTTP/1.1 200 OK"
2026-01-28 13:45:35,684 - INFO - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/vocab.txt "HTTP/1.1 200 OK"


vocab.txt: 0.00B [00:00, ?B/s]

2026-01-28 13:45:35,986 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/tokenizer.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:45:36,037 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/tokenizer.json "HTTP/1.1 200 OK"
2026-01-28 13:45:36,090 - INFO - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/tokenizer.json "HTTP/1.1 200 OK"


tokenizer.json: 0.00B [00:00, ?B/s]

2026-01-28 13:45:36,435 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/added_tokens.json "HTTP/1.1 404 Not Found"
2026-01-28 13:45:36,716 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/special_tokens_map.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:45:36,755 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/special_tokens_map.json "HTTP/1.1 200 OK"
2026-01-28 13:45:36,805 - INFO - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/special_tokens_map.json "HTTP/1.1 200 OK"


special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

2026-01-28 13:45:37,093 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/chat_template.jinja "HTTP/1.1 404 Not Found"
2026-01-28 13:45:37,474 - INFO - HTTP Request: HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/1_Pooling/config.json "HTTP/1.1 307 Temporary Redirect"
2026-01-28 13:45:37,512 - INFO - HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/1_Pooling%2Fconfig.json "HTTP/1.1 200 OK"
2026-01-28 13:45:37,556 - INFO - HTTP Request: GET https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/1_Pooling%2Fconfig.json "HTTP/1.1 200 OK"


config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2026-01-28 13:45:37,875 - INFO - HTTP Request: GET https://huggingface.co/api/models/sentence-transformers/all-MiniLM-L6-v2 "HTTP/1.1 200 OK"


‚úÖ Model loaded.


## üî¢ Step 9: Generate Embeddings
**What it does**:
- Passes all text chunks through the Neural Network.
- Returns a matrix of floating point numbers.

**Input**: List of text strings.
**Output**: Numpy array of shape `(Num_Chunks, 384)`.
**Role**: Vectorization.


In [10]:
# @title üî¢ Compute Embeddings
import numpy as np
chunk_texts = [c['text'] for c in chunks]
print(f"Encoding {len(chunk_texts)} chunks...")
embeddings = embedding_model.encode(chunk_texts, show_progress_bar=True, convert_to_numpy=True)
embeddings = embeddings.astype(np.float32)
print(f"‚úÖ Embeddings shape: {embeddings.shape}")


Encoding 982 chunks...


Batches:   0%|          | 0/31 [00:00<?, ?it/s]

‚úÖ Embeddings shape: (982, 384)


## üìö Step 10: Create FAISS Index
**What it does**:
- Creates a structural index optimized for fast L2 (Euclidean) distance search.
- Adds the vectors to this index.

**Input**: Embeddings Matrix.
**Output**: Populated FAISS Index.
**Role**: Database Creation.


In [11]:
# @title üìö Build Index
import faiss
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
print(f"‚úÖ FAISS Index created. Total vectors: {index.ntotal}")


2026-01-28 13:46:39,075 - INFO - Loading faiss with AVX2 support.
2026-01-28 13:46:39,487 - INFO - Successfully loaded faiss with AVX2 support.


‚úÖ FAISS Index created. Total vectors: 982


## üóÇÔ∏è Step 11: Configure Metadata
**What it does**:
- Creates a "Sidecar" dictionary that links every Vector ID back to its original Text and Source File.
- FAISS stores numbers; this stores the actual info.

**Input**: Chunk list.
**Output**: `metadata_store` and `text_store` dictionaries.
**Role**: Data Mapping.


In [12]:
# @title üóÇÔ∏è Prepare Meta Stores
metadata_store = {}
text_store = {}
for i, chunk in enumerate(chunks):
    c_id = chunk['chunk_id']
    metadata_store[c_id] = { "doc_id": chunk['doc_id'], "source": chunk['source'], "position": chunk['position'] }
    text_store[c_id] = chunk['text']
print(f"‚úÖ Prepared metadata for {len(metadata_store)} items.")


‚úÖ Prepared metadata for 982 items.


## üíæ Step 12: Save to Disk
**What it does**:
- Serializes (saves) all artifacts to the `./vector_store` folder.
- Zips the folder for easy download.

**Input**: Index, Dictionaries, Config.
**Output**: `vector_store_backup.zip`.
**Role**: Persistence.


In [13]:
# @title üíæ Save & Zip
import pickle
import json
from datetime import datetime

index_path = os.path.join(OUTPUT_DIR, 'index.faiss')
metadata_path = os.path.join(OUTPUT_DIR, 'metadata.pkl')
texts_path = os.path.join(OUTPUT_DIR, 'texts.pkl')

faiss.write_index(index, index_path)
with open(metadata_path, 'wb') as f: pickle.dump(metadata_store, f)
with open(texts_path, 'wb') as f: pickle.dump(text_store, f)

print("‚úÖ All artifacts saved.")
shutil.make_archive('vector_store_backup', 'zip', OUTPUT_DIR)
print("üì¶ Created 'vector_store_backup.zip' for download.")


‚úÖ All artifacts saved.
üì¶ Created 'vector_store_backup.zip' for download.
