# ***Multi-Modal Document Retrieval System***

---



**Step 1 - Importing and Installing Libraries**

In [1]:
!sudo apt-get install tesseract-ocr
!pip install pytesseract sentence-transformers faiss-cpu

import pytesseract
import faiss
from sentence_transformers import SentenceTransformer
import numpy as np
import os

print("Colab Environment Setup Complete!")

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.
Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m112.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytesseract, faiss-cpu
Successfully installed faiss-cpu-1.13.2 pytesseract-0.3.13
Colab Environment Setup Complete!


**Step 2 - Generate Dummy Document Dataset**

In [2]:
from PIL import Image, ImageDraw, ImageFont
import os

dataset_folder = "dataset"
os.makedirs(dataset_folder, exist_ok=True)

def create_image(filename, text_content):
    img = Image.new('RGB', (800, 400), color=(255, 255, 255))
    d = ImageDraw.Draw(img)

    try:
        font = ImageFont.truetype("LiberationSans-Regular.ttf", 24)
    except:
        font = ImageFont.load_default()

    d.text((20, 20), text_content, fill=(0, 0, 0), font=font)

    path = os.path.join(dataset_folder, filename)
    img.save(path)
    print(f"Created: {path}")

text_invoice = """
INVOICE #001
Date: 2023-10-25
To: John Doe

Items:
1. Web Development - $500
2. Server Setup - $150

Total Due: $650
"""
create_image("invoice_001.png", text_invoice)

text_meeting = """
MINUTES OF MEETING
Project: Alpha Launch
Date: 2023-11-01

- The backend API is 90% complete.
- Need to fix the login bug on the frontend.
- Deployment scheduled for next Friday.
"""
create_image("meeting_notes.png", text_meeting)

text_support = """
SUPPORT TICKET #404
User: Alice Smith
Issue: Cannot access the database.

Resolution:
- Reset the firewall rules.
- Restarted the service.
- Connection confirmed stable.
"""
create_image("support_log.png", text_support)

print("\nDataset generation complete.")

Created: dataset/invoice_001.png
Created: dataset/meeting_notes.png
Created: dataset/support_log.png

Dataset generation complete.


**Step 3 - Extract Text from Images (OCR Pipeline)**

In [3]:
import pytesseract
from PIL import Image
import os
import glob

image_files = glob.glob("dataset/*.png")

extracted_data = {}

print(f"Found {len(image_files)} documents. Starting OCR processing...\n")

for img_path in image_files:
    filename = os.path.basename(img_path)

    try:
        image = Image.open(img_path)

        text = pytesseract.image_to_string(image)

        clean_text = text.strip()

        extracted_data[filename] = clean_text

        print(f"✅ Successfully read: {filename}")
        print(f"   --- Content Preview ---")
        print(f"   {clean_text.splitlines()[0]}")
        print(f"   {clean_text.splitlines()[1] if len(clean_text.splitlines()) > 1 else '...'}")
        print("   -----------------------\n")

    except Exception as e:
        print(f"❌ Error reading {filename}: {e}")

print("OCR Extraction Complete! We now have the text ready for AI processing.")

Found 3 documents. Starting OCR processing...

✅ Successfully read: meeting_notes.png
   --- Content Preview ---
   MINUTES OF MEETING
   Project: Alpha Launch
   -----------------------

✅ Successfully read: support_log.png
   --- Content Preview ---
   SUPPORT TICKET #404
   User: Alice Smith
   -----------------------

✅ Successfully read: invoice_001.png
   --- Content Preview ---
   INVOICE #001
   Date: 2023-10-25
   -----------------------

OCR Extraction Complete! We now have the text ready for AI processing.


**Step 4 - Generate Embeddings & Build Vector Index (FAISS)**

In [4]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

filenames = list(extracted_data.keys())
texts = list(extracted_data.values())

print("Loading AI Model (Sentence-Transformers)...")
model = SentenceTransformer('all-MiniLM-L6-v2')

print(f"Generating embeddings for {len(texts)} documents...")
embeddings = model.encode(texts)

dimension = embeddings.shape[1]

index = faiss.IndexFlatL2(dimension)

index.add(embeddings)

print("\n------------------------------------------------")
print(f"✅ Indexing Complete!")
print(f"   Number of documents indexed: {index.ntotal}")
print(f"   Vector Dimension: {dimension}")
print("   We are ready to search!")
print("------------------------------------------------")

Loading AI Model (Sentence-Transformers)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings for 3 documents...

------------------------------------------------
✅ Indexing Complete!
   Number of documents indexed: 3
   Vector Dimension: 384
   We are ready to search!
------------------------------------------------


**Step 5 - Interactive CLI Application**

In [6]:
def run_cli_app():
    print("==================================================")
    print("   Multi-Modal Document Search Engine")
    print("   Type 'exit' to quit.")
    print("==================================================")

    while True:
        user_query = input("\nEnter your search query: ")

        if user_query.lower() in ['exit', 'quit', 'q']:
            print("Exiting application.")
            break

        if not user_query.strip():
            continue

        search_documents(user_query)

run_cli_app()

   Multi-Modal Document Search Engine
   Type 'exit' to quit.

Enter your search query: How much is the bill

🔍 Query: 'How much is the bill'
   Top 2 Results:
   ------------------------------
   📄 Document: invoice_001.png
      Score: 1.0730
      Snippet: INVOICE #001 Date: 2023-10-25 To: John Doe  Items: 1. Web Development - $500 2. Server Setup - $150 ...
   ------------------------------
   📄 Document: meeting_notes.png
      Score: 1.9929
      Snippet: MINUTES OF MEETING Project: Alpha Launch Date: 2023-11-01  - The backend API is 90% complete. - Need...
   ------------------------------

Enter your search query: api

🔍 Query: 'api'
   Top 2 Results:
   ------------------------------
   📄 Document: meeting_notes.png
      Score: 1.5407
      Snippet: MINUTES OF MEETING Project: Alpha Launch Date: 2023-11-01  - The backend API is 90% complete. - Need...
   ------------------------------
   📄 Document: invoice_001.png
      Score: 1.7386
      Snippet: INVOICE #001 Date: 2023-10