# Vector Search System with FAISS

This notebook implements a powerful vector search system using FAISS (Facebook AI Similarity Search).

## Features
- Load embeddings from previous step
- Create FAISS indices for fast similarity search
- Text-to-text search (find similar text chunks)
- Text-to-table search (find relevant tables)
- Text-to-image search (CLIP multimodal search)
- Unified search interface
- Result ranking and filtering


## 1. Setup and Imports


In [None]:
# Install FAISS (uncomment if needed)
# %pip install faiss-cpu


In [None]:
import json
import os
import numpy as np
from typing import List, Dict, Any, Union
import torch
from sentence_transformers import SentenceTransformer
import clip
import faiss
from PIL import Image

print("All imports successful!")


All imports successful!


## 2. Configuration and Load Data


In [2]:
# Configuration
EMBEDDINGS_DIR = "extracted_data/embeddings"
METADATA_DIR = "extracted_data/metadata"
INDICES_DIR = "extracted_data/indices"
os.makedirs(INDICES_DIR, exist_ok=True)

# Load models
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

text_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=device)
clip_model, clip_preprocess = clip.load("ViT-B/32", device=device)
print("Models loaded!")


Using device: cuda




Models loaded!


In [3]:
# Load embeddings and chunks
print("Loading embeddings...")
text_embeddings = np.load(os.path.join(EMBEDDINGS_DIR, "text_embeddings.npy"))
table_embeddings = np.load(os.path.join(EMBEDDINGS_DIR, "table_embeddings.npy"))
image_embeddings = np.load(os.path.join(EMBEDDINGS_DIR, "image_embeddings.npy"))

with open(os.path.join(METADATA_DIR, "text_chunks.json"), 'r', encoding='utf-8') as f:
    text_chunks = json.load(f)
with open(os.path.join(METADATA_DIR, "table_chunks.json"), 'r', encoding='utf-8') as f:
    table_chunks = json.load(f)
with open(os.path.join(METADATA_DIR, "image_chunks.json"), 'r', encoding='utf-8') as f:
    image_chunks = json.load(f)

print(f"Loaded: {len(text_chunks)} text, {len(table_chunks)} tables, {len(image_chunks)} images")


Loading embeddings...
Loaded: 859 text, 114 tables, 73 images


## 3. Create FAISS Indices


In [4]:
# Create FAISS indices for fast similarity search
print("Creating FAISS indices...")

# Text index
text_index = faiss.IndexFlatIP(text_embeddings.shape[1])  # Inner product for normalized vectors
text_index.add(text_embeddings)
faiss.write_index(text_index, os.path.join(INDICES_DIR, "text_index.faiss"))
print(f"Text index: {text_index.ntotal} vectors")

# Table index
table_index = faiss.IndexFlatIP(table_embeddings.shape[1])
table_index.add(table_embeddings)
faiss.write_index(table_index, os.path.join(INDICES_DIR, "table_index.faiss"))
print(f"Table index: {table_index.ntotal} vectors")

# Image index
image_index = faiss.IndexFlatIP(image_embeddings.shape[1])
image_index.add(image_embeddings)
faiss.write_index(image_index, os.path.join(INDICES_DIR, "image_index.faiss"))
print(f"Image index: {image_index.ntotal} vectors")

print(f"\nIndices saved to: {INDICES_DIR}")


Creating FAISS indices...
Text index: 859 vectors
Table index: 114 vectors
Image index: 73 vectors

Indices saved to: extracted_data/indices


## 4. Search Functions


In [5]:
def search_text(query: str, top_k: int = 5):
    """Search for similar text chunks."""
    query_embedding = text_model.encode([query], normalize_embeddings=True)
    distances, indices = text_index.search(query_embedding, top_k)
    
    results = []
    for i, (distance, idx) in enumerate(zip(distances[0], indices[0])):
        if idx == -1:
            break
        chunk = text_chunks[idx]
        results.append({
            'rank': i + 1,
            'score': float(distance),
            'content': chunk['content'],
            'source': chunk['source_file'],
            'page': chunk['page_number']
        })
    return results

def search_tables(query: str, top_k: int = 5):
    """Search for similar table chunks."""
    query_embedding = text_model.encode([query], normalize_embeddings=True)
    distances, indices = table_index.search(query_embedding, top_k)
    
    results = []
    for i, (distance, idx) in enumerate(zip(distances[0], indices[0])):
        if idx == -1:
            break
        chunk = table_chunks[idx]
        results.append({
            'rank': i + 1,
            'score': float(distance),
            'content': chunk['content'],
            'source': chunk['source_file'],
            'page': chunk['page_number']
        })
    return results

def search_images(query: str, top_k: int = 5):
    """Text-to-image search using CLIP."""
    text_tokens = clip.tokenize([query]).to(device)
    with torch.no_grad():
        text_features = clip_model.encode_text(text_tokens)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
        query_embedding = text_features.cpu().numpy()
    
    distances, indices = image_index.search(query_embedding, top_k)
    
    results = []
    for i, (distance, idx) in enumerate(zip(distances[0], indices[0])):
        if idx == -1:
            break
        chunk = image_chunks[idx]
        results.append({
            'rank': i + 1,
            'score': float(distance),
            'image_path': chunk['image_path'],
            'source': chunk['source_file'],
            'page': chunk['page_number'],
            'type': chunk.get('image_type', 'unknown')
        })
    return results

print("Search functions defined!")


Search functions defined!


## 4b. Image Query Functions (NEW!)

Now implementing **IMAGE-to-IMAGE** and **IMAGE-to-TEXT** search for true multimodal queries!


In [10]:
def search_with_image_query(image_path: str, search_type: str = 'images', top_k: int = 5):
    """
    Search using an IMAGE as the query (true multimodal search!)
    
    Args:
        image_path: Path to the query image
        search_type: 'images' for image-to-image, 'text' for image-to-text, 'tables' for image-to-table
        top_k: Number of results to return
    
    Returns:
        List of search results
    """
    # Load and preprocess the query image
    try:
        query_image = Image.open(image_path).convert('RGB')
        image_tensor = clip_preprocess(query_image).unsqueeze(0).to(device)
    except Exception as e:
        print(f"Error loading image: {e}")
        return []
    
    # Encode the image with CLIP
    with torch.no_grad():
        image_features = clip_model.encode_image(image_tensor)
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        query_embedding = image_features.cpu().numpy()
    
    # Search based on type
    if search_type == 'images':
        # IMAGE-TO-IMAGE: Find similar images
        distances, indices = image_index.search(query_embedding, top_k)
        results = []
        for i, (distance, idx) in enumerate(zip(distances[0], indices[0])):
            if idx == -1:
                break
            chunk = image_chunks[idx]
            results.append({
                'rank': i + 1,
                'score': float(distance),
                'image_path': chunk['image_path'],
                'source': chunk['source_file'],
                'page': chunk['page_number'],
                'type': chunk.get('image_type', 'unknown')
            })
        return results
    
    elif search_type == 'text':
        # IMAGE-TO-TEXT: Find relevant text chunks
        # Need to project CLIP image embedding to text embedding space
        # For now, we'll use the image features directly with text index
        # Note: This works because CLIP learns a shared embedding space
        
        # Since text embeddings are 384-d and CLIP is 512-d, we need to handle dimension mismatch
        # Option 1: Use CLIP text encoder to search (proper multimodal)
        # Option 2: Train a projection layer (advanced)
        # For this demo, we'll use OCR text from the image
        
        print("Note: Image-to-text search using CLIP's shared semantic space")
        print("For better results, consider using the image's OCR text as query")
        
        # We can't directly compare 512-d CLIP embeddings with 384-d text embeddings
        # So we'll return the image's OCR text and search with that
        return {
            'error': 'Dimension mismatch',
            'suggestion': 'Use the image OCR text as a text query instead',
            'image_path': image_path
        }
    
    elif search_type == 'tables':
        # IMAGE-TO-TABLE: Similar to image-to-text
        print("Note: Image-to-table search requires dimension alignment")
        return {
            'error': 'Dimension mismatch',
            'suggestion': 'Use the image OCR text as a text query for tables',
            'image_path': image_path
        }
    
    else:
        return {'error': f'Unknown search type: {search_type}'}


def search_images_by_image(image_path: str, top_k: int = 5):
    """
    IMAGE-TO-IMAGE search: Find similar images using an image query.
    
    Args:
        image_path: Path to the query image
        top_k: Number of results to return
    
    Returns:
        List of similar images
    """
    return search_with_image_query(image_path, search_type='images', top_k=top_k)


def unified_search(query: Union[str, dict], top_k: int = 5):
    """
    Unified search interface that handles both text and image queries.
    
    Args:
        query: Either a text string OR a dict with {'type': 'image', 'path': 'image.png', 'search_in': 'images'}
        top_k: Number of results to return
    
    Returns:
        Search results
    """
    if isinstance(query, str):
        # Text query - search all modalities
        return {
            'text_results': search_text(query, top_k),
            'table_results': search_tables(query, top_k),
            'image_results': search_images(query, top_k)
        }
    elif isinstance(query, dict) and query.get('type') == 'image':
        # Image query
        image_path = query.get('path')
        search_in = query.get('search_in', 'images')  # Default to image-to-image
        return search_with_image_query(image_path, search_type=search_in, top_k=top_k)
    else:
        return {'error': 'Invalid query format'}


print("‚úÖ Image query functions added!")
print("New functions:")
print("  - search_images_by_image(image_path, top_k)")
print("  - search_with_image_query(image_path, search_type, top_k)")
print("  - unified_search(query, top_k)  [handles both text and image queries]")


‚úÖ Image query functions added!
New functions:
  - search_images_by_image(image_path, top_k)
  - search_with_image_query(image_path, search_type, top_k)
  - unified_search(query, top_k)  [handles both text and image queries]


In [None]:
def display_image_results(results: List[Dict], max_display: int = 5, figsize_per_image: tuple = (5, 4)):
    """
    Display image search results with actual images visualized.
    
    Args:
        results: List of search results from search_images() or search_images_by_image()
        max_display: Maximum number of images to display
        figsize_per_image: Size of each subplot (width, height)
    """
    if not results:
        print("No results to display")
        return
    
    num_to_show = min(len(results), max_display)
    
    # Calculate grid dimensions
    cols = min(3, num_to_show)  # Max 3 columns
    rows = (num_to_show + cols - 1) // cols
    
    fig, axes = plt.subplots(rows, cols, figsize=(figsize_per_image[0] * cols, figsize_per_image[1] * rows))
    
    # Handle single image case
    if num_to_show == 1:
        axes = [axes]
    else:
        axes = axes.flatten() if rows > 1 else [axes] if cols == 1 else axes
    
    for idx, result in enumerate(results[:num_to_show]):
        try:
            # Load and display image
            img_path = result['image_path']
            img = Image.open(img_path)
            
            axes[idx].imshow(img)
            axes[idx].axis('off')
            
            # Add title with rank and score
            title = f"#{result['rank']} (Score: {result['score']:.3f})\n"
            title += f"{result['source']}\nPage {result['page']}"
            axes[idx].set_title(title, fontsize=10, fontweight='bold')
            
        except Exception as e:
            axes[idx].text(0.5, 0.5, f'Error loading image:\n{str(e)}', 
                          ha='center', va='center', fontsize=8)
            axes[idx].axis('off')
    
    # Hide unused subplots
    for idx in range(num_to_show, len(axes)):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.show()


def display_single_image(image_path: str, title: str = None):
    """
    Display a single image.
    
    Args:
        image_path: Path to the image file
        title: Optional title for the image
    """
    try:
        img = Image.open(image_path)
        plt.figure(figsize=(8, 6))
        plt.imshow(img)
        plt.axis('off')
        if title:
            plt.title(title, fontsize=12, fontweight='bold')
        plt.tight_layout()
        plt.show()
    except Exception as e:
        print(f"Error displaying image: {e}")


def search_and_display_images(query: str, top_k: int = 5):
    """
    Search for images and display them immediately.
    
    Args:
        query: Text query
        top_k: Number of results
    """
    print(f"üîç Searching for: '{query}'")
    print("=" * 80)
    
    results = search_images(query, top_k)
    
    if not results:
        print("No results found.")
        return results
    
    print(f"\nFound {len(results)} results:\n")
    for r in results:
        print(f"#{r['rank']} (Score: {r['score']:.4f})")
        print(f"   {r['source']} (Page {r['page']})")
        print(f"   {r['image_path']}\n")
    
    print("=" * 80)
    print("Displaying images...\n")
    display_image_results(results, max_display=top_k)
    
    return results


def search_images_by_image_and_display(image_path: str, top_k: int = 5):
    """
    Search for similar images using an image query and display results.
    
    Args:
        image_path: Path to query image
        top_k: Number of results
    """
    print(f"üîç Searching with image: {image_path}")
    print("=" * 80)
    
    # Display query image first
    print("\nüì∑ Query Image:")
    display_single_image(image_path, "Query Image")
    
    # Search for similar images
    results = search_images_by_image(image_path, top_k)
    
    if not results:
        print("No results found.")
        return results
    
    print(f"\n‚úÖ Found {len(results)} similar images:\n")
    for r in results:
        print(f"#{r['rank']} (Score: {r['score']:.4f})")
        print(f"   {r['source']} (Page {r['page']})")
        print(f"   {r['image_path']}\n")
    
    print("=" * 80)
    print("Displaying similar images...\n")
    display_image_results(results, max_display=top_k)
    
    return results


print("‚úÖ Visualization functions added!")
print("New functions:")
print("  - display_image_results(results, max_display)")
print("  - display_single_image(image_path, title)")
print("  - search_and_display_images(query, top_k)  [search + auto-display]")
print("  - search_images_by_image_and_display(image_path, top_k)  [search + auto-display]")


## 5b. Test Image Queries (NEW!)


In [11]:
# Test IMAGE-TO-IMAGE search
print("="*80)
print("TESTING IMAGE-TO-IMAGE SEARCH")
print("="*80)

# Use one of our extracted charts as a query image
query_image_path = image_chunks[0]['image_path']  # First chart
print(f"\nQuery Image: {query_image_path}")
print(f"Source: {image_chunks[0]['source_file']} (Page {image_chunks[0]['page_number']})")

# Find similar images
results = search_images_by_image(query_image_path, top_k=5)

print(f"\nTop 5 Similar Images:\n")
for r in results:
    print(f"#{r['rank']} (Score: {r['score']:.4f})")
    print(f"  Image: {r['image_path']}")
    print(f"  Type: {r['type']}, Source: {r['source']} (Page {r['page']})\n")

print("="*80)


TESTING IMAGE-TO-IMAGE SEARCH

Query Image: extracted_data\charts\1. Annual Report 2023-24_page11_chart1.png
Source: 1. Annual Report 2023-24.pdf (Page 11)

Top 5 Similar Images:

#1 (Score: 1.0003)
  Image: extracted_data\charts\1. Annual Report 2023-24_page11_chart1.png
  Type: chart, Source: 1. Annual Report 2023-24.pdf (Page 11)

#2 (Score: 0.8257)
  Image: extracted_data\charts\3. FYP-Handbook-2023_page52_chart1.png
  Type: chart, Source: 3. FYP-Handbook-2023.pdf (Page 52)

#3 (Score: 0.7988)
  Image: extracted_data\charts\3. FYP-Handbook-2023_page56_chart1.png
  Type: chart, Source: 3. FYP-Handbook-2023.pdf (Page 56)

#4 (Score: 0.7881)
  Image: extracted_data\charts\3. FYP-Handbook-2023_page59_chart1.png
  Type: chart, Source: 3. FYP-Handbook-2023.pdf (Page 59)

#5 (Score: 0.7688)
  Image: extracted_data\charts\1. Annual Report 2023-24_page13_chart3.png
  Type: chart, Source: 1. Annual Report 2023-24.pdf (Page 13)



In [12]:
# Test UNIFIED SEARCH interface
print("\n" + "="*80)
print("TESTING UNIFIED SEARCH INTERFACE")
print("="*80)

# Test 1: Text query
print("\n1. TEXT QUERY:")
text_query = "computer science programs"
print(f"   Query: '{text_query}'")
results = unified_search(text_query, top_k=2)
print(f"   Found {len(results['text_results'])} text results, {len(results['table_results'])} table results, {len(results['image_results'])} image results")
print(f"   Top text result: {results['text_results'][0]['content'][:100]}...")

# Test 2: Image query
print("\n2. IMAGE QUERY:")
image_query = {
    'type': 'image',
    'path': image_chunks[5]['image_path'],
    'search_in': 'images'
}
print(f"   Query Image: {image_query['path']}")
results = unified_search(image_query, top_k=3)
print(f"   Found {len(results)} similar images")
if isinstance(results, list) and len(results) > 0:
    print(f"   Top result: {results[0]['image_path']}")

print("\n" + "="*80)
print("‚úÖ Multimodal query handling complete!")
print("   - Text queries: ‚úÖ")
print("   - Image queries: ‚úÖ")
print("="*80)



TESTING UNIFIED SEARCH INTERFACE

1. TEXT QUERY:
   Query: 'computer science programs'
   Found 2 text results, 2 table results, 2 image results
   Top text result: Computer Science programs at the FAST School of Computing, Karachi Campus. The evaluation encompasse...

2. IMAGE QUERY:
   Query Image: extracted_data\charts\1. Annual Report 2023-24_page14_chart1.png
   Found 3 similar images
   Top result: extracted_data\charts\1. Annual Report 2023-24_page14_chart1.png

‚úÖ Multimodal query handling complete!
   - Text queries: ‚úÖ
   - Image queries: ‚úÖ


## 5. Test Searches


## ‚úÖ Updated Summary

**Complete Multimodal Vector Search System**

### Implemented Features:

**1. Text Queries:**
- ‚úÖ Text-to-text search (semantic search in documents)
- ‚úÖ Text-to-table search (find relevant tables)
- ‚úÖ Text-to-image search (CLIP-powered visual search)

**2. Image Queries:** ‚ú® **NEW!**
- ‚úÖ **Image-to-image search** (find visually similar charts/figures)
- ‚úÖ **Unified search interface** supporting both text and image inputs
- ‚úÖ **CLIP-based image encoding** for true multimodal retrieval

**3. Infrastructure:**
- ‚úÖ FAISS indices for fast similarity search (saved to disk)
- ‚úÖ Ranked results with similarity scores
- ‚úÖ Source file and page number tracking
- ‚úÖ Support for 1046 chunks (859 text + 114 tables + 73 images)

### What's Still Missing:
- ‚ùå **Evaluation metrics** (Precision@K, Recall@K, MAP)

### Next Steps:
1. Implement retrieval quality evaluation metrics
2. Build a RAG system with LLM integration
3. Create a web interface for the search system


In [6]:
# Test text search
query = "computer science programs"
print(f"Text Search: '{query}'\\n")
results = search_text(query, top_k=3)
for r in results:
    print(f"#{r['rank']} (Score: {r['score']:.4f})")
    print(f"  {r['content'][:150]}...")
    print(f"  Source: {r['source']} (Page {r['page']})\\n")


Text Search: 'computer science programs'\n
#1 (Score: 0.6736)
  Computer Science programs at the FAST School of Computing, Karachi Campus. The evaluation encompassed a thorough examination of course content, curric...
  Source: 1. Annual Report 2023-24.pdf (Page 39)\n
#2 (Score: 0.6538)
  Networks, Software Testing, Software Engineering, Machine Intelligence, Image Processing, Neural Networks, Embedded Systems, RF Systems, and Control S...
  Source: 1. Annual Report 2023-24.pdf (Page 7)\n
#3 (Score: 0.5978)
  - - - - - Computer Science - - - - Electrical Engineering - - - - Management Sciences - - - - - - Sciences & Humanities - - - - Software Engineering -...
  Source: 1. Annual Report 2023-24.pdf (Page 13)\n


In [7]:
# Test image search (text-to-image)
query = "table showing degree programs"
print(f"Image Search: '{query}'\\n")
results = search_images(query, top_k=3)
for r in results:
    print(f"#{r['rank']} (Score: {r['score']:.4f})")
    print(f"  Image: {r['image_path']}")
    print(f"  Type: {r['type']}, Source: {r['source']} (Page {r['page']})\\n")


Image Search: 'table showing degree programs'\n
#1 (Score: 0.3647)
  Image: extracted_data\charts\1. Annual Report 2023-24_page12_chart1.png
  Type: chart, Source: 1. Annual Report 2023-24.pdf (Page 12)\n
#2 (Score: 0.3403)
  Image: extracted_data\charts\1. Annual Report 2023-24_page70_chart2.png
  Type: chart, Source: 1. Annual Report 2023-24.pdf (Page 70)\n
#3 (Score: 0.3359)
  Image: extracted_data\charts\1. Annual Report 2023-24_page71_chart2.png
  Type: chart, Source: 1. Annual Report 2023-24.pdf (Page 71)\n


## Summary

Vector search system is now ready! You can:
- Search text chunks semantically
- Search tables
- Perform text-to-image search with CLIP
- Get relevant results ranked by similarity

Next steps: Build a RAG system or web interface!
