# Vector Database Demo

This notebook walks through the core concepts and a working demo of a vector database using Facebook AI Similarity Search (FAISS).

## Objectives
- Understand what a vector database is
- Create embeddings from sample text
- Store embeddings in FAISS
- Query FAISS for similar documents

In [None]:
# Install required libraries
%pip install faiss-cpu sentence-transformers -q

## Step 1: Generate Embeddings from Text using Sentence Transformers

SentenceTransformer provides a simple API for converting sentences or texts into high-dimensional dense vectors (aka embeddings) that capture their semantic meaning.

### Steps
1. Tokenization
    - "The cat sat on the mat." → ['the', 'cat', 'sat', 'on', 'the', 'mat', '.']
1. Each token is mapped to an ID (arbitrary integer) (from the model's vocabulary)
    - ['the', 'cat', 'sat', 'on', 'the', 'mat', '.'] → [101, 4523, 3546, 1203, 101, 2981, 119]
1. Embedding Layer (Word → Vector)
    - Each token ID is looked up in a learned embedding table, which is just a matrix of floats.
    - 101 might become:
        - [0.01, 0.15, -0.23, ..., 0.07]  (384-dimensional vector)
    - Each token ID is mapped to a dense (list or array where most or all values are non-zero) vector of floats — and these are learned during training.
        - 'cat'  → [0.12, -0.8, 0.9, ...]
        - 'dog'  → [0.11, -0.78, 0.91, ...]
    - 'cat' and 'dog' have similar vectors — because they mean similar things.
    - All tokens are now represented as vectors — these are initial word-level embeddings, but they don’t yet capture context or meaning beyond the word level.
1. Transformer Layers
    - Model applies attention mechanisms to process all tokens together, layer by layer.
    - At each layer, each word’s vector is updated based on:
        - What other words are nearby
        - The relationships between words (e.g. subjects, objects, actions)
        - Learned patterns from training on massive amounts of text
    - After several layers (e.g. 6 in MiniLM), the vectors become contextualized:
        - "cat" knows it’s the subject
        - "sat" knows it’s a verb referring to the cat
    - This is what gives the model its understanding of meaning.
1. Pooling / Sentence Embedding
    - Once each word has a contextual vector, you combine them into a single vector for the whole sentence.
    - This could be done using:
        - The vector for the [CLS] token
            - Many transformer models (like BERT) add a special token [CLS] at the start of every input.
        - Mean pooling (average of all token vectors)
            - Take the average (mean) of all token embeddings (excluding special tokens like [PAD] or [SEP]).
            - Gives equal weight to every word in the sentence.
        - Max pooling
            - For each dimension in the embedding vector, pick the maximum value across all token vectors.
            - Think of it as capturing the most strongly activated (strongest signal) feature.


In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2',device="cpu")

texts = [
    "The cat sat on the mat.",
    "The dog played in the yard.",
    "There is a cat under the table.",
    "Dogs love playing fetch.",
    "A man is sitting on a bench."
]

# Generate vector embeddings
# Tokenizes each sentence.
# Feeds them into the transformer.
# Returns a dense vector (embedding) for each sentence.
embeddings = model.encode(texts, show_progress_bar=True)

# 5 text inputs. 
# Each one is represented by a 384-dimensional vector.
print("Shape of embeddings:", embeddings.shape)
print(embeddings)

## Step 2: Store Vectors in Facebook AI Similarity Search (FAISS) Index and Perform Search

### The Problem: Search by Meaning, Not Words

Imagine you want to build a system that can answer:
- "What produces energy in a cell?"

You have documents like:
- "The mitochondria is the powerhouse of the cell."
- "Photosynthesis occurs in plant cells."

A keyword-based search engine (like traditional search) may not find "mitochondria" from "energy" if there's no overlap in words.

We need to search by meaning, not exact wording.

### The Solution: Store Vectors in FAISS

Step-by-Step
1. Convert documents to vectors
    - Each sentence or paragraph is encoded into a dense vector (embedding) that captures its meaning.

1. Store those vectors in FAISS
    - FAISS is a fast vector index optimized for similarity search in high-dimensional space.

1. Convert user queries to vectors
    - The question "What produces energy in a cell?" is turned into a vector.

1. Search FAISS for similar vectors
    - FAISS compares your query vector to all stored vectors and returns the most similar ones — based on cosine similarity or L2 distance.

1. Use the top results
    - You can now return the top sentences, or pass them to a language model to generate an answer.

In [None]:
import faiss
import numpy as np

# FAISS needs to know how mnay dimenions each vector has.
dimension = embeddings.shape[1]  # e.g. 384
index = faiss.IndexFlatL2(dimension)

# Add embeddings to index
# FAISS requires exactly float32 (np.float32), 
# and many models (like Sentence Transformers) return float64 by default.
index.add(np.array(embeddings).astype('float32'))
print("Number of vectors in the index:", index.ntotal)

# Query with a new sentence
query = "A dog is playing outside."
query_vector = model.encode([query])
# I → Indices
# D → Distances (L2 same as Euclidean by default)
D, I = index.search(np.array(query_vector).astype('float32'), k=3)

print("\nTop 3 most similar texts:")
for idx, dist in zip(I[0], D[0]):
    print(f"- {texts[idx]} (distance: {dist:.4f})")


## Summary
- We used `sentence-transformers` to convert text into dense vectors.
- Stored the vectors in a FAISS index.
- Queried the index to find semantically similar text.