# Preparing Your Environment

Before running this notebook, first create a python environment with `jupyter` installed. If you use conda, you can run a command like the one below

```
conda create -n vector-databases -yq python=3.12 jupyter 
```

Once your environment is ready link it to the notebook and install the packages as we move on with the notebook. Alternatively, you can just install all packages in one go

```
pip install dotenv openai pinecone torch transformers open_clip_torch langchain-experimental
```

# Vector Databases
**

## 1. Vectors & Embeddings (Theory)

### 1.1. What is a Vector

A *vector* is a mathematical representation of data as an ordered list of numbers. In the context of *vector databases* and *machine learning*, a vector is used to represent features or characteristics of an object ‚Äî such as a word, image, or user ‚Äî in a numerical form that computers can efficiently process and compare.

#### Representing a Vector

A vector is an element of an `n-dimensional` space (‚Ñù‚Åø), written as: ```V = [v1, v2, v3, ..., vn]```. Each component `ùë£ùëñ` is a real number representing a specific feature or dimension.

**Example Sentance Representation**
<div align="center">
    <img src="imgs/vec_rep1.jpeg" alt="Alt text" width="700" height="200" center>
</div>


**Example Image Representation**
<div align="center">
    <img src="imgs/vec_rep2.jpeg" alt="Alt text" width="700" height="200" center>
</div>


#### Vector Magniture & Direction

Intuitively: Think of a vector as a `point` or `arrow` in space. It has a direction and magnitude

<div align="center">
    <img src="imgs/vector-position-in-plane.png" alt="Alt text" width="500" height="500" center>
</div>


**A vector is not just a point** ‚Äî it‚Äôs a point relative to something (an arrow)

if you only draw a vector as (3, 2), it looks like a single point. But mathematically, a vector represents movement or change ‚Äî not just a location.

For example:
  - The point (3, 2) can represent a location in space.
  - But the vector (3, 2) represents ‚Äúgo 3 units right and 2 units up‚Äù ‚Äî a displacement from the origin (0, 0).

So every vector has:
- Direction ‚Üí where it points.
- Magnitude ‚Üí how long it is (how far it goes).

> In a high-dimensional space (when dimensions are >3) we stop drawing arrows

### 1.2. What is an Embedding Model

An *embedding model* is a type of machine learning model that transforms complex, high-dimensional data ‚Äî such as `words`, `images`, or `documents` ‚Äî into dense `numerical vectors` (called embeddings) that *capture their meaning or relationships*.

> **An Embedding** is the vector representation of an object (text, image, sound, etc.) in a continuous, lower-dimensional space.

<div align="center">
    <img src="imgs/How-Embeddings-Work.jpg" alt="Alt text" width="500" height="300" center>
</div>

**Why does an Embedding Model needs to be Trained ?**

The model learns to map similar things close together in vector space.

For example:
- Words like ‚Äúking‚Äù, ‚Äúqueen‚Äù, ‚Äúprince‚Äù end up close to each other.
- Images of cats cluster together and are far from cars.
- Customers with similar behaviors have nearby embeddings.

So, distance between vectors reflects semantic similarity.

<div align="center">
    <img src="imgs/word_embeddings_toy_example.png" alt="Alt text" width="500" height="500" center>
</div>

> Notice how cat and kitten are close ‚Äî because their meanings are related.

### 1.3. Understanding `Embeddings` Vector Magnitude & Direction

<div align="center">
    <img src="imgs/mag_direction.jpg" alt="Alt text" width="400" height="300" center>
</div>

**Magnitude**

The magnitude (length) of a vector shows how much of something is present ‚Äî like intensity or confidence (i.e, how strongly something expresses certain features). Fromally, the magnitude of a vector is computed as $|\vec{v}| = \sqrt{x^2 + y^2 + z^2}$.


**Direction**

The direction of a vector represents which features are active or what pattern it corresponds to. Think of direction as the identity of the concept.



### 1.4. Embedding Models Overview

There are several embedding models available today, each representing a step forward in how machines understand and encode meaning. From simple statistical methods to deep contextual models, these embeddings transform raw data (like text or images) into numerical vectors that capture relationships and meaning. Below are some of the popular models. Each with its own story, architecture, and strengths.

#### Word2Vec (2013, Google)


Word2Vec was one of the first models to revolutionize natural language processing by demonstrating that words could be represented as dense vectors in a continuous vector space. This idea laid the foundation for modern embedding-based systems, including today‚Äôs vector databases.

Word2Vec is known for its simplicity, interpretability, and computational efficiency. However, it produces static embeddings, meaning that each word has a single fixed vector regardless of context. Because embeddings are static, Word2Vec cannot capture word meaning based on context.

For Example:
- I deposited money in the bank
- I sat by the river bank

In Word2Vec, the word `bank` has one vector, even though the meanings are completely different.


Modern embedding models - presented below - generate different vectors depending on context, which is critical for modern applications and search. Word2Vec is rarely used in modern applications, however it is mainly used for Educational purposes, Historical understanding of embeddings and Very lightweight or legacy systems where performance constraints outweigh semantic accuracy

#### FastText (2016, Facebook AI)


FastText builds on Word2Vec‚Äôs foundations by addressing one of its major limitations: out-of-vocabulary (OOV) and rare words. Instead of learning a single vector per word, FastText represents each word as a bag of character n-grams, allowing it to capture subword information such as prefixes, suffixes, and word stems.

This approach makes FastText particularly robust for:
- Morphologically rich languages
- Spelling variations and typos
- Previously unseen words

For example: Even if the word ‚Äúunhappiness‚Äù was never seen during training, FastText can infer its meaning from subwords like:
- ‚Äúun‚Äù
- ‚Äúhappy‚Äù
- ‚Äúness‚Äù

Despite these improvements, FastText still produces static embeddings: The word ‚Äúapple‚Äù has the same vector in:
- I ate an apple
- Apple released a new iPhone

It cannot model context-dependent meaning, which limits its effectiveness

#### BERT (2018, Google)


BERT (Bidirectional Encoder Representations from Transformers) marked a major leap in language representation. Unlike Word2Vec or FastText, BERT produces contextual embeddings, meaning the representation of a word depends on the entire sentence it appears in. This allows BERT to capture deep semantic and syntactic relationships between words.

BERT is a foundation model for many NLP Tasks including 
- Question answering
- Text classification
- Chatbots
- Semantic understanding

While powerful, BERT is computationally intensive: It uses very high memory, provides slower inference and is not  optimized for large-scale embedding generation

#### OpenAI Embeddings (2022‚Äìpresent)

OpenAI‚Äôs embedding models (such as `text-embedding-3-small` and `text-embedding-3-large`) represent a modern generation of general-purpose, high-quality embeddings. They are trained on large, diverse datasets and explicitly optimized for semantic similarity, search, and clustering, making them particularly well suited for vector database applications.

Unlike earlier embedding approaches that were task- or domain-specific, OpenAI embeddings capture a broad understanding of natural language across topics, writing styles, and use cases. They are designed to work well out of the box, without requiring additional fine-tuning or complex preprocessing.

Why OpenAI Embeddings Are Widely Used
- Strong performance on semantic retrieval tasks
- Consistent vector quality across domains
- Easy integration via a managed API
- Optimized for production-scale systems

These characteristics make them a common choice for:
- Semantic search
- Recommendation systems
- Retrieval-Augmented Generation (RAG)
- Reasoning over large document collections

While OpenAI embeddings offer excellent quality and convenience, they come with important considerations:
- Proprietary and API-based (no self-hosting)
- Ongoing cost per request
- Data leaves your infrastructure, which may be a concern for sensitive or regulated environments

For these reasons, OpenAI embeddings are often contrasted with open-source alternatives such as BGE, where teams trade convenience for control and cost predictability.

#### BGE (BAAI General Embeddings)

BGE (BAAI General Embeddings) is a family of high-quality open-source embedding models developed by BAAI (Beijing Academy of Artificial Intelligence). BGE models are designed specifically for semantic search, retrieval, and RAG (Retrieval-Augmented Generation) tasks. They are widely used as drop-in alternatives to proprietary embeddings (e.g. OpenAI) in vector databases.

**Key Characteristics**
- Sentence & document embeddings
- Optimized for retrieval (query ‚Üî document similarity)
- Open-source and self-hostable
- Strong performance on MTEB benchmarks
- Available in multiple sizes and languages

#### CLIP (Contrastive Language‚ÄìImage Pretraining)

CLIP is a multimodal embedding model developed by OpenAI that learns joint representations of text and images. Unlike text-only embedding models, CLIP can work with both Text and Images; embedding both into the same vector space. This enables cross-modal search.

CLIP is a proprietary model, belongs to Open AI. However, Open-source variants also exist like `OpenCLIP` and `LAION CLIP`


An embedding model like CLIP can be used to support Usecases like
- Image search engines
- E-commerce product search
- Visual recommendation systems
- Content moderation
- Multimedia knowledge bases

### 1.5. Distance Measures

<div align="center">
    <img src="imgs/DistanceMeasuresEx.png" alt="Alt text" width="700" height="300" center>
</div>

#### Cosine similarity

Measures how similar the direction of two vectors is ‚Äî not their length, i.e. the angle between two vectors. The cosine similarity between two vectors **A** and **B** is given by:

$$
\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{\|A\| \, \|B\|}
$$


**Range:** 

‚àí 1 to +1

**Meaning of values:**
| Value  | Meaning                                                     |
| :----- | :---------------------------------------------------------- |
| **+1** | Vectors point in the **same direction** (identical meaning) |
| **0**  | Vectors are **orthogonal** (no relationship / unrelated)    |
| **-1** | Vectors point in **opposite directions** (opposite meaning) |


In [14]:
import math

def dot_product(a, b):
    """Compute the dot product of two equal-length vectors."""
    return sum(x * y for x, y in zip(a, b))

def norm(v):
    """Compute the magnitude (Euclidean norm) of a vector."""
    return math.sqrt(sum(x**2 for x in v))

def cosine_similarity(a, b):
    """
    Compute the cosine similarity between two vectors a and b.
    Uses helper functions for clarity.
    """
    mag_a = norm(a)
    mag_b = norm(b)

    # Handle division by zero (if one vector is all zeros)
    if mag_a == 0 or mag_b == 0:
        return 0.0

    return dot_product(a, b) / (mag_a * mag_b)

#### Euclidean distance


Measures how far apart two vectors are in space, considering both direction and length, i.e. the straight-line distance between two points.  The Euclidean distance between two vectors **A** and **B** is given by:

$$
d(A, B) = \sqrt{\sum_{i=1}^{n} (A_i - B_i)^2}
$$

**Range**:

0 to +‚àû

**Meaning of values:**
| Value             | Meaning                                      |
| :---------------- | :------------------------------------------- |
| **0**             | Vectors are **identical**                    |
| **Larger number** | Vectors are **further apart** (less similar) |


In [15]:
def subtract_vectors(a, b):
    """Subtract vector b from vector a (element-wise)."""
    return [x - y for x, y in zip(a, b)]

def euclidean_distance(a, b):
    """
    Compute the Euclidean distance between two vectors a and b.
    """
    diff = subtract_vectors(a, b)
    return norm(diff)


#### Example

<div align="center">
    <img src="imgs/points_graph1.png" alt="Alt text" width="500" height="500" center>
</div>


In [None]:
a = [1,1]
b = [2,2]
c = [-1,-1]

print(cosine_similarity(a, a))


# A and B point in the same direction ‚Üí cosine similarity = 1
# Even though B is twice as long, it‚Äôs still perfectly aligned with A, so cosine similarity = 1.
print(cosine_similarity(a, b))

# A and C point in opposite directions ‚Üí cosine similarity = -1
print(cosine_similarity(a, c))

In [None]:
print(euclidean_distance(a, a))


# A and B are close (same direction, similar position)
print(euclidean_distance(a, b))


# A and C are far (different direction)
print(euclidean_distance(a, c))

#### Practical use

| Use Case                         | Metric                 | Why                                                                                                           |
| :------------------------------- | :--------------------- | :------------------------------------------------------------------------------------------------------------ |
| **Semantic search / embeddings** | **Cosine similarity**  | Focuses on meaning (direction). Magnitude differences are irrelevant because embeddings are often normalized. |
| **Clustering numeric data**      | **Euclidean distance** | Distance in scale matters (e.g., prices, coordinates).                                                        |
| **Physics or geometry problems** | **Euclidean distance** | Actual spatial distances are meaningful.                                                                      |


---
## 2. Vectors & Embeddings (Practical)

### 2.1 Using Open AI Embedding Models: `text-embeddings-3-small/large`

In this sub-section, we demonstrate how to generate embeddings using OpenAI‚Äôs proprietary models. These models are hosted on OpenAI‚Äôs servers and provide high-quality, ready-to-use embeddings without any local setup. While convenient and reliable, they require an API key and depend on external infrastructure for all requests. Before you move on, you first need to setup your OpenAI API Key from [OpenAI's platform](https://platform.openai.com/). Once you have your API Key, you can set it up in your `.env` file or directly provide it in your notebook as we will see in the cells below

In [None]:
!pip install openai dotenv

In [7]:
# loading OPENAI_API_KEY from .env file
from dotenv import load_dotenv
import os
load_dotenv()
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

# alternatively, provde your api key directly into the notebook
# OPENAI_API_KEY = "YOUR API KEY HERE"

In [8]:
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)

In [None]:
resp = client.embeddings.create(
    model="text-embedding-3-small",   # or "text-embedding-3-small"
    input="I will deposite money in the bank"
)

print(resp.data[0].embedding)
print(len(resp.data[0].embedding))

In [11]:
def embed_sentences_open_ai(sentences):

    # Request embeddings
    resp = client.embeddings.create(
        model="text-embedding-3-small",   # or "text-embedding-3-small"
        input=sentences
    )

    # extract embeddings from response and 
    # return in a list of embeddings
    embeddings = list()
    for item in resp.data:
        embeddings.append(item.embedding)
    
    return embeddings

### 2.2. Using Open Source Embedding Models: An example with `bge-large-en-v1.5`

`Proprietary AI models`, such as those offered by **OpenAI** and similar vendors, have played a significant role in accelerating the adoption of large-scale language and multimodal systems. However, relying on closed, hosted models introduces several structural drawbacks that become increasingly relevant as AI systems move from experimentation to production. **First**, proprietary models operate as black boxes: their architectures, training data, and optimization strategies are not fully disclosed. This limits transparency, makes debugging and auditing difficult, and creates challenges for compliance, reproducibility, and long-term maintainability. **Second**, usage is typically bound to external APIs, introducing latency, availability dependencies, and vendor lock-in. Cost can also scale unpredictably with usage, making budgeting and optimization harder at scale. **Finally**, data privacy and governance constraints may prevent sensitive or regulated data from being processed by third-party services altogether.

In contrast, `open-source models` hosted on platforms like **Hugging Face** provide a compelling alternative. These models can be downloaded, inspected, fine-tuned, and deployed entirely within private infrastructure. This enables full control over data flow, inference costs, and system behavior, while eliminating external dependencies at runtime. The Hugging Face ecosystem also offers a rich and rapidly evolving catalog of high-quality models for text, image, and multimodal tasks‚Äîmany of which rival or exceed proprietary solutions in specific domains such as semantic similarity, retrieval, and embeddings.

Crucially, modern open models (e.g.,`BGE`, `E5`, `Nomic`) are designed to be drop-in replacements for common proprietary use cases. They integrate seamlessly with standard tooling, support offline execution after initial download, and allow teams to build scalable, production-grade AI systems without sacrificing performance or control. Open-source models hosted on Hugging Face are increasingly becoming the default choice for organizations seeking transparency, cost efficiency, and long-term architectural flexibility - without compromising on quality.

#### Creating Text Embeddings with `bge-large-en-v1.5`

As an example of using locally hosted, open-source models, we will work with the bge-large-en-v1.5 model from BAAI‚Äôs BGE (Beijing General Embeddings) family. BGE models are specifically designed for generating high-quality sentence embeddings, making them well-suited for semantic search and other similarity-based tasks. Unlike proprietary models such as OpenAI‚Äôs embeddings, these models can be hosted locally - on your own server or even your personal machine - without requiring an account on any external platform. They are fully open-source and free to use.

In the next few cells, we will demonstrate how to host `bge-large-en-v1.5` locally and create an embed_sentences function as a drop-in alternative to the embeddings function we used earlier.

In [None]:
! pip install torch transformers

**If you have an NVIDIA GPU (recommended)**

Install CUDA-enabled PyTorch from
üëâ https://pytorch.org/get-started/locally/

> Example:
> 
> pip install torch --index-url https://download.pytorch.org/whl/cu121

In the cell below we download the model files from `Hugging Face` Server. After the first download, the model runs fully locally/offline

In [None]:
# Import required libraries
import torch
from transformers import AutoTokenizer, AutoModel

# -----------------------------
# Model configuration
# -----------------------------
# Hugging Face model name
# BAAI: the organization/user that published the model
# bge-large-en-v1.5: the specific model version (BGE = Beijing General Embedding, large English model, v1.5)
MODEL_NAME = "BAAI/bge-large-en-v1.5"

# Select device:
# - "cuda" if GPU is available
# - otherwise fall back to CPU
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# -----------------------------
# Load tokenizer and model
# -----------------------------
# NOTE:
# - The first run will download the model
# - After that, everything runs fully offline
# - Files are cached in ~/.cache/huggingface/
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)
model.to(DEVICE)

# Set model to evaluation mode
# (important: disables dropout, improves consistency)
model.eval()

In [3]:
def embed_sentence_bge(sentence: str) -> list:
    """
    Convert a single sentence into a normalized embedding vector.

    Parameters
    ----------
    sentence : str
        The input sentence to embed.

    Returns
    -------
    list
        A list of floats representing the sentence embedding.
        (Length = 1024 for bge-large-en-v1.5)
    """

    # Tokenize the input sentence
    # - padding/truncation ensure safe input length
    # - return_tensors="pt" returns PyTorch tensors
    inputs = tokenizer(
        sentence,
        return_tensors="pt",
        padding=True,
        truncation=True
    )

    # Move input tensors to the same device as the model
    inputs = {key: value.to(DEVICE) for key, value in inputs.items()}

    # Disable gradient calculation (faster + less memory)
    with torch.no_grad():

        # Forward pass through the model
        outputs = model(**inputs)

        # BGE models use the CLS token (first token) as sentence embedding
        # Shape: [1, hidden_size]
        embedding = outputs.last_hidden_state[:, 0]

        # Normalize embedding to unit length
        # This is critical for cosine similarity search
        embedding = torch.nn.functional.normalize(embedding, p=2, dim=1)

    # Convert from:
    # PyTorch tensor -> CPU -> NumPy -> Python list
    return embedding[0].cpu().tolist()

In [None]:
sentence = "I will deposite money in the bank"
embedding = embed_sentence_bge(sentence)
print(len(embedding))
print(embedding)

In [5]:
def embed_sentences_bge(sentences):
    embeddings = list()
    for sentence in sentences:
        embeddings.append(
            embed_sentence_bge(sentence)
        )
    return embeddings

#### Other Open-Source Models for Embeddings

While BGE provides a robust example of locally hosted sentence embeddings, it is by no means the only option. The Hugging Face ecosystem hosts a wide variety of open-source models for text, image, and multimodal embeddings, many of which are fully compatible with the same workflow we demonstrated.

These models can serve as drop-in replacements: in most cases, all you need to do is change the model name in your code, and the existing embedding functions, pooling, and normalization steps will continue to work seamlessly.

Below are some notable examples organized by embedding type, but you can view the full list of models [here](https://huggingface.co/models)

**Text Embeddings**
| Model                                     | Notes                                                           |
| ----------------------------------------- | --------------------------------------------------------------- |
| `BAAI/bge-large-en-v1.5`                  | High-quality, English sentence embeddings, local hosting        |
| `intfloat/e5-large-v2`                    | Excellent retrieval performance, optimized for semantic search  |
| `nomic-ai/nomic-embed-text-v1.5`          | LLaMA-based, high-quality embeddings, supports local deployment |
| `sentence-transformers/all-mpnet-base-v2` | Lightweight and fast, widely used in semantic similarity tasks  |


**Image Embeddings**
| Model                                   | Notes                                                                     |
| --------------------------------------- | ------------------------------------------------------------------------- |
| `openai/clip-vit-large-patch14`         | Gold standard for image embeddings, compatible with text-image similarity |
| `openai/clip-vit-base-patch32`          | Smaller and faster alternative                                            |
| `laion/CLIP-ViT-H-14-laion2B-s32B-b79K` | Strong open-source CLIP model for image embeddings                        |

**Text ‚Üî Image Embeddings (Multimodal)**
| Model                                   | Notes                                                                               |
| --------------------------------------- | ----------------------------------------------------------------------------------- |
| `openai/clip-vit-large-patch14`         | Text and image share the same vector space, ideal for retrieval and semantic search |
| `openai/clip-vit-base-patch32`          | Faster and smaller multimodal model                                                 |
| `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` | OpenCLIP variant, high-quality open-source alternative                              |


### 2.3. Normalizing Embeddings Before Using Cosine Similarity

When working with text embeddings, similarity is usually measured using `cosine similarity`, because we care about `semantic meaning`, not vector magnitude. It is always preferred to **normalize** embeddings first, so that similarity depends only on meaning and not on arbitrary differences in vector length. With normalization, we basically scale the values of embeddings (i.e. features values) so that the magnitude of the corresponding vector (i.e. its length) becomes exactly 1. 

$$
\| \mathbf{v} \| = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2}
$$


#### Practical implication for Normalizing Embeddings

  - Comparison between any two embeddings depends only on their angle, i.e., how close their directions are (Pure semantic comparison)
  - This makes cosine similarity equal to the dot product, simplifying computation and guaranteeing stable similarity values in the range [-1, 1]

  - Euclidean distance can still be computed but gives almost the same ranking (because all vectors are unit-length).
  - Magnitude-based metrics (like ‚Äúintensity of meaning‚Äù) are not available from these embeddings

$$
\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{\|A\| \, \|B\|}
$$

#### Working with Normalized Embeddings

With BGE-style models, embeddings are not normalized by default, so we explicitly normalize them ourselves. In contrast, OpenAI embedding models (e.g. text-embedding-3-small, text-embedding-3-large) return unit-length embeddings automatically. This normalization happens out of the box, without extra steps.

In [22]:
embed_sentences = embed_sentences_open_ai
# embed_sentences = embed_sentences_bge

In [None]:
words = ["happy", "joyful", "cat", "dog", "king", "queen"]
embeddings = embed_sentences(words)

for i, word in enumerate(words):
    word_norm = round(norm(embedding), 2)
    print(f"The magnitude of the word: {word} is {word_norm}")

### 2.4. Practical Example

In [27]:
# embed_sentences = embed_sentences_bge
embed_sentences = embed_sentences_open_ai

In [28]:
words = ["happy", "joyful", "cat", "dog", "king", "queen"]
embeddings = embed_sentences(words)

In [None]:
def get_distance_measure(embedding, other_embeddings, distance_measure_fn):
    distance_measures = list()
    for other_embedding in other_embeddings:
        distance_measures.append(round(distance_measure_fn(embedding, other_embedding), 2))
    return distance_measures

get_distance_measure(embedding=embeddings[0], other_embeddings=embeddings, distance_measure_fn=euclidean_distance)

In [None]:
# Cosine similarity
header = "\t".join(f"{w:>10}" for w in words)
print(f"{'':>10}\t{header}")

for i, embedding in enumerate(embeddings):
    distance_arr = get_distance_measure(
        embedding, embeddings, distance_measure_fn=cosine_similarity
    )
    row = "\t".join(f"{d:>10.4f}" for d in distance_arr)
    print(f"{words[i]:>10}\t{row}")

In [None]:
# Euclidean Distance
header = "\t".join(f"{w:>10}" for w in words)
print(f"{'':>10}\t{header}")

for i, embedding in enumerate(embeddings):
    distance_arr = get_distance_measure(
        embedding, embeddings, distance_measure_fn=euclidean_distance
    )
    row = "\t".join(f"{d:>10.4f}" for d in distance_arr)
    print(f"{words[i]:>10}\t{row}")

---
## 3. Vector Databases (Theory)

A vector database stores and manages high-dimensional vector data. Data points are stored as arrays of numbers called **vectors** which are organized (`indexed`) in a way that allows for efficient, low-latency similarity based queries

> Vector databases are growing in popularity because they deliver the speed and performance needed to drive generative artificial intelligence (AI) use cases and applications. According to Gartner¬Æ, by 2026, more than 30% of enterprises will have adopted vector databases to build their foundation models with relevant business data [check](https://www.gartner.com/account/signin?method=initialize&TARGET=http%3A%2F%2Fwww.gartner.com%2Fdocument%2F4705699%3Fref%3DTypeAheadSearch)

The applications for vector databases are vast and growing. Some key use cases include:

- Retrieval-augmented generation (RAG)
- Conversational AI
- Recommendation engines
- Vector search

### 3.1. Vector databases versus traditional databases

#### Data Structure

The nature of data has undergone a profound transformation. It's no longer confined to structured information easily stored in traditional databases. Unstructured data ‚Äî including social media posts, images, videos, audio clips and more ‚Äî is growing 30% to 60% year over year

While Relational databases excel at managing structured and semistructured datasets in specific formats, it is not optimized and does not fit use for un-structured data (ex. Documents, Audio Files, ...etc.). 

Unlike traditional relational databases with rows and columns, data points in a vector database are represented by vectors with a fixed number of dimensions. These are basically un-structured data transformed into embeddings. Because they use high-dimensional vector embeddings, vector databases are better able to handle unstructured datasets.

> Vectors can represent complex objects such as words, images, videos and audio ... and are typically generated by an ML model.

#### Query Structure

Traditional search typically represents data by using discrete tokens or features, such as keywords, tags or metadata. Traditional searches rely on exact matches to retrieve relevant results. For example, a search for "smartphone" would return results containing the word "smartphone."

Opposed to this, vector search represents data as dense vectors, which are vectors with most or all elements being nonzero. Vectors are represented in a continuous vector space, the mathematical space in which data is represented as vectors. The vector representations enable similarity search. For example, a vector search for ‚Äúsmartphone‚Äù might also return results for ‚Äúcellphone‚Äù and ‚Äúmobile devices.‚Äù

> Each dimension of the dense vector corresponds to a latent feature or aspect of the data. A latent feature is an underlying characteristic or attribute that is not directly observed but inferred from the data through mathematical models or algorithms. Latent features capture the hidden patterns and relationships in the data, enabling more meaningful and accurate representations of items as vectors in a high-dimensional space.

> When we query the scalar index to retrieve rows or records, we generally query for exact matches. The power of indexes using vector embeddings that capture semantic information is we can instead search the index for approximate matches. We provide a vector as input and ask the vector index to return other vectors similar to the input or query vector. This allows us to search large datasets of vectors very quickly. The class of algorithms used to build and search vector indexes is called Approximate Nearest Neighbor (ANN) search. ANN algorithms rely on a similarity measure to determine the nearest neighbors. The vector index must be constructed based on a particular similarity metric.

#### Types of Databases

<div align="center">
    <img src="imgs/1_4l3TBZGVwRpH8o0pGMSbJg.gif" alt="Alt text" width="500" height="500" center>
</div>

Image source: [Medium post](https://medium.com/@tushar_datascience/12-types-of-databases-you-must-know-in-2025-a-complete-guide-1586d3df19cb) by Tushar Mahuri (2025)

### 3.2. Components of a Vector Database 

#### Vector Storage

Vector databases store the outputs of an embedding model algorithm, the vector embeddings. They also store each vector‚Äôs metadata‚Äîincluding; for example, a title, description and original data type. These can be queried by using metadata filters.

By ingesting and storing these embeddings, the database can facilitate fast retrieval of a similarity search, matching the user‚Äôs prompt with a similar vector embedding.

#### Vector indexing

Vectors need to be indexed to accelerate searches within high-dimensional data spaces. Vector databases create indexes on vector embeddings for search functions. Indexing maps the vectors to new data structures that enable faster similarity or distance searches, such as nearest neighbor searches, between vectors.

There are different techniques that can be used for indexing. Vectors can be indexed by using algorithms such as hierarchical navigable small world (HNSW), locality-sensitive hashing (LSH) or product quantization (PQ).


> Vector databases use indexing techniques to enable faster searching. Vector indexing and distance-calculating algorithms such as nearest neighbor search can help optimize performance when searching for relevant results across large datasets with millions, if not billions, of data points. One consideration is that with indexing, vector databases provide `approximate` results. Applications requiring greater accuracy might need to use no or flat indexing or a different kind of database at the cost of a slower processing speed.

#### Similarity search based on querying or prompting

When a user queries a vector database, first the vector embedding representation of the query is computed (Using the same embedding model / transformer of the original data that was stored and indexed in the database). The database then calculates distances between query vectors and vectors stored in the index to return similar results.

Databases can measure the distance between vectors with various algorithms, such as nearest neighbor search. Measurements can also be based on various similarity metrics, such as cosine similarity.

The database then returns the most similar vectors or nearest neighbors to the query vector according to the similarity ranking.

<div align="center">
    <img src="imgs/Vector Databases.drawio.png" alt="Alt text" width="550" height="200" center>
</div>

---
## 4. Vector Indexes (Theory)

### 4.1. How does a vector index work

An index in a database is a data structure that helps the system find rows quickly without scanning the entire table‚Äîmuch like the index of a book. Indexes make read queries faster, though they use extra space and can slightly slow down writes.

#### Traditional Database Index

In traditional databases, an index is built on scalar data‚Äîcolumns with single values, such as numbers, dates, or strings. Each row represents a fact or object, and columns describe its attributes or link to other tables. Queries on a traditional index usually look for exact matches (e.g., find all users with age = 30).

#### Vector Index

A vector index is designed for high-dimensional vector data, such as embeddings produced by models like BERT, BGE, or OpenAI embeddings. Vector indexes enable fast and accurate similarity search and retrieval of vector embeddings from a large dataset of objects.

The class of algorithms used to build and search vector indexes is called **Approximate Nearest Neighbor (ANN) search**. ANN algorithms rely on a similarity measure to determine the nearest neighbors. The vector index must be constructed based on a particular similarity metric. 

Approximate Nearest Neighbor (ANN) algorithms achieve high performance by trading perfect accuracy for speed and scalability. Instead of guaranteeing the exact closest vectors, ANN methods return vectors that are very likely to be among the nearest neighbors.

### 4.2. Flat indexing

Flat indexing is an index strategy where we store each vector as is, with no modifications. This approach is easy to implement, and provides perfect accuracy. The downside is it is slow. In a flat index, the similarity between the query vector and every other vector in the index is computed.

We then return the `K` vectors with the smallest similarity score.

Flat indexing is the right choice when perfect accuracy is required and speed is not a consideration.  If the dataset we are searching is small, flat indexing may be a good choice as the search speed can still be reasonable.

### 4.3 Locality Sensitive Hashing (LSH) indexes

Locality Sensitive Hashing is an indexing strategy that optimizes for speed and finding an `approximate` nearest neighbor, instead of doing an exhaustive search to find the actual nearest neighbor as is done with flat indexing.

The index is built using a hashing function. Vector embeddings that are near each other are hashed to the same bucket. We can then store all these similar vectors in a single table or bucket.

When a query vector is provided, its nearest neighbors can be found by hashing the query vector, and then computing the similarity metric for all the vectors in the table for all other vectors that hashed to the same value. This results in a much smaller search compared to flat indexing where the similarity metric is computed over the whole space, greatly increasing the speed of the query.


<div align="center">
    <img src="imgs/Locality-sensitive-hashing-LSH.png" alt="Alt text" width="300" height="300" center>
</div>

### 4.4 Inverted file (IVF) indexes

Inverted file (IVF) indexes are similar to LSH in that the goal is to first map the query vector to a smaller subset of the vector space and then only search that smaller space for approximate nearest neighbors.  This will greatly reduce the number of vectors we need to compare the query vector to‚Äìthus speeding up our ANN search.

In LSH that subset of vectors was produced by a hashing function. In IVF, the vector space is partitioned or clustered, and then centroids of each cluster are found. For a given query vector, we then find the closest centroid. Then for that centroid, we search all the vectors in the associated cluster.

Note that there is a potential problem when the query vector is near the edge of multiple clusters. In this case, the nearest vectors may be in the neighboring cluster. In these cases, we generally need to search multiple clusters.


> IVF splits the whole data into several clusters using techniques like K-means clustering. Each vector of the database is assigned to a specific cluster. When a new query comes, the system doesn‚Äôt traverse the whole dataset. Instead, it identifies the nearest or most similar clusters and searches for the specific document within those clusters.


<div align="center">
    <img src="imgs/0_MB4pj6br37hutgCu.webp" alt="Alt text" width="400" height="350" center>
</div>


### 4.5 Hierarchical Navigable Small Worlds (HNSW) indexes


Hierarchical Navigable Small World (HNSW) is one of the most popular algorithms for building a vector index. It is very fast and efficient. 

HNSW is a multi-layered graph approach to indexing data. At the lowest level, every vector in the index is captured.  As we move up layers in the graph, data points are grouped based on similarity to reduce the number of data points in each layer exponentially.  In a single layer, points are connected based on their similarity.  Data points in each layer are also connected to those in the next layer.

To search the index, we first search for the highest layer of the graph. The closest match from this graph is then taken to the next layer down where we again find the closest matches to the query vector. We continue this process until we reach the lowest layer in the graph.

<div align="center">
    <img src="imgs/2d5d5f8b5aa8f575398f16cef31467dc95e4ed8b-3080x2136.avif" alt="Alt text" width="500" height="350" center>
</div>


### 4.6 Summary


There are many considerations for choosing an appropriate vector index strategy. First, we have use case considerations. How fast do you need the results? How accurate do you need them to be? All vector index methods have some balance between the speed at which we can retrieve data and find similar vectors and how accurate the results will be.

Additional challenges that are encountered in indexes is how much memory is used. Different algorithms may greatly increase the amount of data that needs to be stored to run the ANN searches efficiently.

The index must also be built before we can start executing queries against it. Considerations for how complex it is to build the index need to be considered. Also, how difficult is it to update the index when new vectors are added? If we have to recompute the entire index each time a new vector is added, we need to ask ourselves how often our index will be updated.

---
## 5. Implementing a Vector Index

### 5.1. Implementing an LSH Index

In [20]:
import random
def random_vector(dim):
    """Generate a random vector with components in [-1, 1]."""
    return [random.uniform(-1, 1) for _ in range(dim)]

#### LSH Index Structure

In [21]:
from collections import defaultdict

num_tables = 2
num_hashes = 3
dim = 3
# ----
tables = dict()
for i in range(num_tables):
    tables[i] = {
        "planes": [random_vector(dim) for _ in range(num_hashes)],
        "data": defaultdict(list)
    }

#### Sign Projection (Hashing Function)

 - For each hyperplane, we calculate the `dot product` between our vector and the hyperplane vector.
 - If it‚Äôs `positive`, the vector lies on one side ‚Üí output '1'.
 - If it‚Äôs `negative`, the vector lies on the other side ‚Üí output '0'.
 - The sequence of bits (like `10110010`) becomes a `hash key` ‚Äî meaning all vectors with the same key are likely to be close in angle (similar direction).

> We use DOT PRODUCT, not COSINE SIMILARITY, because the purpose is not to measure how similar two vectors are (`vec`, `hp`). Instead, it‚Äôs to decide which side of a random hyperplane (hp) a vector (vec) lies on.

In [22]:
def get_hash(vec, hyperplanes):
    bits = ['1' if dot_product(vec, hp) >= 0 else '0' for hp in hyperplanes]
    return ''.join(bits)

#### LSH Insert 

For each table:
 - Compute the hash (binary string) for the vector.
 - Store (id, vector) in that bucket.
This way, similar vectors will share the same or nearby hash keys and end up in the same buckets.

> üí° Example:
> If two vectors produce the same hash '10110001' in a table, they‚Äôll be stored together in the same list.

In [23]:
def lsh_insert(id, vector):
    for i in range(num_tables):
        table = tables[i]
        h = get_hash(vector, table["planes"])
        table["data"][h].append({"id": id, "vector": vector})

#### LSH Query

- Compute hash for the query vector in each table.
- Gather all items stored in those hash buckets (these are candidate neighbors).
- Remove duplicates using a set.
- Compute cosine similarity between the query and each candidate.
- Sort by similarity and return the top-k most similar ones.

> üí° Why:
> LSH doesn‚Äôt give the exact nearest neighbors, but it gives a small subset of likely candidates that you can then rank exactly using cosine similarity. That‚Äôs why it‚Äôs much faster than checking all vectors.

In [24]:
def lsh_query(query_vec, top_k=3):
    """Query nearest neighbors of query_vec."""
    candidates = dict()
    for i in range(num_tables):
        table = tables[i]
        h = get_hash(query_vec, table["planes"])
        for candidate_data in table["data"].get(h, list()):
            candidate_id = candidate_data["id"]
            candidates[candidate_id] = candidate_data

    n_candidates = len(candidates.keys())
    print(f"The number of candidates to be searched is {n_candidates}")

    if n_candidates == 0:
        return list()
        
    sims = list()
    for candidate_id, candidate_data in candidates.items():
        sims.append({
            "id": candidate_id,
            "similarity": cosine_similarity(query_vec, candidate_data["vector"]),
        })

    sims.sort(key=lambda x: x["similarity"], reverse=True)
    return sims[:top_k]

#### LSH Index Implementation as a Class


In [25]:
class LSH:
    def __init__(self, dim, num_hashes=3, num_tables=2):
        self.dim = dim
        self.num_hashes = num_hashes
        self.num_tables = num_tables
        self.tables = self._init_tables()

    def _init_tables(self):
        tables = dict()
        for i in range(self.num_tables):
            tables[i] = {
                "planes": [random_vector(self.dim) for _ in range(self.num_hashes)],
                "data": defaultdict(list)
            }
        return tables

    def _get_hash(self, vec, hyperplanes):
        bits = ['1' if dot_product(vec, hp) >= 0 else '0' for hp in hyperplanes]
        return ''.join(bits)

    def insert(self, id, vector):
        for i in range(self.num_tables):
            table = self.tables[i]
            h = self._get_hash(vector, table["planes"])
            table["data"][h].append({"id": id, "vector": vector})
            

    def query(self, query_vec, top_k=3):

        candidates = dict()
        for i in range(self.num_tables):
            table = self.tables[i]
            h = self._get_hash(query_vec, table["planes"])
            for candidate_data in table["data"].get(h, list()):
                candidate_id = candidate_data["id"]
                candidates[candidate_id] = candidate_data

        n_candidates = len(candidates.keys())
        print(f"The number of candidates to be searched is {n_candidates}")

        if n_candidates == 0:
            return list()
            
        sims = list()
        for candidate_id, candidate_data in candidates.items():
            sims.append({
                "id": candidate_id,
                "similarity": cosine_similarity(query_vec, candidate_data["vector"]),
            })

        sims.sort(key=lambda x: x["similarity"], reverse=True)
        return sims[:top_k]
    
    def explain(self):
        for i in range(self.num_tables):

            table = self.tables[i]
            print(f"Table {i+1}:")

            for h, candidates in table["data"].items():
                print(f"  Hash: {h} --> contains {len(candidates)} candidates")
            print()

### 5.2. Lets Try Our LSH Hash

In [26]:
# Randonly generating vector data 
dim = 16
vectors = {f"vec{i}": [random.uniform(-1, 1) for _ in range(dim)] for i in range(5000)}

In [None]:
# create lsh Index and insert data
lsh_index = LSH(dim=dim, num_hashes=4, num_tables=3)
for id, vec in vectors.items():
    lsh_index.insert(id, vec)
lsh_index.explain()

In [28]:
# Query a near-duplicate
query_vec = [v + random.uniform(-0.09, 0.09) for v in vectors["vec10"]]

In [None]:
results = lsh_index.query(query_vec, top_k=5)

print("Nearest neighbors:")
print("-" * 50)
for result in results:
    print(f"{result['id']}: similarity = {result['similarity']:.4f}")
    print("-" * 50)

---
## 6. Implementing a Vector Database

### 6.1. Vector Database Implementation

In [30]:
class VecorDatabase:

    DIM = 1536  # Default dimension for OpenAI's text-embedding-3-small

    def __init__(self, num_tables=2, num_hashes=3):
        self.openai_client = OpenAI(api_key=OPENAI_API_KEY) # OpenAI client for generating embeddings
        self.lsh_index = LSH(dim=self.DIM, num_tables=num_tables, num_hashes=num_hashes)
        self.data = {}

    def _get_embedding(self, text):
        return self.openai_client.embeddings.create(model="text-embedding-3-small",input=text).data[0].embedding

    def upsert(self, id, text, metadata=dict()):
        vector = self._get_embedding(text)
        self.data[id] = {"vector": vector, "text": text, "metadata": metadata}
        self.lsh_index.insert(id, vector)

    def search(self, query_text, top_k=3):   
        query_vec = self._get_embedding(query_text)
        results = self.lsh_index.query(query_vec, top_k)
        enriched = []
        for result in results:
            id = result["id"]
            similarity = round(result["similarity"], 4)
            record = self.data.get(id)  # get the record from the data dictionary
            if record:
                enriched.append({
                    "id": id,
                    "similarity": similarity,
                    "text": record["text"],
                    "metadata": record["metadata"]
                })
        return enriched

In [31]:
db = VecorDatabase(num_hashes=2, num_tables=2)

#### Data Preparation for Insertion

In [32]:
records = [
    { "_id": "rec1", "chunk_text": "The Eiffel Tower was completed in 1889 and stands in Paris, France.", "category": "history" },
    { "_id": "rec2", "chunk_text": "Photosynthesis allows plants to convert sunlight into energy.", "category": "science" },
    { "_id": "rec3", "chunk_text": "Albert Einstein developed the theory of relativity.", "category": "science" },
    { "_id": "rec4", "chunk_text": "The mitochondrion is often called the powerhouse of the cell.", "category": "biology" },
    { "_id": "rec5", "chunk_text": "Shakespeare wrote many famous plays, including Hamlet and Macbeth.", "category": "literature" },
    { "_id": "rec6", "chunk_text": "Water boils at 100¬∞C under standard atmospheric pressure.", "category": "physics" },
    { "_id": "rec7", "chunk_text": "The Great Wall of China was built to protect against invasions.", "category": "history" },
    { "_id": "rec8", "chunk_text": "Honey never spoils due to its low moisture content and acidity.", "category": "food science" },
    { "_id": "rec9", "chunk_text": "The speed of light in a vacuum is approximately 299,792 km/s.", "category": "physics" },
    { "_id": "rec10", "chunk_text": "Newton's laws describe the motion of objects.", "category": "physics" },
    { "_id": "rec11", "chunk_text": "The human brain has approximately 86 billion neurons.", "category": "biology" },
    { "_id": "rec12", "chunk_text": "The Amazon Rainforest is one of the most biodiverse places on Earth.", "category": "geography" },
    { "_id": "rec13", "chunk_text": "Black holes have gravitational fields so strong that not even light can escape.", "category": "astronomy" },
    { "_id": "rec14", "chunk_text": "The periodic table organizes elements based on their atomic number.", "category": "chemistry" },
    { "_id": "rec15", "chunk_text": "Leonardo da Vinci painted the Mona Lisa.", "category": "art" },
    { "_id": "rec16", "chunk_text": "The internet revolutionized communication and information sharing.", "category": "technology" },
    { "_id": "rec17", "chunk_text": "The Pyramids of Giza are among the Seven Wonders of the Ancient World.", "category": "history" },
    { "_id": "rec18", "chunk_text": "Dogs have an incredible sense of smell, much stronger than humans.", "category": "biology" },
    { "_id": "rec19", "chunk_text": "The Pacific Ocean is the largest and deepest ocean on Earth.", "category": "geography" },
    { "_id": "rec20", "chunk_text": "Chess is a strategic game that originated in India.", "category": "games" },
    { "_id": "rec21", "chunk_text": "The Statue of Liberty was a gift from France to the United States.", "category": "history" },
    { "_id": "rec22", "chunk_text": "Coffee contains caffeine, a natural stimulant.", "category": "food science" },
    { "_id": "rec23", "chunk_text": "Thomas Edison invented the practical electric light bulb.", "category": "inventions" },
    { "_id": "rec24", "chunk_text": "The moon influences ocean tides due to gravitational pull.", "category": "astronomy" },
    { "_id": "rec25", "chunk_text": "DNA carries genetic information for all living organisms.", "category": "biology" },
    { "_id": "rec26", "chunk_text": "Rome was once the center of a vast empire.", "category": "history" },
    { "_id": "rec27", "chunk_text": "The Wright brothers pioneered human flight in 1903.", "category": "inventions" },
    { "_id": "rec28", "chunk_text": "Bananas are a good source of potassium.", "category": "nutrition" },
    { "_id": "rec29", "chunk_text": "The stock market fluctuates based on supply and demand.", "category": "economics" },
    { "_id": "rec30", "chunk_text": "A compass needle points toward the magnetic north pole.", "category": "navigation" },
    { "_id": "rec31", "chunk_text": "The universe is expanding, according to the Big Bang theory.", "category": "astronomy" },
    { "_id": "rec32", "chunk_text": "Elephants have excellent memory and strong social bonds.", "category": "biology" },
    { "_id": "rec33", "chunk_text": "The violin is a string instrument commonly used in orchestras.", "category": "music" },
    { "_id": "rec34", "chunk_text": "The heart pumps blood throughout the human body.", "category": "biology" },
    { "_id": "rec35", "chunk_text": "Ice cream melts when exposed to heat.", "category": "food science" },
    { "_id": "rec36", "chunk_text": "Solar panels convert sunlight into electricity.", "category": "technology" },
    { "_id": "rec37", "chunk_text": "The French Revolution began in 1789.", "category": "history" },
    { "_id": "rec38", "chunk_text": "The Taj Mahal is a mausoleum built by Emperor Shah Jahan.", "category": "history" },
    { "_id": "rec39", "chunk_text": "Rainbows are caused by light refracting through water droplets.", "category": "physics" },
    { "_id": "rec40", "chunk_text": "Mount Everest is the tallest mountain in the world.", "category": "geography" },
    { "_id": "rec41", "chunk_text": "Octopuses are highly intelligent marine creatures.", "category": "biology" },
    { "_id": "rec42", "chunk_text": "The speed of sound is around 343 meters per second in air.", "category": "physics" },
    { "_id": "rec43", "chunk_text": "Gravity keeps planets in orbit around the sun.", "category": "astronomy" },
    { "_id": "rec44", "chunk_text": "The Mediterranean diet is considered one of the healthiest in the world.", "category": "nutrition" },
    { "_id": "rec45", "chunk_text": "A haiku is a traditional Japanese poem with a 5-7-5 syllable structure.", "category": "literature" },
    { "_id": "rec46", "chunk_text": "The human body is made up of about 60% water.", "category": "biology" },
    { "_id": "rec47", "chunk_text": "The Industrial Revolution transformed manufacturing and transportation.", "category": "history" },
    { "_id": "rec48", "chunk_text": "Vincent van Gogh painted Starry Night.", "category": "art" },
    { "_id": "rec49", "chunk_text": "Airplanes fly due to the principles of lift and aerodynamics.", "category": "physics" },
    { "_id": "rec50", "chunk_text": "Renewable energy sources include wind, solar, and hydroelectric power.", "category": "energy" }
]

#### Insert Data

In [33]:
# insert facts into the database
for record in records:
    db.upsert(
        id=record["_id"],
        text=record["chunk_text"],
        metadata={"category": record["category"]}
    )

In [None]:
k, v = list(db.data.items())[1]
print(k)
v

In [None]:
db.lsh_index.explain()

#### Query Data

In [None]:
search_text = "Famous historical structures and monuments"
results = db.search(search_text, top_k=10)

print("\nüîç Search Results:")
for r in results:
    print(f"ID: {r['id']}, Similarity: {r['similarity']}, Text: {r['text']}, Metadata: {r['metadata']}")

### 6.2. Reranking 

Vector search retrieves the top N most similar documents based on embeddings, but those results are not always the most relevant to the specific query. Reranking is a technique that takes the initial list of retrieved results and reorders them based on a deeper, more accurate relevance assessment to the user‚Äôs query.

> A reranking model scores and reorders the returned candidates using a deeper semantic comparison, ensuring the final documents are more contextually accurate. This leads to noticeably better answer quality and fewer hallucinations.


A vector DB stores vectors and uses nearest-neighbor search. So when you query it, it says:

‚ÄúWhich items are closest in vector space to this query embedding?‚Äù. But embeddings are not perfect:
- They capture general meaning
- They can get confused by wording
- They may return things that are related but not contextually correct

**Example**

Query: ‚ÄúHow do I change my billing address?‚Äù

Top vector results might include:
- ‚ÄúBilling API documentation‚Äù (similar word surface form)
- ‚ÄúHow to change your email address‚Äù (similar topic)
- ‚ÄúInvoice generation‚Äù (same semantic domain)

They‚Äôre close, but not all are truly what you want.

**What Reranking Does**

Reranking takes the top N results (e.g., top 20 or 50) and runs a more expensive but more accurate model (like a cross-encoder) to evaluate:

‚ÄúGiven the query and each individual document, how relevant is this really?‚Äù

This second (reranking) model compares the pair (query, document) directly. So it‚Äôs more precise, but too slow to run on the whole database.

The Two-Stage Workflow
- Vector search: fast filter ‚Üí gets 20 candidates
- Reranker: slow but precise ‚Üí sorts those 20 accurately

This gives you speed and quality.

---
## 7. Introduction to Pinecone 

Vector databases are specialized systems designed to store, index, and search high-dimensional vectors (embeddings) efficiently. As AI applications such as semantic search, recommendation systems, and Retrieval-Augmented Generation (RAG) have grown in adoption, a rich ecosystem of vector databases has emerged‚Äîeach offering different trade-offs in performance, scalability, deployment model, and developer experience.

Popular solutions in this space include open-source options like **Pinecone**, **Qdrant**, **Weaviate**, **Milvus**, and **Chroma**, as well as fully managed cloud services. While these databases share the same core goal‚Äîfast and accurate vector similarity search‚Äîthey differ in how much infrastructure management, scaling, and operational complexity they place on the user.

**Pinecone** stands out as a fully managed, cloud-native vector database designed specifically for production AI workloads. It abstracts away infrastructure concerns such as indexing strategies, sharding, and scaling, allowing developers to focus on building AI features rather than operating databases. Pinecone offers low-latency similarity search, rich metadata filtering, and tight integration with modern embedding and LLM workflows, making it a strong choice for real-world semantic search and RAG systems.

In this section, we will focus on Pinecone‚Äôs core concepts, architecture, and usage patterns, while keeping in mind how it fits into the broader vector database landscape.

| **Feature**                       | **Pinecone**                     | **Qdrant**           | **Weaviate**             | **Milvus**          | **Chroma**          |
| --------------------------------- | -------------------------------- | -------------------- | ------------------------ | ------------------- | ------------------- |
| **Type**                          | Managed Cloud                    | Open Source / Cloud  | Open Source / Cloud      | Open Source         | Local / Lightweight |
| **Ease of Setup**                 | ‚≠ê‚≠ê‚≠ê‚≠ê (very easy)                 | ‚≠ê‚≠ê                   | ‚≠ê‚≠ê                       | ‚≠ê                   | ‚≠ê‚≠ê‚≠ê‚≠ê                |
| **Performance**                   | ‚≠ê‚≠ê‚≠ê‚≠ê                             | ‚≠ê‚≠ê‚≠ê‚≠ê                 | ‚≠ê‚≠ê‚≠ê                      | ‚≠ê‚≠ê‚≠ê‚≠ê                | ‚≠ê‚≠ê                  |
| **Scalability**                   | ‚≠ê‚≠ê‚≠ê‚≠ê                             | ‚≠ê‚≠ê‚≠ê                  | ‚≠ê‚≠ê‚≠ê‚≠ê                     | ‚≠ê‚≠ê‚≠ê‚≠ê                | ‚≠ê                   |
| **Metadata Filtering**            | ‚úÖ                                | ‚úÖ                    | ‚úÖ                        | ‚úÖ                   | ‚úÖ                   |
| **Hybrid Search (Text + Vector)** | ‚ùå                                | ‚úÖ                    | ‚úÖ                        | ‚úÖ                   | ‚ùå                   |
| **Best For**                      | Production apps, managed scaling | Open-source projects | Semantic + hybrid search | Research / Big data | Rapid prototyping   |
| **License / Pricing**             | Commercial (free tier)           | Apache 2.0           | BSD                      | Apache 2.0          | MIT                 |


 ### 7.1. Pinecone Overview

Pinecone is a managed vector database designed to store, index, and query high-dimensional vectors at scale. It handles the heavy lifting of infrastructure, scalability, and performance, so you can focus on building AI applications.

#### Project Structure

- Each project contains a single `database`
- A database can include multiple `indexes`

> Typically, you create one index per use case. For Example: different indexes for different embedding models or retrieval strategies

#### Namespaces

Within an index, records are partitioned into `namespaces`, and all upserts, queries, and other data operations always target one namespace. This has two main benefits:
  - Multitenancy: When you need to isolate data between customers, you can use one namespace per customer and target each customer‚Äôs writes and queries to their dedicated namespace. See Implement multitenancy for end-to-end guidance.
  - Faster queries: When you divide records into namespaces in a logical way, you speed up queries by ensuring only relevant records are scanned. The same applies to fetching records, listing record IDs, and other data operations.

Namespaces are created automatically during upsert. If a namespace doesn‚Äôt exist, it is created implicitly. 

> There is no mechanism offered by Pinecone to search (or do any operation) across multiple namespaces

[check](https://www.youtube.com/shorts/gp5bFF4QNCQ)

#### Dense vs Sparse Vectors

A Pinecone index can store dense vectors, sparse vectors, or both, enabling semantic, keyword, or hybrid search within the same index.

**Dense Embeddings ‚Äî Use Cases**
Dense vectors capture semantic meaning and are typically produced by embedding models like OpenAI, BGE, or CLIP.
- Semantic search ‚Äì find documents or passages based on meaning, not exact keywords
- Similarity search ‚Äì find similar texts, images, or products
- Recommendation systems ‚Äì recommend content or items based on user behavior or preferences
- RAG (Retrieval-Augmented Generation) ‚Äì retrieve relevant context to enrich LLM responses
- Multimodal search ‚Äì search images using text or other images (e.g., OpenCLIP)
üëâ Best when understanding meaning and context matters more than exact wording.

**Sparse Embeddings ‚Äî Use Cases**
Sparse vectors represent data where most values are zero and usually correspond to keyword-based or token-based representations (e.g., BM25, TF-IDF).
- Keyword search ‚Äì exact or near-exact term matching
- Filtering by rare or specific terms ‚Äì names, IDs, error codes, part numbers
- Search in technical or legal documents where exact wording matters
- Explainable search results ‚Äì easier to understand why a result matched
üëâ Best when precision and exact matches are critical.

**Hybrid Search ‚Äî Use Cases**
Hybrid search combines dense semantic understanding with sparse keyword precision.

#### Pinecone Data Model

Each record stored in Pinecone consists of three main components:
- ID ‚Äì a unique identifier for the vector
- Vector ‚Äì the numerical embedding representing meaning
- Metadata ‚Äì additional structured information used for filtering and context

 ### 7.2. Initializing Pinecone Client

Pinecone is a fully managed, cloud-based vector database. It‚Äôs easy to set up, requires no server management, and integrates seamlessly with OpenAI embeddings and LangChain ‚Äî making it perfect for a hands-on learning experience.

> Setup the Database and get your API Key from [here](https://www.pinecone.io/)

In [None]:
!pip install pinecone

In [None]:
from dotenv import load_dotenv
import os
load_dotenv()
PINECONE_API_KEY = os.environ["PINECONE_API_KEY"]

In [None]:
from pinecone import Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)

### 7.3. Creating an Index 

In Pinecone, there are two types of indexes for storing vector data: `Dense` indexes store dense vectors for `semantic` search, and `sparse` indexes store sparse vectors for `lexical/keyword` search. For our example, we will create a `dense` index 

To convert the input data into embeddings (i.e. vector format), Pinecone offers integrated embedding models (e.g. `llama-text-embed-v2`) that can be used directly out of the box. Alternatively, an external embedding model can be used (ex. Open AI's `text-embedding-3-small`) and store the vector data directly to Pinecone. For this example, we will use the integrated model `llama-text-embed-v2`. This means, when we upsert and search our data, Pinecone will autimatically generate the vectors. 

| Manage Embedding Externally | Manage embedding within DB |
|----------------------------|----------------------------|
| You generate embeddings in your application code using an external model or service, explicitly define vector dimensions and metrics, and manually upsert vectors into the index. This gives you full control over model choice, versioning, preprocessing, and embedding lifecycle, but requires extra infrastructure and code to manage embedding generation and updates. | The database automatically generates embeddings using a built-in model based on a mapped text field. You don‚Äôt manage vector dimensions or embedding generation directly, which simplifies the pipeline and reduces boilerplate, but ties you to the database-supported models and limits control over embedding customization and versioning. |
| ![Picture 1](imgs/pc_index_external_model.png) | ![Picture 2](imgs/pc_index_managed_model.png) |


In [44]:
# Create a dense index with integrated embedding
index_name = "my-index"
if not pc.has_index(index_name):
    pc.create_index_for_model(
        name=index_name,
        # specify where the index will be hosted (cloud provider and region)
        cloud="aws",
        region="us-east-1",
        
        # specify the embedding model to use
        embed={
            "model":"llama-text-embed-v2",
            # specify the field to embed
            "field_map":{"text": "chunk_text"}
        }, 
    )

### 7.4. Data Preparation & Insertion 

#### Data Preparation

Prepare a sample dataset of factual statements from different domains like history, physics, technology, and music. Model the data as as records with an ID, text, and category.

> As you will notice below, each record have a unique `_id` field, and the field `chunk_text`; `chunk_text` was defined on index creation as the field to be used for embedding (i.e. the field that will be transfored into a vector representation). All additional fields are stored as record metadata. We can filter by metadata when searching or deleting records.

In [45]:
records = [
    { "_id": "rec1", "chunk_text": "The Eiffel Tower was completed in 1889 and stands in Paris, France.", "category": "history" },
    { "_id": "rec2", "chunk_text": "Photosynthesis allows plants to convert sunlight into energy.", "category": "science" },
    { "_id": "rec3", "chunk_text": "Albert Einstein developed the theory of relativity.", "category": "science" },
    { "_id": "rec4", "chunk_text": "The mitochondrion is often called the powerhouse of the cell.", "category": "biology" },
    { "_id": "rec5", "chunk_text": "Shakespeare wrote many famous plays, including Hamlet and Macbeth.", "category": "literature" },
    { "_id": "rec6", "chunk_text": "Water boils at 100¬∞C under standard atmospheric pressure.", "category": "physics" },
    { "_id": "rec7", "chunk_text": "The Great Wall of China was built to protect against invasions.", "category": "history" },
    { "_id": "rec8", "chunk_text": "Honey never spoils due to its low moisture content and acidity.", "category": "food science" },
    { "_id": "rec9", "chunk_text": "The speed of light in a vacuum is approximately 299,792 km/s.", "category": "physics" },
    { "_id": "rec10", "chunk_text": "Newton's laws describe the motion of objects.", "category": "physics" },
    { "_id": "rec11", "chunk_text": "The human brain has approximately 86 billion neurons.", "category": "biology" },
    { "_id": "rec12", "chunk_text": "The Amazon Rainforest is one of the most biodiverse places on Earth.", "category": "geography" },
    { "_id": "rec13", "chunk_text": "Black holes have gravitational fields so strong that not even light can escape.", "category": "astronomy" },
    { "_id": "rec14", "chunk_text": "The periodic table organizes elements based on their atomic number.", "category": "chemistry" },
    { "_id": "rec15", "chunk_text": "Leonardo da Vinci painted the Mona Lisa.", "category": "art" },
    { "_id": "rec16", "chunk_text": "The internet revolutionized communication and information sharing.", "category": "technology" },
    { "_id": "rec17", "chunk_text": "The Pyramids of Giza are among the Seven Wonders of the Ancient World.", "category": "history" },
    { "_id": "rec18", "chunk_text": "Dogs have an incredible sense of smell, much stronger than humans.", "category": "biology" },
    { "_id": "rec19", "chunk_text": "The Pacific Ocean is the largest and deepest ocean on Earth.", "category": "geography" },
    { "_id": "rec20", "chunk_text": "Chess is a strategic game that originated in India.", "category": "games" },
    { "_id": "rec21", "chunk_text": "The Statue of Liberty was a gift from France to the United States.", "category": "history" },
    { "_id": "rec22", "chunk_text": "Coffee contains caffeine, a natural stimulant.", "category": "food science" },
    { "_id": "rec23", "chunk_text": "Thomas Edison invented the practical electric light bulb.", "category": "inventions" },
    { "_id": "rec24", "chunk_text": "The moon influences ocean tides due to gravitational pull.", "category": "astronomy" },
    { "_id": "rec25", "chunk_text": "DNA carries genetic information for all living organisms.", "category": "biology" },
    { "_id": "rec26", "chunk_text": "Rome was once the center of a vast empire.", "category": "history" },
    { "_id": "rec27", "chunk_text": "The Wright brothers pioneered human flight in 1903.", "category": "inventions" },
    { "_id": "rec28", "chunk_text": "Bananas are a good source of potassium.", "category": "nutrition" },
    { "_id": "rec29", "chunk_text": "The stock market fluctuates based on supply and demand.", "category": "economics" },
    { "_id": "rec30", "chunk_text": "A compass needle points toward the magnetic north pole.", "category": "navigation" },
    { "_id": "rec31", "chunk_text": "The universe is expanding, according to the Big Bang theory.", "category": "astronomy" },
    { "_id": "rec32", "chunk_text": "Elephants have excellent memory and strong social bonds.", "category": "biology" },
    { "_id": "rec33", "chunk_text": "The violin is a string instrument commonly used in orchestras.", "category": "music" },
    { "_id": "rec34", "chunk_text": "The heart pumps blood throughout the human body.", "category": "biology" },
    { "_id": "rec35", "chunk_text": "Ice cream melts when exposed to heat.", "category": "food science" },
    { "_id": "rec36", "chunk_text": "Solar panels convert sunlight into electricity.", "category": "technology" },
    { "_id": "rec37", "chunk_text": "The French Revolution began in 1789.", "category": "history" },
    { "_id": "rec38", "chunk_text": "The Taj Mahal is a mausoleum built by Emperor Shah Jahan.", "category": "history" },
    { "_id": "rec39", "chunk_text": "Rainbows are caused by light refracting through water droplets.", "category": "physics" },
    { "_id": "rec40", "chunk_text": "Mount Everest is the tallest mountain in the world.", "category": "geography" },
    { "_id": "rec41", "chunk_text": "Octopuses are highly intelligent marine creatures.", "category": "biology" },
    { "_id": "rec42", "chunk_text": "The speed of sound is around 343 meters per second in air.", "category": "physics" },
    { "_id": "rec43", "chunk_text": "Gravity keeps planets in orbit around the sun.", "category": "astronomy" },
    { "_id": "rec44", "chunk_text": "The Mediterranean diet is considered one of the healthiest in the world.", "category": "nutrition" },
    { "_id": "rec45", "chunk_text": "A haiku is a traditional Japanese poem with a 5-7-5 syllable structure.", "category": "literature" },
    { "_id": "rec46", "chunk_text": "The human body is made up of about 60% water.", "category": "biology" },
    { "_id": "rec47", "chunk_text": "The Industrial Revolution transformed manufacturing and transportation.", "category": "history" },
    { "_id": "rec48", "chunk_text": "Vincent van Gogh painted Starry Night.", "category": "art" },
    { "_id": "rec49", "chunk_text": "Airplanes fly due to the principles of lift and aerodynamics.", "category": "physics" },
    { "_id": "rec50", "chunk_text": "Renewable energy sources include wind, solar, and hydroelectric power.", "category": "energy" }
]

#### Insert/Upsert 

Now we will insert/upsert the data into our Index. Because we setup the index with an integrated embedding model, we only provide the textual statements (via the field `chunk_text`) and Pinecone converts them to dense vectors automatically.

> Notice that Pinecone is eventually consistent, so there can be a slight delay before new or changed records are visible to queries

In [None]:
# Target the index
dense_index = pc.Index(index_name)

# Upsert the records into a namespace
namespace_name = "my-namespace"
dense_index.upsert_records(namespace_name, records)

# Wait for the upserted vectors to be indexed
import time
time.sleep(10)

# View stats for the index
stats = dense_index.describe_index_stats()
print(stats)

### 7.5. Index Query & Reranking 

#### Index Query: Searching Data

Now we search the dense index for records that are most semantically similar to the query, ‚ÄúFamous historical structures and monuments‚Äù. Again, because the index is setup with an integrated embedding model, we provide the query as text and Pinecone converts the text to a dense vector automatically.

> Notice that we specify the `Namespace` we would like to perform the search operation on

In [None]:
# Define the query
query = "Famous historical structures and monuments"

# Search the dense index
results = dense_index.search(
    namespace=namespace_name,
    query={
        "top_k": 10,
        "inputs": {
            'text': query
        },
    },
)

# Print the results
for hit in results['result']['hits']:
        print(f"id: {hit['_id']:<5} | score: {round(hit['_score'], 2):<5} | category: {hit['fields']['category']:<10} | text: {hit['fields']['chunk_text']:<50}")

#### Reranking

Vector databases retrieve results based on embedding similarity (usually cosine or dot-product distance). While this works well for semantic closeness, the top-ranked vectors aren‚Äôt always the most relevant in context ‚Äî especially when:
  - The query is ambiguous (e.g., ‚ÄúApple‚Äù could mean the fruit or the company).
  - The embeddings miss fine-grained meaning (subtle differences between retrieved results).
  - The user‚Äôs intent isn‚Äôt purely semantic (e.g., preferring recent, local, or authoritative content).

As a result, raw vector search may return items that are similar in meaning but not optimal in relevance.

> Re-ranking improves search quality by taking the initial list of retrieved documents (from the vector DB) and ordering them more intelligently. It typically uses a secondary model (like a cross-encoder or LLM) that compares the query with each result directly to predict a more accurate relevance score.

To get a more accurate ranking, we can rerank the initial results based on their relevance to the query. Reranking is used as part of a two-stage vector retrieval process to improve the quality of results.
  - First we query an index for a given number of relevant results
  - Second we send the query and results to a reranking model.

The reranking model scores the results based on their semantic relevance to the query and returns a new, more accurate ranking. This approach is one of the simplest methods for improving quality in retrieval augmented generation (RAG) pipelines.

Pinecone provides hosted reranking models so it‚Äôs easy to manage two-stage vector retrieval via the same platform. It is possible to use a hosted model to rerank results as an integrated part of a query, or we can use a hosted model or external model to rerank results as a standalone operation.

In [None]:
# Search the dense index and rerank results
reranked_results = dense_index.search(
    namespace=namespace_name,
    query={
        "top_k": 10,
        "inputs": {
            'text': query
        }
    },
    rerank={
        "model": "bge-reranker-v2-m3",
        "top_n": 10,
        "rank_fields": ["chunk_text"]
    }   
)

# Print the reranked results
for hit in reranked_results['result']['hits']:
    print(f"id: {hit['_id']}, score: {round(hit['_score'], 2)}, text: {hit['fields']['chunk_text']}, category: {hit['fields']['category']}")

Reranking results is one of the most effective ways to improve search accuracy and relevance, but there are many other techniques to consider. For example:
  - Filtering by metadata: When records contain additional metadata, you can limit the search to records matching a filter expression.
  - Hybrid search: You can add lexical search to capture precise keyword matches (e.g., product SKUs, email addresses, domain-specific terms) in addition to semantic matches.
  - Chunking strategies: You can chunk your content in different ways to get better results. Consider factors like the length of the content, the complexity of queries, and how results will be used in your application.

### 7.6. Clean Up !

Let's clean up our resources

In [49]:
pc.delete_index(index_name)

---
## 8. Practical Usecases 

### 8.1. Usecase: Image Search and Similar Images Recommendation Using Embeddings

So far, we‚Äôve mainly worked with text ‚Äî turning words into vectors and searching for meaning. In this section, we take a big step forward and move into something even more powerful: multimodal search.

We‚Äôll use `OpenCLIP` - an embedding model based on Open AIs [CLIP](https://openai.com/index/clip/) to implement two very interesting use-cases

- Search Images by providing an input `text` query
- Search Visiually Similar Images to a given input `Image` as query

OpenCLIP uses the same embedding model to convert both images and text into vectors in a shared space. This allows us to do something that feels almost magical: search for images using text, or find similar images using another image ‚Äî all with the same vector database.


You‚Äôll see how an image or a short phrase can be embedded into the same vector space, indexed once, and queried in multiple ways. This is a perfect example of how vector databases enable flexible, real-world AI systems ‚Äî not just single-modality search, but cross-modal understanding.

In [None]:
! pip install torch open_clip_torch langchain-experimental

##### Auxilary Functions

In [57]:
def read_img(path):
    with open(path, "rb") as image_file:
        return image_file.read()

In [58]:
import base64
def image_to_base64(img):
    encoded = base64.b64encode(img)
    return encoded.decode("utf-8")

##### Initializing our Model

In [59]:
# requires that open_clip_torch is installed
from langchain_experimental.open_clip import OpenCLIPEmbeddings 

The code below initializes an OpenCLIP embedding model that we will use to convert images and text into vector representations.

- model_name = "ViT-g-14" specifies the architecture of the image encoder.`ViT-g-14` is a large Vision Transformer model that produces high-quality visual embeddings, making it well-suited for similarity search and retrieval tasks.

- checkpoint = "laion2b_s34b_b88k" tells OpenCLIP which pretrained weights to load. This checkpoint was trained on a massive dataset of image‚Äìtext pairs `(LAION-2B)`, allowing the model to understand rich visual and semantic relationships.

- OpenCLIPEmbeddings(...) creates an embedding object that can embed both images and text into the same vector space. This is what enables multimodal search, such as finding images using text or finding similar images using another image.

After this step, clip_embd can be used to generate embeddings that we store in a vector database and query for similarity.

In [60]:
model_name = "ViT-g-14"
checkpoint = "laion2b_s34b_b88k"
clip_embd = OpenCLIPEmbeddings(model_name=model_name, checkpoint=checkpoint)

##### Data Preparation

In [61]:
root_path = "resources/fashion"
categories = ["jackets", "shirts", "shoes"]
all_images = dict()
idx = 0

for category in categories:
    dir_path = f"{root_path}/{category}"
    for file_name in os.listdir(dir_path):
        idx += 1
        img_id = f"img{idx}"
        file_path = f"{dir_path}/{file_name}"
        img = read_img(file_path)
        #
        all_images[img_id] = {
            "file_name": file_name,
            "file_path": file_path,
            "img": img,
            "img_b64": image_to_base64(img),
            "category": category
        }

In [None]:
img_data = list(all_images.values())[0]
img_data

In [63]:
records = list()
for img_id, img_data in all_images.items():
    img_embedding = clip_embd.embed_image([img_data['file_path']])[0]
    records.append({
        "id": img_id,
        "values": img_embedding, 
        "metadata": {"category": img_data["category"]}
    })

In [None]:
len(records)

In [None]:
len(records[0]['values'])

In [None]:
print(norm(records[0]['values']))

In [None]:
records[0]

##### Creating our Index & Data Insertion

In [68]:
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=PINECONE_API_KEY)

In [69]:
index_name = "my-index"
# Create a dense index
if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        ),
        vector_type="dense",
        dimension=1024,
        metric="cosine"       
    )

In [70]:
# Target the index
dense_index = pc.Index(index_name)
namespace_name = "my-namespace"

In [None]:
dense_index.upsert(vectors=records, namespace=namespace_name)

# Wait for the upserted vectors to be indexed
import time
time.sleep(10)

# View stats for the index
stats = dense_index.describe_index_stats()
print(stats)

##### Querying our Data

In [72]:
from IPython.display import display, HTML, Image

In [73]:
def get_images_html_by_vector(query_vector, top_k):

    # get images
    query_response = dense_index.query(
        vector=query_vector,
        top_k=top_k,
        namespace=namespace_name,
        include_metadata=False
    )

    # generat html for display
    html_str = ""
    for match in query_response['matches']:
        iid = match['id']
        ib64 = all_images[iid]["img_b64"]
        html_str += f'<img src="data:image/png;base64,{ib64}" width="100" height="100" style="margin-right:10px"/>'

    return html_str

Let‚Äôs begin with a text-based search, where we describe what we‚Äôre looking for in words and let the model retrieve the most similar images.

In [None]:
query_text = "checked shirt"
query_vector = clip_embd.embed_documents([query_text])[0]
display_html = get_images_html_by_vector(query_vector, top_k=5)
display(HTML(display_html))

Now we‚Äôll switch to image-based search, where the query itself is an image and we retrieve visually similar images from the vector database.

In [None]:
query_img_path = "resources/fashion/jackets/IBBW23-5553_999_1_9c6e4cd2-8819-41ee-b6d5-aaf7adff3c4a.webp"
query_img = read_img(query_img_path)
Image(data=query_img, width="100", height="100")

In [None]:
query_vector = clip_embd.embed_image([query_img_path])[0]
display_html = get_images_html_by_vector(query_vector, top_k=5)
display(HTML(display_html))

##### Clean Up !

In [77]:
pc.delete_index(index_name)

### 8.2. Usecase: Retrieval-Augmented Generation for AI Customer Support Enhancement

In this example, we show how vector databases can be used to build a RAG pipeline. We demonstrate this by illustrating how a customer support agent (LLM) can improve its responses by connecting to an external knowledge base (a vector database). This example assumes prior knowledge of how LLMs work.

Our Customer Support Agent operates within a sneaker store, assisting customers with a wide range of inquiries. It can answer questions about order returns, shipping status, account management, product availability, and more.

#### What is a RAG

**Retrieval-augmented generation (RAG)** is a technique used to enhance the accuracy and reliability of generative AI models by incorporating information retrieved from relevant external data sources.

Without RAG, an LLM generates responses based solely on the knowledge it acquired during training. With RAG, an information retrieval component is added: the user‚Äôs query is first used to retrieve relevant information from an external data source. Both the user query and the retrieved information are then provided to the LLM, allowing it to combine its training knowledge with up-to-date or domain-specific data to produce more accurate and reliable responses.

<div align="center">
    <img src="imgs/rag.jpg" alt="Alt text" width="600" height="350" center>
</div>

Image credit is from [this](https://aws.amazon.com/what-is/retrieval-augmented-generation/) amazing article from AWS. Feel free to read it for a nice introduction to RAGs


#### Lets Initialize Our Knowledge Base

Our Customer Support Agent leverages previous human-led support conversations, handled by business-aware employees, to gain contextual understanding of the business and provide more informed answers. To enable this, all past conversations are ingested into a vector database, allowing for efficient retrieval and contextual grounding.

In [78]:
from pinecone import Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)

First lets load the input customer support conversations from disk

In [None]:
import json
file_path = "resources/customer_support_conversations.jsonl"

def load_conversations(file_path):

    """
    Loads the conversations from the jsonl file
    Args:
        file_path: The path to the jsonl file
    Returns:
        conversations: A list of conversations
    """

    conversations = list()
    with open(file_path, "r", encoding="utf-8") as f:
        buffer = ""
        for line in f:
            line = line.strip()
            if not line:
                continue  # skip empty lines
            buffer += line
            # Check if buffer is a complete JSON object
            try:
                convo = json.loads(buffer)
                conversations.append(convo)
                buffer = ""  # reset buffer for next object
            except json.JSONDecodeError:
                # not complete yet, keep reading lines
                buffer += "\n"
    return conversations


conversations = load_conversations(file_path)
print(f"Total conversations loaded: {len(conversations)}")

print("Example conversations:")
for conversation in conversations[0:3]:
    for message in conversation["messages"]:
        print(message)
    print("-----")

Then we make each conversation more suitable to insert in the vector database

In [None]:
# creating a single TEXT field that will be given to the Vector database
# This will be used for matching the user queries
for idx, conversation in enumerate(conversations):
    conversation["_id"] = f"conversation_{idx+1}"
    conversation["conversation"] = "\n".join(conversation["messages"])
    conversation.pop("messages")

for conversation in conversations[0:3]:
    print(f"Conversation Id: {conversation['_id']}")
    print(conversation["conversation"])
    print("-----")

In [None]:
conversation.keys()

Now, Lets Create our Index and Insert all our Customer Support Conversations In it

In [82]:
# Create a dense index with integrated embedding
index_name = "customer-support-conversations"
namespace_name = "customer-support-conversations"
if not pc.has_index(index_name):
    pc.create_index_for_model(
        name=index_name,
        # specify where the index will be hosted (cloud provider and region)
        cloud="aws",
        region="us-east-1",
        # specify the embedding model to use
        embed={
            "model":"llama-text-embed-v2",
            # specify the field to embed
            "field_map":{"text": "conversation"}
        }
    )

In [83]:
def split_list_into_sublists(lst, n):
    """
    Splits a list `lst` into `n` sublists of roughly equal size.
    """
    k, m = divmod(len(lst), n)
    return [lst[i*k + min(i, m):(i+1)*k + min(i+1, m)] for i in range(n)]

conversations = split_list_into_sublists (conversations, 4)

In [None]:
# Target the index
dense_index = pc.Index(index_name)

# Upsert the records
for idx, conversations_subset in enumerate(conversations):
    print(f"Inserting Sublist: {idx+1}")
    dense_index.upsert_records(namespace_name, conversations_subset)

In [None]:
# View stats for the index
stats = dense_index.describe_index_stats()
print(stats)

Lets Give a Try for the Vector Database

In [86]:
def get_relevant_context(query):
    # Search the dense index
    results = dense_index.search(
        namespace=namespace_name,
        query={
            "top_k": 3,
            "inputs": {
                'text': query
            }
        }
    )

    context = ""

    for idx, hit in enumerate(results['result']['hits']):
        context += f"Conversation: {idx+1} \n"
        context += hit['fields']['conversation']
        context += "\n\n"
    
    return context

In [None]:
# Define the query
query = "How long does a shipment take outside of EU"
print(get_relevant_context(query))

#### RAG Based Customer Support Agent

Example without RAG

In [None]:
import openai

# Main interaction
query = "Hello, Am I able to return my Sneakers, I have purchased them 10 days ago"
response = openai.responses.create(
    model="gpt-4.1-nano",
    input=query, 
)
print(response.output_text)

Example with a RAG

In [89]:
# defining agent instructions 
instructions = """
You are a customer support agent working for an online sneaker store. Your role is to help customers by \
answering their questions accurately, clearly, and politely.

You will be given:
 - A user query
 - A context containing relevant information retrieved from past customer support conversations and \
internal knowledge

Use only the provided context to answer the user‚Äôs question. If the context does not contain \
enough information to answer confidently, clearly state that you do not have sufficient \
information and suggest an appropriate next step.

Your responses should be helpful, concise, and aligned with the store‚Äôs policies and tone of voice.
"""

In [None]:
query = "Hello, Am I able to return my Sneakers, I have purchased them 10 days ago"
context = get_relevant_context(query)

query_with_context = f"""
Answer the following using the provided context:

### Question: 
{query}

### Context: 
{context}
"""

print(query_with_context)

In [None]:
response = openai.responses.create(
    instructions=instructions, 
    model="gpt-4.1-nano",
    input=query_with_context, 
)
print(response.output_text)

#### Clean Up!

In [92]:
pc.delete_index(index_name)

**
# Congratulations !

Congratulations on completing the course! You should be really proud of the effort and curiosity you showed. Keep learning, exploring, and believing in yourself‚Äîthis is just the beginning. Wishing you continued success on your learning journey.

M. ElSioufy