# Basic Vector Search from Scratch

For this exercise we will implement basic vector search
from scratch with just numpy.<br/>
This will give us a feel
for what's happening under the hood in vector databases.

In [None]:
!pip install numpy pytest

## Euclidean distance

There are many ways to measure the distance between two vectors.
Let's write a function that computes the `Euclidean distance` 
between vectors. 

This function should take as input two vectors and return
the euclidean distance between them.

For more details you can read this [kaggle page](https://www.kaggle.com/code/paulrohan2020/euclidean-distance-and-normalizing-a-vector)


In [1]:
import numpy as np

In [2]:
def euclidean_distance(v1: np.ndarray, v2: np.ndarray) -> float:
    """
    Compute the Euclidean distance between two vectors.

    Parameters
    ----------
    v1 : np.ndarray
        First vector.
    v2 : np.ndarray
        Second vector.

    Returns
    -------
    float
        Euclidean distance between `v1` and `v2`.
    """
    if v1.shape != v2.shape:
        raise ValueError("Vectors must have the same shape.")
    return np.linalg.norm(v1 - v2)

## KNN search

Using the distance function you just wrote, write a function that 
finds the k-nearest neighbors of a query vector.

This function should take as input a query vector, a 2d array of database vectors,
and an integer k the number of nearest neighbors to return. And it should return 
the vectors that are the k-nearest neighbors of the query vector.


In [3]:
def find_nearest_neighbors(query: np.ndarray,
                           vectors: np.ndarray,
                           k: int = 1) -> np.ndarray:
    """
    Find k-nearest neighbors of a query vector.

    Parameters
    ----------
    query : np.ndarray
        Query vector.
    vectors : np.ndarray
        Vectors to search.
    k : int, optional
        Number of nearest neighbors to return, by default 1.

    Returns
    -------
    np.ndarray
        The `k` nearest neighbors of `query` in `vectors`.
    """
    if k < 1:
        raise ValueError("k must be at least 1.")
    if k > vectors.shape[0]:
        raise ValueError("k must not exceed the number of vectors.")
    distances = np.array([euclidean_distance(query, v) for v in vectors])
    nearest_indices = np.argsort(distances)[:k]
    return vectors[nearest_indices]

## Other distance metrics

For this problem we'll write a new distance function and modify 
our nearest neighbors function to accept a distance metric.


Write a function that computes the [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity) between vectors.

In [4]:
from typing import Union

def cosine_distance(v1: np.ndarray, v2: np.ndarray) -> Union[float, np.ndarray]:
    """
    Compute the cosine distance between two vectors.

    Parameters
    ----------
    v1 : np.ndarray
        First vector.
    v2 : np.ndarray
        Second vector.

    Returns
    -------
    float
        Cosine distance between `v1` and `v2`.
    """
    if v1.shape != v2.shape:
        raise ValueError("Vectors must have the same shape.")
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    if norm_v1 == 0 or norm_v2 == 0:
        raise ValueError("One of the vectors is zero.")
    return 1 - (dot_product / (norm_v1 * norm_v2))

**HINT** Please make sure you understand the difference between cosine similarity and cosine distance

Now, rewrite the `find_nearest_neighbors` function to accept a distance metric so you can use either Euclidean or Cosine distance

In [5]:
def find_nearest_neighbors(query: np.ndarray,
                           vectors: np.ndarray,
                           k: int = 1,
                           distance_metric="euclidean") -> np.ndarray:
    """
    Find k-nearest neighbors of a query vector with a configurable
    distance metric.

    Parameters
    ----------
    query : np.ndarray
        Query vector.
    vectors : np.ndarray
        Vectors to search.
    k : int, optional
        Number of nearest neighbors to return, by default 1.
    distance_metric : str, optional
        Distance metric to use, by default "euclidean".

    Returns
    -------
    np.ndarray
        The `k` nearest neighbors of `query` in `vectors`.
    """
    if k < 1:
        raise ValueError("k must be at least 1.")
    if k > vectors.shape[0]:
        raise ValueError("k must not exceed the number of vectors.")
    if distance_metric == "euclidean":
        distances = np.array([euclidean_distance(query, v) for v in vectors])
    elif distance_metric == "cosine":
        distances = np.array([cosine_distance(query, v) for v in vectors])
    else:
        raise ValueError(f"Unknown distance metric: {distance_metric}")
    nearest_indices = np.argsort(distances)[:k]
    return vectors[nearest_indices]

## Exploration

Now that we have a nearest neighbors function that accepts a distance metric, <br/>
let's explore the differences between Euclidean distance and cosine distance.

Would you expect same or different answers?

In [6]:
# You might find this function useful

def generate_vectors(num_vectors: int, num_dim: int,
                     normalize: bool = True) -> np.ndarray:
    """
    Generate random embedding vectors.

    Parameters
    ----------
    num_vectors : int
        Number of vectors to generate.
    num_dim : int
        Dimensionality of the vectors.
    normalize : bool, optional
        Whether to normalize the vectors, by default True.

    Returns
    -------
    np.ndarray
        Randomly generated `num_vectors` vectors with `num_dim` dimensions.
    """
    vectors = np.random.rand(num_vectors, num_dim)
    if normalize:
        vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)
    return vectors

In [7]:
# Generate random vectors
num_vectors = 10
num_dim = 5
query_vector = np.random.rand(num_dim)
vectors = generate_vectors(num_vectors, num_dim)

# Find nearest neighbors using Euclidean distance
euclidean_neighbors = find_nearest_neighbors(query_vector, vectors, k=3, distance_metric="euclidean")

# Find nearest neighbors using Cosine distance
cosine_neighbors = find_nearest_neighbors(query_vector, vectors, k=3, distance_metric="cosine")

print("Query Vector:", query_vector)
print("Generated Vectors:\n", vectors)
print("\nNearest Neighbors (Euclidean):\n", euclidean_neighbors)
print("\nNearest Neighbors (Cosine):\n", cosine_neighbors)

Query Vector: [0.80636498 0.52121972 0.74415507 0.89949023 0.61875503]
Generated Vectors:
 [[0.20415003 0.601279   0.27825146 0.68177622 0.23354579]
 [0.54384121 0.31200666 0.64159128 0.06749047 0.43668553]
 [0.56417267 0.43084884 0.53022862 0.41990873 0.19650124]
 [0.64436778 0.57988366 0.17021103 0.41594323 0.21574182]
 [0.64402178 0.31733336 0.52761181 0.28936737 0.34989683]
 [0.63492711 0.46717881 0.00216189 0.61476738 0.02584422]
 [0.08536769 0.74536814 0.60104077 0.24188502 0.1318344 ]
 [0.36746172 0.57368529 0.05404907 0.39545543 0.61363734]
 [0.54338715 0.62924478 0.29450522 0.46569972 0.07191556]
 [0.32538946 0.61333766 0.69953277 0.05211984 0.16086031]]

Nearest Neighbors (Euclidean):
 [[0.56417267 0.43084884 0.53022862 0.41990873 0.19650124]
 [0.64402178 0.31733336 0.52761181 0.28936737 0.34989683]
 [0.64436778 0.57988366 0.17021103 0.41594323 0.21574182]]

Nearest Neighbors (Cosine):
 [[0.56417267 0.43084884 0.53022862 0.41990873 0.19650124]
 [0.64402178 0.31733336 0.527611