# **Vector processing and customization in FAISS**

FAISS is a powerful library for nearest-neighbor search and clustering. While its core functionality revolves around indexing and searching vectors, FAISS also provides tools for advanced **vector processing** and **customization**.


In [1]:
import faiss
import numpy as np

## Vector normalization
Normalization scales vectors to have a unit length (magnitude of 1). This is especially useful for metrics like cosine similarity, where the direction of the vector matters more than its magnitude. We normalize vectors:
- Equal weight in similarity computations: If we don't normalize, vectors of different lengths might dominate the similarity computation, even if their directions are similar.
- Cosine similarity: In particular, for cosine similarity, we care only about the angle (or direction) between vectors, not their length. So, by normalizing the vectors, we make sure that all vectors are treated equally in terms of their length.

FAISS provides a built-in function `normalize_L2` to scale vectors for L2 norm (magnitude).

In [2]:
# Generate random vectors
dimension = 128  # Length of each vector
num_vectors = 1000  # Total number of vectors
data = np.random.random((num_vectors, dimension)).astype('float32')

# Normalize the dataset
faiss.normalize_L2(data)

# Check the norms of the vectors (should be close to 1)
norms = np.linalg.norm(data, axis=1)
print("Norm of the first 5 vectors after normalization:", norms[:5])

Norm of the first 5 vectors after normalization: [1.         1.         0.99999994 1.0000001  0.99999994]


Here, we create 1000 vectors, each with 128 dimensions. These vectors are randomly generated. Then, we scale all the vectors so that each vector has a length of 1. After this step, each vector's magnitude (or norm) will be 1. And then, we compute the magnitude (norm) of each vector and prints the first 5.

#### Using normalized vectors for cosine similarity
Cosine similarity is used to measure the angle between two vectors. By normalizing the vectors, we convert cosine similarity into inner product (dot product) between the vectors, which is easier to compute. We can use IndexFlatIP (inner product) to compute cosine similarity.

In [3]:
# Create an index for inner product search
index = faiss.IndexFlatIP(dimension)

# Add normalized vectors to the index
index.add(data)

# Create a query vector and normalize it
query_vector = np.random.random((1, dimension)).astype('float32')
faiss.normalize_L2(query_vector)

# Perform a search
k = 5
distances, indices = index.search(query_vector, k)

print("Indices of nearest neighbors:", indices)
print("Cosine similarity scores:", distances)

Indices of nearest neighbors: [[828 527 986 333 791]]
Cosine similarity scores: [[0.84198797 0.83181256 0.82972383 0.8245963  0.82215536]]


We create an index using inner product (dot product) as the similarity measure. After normalizing the vectors, the inner product is the same as cosine similarity, because the vectors have unit length. Then, we add our normalized vectors to the index so that we can search for similar vectors in this dataset. Later, we generate a query vector and normalize it in the same way as the dataset, and perform a search to find the top 5 nearest neighbors to our query vector, using cosine similarity (since the vectors and query are normalized).

The `distances` returned will represent the similarity scores (higher is more similar), and the `indices` represent the positions of the nearest neighbors in the dataset.

## Custom distance metrics
Sometimes, the default similarity metrics like L2 distance (Euclidean) or inner product aren't sufficient. We might need a custom metric, such as:
- Weighted Euclidean distance.
- Manhattan distance.
- Pre-processed distances (e.g., scaling vector components).

Custom metrics allow us to tailor similarity computations to specific applications, such as domain-specific weighting of vector dimensions. FAISS doesn’t directly support custom distance metrics, but we can preprocess vectors before adding them to the index.

#### Example: Weighted Euclidean distance
Apply a weight vector to scale dimensions before indexing.

In [4]:
# Define a weight vector
weights = np.random.random(dimension).astype('float32')

# Preprocess data: Apply weights
weighted_data = data * weights

# Index the weighted data
index = faiss.IndexFlatL2(dimension)
index.add(weighted_data)

# Preprocess query vector and search
weighted_query = query_vector * weights
distances, indices = index.search(weighted_query, k)

print("Distances with weighted Euclidean metric:", distances)

Distances with weighted Euclidean metric: [[0.09643585 0.09670793 0.09840918 0.10057575 0.1029233 ]]


#### Example: Manhattan distance (L1 distance)
Since FAISS does not directly support Manhattan distance, we can:
1. Preprocess the data: Compute distances manually after retrieving nearest neighbors based on another metric (like L2 distance).
2. Postprocess results: Re-rank the results based on the Manhattan distance.

In [5]:
# Create an L2 index (FAISS does not support Manhattan distance directly)
index = faiss.IndexFlatL2(dimension)
index.add(data)

# Perform a preliminary search with L2 (to reduce the candidate set)
k = 10  # Top-k neighbors to retrieve
distances, indices = index.search(query_vector, k)

# Postprocess: Calculate Manhattan distance manually for top candidates
def compute_manhattan_distance(query, candidates):
    return np.sum(np.abs(candidates - query), axis=1)

# Retrieve the candidate vectors
candidate_vectors = data[indices[0]]

# Compute Manhattan distances for the candidates
manhattan_distances = compute_manhattan_distance(query_vector, candidate_vectors)

# Re-rank by Manhattan distance
manhattan_sorted_indices = np.argsort(manhattan_distances)

# Display the sorted results
print("Original indices (by L2):", indices[0])
print("Re-ranked indices (by Manhattan):", indices[0][manhattan_sorted_indices])
print("Manhattan distances:", manhattan_distances[manhattan_sorted_indices])


Original indices (by L2): [828 527 986 333 791 317 504 201 132 826]
Re-ranked indices (by Manhattan): [828 986 527 826 317 201 791 333 504 132]
Manhattan distances: [4.9906335 5.201288  5.263769  5.328466  5.367114  5.3852134 5.400983
 5.4567366 5.4767303 5.5304623]


FAISS is optimized for L2 and inner product similarity. Computing the Manhattan distance for all vectors in large datasets might not be efficient.

## Vector transformation
Transformations modify vectors to:
- Reduce dimensionality (e.g., PCA).
- Optimize the vector space for quantization or search efficiency (e.g., OPQ).

We transform vectors because:
- PCA: Reduce storage and computation costs by lowering dimensions while retaining most of the variance.
- OPQ: Rotate and reorder vector components for better quantization and search performance.

### Principal component analysis (PCA)
PCA reduces the dimensionality of vectors by projecting them onto a lower-dimensional space while retaining the most variance. FAISS provides a `PCAMatrix` object to transform vectors.

In [6]:
# Define a PCA matrix to reduce dimensionality to 50
pca = faiss.PCAMatrix(dimension, 50)
pca.train(data)  # Train PCA on the dataset

# Transform the dataset
reduced_data = pca.apply_py(data)
print("Original shape:", data.shape)
print("Reduced shape:", reduced_data.shape)

# Use reduced data in an index
index = faiss.IndexFlatL2(50)
index.add(reduced_data)

Original shape: (1000, 128)
Reduced shape: (1000, 50)


### Optimized product quantization (OPQ)
OPQ is an advanced technique used to improve product quantization (PQ). It focuses on transforming the original vector space to a new space where the vectors are easier to quantize (compress) while maintaining the search accuracy.

PQ splits a vector into smaller subspaces and quantizes each subspace individually. This method helps reduce memory usage, but it can sometimes lead to errors due to how the vector is split into subspaces. OPQ improves PQ by first rotating or reordering the vector components in a way that makes the quantization process more effective. It rotates vectors before applying product quantization. This optimization ensures that the subspaces are better aligned, which minimizes quantization errors and achieve better trade-offs between memory usage and search accuracy. FAISS provides an `OPQMatrix` for optimized transformation.

##### Step 1: Define and train the OPQ matrix

In [7]:
# Define an OPQ matrix with 4 subspaces
opq = faiss.OPQMatrix(dimension, 4)
opq.train(data)

First, we create an OPQ transformation matrix for the vectors. The matrix will divide the original vectors into 4 subspaces (the number `4` is arbitrary and can be adjusted based on the dataset size and the desired trade-off between accuracy and compression). We train the OPQ matrix on the dataset (`data`). Training the OPQ matrix means that the algorithm learns how to rotate and reorder the vectors in a way that improves the performance of the subsequent product quantization step.

##### Step 2: Apply OPQ transformation

In [8]:
# Apply OPQ transformation
transformed_data = opq.apply_py(data)

After training, we apply the transformation to the dataset. This rotates and reorders the vector components, optimizing them for better quantization and search. The result is the `transformed_data`, which will be more efficiently compressed and searched than the original vectors.

##### Step 3: Define and train the product quantization (PQ) index

In [9]:
# Create a Product Quantization index
pq = faiss.IndexPQ(dimension, 4, 8)  # 4 subspaces, 8 bits per subspace

# Train the PQ index on the transformed data
pq.train(transformed_data)

We now define a product quantization index (`IndexPQ`) on the transformed data. The number 4 represents the number of subspaces into which each vector will be split, and 8 is the number of bits used for quantization in each subspace. Then, we train the PQ index on the transformed dataset. The goal is to learn the best way to compress the data into a more memory-efficient format while still enabling effective searches.

##### Step 4: Add transformed data to the PQ index

In [10]:
# Add the transformed data to the PQ index
pq.add(transformed_data)

After training the PQ index, we add the transformed vectors (from the OPQ step) to the index. These vectors are now ready to be searched efficiently.

##### Step 5: Apply OPQ transformation to the query vector and perform the search

In [11]:
# Apply OPQ transformation to the query vector
transformed_query = opq.apply_py(query_vector)

# Search with OPQ-transformed vectors
distances, indices = pq.search(transformed_query, k)
print("Indices of nearest neighbors:", indices)

Indices of nearest neighbors: [[  7 504 349 931 260 776 440 718 415 333]]


We apply the same OPQ transformation to the query vector. This ensures that the query vector is in the same transformed space as the data vectors, enabling a consistent and efficient search. Then, we perform a search using the transformed query vector. The `pq.search()` function looks for the top `k` nearest neighbors to the query vector, based on the product quantization representation in the transformed space.

The `distances` represent the quantized similarity scores between the query vector and the nearest neighbors, and `indices` give the positions of these nearest neighbors in the index.