### Introduction to FAISS for Vector Search in Python


#### What is FAISS?
FAISS (Facebook AI Similarity Search) is an open-source library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors. It is widely used in machine learning and information retrieval applications where you need to find similar items in a dataset, such as images or documents.

#### Prerequisites for Installing FAISS

To use FAISS with Python, you need to install a few prerequisites:

1. **Python 3.6 or later**
2. **FAISS Installation**: FAISS can be installed via pip, but there are different versions based on whether you want GPU acceleration or just CPU-based computation.
3. **Numpy**
4. **Optional - CUDA**: If you want GPU acceleration.

You can install FAISS via pip as follows:

- For CPU only:
  ```bash
  pip install faiss-cpu
  ```

- For GPU support:
  ```bash
  pip install faiss-gpu

#### Setting Up FAISS for Vector Search

Letâ€™s explore how to use FAISS for a simple vector search example in Python. We'll walk through creating a set of random vectors and performing a similarity search to find the closest matches.

##### Step 1: Import Libraries

In [5]:
import numpy as np
import faiss

##### Step 2: Create Data to Search Against

Let's generate some random vectors to use as our dataset. We'll use Numpy to create vectors of a specific dimensionality.

In [6]:
dim = 128
num_vectors = 1000
dataset = np.random.random((num_vectors, dim)).astype('float32')

##### Step 3: Build and Train the FAISS Index

We need to create a FAISS index to store our dataset. Here, we'll use the `IndexFlatL2` index, which calculates the L2 (Euclidean) distance to find similar vectors.

In [7]:
index = faiss.IndexFlatL2(dim)
index.add(dataset) 
print(f"Number of vectors in the index: {index.ntotal}")

Number of vectors in the index: 1000


##### Step 4: Perform a Similarity Search

Now that we have our index, let's create a query vector and find its nearest neighbors.

In [10]:
query_vector = np.random.random((1, dim)).astype('float32')

# Search the index for the 10 nearest neighbors
k = 10
_, indices = index.search(query_vector, k)
print("Indices of the ten nearest neighbors:", indices)

Indices of the ten nearest neighbors: [[898 930 496 406 145 266 546 984 205 237]]


The output will show the indices of the top ten vectors in the dataset that are closest to our query vector, based on the L2 distance.

#### Embeddings with BERT -> FAISS
This is the general idea
```python
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
print(f"Number of vectors in the index: {index.ntotal}")
```

```python
# Generate an embedding for the query
query = "What is FAISS used for?"
query_inputs = tokenizer(query, return_tensors='pt')
with torch.no_grad():
    query_embedding = model(**query_inputs).last_hidden_state.mean(dim=1).numpy()

# Search the index for the 5 most similar articles
k = 5
_, article_indices = index.search(query_embedding, k)
print("Indices of the most relevant articles:", article_indices)
```

#### Summary

FAISS is a powerful tool for similarity search and is easy to set up using Python. With its support for both CPU and GPU, FAISS scales well for handling large datasets. In this example, we demonstrated how to create an index, add vectors, and perform a similarity search efficiently.

To go further, you can experiment with different FAISS indices, such as `IndexIVFFlat` for faster searches on larger datasets or use GPU acceleration to handle millions of vectors.

#### More Reading
https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

https://github.com/facebookresearch/faiss