## Faiss

Demo Faiss for NN search!

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed by Facebook AI Research.

## Installing Faiss via conda

### CPU-only version

Run the following command in terminal:

`conda install -c pytorch faiss-cpu -n lang`

## Prepare some syntactic data

https://github.com/facebookresearch/faiss/wiki/Getting-started

In [4]:
import numpy as np

import faiss                   # make faiss available

In [6]:
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries

np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.

xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

In [13]:
xb[10000][0]

10.635717

## Building an index and adding the vectors to it

There are many types of indexes, we are going to use the simplest version that just performs brute-force L2 distance search on them: `IndexFlatL2`

In [8]:
index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index
print(index.ntotal)

assert index.ntotal == nb

True
100000


## Searching

The basic search operation that can be performed on an index is the k-nearest-neighbor search, ie. for each query vector, find its k nearest neighbors in the database.

In [18]:
k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)

import time
start = time.process_time()
D, I = index.search(xq, k)     # actual search 

print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

print("Time elapsed: {}".format(time.process_time() - start))

[[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]
 [  3 173  18 182]
 [  4 288 370 531]]
[[0.        7.1751733 7.207629  7.2511625]
 [0.        6.3235645 6.684581  6.7999454]
 [0.        5.7964087 6.391736  7.2815123]
 [0.        7.2779055 7.5279865 7.6628466]
 [0.        6.7638035 7.2951202 7.3688145]]
[[ 381  207  210  477]
 [ 526  911  142   72]
 [ 838  527 1290  425]
 [ 196  184  164  359]
 [ 526  377  120  425]]
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]
Time elapsed: 11.889017000000003
