# Neural Information Retrieval Using Nearest Neighbor


Neural retrieval for information retrieval (IR) use embeddings to search for similar items in an search space build using search corpus data.

<br/>
We will FAISS library from Facebook for our hands-on presentation. But one could look <a href="http://ann-benchmarks.com/">here</a> to find other solutions and benchmark data.


We will look at `performance` and `recall@1`

## Load the libraries

In [13]:
from pathlib import Path
import numpy as np
import pandas as pd
import faiss
import datasets

## Load the embeddings of the image corpus

In [14]:
dset = datasets.load_from_disk("../data/processed_embeddings")
## these embeddings will be used to create the search space.
corpus = dset['embeddings']


corpus = np.array(corpus).astype('float32')
corpus = np.unique(corpus, axis=0)

In [15]:
corpus.shape

(24958, 512)

In [16]:
corpus

array([[-0.8442147 ,  0.1623359 ,  0.3072559 , ...,  0.40091127,
        -0.20468342, -0.11151588],
       [-0.82529676,  0.26502204,  0.05470146, ...,  0.27426067,
        -0.56668615, -0.08001572],
       [-0.8063217 , -0.19098167, -0.22377765, ...,  0.2548957 ,
        -0.12389499,  0.27585742],
       ...,
       [ 0.9416133 ,  0.25078115,  0.20232391, ...,  0.35934162,
        -0.11814432, -0.4162678 ],
       [ 0.9556082 ,  0.44172812,  0.06752564, ...,  0.00269235,
         0.08827013, -0.36235547],
       [ 1.0331686 ,  0.42330503, -0.11260805, ...,  0.16301963,
         0.00906937, -0.25052726]], dtype=float32)

In [17]:
dimension = corpus.shape[-1]
dimension

512

## Create the <a href ="https://github.com/facebookresearch/faiss">FAISS</a> index. 
<br/>FAISS will be used to create the vector search space using the embeddings corpus.
<br/>We are going to use FlatIndex to store the index, but <a href="https://github.com/facebookresearch/faiss/wiki/Faiss-indexes">other efficient indexes</a> are also available in FAISS.
<br/> FAISS supports both GPU and CPU based index. GPU index search is <a href="https://github.com/facebookresearch/faiss/wiki/Comparing-GPU-vs-CPU">comparatively faster </a> compared to CPU as long as complete index could fit in memory.

### Flat Index

In [18]:
xb = corpus
xb.shape
dimension = xb.shape[-1]
#initialize the flat index for data dimension. In current example it is 512
index = faiss.IndexFlatL2(dimension)


In [19]:
index.is_trained


True

In [20]:
# add data to the index. This is a CPU based index.
index.add(xb)                


In [21]:
xq = xb


In [22]:
len(xq)


24958

number of vectors / results to retrieve

In [23]:
k =1

#### Index Search
search method returns query indices (I) similar to search query vector and their euclidean distances (D) from the search query vector.

search for single vector

In [24]:
%%timeit
D, I = index.search(xq[:1], k)   

18.6 ms ± 40.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


search for all vectors in corpus

In [26]:
%%time
D, I = index.search(xq, k)     

CPU times: user 29.8 s, sys: 54.5 ms, total: 29.9 s
Wall time: 10.3 s


distance of vector in corpus to query vector

In [27]:
D

array([[0.0000000e+00],
       [7.6293945e-05],
       [0.0000000e+00],
       ...,
       [6.1035156e-05],
       [0.0000000e+00],
       [0.0000000e+00]], dtype=float32)

top vertex id 



In [29]:
I

array([[    0],
       [    1],
       [    2],
       ...,
       [24955],
       [24956],
       [24957]])

because we are using the entire corpus and the ids are sequential, the ideal recall would be sequential too

In [30]:
z = I[:,0] == np.array( list(range(len(xq))))
z

array([ True,  True,  True, ...,  True,  True,  True])

In [31]:
{
 "recall@1":  z.sum()
 , "num_vectors":  len(z)
 , "mismatch":    len(z) - z.sum()
}


{'recall@1': 24957, 'num_vectors': 24958, 'mismatch': 1}

In [32]:
np.array( list(range(len(xq))))[z]

array([    0,     1,     2, ..., 24955, 24956, 24957])

In [33]:
z

array([ True,  True,  True, ...,  True,  True,  True])

In [34]:
np.where(z == False)

(array([3766]),)

In [35]:
D[3766]

array([0.], dtype=float32)

In [36]:
I[3766]

array([3765])

### faiss ivf

<img src="https://d33wubrfki0l68.cloudfront.net/44acb1425f25e30ca058daec92bdb209c6c47ad2/e92fc/images/faiss5.png" width="500"/>

<p> Image from Pinecone Faiss Tutorial </p>
https://www.pinecone.io/learn/faiss-tutorial/


**Parameters**:
- nlist : number of clusters
- nprobe: number of clusters to search

In [37]:
nlist = 20 # number of clusters
quantizer = faiss.IndexFlatL2(dimension)  # the other index
index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_L2)

In [38]:
assert not index.is_trained
index.train(xb)
assert index.is_trained

In [39]:
xq = xb
# create another index for IVFFlat
index.add(xb)         


search for single vector

In [40]:
%%timeit


index.nprobe = 1              # default nprobe is 1

D, I = index.search(xq[:1], k)     # actual search

1.39 ms ± 4.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


search for entire corpus

In [41]:
%%time


index.nprobe = 1              

D, I = index.search(xq, k)     # actual search

CPU times: user 29.4 s, sys: 54 ms, total: 29.5 s
Wall time: 7.48 s


In [42]:
z = I[:,0] == np.array( list(range(len(xq))))
{
 "recall@1":  z.sum()
 , "num_vectors":  len(z)
 , "mismatch":    len(z) - z.sum()
}


{'recall@1': 24958, 'num_vectors': 24958, 'mismatch': 0}

increase the number of cells that are probed

In [44]:
%%time
index.nprobe = 5              # default nprobe is 1

D, I = index.search(xq, k)    

CPU times: user 2min 21s, sys: 220 ms, total: 2min 21s
Wall time: 35.6 s


In [45]:
z = I[:,0] == np.array( list(range(len(xq))))
{
 "recall@1":  z.sum()
 , "num_vectors":  len(z)
 , "mismatch":    len(z) - z.sum()
}


{'recall@1': 24958, 'num_vectors': 24958, 'mismatch': 0}