### Neural Information Retrieval Using Nearest Neighbor


Neural retrieval for information retrieval (IR) use embeddings to search for similar items in an search space build using search corpus data.
<br/>
We will FAISS library from Facebook for our hands-on presentation. But one could look <a href="http://ann-benchmarks.com/">here</a> to find other solutions and benchmark data.

### Load the libraries

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
import faiss
import datasets

### Load the embeddings of the image corpus

In [2]:
dset = datasets.load_from_disk("../data/processed_embeddings")
## these embeddings will be used to create the search space.
corpus = dset['embeddings']


corpus = np.array(corpus).astype('float32')
corpus = np.unique(corpus, axis=0)

In [3]:
corpus.shape

(24958, 512)

In [4]:
corpus

array([[-0.8442147 ,  0.1623359 ,  0.3072559 , ...,  0.40091127,
        -0.20468342, -0.11151588],
       [-0.82529676,  0.26502204,  0.05470146, ...,  0.27426067,
        -0.56668615, -0.08001572],
       [-0.8063217 , -0.19098167, -0.22377765, ...,  0.2548957 ,
        -0.12389499,  0.27585742],
       ...,
       [ 0.9416133 ,  0.25078115,  0.20232391, ...,  0.35934162,
        -0.11814432, -0.4162678 ],
       [ 0.9556082 ,  0.44172812,  0.06752564, ...,  0.00269235,
         0.08827013, -0.36235547],
       [ 1.0331686 ,  0.42330503, -0.11260805, ...,  0.16301963,
         0.00906937, -0.25052726]], dtype=float32)

### Create the <a href ="https://github.com/facebookresearch/faiss">FAISS</a> index. 
<br/>FAISS will be used to create the vector search space using the embeddings corpus.
<br/>We are going to use FlatIndex to store the index, but <a href="https://github.com/facebookresearch/faiss/wiki/Faiss-indexes">other efficient indexes</a> are also available in FAISS.
<br/> FAISS supports both GPU and CPU based index. GPU index search is <a href="https://github.com/facebookresearch/faiss/wiki/Comparing-GPU-vs-CPU">comparatively faster </a> compared to CPU as long as complete index could fit in memory.

In [35]:
xb = corpus
xb.shape
dimension = xb.shape[-1]
#initialize the flat index for data dimension. In current example it is 512
index = faiss.IndexFlatL2(dimension)


In [10]:
index.is_trained


True

In [36]:
# add data to the index. This is a CPU based index.
index.add(xb)                


In [12]:
xq = xb


In [13]:
len(xq)


100000

In [14]:
k =1

### Index Search
search method returns query indices (I) similar to search query vector and their euclidean distances (D) from the search query vector.

In [15]:
%%timeit
D, I = index.search(xq[:1], k)   

74.4 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [16]:
%%time
D, I = index.search(xq, k)     

CPU times: user 15min 42s, sys: 975 ms, total: 15min 43s
Wall time: 3min 3s


In [17]:
D

array([[3.0517578e-05],
       [3.0517578e-05],
       [0.0000000e+00],
       ...,
       [0.0000000e+00],
       [6.1035156e-05],
       [1.2207031e-04]], dtype=float32)

In [18]:
I

array([[    0],
       [    1],
       [    2],
       ...,
       [99997],
       [99998],
       [99999]])

In [19]:
z = I[:,0] == np.array( list(range(len(xq))))

In [20]:
z.sum() , len(z) ,  (  len(z) - z.sum())

(100000, 100000, 0)

In [21]:
np.array( list(range(len(xq))))[z]

array([    0,     1,     2, ..., 99997, 99998, 99999])

In [22]:
z

array([ True,  True,  True, ...,  True,  True,  True])

In [23]:
np.where(z == False)

(array([], dtype=int64),)

In [24]:
D[3766]

array([3.0517578e-05], dtype=float32)

In [25]:
I[3766]

array([3766])

## faiss ivf

In [26]:
nlist = 20 # nmber of clusters
quantizer = faiss.IndexFlatL2(d)  # the other index
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)

In [27]:
assert not index.is_trained
index.train(xb)
assert index.is_trained

In [37]:
xq = xb
# create another index for IVFFlat
index.add(xb)         


In [29]:
%%timeit


index.nprobe = 1              # default nprobe is 1

D, I = index.search(xq[:1], k)     # actual search

2.57 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [30]:
%%time


index.nprobe = 1              # default nprobe is 1

D, I = index.search(xq, k)     # actual search

CPU times: user 7min 31s, sys: 135 ms, total: 7min 32s
Wall time: 56.6 s


In [31]:
z = I[:,0] == np.array( list(range(len(xq))))
z.sum() , len(z) ,  (  len(z) - z.sum())

(100000, 100000, 0)

In [32]:
%%time
index.nprobe = 5              # default nprobe is 1

D, I = index.search(xq, k)    

CPU times: user 37min 26s, sys: 538 ms, total: 37min 27s
Wall time: 4min 41s


In [33]:
z = I[:,0] == np.array( list(range(len(xq))))
z.sum() , len(z) ,  (  len(z) - z.sum())

(100000, 100000, 0)