## Use Faiss to create index

- Faiss is a dense retreival method (NN search)
<br>
- Elastic search is a sparse retreival method (exact match retrieval)

Index structures correspond to various trade-offs with respect to:
<br> 
- search time
- search quality
- search used per index
- training time
- need for external data for unsupervised training

In [None]:
import os 

In [1]:
import pandas as pd
import numpy as np
import torch
import faiss

# s3-based object store file system (s3 by amazon)
# import s3fs

from sentence_transformers import SentenceTransformer
# from vector_engine.utils import vector_search, id2details

In [2]:
# build vectors first
df = pd.read_csv('../chat_logs.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
retrieve_idx = [1,3,5,7,9]
retrieval = list(df[df.log_id==idx]['line_message'] for idx in retrieve_idx)

In [4]:
# id==1,3,5,7,9
retrieval

[0    one of these days soon I need to see what geas...
 Name: line_message, dtype: object,
 2    so its today or not this week
 Name: line_message, dtype: object,
 4    oh, guess its not today
 Name: line_message, dtype: object,
 6    I'm doing real work
 Name: line_message, dtype: object,
 8    sigh
 Name: line_message, dtype: object]

In [5]:
df.head()

Unnamed: 0,log_id,line_count,user,line_message,date_of_log
0,1,2.0,jcater,one of these days soon I need to see what geas...,2001-06-27
1,2,3.0,neilt,anyway i'm off to do real work for a couple of...,2001-06-27
2,3,4.0,neilt,so its today or not this week,2001-06-27
3,4,5.0,neilt,we are still waiting for a new forms/geas driver,2001-06-27
4,5,6.0,jcater,"oh, guess its not today",2001-06-27


In [6]:
large = df['line_message'][:1000]

In [7]:
small = df['line_message'][:390].to_list()
small

['one of these days soon I need to see what geas has to offer',
 "anyway i'm off to do real work for a couple of days",
 'so its today or not this week',
 'we are still waiting for a new forms/geas driver',
 'oh, guess its not today',
 'cant remember who volunteered for that',
 "I'm doing real work",
 'are the geas docs pretty decent?',
 'sigh',
 'um, what it is and how to use it :)',
 'its just a black box interface to objects',
 'dont know what the docs say',
 'but if you cant find what you are looking for, let me know and well create it',
 "jamest_: what's wrong?",
 'this damn bot',
 "what's it doing",
 'annoying me',
 'nothing more, nothing less',
 'and I should not even be messing with it',
 'no real work today?',
 'jamest_: if its related',
 'jamest_: ash keeps kicking my ssh session off',
 'if i am away for any time at all',
 "but my heart fills with pity whenever I think of poor masta, hudled over his keyboard, shuddering cause he's missed out on 12 hours of goat references in 

In [8]:
model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = model.encode(small)
sentence_embeddings.shape

(390, 768)

In [9]:
sentence_embeddings

array([[ 0.0162252 , -0.44534183,  3.056395  , ..., -0.8601831 ,
        -0.19799072, -0.14249888],
       [ 0.17303751, -0.5800499 ,  2.513252  , ..., -0.99075675,
        -0.24516724, -0.16175953],
       [-0.02583414, -0.5347643 ,  2.0662234 , ..., -0.8746693 ,
         1.1000949 ,  0.19394492],
       ...,
       [ 0.19004726, -0.00930948,  0.9470397 , ...,  0.10992029,
         0.8386087 ,  0.18612799],
       [ 0.38973647,  0.42429236,  2.1174006 , ...,  0.342076  ,
        -0.3939347 ,  0.08969497],
       [-0.12694564, -0.12167815,  1.0466301 , ..., -0.64359504,
        -0.35453606, -0.03942579]], dtype=float32)

In [10]:
# initialize index flat L2 (exhaustive search)
# no groupings or transformations needed, so no training
d = sentence_embeddings.shape[1]
index = faiss.IndexFlatL2(d)

In [11]:
d

768

In [12]:
# check whether index requires training
index.is_trained

True

In [13]:
index.add(sentence_embeddings)

In [14]:
index.ntotal

390

In [15]:
# search given a query and number of nearest neighbors to return k
k = 5
xq = model.encode(["work today"])

In [16]:
%time
# do search
D, I = index.search(xq,k)
print(I)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 42.9 µs
[[147 366   2 154   1]]


In [1]:
print(D)

NameError: name 'D' is not defined

In [15]:
# out of 20 documents, 4 are retrieved
large.iloc[[2,1,0,6,4]]

2                        so its today or not this week
1    anyway i'm off to do real work for a couple of...
0    one of these days soon I need to see what geas...
6                                  I'm doing real work
4                              oh, guess its not today
Name: line_message, dtype: object

## Approximate search: partitioning the index

- this is a way to optimize search.
<br>
- this way basically partitions the indexes into Voroni cells.
<br>
- **we use this method to reduce the scope of our search**

In [16]:
nlist = 10 # how many cells/partitions
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist) # clustering

In [17]:
index.is_trained

False

In [18]:
# train index on our data before adding any data to the index
# requires at least 390 training points
index.train(sentence_embeddings)
index.is_trained

True

In [19]:
# add data to index
index.add(sentence_embeddings)
index.ntotal

390

In [20]:
%time
D, I = index.search(xq, k)
print(I)

CPU times: user 4 µs, sys: 2 µs, total: 6 µs
Wall time: 11.9 µs
[[  2   1 369   6 341]]


## Improve search accuracy

- improve search accuracy by increasing the number of probes: how many nearby cells to search for

In [21]:
index.nprobe = 10

In [22]:
%time
D, I = index.search(xq,k)
print(I)

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.39 µs
[[147 366   2 154   1]]


## Vector Reconstruction

## Index Quantization

- we use quantization to reduce full vector size
<br>
- one way to compress our vectors using product quantization (PQ)
- steps: (1) split into subvectors (2) cluster (3) compose ID centroid vectors

In [24]:
m = 8 # number of centroid IDs in final compressed vectors
bits = 8 # number of bits in each centroid (m==bits)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, bits)

In [25]:
index.is_trained

False

In [26]:
index.train(sentence_embeddings)



In [None]:
index.add(sentence_embeddings)

In [None]:
index.nprobe=10

In [None]:
%time
D, I = index.search(xq,k)
print(I)