<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#PQLite-explained" data-toc-modified-id="PQLite-explained-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>PQLite explained</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#How-does-pqlite-works" data-toc-modified-id="How-does-pqlite-works-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>How does pqlite works</a></span></li></ul></li></ul></li><li><span><a href="#Understanding-pq.fit" data-toc-modified-id="Understanding-pq.fit-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Understanding <code>pq.fit</code></a></span></li><li><span><a href="#Understanding-pq.index" data-toc-modified-id="Understanding-pq.index-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Understanding <code>pq.index</code></a></span></li><li><span><a href="#Understanding-pq.search" data-toc-modified-id="Understanding-pq.search-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Understanding <code>pq.search</code></a></span><ul class="toc-item"><li><span><a href="#Searching-with-filtering" data-toc-modified-id="Searching-with-filtering-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Searching with filtering</a></span></li></ul></li></ul></div>

## PQLite explained

In [1]:
%load_ext autoreload

%autoreload 2

In [2]:
import pyximport
pyximport.install()

import pqlite
pqlite.__path__
import time

import jina
from jina.math.distance import cdist

import sklearn
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

import random
import numpy as np
from pqlite import PQLite


#### How does pqlite works

Pqlite has a first coarse search step.

When adding elements to PQLite elements are stored in cells.

The number `n_datapoints / n_cells` will be roughly the number of elements in each cell.

In [3]:
Nq = 1
D = 128 
top_k = 100
n_cells = 10
n_subvectors = 32

In [4]:
!rm -rf ./data

## Understanding `pq.fit`

Internally, when doing `pq.fit(Xtr)` the `pq` class learns a quantizer stored in `pq.pq_codec`.

The `pq` does not add data unitl `pq.add()` is called.

We can see that the cells in `pq` are empty

Let us see what we have after adding to PQLIte with 500 examples


In [8]:

Nt = 500

np.random.seed(1234)
Xtr, Xte = train_test_split(make_blobs(n_samples = Nt, n_features = D)[0].astype(np.float32), test_size=20)

# the column schema: (name:str, dtype:type, create_index: bool)
pq = PQLite(dim=D, 
            n_cells=n_cells,
            n_subvectors=n_subvectors, 
            columns=[('price',float), ('category', str)])

pq.train(Xtr)


2021-12-10 11:09:06.588 | INFO     | pqlite.index:__init__:89 - Initialize VQ codec (K=10)
2021-12-10 11:09:06.589 | INFO     | pqlite.index:__init__:99 - Initialize PQ codec (n_subvectors=32)
2021-12-10 11:09:06.602 | INFO     | pqlite.index:train:141 - Start training VQ codec (K=10) with 480 data...
2021-12-10 11:09:06.619 | INFO     | pqlite.index:train:147 - Start training PQ codec (n_subvectors=32) with 480 data...
2021-12-10 11:09:07.707 | INFO     | pqlite.index:train:152 - The pqlite is successfully trained!
2021-12-10 11:09:07.708 | INFO     | pqlite.index:dump_model:297 - Save the trained parameters to data/0a7dfc558abb6bc6cb48db43ccf64964


Note that cells are empty because we have not added yet any information

In [9]:
[c.size for c in pq.cell_tables] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [10]:
!du -h data

 16K	data/cell_1
 16K	data/cell_6
 16K	data/cell_8
 16K	data/cell_9
 16K	data/cell_7
 16K	data/cell_0
140K	data/0a7dfc558abb6bc6cb48db43ccf64964
 16K	data/cell_5
 16K	data/cell_2
 16K	data/cell_3
 16K	data/cell_4
300K	data


Information about hyperparams

In [11]:
pq.pq_codec.fit(Xtr)



In [12]:
print(f'pq.n_cells = {pq.n_cells }')
print(f'pq.n_subvectors = {pq.n_subvectors }')
print(f'pq.n_probe = {pq.n_probe}')

pq.n_cells = 10
pq.n_subvectors = 32
pq.n_probe = 16


In [13]:
pq.n_cells

10

Nevertheless we can use the current `quantizer` stored in `pq.pq_codec` to encode data 

In [14]:
pq.encode(Xte[[0]])

array([[132,  40,  25,  30, 160,  92,   9,  47, 131, 255,  17, 178,  41,
         53,   5,  52, 255, 153, 163,  80, 203, 164, 231, 106,   8,  98,
        243,  35, 201,  25,  74, 222]], dtype=uint8)

This quantizer uses a codebook for each of the subspaces in the Product space. 

Since we have code_size = 32 this means we will have 32 different subspaces, which will have been trained with the corresponding columns from the training data.

In this case since `pq.d_subvector` is 4 each of the slices `Xtr[:,0:4],Xtr[:,4:8],Xtr[:,8:12],....` will have a corresponding codebook. This matches because `128/32=4`.

All codebooks are stored in `pq.pq_codec.codebooks`

In [15]:
pq.pq_codec.codebooks.shape

(32, 256, 4)

And one of the codebooks contains a matrix of shape `(K,d_subvector)` where `K` is the number of prototypes for each subspace. 

In [16]:
pq.pq_codec.codebooks[0].shape

(256, 4)

##### Understanding the encoding

Once we have fitted a `pq.codec` we can encode the data.
This process takes a real valued vector, splits it in slices of size `pq.d_subvector` and each of the slices is assigned to the closest prototype stored in the codebook of the corresponding slice.

For example we can take an slice of a vector and look where it should be matched


In [17]:
slice_0 = Xte[0][0:4]
slice_0

array([-5.496039 ,  6.284065 ,  1.7528343,  2.411145 ], dtype=float32)

In [18]:
dists_to_prototypes_slice_0 = np.sum((pq.pq_codec.codebooks[0] - slice_0)**2, axis=1)
print(dists_to_prototypes_slice_0.shape)
print(dists_to_prototypes_slice_0.argmin())

(256,)
132


Repeating this process for each slice will encode our vector in the PQ space.

This can be done using `pq.encode`

In [19]:
pq.encode(Xte[[0]])

array([[132,  40,  25,  30, 160,  92,   9,  47, 131, 255,  17, 178,  41,
         53,   5,  52, 255, 153, 163,  80, 203, 164, 231, 106,   8,  98,
        243,  35, 201,  25,  74, 222]], dtype=uint8)

This method will internally call the stored `pq_codec` and call the `.encode` method of the internal `pq_codec`

In [20]:
pq.pq_codec.encode(Xte[[0]])

array([[132,  40,  25,  30, 160,  92,   9,  47, 131, 255,  17, 178,  41,
         53,   5,  52, 255, 153, 163,  80, 203, 164, 231, 106,   8,  98,
        243,  35, 201,  25,  74, 222]], dtype=uint8)

##### pq._vecs_storage

pq stores the quantized data in `pq._vecs_storage`. This is a list of `n_cells` elements containing matrices with the quantized data added to `pq`. Note that if no data is added this matrices will contain only 0 values.


## Understanding `pq.index`

We have seen that `pq` has not stored a single example

In [21]:
[c.size for c in pq.cell_tables] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

To add examples we have to do 

In [22]:
from jina import DocumentArray, Document

In [23]:
np.random.choice((100,25,10)),np.random.choice(['comics','movies','audiobook'])

(25, 'movies')

In [190]:
CATEGORIES = ['comics','movies','audiobook']
da = DocumentArray([Document(id=f'{i}', 
                             embedding=Xtr[i], 
                             tags={
                                   'price': np.random.choice((5.,10.,25.,100.)),
                                   'category':np.random.choice(CATEGORIES),
                                 }) for i in range(len(Xtr))])
    

In [192]:
len(da)

480

In [196]:
da[0].tags['price']

5.0

Before indexing we can see in `./data` that there are some folders that containg the basic data structures used to store the `indexed data`.

In [26]:
!du -h data

 16K	data/cell_1
 16K	data/cell_6
 16K	data/cell_8
 16K	data/cell_9
 16K	data/cell_7
 16K	data/cell_0
140K	data/0a7dfc558abb6bc6cb48db43ccf64964
 16K	data/cell_5
 16K	data/cell_2
 16K	data/cell_3
 16K	data/cell_4
300K	data


In [27]:
pq.index(da)

2021-12-10 11:09:25.358 | DEBUG    | pqlite.container:insert:203 - => 480 new docs added


In [197]:
[c.size for c in pq.cell_tables] 

[62, 72, 76, 38, 41, 78, 29, 50, 6, 28]

After indexing we can see that each cell  has a sensible amount of information

In [29]:
!du -h data

 92K	data/cell_1
 48K	data/cell_6
 20K	data/cell_8
 48K	data/cell_9
 68K	data/cell_7
 76K	data/cell_0
140K	data/0a7dfc558abb6bc6cb48db43ccf64964
 88K	data/cell_5
 92K	data/cell_2
 60K	data/cell_3
 56K	data/cell_4
788K	data


If we sum the elements across cells we will se that this number matches the length of the indexed DocumentArray

In [30]:
np.sum([c.size for c in pq.cell_tables] ) == len(da)

True

The cell information is 

In [31]:
print(f'The number of cells is n_cells={pq.n_cells}')

print('\nCells can be accessed in pq.cell_tables')
print(f'\twe have len(pq.cell_tables)={len(pq.cell_tables)} cells')
print(f'\nWe have added len(Xtr)={len(Xtr)} elements to pq')

The number of cells is n_cells=10

Cells can be accessed in pq.cell_tables
	we have len(pq.cell_tables)=10 cells

We have added len(Xtr)=480 elements to pq


Note that `pq.cell_tables` is a list of  `CellTable` objects

In [32]:
pq.cell_tables[0]

<pqlite.storage.table.CellTable at 0x7fbaf95beeb0>


Each CellTable allows you to `insert`, `query` and `delete` vectors

We can inspect how many elements are in a cell using `.count()`

In [33]:
pq.cell_tables[0].count()

62

Not all `cell_tables` will contain the same number of elements because not all of them are assigned to the same prototype. Nevertheless the sum of the elements across cells equalts the number of added elements

In [34]:
elements_per_cell = [pq_cell_table.count() for pq_cell_table in pq.cell_tables]
print('elements_per_cell =', elements_per_cell)
print('total number of elements added =', np.sum(elements_per_cell))
print('np.sum(elements_per_cell) == len(Xtr) is ',np.sum(elements_per_cell) == len(Xtr))

elements_per_cell = [62, 72, 76, 38, 41, 78, 29, 50, 6, 28]
total number of elements added = 480
np.sum(elements_per_cell) == len(Xtr) is  True


## Understanding `pq.search`


Internally `pq.search` first computes the distance between each query and the prototypes that define the cells. Then the cells whose prototypes are closest to a query are selected as search space. The best  `pq.n_probe` cells are selected (this is a hyperparameter of the algorithm).

Since `pq.n_probe` in this case is bigger than `pq.n_cells` all the cells will be searched.

In [35]:
pq.n_probe, pq.n_cells

(16, 10)

Note that  `pq.search` can be called with a batch of vectors. Once called it will end up calling `pq.search_cells` with the full batch of queries but with an array of arrays containing at each position a list of the ids of the cells that best batch the query. So if 5 queries are passed into the `pq.search` it will pass to `self.search_cells` an array of size `(len(queries), max(pq.n_probe, pq.n_cells)`.

The `.search_cells` method iterates over the queries and comptues the distance between each query and all retrieved elements in the activated cells.

For each query in the batch  the Asymetric Distance Computation is performed using `pq.pq_codec.precompute_adc` which returns a table of shape `(pq.n_subvectors, pq.pq_codec.n_clusters)`.  In our case a matrix of shape `(32, 256)`.


In [96]:
query = Xtr[[10]]

In [97]:
dtable = pq.pq_codec.precompute_adc(query[0])
dtable.dtable.shape

(32, 256)

We can do this faster with a cython function as follows

In [98]:
import pqlite.pq_bind
from pqlite.pq_bind import precompute_adc_table

In [99]:
d_subvector = int(query.shape[1]/pq.pq_codec.n_subvectors)

In [100]:
dt = precompute_adc_table(query[0], 
                          d_subvector,
                          pq.pq_codec.n_clusters,
                          pq.pq_codec.codebooks)

In [101]:
np.mean(np.asarray(dt) - dtable.dtable)

0.0

This table contains the distance between each possible subvector in que query and each possible subvector from any subcodevector.

Therefore we go from `search` -> `search_cells` -> `search_cells` -> `precomputed_k = pq_codec.precompute_adc(query_k)` -> `ivfpq_topk`

Therefore for `query_k` we compute the ADC table. Then this table is used to compute the distance between the query and all the database. 

In this case, since there is filtering, the computations are done only on a subset of the database. Distances are computed between the query and the exapmles that come from the selected cells and verify the conditions specified by the provided filter.

```
self.ivfpq_topk(precomputed, cells=cell_idx,conditions=conditions,k=k )
```

In [102]:
pq.pq_codec.codebooks.shape

(32, 256, 4)

In [103]:
precomputed = pq.pq_codec.precompute_adc(query[0])
precomputed.dtable.shape

(32, 256)

Given a bunch of datapoint candidates from the database (from which we already have the pqcode) we want to find distances between the query and the candidates. This is done with `precomputed.adist(codes)` which returns the distance between each code in codes and the pqcode of the query.

Let us recall that each subvector in the quantized space represented 4 values in the original space and those 4 real-valued values are represented with a single intenger from 0 to n_clusters.

In [104]:
pq.pq_codec.d_subvector, pq.pq_codec.n_clusters

(4, 256)

We can look at the indexed (quantized) data in `cell_k` using  `pq._vec_indexes[cell_k].data`

In [105]:
pq._vec_indexes[0]._data

array([[232,  25,  35, ..., 215, 113,  10],
       [ 31, 168, 236, ...,  39,  28,  38],
       [198, 218, 220, ..., 159, 212, 233],
       ...,
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0]], dtype=uint8)

This is an array of pq-codes of vectors that have been indexed.

Note that several rows are 0 because memory is preemtively saved to avoid too much memory resizes.

In [106]:
pq._vec_indexes[0]._data.shape

(10240, 32)

We can see that the number of items in a cell starts at a position that is full of 0 values

In [107]:
print(np.where(pq._vec_indexes[0]._data.sum(axis=1)==0)[0][0])
print(np.where(pq._vec_indexes[1]._data.sum(axis=1)==0)[0][0])
print([c.size for c in pq.cell_tables])

62
72
[62, 72, 76, 38, 41, 78, 29, 50, 6, 28]


Note that this is not necessarily true since a vector in the original feature space could be mapped to a pq-code that is represented  as zeros [0,0,...0].

In [108]:
precomputed = pq.pq_codec.precompute_adc(query[0])

In [198]:
from jina import DocumentArray
da = DocumentArray([Document(embedding=query[0]),
                    Document(embedding=Xtr[0])])

We can search matches for documents in a documentarray using `.search`.

Note that this does not return anything

In [199]:
pq.search(da,limit=5)

But the documentarray is updated with matches in each of the docs of the documentarray

In [200]:
[m.id for m in da[0].matches]

['10', '207', '135', '272', '398']

We can anually look at the euclidean distances with

In [201]:
[x.scores['euclidean'].value for x in da[0].matches]

[5.459360599517822,
 173.40541076660156,
 184.3723602294922,
 187.82205200195312,
 188.35154724121094]

In [202]:
[x.scores['euclidean'].value for x in da[1].matches]

[6.382546424865723,
 149.67112731933594,
 157.03163146972656,
 181.40065002441406,
 184.99688720703125]

The search method will look into the different cells and search on each cell retrieve elements and compute distances. In each cell the method `.search` will be called.

Note that the distance 5.45 appears in one of the cells if we exhaustively search across cells for the closest matches.

In [203]:
[pq._vec_indexes[i].search(query[0], 1) for i in range(len(pq._vec_indexes))]

[(array([5587.41992188]), array([62])),
 (array([5587.41992188]), array([72])),
 (array([5587.41992188]), array([76])),
 (array([5587.41992188]), array([38])),
 (array([5587.41992188]), array([41])),
 (array([5.4593606]), array([2])),
 (array([188.35154724]), array([25])),
 (array([173.40541077]), array([23])),
 (array([5587.41992188]), array([6])),
 (array([5587.41992188]), array([28]))]

In [204]:
[pq._vec_indexes[i].search(Xtr[0], 1) for i in range(len(pq._vec_indexes))]

[(array([5676.70800781]), array([62])),
 (array([5676.70800781]), array([72])),
 (array([5676.70800781]), array([76])),
 (array([5676.70800781]), array([38])),
 (array([5676.70800781]), array([41])),
 (array([6.38254642]), array([0])),
 (array([202.37226868]), array([11])),
 (array([185.46856689]), array([19])),
 (array([5676.70800781]), array([6])),
 (array([5676.70800781]), array([28]))]

An important observation is that the closest elements in many cells are really far from the best elments in a few cells. This suggests there is not need to look into all cells at query time (for this examples).

### Searching with filtering

We can filter according to a set of tags of the documents

In [214]:
!rm -rf ./data

In [215]:

Nt = 500

np.random.seed(1234)
Xtr, Xte = train_test_split(make_blobs(n_samples = Nt, n_features = D)[0].astype(np.float32), test_size=20)

# the column schema: (name:str, dtype:type, create_index: bool)
pq = PQLite(dim=D, 
            n_cells=n_cells,
            n_subvectors=n_subvectors, 
            columns=[('price',float), ('category', str)])

pq.train(Xtr)

CATEGORIES = ['comics','movies','audiobook']
da = DocumentArray([Document(id=f'{i}', 
                             embedding=Xtr[i], 
                             tags={
                                   'price': np.random.choice((5.,10.,25.,100.)),
                                   'category':np.random.choice(CATEGORIES),
                                 }) for i in range(len(Xtr))])
    
pq.index(da)

2021-12-10 11:46:46.248 | INFO     | pqlite.index:__init__:89 - Initialize VQ codec (K=10)
2021-12-10 11:46:46.248 | INFO     | pqlite.index:__init__:99 - Initialize PQ codec (n_subvectors=32)
2021-12-10 11:46:46.264 | INFO     | pqlite.index:train:141 - Start training VQ codec (K=10) with 480 data...
2021-12-10 11:46:46.279 | INFO     | pqlite.index:train:147 - Start training PQ codec (n_subvectors=32) with 480 data...
2021-12-10 11:46:47.408 | INFO     | pqlite.index:train:152 - The pqlite is successfully trained!
2021-12-10 11:46:47.408 | INFO     | pqlite.index:dump_model:297 - Save the trained parameters to data/0a7dfc558abb6bc6cb48db43ccf64964
2021-12-10 11:46:47.494 | DEBUG    | pqlite.container:insert:203 - => 480 new docs added


In [252]:
query_da = DocumentArray([Document(embedding=Xtr[0], tags={'price':0.23})])

pq.search(query_da, filter={'price': {'$lt': 150}}, limit=5)

In [253]:
[x.scores['euclidean'].value for x in query_da[0].matches]

[6.262570858001709,
 148.74696350097656,
 161.92678833007812,
 178.64292907714844,
 181.718994140625]

In [254]:
[x.tags['price'] for x in query_da[0].matches]

[<jina.types.struct.StructView () at 140440233829376>,
 <jina.types.struct.StructView () at 140440233830288>,
 <jina.types.struct.StructView () at 140440233831728>,
 <jina.types.struct.StructView () at 140440233829664>,
 <jina.types.struct.StructView () at 140440233832112>]

In [219]:
query_da[0].matches[0].tags['price']

<jina.types.struct.StructView () at 140440209600720>

In [220]:
# why is this empty?
query_da[0].tags['price']

0.23

If No conditions are introduced...

In [223]:
pq.search(query_da,  limit=10)
[x.tags['price'] for x in query_da[0].matches]

[<jina.types.struct.StructView () at 140440212043712>,
 <jina.types.struct.StructView () at 140440212043856>,
 <jina.types.struct.StructView () at 140440212043760>,
 <jina.types.struct.StructView () at 140440212043952>,
 <jina.types.struct.StructView () at 140440212042128>,
 <jina.types.struct.StructView () at 140440212043472>,
 <jina.types.struct.StructView () at 140440212043904>,
 <jina.types.struct.StructView () at 140440209585056>,
 <jina.types.struct.StructView () at 140440209585584>,
 <jina.types.struct.StructView () at 140440209543072>]

In [224]:
query_da = DocumentArray([Document(embedding=Xtr[0], tags={'price':50})])
conditions = [('category', '=', 'movies')]
pq.search(query_da, conditions, limit=20)

In [225]:
([x.tags['category']=='movies' for x in query_da[0].matches])

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False]

In [160]:
#da.embeddings.shape

In [161]:
#pq.cell_tables[0].count()

In [162]:
pq.cell_tables[0].__dict__

{'_conn_name': ':memory:',
 '_name': 'table_0',
 '_conn': <sqlite3.Connection at 0x7fba982fc030>,
 '_columns': ['price FLOAT', 'category TEXT'],
 '_indexed_keys': {'category', 'price'}}