<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Asymmetric-Distance-computation" data-toc-modified-id="Asymmetric-Distance-computation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Asymmetric Distance computation</a></span><ul class="toc-item"><li><span><a href="#cython-version" data-toc-modified-id="cython-version-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>cython version</a></span></li></ul></li><li><span><a href="#Speed-up-precompute_adc-function" data-toc-modified-id="Speed-up-precompute_adc-function-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Speed up <code>precompute_adc</code> function</a></span></li></ul></div>

In [1]:
%load_ext cython
%timeit

import Cython

In [2]:
Cython.__version__

'0.29.24'

In [3]:
import numpy as np

## Asymmetric Distance computation

Currently the code performs

```
dists = np.sum(self.dtable[range(M), codes], axis=1)
```
which is equivalent to 

```python
dists = np.zeros((N, )).astype(np.float32)
for n in range(N):
    for m in range(M):
        dists[n] += self.dtable[m][codes[n][m]]

```

In [59]:
M = 32
np.random.seed(123)

n_cluster = 256
dtable = np.array(np.random.random((M, n_cluster)), 'float32')

np.random.seed(123)
pq_codes_batch = np.array([np.random.randint([M]*M)])
N, M = pq_codes_batch.shape


In [62]:
pq_codes_batch

array([[30, 13, 30,  2, 28,  2,  6, 17, 19, 10, 27, 25, 22,  1,  0, 17,
        30, 15,  9,  0, 14,  0, 15, 25, 19, 14, 29,  4,  0, 16,  4, 17]])

In [6]:
dtable[range(M),pq_codes_batch].sum(axis=1)

array([17.402649], dtype=float32)

In [7]:
def distances_loop_py(N,M, dtable):
    dists = np.zeros((N, )).astype(np.float32)
    for n in range(N):
        for m in range(M):
            dists[n] += dtable[m, pq_codes_batch[n,m]]
    return dists

In [8]:
distances_loop_py(1,M,dtable)

array([17.402647], dtype=float32)

In [9]:
dtable.shape

(32, 256)

In [51]:
%timeit distances_loop_py(1,M,dtable)

12.7 µs ± 484 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [52]:
%timeit dtable[range(M),pq_codes_batch].sum(axis=1)

6.81 µs ± 28.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### cython version

In [11]:
%%cython
cimport numpy as cnp
cimport cython
             
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef distances_loop_cy(long M,float[:,:] dtable,long[:] pq_code):
    cdef float dist = 0
    cdef int m
    
    for m in range(M):
        dist += dtable[m, pq_code[m]]

    return dist

In [14]:
pq_code = pq_codes_batch.flatten()

In [13]:
%timeit distances_loop_cy(M, dtable, pq_code)

481 ns ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


We can make the method work for generic types but this will have a penalty when called from python 

In [39]:
%%cython
cimport numpy as cnp
cimport cython
from cython cimport integral, floating
             
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef floating distances_loop_cy2(integral M,
                                  floating[:,:] dtable,
                                  integral[:] pq_code):
    cdef floating dist = 0
    cdef integral m 
    
    for m in range(M):
        dist += dtable[m, pq_code[m]]

    return dist

In [36]:
%timeit distances_loop_cy2(M, dtable, pq_code)

1.84 µs ± 5.78 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


## Speed up `precompute_adc` function

In [158]:

import sklearn
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

from pqlite.core.codec.pq import PQCodec

D = 128
n_clusters = 256
M = 32

D = 128 # dimentionality / number of features
top_k = 100
n_cells = 18
n_subvectors = 64
Nt = 5000
d_subvector = int(D/M)

np.random.seed(123)
Xtr, Xte =train_test_split(make_blobs(n_samples = Nt, n_features = D)[0].astype(np.float32), test_size=20)

pq = PQCodec(D, M, n_clusters)

In [127]:
pq.fit(Xtr)

In [129]:
codebooks = pq.codebooks
codebooks.shape

(32, 256, 4)

In [77]:
%timeit pq.precompute_adc(Xtr[4,:])

513 µs ± 5.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [118]:
query = Xtr[4,:]
#distance_table_from_class = pq.precompute_adc(query)
#distance_table_from_class.dtable.shape

In [163]:
%%cython
cimport numpy as cnp
import numpy as np
cimport cython
             
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef precompute_adc(float[:] query, 
                     long n_subvectors, 
                     long n_clusters,
                     long d_subvector,
                     float[:,:,:] codebooks):
    
    dtable = np.empty((n_subvectors, n_clusters), dtype=np.float32)
    for m in range(n_subvectors):
        query_sub = query[m * d_subvector : (m + 1) * d_subvector]
        dtable[m, :] = np.linalg.norm(codebooks[m] - query_sub, axis=1) ** 2
    return dtable


In [166]:
#precompute_adc(query, n_subvectors, n_clusters, d_subvector, codebooks)

In [167]:
codebooks[2]

array([[-1.7292025 , -2.7353914 , -3.7812238 ,  4.992066  ],
       [ 0.94363374, -4.7408605 ,  0.47655922,  0.88397473],
       [-1.4082857 , -1.5802077 , -4.2403502 ,  5.134796  ],
       ...,
       [-3.4168549 , -6.32316   , -7.942589  , -3.663811  ],
       [-3.6777914 , -5.175822  , -8.004511  , -5.853117  ],
       [-0.8550133 , -4.7895107 , -1.6036859 ,  2.6832523 ]],
      dtype=float32)

In [168]:
m = 0
query_sub = query[ m * d_subvector : (m + 1) * d_subvector]
query_sub.shape

(4,)

We can avoid slices to use less memory and improve speed