# FAISS for UTF-8 Multihot Decoding

This notebook intends to learn how to use faiss and check if it is usefull to handle the decoding (cosine similarity ?) of embedding vectors into code indices of the codebook.


In [1]:
import numpy as np
import itertools
from itertools import combinations
import faiss
from utf8_encoder import *

Loading faiss with AVX2 support.
Loading faiss.


Load the pre-computed utf-8 codes and codebook


In [2]:
import torch
import torch.nn as nn
from langmodels.utf8codec import *
from langmodels import utf8codec

In [3]:
# utf8codebook = load_obj("utf8-codes/multihot64short-embeds")
# idx2char = load_obj("utf8-codes/multihot64short-idx2char")
# char2idx = load_obj("utf8-codes/multihot64short-char2idx")

# utf8codebook = load_obj("utf8-codes/utf8_code_matrix_2seg.npy")
utf8codebook = np.load("utf8-codes/utf8_code_matrix_2seg.npy").astype("float32")
idx2char = load_obj("utf8-codes/num2txt_2seg.pkl")
char2idx = load_obj("utf8-codes/txt2num_2seg.pkl")

In [4]:
code_matrix, txt2code, code2txt, txt2num, num2txt = utf8codec._load_codebook()

In [5]:
utf8codebook.dtype

dtype('float32')

In [6]:
# d=64
d=324
indexl2 = faiss.IndexFlatL2(d)  # the index is small for 2 segments dimension so a flat index that is completely in memory
# faiss.index_factory()  # <- this function for index creation

from the documentation, maybe will be faster for decoding to use an IVF (Inverse Vector File) index type

Also it might be good to have it in CPU as the GPU operations might not be needed for training or testing, the decoding can be done just when needed to create a text reconstruction for user visualization. Taking into account that I have much more RAM than GPU-RAM (a factor of 8) this would be a nice thing.

IVF indices need training

Product Quantization (PQ) could be used and might be good, this is because the input are binary elements, and the outputs of the network are float32. If quantized, the most important decimals of the vector output might be good enough to recognize similarity (just an idea, I don't know if this will be true).

Might need *OPQ rotation* and|or *RemapDimensionsTransform* (from documentation) to improve the PQ coding (transformations can be trained) and *rq = faiss.IndexRefineFlat(q)* to refine ranking once pre-ranked

From documentation:

    "The IVFFlat is often the fastest option, so the PQ variants are useful if memory is limited."



The other nice thing of this is that we can compute the closest K nearest neighbours, which can give several benefits for decoding.

I could try to do iterative decoding, several passes for the same input, each giving a result, this result passing it again, and again, this might lead to some nice surprises (estabilizing the result when there are doubts or different options??)

In [7]:
utf8codebook.shape

(1984, 324)

In [8]:
indexl2.add(utf8codebook)

Now what I do is encode all the codebook and decode it with the overfit-pre-trained Autoencoder

First load the pre-trained models

In [9]:
# utf8embedd = torch.load("trained_models/utf8Autoencoder_embedding_2segments_d48.pth")
# utf8decod = torch.load("trained_models/utf8Autoencoder_decoder_2segments_d48.pth")

utf8embedd = torch.load("trained_models/utf8Autoencoder_embedding_2segments_d64.pth")
utf8decod = torch.load("trained_models/utf8Autoencoder_decoder_2segments_d64.pth")

Now I get the dataset and the codes for each elements to test and check all the codes and see the errors.

In [10]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [11]:
utf8embedd = utf8embedd.to(device)
utf8decod = utf8decod.to(device)

In [12]:
# utf8embedd

In [13]:
# utf8decod

In [14]:
# prepare data to encode and decode
chars, idxs = list(char2idx.keys()), list(char2idx.values())

In [15]:
# list(zip(chars,idxs))

In [16]:
indices = np.stack(idxs)
tindices = torch.from_numpy(indices).to(device)

In [17]:
# list(indices)[-100:]

In [18]:
tindices[-30:]

tensor([1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968,
        1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980,
        1981, 1982, 1983,    2,    3,   15], device='cuda:0')

In [19]:
tindices = tindices.unsqueeze(0)
tindices.shape

torch.Size([1, 1987])

In [20]:
enc_multihot, enc32 = utf8embedd(tindices)

In [21]:
enc_multihot.shape, enc32.shape


(torch.Size([1, 1987, 324]), torch.Size([1, 1987, 64]))

In [22]:
dec = utf8decod(enc32)

In [23]:
type(dec)

torch.Tensor

In [24]:
bdec = dec > 0.

In [25]:
bdec

tensor([[[True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         ...,
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True]]], device='cuda:0')

In [26]:
cmp = bdec.squeeze() == enc_multihot.squeeze()

In [27]:
cmp.shape

torch.Size([1987, 324])

In [28]:
npdec = dec.squeeze().cpu().detach().numpy()

In [29]:
npdec.dtype

dtype('float32')

In [30]:
npdec.shape

(1987, 324)

In [31]:
k=2
D,I = indexl2.search(npdec, k)

In [32]:
# search by threshold range instead of K closests
indexl2.range_search?

[0;31mSignature:[0m [0mindexl2[0m[0;34m.[0m[0mrange_search[0m[0;34m([0m[0mx[0m[0;34m,[0m [0mthresh[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/venv3/lib/python3.7/site-packages/faiss-1.6.0-py3.7.egg/faiss/__init__.py
[0;31mType:[0m      method


In [33]:
faiss.MatrixStats(npdec).comments

'analyzing 1987 vectors of size 324\nno NaN or Infs in data\n1984 vectors are distinct (99.85%)\nvector 15 has 2 copies\nrange of L2 norms=[1.41507, 1.73204] (0 null vectors)\nmatrix contains 1.08 % 0 entries\nno constant dimensions\nno dimension has a too large mean\nstddevs per dimension are in [1.01375e-10 0.248155]\n101 dimensions have negligible stddev wrt. the largest dimension, they could be ignored'

In [34]:
type(I), I.dtype

(numpy.ndarray, dtype('int64'))

In [35]:
I.shape

(1987, 2)

In [36]:
I0 = I[:,0] 
errvals = np.roll(I0,1) - I0

In [37]:
errs = [i for i in errvals if abs(i) != 1]

In [38]:
len(errs)

3

In [39]:
I0.shape

(1987,)

In [40]:
I0[-3:]

array([ 2,  3, 15])

In [41]:
for i in [2,3,15]:
    print(idx2char[i])

<start>
<end>
unk


In [42]:
res = []
for i in range(len(idx2char.keys())):
    o = idx2char[i]
    d = idx2char[I0[i]]
    if o != d:
        res.append((i,o,d))
    
    

In [43]:
len(res)

0

In [44]:
res

[]


### For encoding dimension 32

There are 102 errors in the decoding on the first choice, anything more than zero is a problem so I have to go back and try to train a new encoder with maybe more dimensions to see what happens.

Anyways I stil have to find a way to make a ZERO error on the overfitted model

### For encoding dimension 48

For dimension 48 there are only 7 errors:

    [(17, '\x11', '>'),
     (24, '\x18', '>'),
     (33, '!', '>'),
     (35, '#', '>'),
     (41, ')', '\\'),
     (65, 'A', '>'),
     (90, 'Z', '>')]

### For encoding dimension 64

There are no errors at all for this dimensionality of the encoding and with only 8000 batches (125 epochs, 22 minutes training)  ;)  YAY

In [45]:
I,D

(array([[  0,  30],
        [  1,  66],
        [  2,  77],
        ...,
        [  2,  77],
        [  3,  56],
        [ 15, 102]]), array([[0.01601672, 2.007584  ],
        [0.01579618, 2.0060966 ],
        [0.01610327, 2.003381  ],
        ...,
        [0.01610327, 2.003381  ],
        [0.01605701, 2.001755  ],
        [0.01596403, 2.0020938 ]], dtype=float32))

In [46]:
torch.sum(bdec), torch.sum(~cmp), torch.sum(enc_multihot), np.sum(utf8codebook)

(tensor(636845, device='cuda:0'),
 tensor(631015, device='cuda:0'),
 tensor(5830., device='cuda:0'),
 5824.0)

For the encoding of dimension 32 the $ sum ( cmp )== 102 $  in the comparison and the different sum in the codebook and the enc_multihot seem to show some mismatch, which I have to check

In [47]:
bdec.shape

torch.Size([1, 1987, 324])

In [48]:
enc_multihot.shape

torch.Size([1, 1987, 324])

In [49]:
enc_multihot.shape, enc32.shape, dec.shape

(torch.Size([1, 1987, 324]),
 torch.Size([1, 1987, 64]),
 torch.Size([1, 1987, 324]))

In [50]:
# what I should do is create several indices and test the decoding in all of those
# for performance to see which one is better in precision, memory usage and speed

Now that I have the encoding and decoding working, What I do is I save the new encoding (which is the result ofpassing the entire domain through the overfitted encoder) to disk, this will be now the new "overfitted" embedding of dimension 64 

In [51]:
embedding_matrix = enc32.detach().cpu().numpy()

In [52]:
np.save("utf8-codes/utf8_codebook_overfit_matrix_2seg_dim64.npy", embedding_matrix)

In [53]:
# # using the GPU
# res = faiss.StandardGpuResources()  # use a single GPU
# gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, indexl2)


In [54]:
# print(gpu_index_flat.ntotal)


The index seems quite big here for the encoding of the complete utf-8 4 segments (+1M datapoints)
Mem usage of the GPU is 3028 MiB with the index in GPU

Mem usage of the GPU without the index: 1573MiB

So total memory usage of the index is: 1455 MiB for the IndexFlatL2 on the entire utf8 codebook coded in vectors of dim 64 


In [55]:
# 3028 - 1573