# FAISS for UTF-8 Multihot Decoding

This notebook intends to learn how to use faiss and check if it is usefull to handle the decoding (cosine similarity ?) of embedding vectors into code indices of the codebook.


In [2]:
import numpy as np
import itertools
from itertools import combinations
import faiss
from utf8_encoder import *

Loading faiss with AVX2 support.
Loading faiss.


Load the pre-computed utf-8 codes and codebook


In [20]:
import torch
import torch.nn as nn
from langmodels.utf8codec import *
from langmodels import utf8codec

In [6]:
# utf8codebook = load_obj("utf8-codes/multihot64short-embeds")
# idx2char = load_obj("utf8-codes/multihot64short-idx2char")
# char2idx = load_obj("utf8-codes/multihot64short-char2idx")

# utf8codebook = load_obj("utf8-codes/utf8_code_matrix_2seg.npy")
utf8codebook = np.load("utf8-codes/utf8_code_matrix_2seg.npy").astype("float32")
idx2char = load_obj("utf8-codes/num2txt_2seg.pkl")
char2idx = load_obj("utf8-codes/txt2num_2seg.pkl")

In [21]:
code_matrix, txt2code, code2txt, txt2num, num2txt = utf8codec._load_codebook()

In [7]:
utf8codebook.dtype

dtype('float32')

In [12]:
# d=64
d=324
indexl2 = faiss.IndexFlatL2(d)
# faiss.index_factory()  # <- this function for index creation

from the documentation, maybe will be faster for decoding to use an IVF (Inverse Vector File) index type

Also it might be good to have it in CPU as the GPU operations might not be needed for training or testing, the decoding can be done just when needed to create a text reconstruction for user visualization. Taking into account that I have much more RAM tahn GPU-RAM (a factor of 8) this would be a nice thing.

IVF indices need training

Product Quantization (PQ) could be used and might be good, this is because the input are binary elements, and the outputs of the network are float32. If quantized, the most important decimals of the vector output might be good enough to recognize similarity (just an idea, I don't know if this will be true).

Might need *OPQ rotation* and|or *RemapDimensionsTransform* (from documentation) to improve the PQ coding (transformations can be trained) and *rq = faiss.IndexRefineFlat(q)* to refine ranking once pre-ranked

From documentation:

    "The IVFFlat is often the fastest option, so the PQ variants are useful if memory is limited."



The other nice thing of this is that we can compute the closest K nearest neighbours, which can give several benefits for decoding.

I could try to do iterative decoding, several passes for the same input, each giving a result, this result passing it again, and again, this might lead to some nice surprises (estabilizing the result when there are doubts or different options??)

In [15]:
utf8codebook.shape

(1984, 324)

In [14]:
indexl2.add(utf8codebook)

Now what I do is encode all the codebook and decode it with the overfit-pre-trained Autoencoder

First load the pre-trained models

In [18]:
utf8embedd = torch.load("trained_models/utf8Autoencoder_embedding_2segments.pth")
utf8decod = torch.load("trained_models/utf8Autoencoder_decoder_2segments.pth")

Now I get the dataset and the codes for each elements to test and check all the codes and see the errors.

In [23]:
device = device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [25]:
utf8embedd = utf8embedd.to(device)
utf8decod = utf8decod.to(device)

In [95]:
# utf8embedd

In [96]:
# utf8decod

In [34]:
# prepare data to encode and decode
chars, idxs = list(char2idx.keys()), list(char2idx.values())

In [97]:
# list(zip(chars,idxs))

In [37]:
indices = np.stack(idxs)
tindices = torch.from_numpy(indices).to(device)

In [98]:
# list(indices)[-100:]

In [42]:
tindices[-30:]

tensor([1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968,
        1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980,
        1981, 1982, 1983,    2,    3,   15], device='cuda:0')

In [44]:
tindices = tindices.unsqueeze(0)
tindices.shape

torch.Size([1, 1987])

In [49]:
enc_multihot, enc32 = utf8embedd(tindices)

In [99]:
# enc

In [50]:
dec = utf8decod(enc32)

In [100]:
# dec

In [75]:
bdec = dec > 0.

In [76]:
bdec

tensor([[[ True, False, False,  ..., False, False, False],
         [ True, False, False,  ..., False, False, False],
         [ True, False, False,  ..., False, False, False],
         ...,
         [ True, False, False,  ..., False, False, False],
         [ True, False, False,  ..., False, False, False],
         [ True, False, False,  ..., False, False, False]]], device='cuda:0')

In [83]:
cmp = bdec.squeeze() == enc_multihot.squeeze()

In [91]:
cmp.shape

torch.Size([1987, 324])

In [109]:
npdec = dec.cpu().detach().numpy()

In [111]:
npdec.dtype

dtype('float32')

In [114]:
k=2
D,I = indexl2.search(npdec[0], k)

In [116]:
I.shape

(1987, 2)

In [135]:
I0 = I[:,0] 
errvals = np.roll(I0,1) - I0

In [136]:
errs = [i for i in errvals if abs(i) != 1]

In [137]:
len(errs)

206

In [145]:
res = []
for i in range(len(idx2char.keys())):
    o = idx2char[i]
    d = idx2char[I0[i]]
    if o != d:
        res.append((i,o,d))
    
    

In [147]:
len(res)

102

In [148]:
res

[(4, '\x04', 'E'),
 (11, '\x0b', '\x0e'),
 (22, '\x16', 'x'),
 (32, ' ', '+'),
 (38, '&', 'z'),
 (48, '0', 'z'),
 (50, '2', 'x'),
 (65, 'A', 'x'),
 (67, 'C', 'U'),
 (70, 'F', '+'),
 (81, 'Q', '+'),
 (82, 'R', 'x'),
 (98, 'b', 'E'),
 (100, 'd', '+'),
 (121, 'y', 'z'),
 (165, '¥', '\x86'),
 (171, '«', '\x86'),
 (180, '´', '\x86'),
 (229, 'å', 'î'),
 (235, 'ë', 'î'),
 (244, 'ô', 'î'),
 (293, 'ĥ', 'ċ'),
 (299, 'ī', 'ċ'),
 (308, 'Ĵ', 'ċ'),
 (357, 'ť', 'Œ'),
 (363, 'ū', 'Œ'),
 (372, 'Ŵ', 'ł'),
 (421, 'ƥ', 'Ɓ'),
 (427, 'ƫ', 'Ɓ'),
 (436, 'ƴ', 'Ɓ'),
 (485, 'ǥ', 'ǋ'),
 (491, 'ǫ', 'ǋ'),
 (500, 'Ǵ', 'ǋ'),
 (549, 'ȥ', 'ȋ'),
 (555, 'ȫ', 'ȋ'),
 (564, 'ȴ', 'ȋ'),
 (613, 'ɥ', 'ɂ'),
 (619, 'ɫ', 'ɂ'),
 (628, 'ɴ', 'ɂ'),
 (677, 'ʥ', 'ʋ'),
 (683, 'ʫ', 'ʋ'),
 (692, 'ʴ', 'ʋ'),
 (741, '˥', '˕'),
 (747, '˫', '˕'),
 (756, '˴', '˕'),
 (805, '̥', '́'),
 (811, '̫', '́'),
 (820, '̴', '́'),
 (869, 'ͥ', '\u0379'),
 (875, 'ͫ', '\u0379'),
 (884, 'ʹ', '\u0379'),
 (933, 'Υ', '\u038b'),
 (939, 'Ϋ', '\u038b'),
 (948, 'δ', '\

There are 102 errors in the decoding on the first choice, anything more than zero is a problem so I have to go back and try to train a new encoder with maybe more dimensions to see what happens.

Anywais I stil have to find a way to make a ZERO error on the overfitted model

In [115]:
I,D

(array([[ 0, 61],
        [ 1, 84],
        [ 2, 33],
        ...,
        [ 2, 33],
        [ 3, 49],
        [15, 20]]), array([[0.01641178, 2.0125747 ],
        [0.01641798, 2.0115097 ],
        [0.01641083, 2.0119002 ],
        ...,
        [0.01641083, 2.0119002 ],
        [0.01641297, 2.0119598 ],
        [0.01641273, 2.0115497 ]], dtype=float32))

In [94]:
torch.sum(bdec), torch.sum(~cmp), torch.sum(enc_multihot), np.sum(utf8codebook)

(tensor(5728, device='cuda:0'),
 tensor(102, device='cuda:0'),
 tensor(5830., device='cuda:0'),
 5824.0)

the $ sum  cmp== 102 $  in the comparison and the different sum in the codebook and the enc_multihot seem to show some mismatch, which I have to check

In [85]:
bdec.shape

torch.Size([1, 1987, 324])

In [86]:
enc_multihot.shape

torch.Size([1, 1987, 324])

In [89]:
enc_multihot.shape, enc32.shape, dec.shape

(torch.Size([1, 1987, 324]),
 torch.Size([1, 1987, 32]),
 torch.Size([1, 1987, 324]))

In [None]:
# what I should do is create several indices and test the decoding in all of those
# for performance to see which one is better in precision, memory usage and speed

In [None]:
# using the GPU
res = faiss.StandardGpuResources()  # use a single GPU
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, indexl2)


In [None]:
print(gpu_index_flat.ntotal)


The index seems quite big here
Mem usage of the GPU is 3028 MiB with the index in GPU

Mem usage of the GPU without the index: 1573MiB

So total memory usage of the index is: 1455 MiB for the IndexFlatL2 on the entire utf8 codebook coded in vectors of dim 64 


In [10]:
3028 - 1573

1455