# MultiHot Sparse Encoding

This tests are based on the previous Error Correction Codes. For those I still don't find a way of easy deconding from a neural network, but there should be something there.

In this study I make an encoding for UTF-8 that can manage all the possible (valid) codes. 

The idea is to reduce the input space, being able to encode any existing text with a finite set that should be smaller than current one-hot encoder approaches. This should provide a universal first layer for text encoding.

The next layers I have an idea on how to work with them, but for the moment let's make a first one that actually works.


The first step is to build an **Overfitted** autoencoder, this is just to validate that the codes are feasible.

A couple of sources:

- [UTF-8 complete table](https://www.utf8-chartable.de/unicode-utf8-table.pl)
- [UTF-8 Wikipedia](https://en.wikipedia.org/wiki/UTF-8)


In [3]:
import numpy as np
import pickle
import torch
from utf8_encoder import *

Since the entire utf-8 univers is NOT the entire $2^{32}$ domain, but there are limitations explained in [the utf-8 description](https://en.wikipedia.org/wiki/UTF-8)

| Number of bytes | Bits for code point | First code point | Last code point | Byte 1   | Byte 2   | Byte 3   | Byte 4   |
|----------------|--------------------|-----------------|----------------|----------|----------|----------|----------|
| 1              | 7                  | U+0000          | U+007F         | 0xxxxxxx |          |          |          |
| 2              | 11                 | U+0080          | U+07FF         | 110xxxxx | 10xxxxxx |          |          |
| 3              | 16                 | U+0800          | U+FFFF         | 1110xxxx | 10xxxxxx | 10xxxxxx |          |
| 4              | 21                 | U+10000         | U+10FFFF       | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |

I'll then compute different table parts and do an append when needed

The thing is that the number of elements in the table should be at most $2^{21}$, I need to create a sort of index that can handle the 4 cases.
It seems I'll have to create 4 different conversion tables.




In fact ... it seems that I can just chunk the utf-8 value in chunks and do one-hot per different parts:
- there are only 4 segment ranges, that can be coded in one-hot also add there either hamming or other ECC
- the largest value is for 7 bits -> 128 values
- the others contain 6 bits -> 64 values
The prefix of each can be taken away and replaced by the initial one-hot

So a complete code would be:  $ 4 + 128 + 64 + 64 + 64 = 324 $

Instead of having dimension 1,112,064 to encode any utf-8 value.

The encoder is  much simpler than I thought for this case, later I can add ECC for each, knowing that there is only one active bit in each row, this makes the task easier.

This embedding can stil be reduced but should be sparse enough already to make a good input

In [2]:
4 + 128 + 64 + 64 + 64 

324

In [3]:
# number of parameters for a one-hot by chunks encoding:
chunk_sizes = [4, 5, 6, 8, 12]
n_params = []
for c in chunk_sizes:
    n_params.append((c, (32 // c) * 2**c))

In [4]:
n_params

[(4, 128), (5, 192), (6, 320), (8, 1024), (12, 8192)]

In [5]:
import torch

emd = torch.nn.Embedding(2**10, 300)

In [6]:
model_parameters = filter(lambda p: p.requires_grad, emd.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])


In [7]:
params

307200

In [8]:
# from https://discuss.pytorch.org/t/how-do-i-check-the-number-of-parameters-of-a-model/4325/7
# counting the number of (trainable) parameters of a pytorch model
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [9]:
count_parameters(emd)

307200

The embedding layer is a fully connected layer ... this means a LOT of parameters

To be able to do an effective one-hot of all utf-8 would be:

In [10]:
for i in [50,100,200,300]:
    print(i, 1112064 * i)

50 55603200
100 111206400
200 222412800
300 333619200


Which means I don't want to train that layer ... it would not even fit in my GPU

There is another thing, the embedding layer learns from the sample input, this means that it will ignore all values that don't appear or are underrepresented (a know issue). My goal is to deal with this with meta-learning techniques, but always being able to keep adding new inputs.


In [11]:
tables = create_tables(segments=3)

number of codes =  59328
number of code_exceptions =  4224


In [12]:
# tables = create_tables()  # 4 segments by default
# if the previous line is executed gives:
# number of codes =  1107904
# number of code_exceptions =  790656


In [13]:
np.save("utf8_code_matrix_3seg", tables[0])

In [14]:
save_obj(tables[1], "txt2code_3seg")
save_obj(tables[2], "code2txt_3seg")
save_obj(tables[3], "txt2num_3seg")
save_obj(tables[4], "num2txt_3seg")

In [15]:
t2c = tables[1]
c2t = tables[2]
n2t = tables[4]
t2n = tables[3]

In [16]:
# checking that all the tables have the right number of codes
tables[0].shape, len(t2n.keys()), len(n2t.keys()), len(tables[1].keys()), len(tables[2].keys())

((59328, 388), 59328, 59328, 59328, 59328)

Although Wikipedia says:

'''
  UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.
'''

We have managed to encode only 1107904 codes, so we are missing somehow 4160 codes that python can't encode from bytes. I won't deal with this for the moment, I'll just believe python that it knows how to encode utf-8 (or I should start creating tests and find if python has a bug and create the ticket ... I must stay strong and follow my goal without diverging as I almost have no time)

In [17]:
1112064 - 1107904

4160

In [18]:
128 + (2**5 * 2**6)+ (2**4 * (2**6)**2) + (2**3 * (2**6)**3)

2164864

In [19]:
2**21 + 2**16 + 2**11 + 2**7

2164864

In [20]:
print("indices for the segments: ", 0, 128, (128 + 2**5 * 2**6), (128 + 2**4 * (2**6)**2), (128 + 2**3 * (2**6)**3) )

indices for the segments:  0 128 2176 65664 2097280


In [21]:
# from:  https://stackoverflow.com/questions/7971618/python-return-first-n-keyvalue-pairs-from-dict
from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

In [22]:
take(100, n2t.items())

[(0, '\x00'),
 (1, '\x01'),
 (2, '\x02'),
 (3, '\x03'),
 (4, '\x04'),
 (5, '\x05'),
 (6, '\x06'),
 (7, '\x07'),
 (8, '\x08'),
 (9, '\t'),
 (10, '\n'),
 (11, '\x0b'),
 (12, '\x0c'),
 (13, '\r'),
 (14, '\x0e'),
 (15, '\x0f'),
 (16, '\x10'),
 (17, '\x11'),
 (18, '\x12'),
 (19, '\x13'),
 (20, '\x14'),
 (21, '\x15'),
 (22, '\x16'),
 (23, '\x17'),
 (24, '\x18'),
 (25, '\x19'),
 (26, '\x1a'),
 (27, '\x1b'),
 (28, '\x1c'),
 (29, '\x1d'),
 (30, '\x1e'),
 (31, '\x1f'),
 (32, ' '),
 (33, '!'),
 (34, '"'),
 (35, '#'),
 (36, '$'),
 (37, '%'),
 (38, '&'),
 (39, "'"),
 (40, '('),
 (41, ')'),
 (42, '*'),
 (43, '+'),
 (44, ','),
 (45, '-'),
 (46, '.'),
 (47, '/'),
 (48, '0'),
 (49, '1'),
 (50, '2'),
 (51, '3'),
 (52, '4'),
 (53, '5'),
 (54, '6'),
 (55, '7'),
 (56, '8'),
 (57, '9'),
 (58, ':'),
 (59, ';'),
 (60, '<'),
 (61, '='),
 (62, '>'),
 (63, '?'),
 (64, '@'),
 (65, 'A'),
 (66, 'B'),
 (67, 'C'),
 (68, 'D'),
 (69, 'E'),
 (70, 'F'),
 (71, 'G'),
 (72, 'H'),
 (73, 'I'),
 (74, 'J'),
 (75, 'K'),
 (76, 'L

In [23]:
t2n['\x09']

9

In [24]:
len(take(10, t2c.items())[0][1])

388

In [25]:
import torch
from torch import sparse


In [26]:
codes = torch.from_numpy(tables[0])

In [27]:
# from https://discuss.pytorch.org/t/how-to-convert-a-dense-matrix-to-a-sparse-one/7809

def to_sparse(x):
    """ converts dense tensor x to sparse format """
    x_typename = torch.typename(x).split('.')[-1]
    sparse_tensortype = getattr(torch.sparse, x_typename)

    indices = torch.nonzero(x)
    if len(indices.shape) == 0:  # if all elements are zeros
        return sparse_tensortype(*x.shape)
    indices = indices.t()
    values = x[tuple(indices[i] for i in range(indices.shape[0]))]
    return sparse_tensortype(indices, values, x.size())


In [28]:
scodes = to_sparse(codes)

In [29]:
scodes.is_sparse

True

In [30]:
type(scodes)

torch.sparse.DoubleTensor

In [31]:
# pytorch sparse can't be saved yet ... not implemented for the moment (I should do it myself and send the patch)
# torch.save(scodes, "utf8-codes.pt")
# save_obj(scodes, "utf8-codes.torch")

In [32]:
import scipy as sp
import scipy.sparse

In [33]:
spcodes = sp.sparse.coo_matrix(tables[0])

In [34]:
save_obj(spcodes, "utf8-codes-scipy-sparse_3seg")

So, for the moment we have the posibility to encode all utf-8 characters, but is still a bit big in size when having the complete. But I'll try to cut the use of memory because 6.8GB for the "dense" matrix reprsentation is too much. In Sparse mode matrix is only 83MB for the entire dataset. Nevertheless there are many characters that I will not be using for the first tests, so having it use only a part will (should) be enough.

So I'll see how big the encoder is without the 4 segments, but only using the first 3 (this should be enough for most applications) so we can encode 

number of codes =  59328

number of code_exceptions =  4224

the entire code is now 206MB on a file on disk in non sparse mode and 3.6MB on sparse mode on disk for codes with 4 segments (this mode is scalable to later add the rest of the code without the need of redoing the architecture)

But also reducing the number of bytes on the code (using only 3 bytes max instead of 4) by not taking the last one that anyways we are not using for this application we can reduce this to 177MB of the complete "dense" code on disk and 3.6MB on sparse mode.

I would not recomend doing this all the time as it restricts the power of the input network to only known elements (and we want to do with all the possible codes) but for my tests this reduces the usage of memory, the amount of parameters and the processing time.

When using 4 segments there are 452 elements per code, when using 3 there are 388

So I can start playing with it without worrying about memory ;)



In [5]:
d = np.load("utf8_code_matrix_3seg.npy")

In [7]:
d = torch.from_numpy(d)

In [10]:
cudad = d.cuda()

In [12]:
len(d)

59328

In [1]:
# trying the autoencoder now, this is just to see if it works before going on with a more complex setup
import numpy as np
import pickle
import torch
from utf8_encoder import *
from utf8vae import *
from torch.utils.data import TensorDataset

In [2]:
# Reconstruction + KL divergence losses summed over all elements and batch
def loss_function(recon_x, x, mu, logvar, vector_size, channels=1):
#     print("x shape = ", x.shape, recon_x.shape)
#     BCE = F.binary_cross_entropy(recon_x, x, size_average=False)
#     BCE = F.nll_loss(recon_x, x)
    BCE = F.mse_loss(recon_x, x)

    # see Appendix B from VAE paper:
    # Kingma and Welling. Auto-Encoding Variational Bayes. ICLR, 2014
    # https://arxiv.org/abs/1312.6114
    # 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    return BCE + KLD


In [7]:

def train_overfit():
    # generate dataset inputs (basically the same as the encoding)
    # We are going to overfit
    epochs = 100
    segments = 3  # I do with 3 as it will be much MUCH faster and smaller for my resources than 4 segments
    in_size = 388  # 3 segments
    hidd_size = 100
    code_size = 50
    device = device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    vector_size = code_size
    channels = 1
    batch_size = 10
    datafile = "utf8_code_matrix_3seg.npy"
    log_interval = 10
    
    model = UTF8VAE(in_size, hidd_size, code_size, segments=segments)
    # loader = DataLoader(UTF8Dataset("utf8_code_matrix_3seg.npy"), batch_size=batch_size)

    name = "utf8-vae-3segments-overfit"
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    n_batches = 10
    n_epocs = 100
    # train_loader, test_loader = get_loaders(batch_size, transformation)
    # we are overfitting, so train and test is the same thing.

#     for epoch in range(1, epochs + 1):
#         train(model, optimizer, loss_function, loader,epoch, vector_size, channels)
#         test(model, loader, epoch, vector_size, channels)
    data = torch.from_numpy(np.load(datafile)).float()
    data = data#.to(device)
    model = model#.to(device)
    
    model.train()
    for epoch in range(n_epocs):
        train_loss = 0
        for batch_idx in range(n_batches):
            optimizer.zero_grad()
            recon_batch, mu, logvar = model(data)
            loss = loss_function(recon_batch, data, mu, logvar, vector_size, channels)
            loss.backward()
            train_loss += loss.data.item()
            optimizer.step()
            if batch_idx % log_interval == 0:
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(data),
                    100. * batch_idx / len(data),
                    loss.data.item() / len(data)))

        print('====> Epoch: {} Average loss: {:.4f}'.format(
              epoch, train_loss / len(data)))


    model.save_model(name, "saved_models")
    # train_all(models)


In [8]:
train_overfit()

====> Epoch: 0 Average loss: 0.8654
====> Epoch: 1 Average loss: 0.1975
====> Epoch: 2 Average loss: 0.0826
====> Epoch: 3 Average loss: 0.0448
====> Epoch: 4 Average loss: 0.0267
====> Epoch: 5 Average loss: 0.0184
====> Epoch: 6 Average loss: 0.0138
====> Epoch: 7 Average loss: 0.0111
====> Epoch: 8 Average loss: 0.0094
====> Epoch: 9 Average loss: 0.0082
====> Epoch: 10 Average loss: 0.0073
====> Epoch: 11 Average loss: 0.0067
====> Epoch: 12 Average loss: 0.0062
====> Epoch: 13 Average loss: 0.0059
====> Epoch: 14 Average loss: 0.0056
====> Epoch: 15 Average loss: 0.0054
====> Epoch: 16 Average loss: 0.0052
====> Epoch: 17 Average loss: 0.0051
====> Epoch: 18 Average loss: 0.0049
====> Epoch: 19 Average loss: 0.0049
====> Epoch: 20 Average loss: 0.0048
====> Epoch: 21 Average loss: 0.0047
====> Epoch: 22 Average loss: 0.0047
====> Epoch: 23 Average loss: 0.0046
====> Epoch: 24 Average loss: 0.0046
====> Epoch: 25 Average loss: 0.0046
====> Epoch: 26 Average loss: 0.0045
====> Epoch

TypeError: Can't convert 'int' object to str implicitly

In [19]:
torch.__version__

'1.0.0'