# DRAFT - V1

This is the first draft where things are going in. Needs to be finished (on tasks, validation, and so on) first and then reordered, reworded, etc.....

# A Comprehensive Study of Multiple Encoding Techniques on Low Resource Embeddings for Character-Level NLP

## Flexible Universal Character Level Encoding method for Text in NLP

    Leonardo M. Rocha
    leo <dot> m <dot> rocha <at> gmail <dot> com
    

## Abstract

Currently NLP deals with *Out of Vocabulary* (OOV) in different ways, this leads to several non-necessarilly efficient ways of pre-processing NLP datasets to be able to deal with 
In this work we present **Segment-Multihot-Encoding** (SME) a technique that deals with OOV and allows to encode all possible symbols in a computationally efficient manner to encode all (or part) of the UTF-8 domain in a fixed multi-hot encoding that can be further compressed by overfitting. This technique eliminates the need of complex and compute consuming pre-processing replacing it for a much more simple one that works for *any* dataset. This work focuses on being able to encode a symbol, as once it is encoded the network can be fine-tunned later to handle previously unseen ones.

This work presents the SME technique for UTF8 and we call it **SME-UTF8**, the source code and encoding vectors are also given with usage examples.

There is an extra advantage of this methodology is that with deterministic and defined process with small representation size can be implemented efficiently in hardware.

This is also presented directly as an ipython notebook to also be Executable in the places that is needed. We call this, the Executable Paper and the idea is to improve reproducibility.



## Introduction and Related Work

Currently for NLP tasks there is the need to first analyze the input domain and encode it to deal with *Out of Vocabulary* (OOV) words or symbols and Polysemy. 

It is important to separate (and we do in this work) the encoding part (to be able to represent the symbol) from the learning to use those symbols (the network to be able to do something useful with it) as this work focuses solely on being able to encode all feasible symbols in a defined text encoding domain. In this case the work is done for UTF-8 which is the most used text encoding in the web [94.6% according to w3techs](https://w3techs.com/technologies/details/en-utf8).



Diverse techniques deal differently with OOV, ranging from techniques that can not deal with them, like [GloVe - Pennington et al. 2014](https://nlp.stanford.edu/pubs/glove.pdf) or [Word2Vec - Milikov et al. 2013](https://arxiv.org/abs/1301.3781) to others such as [Universal Sentence Encoder -Cer et al. 2018](https://arxiv.org/abs/1803.11175) or the one for FastText can encode OOV with subword embeddings
One of the most used techniques is [Byte Pair Encoding from Neural Machine Translation of Rare Words with Subword Units Sennrich et al. 2015](https://arxiv.org/abs/1508.07909) which has the advantage of compressing the input size hence accelerate training compared to a full character level input on the current SoTA.


[ELMo - Peters et al. 2018](https://arxiv.org/abs/1802.05365)

[Transformer - ]()
[ULM-FiT - ]()
[BERT - ]()
[AlBERT - ]()
[CamemBERT]()


... TODO more papers and references here



All current methods deal with subdomains of the possible inputs available, which for most tasks is enough, nevertheless the weakness is that they can not deal with **all** posisble input symbols, which for the current study is UTF-8.

In the case of continual learning the need to add new symbols is a given, be it due to adding new domain in the same language, or new languages

The goal of this work is to try to set a more standard way to deal with all possible symbols in a defined encoding standard.

Also it is important noting that the goals in this paper are NOT t obtain orthonormal representations, instead we are looking for over-represented spaces with redundant information in different ways that could be exploited by the network differently in a per-case basis.

This work analyzes UTF-8 encoding and presents a technique to be able to encode part or all of the UTF-8 domain in an computationally efficient way. The same technique can be used for other text encodings without any modification, and as utf-8 is a superset of other encodings (ASCII, Unicode, ...) the same matrix encoding can be applied without any modification in those datasets.

There is also another point to say about this computational complexity, all the current SoTA methods are trained in clusters (and wiht prices) that are unavailable to most users. The current work is part or a larger work on trying to get enough permormance in commercially available (and relatively accessible) single GPUs for end users (being at the current time the NVidia RTX2080ti one of the more computationally strong cards in the market).


### Sections

This paper deals first with an analysis of the UTF-8 encoding

Then goes to deal with the construction of the multi-hot code proposed

After works on compressing the multi-hot code with Overfitting (yes, OVERFITTING)

Then goes to the evaluation of the codes and compare results with other encoding methods

Finally goes to the conclussion and future work.


## UTF-8 Analysis

### One-Hot encoding

[One-hot encoding](https://en.wikipedia.org/wiki/One-hot) is one of the most used to encode cathegorical variables, in the case of State of the Art NLP tasks is used to encode the input symbols, this is computationally expensive and the goal here is to reduce this complexity leaving memory and computational space for other more complex tasks in the network.

### Number of code-points

As this file tries to encode all the characters possible by utf-8 we have to check the feasible number so:

From [Wikipedia utf-8](https://en.wikipedia.org/wiki/UTF-8)

UTF-8 is a variable width character encoding capable of encoding all **1,112,064**.

$$ 17×2^{16} = 1114112 $$ code points minus 2,048 technically-invalid surrogate code points

This is, if encoding with one-hot we would need 1.1M parameters per neuron in the input layer, which is expensive. The goal is to reduce this complexity (which we argue is unnecessary) by orders of magnitude.

## UTF-8 structure and Encoding Details

Since the entire utf-8 univers is NOT the entire $2^{32}$ domain, but there are limitations explained in [the utf-8 description](https://en.wikipedia.org/wiki/UTF-8)

| Number of bytes | Bits for code point | First code point | Last code point | Byte 1   | Byte 2   | Byte 3   | Byte 4   |
|----------------|--------------------|-----------------|----------------|----------|----------|----------|----------|
| 1              | 7                  | U+0000          | U+007F         | 0xxxxxxx |          |          |          |
| 2              | 11                 | U+0080          | U+07FF         | 110xxxxx | 10xxxxxx |          |          |
| 3              | 16                 | U+0800          | U+FFFF         | 1110xxxx | 10xxxxxx | 10xxxxxx |          |
| 4              | 21                 | U+10000         | U+10FFFF       | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |

The UTF-8 code is formed by 4 segments, we will refer to this often during the current work.

The thing is that the number of elements in the table should be at most $2^{21}$, There is only the need to create a index that can handle the 4 cases which can be done with 4 different conversion tables.


In fact it is possible to just cut the utf-8 value in chunks and do one-hot per different parts:
- there are only 4 segment ranges, that can be coded to add redundancy in one-hot also add there either hamming or other ECC
- the largest value is for 8 bits -> 256 values
- the others contain 6 bits -> 64 values
The prefix of each can be taken away and replaced by the initial one-hot

So a complete code would be:  $ 4 + 256 + 64 + 64 + 64 = 452 $

Instead of having a vector of dimension 1,112,064 to encode any utf-8 value, one with dimension 452 would be able to encode everything in the utf-8 domain.

This embedding can stil be reduced but should be sparse enough already to make a good input, the goal here is to have sparse vector that makes each vector far enough of the others, at least by one dimension. Adding the redundancy code (the first 4 dimensions) allows to make distance even bigger for vectors that should be further appart taking into account the locality of the utf-8 encoding (each character set is close to the ones used with it, segment 3 encodes mostly CJK Chinese-Japanese-Korean).

#### Notes
It is worth noting here that the first author has also experience in communications which allowed during the curse of this research the analysis of multiple Error Correcting Codes (ECCs) and different kinds of encoding (for example encoding as a Fourier Series), the conclusion is that even if the one-hot is the best in distance, other codes can be used and a multi-hot sparse is the simplest to implement (and fastest to encode). As a note, one pending task is to analyze ECCs in an end to end manner for a neural network. Some of these analysis (without proper formating) can be found in the anex notebooks folders of this repository or at [text subfolder in the minibrain project](https://github.com/leomrocha/minibrain/tree/master/predictors/sequence/text) where most of the experimental code is located.

## Encoding details

### UTF-8 Segments
To cut even more memory consumption the table can be generated for 1-4 segments of the utf-8 code, taking into account that the 4th segment is mostly composed of:
* Supplementary Multilingual Plane (SMP) of historic scripts
* Supplementary Special-purpose Plane (SSP)
* Private Use Areas (SSU)
* Invalid Codes

We can safely ignore this 4th segment (for the purpose of this article and most usages) which adds to most of the code-points

If CJK, Indic and some Miscelaneous Symbols are not (and will not be) needed then the 3rd segment can be safely ignored too reducing even more the memory consumption of the application

So the result would be:

| Segment | # of code points | First index | Last index | Vector Size | # code exceptions | Size (MB) |Matrix Size (MB) | Sparse Size (MB) |
|---------------|-----------------------|-------------|-------------|------------|-------------------|-------------------|-----------|-----------|
| 4             | 1107904             | 61440        | 1107904     | 452 | 790656  | 11538.59   | 3820.59 | 83.59 |
| 3             | 59328               | 2112         | 61439       | 388 | 4224    | 530.71     | 175.62  | 3.59  |
| 2             | 1984                | 128          | 2111        | 324 | 128     |  14.84     | 4.90    | 0.09  |
| 1             | 128                 | 0            | 127         | 260 | 0       |  0.77      | 0.25    | 0.005  |

Where:
* Segment: number of segments used from utf-8 to generate the code
* \# of Code Points: The total encoded code points generated
* Vector Size: The embedding vector size
* First / Last Index: corresponds to the segment first and last index in the embedding matrix
* \# Code Exceptions: Number of code exceptions during encoding with Python, notice that we use the standard library for this.
* Size (MB): Size of the embedding matrix and conversion dictionaries (from-to code) once saved in disk in Dense mode
* Matrix Size (MB): Size of the embedding matrix in disk in Dense Mode
* Sparse Size (MB): Size of the embedding matrix in disk in Sparse mode


Notice that the code for 1 segment corresponds to one-hot encoding of ASCII encoding (plus the vector of size 4 that we don't modify in any case)

## Signaling the Start and End of a Sequence 

To this end we can use the codes available in the utf-8 and add the mapping to the encoding and decoding dictionaries (not the matrix), The first 0x20 codes in UTF-8 signal different communication an control codes, we can use those same for our NLP purposes or we could choose the invalid codes at 0xC0 and 0xC1 or codes larger than U+10FFFF (at the end of segment 4 as per in our vocabulary for this paper).

In order to avoid any issues and as we don't count on using the current models for communication protocols but only for NLP purposes (this encoding could also be used for allowing the network to deal directly with network communication protocols) we decide to use the control codes at the beginning of the block of the first segment

To this end there are 3 codes that are re-mapped by design and signaled by the following:
* \<start\>: 0x02 - STX Start of Text
* \<end\>: 0x03 - ETX End of Text
* \<unk\>": 0x15 - NAK  Negative Acknowledge

The \<unk\> (Unknown) element should not be used as per design there should be no unrepresented symbols in the code design, but is added for completion and in case of future use.

The mapping is created in a way to be as close as possible to the original meaning of the control symbols.


### Codes Creation

This section is dedicated to code execution and measurements to fill the table above

In [1]:
from sparse_encoders import create_measure_tables
create_measure_tables()

number of codes =  128
number of code_exceptions =  0
number of codes =  1984
number of code_exceptions =  128
number of codes =  59328
number of code_exceptions =  4224
number of codes =  1107904
number of code_exceptions =  790656
| Segments | exec_time (sec) |  matrix_shape | Size in Disk (MB): | Matrix Size in Disk (MB):            | Sparse Matrix Size in Disk (MB): |code path
| 1 | 0.004 | (128, 260) | 0.78 | 0.25 | 0.00 | codes/utf8_codes-1seg.pkl |
| 2 | 0.040 | (1984, 324) | 14.85 | 4.90 | 0.09 | codes/utf8_codes-2seg.pkl |
| 3 | 1.479 | (59328, 388) | 530.71 | 175.62 | 3.59 | codes/utf8_codes-3seg.pkl |
| 4 | 33.929 | (1107904, 452) | 11538.59 | 3820.59 | 83.59 | codes/utf8_codes-4seg.pkl |


### Observations

### Execution of previous results

The execution of all the previous code is done in a single thread of an Intel i7 7700.

#### Embedding Sizes

Size of the embedding matrix grows with the number of code points and embedding size, observing the size of the matrices, the sparse representation of them is negligeable in comparison with current models. 

NVidia provides support for sparse operations with [cuSPARSE ](https://developer.nvidia.com/cusparse) ([documentation](https://docs.nvidia.com/cuda/cusparse/index.html)) which means we can use these matrices. 

Nevertheless many applications work in dense mode and if is the case working with less than the 4 segments would be advisable. The 3 segments embedding provides enough representation for all languages on earth and the 2 segment one support most languages already as stated in a previous section.

#### Execution Time

As seen in the code execution above, even if vectors are big to keep saved and download, the creation of the vectors is defined and can be recreated with less than a minute execution in a single thread of an off-the-shelf CPU.


### PC Configuration:

#### Hardware

    intel i7 7700
    64GB of RAM
    GPU-0 GTX1080 -> runs the GUI and other tasks, sometimes used for train and testing
    GPU-1 RTX2080ti -> only used for computation

#### Software

    Cuda V10.1
    Pytorch V1.3
    Python 3.7


## Overfitting Compression

In the literature overfitting is an evil creature, but in this case, as we know the entire domain, we are going to use it to our advantage with overfitting the sparse input (the multi-hot encoded vector) into a smaller embedding vector than the input, the goal is lossless compression here.

This is done to be able to make a more informed decision at the end of the study and show the comparative results. We are able to show that overfitting underperforms manual encoding (which comes at no surprise here), where overfitting fails to decode vectors, manually created codes are smaller and repeatable.

Once the network is trained (basically an overfitted autoencoder), a new encoding matrix is generating making each element of the input domain pass by the autoencoder and getting the latent vector, which is used to generate a matrix that can be given as Embedding directly to the network.

As the UTF-8 coding uses a max of 4 bytes for the code representation, there is at least the need to use vectors of size 32. The smallest code for this would be directly using the utf-8 code as the embedding (which we should also test as input to be able to compare the results)

But then, any vector representation that handles the complete domain must be at least of dimension 32, here we'll test several dimensions for each number of codes, the representation of 1 segment coding is just done for completion, the one for 2 segments coding might be useful but having only 1984 elements a one-hot encoding does not pose a big problem with current resources, the code starts to be more interesting for 3 segment and 4 segment coding.



The training is done with the following configuration:

    Batch Size: Size of all the simbols, each batch contains once every symbol in the domain
    Network Configuration: Autoencoder
    Loss: CrossEntropy
    
And we measure:

    Output Vector Embedding Size
    Execution Time
    Matrix Size on Disk (here only the dense matrix is taken into account as there should be close to no sparsity)
    

The experiments on overfitting were run on different vector sizes from 32 to 128 for encodings using 3 and 4 segments

**TODO do the runs again (with a better and more clean script) and put here the results**

## Sparse Codes

In this section we develop a methodology and different codes with different vector sizes and distances. All the generated codes are saved and later used to test in NLP tasks.

Two different ideas are used to create the sparse codes, the first one is choose k from N, the second idea is to generate a multi-hot with the combination of different smaller codes of co-prime sizes, this leads to longer cycles on the combination of the codes

Although any code order might be enough we look for repeatable, predictable and deterministic process to reach to the same values each time we recompute the code.

## Sparse Codes, Choosing *k* of *N* ${N\choose k}$ 

For completion, this study also deals with different sparse coding techniques, the basic idea will be choosing the 

We need to basically do the following: ${N\choose k}$ 

Where $ 32 <= N <= M$ and defining $M$ as the maximum value of the desired vector dimension 

and $ k $ should be minimized to augment the sparcity of the vector as much as possible

Also We can again add some redundancy as in the previous multi-hot code. i.e. the first 4 elements should indicate which of the UTF-8 segment are being used for the code-point.

${N\choose k} = \frac{(n)!}{k!(n-k)!} $ 

TODO There might be an issue (to explore in the validation part) with this method is that we **can not** use a multi-softmax decoding method, instead classic multi-class classification methods need to be used (a Sigmoid layer) and then apply vector similarity. In the case where mixing both ${N\choose k}$  and co-prime methods the decoding part of the network architecture gets a bit more complex.

The vector sizes are explored here:

In [2]:
from sparse_encoders import sparse_Nk_dimension_analysis

In [3]:
%%time
results = sparse_Nk_dimension_analysis()
len(results)

CPU times: user 25.1 ms, sys: 434 µs, total: 25.6 ms
Wall time: 25.2 ms


18

In [4]:
# results: (code points needed, possible code points, vector size, number of  ones in code, sparsity ratio)
results

[(128, 136, 17, 2, '0.118'),
 (128, 165, 11, 3, '0.273'),
 (128, 210, 10, 4, '0.400'),
 (128, 252, 10, 5, '0.500'),
 (128, 210, 10, 6, '0.600'),
 (1984, 2016, 64, 2, '0.031'),
 (1984, 2024, 24, 3, '0.125'),
 (1984, 2380, 17, 4, '0.235'),
 (1984, 2002, 14, 5, '0.357'),
 (1984, 3003, 14, 6, '0.429'),
 (59328, 59640, 72, 3, '0.042'),
 (59328, 66045, 37, 4, '0.108'),
 (59328, 65780, 26, 5, '0.192'),
 (59328, 74613, 22, 6, '0.273'),
 (1107904, 1125180, 190, 3, '0.016'),
 (1107904, 1150626, 74, 4, '0.054'),
 (1107904, 1221759, 45, 5, '0.111'),
 (1107904, 1344904, 34, 6, '0.176')]

We can observe that the size of the vectors on these representation are much smaller than the manual one-hot code by segment created before. We can again use the same technique as in the first part of adding first 4 vector elements that represent the segment to which the symbol belongs.

There is another point too to take into account, the size of the vector is important as hardware is more adapted for certain sizes, it also has consequences on the kind of techniques can be done, for example groups convolution.

It is convenient then to have vector sizes of powers of two and also multiple of 96 (tensor operation sizes in NVidia tensor cores are of size 96 .... Soething to understand better here: https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/)

## Sparse Codes, multiple joint co-prime codes

The idea here is to create multiple ohe-hot codes, each code of a prime size (which is the simplest way to choose co-primes), then joining the codes.

Again we look for combinations that minimize the vector size.

In [5]:
%%time
from sparse_encoders import multihot_primes_conf_finder
_, codes_1seg, codes_2seg, codes_3seg, codes_4seg= multihot_primes_conf_finder()

CPU times: user 273 ms, sys: 452 µs, total: 274 ms
Wall time: 275 ms


The shortest codes that can handle the needed codebook sizes are:

In [6]:
codes_1seg[0], codes_2seg[0], codes_3seg[0], codes_4seg[0]

(((2, 3, 5, 7), 4, 0.235, 17, 210),
 ((3, 5, 11, 13), 4, 0.125, 32, 2145),
 ((11, 13, 19, 23), 4, 0.061, 66, 62491),
 ((23, 31, 37, 43), 4, 0.03, 134, 1134383))

To build the codes, for each code we pad with redundancy such as we get an embedding vector of a size in $[32, 48, 64, 96, 128, 192, 256, 384]$

But we can also look for a defined sparsity in the final vector embeddings, as extra property, all the vectors have the same parity which is a nice to have to be able to check during decoding

To take those two elements into account we'll select from the smallest vector sizes, we can see some of these here:

In [7]:
codes_1seg[:5], codes_2seg[:5], codes_3seg[:5], codes_4seg[:5]

([((2, 3, 5, 7), 4, 0.235, 17, 210),
  ((3, 5, 11), 3, 0.158, 19, 165),
  ((2, 5, 13), 3, 0.15, 20, 130),
  ((2, 7, 11), 3, 0.15, 20, 154),
  ((3, 5, 13), 3, 0.143, 21, 195)],
 [((3, 5, 11, 13), 4, 0.125, 32, 2145),
  ((2, 7, 11, 13), 4, 0.121, 33, 2002),
  ((3, 5, 7, 19), 4, 0.118, 34, 1995),
  ((3, 7, 11, 13), 4, 0.118, 34, 3003),
  ((3, 5, 11, 17), 4, 0.111, 36, 2805)],
 [((11, 13, 19, 23), 4, 0.061, 66, 62491),
  ((11, 13, 17, 29), 4, 0.057, 70, 70499),
  ((11, 17, 19, 23), 4, 0.057, 70, 81719),
  ((7, 13, 23, 29), 4, 0.056, 72, 60697),
  ((7, 17, 19, 29), 4, 0.056, 72, 65569)],
 [((23, 31, 37, 43), 4, 0.03, 134, 1134383),
  ((23, 29, 37, 47), 4, 0.029, 136, 1159913),
  ((23, 29, 41, 43), 4, 0.029, 136, 1175921),
  ((17, 37, 41, 43), 4, 0.029, 138, 1108927),
  ((19, 29, 43, 47), 4, 0.029, 138, 1113571)])

Also, to fill the remaining space we can choose to combine co-prime codes with N choose k. This has the advantage of generating different patterns with redundant information that can be exploited by the network later.

### Code Generation

First we generate the codes for sparse and co-prime methods without filling for the HW optimization size, then we do the filling. All codes are saved to be able to experiment later with them and compare results.

In [8]:
from sparse_encoders import create_sparse_Nk_codes, all_multihot_primes

In [9]:
%%time 
codes = create_sparse_Nk_codes()

| Segments | code size | Vector Size | N | k |exec_time (sec) |  Matrix Size in Disk (MB):                | Sparse Matrix Size in Disk (MB): |code path
| 1 | 128 | (128, 17) | 17 | 2 | 0.001 | 0.00 | 0.00 | codes/utf8_sparse_codes-1_N-17_k-2_seg |
| 2 | 1984 | (1984, 24) | 24 | 3 | 0.003 | 0.05 | 0.05 | codes/utf8_sparse_codes-2_N-24_k-3_seg |
| 3 | 59328 | (59328, 37) | 37 | 4 | 0.050 | 2.09 | 2.04 | codes/utf8_sparse_codes-3_N-37_k-4_seg |
| 4 | 1107904 | (1107904, 45) | 45 | 5 | 0.929 | 47.55 | 47.55 | codes/utf8_sparse_codes-4_N-45_k-5_seg |
CPU times: user 924 ms, sys: 60 ms, total: 984 ms
Wall time: 983 ms


In [10]:
%%time 
cp_codes = all_multihot_primes()

| {} | {} | {} | {} | {:.3f} | {:.2f} | {:.2f} | {} |
| 1 | 128 | (128, 19) | (3, 5, 11) | 0.001 | 0.00 | 0.00 | codes/utf8_coprime_codes-128_primes-(3, 5, 11)_1_seg |
| 2 | 1984 | (1984, 32) | (3, 5, 11, 13) | 0.001 | 0.06 | 0.07 | codes/utf8_coprime_codes-1984_primes-(3, 5, 11, 13)_2_seg |
| 3 | 59328 | (59328, 66) | (11, 13, 19, 23) | 0.016 | 3.73 | 2.04 | codes/utf8_coprime_codes-59328_primes-(11, 13, 19, 23)_3_seg |
| 4 | 1107904 | (1107904, 134) | (23, 31, 37, 43) | 0.478 | 141.58 | 38.04 | codes/utf8_coprime_codes-1107904_primes-(23, 31, 37, 43)_4_seg |
CPU times: user 425 ms, sys: 72.2 ms, total: 497 ms
Wall time: 496 ms


As we can observe these codes are light on memory (at least up to 3 segment which should be enough for most NLP tasks in most languages), in the case of sparse matrix representation, all the representations are low on memory consumption.

### Combining sparse codes

To comply with efficient hardware vector representations it might be more advantageous to combine different codes as they cycle differently, two representations will be done, one parting from the complete ${N\choose k}$  and completing with co-prime coding and the other parting from co-prime and completing with ${N\choose k}$.

Starting from the ${N\choose k}$  codes we construct smaller codes than starting from the co-prime technique, this has the consequence of having a complete duplicate complete code, which gives more redundancy than the smaller codes.

The configuration decision is done a bit arbitrarilly just trying to get a good decision while maintaining the vector dimension fixed in the sizes named above. The configuration is the following:


TODO here make this into a nice looking table and explicit what NCODES[X] is

    # N choose k + coprime multihot
    # code dim, N,k,target dim, prime dim, primes
    Nk_coprimes = [(NCODES[0], 17, 2, 32, 15, (3, 5, 7)),
                   (NCODES[1], 24, 3, 48, 24, (5, 8, 11)),
                   (NCODES[2], 37, 4, 64, 27, (3, 5, 8, 11)),
                   (NCODES[3], 45, 5, 96, 51, (3, 7, 11, 13, 17))]

    # coprime multihot + N choose k
    # code dim, primes, (N, k)
    code_config = [(NCODES[0], (3, 5, 11), (13, 3)),
                   (NCODES[1], (3, 5, 11, 13), (32, 3)),
                   (NCODES[2], (11, 13, 19, 23), (30, 4)),
                   (NCODES[3], (23, 31, 37, 43), (58, 5))]


## Redundant Codes Methods

To add redundancy to the codes we can try several methods. We can do it basically mixing methods, or adding a new one.

~~While linear transformations can achieve multiple reorderings of the input vectors, having different and redundant representations will make it easier for the learning to find the parts that are best to interpret or pay attention to for the task at hand.~~

### Single Cycle multi-one-hot Segmentation Method

Basically this method assigns a dimension for each part of the code in the segment,of the vector representation. It goes as follows:

Let $C$ be a code of vector dimension $d$ such as $dim(C)=d$ where $c \in C$ is only formed by binary elements $[0,1]$

Let $s$ be a defined subset of the dimensions from $C$ and $dim(s)=ds < dim(C)=d $ and $s_i \forall i=[1,ds]$

Let $N$ be the set composed of (different) elements of $dim(C)$ in code $C$ (note, $N$ does not have to be complete in the sense of representing all the elements possible in $C$) where $dim(N)=n$

$N[s_i]_j = int \left( \dfrac{j}{\dfrac{n}{ds}} \right)$ where $j$ is the index of the vector of $N$ and $s_i$ is the $i^{th}$ dimension of the set of dimensions $s$ from $N_j$ and $int(x)$ is the integer part of the value $x$

#### Extra Notes on Mixed and redundant Codebooks

Normally a network will create their own redundancy depending on the needs of the application, the study's goal here is to create a code that can also be used for applications that continually learn different tasks on different domains and multiple languanges.

While the codebooks generated by this method will be mostly ignored if the application is single language only (english for example), later adding parallel columns (another study in progress) will make use of different parts of the code, the redundancies embedded might (to try/prove/future work) allow for easier learning.

In [11]:
from sparse_encoders import create_choose_Nk_coprimes_codes, create_coprimes_choose_Nk_codes


In [12]:
%%time
nkcp = create_choose_Nk_coprimes_codes()

| Segments | code size | Vector Size | N | k | primes |exec_time (sec) |  Matrix Size in Disk (MB):                        | Sparse Matrix Size in Disk (MB): |code path
| 1 | 128 | (128, 32) | 17 | 2 | (3, 5, 7) | 0.001 | 0.00 | 0.01 | codes/utf8_N-17k-2-coprime_codes-128_primes-(3, 5, 7)_1_seg |
| 2 | 1984 | (1984, 48) | 24 | 3 | (5, 8, 11) | 0.002 | 0.09 | 0.10 | codes/utf8_N-24k-3-coprime_codes-1984_primes-(5, 8, 11)_2_seg |
| 3 | 59328 | (59328, 64) | 37 | 4 | (3, 5, 8, 11) | 0.054 | 3.62 | 4.07 | codes/utf8_N-37k-4-coprime_codes-59328_primes-(3, 5, 8, 11)_3_seg |
| 4 | 1107904 | (1107904, 96) | 45 | 5 | (3, 7, 11, 13, 17) | 1.222 | 101.43 | 95.09 | codes/utf8_N-45k-5-coprime_codes-1107904_primes-(3, 7, 11, 13, 17)_4_seg |
CPU times: user 1.19 s, sys: 92.4 ms, total: 1.28 s
Wall time: 1.28 s


In [13]:
%%time 
cpnk = create_coprimes_choose_Nk_codes()

| Segments | code size | Vector Size | N | k | primes |exec_time (sec) |  Matrix Size in Disk (MB):                        | Sparse Matrix Size in Disk (MB): |code path
| 1 | 128 | (128, 32) | 13 | 3 | (3, 5, 11) | 0.001 | 0.00 | 0.01 | codes/utf8_coprime_codes-128_primes-(3, 5, 11)_N-13k-3_1-seg |
| 2 | 1984 | (1984, 64) | 32 | 3 | (3, 5, 11, 13) | 0.003 | 0.12 | 0.12 | codes/utf8_coprime_codes-1984_primes-(3, 5, 11, 13)_N-32k-3_2-seg |
| 3 | 59328 | (59328, 96) | 30 | 4 | (11, 13, 19, 23) | 0.040 | 5.43 | 2.98 | codes/utf8_coprime_codes-59328_primes-(11, 13, 19, 23)_N-30k-4_3-seg |
| 4 | 1107904 | (1107904, 192) | 58 | 5 | (23, 31, 37, 43) | 3.118 | 202.86 | 85.58 | codes/utf8_coprime_codes-1107904_primes-(23, 31, 37, 43)_N-58k-5_4-seg |
CPU times: user 2.95 s, sys: 205 ms, total: 3.15 s
Wall time: 3.16 s


As the code that seems the most useful for most languages is the one derived from  2 and 3 segments from UTF-8 we dedicate special attention to it and generate an extra code here in two versions with two and three-fold redundancy.
the two-fold redundancy will be of a non-compliant vector size (not in the vector size list specified in the section above), the other will be $2^n$

#### 2 Segments Code

For completion and having extra redundancy we also create a code that contains both complete  co-prime and ${N \choose k}$ for the 2 segment case (the other cases are complete already), this is, a vector of size 128 where:

| coprimes | N | k | Size | Target Size | remaining |
|----------|---|---|------|-------------|-----------|
| (3, 5, 11, 13) | 24 | 3 | 56 | 64 | 8 |

We need to use the remaining elements to create a good enough repetition that is not in the same period/cycle as the other existing parts of the code (reason why the coprime code is created instead of using another method). The idea is to maximize the information obtained looking at only part of the code.

Some ideas into how to use the remaining 8 dimensions are:
1. One information that we can add is the segment at which the code belongs to with 2 dimensions and will keep 6 to do another coding. 
2. using a co-prime coding of 3 and 5 -> issue already used (maybe creating and sorting it?)
3. Using an ${8 \choose 2 }$ coding but this is a sub-period of ${ 24 \choose 3}$, the same applies to ${6 \choose 2}$
4. We can do ${5 \choose 2}$ and ${3 \choose 1}$ this might just work **(selected)**



TODO For completion we can also create a vector of size 64 with different co-prime techniques and using some of the codes sorted like in point 2 above. This has the advantage of being able to use Softmax in the decoding layer.




#### 3 Segments Code

For completion and having extra redundancy we also create a code that contains both complete  co-prime and ${N \choose k}$ for the 3 segment case (the other cases are complete already), this is, a vector of size 128 where:

| coprimes | N | k | Size | Target Size | remaining |
|----------|---|---|------|-------------|-----------|
| (11, 13, 19, 23) | 37 | 4 | 103 | 128 | 25 |

The remaining 25 elements to complete dimension 128 then is treated again as an extra redundancy element, we can add several types but the idea is to keep not only redundancy but also sparsity, doing a $ {25 \choose x} $ for $ x \in [2,3,4,5,6]$ works correctly at sparsities $[0.078, 0.086, 0.093, 0.10, 0.11]$ respectively. If $x=6$ then the code completes again giving a triplicate with different patterns that should allow the network to use the information differently.

As in the case of 2 segments code we can treat the remaining to fill our needs, there are mostly two ideas to work on:
* Do a complete coding with $ {25 \choose x} $ **(selected)**  is the simplest one, we choose $x=4$ arbitrarily as it keeps sparsity below $10%$ and gives a big enough cycle of $2300$
* Do a partial with 3 dimensions to encode the segment the symbol belongs to and do a $ {22 \choose x} $ This method also makes a complete code if $x=5$ and keeps the same sparsity as the previous option but has specific redundant information about utf-8 code block. 


TODO For completion we can also create a vector of size 96 and 128 with different co-prime techniques and using some of the codes sorted like in point 2 above. This has the advantage of being able to use Softmax in the decoding layer.



## Preliminary Testing Decisions

Due to time and resource restrictions the first set of experiments will only take care on this setting, the decision (after some thought) is to be the following:

#### Code:
* The code will be a multi-hot sparse coding (already decided beforehand)
* The code will be a co-prime + $N \choose k$ method due to vector size
~~* The code will be set as a coprime based method (easier to apply multiple Softmax)~~
~~* The redundancy will be set as a coprime based method, but segmented by the total number of elements in the code divided by the cycle of the prime (Single Cycle Segmentation Method above).~~
* The code will be done to comprise 2 ~~and 3~~ segments of utf-8 code only, being ~~the 3 segment one the most significative for a complete multi-lingual setup and~~ the 2 segment one much smaller and thus faster to encode and decode for simpler testing purposes.


#### Character Level Encoder:
* The encoder layers will first pass through a few (maybe 2 or 3) linear layers on the vector dimension (spatial) before the temporal part of the NN 

#### Character Level Decoder
* The decoder will be composed first of a transformer layer with the same number of heads as the number of ones in each vector of the codebook
~~* The decoder will finish (optionally and by configuration) by a multi-softmax layer (multiple softmax, one for each part of the code that contains a one-hot)~~
* The Decoder will work directly on the final on the vector dimension
* The decoder will work on the spatial dimension only
* The final decoding will be done with vector similarity (faiss) library


In [14]:
# Codes for the testing, include 2 and 3 segments
from sparse_encoders import create_prelim_testing_codes


In [15]:
%%time
test_codes = create_prelim_testing_codes()

| Segments | code size | Vector Size | N | k | primes | cycles | exec_time (sec) |  Matrix Size in Disk (MB):                            | Sparse Matrix Size in Disk (MB): |code path
| 2 | 1984 | (1984, 64) | 24 | 3 | (3, 5, 11, 13) | (6, 2) | 0.005 | 0.12 | 0.13 | codes/utf8_2-seg_1984-codepoints_64-dim_N-24-k3_coprimes-(3, 5, 11, 13)_cycles-(6, 2)_dense |
| 3 | 59328 | (59328, 128) | 37 | 4 | (11, 13, 19, 23) | (11, 7, 4, 3) | 0.080 | 7.24 | 4.49 | codes/utf8_3-seg_59328-codepoints_128-dim_N-37-k4_coprimes-(11, 13, 19, 23)_cycles-(11, 7, 4, 3)_dense |
CPU times: user 79.3 ms, sys: 7.77 ms, total: 87.1 ms
Wall time: 86.2 ms


In [16]:
# test_codes

As we can see, codes are REALLY small (in boolean dtype) and should not have any issue with memory. The only issue is that the Encoder type in PyTorch asks for Float as the input matrix and Long as the index code in the encoding, would be a gREAT memory saving to transform that type in pytorch to be able to handle the boolean and 2bytes unsigned int representation (uint16) as it can represent the whole spectrum needed for these embeddings.

This rests as a TODO task

## Decoding

Decoding the vectors can be done by cosine similarity (or any other method of vector similarity), in this case we use [faiss](https://github.com/facebookresearch/faiss) library from Facebook AI Research.

There are other ways of dealing with decoding, as it is a binary vector and we know beforehand the parity.

One trick to do the decoding is at the end instead of using a single big Softmax layer using multiple ones (Multi-Softmax), depending on the code configuration, one per part of the code as per the code configuration. It is known in the domain that one of the performance issues is the existence of big softmax layers, this is so that [multiple](https://openreview.net/pdf?id=HkwZSG-CZ) [studies](https://papers.nips.cc/paper/7312-sigsoftmax-reanalysis-of-the-softmax-bottleneck.pdf) [try](https://papers.nips.cc/paper/7130-svd-softmax-fast-softmax-approximation-on-large-vocabulary-neural-networks.pdf) [to](https://arxiv.org/pdf/1609.04309.pdf) [limit](https://arxiv.org/pdf/1805.02867.pdf) [the](https://arxiv.org/pdf/1901.10668.pdf) [performance](https://arxiv.org/pdf/1812.05737.pdf) issue generated by them. (TODO add more references, I remember one from a google paper that they cut the number of trainable parameters in a NLP network by repeating all the transformer blocks and add a technique at the input and output level to try to get the input smaller)

Small softmax layers ar much faster to process than big ones. Although rest to be measured as future work depending on the actual size of the softmax (too small vectors might not take the real advantage of GPGPUs) , maybe for really small vectors might be just good enough to do some clipping and binarization, or just put the biggest element to 1 and the rest to 0, which would have the issue of not beign differentiable. 

Ideas: TODO Some different techniques should be tested and should be a mixture of:

* Last layer as Sigmoid -> this can be used by all the codebooks
* Last layer as Multi-Softmax per part of code (dependent on the input codebook but pre-defined), this can only be done in the cases for co-prime method.

With: 

* Vector Clipping
* Binarization

Note: Previous to the last decoding layer we can use Transformer blocks with as many heads as ones in the vector there are, each should be able to target attention to a different important vector dimension.

### Decoding on Dedicated Hardware

The other advantage of small softmax layers is that they can be implemented in hardware (ASIC or [FPGA](https://www.researchgate.net/publication/332525912_Efficient_FPGA_Implementation_of_Softmax_Function_for_DNN_Applications) for which there is [available](https://dl.acm.org/doi/10.1145/3299874.3317988) [work](https://arxiv.org/pdf/1808.09945.pdf) [on](https://arxiv.org/pdf/1711.05860.pdf)), so a specific text  encoder and decoder can be implemented for the given encoding techniques and configurations.

There are also [available](https://ieeexplore.ieee.org/document/8794567) [techniques](https://link.springer.com/chapter/10.1007/978-3-319-32055-7_28) to do similarity search on hardware (which is a need specially in [biological applications](https://www.researchgate.net/publication/266178553_MASS-SIMILARITY_SEARCH_OF_BIOLOGICAL_SEQUENCES_USING_FPGA)), the fact that the vector sizes and number of elements is fixed makes it feasible to create and maintain the vectors in memory and should be studied if applying error correcting codes adds some (if any) advantage.

## Method Validation

To validate this method is sufficient to show that the performance of a network does not decay with compared to one-hot in several tasks. This article deals with this in a restricted environment

There are a few key points to measure:

* Pre-processing time (dataset) for each method
* Network Performance 
* Network Size (total and trainable parameters)
* Network Memory Consumption
* Network Training Time

To this end simple enough NLP tasks will be tackled such as the training and testing time is not excesive (running on a single end user graphic GPU card, in this case an RTX2080ti).

The tasks will be evaluated on the same (except of the first embedding layer) networks with different encoding, to be able to compare networks all will be done at character level. For completeness some otehr methods also will be evaluated, mainly BPE which is currently used in most SoTA papers.

## Results and Discussion

The TODO here

## Conclussion

This work shows a different take of the current approach in how to deal with input coding for Natural Language Processing tasks on Deep Learning Networks. 

This work shows that is possible to reduce computational complexity *and* add representational capability to a deep neural network without loss of performance and making pre-processing easier .... BLBLBLABLABLABLABLABLABLABLABLA TODO here.

## Future Work:

This study is the first part of a deeper study on how to make networks train faster and be able to run on consumer grade GPUs in a competitive way (even if they are not SoTA). ..... TODO here