# Flexible Universal Character Level Encoding method for Text in NLP

    Leonardo M. Rocha
    leo <dot> m <dot> rocha <at> gmail <dot> com
    

## Abstract

Currently NLP deals with *Out of Vocabulary* (OOV) in different ways, this leads to several non-necessarilly efficient ways of pre-processing NLP datasets to be able to deal with 
In this work we present **Segment-Multihot-Encoding** (SME) a technique that deals with OOV and allows to encode all possible symbols in a computationally efficient manner to encode all (or part) of the UTF-8 domain in a fixed multi-hot encoding that can be further compressed by overfitting. This technique eliminates the need of complex and compute consuming pre-processing replacing it for a much more simple one that works for *any* dataset. This work focuses on being able to encode a symbol, as once it is encoded the network can be fine-tunned later to handle previously unseen ones.

This work presents the SME technique for UTF8 and we call it **SME-UTF8**, the source code and encoding vectors are also given and examples are shown in other notebooks.

## Introduction and Related Work

Currently for NLP tasks there is the need to first analyze the input domain and encode it to deal with *Out of Vocabulary* (OOV) words or symbols and Polysemy. 

It is important to separate (and we do in this work) the encoding part (to be able to represent the symbol) from the learning to use those symbols (the network to be able to do something useful with it) as this work focuses solely on being able to encode all feasible symbols in a defined text encoding domain. In this case the work is done for UTF-8 which is the most used text encoding in the web [94.6% according to w3techs](https://w3techs.com/technologies/details/en-utf8).



Diverse techniques deal differently with OOV, ranging from techniques that can not deal with them, like [GloVe - Pennington et al. 2014](https://nlp.stanford.edu/pubs/glove.pdf) or [Word2Vec - Milikov et al. 2013](https://arxiv.org/abs/1301.3781) to others such as [Universal Sentence Encoder -Cer et al. 2018](https://arxiv.org/abs/1803.11175) or the one for FastText can encode OOV with subword embeddings
One of the most used techniques is [Byte Pair Encoding from Neural Machine Translation of Rare Words with Subword Units Sennrich et al. 2015](https://arxiv.org/abs/1508.07909) which has the advantage of compressing the input size hence accelerate training compared to a full character level input on the current SoTA.


[ELMo - Peters et al. 2018](https://arxiv.org/abs/1802.05365)

[Transformer - ]()
[ULM-FiT - ]()
[BERT - ]()
[AlBERT - ]()
[CamemBERT]()


... TODO more papers and references here



All current methods deal with subdomains of the possible inputs available, which for most tasks is enough, nevertheless the weakness is that they can not deal with **all** posisble input symbols, which for the current study is UTF-8.

In the case of continual learning the need to add new symbols is a given, be it due to adding new domain in the same language, or new languages

The goal of this work is to try to set a more standard way to deal with all possible symbols in a defined encoding standard.

This work analyzes UTF-8 encoding and presents a technique to be able to encode part or all of the UTF-8 domain in an computationally efficient way. The same technique can be used for other text encodings without any modification, and as utf-8 is a superset of other encodings (ASCII, Unicode, ...) the same matrix encoding can be applied without any modification in those datasets.

There is also another point to say about this computational complexity, all the current SoTA methods are trained in clusters (and wiht prices) that are unavailable to most users. The current work is part or a larger work on trying to get enough permormance in commercially available (and relatively accessible) single GPUs for end users (being at the current time the NVidia RTX2080ti one of the more computationally strong cards in the market).


## UTF-8 Analysis

### One-Hot encoding

[One-hot encoding](https://en.wikipedia.org/wiki/One-hot) is one of the most used to encode cathegorical variables, in the case of State of the Art NLP tasks is used to encode the input symbols, this is computationally expensive and the goal here is to reduce this complexity leaving memory and computational space for other more complex tasks in the network.

### Number of code-points

As this file tries to encode all the characters possible by utf-8 we have to check the feasible number so:

From [Wikipedia utf-8](https://en.wikipedia.org/wiki/UTF-8)

UTF-8 is a variable width character encoding capable of encoding all **1,112,064**.

$$17×2^{16} = 1114112 $$ code points minus 2,048 technically-invalid surrogate code points

This is, if encoding with one-hot we would need 1.1M parameters per neuron in the input layer, which is expensive. The goal is to reduce this complexity (which we argue is unnecessary) by orders of magnitude.

## UTF-8 structure and Encoding Details

Since the entire utf-8 univers is NOT the entire $2^{32}$ domain, but there are limitations explained in [the utf-8 description](https://en.wikipedia.org/wiki/UTF-8)

| Number of bytes | Bits for code point | First code point | Last code point | Byte 1   | Byte 2   | Byte 3   | Byte 4   |
|----------------|--------------------|-----------------|----------------|----------|----------|----------|----------|
| 1              | 7                  | U+0000          | U+007F         | 0xxxxxxx |          |          |          |
| 2              | 11                 | U+0080          | U+07FF         | 110xxxxx | 10xxxxxx |          |          |
| 3              | 16                 | U+0800          | U+FFFF         | 1110xxxx | 10xxxxxx | 10xxxxxx |          |
| 4              | 21                 | U+10000         | U+10FFFF       | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |

The UTF-8 code is formed by 4 segments, we will refer to this often during the current work.

The thing is that the number of elements in the table should be at most $2^{21}$, There is only the need to create a index that can handle the 4 cases which can be done with 4 different conversion tables.


In fact it is possible to just cut the utf-8 value in chunks and do one-hot per different parts:
- there are only 4 segment ranges, that can be coded to add redundancy in one-hot also add there either hamming or other ECC
- the largest value is for 7 bits -> 128 values
- the others contain 6 bits -> 64 values
The prefix of each can be taken away and replaced by the initial one-hot

So a complete code would be:  $ 4 + 128 + 64 + 64 + 64 = 324 $

Instead of having a vector of dimension 1,112,064 to encode any utf-8 value, one with dimension 324 (even 320) would be able to encode everything.

This embedding can stil be reduced but should be sparse enough already to make a good input, the goal here is to have sparse vector that makes each vector far enough of the others, at least by one dimension. Adding the redundancy code (the first 4 dimensions) allows to make distance even bigger for vectors that should be further appart taking into account the locality of the utf-8 encoding (each character set is close to the ones used with it, segment 3 encodes mostly CJK Chinese-Japanese-Korean).

#### Notes
It is worth noting here that the first author has also experience in communications which allowed during the curse of this research the analysis of multiple Error Correcting Codes (ECCs) and different kinds of encoding (for example encoding as a Fourier Series), the conclusion is that even if the one-hot is the best in distance, other codes can be used and a multi-hot sparse is the simplest to implement (and fastest to encode). As a note, one pending task is to analyze ECCs in an end to end manner for a neural network.

## Encoding details

### UTF-8 Segments
To cut even more memory consumption the table can be generated for 1-4 segments of the utf-8 code, taking into account that the 4th segment is mostly composed of:
* Supplementary Multilingual Plane (SMP) of historic scripts
* Supplementary Special-purpose Plane (SSP)
* Private Use Areas (SSU)
* Invalid Codes

We can safely ignore this 4th segment (for the purpose of this article and most usages) which adds to most of the code-points

If CJK, Indic and some Miscelaneous Symbols are not (and will not be) needed then the 3rd segment can be safely ignored too reducing even more the memory consumption of the application

So the result would be:

**TODO do this again with a more clean source code and measure correctly everything giving an output HERE**

| Segments Used | Number of code points | Vector Size | First code point | Last code point | # code exceptions | Size (MB in Disk) |Sparse Size (MB in Disk) |
|---------------|-----------------------|-------------|------------------|-----------------|-------------------|-------------------|-----------|
| 4             | 1,112,064             | 452          |          | 4160 |    |
| 3             | 59328                 | 388        |          | 4224 |      |
| 2             |                  | 324          |          |  |    |
| 1             |                  |          |        |  |    |

Where:
* Segments Used: number of segments used from utf-8 to generate the code
* Number of Code Points: The total encoded code points generated
* Vector Size: The embedding vector size
* First - Last Code Point: corresponds to the segment first and last code-point in the embedding index
* \# Code Exceptions: Number of code exceptions during encoding with Python, notice that we use the standard library for this.
* Size (MB in Disk): Size of the embedding matrix and conversion dictionaries (from-to code) once saved in disk in Dense mode
* Sparse Size (MB in Disk): Size of the embedding matrix and conversion dictionaries (from-to code) once saved in disk in Sparse mode


Notice that the code for 1 segment corresponds to one-hot encoding of ASCII encoding (plus the vector of size 4 that we don't modify in any case)

## Overfitting Compression

In the literature overfitting is an evil creature, but in this case, as we know the entire domain, we are going to use it to our advantage with overfitting the sparse input (the multi-hot encoded vector) into a smaller embedding vector than the input, the goal is lossless compression here.

This is done to be able to make a more informed decision at the end of the study and show the comparative results.

Once the network is trained (basically an overfitted autoencoder), a new encoding matrix is generating making each element of the input domain pass by the autoencoder and getting the latent vector, which is used to generate a matrix that can be given as Embedding directly to the network.

The training is done with the following configuration:

    Batch Size: 
    Network Configuration:
    Loss:
    
And we measure:

    Output Vector Embedding Size
    Execution Time
    Matrix Size on Disk (here only the dense matrix is taken into account as there should be close to no sparsity)
    

The experiments on overfitting were run on different vector sizes from 32 to 128 for encodings using 3 and 4 segments

**TODO do the runs again (with a better and more clean script) and put here the results**



## Method Validation

To validate this method is sufficient to show that the performance of a network does not decay with compared to one-hot in several tasks. This article deals with this in a restricted environment

There are a few key points to measure:

* Pre-processing time (dataset) for each method
* Network Performance 
* Network Size (total and trainable parameters)
* Network Memory Consumption
* Network Training Time

To this end simple enough NLP tasks will be tackled such as the training and testing time is not excesive (running on a single end user graphic GPU card, in this case an RTX2080ti).

The tasks will be evaluated on the same (except of the first embedding layer) networks with different encoding, to be able to compare networks all will be done at character level. For completeness some otehr methods also will be evaluated, mainly BPE which is currently used in most SoTA papers.

## Results and Discussion

The TODO here

## Conclussion

This work shows a different take of the current approach in how to deal with input coding for Natural Language Processing tasks on Deep Learning Networks. 

This work shows that is possible to reduce computational complexity *and* add representational capability to a deep neural network without loss of performance and making pre-processing easier .... BLBLBLABLABLABLABLABLABLABLABLA TODO here.

## Future Work:

This study is the first part of a deeper study on how to make networks train faster and be able to run on consumer grade GPUs in a competitive way (even if they are not SoTA). ..... TODO here