# Error Correcting Codes Encoding Study

The goal of this study is to understand options to the popular one-hot encoding. There are many sides to each story (no, not only two), on those sides are: 

- I never liked one-hot encoding (and is been more than a decade since I first used it, so the disgust might never go out);
- I don't like how neural networks are treated and should always be end to end learning (no they should not, they should be more complex architectures, many already in research literature)
- There are priors 
- Each type of input should  (and HAS in nature) its own priors, which are adapted to *facilitate* the learning, no we should not do everyhting inside a NN, we should give as input something that has priors that facilitate learning (and might or might not later save processing power during operations)


On the priors, many have already shown good results, the most remarcable prior are: Convolutional Neural Networks, MAC Networks, LSTMs, others are more subtle, like (remember citation here ...) adding a coordinate system to the input image as an (or many) extra channel(s). There are many more that I think are worth exploring and adding to the literature, even if they don't give good results. 
On those priors there are many that we not only know, but also we have specialized hardware that is perfectly adapted
* time and space -> this we can encode and add it as extra channels
* Different transforms (Fourier, Laplace, Wavelets, ...)
* spikes (borders in images)
* ....


The idea  behind this is that I don't agree with one-hot encoding, not because it does not work, but because it imposes a few limits that I don't want to deal with at first

* We know the actual number of values to encode (with words this is not necessary true)
* We have a sample data to train the encoding

This limits us in several ways; for example, for training on a domain, the encoder will depend on that domain only. If there are under-represented values (such as words that don't appear or are new later, or changing domain) this limits the encoding possibliities. A better idea will be to be able to encode everything even if the internal representations have not yet learned to use those simbols.

I want to be able to do a separation ebtween the *possibility*  of representing a value, and the learning of that concept.

The first and biggest limitation of one-hot encoding is that does not allow to represent values that are not accepted.

As some other parts of this study have already focused on integer value representations, arbitrary function representation (although with limitted success on the fourier inspired encodings) this study is more focused on being able to represent correctly all the values of utf-8, basically doing a first binary representation that will be given as input to an OVERFITTED encoder. 

The reasoning behind this is:


* The origin domain is all text
* UTF-8 can represent all text in all languages including some extra elements
* let's use UTF-8 as the origin domain
* Create an encoder that can deal with ANY and ALL input in the origin domain
* the encoded values can later be used

As text should be always correctly reconstructed in the presence of noise, I want to imagine now a Neural Network like a REALLY NOISY channel. For this using (Forward) ECCs is one way of thinking in this medium
Then the tests that I intend to do is:

* Create an autoencoder that OVERFITS to all the data


One idea that I have been dealing with my head for the past 3-4 years is that we are thinking overfitting the wrong way, and we can actually use it well, but we have to learn how.

I think that here is the first time I actually find a way of doing it in a useful way

The idea is to overfit so it generates an smaller encoding vector than the one in the input. Autoencoders are good to do this test.

The other idea is that if the autoencoder can NOT do this, then the encoding ideas that I will try are BAAAD and I should feel BAAAD. In this case ... just go to the drawing table and think of other things.

On the other side, if this works, this means that FINALLY I can go on the next stage, that is building the predictors first basic ones (LSTMs, HMMs, Time Convolutions), then with meta-learning and later with my still too fresh idea on neural databases. 

One interesting thing I want to find out about Error Correcting Codes (ECCs) is if they are actually useful in the output decoding, as they should be adding *explicit* redundancy to the input and also to the output.

The other thing about ECCs is that we can pile them up, for example, one (or many codes) to representa a symbol (for example the value *'€'* ) and then use convolutional or turbo codes for the *temporal* encoding/decoding part, this means that we not only add priors to the intantaneous input, but also to the temporal dimension, which is something I really want to explore (and this should facilitate fixing and correcting "channel errors")

I don't deal here with *erasure* error types, but that is a possibility later.


In [44]:
import numpy as np
import commpy
# import bitarray as ba
# import struct
import sys
# import binascii
from bitstring import BitArray, BitStream


In [28]:
sys.byteorder

'little'

In [12]:
c = '€'.encode()

In [54]:
c

b'\xe2\x82\xac'

In [82]:
zero = BitArray(b'\x00\x00\x00\x00')
b = BitArray(c)

In [84]:
b

BitArray('0xe282ac')

In [85]:
b.tobytes()

b'\xe2\x82\xac'

In [86]:
int.from_bytes(c, byteorder='big')

14844588

In [87]:
32 - b.len

8

In [88]:
int.from_bytes(c, byteorder='big') >> 1

7422294

In [99]:
for i in range ((32 - b.len)//8):
    b.prepend(b'\x00')

In [100]:
b.len

32

In [101]:
b

BitArray('0x00e282ac')

In [102]:
32 - b.len

0


I tried to do some things about the first part of the code, turning bytes to a numpy array, but seems that the most efficient way would be a table (numpy 2d array that has as index the int value of the input and there the array in that position is the binary code, this can already include the first pass to make a one hot of every N bits (maybe every 4, so there are not so many initial values ), this matrix could already have pre-computed the ECC ...

For the ECC, I stil don't decide if making it by chunks of input bits, or by all the values, I guess that by all should do, but maybe is easier to compute them reshaping the input arrays to the code in use (example for Golay [24,12,8] will do for every 12 input bits) 

The idea is to not completely get rid of one-hot encoding, but to limit it to parts of the input vector code restricting the size of the domain

In [108]:
# number of parameters for a one-hot by chunks encoding:
chunk_sizes = [4, 5, 6, 8, 12]
n_params = []
for c in chunk_sizes:
    n_params.append((c, (32 // c) * 2**c))

In [109]:
n_params

[(4, 128), (5, 192), (6, 320), (8, 1024), (12, 8192)]

Maybe for my tests up to chunks of size 6 should be acceptable (I still need to add the next ECC)

The next code can be:
- Repetition (x3)
- Hamming
- Golay
- Reed Solomon
- Latin Square 
- AN Correcting

Here some thoughts about the codes:

Repetition: thishas the disadvantage of giving a redundancy that is quite obvious, besides the low power of reconstruction and catastrofic errors, it is obvious that just repeating does not necessarilly adds to a neural network another perspective at the input. Might be worth trying, but for the moment I'm not interested in it.

Hamming: it can correct one error (Hamming 7,4), adding 3 out of 4 bits. With an extra bit it can detect up to 2 errors with an extra parity bit.

Golay: might serve well enough for my first tests as it ads not too much overhead (duplicates the number of elements) for an interesting error correction (up to 3 bits from each 12, so one fourth).


There is one difference in the focus from this analysis and telecomunications (or any other domain with noisy channel), here I'm interested not in the code Rate (ammount of information sent vs ammount of actual bits sent) but in giving as input to the NN some form of non necessary evident redundancy that it could use, and having more ways to correct the output if a single mistake is made during the output extrapolation, I want ot check this part.

Thinking a bit more about auto-encoders, it might not be the best idea to start there as it might not give any useful information ... I still have to try some things, might give it a try if it is quick enough to build it once I have the input code.


For efficiency what I will do is build from the beginning the encoding table, for the decoding, I stil need to think it thorugh.
