# Encoding tests for UTF-8 text 

This file tries to understand how to encode universally all UTF-8 possible characters.
The idea is to create a basic encoding that could be reused not by domein, like current word encodings, but that can handle all current languages and symbols.

The idea is to go to a context based encoding, for this a lot of training will be needed, but if this works, this can be game changer.

As this file tries to encode all the characters possible by utf-8 we have to check the feasible number so:

From [Wikipedia utf-8](https://en.wikipedia.org/wiki/UTF-8)

UTF-8 is a variable width character encoding capable of encoding all 1,112,064.

$$17×2^{16} = 1114112 $$ code points minus 2,048 technically-invalid surrogate code points

As for the number of letters, we can check a few places:

- [The Bible](https://wordcounter.net/blog/2015/12/08/10975_how-many-words-bible.html) 783,137 words, 3,116,480 characters

- [10 longest novels](http://mentalfloss.com/article/18661/quick-10-10-longest-novels-ever)  The Blah Story by Nigel Tomm. 3,277,227 words

the longest book about 5 times more words than the bible, so about 15 Million characters should deal with almost all the english text. I imagine that almost all languages could be around that for the most characters.




In [1]:
import numpy as np

In [2]:
from encoders import *
from helpers import *

In [3]:
utf_enc = Lin2WaveEncoder(0,1114112, neg_allowed=False)

In [4]:
utf_enc.coefficients

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40]

In [5]:
len(utf_enc.coefficients)

20

In [6]:
2**utf_enc.n_high

2097152

We can encode almost double the elements ... which is OK, better to be safe than sorry

In [7]:
char_pos_enc= Lin2WaveEncoder(0,15000000, neg_allowed=False)
len(char_pos_enc.coefficients)

23

In [8]:
2**char_pos_enc.n_high

16777216

It seems that with about 43 elements we can encode all the things that we want, so for the moment what I'll do is add a few tets to check that every encoding is correct

In [9]:
utf_enc.coefficients

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40]

In [10]:
utf_enc.periods

[2,
 4,
 8,
 16,
 32,
 64,
 128,
 256,
 512,
 1024,
 2048,
 4096,
 8192,
 16384,
 32768,
 65536,
 131072,
 262144,
 524288,
 1048576]

In [11]:
# nums = np.array(list(range(1114113)))
# enc_nums = utf_enc.encode(nums)
# dec_nums = utf_enc.decode(enc_nums)

In [12]:
# diffs = dec_nums - nums

In [13]:
# diffs.max()

In [14]:
# diffs.min()

In [15]:
for i in range(10000, 10010):  # range(1114113):
    ei = utf_enc.encode(np.array([i]))
    di = utf_enc.decode(ei)
    print(di - np.array([i]), di, [i], ei.T)
    

[-7614.99628525] [2385.00371475] [10000] [[0.00601678 0.17493624 0.32673182 0.58800814 0.00195542 0.13115829
  0.70151093 0.98928396 0.8150547  0.33285724 0.00724349 0.82218012
  0.96967038 0.78657779 0.65023042 0.57599823 0.53810998 0.51906886
  0.50953616 0.5047683 ]]
[-7615.37501024] [2385.62498976] [10001] [[1.03564933e-01 2.79033554e-01 3.86558448e-01 5.57094223e-01
  8.18281119e-04 1.36477680e-01 6.97929851e-01 9.89682423e-01
  8.15812401e-01 3.32397127e-01 7.28495169e-03 8.22086763e-01
  9.69691306e-01 7.86602798e-01 6.50244973e-01 5.76005771e-01
  5.38113780e-01 5.19070767e-01 5.09537118e-01 5.04768776e-01]]
[-7615.52474358] [2386.47525642] [10002] [[2.98174216e-01 3.96869493e-01 4.48155289e-01 5.25957358e-01
  1.68588183e-04 1.41885822e-01 6.94336695e-01 9.90073413e-01
  8.16568902e-01 3.31937175e-01 7.32653409e-03 8.21993385e-01
  9.69712229e-01 7.86627804e-01 6.50259527e-01 5.76013312e-01
  5.38117583e-01 5.19072673e-01 5.09538072e-01 5.04769253e-01]]
[-7615.55751564] [2387.

In [16]:
l2f = Lin2Fourier(0, 1114112, terms=5)

In [17]:
c0 = l2f.coefficients[0]

In [18]:
l2f.factors[0](10)

5.639635246590041e-05

In [19]:
ten = np.array(range(10))

In [20]:
eten = l2f.encode(ten)

In [21]:
l2f.fs.subs(abc.x,10)

-1114112*sin(5*pi/278528)/pi - 557056*sin(5*pi/139264)/pi - 1114112*sin(15*pi/278528)/(3*pi) - 278528*sin(5*pi/69632)/pi - 1114112*sin(25*pi/278528)/(5*pi)

In [22]:
l2f.fs.subs(abc.x,4).evalf()

-39.9999999626816

In [23]:
lf = lambdify(abc.x, l2f.fs, "numpy")

In [24]:
lf(4)

-39.999999962681564

In [25]:
l2f.fs

-1114112*sin(pi*x/557056)/pi - 557056*sin(pi*x/278528)/pi - 1114112*sin(3*pi*x/557056)/(3*pi) - 278528*sin(pi*x/139264)/pi - 1114112*sin(5*pi*x/557056)/(5*pi)

In [26]:
# l2f.decode(eten)

In [27]:
c = [c for c in l2f.factors][3]

In [28]:
c

<function _lambdifygenerated(x)>

In [29]:
# for i in range(10):  # range(1114113):
#     ei = l2f.encode(np.array([i]))
#     di = l2f.decode(ei)
#     print(di - np.array([i]), di, [i], ei.T)
    

In [30]:
import numpy as np
numpy = np  # FIXME, somewhere the reference is missing
import sympy
from sympy import abc
from sympy import lambdify
from sympy import fourier_series, pi

In [31]:
min_val = 0
max_val = 1114112
terms = 20
fs = fourier_series(abc.x, (abc.x, min_val, max_val))  # .truncate(terms)

In [32]:
fs

FourierSeries(x, (x, 0, 1114112), (0, SeqFormula(0, (_k, 1, oo)), SeqFormula(Piecewise((-620622774272*cos(2*_n*pi)/(_n*pi) + 310311387136*sin(2*_n*pi)/(_n**2*pi**2), (_n > -oo) & (_n < oo) & Ne(_n, 0)), (0, True))*sin(_n*pi*x/557056)/557056, (_n, 1, oo))))

In [33]:
fs_args = fs.args 

In [34]:
# coef_fact = [ (lambdify(abc.x, (np.product(f.args[:-1]), "numpy")[0]),  # evaluate  sympy.pi to numeric
#                       lambdify(abc.x, f.args[-1], "numpy")) for f in fs_args  # create lambda functions
#                      ]

In [35]:
# coefficients, factors = zip(*coef_fact)

In [36]:
# coefficients = np.stack(coefficients)

In [37]:
# coefficients

In [38]:
a = 'a'
b = 'b'
s = "this is a text"
s1 = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHI'

In [39]:
ea = a.encode('utf-8')

In [40]:
int.from_bytes(ea, byteorder='big')

97

In [41]:
import commpy

In [None]:
commpy.