# New Compositional Codebook Preparation

For this codebook I'll start from the previous idea where each codepoint was independent and only had a number id.

For the new code the idea  is a bit more elaborated where:

- each first iteration codepoint depends on the index
- from this first iteration new codes are derived in the following way:
 * for the single char codes:
   - normalize char with NFKD (so it is decomposed)
   - check if is num, is uppercase, is_special, has_diacritic (and which)
   - char to lowercase, char to ascii (or simplest representation)
   - code is composed of the concatenation of (to_ascii, lowercase_code, is uppercase|lowercase, is numeric|not numeric, is special|not special, diacritic
 * for the multiple character codes:
   - normalize sequence with NFKD
   - encode each character
   - conv(to_ascii ..) cat sum (to_ascii) cat conv (to lower) cat sum(to_lower) cat conv (diacritics) cat sum(diacritics) cat charcount cat hasnum, cat isnum ... (TODO, finish deciding which kind of code and what does it contains)


The idea is:

Each character representation contains more information than a simple index, this should make the network's learning easier and give a way of conversion between upper/lower with and without diacritics.

The composed code gives information about the presence or absense of a character (the sums) and the order (the convolutions), this should give subspaces where is easier for similarity and proximity analysis.

The issue here is that maybe each subspace part should be considered/processed in parallel while getting some information from the other subspace instead of doing it in a big neural network .... 


The current assumptions are the following:

- Origin language is given by name not detected
- Destination language is Given by name, not detected
- For training a destination vector will be either checked with similarity search (FAISS) or as a one-hot encoding depending on the resource ussage
- The input embeddings mapping will be pre-computed (as in the previous iteration) but the number of input elements will be bigger
- The tokenization will be greedy, meaning it will try to span the longest sequences first
- unknown input tokens should be tested with the following two encoding protocols:
  * only span the longest tokens possible
  * encode the entire symbol as per the compositional encoding protocol and let the network treat it as an unknown but tag it as something semantically and gramatically
  

For the initial code  would be nice to have a redundant code that manages to make close elements close in subspace and alsosomething to pull them appart enough such as the sum of the subspaces is clear enough in the compositional encoded values.

Something like the multihot code for the distantiation and single-cycle-code for the proximity part.

Now let's compute the number of codepoints for the base generator code

## Encoding Steps

1. Base Generator Code -> index based of a redundant single-cycle-code + multihot-prime-code
2. Single Char Basic Code 
  - after NFKD normalization
  - includes if is uppercase/lowercase, 
  - if contains a diacritic/accidental,  (check if is better to tell which or just a binary element with this)
  - if is a composed symbol (more than one char on the NFKD normalization)
  - if is a numeric element
  - it contains the basic code for the letter (closest ascii for example ... TODO clarify this)
3. Composed Code:
  - circ conv of Single Char Codes (dim*2)
  - sum of previous codes (dim)
  - circ conv of ascii representations (dim*2)
  - sum of ascii representations (dim)
  - is numeric| is alphanumeric | is all text  (dim=3)
  - has diacritic/accidental (a position for each, with the vector size being the max length of the token ... for example 5 or 10, or count the number of accidentals instead) (dim=2)
  - is all caps (instead of having each  (dim=2)
  - starts with upper (dim=2)
  
This schema is not the simplest one, and takes work to put it in place, but might (and is what I hope) reduce the number of parameters and training time
  

There is the selection of the desired vector size for the embedding codes, I choose to work on the following ranges:
single char code might be 48, composed codes should be of dimension no more than 192 but preferred would be 128

Lets see the following code:

We need to represent at least 1619 characters for one of the selected character codes (I'm trying to cut the number of dimensions for the current resources while keeping a maximum of flexibility, more work on this can give more benefits but I won't spend TOO MUCH more time on this)

let's say we use the following code:

    multihot-code (3,5,11,13) -> max 2145 codepoints
    single-cycle-code (4,6,10,12) -> max 2880 codepoints
    is upper|is_lower (dim=2)
    contains_diacritic (dim=2)
    composed_symbol (dim=2)
    is_numeric|is_text|is_symbol (dim=3)
    ascii_converted_codepoint (transliteration + normalization + taking diacritics out) -> to reduce to maximum the lang 
    
    total_dimension = 3+5+11+13 + 4+6+10+12 + 2 + 2 + 2 + 3 +

In [1]:
import unidecode
import unicodedata
import numpy as np


from constants import RESERVED_CODE_SPACE
from sparse_encoders import *

## Process 

Making the process streamline and test it here

In [2]:
charcodes = compositional_code_main()

chars len =  1564
1 first_symbols len =  0
1 all_chars len =  3128
2 all_chars len =  1519
2 first_symbols len =  3128
3 first_symbols len =  842
all_base_chars len =  870
all_base_chars len =  870
charcodes len =  1519


In [3]:
charcodes[0]

{'token': '\n',
 'complete_conv': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.], dtype=float16),
 'non_accent_conv': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       

In [4]:
len(charcodes)

1519

In [5]:
import pickle

In [29]:
fname = "codes/compositional_charcode_1519_raw_dicts.pkl"

with open(fname, 'wb') as f:
    pickle.dump(charcodes, f, pickle.HIGHEST_PROTOCOL)

In [7]:
ls -lh codes/

total 15M
-rw-r--r-- 1 leo leo 205K févr. 12 22:30  adhoc-codebook-1871.pkl
-rw-r--r-- 1 leo leo  20K févr.  7 12:12  all_chars_byline.chars
-rw-r--r-- 1 leo leo  15K févr.  7 12:12  all_chars.chars
-rw-r--r-- 1 leo leo 1,4M avril 11 14:43  compositional_charcode_1564_raw_dicts
-rw-r--r-- 1 leo leo 1,3M avril 11 15:56  compositional_charcode_1564_raw_dicts.pkl
-rw-r--r-- 1 leo leo 176K févr. 12 22:30 'utf8_2-seg_1871-codepoints_96-dim_N-24-k3_coprimes-(3, 5, 11, 13)_cycles-(4, 6, 8, 10, 12)_dense.npy'
-rw-r--r-- 1 leo leo 128K févr. 12 22:30 'utf8_2-seg_1871-codepoints_96-dim_N-24-k3_coprimes-(3, 5, 11, 13)_cycles-(4, 6, 8, 10, 12)_sparse.npy'
-rw-r--r-- 1 leo leo 7,3M janv. 20 12:41 'utf8_3-seg_59328-codepoints_128-dim_N-37-k4_coprimes-(11, 13, 19, 23)_cycles-(11, 7, 4, 3)_dense.npy'
-rw-r--r-- 1 leo leo 4,5M janv. 20 12:41 'utf8_3-seg_59328-codepoints_128-dim_N-37-k4_coprimes-(11, 13, 19, 23)_cycles-(11, 7, 4, 3)_sparse.npy'


## Code creation 

The goal of this is to select a few elements from the given dictionary, sort all the characters and do a more elaborate codebook than the base one with the same formatting:

    (codes, symbol2int, int2symbol)

There is the need to check that the special codes are there, just in case.


In [8]:
SPECIAL_CODES_CHARS

['◌',
 '◍',
 '◀',
 '◂',
 '▸',
 '▶',
 '▒',
 '\x00',
 '\x01',
 '\x02',
 '\x03',
 '\x04',
 '\x05',
 '\x06',
 '\x07',
 '\x08',
 '\t',
 '\n',
 '\x0b',
 '\x0c',
 '\r',
 '\x0e',
 '\x0f',
 '\x10',
 '\x11',
 '\x12',
 '\x13',
 '\x14',
 '\x15',
 '\x16',
 '\x17',
 '\x18',
 '\x19',
 '\x1a',
 '\x1b',
 '\x1c',
 '\x1d',
 '\x1e',
 '\x1f',
 ' ']

In [9]:
# charcodes = sorted(charcodes, key=lambda k: k['token'])
chars = [k['token'] for k in charcodes]

In [10]:
len(chars), len(set(chars))

(1519, 1519)

In [11]:
counter = []
for c in chars:
    counter.append((c, chars.count(c)))

In [22]:
# check that there are no duplicates
for c in counter:
    if c[1]>1:
        print(c)

In [30]:
codes = charcodes_dict2codebook(charcodes)
codebook, symbol2int, int2symbol = codes

In [31]:
# check dimensions match
len(codebook), len(symbol2int), len(int2symbol)

(1519, 1519, 1519)

In [32]:
# check shape and datatype
codebook.shape, codebook.dtype

((1519, 187), dtype('int8'))

In [33]:
# check dimensions for each part of the code
for i in charcodes[0].values():
    try:
        print(len(i))
    except:
        pass

1
120
120
60
60
4
3


In [34]:
fname = "codes/compositional_charcode_1519_codebook_complete_conv-non_accent_sum-casing-alnum.pkl"

with open(fname, 'wb') as f:
    pickle.dump(codes, f, pickle.HIGHEST_PROTOCOL)

In [35]:
ls -lh codes/

total 14M
-rw-r--r-- 1 leo leo 205K févr. 12 22:30  adhoc-codebook-1871.pkl
-rw-r--r-- 1 leo leo  20K févr.  7 12:12  all_chars_byline.chars
-rw-r--r-- 1 leo leo  15K févr.  7 12:12  all_chars.chars
-rw-r--r-- 1 leo leo 302K avril 11 16:01  compositional_charcode_1519_codebook_complete_conv-non_accent_sum-casing-alnum.pkl
-rw-r--r-- 1 leo leo 1,3M avril 11 15:59  compositional_charcode_1519_raw_dicts.pkl
-rw-r--r-- 1 leo leo 176K févr. 12 22:30 'utf8_2-seg_1871-codepoints_96-dim_N-24-k3_coprimes-(3, 5, 11, 13)_cycles-(4, 6, 8, 10, 12)_dense.npy'
-rw-r--r-- 1 leo leo 128K févr. 12 22:30 'utf8_2-seg_1871-codepoints_96-dim_N-24-k3_coprimes-(3, 5, 11, 13)_cycles-(4, 6, 8, 10, 12)_sparse.npy'
-rw-r--r-- 1 leo leo 7,3M janv. 20 12:41 'utf8_3-seg_59328-codepoints_128-dim_N-37-k4_coprimes-(11, 13, 19, 23)_cycles-(11, 7, 4, 3)_dense.npy'
-rw-r--r-- 1 leo leo 4,5M janv. 20 12:41 'utf8_3-seg_59328-codepoints_128-dim_N-37-k4_coprimes-(11, 13, 19, 23)_cycles-(11, 7, 4, 3)_sparse.npy'
