# New Compositional Codebook Preparation

For this codebook I'll start from the previous idea where each codepoint was independent and only had a number id.

For the new code the idea  is a bit more elaborated where:

- each first iteration codepoint depends on the index
- from this first iteration new codes are derived in the following way:
 * for the single char codes:
   - normalize char with NFKD (so it is decomposed)
   - check if is num, is uppercase, is_special, has_diacritic (and which)
   - char to lowercase, char to ascii (or simplest representation)
   - code is composed of the concatenation of (to_ascii, lowercase_code, is uppercase|lowercase, is numeric|not numeric, is special|not special, diacritic
 * for the multiple character codes:
   - normalize sequence with NFKD
   - encode each character
   - conv(to_ascii ..) cat sum (to_ascii) cat conv (to lower) cat sum(to_lower) cat conv (diacritics) cat sum(diacritics) cat charcount cat hasnum, cat isnum ... (TODO, finish deciding which kind of code and what does it contains)


The idea is:

Each character representation contains more information than a simple index, this should make the network's learning easier and give a way of conversion between upper/lower with and without diacritics.

The composed code gives information about the presence or absense of a character (the sums) and the order (the convolutions), this should give subspaces where is easier for similarity and proximity analysis.

The issue here is that maybe each subspace part should be considered/processed in parallel while getting some information from the other subspace instead of doing it in a big neural network .... 


The current assumptions are the following:

- Origin language is given by name not detected
- Destination language is Given by name, not detected
- For training a destination vector will be either checked with similarity search (FAISS) or as a one-hot encoding depending on the resource ussage
- The input embeddings mapping will be pre-computed (as in the previous iteration) but the number of input elements will be bigger
- The tokenization will be greedy, meaning it will try to span the longest sequences first
- unknown input tokens should be tested with the following two encoding protocols:
  * only span the longest tokens possible
  * encode the entire symbol as per the compositional encoding protocol and let the network treat it as an unknown but tag it as something semantically and gramatically
  

For the initial code  would be nice to have a redundant code that manages to make close elements close in subspace and alsosomething to pull them appart enough such as the sum of the subspaces is clear enough in the compositional encoded values.

Something like the multihot code for the distantiation and single-cycle-code for the proximity part.

Now let's compute the number of codepoints for the base generator code

## Encoding Steps

1. Base Generator Code -> index based of a redundant single-cycle-code + multihot-prime-code
2. Single Char Basic Code 
  - after NFKD normalization
  - includes if is uppercase/lowercase, 
  - if contains a diacritic/accidental,  (check if is better to tell which or just a binary element with this)
  - if is a composed symbol (more than one char on the NFKD normalization)
  - if is a numeric element
  - it contains the basic code for the letter (closest ascii for example ... TODO clarify this)
3. Composed Code:
  - circ conv of Single Char Codes (dim*2)
  - sum of previous codes (dim)
  - circ conv of ascii representations (dim*2)
  - sum of ascii representations (dim)
  - is numeric| is alphanumeric | is all text  (dim=3)
  - has diacritic/accidental (a position for each, with the vector size being the max length of the token ... for example 5 or 10, or count the number of accidentals instead) (dim=2)
  - is all caps (instead of having each  (dim=2)
  - starts with upper (dim=2)
  
This schema is not the simplest one, and takes work to put it in place, but might (and is what I hope) reduce the number of parameters and training time
  

There is the selection of the desired vector size for the embedding codes, I choose to work on the following ranges:
single char code might be 48, composed codes should be of dimension no more than 192 but preferred would be 128

Lets see the following code:

We need to represent at least 1619 characters for one of the selected character codes (I'm trying to cut the number of dimensions for the current resources while keeping a maximum of flexibility, more work on this can give more benefits but I won't spend TOO MUCH more time on this)

let's say we use the following code:

    multihot-code (3,5,11,13) -> max 2145 codepoints
    single-cycle-code (4,6,10,12) -> max 2880 codepoints
    is upper|is_lower (dim=2)
    contains_diacritic (dim=2)
    composed_symbol (dim=2)
    is_numeric|is_text|is_symbol (dim=3)
    ascii_converted_codepoint (transliteration + normalization + taking diacritics out) -> to reduce to maximum the lang 
    
    total_dimension = 3+5+11+13 + 4+6+10+12 + 2 + 2 + 2 + 3 +

In [27]:
import unidecode
import unicodedata
# import transliterate -> no, I need to know the language code for this .... not useful for character level

In [28]:
transliterate.get_available_language_codes()

['el', 'l1', 'bg', 'sr', 'mk', 'ka', 'ru', 'uk', 'hy', 'mn']

In [70]:
# read file with all the characters and do the transliteration, normalization 
# and diacritic elimination to see the number of codes that rest at the end

fpath = "/home/leo/projects/Datasets/text/wiki-unicode/selected_sources_small/selected_chars.chars"
with open(fpath, "r") as f:
    chars = f.read()


In [71]:
chars = sorted(chars)

In [72]:
''.join(chars)

'\n!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıĲĳĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŉŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃǄǅǆǇǈǉǊǋǌǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰǱǲǳǴǵǶǷǸǹǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏəɼʒʰʱʲʳʴʵʶʷʸʹʺʻʼʽʾʿˀˁ˂˃˄˅ˆˇˈˉˊˋˌˍˎˏːˑ˒˓˔˕˖˗˘˙˚˛˜˝˞˟ˠˡˢˣˤ˥˦˧˨˩˪˫ˬ˭ˮ˯˰˱˲˳˴˵˶˷˸˹˺˻˼˽˾˿ͰͱͲͳʹ͵Ͷͷͺͻͼͽ;Ϳ΄΅Ά·ΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϏϐϑϒϓϔϕϖϗϘϙϚϛϜϝϞϟϠϡϢϣϤϥϦϧϨϩϪϫϬϭϮϯϰϱϲϳϴϵ϶ϷϸϹϺϻϼϽϾϿЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ

In [73]:
len(chars)

1564

In [74]:
# from 
# https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

In [75]:
norm_chars = []
for c in chars:
    tc = unidecode.unidecode_expect_nonascii(c)
    nc = unicodedata.normalize('NFKD', c)
    ac = remove_accents(c)
    dec = unicodedata.decomposition(c)
    norm_chars.append((c, tc, nc, ac, dec))


In [76]:
len(norm_chars)

1564

In [77]:
norm_chars

[('\n', '\n', '\n', '\n', ''),
 ('!', '!', '!', '!', ''),
 ('"', '"', '"', '"', ''),
 ('#', '#', '#', '#', ''),
 ('$', '$', '$', '$', ''),
 ('%', '%', '%', '%', ''),
 ('&', '&', '&', '&', ''),
 ("'", "'", "'", "'", ''),
 ('(', '(', '(', '(', ''),
 (')', ')', ')', ')', ''),
 ('*', '*', '*', '*', ''),
 ('+', '+', '+', '+', ''),
 (',', ',', ',', ',', ''),
 ('-', '-', '-', '-', ''),
 ('.', '.', '.', '.', ''),
 ('/', '/', '/', '/', ''),
 ('0', '0', '0', '0', ''),
 ('1', '1', '1', '1', ''),
 ('2', '2', '2', '2', ''),
 ('3', '3', '3', '3', ''),
 ('4', '4', '4', '4', ''),
 ('5', '5', '5', '5', ''),
 ('6', '6', '6', '6', ''),
 ('7', '7', '7', '7', ''),
 ('8', '8', '8', '8', ''),
 ('9', '9', '9', '9', ''),
 (':', ':', ':', ':', ''),
 (';', ';', ';', ';', ''),
 ('<', '<', '<', '<', ''),
 ('=', '=', '=', '=', ''),
 ('>', '>', '>', '>', ''),
 ('?', '?', '?', '?', ''),
 ('@', '@', '@', '@', ''),
 ('A', 'A', 'A', 'A', ''),
 ('B', 'B', 'B', 'B', ''),
 ('C', 'C', 'C', 'C', ''),
 ('D', 'D', 'D', 'D', ''

In [78]:
norm_chars[-200:]

[('ΰ', 'u', 'ΰ', 'υ', '03B0'),
 ('ῤ', 'R', 'ῤ', 'ρ', '03C1 0313'),
 ('ῥ', 'R', 'ῥ', 'ρ', '03C1 0314'),
 ('ῦ', 'u', 'ῦ', 'υ', '03C5 0342'),
 ('ῧ', 'u', 'ῧ', 'υ', '03CB 0342'),
 ('Ῠ', 'U', 'Ῠ', 'Υ', '03A5 0306'),
 ('Ῡ', 'U', 'Ῡ', 'Υ', '03A5 0304'),
 ('Ὺ', 'U', 'Ὺ', 'Υ', '03A5 0300'),
 ('Ύ', 'U', 'Ύ', 'Υ', '038E'),
 ('Ῥ', 'R', 'Ῥ', 'Ρ', '03A1 0314'),
 ('῭', '"`', ' ̈̀', ' ', '00A8 0300'),
 ('΅', '"\'', ' ̈́', ' ', '0385'),
 ('`', '`', '`', '`', '0060'),
 ('ῲ', 'o', 'ῲ', 'ω', '1F7C 0345'),
 ('ῳ', 'o', 'ῳ', 'ω', '03C9 0345'),
 ('ῴ', 'o', 'ῴ', 'ω', '03CE 0345'),
 ('ῶ', 'o', 'ῶ', 'ω', '03C9 0342'),
 ('ῷ', 'o', 'ῷ', 'ω', '1FF6 0345'),
 ('Ὸ', 'O', 'Ὸ', 'Ο', '039F 0300'),
 ('Ό', 'O', 'Ό', 'Ο', '038C'),
 ('Ὼ', 'O', 'Ὼ', 'Ω', '03A9 0300'),
 ('Ώ', 'O', 'Ώ', 'Ω', '038F'),
 ('ῼ', 'O', 'ῼ', 'Ω', '03A9 0345'),
 ('´', "'", ' ́', ' ', '00B4'),
 ('῾', '`', ' ̔', ' ', '<compat> 0020 0314'),
 ('‘', "'", '‘', '‘', ''),
 ('’', "'", '’', '’', ''),
 ('‚', ',', '‚', '‚', ''),
 ('‛', "'",

I want to reduce the number of codes because that's important for my project (I have limited LIMITED resources in HW) and having something that's smaller makes real sense (example, cutting computation by 30-50%), I want to have composed codes at most of dimension 128 so I need to cut a lot of things.

Smileys/Emoticons should be replaced for the ascii equivalent instead (this cuts a lot already from the general). Composed characters (a base plus accidentals/diacrytics) should be done as a composition, this should already cut many points and reduce the dimension.

In [79]:
single_char_codes = []

for c in chars:
    nc = unicodedata.normalize('NFKD', c)
    for i in nc:
        single_char_codes.append(i)
   
    
single_char_codes = sorted(list(set(single_char_codes).difference(set([ '҈',  '҉']))))
# added the explicit elimination of  '҈',  '҉', as I couldn't find where they came from in the file ... 

In [80]:
len(single_char_codes)

842

Now, this dimension is MUCH more acceptable, so every element must be based on this code size which can comprise MANY languages :).
I also took out characters from old scripts like old_cyrillic, this seems like a lot of progress .... BUT is only 1 or 2 dimension difference only .. I don't know if it is worth the effort of trying to cut it even more. dim 73 to dim 70 for the basic part without the ascii code or the sum and convolutions .... should use only one type of code then? the single cycle one with a big cycle for the convolutions and sums to be as sparse as possible?

The gain does not seem so great respective from a more complete code ... what to do here?

The important thing though is that it does increment sparsity of the code

REMEMBER that the 33 fist elements are reserved for special codes and replaced with those. So .... for the encoding I must make that available

so we need to represent 816 + 33 = 849 basic symbols.


    multihot-code (7,11,13) -> max 1001 codepoints
    single-cycle-code (8,10,12) -> max 960 codepoints
    is upper|is_lower (dim=2)
    contains_diacritic (dim=2)
    composed_symbol (dim=2)
    is_numeric|is_text|is_symbol (dim=3)
    
    total_dimension = 7+11+13 + 8+10+12 + 2 + 2 + 2 + 3 = 70


I take out the ascii conversion as it will be a pain due to the failures that appear with the library (or I'll have to deal with it manually by myself)

In [427]:
first_symbols = []

for c in chars:
    nc = unicodedata.normalize('NFKD', c)
    for i in nc:
        first_symbols.append(i)
        first_symbols.append(i.lower())
        break
   
    
first_symbols = sorted(list(set(first_symbols).difference(set([ '҈',  '҉']))))

In [428]:
len(first_symbols)

836

In [429]:
all_chars = sorted(list(set(first_symbols + list(unicodedata.normalize('NFKD', ''.join(chars))))))

In [430]:
len(all_chars)

864

In [431]:
'  '.join(first_symbols)

'\n     !  "  #  $  %  &  \'  (  )  *  +  ,  -  .  /  0  1  2  3  4  5  6  7  8  9  :  ;  <  =  >  ?  @  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  [  \\  ]  ^  _  `  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z  {  |  }  ~  ¡  ¢  £  ¤  ¥  ¦  §  ©  «  ¬  ®  °  ±  ¶  ·  »  ¿  Æ  Ð  ×  Ø  Þ  ß  æ  ð  ÷  ø  þ  Đ  đ  Ħ  ħ  ı  ĸ  Ł  ł  Ŋ  ŋ  Œ  œ  Ŧ  ŧ  ƀ  Ɓ  Ƃ  ƃ  Ƅ  ƅ  Ɔ  Ƈ  ƈ  Ɖ  Ɗ  Ƌ  ƌ  ƍ  Ǝ  Ə  Ɛ  Ƒ  ƒ  Ɠ  Ɣ  ƕ  Ɩ  Ɨ  Ƙ  ƙ  ƚ  ƛ  Ɯ  Ɲ  ƞ  Ɵ  Ƣ  ƣ  Ƥ  ƥ  Ʀ  Ƨ  ƨ  Ʃ  ƪ  ƫ  Ƭ  ƭ  Ʈ  Ʊ  Ʋ  Ƴ  ƴ  Ƶ  ƶ  Ʒ  Ƹ  ƹ  ƺ  ƻ  Ƽ  ƽ  ƾ  ƿ  ǀ  ǁ  ǂ  ǃ  ǝ  Ǥ  ǥ  Ƕ  Ƿ  Ȝ  ȝ  Ƞ  ȡ  Ȣ  ȣ  Ȥ  ȥ  ȴ  ȵ  ȶ  ȷ  ȸ  ȹ  Ⱥ  Ȼ  ȼ  Ƚ  Ⱦ  ȿ  ɀ  Ɂ  ɂ  Ƀ  Ʉ  Ʌ  Ɇ  ɇ  Ɉ  ɉ  Ɋ  ɋ  Ɍ  ɍ  Ɏ  ɏ  ɓ  ɔ  ɖ  ɗ  ə  ɛ  ɠ  ɣ  ɦ  ɨ  ɩ  ɯ  ɲ  ɵ  ɹ  ɻ  ɼ  ʀ  ʁ  ʃ  ʈ  ʉ  ʊ  ʋ  ʌ  ʒ  ʕ  ʹ  ʺ  ʻ  ʼ  ʽ  ʾ  ʿ  ˀ  ˁ  ˂  ˃  ˄  ˅  ˆ  ˇ  ˈ  ˉ  ˊ  ˋ  ˌ  ˍ  ˎ  ˏ  ː  ˑ  ˒  ˓  ˔  ˕  ˖  ˗  ˞  ˟  ˥  ˦  ˧  ˨  ˩  ˪  ˫  ˬ  ˭  ˮ  ˯  ˰  ˱  ˲  ˳  ˴  ˵  ˶  ˷  ˸  ˹  ˺  ˻  ˼  ˽  ˾  ˿  Ͱ  ͱ  

In [432]:
# saving the file with the basic codepoints

fpath = "/home/leo/projects/Datasets/text/wiki-unicode/selected_sources_small/selected_chars_base.chars"
with open(fpath, "w") as f:
    f.write(''.join(first_symbols))
    

In [433]:
ascii_singlechar_codes = [unidecode.unidecode_expect_nonascii(c) for c in single_char_codes]

In [434]:
ascii_singlechar_codes

['\n',
 ' ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '{',
 '|',
 '}',
 '~',
 '!',
 'C/',
 'PS',
 '$?',
 'Y=',
 '|',
 'SS',
 '(c)',
 '<<',
 '!',
 '(r)',
 'deg',
 '+-',
 'P',
 '*',
 '>>',
 '?',
 'AE',
 'D',
 'x',
 'O',
 'Th',
 'ss',
 'ae',
 'd',
 '/',
 'o',
 'th',
 'D',
 'd',
 'H',
 'h',
 'i',
 'k',
 'L',
 'l',
 'ng',
 'NG',
 'OE',
 'oe',
 'T',
 't',
 'b',
 'B',
 'B',
 'b',
 '6',
 '6',
 'O',
 'C',
 'c',
 'D',
 'D',
 'D',
 'd',
 'd',
 '3',
 '@',
 'E',
 'F',
 'f',
 'G',
 'G',
 'hv',
 'I',
 'I',
 '

In [435]:
failed = [c for c in ascii_singlechar_codes if '[?]' in c]

In [436]:
len(failed)

105

Conversion to single character seems to fail a lot (105/842) ~12.5% of the cases which might be a problem for the learning system (although I plan to set them to none such as the system needs to check the other characters ... )

In [437]:
3*5*11*13, 4*6*10*12, 2024

(2145, 2880, 2024)

In [438]:
def prime_factors(n):
    i = 2
    factors = []
    while i * i <= n:
        if n % i:
            i += 1
        else:
            n //= i
            factors.append(i)
    if n > 1:
        factors.append(n)
    return factors

In [439]:

prime_factors(len(single_char_codes)+33)
# codes, char2int, int2char = create_base_codebook(single_char_codes, code_size=len(single_char_codes)+33)

[5, 5, 5, 7]

In [440]:
s=3
code_size = 100
idx = np.arange(1, code_size + 1)
idx = idx % s
sc = np.zeros([code_size, s], dtype=bool)
np.put(sc, idx, 1)

The cycle code generator is WRONG and MUST be corrected, I'll just not use it for the moment and that's it.



In [441]:
from constants import *
from sparse_encoders import *

def create_base_codebook(charset, special_codes=SPECIAL_CODES, code_size=2145+33,
                    N=24,k=3,
                    subcode_list=(2,3,5,11,13), 
#                     cycle_list=(2, 3),  # (4,6,10,12),  # WARNING< DO NOT USE < bug in the cycle code generator
                    nul_row_is_zero=True, reserved_spaces=RESERVED_CODE_SPACE
                    ):
    """
    :param charset_fpath: file path where the set of characters is available
    :param config: list of tuples: (segment, number of code-points, (n,k), (coprimes), (cycles), dimension, sparcity)
    :param ofname: Where to save the codebook
    :param special_codes: special codes mapping for the output dictionary
    :param nul_row_is_zero: if the first row (the NUL one) should be zeros or the given code
    :param reserved_spaces: the reserved spaces at the beginning of the codebook, 32 is the default as is the number of
    control codes in utf-8. This later is used for remapping reserved SPECIAL_CODES, IS 32
    :return:
    """
    # TODO this code is ugly but works wiht the right configuration, for the moment
    # TODO make the configuration selection automatic from some config points and the charset
    codes = [
        sparse_code_Nk(code_size, N, k),
        generate_multihot_prime_code(code_size, subcode_list),
#         create_single_cycle_code(code_size, cycle_list),  #this code generator is only for redundancy
        
    ]
    
    if nul_row_is_zero:
        # assume nul row is the first one
        for code in codes:
            code[0, :] = 0
    # create dict
    char2int = OrderedDict()
    int2char = OrderedDict()
    # add the number of reserved chars at the beginning
    for i in range(reserved_spaces):  # Warning, must be <128
        # use utf-8 codepoints
        c = str(bytes([i]), 'utf-8')
        char2int[c] = i
        # for the reverse mapping, to avoid issues on decoding, leave them unassigned UNASSIGNED='◁???▷'
        # could use UNK but I'd rather have it be obviously different, leaving unassigned is an issue
        int2char[i] = c   # UNASSIGNED
    # overwrite the indices of the reverse mapping for the special codes
    for c, i, c_alt in special_codes:
        # Take into account this will duplicate the char2int mapping having 2 chars and the alternative code
        # mapping to the same int
        char2int[c] = i
        # char2int[c_alt] = i
        # but the int reverse index will be overwritten
        int2char[i] = c

    for i, c in enumerate(list(charset)):
        # forward the index
        j = i + reserved_spaces
        char2int[c] = j
        int2char[j] = c

    # pickle all together
    codebook = (codes, char2int, int2char)
#     with open(ofname, 'wb') as f:
#         print("saving file {} with codes.shape {} | char2int {} | int2char {}".format(
#             ofname, codes.shape, len(char2int), len(int2char)))
#         pickle.dump(codebook, f, pickle.HIGHEST_PROTOCOL)
    return codebook


In [442]:
# codes, char2int, int2char = create_base_codebook(first_symbols, code_size=2880)
# codes, char2int, int2char = create_base_codebook(single_char_codes, code_size=len(single_char_codes)+33)
codes, char2int, int2char = create_base_codebook(all_chars, code_size=len(all_chars)+33)

In [443]:
for c in codes:
    print(c.shape)

(897, 24)
(897, 32)


In [444]:
from convolutions import *
import torch.nn.functional as F
from torch import fft, ifft


In [603]:
def get_code_item(c, codebook, padded_codebook, circ_padded_codebook, char2int):
    # convert to lowercase for the symbol representation
    # 
    c_len = len(c)
    c = unicodedata.normalize('NFKD', c)
    nc = c.lower()
    ac = remove_accents(nc)
    
    nc_vecs = [codebook[char2int[i]] for i in nc]
    ac_vecs = [codebook[char2int[i]] for i in ac]
    
    # padded version to be able to convolve later 
    nc_padded = [padded_codebook[char2int[i]] for i in nc]
    ac_padded = [padded_codebook[char2int[i]] for i in ac]
    
    nc_cpadded = [circ_padded_codebook[char2int[i]] for i in nc]
    ac_cpadded = [circ_padded_codebook[char2int[i]] for i in ac]
    
    # circular convolution -> keeps order of elements in token
    nc_conv = nc_padded[0] if len(nc_padded) > 0 else codebook[0]
    if len(nc_padded) > 1:
        for padded in nc_cpadded[1:]:
    #         print(padded.shape, padded.view((1,1,-1)).shape, nc_conv.shape)
            nc_conv = F.conv1d(padded.view((1,1,-1)), nc_conv.view((1,1,-1)))  # .view(padded.shape[0])
        
    ac_conv = ac_padded[0] if len(ac_padded) > 0 else codebook[0]
    if len(ac_conv) > 1:
        for padded in ac_cpadded[1:]:
            ac_conv = F.conv1d(padded.view((1,1,-1)), ac_conv.view((1,1,-1)))  # .view(padded.shape[0])
    
    # vector sum, keeps the values only but don't keep order
    
    nc_sum = nc_vecs[0]
    for v in nc_vecs[1:]:
        nc_sum = np.add(nc_sum, v)
        
    ac_sum = ac_vecs[0] if len(ac_vecs) > 0 else codebook[0]
    if len(ac_vecs) > 1:
        for v in ac_vecs[1:]:
            ac_sum = np.add(ac_sum, v)
    
    # case representation -> dim = 3
    islower_case = c.islower()
    isupper_case = c.isupper()
    notcase = not(c.lower() or c.upper())  # only true if is not all upper or lower
    # starts with uppercase or not -> dim = 2 10|01
    istitle = c.istitle()
    # if all elements are numeric (does not understand decimals) -> dim = 3
    isnum = c.isnumeric()  # takes into account other things like exponentials, japanese and chinese numeric characters
    isalnum = c.isalnum()  
    isalpha = c.isalpha()
    
    code_dict = {
        'token': c,  # Normalized NFKD token
        'complete_conv': nc_conv.view(-1),
        'non_accent_conv': ac_conv.view(-1),
        'complete_sum': nc_sum.view(-1),
        'non_accent_sum': ac_sum.view(-1),
        'casing': [isupper_case, islower_case, notcase, istitle, not istitle],
        'alnum': [isnum, isalnum, isalpha],
        'len': c_len,  # length 
    }
    
    return code_dict

In [604]:
len(char2int.values())

902

In [605]:
codematrix = np.concatenate(codes, axis=1)

codematrix = torch.from_numpy(codematrix).float()
# padded_codematrix = torch.zeros((codematrix.shape[0],codematrix.shape[1]*2))
# pad_dim = codematrix.shape[1] // 2
# padded_codematrix[:, pad_dim:-pad_dim] = codematrix
padded_codematrix = codematrix  # this is so the dimension does not explode
# circ_padded_codematrix = torch.cat([padded_codematrix, padded_codematrix], dim=1)
circ_padded_codematrix = torch.cat([codematrix, codematrix], dim=1)

In [606]:
codematrix.shape, padded_codematrix.shape, circ_padded_codematrix.shape

(torch.Size([897, 56]), torch.Size([897, 56]), torch.Size([897, 112]))

In [607]:
padded_codematrix.shape

torch.Size([897, 56])

In [608]:
len(char2int.keys())

902

In [609]:
'ɓ' in char2int, 'ɓ' in first_symbols, 'ɓ' in chars

(True, True, False)

In [610]:
char2int['ⱥ'], padded_codematrix.shape, circ_padded_codematrix.shape

(894, torch.Size([897, 56]), torch.Size([897, 112]))

In [611]:
chars[122:130]

['À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç']

In [612]:
charcodes = [get_code_item(c, codematrix, padded_codematrix, circ_padded_codematrix, char2int) for c in chars]
# charcodes = [get_code_item(c, codematrix, padded_codematrix, circ_padded_codematrix, char2int) for c in first_symbols]

For the moment I have issues with the circular convolution, I don't want to redo all the math to implement it again so I tried with different methods but:

  numpy fft api does not work as expected
  old pytorch_fft API is deprecated and does not work on new pytorch
  new torch fft API does not work as expected either
  torch padding api works only for 3D,4D and 5D not for 1 or 2D
  manual implementation does not work either


NotImplementedError: Only 3D, 4D, 5D padding with non-constant padding are supported for now


In [613]:
charcodes[122:130]

[{'token': 'À',
  'complete_conv': tensor([3., 3., 1., 0., 1., 0., 2., 2., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
          1., 1., 0., 0., 0., 2., 2., 2., 1., 1., 1., 1., 1., 2., 3., 1., 1., 1.,
          0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 2., 1., 1., 0., 1., 0.,
          1., 3., 3.]),
  'non_accent_conv': tensor([1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
          0., 0.]),
  'complete_sum': tensor([1., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 2., 2., 0., 0., 0., 0., 0., 1., 1., 2., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
          0., 1.]),
  'non_accent_sum': tensor([1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0.,

Well, now it seems to be working for this, what I need to do is to make some normalization to put the ranges between 0 and 1 ... how to normalize it?

Should I let the  "normalization" to the first network layers directly?

Should I do a transformation like tanh ? like what?

In [601]:
cc122 = charcodes[122]

In [602]:
cc122

{'token': 'À',
 'complete_conv': tensor([[[3., 3., 1., 0., 1., 0., 2., 2., 1., 1., 0., 0., 1., 0., 0., 0., 0.,
           0., 1., 1., 0., 0., 0., 2., 2., 2., 1., 1., 1., 1., 1., 2., 3., 1.,
           1., 1., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 2., 1., 1.,
           0., 1., 0., 1., 3., 3.]]]),
 'non_accent_conv': tensor([1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
         0., 0.]),
 'complete_sum': tensor([1., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 2., 2., 0., 0., 0., 0., 0., 1., 1., 2., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
         0., 1.]),
 'non_accent_sum': tensor([1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 1., 

In [555]:
for c in cc122.values():
    print(len(c))

1
56
56
56
5
3


In [142]:
[char2int[i] for i in nc]

[748, 760]

In [146]:
len(codes)

3

In [170]:
missing = '̈'

In [171]:
missing in char2int

False

In [173]:
missing in single_char_codes

True