# Multihot study fitting in 64 bits

This study tries to cut the number of embedding dimensions and will continue from the points done in multihot_study_simple

The idea is to cut down from the embedding dimension of 324  to something much more manageable.

So I want to go UTF8ed (utf-8 embedding dimension) no more than 64 bits, why? just because

The low limit of the embedding would be 32 bits (the maximum lenght of an utf-8 code)

$ 32 <= UTF8ed <= 64 $


For this I want to basically do the following: ${N\choose k}$

Where $ 32 <= N <= 64$
and $ k $ should be minimized to augment the sparcity of the vector as much as possible

Also I would like to add some verification or checking elements that should be also more important, for example, the first 4 elements should indicate which UTF-8 segment is being used. This implies  $ 32 <= N <= 60$



The value $k$ should be around the $k^{th}$ root of the product of the first $k$ parts of $N!$

So I will try some values for k

In [1]:
import numpy as np
import itertools
from itertools import combinations

In [2]:
# as a first experiment I would like to see how many 

# the number of items that need to be included in the coding scheme:
ncodes = 1112064  # number of valid codes in UTF-8 per Wikipedia page


In [3]:
list(range(32,32-4,-1))

[32, 31, 30, 29]

In [4]:
list(range(1,4+1))

[1, 2, 3, 4]

In [5]:
#find the minimum N for which the condition is filled
for N in range(32,64):
    for k in [4,5]:
        v = np.prod(list(range(N,N-k,-1))) / np.prod(list(range(1,k+1)))
        if v > ncodes:
            print("ncodes={}; N={},k={}".format(v N,k))
            break
    

N=45,k=5
N=46,k=5
N=47,k=5
N=48,k=5
N=49,k=5
N=50,k=5
N=51,k=5
N=52,k=5
N=53,k=5
N=54,k=5
N=55,k=5
N=56,k=5
N=57,k=5
N=58,k=5
N=59,k=5
N=60,k=5
N=61,k=5
N=62,k=5
N=63,k=5


so the values are $ N >= 45 ; k >=5 $

Which means that for a code of dim 64 I can use a one-hot for the first 4 elements such as it indicates the utf-8 plane segment  and tehre are still 15 elements to signal some other things (such as a positional embedding or an error correction code for example).

So I decide to create a code of dimension $ N=49 $ and leave the rest of the space for dimensional embedding or other thing (64 would be great for grouped convolution features and 49 is only divisible by 7) 

From these 49 elements, the only available values will be $0$ and $1$, the first 4 elements will be selected according to the plane segment used in UTF-8, and the rest should indicate all the selection (this adds redundancy but also makes things more clear)

In [6]:
list(combinations(list(range(5)), 2))

[(0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (1, 2),
 (1, 3),
 (1, 4),
 (2, 3),
 (2, 4),
 (3, 4)]

so, basically I have to do something like the following:

- generate all combinations of ${45\choose 5}$
- assing to each an index 
- convert all that to numpy and vectors of size 45

In [7]:
def get_all_combinations(N,k):
    ret = combinations(list(range(N)),k)  #iterator
    return ret



In [8]:
all_combs = get_all_combinations(45,5)

In [9]:
indices = np.array(list(all_combs))

In [10]:
indices.shape

(1221759, 5)

In [11]:
indices[:5]

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 5],
       [0, 1, 2, 3, 6],
       [0, 1, 2, 3, 7],
       [0, 1, 2, 3, 8]])

In [12]:
embeds = np.zeros([indices.shape[0], 45])

In [13]:
embeds.shape

(1221759, 45)

In [14]:
# numpy.put works with indices as if the array is flattened so I have to work on that
lin_indices = np.array(list(range(embeds.shape[0])))

In [15]:
lin_indices = lin_indices.reshape([-1,1])

In [16]:
lin_indices.shape

(1221759, 1)

In [17]:
lin_indices[:20]

array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12],
       [13],
       [14],
       [15],
       [16],
       [17],
       [18],
       [19]])

In [18]:
flat_indices =  (lin_indices*45)+indices

In [19]:
flat_indices[:10]

array([[  0,   1,   2,   3,   4],
       [ 45,  46,  47,  48,  50],
       [ 90,  91,  92,  93,  96],
       [135, 136, 137, 138, 142],
       [180, 181, 182, 183, 188],
       [225, 226, 227, 228, 234],
       [270, 271, 272, 273, 280],
       [315, 316, 317, 318, 326],
       [360, 361, 362, 363, 372],
       [405, 406, 407, 408, 418]])

In [20]:
embeds.put(flat_indices,[1])

In [21]:
embeds[-4:]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.]])

This covers a complete codebook, now the issue might be the distance between two elements of the code. In this case the distance is quite small, so I can add some extra dimensions that increments the distance between vectors...
Maybe what I can do is actually use the next 15 dimensions (to fill up to 64 dimensions) ... so something might come up of it

After thinking about several methods, specially on Fowraed Error Correction Codes like TurboCodes, LDPC and ReedSolomon. Other error detection codes (that use parity codes) are not necessarilly useful as the parity will always be the same in the codebook by construction (which is another nice thing). There is another thing here, is that many codes (like golay or hamming) have fixed size for the messages which do not match the needs in the codes here.

So basically what needs to be done is augment the distance between two elements, which can be done easily. 

In this case I can do that with an easy trick that will augment distance between the points, maybe do several one-hot like the one used in the previous codebook I worked on.

In [22]:
# arr = [3,5,7,11,13,17,19,23]
arr = [5,7,11,13,17,19,23]

In [23]:
np.prod(arr), np.sum(arr)

(37182145, 95)

The issue with this is that the dimensionality grows more than I fixed I wanted to work on.

So I can use the same technique but with the 15 elements I have left as max dimension that I fixed (just because I wanted to)

Note that all these decisions on dimensionality are completely arbitrary, adding constraints just for the sake of cutting down the number of operations and trainable parameters.

The idea of having a fixed codebook is to get free of it later.

So for this extra code of 15 elements will be created in a way that all pairs are co-primes (this increases the distance between vectors on the cycles), the easiest way of selecting co-primes is selecting prime numbers, also there is a nice thing in the sequence $[3,5,7]$ that they sum 15 which is exactly the same as the allowed space I gave myself to build that.

In [24]:
eyes = np.eye(3), np.eye(5), np.eye(7)

In [25]:
eyes[0].repeat(4)

array([1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
       1., 1.])

In [26]:
np.tile(eyes[0],(3,1))

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [27]:
rep3, rep5, rep7 = int(np.ceil(embeds.shape[0]/3.)), int(np.ceil(embeds.shape[0]/5.)), int(np.ceil(embeds.shape[0]/7.))

In [28]:
reps = [rep3,rep5,rep7]
reps

[407253, 244352, 174537]

And now I build the codebook

In [29]:
tiles = []
for e,r in zip(eyes, reps):
    t = np.tile(e, [r,1])[:embeds.shape[0],:]
    tiles.append(t)

In [30]:
[t.shape for t in tiles]

[(1221759, 3), (1221759, 5), (1221759, 7)]

In [31]:
code15 = np.concatenate(tiles,axis=1)

In [32]:
code15.shape

(1221759, 15)

In [33]:
embeds45 = np.concatenate([embeds,code15],axis=1)

In [34]:
embeds45.shape

(1221759, 60)

In [35]:
embeds45bool = np.array(embeds45, dtype=bool)

Now I want to compute the distances between vectors, just to know about them .. but the dimensionality of the vector makes it big and out of memory errors appear, so I'll do splits to try to get it right.

In [36]:
splits = np.array_split(embeds45bool, 1000)

In [37]:
splits[0][:2]

array([[ True,  True,  True,  True,  True, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
         True, False, False,  True, False, False, False, False,  True,
        False, False, False, False, False, False],
       [ True,  True,  True,  True, False,  True, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False,  True, False, False,  True, False, False, False, False,
         True, False, False, False, False, False]])

In [38]:
from scipy.spatial.distance import cdist,pdist, hamming
# from scipy.spatial import distance
dd = cdist(embeds45bool,splits[0][:2], metric='hamming')
# pp = pdist(embeds45bool[:10,:],splits[0][:2])
hh = hamming(embeds45bool[0], splits[0][3])

In [39]:
splits[0].shape

(1222, 60)

In [40]:
embeds45bool[0],embeds45[0],

(array([ True,  True,  True,  True,  True, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
         True, False, False,  True, False, False, False, False,  True,
        False, False, False, False, False, False]),
 array([1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0.]))

In [41]:
dd.shape, hh.shape

((1221759, 2), ())

In [42]:
hh

0.1

In [43]:
ddf = dd.flat

In [44]:
ddf[ddf>0]

array([0.13333333, 0.13333333, 0.13333333, ..., 0.23333333, 0.26666667,
       0.26666667])

In [45]:
np.min(ddf[ddf>0])

0.03333333333333333

In [46]:
np.min(dd)

0.0

The next experiment should not be run lightly as it is heavy and time consuming (one run takes about 138 seconds wall time, so about 140s I estimate about 39 hours, or about 2 days of runtime in my computer, I can not parallelize more due to memory issues which I only have 64GB)

In [47]:
# %%time
# isplits = splits[:5]
# # # from scipy.spatial.distance import cdist
# # # # cdist(XA, XB, metric='euclidean', p=2, V=None, VI=None, w=None)
# # # maxd, mind, 
# diststats = []
# for s in isplits: 
#     fdist = cdist(embeds45bool,s, metric='hamming').flat
#     nzfdist = fdist[fdist>0]  # eliminate from the elements the zero distances (distance to itself)
#     # save stat values
#     diststats.append( (np.min(nzfdist), np.max(nzfdist), np.median(nzfdist), np.std(nzfdist) ))
    


In [48]:
# diststats

In [49]:
np.array([0.03333333333333333 , 0.26666666666666666 , 0.23333333333333334]) * 45

array([ 1.5, 12. , 10.5])

In [50]:
# %%time
# splits2 = np.array_split(embeds45, 1000)
# isplits = splits2[:5]
# # # from scipy.spatial.distance import cdist
# # # # cdist(XA, XB, metric='euclidean', p=2, V=None, VI=None, w=None)
# # # maxd, mind, 
# diststats2 = []
# for s in isplits: 
#     fdist = cdist(embeds45,s, metric='hamming').flat
#     nzfdist = fdist[fdist>0]  # eliminate from the elements the zero distances (distance to itself)
#     # save stat values
#     diststats2.append( (np.min(nzfdist), np.max(nzfdist), np.median(nzfdist), np.std(nzfdist) ))
    


In [51]:
# diststats2

Now I'll get to do the segment coding this makes an extra 4 elements that encode each segment and the special tokens

It will be a one hot encoding and all ones when  is a special token and I can use the **utf-8 private area**

From the previous study the indices for the codes are:

In [52]:
print("indices for the segments: ", 0, 128, (128 + 2**5 * 2**6), (128 + 2**4 * (2**6)**2), (128 + 2**3 * (2**6)**3) )

indices for the segments:  0 128 2176 65664 2097280


So what I need are a few elements at some point to use them as private values.

In the case of utf-8 there are non used values taht I can use for this purpose, or I can add some extra values at the beginning.



In [53]:
# the segment indicator vector will be  shape (embeds.shape[0], 4)
segind = np.zeros((embeds.shape[0], 4))


In [54]:
segind.shape

(1221759, 4)

I'll use the last 4 codes as special codes, these will be set for the following elements:
* \<error>  $last$
* \<start> $last-1$
* \<stop> $last-2$
* \<unknown> $last-3$
* \<null> $last-4$


Other elements might be needed, but as the encoding is much bigger than the complete utf-8 space I'll be able to add them later if the need arrives.


Special codes have the segment indicator part set to *1111*

In [55]:
segind[-5:] = 1

In [56]:
segind[-6:]

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [57]:
# here is where the pre-computed indices are of use
# 0 128 2176 65664 2097280
segind[:128] = np.array([0,0,0,1])
segind[128:2176] = np.array([0,0,1,0])
segind[2176:65664] = np.array([0,1,0,0])
# segind[65664:] = np.array([1,0,0,0])
segind[65664:-113854] = np.array([1,0,0,0])  # where 113855 is the number of special codes that fit in this coding but I leave one for margin
segind[-6:] = 1

In [58]:
segind[120:130]

array([[0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.]])

In [59]:
segind[2170:2180]

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.]])

In [60]:
segind[65660:65670]

array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]])

In [61]:
segind[-10:]

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

Now I can create the complete codebook:

In [62]:
embeds64 = np.concatenate([embeds45,segind],axis=1)

In [63]:
embeds64.shape

(1221759, 64)

Now I have all the codes, this should be enough for many things. Nevertheless even if this is an encoding that can capture everything, the decoding part as well as the learning might prove problematic. 
One-hot is quite nice for learning and decoding while this encoding will need some other techniques for decoding and measuring loss (cosine similarity for example?, using faiss might be an option)


There is another way of encoding this, try to maximize the distance between elements in the *SAME* utf-8 code segment, this could be more beneficial as most of the text in one text or language should (mostly) be in the same segment while (maybe) having a few words or codes from the other ones (exceptions would be the punctuation and emoticons codes), but for the moment I'll just create my codes as is and be done with it.

In [64]:
from utf8_encoder import *

In [65]:
tables = create_tables(segments=4)

number of codes =  1107904
number of code_exceptions =  790656


In [66]:
len(tables)

5

In [67]:
_, _, _, char2idx, idx2char = tables

In [68]:
type(char2idx)

collections.OrderedDict

In [69]:
# if we check the number of codes generated is
len(char2idx), len(idx2char)

(1107904, 1107904)

which is less than: 1221759

In [70]:
1221759 - 1107904

113855

In [71]:
# what I want to do now is to save the coding but for that I need to add the special characters, 
# <err> (error) 𝑙𝑎𝑠𝑡 = 1221758
# <start> 𝑙𝑎𝑠𝑡−1 = 1221757
# <stop>  𝑙𝑎𝑠𝑡−2 = 1221756
# <unk> (unknown) 𝑙𝑎𝑠𝑡−3 = 1221755
# <null> 𝑙𝑎𝑠𝑡−4 = 1221754

# char2idx["<err>"] = 1221758
# char2idx["<start>"] = 1221757
# char2idx["<stop>"] = 1221756
# char2idx["<unk>"] = 1221755
# char2idx["<null>"] = 1221754

# idx2char[1221758] = "<err>"
# idx2char[1221757] = "<start>"
# idx2char[1221756] = "<stop>"
# idx2char[1221755] = "<unk>"
# idx2char[1221754] = "<null>"

# eslen = len(embeds64)
# idx2char["<err>"] = eslen-1
# idx2char["<start>"] = eslen-2
# idx2char["<stop>"] = eslen-3
# idx2char["<unk>"] = eslen-4
# idx2char["<null>"] = eslen-5

# idx2char[eslen-1] = "<err>"
# idx2char[eslen-2] = "<start>"
# idx2char[eslen-3] = "<stop>"
# idx2char[eslen-4] = "<unk>"
# idx2char[eslen-5] = "<null>"



In [72]:
embeds64bool = np.array(embeds64, dtype=bool)

In [73]:
# list(char2idx.items())[:100]

In [74]:
# list(idx2char.items())[:100]

In [75]:
# list(embeds64[[0,120,240,360,480,600,720,840,960,1080,1200,1320]])

In [76]:
# and now SAVE all the codes
# save_obj(char2idx, "multihot64-char2idx")
# save_obj(idx2char, "multihot64-idx2char")
# save_obj(embeds64, "multihot64-embeds")
# save_obj(embeds64bool, "multihot64-embeds-bool")

In [77]:
ls -lh

total 280K
drwxr-xr-x 5 leo leo 4,0K sept. 27 14:22 [0m[01;34madaptive-span[0m/
-rw-r--r-- 1 leo leo  42K oct.   3 17:57 ecc_study_simple.ipynb
-rw-rw-r-- 1 leo leo 7,9K sept. 21 21:16 ilustrated_transformer.py
-rw-r--r-- 1 leo leo    0 juil. 30 11:41 __init__.py
drwxrwxr-x 4 leo leo 4,0K oct.  31 16:37 [01;34mlangmodels[0m/
-rw-r--r-- 1 leo leo  42K oct.  31 16:46 multihot-64_study.ipynb
-rw-r--r-- 1 leo leo 5,1K oct.  31 16:46 multihot_faiss-test.ipynb
-rw-r--r-- 1 leo leo  49K oct.  30 19:31 multihot-small_study.ipynb
-rw-r--r-- 1 leo leo  39K sept. 10 22:11 multihot_study_simple.ipynb
-rw-r--r-- 1 leo leo 2,8K sept. 20 02:50 nextchar.py
-rw-rw-r-- 1 leo leo 1,8K sept. 21 21:58 position_coding.py
drwxr-xr-x 2 leo leo 4,0K oct.   4 18:11 [01;34m__pycache__[0m/
-rw-r--r-- 1 leo leo  13K oct.   3 15:35 tcn.py
drwxr-xr-x 2 leo leo 4,0K oct.  30 19:06 [01;34mutf8-codes[0m/
-rw-r--r-- 1 leo leo 9,5K oct.   3 16:34 utf8_encoder.py
-rw-rw-r-- 1 leo leo    0 sept. 10 22:06 utf8_mult

The code is a bit big, so I'll cut out the part that is NOT used and leave just a few places for special codes, the rest, forget about it

In [78]:
#eliminate the values that we'll not use and keep the most distanced objects for special use
embeds64short = np.concatenate([embeds64[:-113855], embeds64[-6:]], axis=0)
# char2idxshort = np.concatenate([char2idx[:-113854], char2idx[-6:]], axis=0)
# idx2charshort = np.concatenate([idx2char[:-113854], idx2char[-6:]], axis=0)

In [79]:
embeds64short.shape

(1107910, 64)

In [80]:
eslen = len(embeds64short)
char2idx["<err>"] = eslen-1
char2idx["<start>"] = eslen-2
char2idx["<stop>"] = eslen-3
char2idx["<unk>"] = eslen-4
char2idx["<null>"] = eslen-5

idx2char[eslen-1] = "<err>"
idx2char[eslen-2] = "<start>"
idx2char[eslen-3] = "<stop>"
idx2char[eslen-4] = "<unk>"
idx2char[eslen-5] = "<null>"


In [95]:
embeds64short = np.array(embeds64short, dtype='float32')
embeds64shortbool = np.array(embeds64short, dtype=bool)

In [82]:
# del idx2char[1221758]
# del idx2char[1221757]
# del idx2char[1221756]
# del idx2char[1221755]
# del idx2char[1221754]


In [83]:
# del(char2idx)
# del(idx2char)

In [96]:
embeds64short.dtype

dtype('float32')

In [84]:
len(char2idx), len(idx2char), embeds64short.shape

(1107909, 1107909, (1107910, 64))

Now I do some verification of the elements to be sure that all goes OK

In [85]:
aidx = set(range(embeds64short.shape[0]))

In [86]:
cidx = set(char2idx.values())

In [87]:
idxc = set(idx2char.keys())

In [88]:
len(idxc.intersection(cidx))  # intersection OK

1107909

In [89]:
idxc.difference(cidx), cidx.difference(idxc)

(set(), set())

This set should have 1 non used value (a special token space), this is by construction to get some space in case I need it and not having to change the codebook, just add it to the dictionary assignment

In [90]:
idxc.difference(aidx), aidx.difference(idxc)

(set(), {1107904})

In [91]:
embeds64short[[1107904]]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0.,
        0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1., 1., 1.]])

In [97]:
# and now SAVE all the codes
save_obj(char2idx, "multihot64short-char2idx")
save_obj(idx2char, "multihot64short-idx2char")
save_obj(embeds64short, "multihot64short-embeds")
save_obj(embeds64shortbool, "multihot64short-embeds-bool")

Checking this change only, the complete numpy pickled embedding codebook changes from:
* 625540774 bytes multihot64-embeds.pkl to
* 567250086 bytes multihot64short-embeds.pkl

so, about 55MBs difference
