# Data Preparation for Multi-Language Supervised Part of Speech training.

This notebook prepares the data pre-processed from the conllu ud-treebank v2.4.

First load all datasets in memory (just because I can)

Count the number of samples for each and do some statistics. 

Here I need to take into account the different sampling strategies. 

There is the need for each batch to contain samples from all the languages, if not the training will not be optimal. The sampling strategies impact the performance in over-sampled and sub-sampled languages.

Mainly (TODO look back to the source and note it here) the different sampling strategies and their conditions are:

- Sampling the same amount from each language: Benefits the languages with less training samples and they take advantage from the learning of the languages with most samples. Penalises the languages with more samples.
- Rate based: benefits transfer from languages with more samples to less ones and does not penalises too much the languages with more data. 
- Proportional: penalises the most the languages with less data.


Data on the datasets with the least should not be repeated (at least too much) to avoid overfitting during this stage. The issue with overfitting now (I will be working on that later, testing what I can accomplish with overfitting)

A supposition that I have is that more complex languages (syntactically and semantically) might be better for training and transferring knowledge to less complex and/or languages with less training data. I won't be trying to test this hypothesis (for the moment at least)

In [4]:
import numpy as np
import pandas as pd
from langmodels.utils.preprocess_conllu import *
from utf8_encoder import *
import pickle
import random

In [5]:
base_dir = "/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5"

In [13]:
allfiles = get_all_files_recurse(base_dir)
charseq_files = [f for f in allfiles if f.endswith(".pkl") and "charseq_code" in f]  # typo in the file saving (fixed now in the .py file)

In [14]:
len(charseq_files)

298

In [15]:
charseq_train = [ f for f in charseq_files if "-train-" in f]
charseq_test = [ f for f in charseq_files if "-test-" in f]
charseq_dev = [ f for f in charseq_files if "-dev-" in f]

In [16]:
len(charseq_files), len(charseq_train), len(charseq_test), len(charseq_dev)

(298, 88, 130, 80)

Now load all data and start counting

In [17]:
def load_data(file_list):
    data = []
    for fname in file_list:
#         name = path_leaf(fname)
        name = fname
        with open(fname, "rb") as f:
            d = pickle.load(f)
            data.append((name, d))
    return data

In [18]:
%%time
data_train = load_data(charseq_train)
data_test = load_data(charseq_test)
data_dev = load_data(charseq_dev)

CPU times: user 1.19 s, sys: 437 ms, total: 1.63 s
Wall time: 1.63 s


In [19]:
len(data_train), len(data_test), len(data_dev)

(88, 130, 80)

In [20]:
data_count_train = [(n,len(d)) for n,d in data_train]
data_count_test = [(n,len(d)) for n,d in data_test]
data_count_dev = [(n,len(d)) for n,d in data_dev]

In [21]:
df_train = pd.DataFrame(data_count_train)
df_train.columns = ["name", "count"]
df_train = df_train.sort_values("count")

df_test = pd.DataFrame(data_count_test)
df_test.columns = ["name", "count"]
df_test = df_test.sort_values("count")

df_dev = pd.DataFrame(data_count_dev)
df_dev.columns = ["name", "count"]
df_dev = df_dev.sort_values("count")

In [22]:
df_train

Unnamed: 0,name,count
57,/home/leo/projects/Datasets/text/UniversalDepe...,19
35,/home/leo/projects/Datasets/text/UniversalDepe...,19
42,/home/leo/projects/Datasets/text/UniversalDepe...,23
37,/home/leo/projects/Datasets/text/UniversalDepe...,87
16,/home/leo/projects/Datasets/text/UniversalDepe...,153
...,...,...
44,/home/leo/projects/Datasets/text/UniversalDepe...,24633
38,/home/leo/projects/Datasets/text/UniversalDepe...,40801
83,/home/leo/projects/Datasets/text/UniversalDepe...,48814
68,/home/leo/projects/Datasets/text/UniversalDepe...,68495


In [23]:
df_test

Unnamed: 0,name,count
51,/home/leo/projects/Datasets/text/UniversalDepe...,34
126,/home/leo/projects/Datasets/text/UniversalDepe...,36
12,/home/leo/projects/Datasets/text/UniversalDepe...,49
24,/home/leo/projects/Datasets/text/UniversalDepe...,55
37,/home/leo/projects/Datasets/text/UniversalDepe...,57
...,...,...
59,/home/leo/projects/Datasets/text/UniversalDepe...,3214
124,/home/leo/projects/Datasets/text/UniversalDepe...,6491
52,/home/leo/projects/Datasets/text/UniversalDepe...,7881
96,/home/leo/projects/Datasets/text/UniversalDepe...,10148


In [24]:
df_dev

Unnamed: 0,name,count
15,/home/leo/projects/Datasets/text/UniversalDepe...,55
20,/home/leo/projects/Datasets/text/UniversalDepe...,65
34,/home/leo/projects/Datasets/text/UniversalDepe...,82
58,/home/leo/projects/Datasets/text/UniversalDepe...,107
10,/home/leo/projects/Datasets/text/UniversalDepe...,129
...,...,...
40,/home/leo/projects/Datasets/text/UniversalDepe...,3125
75,/home/leo/projects/Datasets/text/UniversalDepe...,6584
35,/home/leo/projects/Datasets/text/UniversalDepe...,8427
61,/home/leo/projects/Datasets/text/UniversalDepe...,9270


In [25]:
df_train.describe()

Unnamed: 0,count
count,88.0
mean,10177.272727
std,18609.989946
min,19.0
25%,1781.0
50%,5382.0
75%,13436.75
max,153035.0


In [26]:
df_test.describe()

Unnamed: 0,count
count,130.0
mean,1201.1
std,1990.188413
min,34.0
25%,471.75
50%,923.5
75%,1131.0
max,18459.0


In [27]:
df_dev.describe()

Unnamed: 0,count
count,80.0
mean,1443.7875
std,2447.523588
min,55.0
25%,529.75
50%,894.0
75%,1419.0
max,18434.0


In [28]:
70123/19

3690.684210526316

Proportional sampling might be difficult as the difference between number of samples is too much. 
19 samples for the dataset with least data and 70123 for the most complete dataset. 

This means that I must do some sampling strategy that is proportional somehow but a rate of 3690 times is too much for it.

Also the training order might be important, so better use the ones with least data at the end of the training, so they benefit from the previous training instead of initializing. 

Maybe repeating there a few times will not necessarilly overfit?

So I will do that, I will order the trainig in a way that the last batches contain the samples from the languages with the least training data and the first batches will not contain them.

Also all batches will be of a length that can contain at least a sample from each training (language) dataset.

Also, for the languages with the lest number of samples might be good to merge the training with the dev datasets.

There are also many datasets that don't contain train, test and or dev dataset. Some have only one of those, these are good for testing the generalization of the network on languages that are not trained on.

In [29]:
df_train.head(20)

Unnamed: 0,name,count
57,/home/leo/projects/Datasets/text/UniversalDepe...,19
35,/home/leo/projects/Datasets/text/UniversalDepe...,19
42,/home/leo/projects/Datasets/text/UniversalDepe...,23
37,/home/leo/projects/Datasets/text/UniversalDepe...,87
16,/home/leo/projects/Datasets/text/UniversalDepe...,153
21,/home/leo/projects/Datasets/text/UniversalDepe...,319
2,/home/leo/projects/Datasets/text/UniversalDepe...,600
65,/home/leo/projects/Datasets/text/UniversalDepe...,803
4,/home/leo/projects/Datasets/text/UniversalDepe...,858
11,/home/leo/projects/Datasets/text/UniversalDepe...,860


In [30]:
df_dev.tail(20)

Unnamed: 0,name,count
28,/home/leo/projects/Datasets/text/UniversalDepe...,1476
77,/home/leo/projects/Datasets/text/UniversalDepe...,1654
5,/home/leo/projects/Datasets/text/UniversalDepe...,1664
30,/home/leo/projects/Datasets/text/UniversalDepe...,1709
18,/home/leo/projects/Datasets/text/UniversalDepe...,1745
0,/home/leo/projects/Datasets/text/UniversalDepe...,1798
62,/home/leo/projects/Datasets/text/UniversalDepe...,1842
64,/home/leo/projects/Datasets/text/UniversalDepe...,1852
37,/home/leo/projects/Datasets/text/UniversalDepe...,1875
9,/home/leo/projects/Datasets/text/UniversalDepe...,1890


In [31]:
def get_root_name(fname):
    rt = path_leaf(fname).split("-ud-")[0]
    return rt

In [32]:
df_train['name_root'] = df_train.name.apply(get_root_name)
df_test['name_root'] = df_test.name.apply(get_root_name)
df_dev['name_root'] = df_dev.name.apply(get_root_name)

In [33]:
df_train.head()

Unnamed: 0,name,count,name_root
57,/home/leo/projects/Datasets/text/UniversalDepe...,19,bxr_bdt
35,/home/leo/projects/Datasets/text/UniversalDepe...,19,olo_kkpp
42,/home/leo/projects/Datasets/text/UniversalDepe...,23,hsb_ufal
37,/home/leo/projects/Datasets/text/UniversalDepe...,87,swl_sslc
16,/home/leo/projects/Datasets/text/UniversalDepe...,153,lt_hse


In [34]:
smax_train = max([i[1] for i in data_count_train])

In [35]:
smax_train

153035

So the issue is to create a sampling strategy. What should be done I don't know, but I'll create a strategy anyways with my supositions.

So I'll do the following:
- First I'll merge the language files per language, train and dev datasets will be merged to benefit the smaller langs
- The batch size will be the number of languages that I'll be training the network
- batches might not (will not) contain the same number of languages, but batches will be saved with the number of languages as reference
- for training I'll first send the batches with the least language variability and at the end of each epoch the ones with the most variability (meaning the ones containing more languages)
- The number of batches will be defined by .... ???

Also for the languages that have more data I'll clean the datasets by filtering the language files according to the rating on the UD-treebank page. I'll start by the languages with more data by order.

The datasets that I'm cleaning are the onces from German and Czeck:
* Czech-CAC
* Czech-CLTT
* Czech-PUD

For the German datasets I'll avoid merging the dev datasets and the following datasets are not used for the training:
* German-GSD
* German-PUD
* German-LIT

For Russian, German and Czech I won't merge the dev sets

For the German HDT dataset there are two sets, a and b, dataset b is taken out because this dataset contains a lot of data.

Merging by language, also merging train and dev datasets

In [36]:
%%time
max_sentence_len = 1024

# the pre-processing of the test data is much simpler, making it per file, not needing to merge and only filtering strings that are longer than the max_sentence data
test_lang_dict = {}

# filter data that is too big and padding data
def filter_pad_data(data, max_len):
    dta = [d for d in data if d.shape[1] <=max_sentence_len]
    pad_dta = [np.pad(d, [(0,0),(0,1024 - d.shape[1])], mode='constant') for d in dta]
    pad_ta = np.stack(pad_dta, axis=0).astype("uint16")
    return pad_dta

for fname,data in data_test:
    test_lang_dict[fname] = filter_pad_data(data, max_sentence_len)



CPU times: user 5.55 s, sys: 305 ms, total: 5.85 s
Wall time: 5.85 s


In [37]:
len(test_lang_dict.keys())

130

In [38]:
%%time
# save all data for testing in vectorized format already

for oname, data in test_lang_dict.items():
    fname = oname.replace(".pkl", "_test-pad-3x1024_uint16.npy")
    np.save(fname, data)

CPU times: user 242 ms, sys: 830 ms, total: 1.07 s
Wall time: 1.07 s


In [39]:
lang_dict = {}

for fname,data in data_train + data_dev:
    lang = fname.split("_")[0]  # language is as the first 2 or 3 chars of the filename
    # avoid merging dev sets for the languages with more data
    if lang in ["de", "cs", "ru"]:
        print("lang ", lang, " fname: ", fname)
        if "-dev-" in fname or "de_hdt-ud-train-b" in fname:
            print("skipping: ", fname)
            continue
    if not lang in lang_dict:
        lang_dict[lang] = data
    else:
        lang_dict[lang] += data

In [40]:
len(lang_dict.keys())

1

In [41]:
data_count_traindev = [(k,len(d)) for k,d in lang_dict.items()]
df_traindev = pd.DataFrame(data_count_traindev)
df_traindev.columns = ["name", "count"]
df_traindev = df_traindev.sort_values("count")

In [42]:
df_traindev

Unnamed: 0,name,count
0,/home/leo/projects/Datasets/text/UniversalDepe...,1011103


In [43]:
df_traindev["count"].sum()

1011103

The threshold for the number of batches will be chosen as to maximize the number of complete batches, this will also give the number of languages in each batch set.
~~Just because (arbitrary number) I decide that the batches with the least languages will be 10, this might give enough language variability for the training. ~~

As there are 50 languages, the batches will contain 50 samples (from REVISITING SMALL BATCH TRAINING FOR DEEP NEURAL NETWORKS https://arxiv.org/pdf/1804.07612.pdf small batches are better)~~, so the batches with the least n# of languages will contain 5 samples per lang, and the ones with 50 languages one sample per batch.~~

To complete the batch for the ones that are not with all the languages I intend to randomly choose from the ~~10~~ N languages with the most samples, this is done by randomly shuffling the remaining data from the threshold sampling chosen.

In [44]:
max_sentence_len = 1024

# creating a list to fill the missing data in the batches, data will be then randomly shuffled
# I explore here the minimum number of languages that will give me the maximum number of batches.
# I'm looking only in the number of samples of the languages, not other point.

fi_len = 30437  # with this threshold I only keep up to 17 languages and 17524 batches
ar_len = 24759  # with this threshold I only keep up to 14 languages and 20953 batches
ro_len = 19991  # with this threshold I only keep up to 15 languages and 19991 batches 
# is the SWEET SPOT for the max number of batches
ca_len = 14832 # with this threshold I only keep up to 20 languages and 14832 batches  and there are 749 extra batches with up to 19 languages
threshold_len = ca_len

fill_data =[]
# create batches, later
for lang, data in lang_dict.items():
    if len(data)> threshold_len:
        dta = [d for d in data[threshold_len:] if d.shape[1] <= max_sentence_len]
        fill_data.extend(dta)
        
# ensure that the data chosen to fill the missing elements is random 
# (and will statistically be proportional to the number of datapoints of each language)
random.shuffle(fill_data)

In [45]:
len(fill_data)

996211

In [46]:
start_index = 0
batches = []

# create batches, later
for i in range(threshold_len):  
    bid=0
    langs=[]
    datas=[]
    for lang, data in lang_dict.items():
        if len(data)> i and data[i].shape[1] <= max_sentence_len:
            langs.append(lang)
            datas.append(data[i])
    while len(datas) < len(lang_dict.keys()):
        datas.append(fill_data.pop())
    batch = (bid, langs, datas)
    batches.append(batch)

In [47]:
len(fill_data)  # there is data still left

996209

In [48]:
len_samples_10 = len(fill_data) //50

In [49]:
len_samples_10

19924

In [50]:
# from https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]


In [51]:
batches_10_langs = list(chunks(fill_data, 50))

In [52]:
# check that all batches are 50 samples long
batch_lens = list(map(lambda x: len(x[2]), batches))

In [53]:
len(batch_lens), max(batch_lens), min(batch_lens), len(batches_10_langs)

(14832, 1, 1, 19925)

In [54]:
14832 * 50

741600

In [55]:
# there are then a maximum
(14832 + 749) * 50
# datapoints

779050

Checking that the data is OK $ 779078  > 779050$ 
So the number of original samples is bigger than the number of samples on the prepared dataset. This is OK

In [56]:
# see how many languages and how many batches with each amount of languages
b_stats = {}
for bid, blangs, bdatas in batches:
    l = len(blangs)
    if l in b_stats:
        b_stats[l] +=1
    else:
        b_stats[l] = 1

In [57]:
b_stats

{1: 14830, 0: 2}

Now I need to make something for the data shape such as all batches are the same shape.

First I need now to create the list of batches, reverse the order as the training order will be from the least diverse to the most diverse batches up to 50 languages in the last 19 batches (the number of samples the language with least training data has).

I take out the last array from the batches_10_langs at it is not complete with 50 samples


In [58]:
batches_data = list(map(lambda x: x[2], batches)) + batches_10_langs[:-1]
batches_data.reverse()  # have to train first with the least diverse, and at the end with the most diverse ones

In [59]:
len(batches_data) * 50

1737800

Now find the maximum lenght of all the data samples, this will be the dimension needed for the numpy array

In [60]:
seq_len = []
for b in batches_data:
    for s in b:
        seq_len.append(s.shape[1])

min_seq_len, max_seq_len = min(seq_len), max(seq_len)

In [61]:
min_seq_len, max_seq_len

(1, 1019)

In [62]:
# see how many languages and how many batches with each amount of languages
len_stats = {}
for l in seq_len:
    if l in len_stats:
        len_stats[l] +=1
    else:
        len_stats[l] = 1

In [63]:
len_stats = SortedDict(sorted(len_stats.items(), key=lambda kv: kv[0]))

In [64]:
len_stats

SortedDict({1: 934, 2: 1277, 3: 1850, 4: 2850, 5: 3802, 6: 3973, 7: 4733, 8: 4689, 9: 4922, 10: 5179, 11: 5331, 12: 5453, 13: 6166, 14: 5840, 15: 6603, 16: 6124, 17: 6754, 18: 6346, 19: 7926, 20: 6398, 21: 7076, 22: 6639, 23: 7291, 24: 6958, 25: 7502, 26: 7123, 27: 7618, 28: 7212, 29: 7694, 30: 7050, 31: 7915, 32: 7237, 33: 7734, 34: 7167, 35: 7764, 36: 7326, 37: 7865, 38: 6955, 39: 7491, 40: 7084, 41: 7628, 42: 6967, 43: 7656, 44: 6860, 45: 7439, 46: 6806, 47: 7341, 48: 6670, 49: 7229, 50: 6697, 51: 7020, 52: 6463, 53: 7247, 54: 6628, 55: 7020, 56: 6376, 57: 6922, 58: 6465, 59: 6744, 60: 6400, 61: 6779, 62: 6251, 63: 6478, 64: 6158, 65: 6530, 66: 6171, 67: 6537, 68: 6209, 69: 6511, 70: 6074, 71: 6409, 72: 6022, 73: 6333, 74: 6082, 75: 6237, 76: 5852, 77: 6312, 78: 5956, 79: 6197, 80: 5882, 81: 6160, 82: 5721, 83: 6173, 84: 5678, 85: 6044, 86: 5719, 87: 5940, 88: 5642, 89: 5857, 90: 5482, 91: 5585, 92: 5514, 93: 5583, 94: 5356, 95: 5387, 96: 5341, 97: 5397, 98: 5149, 99: 5378, 100: 501

In [65]:
sum(len_stats.values())

1011032

In [66]:
sum([v for k,v in len_stats.items() if k > 1024])

0

In [67]:
sum([v for k,v in len_stats.items() if k <= 5])

10713

Data still seems quite bad, there are many samples with lenght that is suspiciously low (85 len 1, 500 len 2, etc) so I should go back to the data generation and will filter out everything that is below a threshold.

Also I'll need to manually check the files that contain data that seems bad. But maybe the number of languages (and/or samples) will still be cut down in order to improve data quality.

I will need to filter data by a minimum and maximum length, for this I need to calculate and decide on the (minimum maybe) and maximum size of the input data for the network.

Nevertheless, for the moment I'll have then to work on the network to define the input and output shapes.

Input will be maximum 1024 char in length, this is to avoid overcomplications and big networks on my setup, but can be changed later.

I might leave the sentences with small length (1,2,...) just to have some "noisy" input ... ?


In [68]:
%%time
padded_data = []
for batch in batches_data:
    for s in batch:
        padded_data.append(np.pad(s, [(0,0),(0,1024 - s.shape[1])], mode='constant'))

CPU times: user 34.7 s, sys: 2.74 s, total: 37.5 s
Wall time: 37.5 s


In [69]:
all_train_data_np = np.stack(padded_data, axis=0)

In [70]:
all_train_data_np.dtype

dtype('int32')

In [71]:
#reduce precision (everything is below the maximum utf-8 encoding for 2 segments index < 2^11)
np_data = all_train_data_np.astype(np.uint16)

In [72]:
np_data.dtype

dtype('uint16')

In [73]:
all_train_data_np.shape

(1011032, 3, 1024)

In [74]:
# base_dir = "/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.4"
# train_fname = os.path.join(base_dir, "traindev_np_batches_779000x3x1024_int32.npy")
# np.save(train_fname, all_train_data_np)

In [75]:
# this file is 4.8GB on disk ... can't fit in my GPU and also have the models there, nice I do have 
train_fname = os.path.join(base_dir, "traindev_np_batches_779000x3x1024_uint16.npy")
np.save(train_fname, np_data)

In [76]:
np_data.shape

(1011032, 3, 1024)

In [77]:
np_data[-1,:,:]

array([[ 71, 101, 114, ...,   0,   0,   0],
       [  2,   2,   2, ...,   0,   0,   0],
       [ 21,  21,  21, ...,   0,   0,   0]], dtype=uint16)

In [78]:
np.count_nonzero(np_data[0,:,:]), np.count_nonzero(np_data[-1,:,:])

(248, 115)