# Data Preparation for Multi-Language Supervised Part of Speech training.

This notebook prepares the data pre-processed from the conllu ud-treebank v2.4.

First load all datasets in memory (just because I can)

Count the number of samples for each and do some statistics. 

Here I need to take into account the different sampling strategies. 

There is the need for each batch to contain samples from all the languages, if not the training will not be optimal. The sampling strategies impact the performance in over-sampled and sub-sampled languages.

Mainly (TODO look back to the source and note it here) the different sampling strategies and their conditions are:

- Sampling the same amount from each language: Benefits the languages with less training samples and they take advantage from the learning of the languages with most samples. Penalises the languages with more samples.
- Rate based: benefits transfer from languages with more samples to less ones and does not penalises too much the languages with more data. 
- Proportional: penalises the most the languages with less data.


Data on the datasets with the least should not be repeated (at least too much) to avoid overfitting during this stage. The issue with overfitting now (I will be working on that later, testing what I can accomplish with overfitting)

A supposition that I have is that more complex languages (syntactically and semantically) might be better for training and transferring knowledge to less complex and/or languages with less training data. I won't be trying to test this hypothesis (for the moment at least)

In [1]:
import numpy as np
import pandas as pd
from langmodels.utils.preprocess_conllu import *
from utf8_encoder import *
import pickle
import random

In [2]:
base_dir = "/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.4"

In [3]:
allfiles = get_all_files_recurse(base_dir)
charseq_files = [f for f in allfiles if f.endswith(".pkl") and "charsec_code" in f]  # typo in the file saving (fixed now in the .py file)

In [4]:
len(charseq_files)

271

In [5]:
charseq_train = [ f for f in charseq_files if "-train-" in f]
charseq_test = [ f for f in charseq_files if "-test-" in f]
charseq_dev = [ f for f in charseq_files if "-dev-" in f]

In [6]:
len(charseq_files), len(charseq_train), len(charseq_test), len(charseq_dev)

(271, 81, 117, 73)

Now load all data and start counting

In [7]:
def load_data(file_list):
    data = []
    for fname in file_list:
#         name = path_leaf(fname)
        name = fname
        with open(fname, "rb") as f:
            d = pickle.load(f)
            data.append((name, d))
    return data

In [8]:
%%time
data_train = load_data(charseq_train)
data_test = load_data(charseq_test)
data_dev = load_data(charseq_dev)

CPU times: user 1.09 s, sys: 309 ms, total: 1.4 s
Wall time: 1.4 s


In [9]:
len(data_train), len(data_test), len(data_dev)

(81, 117, 73)

In [10]:
data_count_train = [(n,len(d)) for n,d in data_train]
data_count_test = [(n,len(d)) for n,d in data_test]
data_count_dev = [(n,len(d)) for n,d in data_dev]

In [11]:
df_train = pd.DataFrame(data_count_train)
df_train.columns = ["name", "count"]
df_train = df_train.sort_values("count")

df_test = pd.DataFrame(data_count_test)
df_test.columns = ["name", "count"]
df_test = df_test.sort_values("count")

df_dev = pd.DataFrame(data_count_dev)
df_dev.columns = ["name", "count"]
df_dev = df_dev.sort_values("count")

In [12]:
df_train

Unnamed: 0,name,count
51,/home/leo/projects/Datasets/text/UniversalDepe...,19
37,/home/leo/projects/Datasets/text/UniversalDepe...,23
33,/home/leo/projects/Datasets/text/UniversalDepe...,87
15,/home/leo/projects/Datasets/text/UniversalDepe...,153
20,/home/leo/projects/Datasets/text/UniversalDepe...,319
4,/home/leo/projects/Datasets/text/UniversalDepe...,566
2,/home/leo/projects/Datasets/text/UniversalDepe...,600
41,/home/leo/projects/Datasets/text/UniversalDepe...,672
59,/home/leo/projects/Datasets/text/UniversalDepe...,803
74,/home/leo/projects/Datasets/text/UniversalDepe...,910


In [13]:
df_test

Unnamed: 0,name,count
46,/home/leo/projects/Datasets/text/UniversalDepe...,34
22,/home/leo/projects/Datasets/text/UniversalDepe...,55
34,/home/leo/projects/Datasets/text/UniversalDepe...,57
92,/home/leo/projects/Datasets/text/UniversalDepe...,98
64,/home/leo/projects/Datasets/text/UniversalDepe...,100
77,/home/leo/projects/Datasets/text/UniversalDepe...,101
86,/home/leo/projects/Datasets/text/UniversalDepe...,110
67,/home/leo/projects/Datasets/text/UniversalDepe...,117
107,/home/leo/projects/Datasets/text/UniversalDepe...,153
20,/home/leo/projects/Datasets/text/UniversalDepe...,153


In [14]:
df_dev

Unnamed: 0,name,count
13,/home/leo/projects/Datasets/text/UniversalDepe...,55
18,/home/leo/projects/Datasets/text/UniversalDepe...,65
31,/home/leo/projects/Datasets/text/UniversalDepe...,82
53,/home/leo/projects/Datasets/text/UniversalDepe...,107
11,/home/leo/projects/Datasets/text/UniversalDepe...,156
65,/home/leo/projects/Datasets/text/UniversalDepe...,156
29,/home/leo/projects/Datasets/text/UniversalDepe...,194
63,/home/leo/projects/Datasets/text/UniversalDepe...,403
17,/home/leo/projects/Datasets/text/UniversalDepe...,412
2,/home/leo/projects/Datasets/text/UniversalDepe...,433


In [15]:
df_train.describe()

Unnamed: 0,count
count,81.0
mean,9653.876543
std,13826.426142
min,19.0
25%,1781.0
50%,5396.0
75%,13123.0
max,70123.0


In [16]:
df_test.describe()

Unnamed: 0,count
count,117.0
mean,1225.042735
std,1874.395771
min,34.0
25%,518.0
50%,957.0
75%,1204.0
max,17028.0


In [17]:
df_dev.describe()

Unnamed: 0,count
count,73.0
mean,1413.945205
std,2298.39349
min,55.0
25%,564.0
50%,912.0
75%,1476.0
max,17294.0


In [18]:
70123/19

3690.684210526316

Proportional sampling might be difficult as the difference between number of samples is too much. 
19 samples for the dataset with least data and 70123 for the most complete dataset. 

This means that I must do some sampling strategy that is proportional somehow but a rate of 3690 times is too much for it.

Also the training order might be important, so better use the ones with least data at the end of the training, so they benefit from the previous training instead of initializing. 

Maybe repeating there a few times will not necessarilly overfit?

So I will do that, I will order the trainig in a way that the last batches contain the samples from the languages with the least training data and the first batches will not contain them.

Also all batches will be of a length that can contain at least a sample from each training (language) dataset.

Also, for the languages with the lest number of samples might be good to merge the training with the dev datasets.

There are also many datasets that don't contain train, test and or dev dataset. Some have only one of those, these are good for testing the generalization of the network on languages that are not trained on.

In [19]:
df_train.head(20)

Unnamed: 0,name,count
51,/home/leo/projects/Datasets/text/UniversalDepe...,19
37,/home/leo/projects/Datasets/text/UniversalDepe...,23
33,/home/leo/projects/Datasets/text/UniversalDepe...,87
15,/home/leo/projects/Datasets/text/UniversalDepe...,153
20,/home/leo/projects/Datasets/text/UniversalDepe...,319
4,/home/leo/projects/Datasets/text/UniversalDepe...,566
2,/home/leo/projects/Datasets/text/UniversalDepe...,600
41,/home/leo/projects/Datasets/text/UniversalDepe...,672
59,/home/leo/projects/Datasets/text/UniversalDepe...,803
74,/home/leo/projects/Datasets/text/UniversalDepe...,910


In [20]:
df_dev.tail(20)

Unnamed: 0,name,count
9,/home/leo/projects/Datasets/text/UniversalDepe...,1400
26,/home/leo/projects/Datasets/text/UniversalDepe...,1476
4,/home/leo/projects/Datasets/text/UniversalDepe...,1622
70,/home/leo/projects/Datasets/text/UniversalDepe...,1654
28,/home/leo/projects/Datasets/text/UniversalDepe...,1709
16,/home/leo/projects/Datasets/text/UniversalDepe...,1745
0,/home/leo/projects/Datasets/text/UniversalDepe...,1798
57,/home/leo/projects/Datasets/text/UniversalDepe...,1842
59,/home/leo/projects/Datasets/text/UniversalDepe...,1852
33,/home/leo/projects/Datasets/text/UniversalDepe...,1875


In [21]:
def get_root_name(fname):
    rt = path_leaf(fname).split("-ud-")[0]
    return rt

In [22]:
df_train['name_root'] = df_train.name.apply(get_root_name)
df_test['name_root'] = df_test.name.apply(get_root_name)
df_dev['name_root'] = df_dev.name.apply(get_root_name)

In [23]:
df_train.head()

Unnamed: 0,name,count,name_root
51,/home/leo/projects/Datasets/text/UniversalDepe...,19,bxr_bdt
37,/home/leo/projects/Datasets/text/UniversalDepe...,23,hsb_ufal
33,/home/leo/projects/Datasets/text/UniversalDepe...,87,swl_sslc
15,/home/leo/projects/Datasets/text/UniversalDepe...,153,lt_hse
20,/home/leo/projects/Datasets/text/UniversalDepe...,319,be_hse


In [24]:
smax_train = max([i[1] for i in data_count_train])

In [25]:
smax_train

70123

So the issue is to create a sampling strategy. What should be done I don't know, but I'll create a strategy anyways with my supositions.

So I'll do the following:
- First I'll merge the language files per language, train and dev datasets will be merged to benefit the smaller langs
- The batch size will be the number of languages that I'll be training the network
- batches might not (will not) contain the same number of languages, but batches will be saved with the number of languages as reference
- for training I'll first send the batches with the least language variability and at the end of each epoch the ones with the most variability (meaning the ones containing more languages)
- The number of batches will be defined by .... ???

Also for the languages that have more data I'll clean the datasets by filtering the language files according to the rating on the UD-treebank page. I'll start by the languages with more data by order.

The datasets that I'm cleaning are the onces from German and Czeck:
* Czech-CAC
* Czech-CLTT
* Czech-PUD

For the German datasets I'll avoid merging the dev datasets and the following datasets are not used for the training:
* German-GSD
* German-PUD
* German-LIT

For Russian, German and Czech I won't merge the dev sets

For the German HDT dataset there are two sets, a and b, dataset b is taken out because this dataset contains a lot of data.

Merging by language, also merging train and dev datasets

In [32]:
%%time
max_sentence_len = 1024

# the pre-processing of the test data is much simpler, making it per file, not needing to merge and only filtering strings that are longer than the max_sentence data
test_lang_dict = {}

# filter data that is too big and padding data
def filter_pad_data(data, max_len):
    dta = [d for d in data if d.shape[1] <=max_sentence_len]
    pad_dta = [np.pad(d, [(0,0),(0,1024 - d.shape[1])], mode='constant') for d in dta]
    pad_ta = np.stack(pad_dta, axis=0).astype("uint16")
    return pad_dta

for fname,data in data_test:
    test_lang_dict[fname] = filter_pad_data(data, max_sentence_len)



CPU times: user 4.97 s, sys: 27.3 ms, total: 5 s
Wall time: 4.99 s


In [33]:
len(test_lang_dict.keys())

117

In [35]:
%%time
# save all data for testing in vectorized format already

for oname, data in test_lang_dict.items():
    fname = oname.replace(".pkl", "_test-pad-3x1024_uint16.npy")
    np.save(fname, data)

CPU times: user 272 ms, sys: 709 ms, total: 981 ms
Wall time: 980 ms


In [26]:
lang_dict = {}

for fname,data in data_train + data_dev:
    lang = fname.split("_")[0]  # language is as the first 2 or 3 chars of the filename
    # avoid merging dev sets for the languages with more data
    if lang in ["de", "cs", "ru"]:
        print("lang ", lang, " fname: ", fname)
        if "-dev-" in fname or "de_hdt-ud-train-b" in fname:
            print("skipping: ", fname)
            continue
    if not lang in lang_dict:
        lang_dict[lang] = data
    else:
        lang_dict[lang] += data

lang  ru  fname:  ru_taiga-ud-train-charsec_code.pkl
lang  ru  fname:  ru_gsd-ud-train-charsec_code.pkl
lang  de  fname:  de_hdt-ud-train-a-charsec_code.pkl
lang  de  fname:  de_hdt-ud-train-b-charsec_code.pkl
skipping:  de_hdt-ud-train-b-charsec_code.pkl
lang  cs  fname:  cs_pdt-ud-train-charsec_code.pkl
lang  ru  fname:  ru_syntagrus-ud-train-charsec_code.pkl
lang  ru  fname:  ru_taiga-ud-dev-charsec_code.pkl
skipping:  ru_taiga-ud-dev-charsec_code.pkl
lang  cs  fname:  cs_fictree-ud-dev-charsec_code.pkl
skipping:  cs_fictree-ud-dev-charsec_code.pkl
lang  ru  fname:  ru_gsd-ud-dev-charsec_code.pkl
skipping:  ru_gsd-ud-dev-charsec_code.pkl
lang  de  fname:  de_hdt-ud-dev-charsec_code.pkl
skipping:  de_hdt-ud-dev-charsec_code.pkl
lang  cs  fname:  cs_pdt-ud-dev-charsec_code.pkl
skipping:  cs_pdt-ud-dev-charsec_code.pkl
lang  ru  fname:  ru_syntagrus-ud-dev-charsec_code.pkl
skipping:  ru_syntagrus-ud-dev-charsec_code.pkl


In [27]:
len(lang_dict.keys())

50

In [28]:
data_count_traindev = [(k,len(d)) for k,d in lang_dict.items()]
df_traindev = pd.DataFrame(data_count_traindev)
df_traindev.columns = ["name", "count"]
df_traindev = df_traindev.sort_values("count")

In [29]:
df_traindev

Unnamed: 0,name,count
39,bxr,19
31,hsb,23
28,swl,169
17,be,384
4,ga,566
34,hy,1246
49,hu,1351
26,af,1509
3,mt,1556
12,wo,1637


In [30]:
df_traindev["count"].sum()

779078

The threshold for the number of batches will be chosen as to maximize the number of complete batches, this will also give the number of languages in each batch set.
~~Just because (arbitrary number) I decide that the batches with the least languages will be 10, this might give enough language variability for the training. ~~

As there are 50 languages, the batches will contain 50 samples (from REVISITING SMALL BATCH TRAINING FOR DEEP NEURAL NETWORKS https://arxiv.org/pdf/1804.07612.pdf small batches are better)~~, so the batches with the least n# of languages will contain 5 samples per lang, and the ones with 50 languages one sample per batch.~~

To complete the batch for the ones that are not with all the languages I intend to randomly choose from the ~~10~~ N languages with the most samples, this is done by randomly shuffling the remaining data from the threshold sampling chosen.

In [31]:
max_sentence_len = 1024

# creating a list to fill the missing data in the batches, data will be then randomly shuffled
# I explore here the minimum number of languages that will give me the maximum number of batches.
# I'm looking only in the number of samples of the languages, not other point.

fi_len = 30437  # with this threshold I only keep up to 17 languages and 17524 batches
ar_len = 24759  # with this threshold I only keep up to 14 languages and 20953 batches
ro_len = 19991  # with this threshold I only keep up to 15 languages and 19991 batches 
# is the SWEET SPOT for the max number of batches
ca_len = 14832 # with this threshold I only keep up to 20 languages and 14832 batches  and there are 749 extra batches with up to 19 languages
threshold_len = ca_len

fill_data =[]
# create batches, later
for lang, data in lang_dict.items():
    if len(data)> threshold_len:
        dta = [d for d in data[threshold_len:] if d.shape[1] <= max_sentence_len]
        fill_data.extend(dta)
        
# ensure that the data chosen to fill the missing elements is random 
# (and will statistically be proportional to the number of datapoints of each language)
random.shuffle(fill_data)

In [32]:
len(fill_data)

352403

In [33]:
start_index = 0
batches = []

# create batches, later
for i in range(threshold_len):  
    bid=0
    langs=[]
    datas=[]
    for lang, data in lang_dict.items():
        if len(data)> i and data[i].shape[1] <= max_sentence_len:
            langs.append(lang)
            datas.append(data[i])
    while len(datas) < len(lang_dict.keys()):
        datas.append(fill_data.pop())
    batch = (bid, langs, datas)
    batches.append(batch)

In [34]:
len(fill_data)  # there is data still left

37441

In [35]:
len_samples_10 = len(fill_data) //50

In [36]:
len_samples_10

748

In [37]:
# from https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]


In [38]:
batches_10_langs = list(chunks(fill_data, 50))

In [39]:
# check that all batches are 50 samples long
batch_lens = list(map(lambda x: len(x[2]), batches))

In [40]:
len(batch_lens), max(batch_lens), min(batch_lens), len(batches_10_langs)

(14832, 50, 50, 749)

In [41]:
14832 * 50

741600

In [42]:
# there are then a maximum
(14832 + 749) * 50
# datapoints

779050

Checking that the data is OK $ 779078  > 779050$ 
So the number of original samples is bigger than the number of samples on the prepared dataset. This is OK

In [43]:
# see how many languages and how many batches with each amount of languages
b_stats = {}
for bid, blangs, bdatas in batches:
    l = len(blangs)
    if l in b_stats:
        b_stats[l] +=1
    else:
        b_stats[l] = 1

In [44]:
b_stats

{50: 18,
 49: 5,
 48: 144,
 47: 217,
 46: 182,
 45: 678,
 44: 107,
 43: 158,
 42: 47,
 41: 81,
 40: 244,
 39: 184,
 38: 192,
 37: 1465,
 36: 142,
 35: 507,
 34: 281,
 33: 295,
 32: 249,
 31: 201,
 30: 328,
 29: 443,
 28: 1025,
 27: 443,
 26: 239,
 25: 582,
 24: 832,
 23: 254,
 22: 477,
 21: 1265,
 20: 3547}

Now I need to make something for the data shape such as all batches are the same shape.

First I need now to create the list of batches, reverse the order as the training order will be from the least diverse to the most diverse batches up to 50 languages in the last 19 batches (the number of samples the language with least training data has).

I take out the last array from the batches_10_langs at it is not complete with 50 samples


In [45]:
batches_data = list(map(lambda x: x[2], batches)) + batches_10_langs[:-1]
batches_data.reverse()  # have to train first with the least diverse, and at the end with the most diverse ones

In [46]:
len(batches_data) * 50

779000

Now find the maximum lenght of all the data samples, this will be the dimension needed for the numpy array

In [47]:
seq_len = []
for b in batches_data:
    for s in b:
        seq_len.append(s.shape[1])

min_seq_len, max_seq_len = min(seq_len), max(seq_len)

In [48]:
min_seq_len, max_seq_len

(1, 1017)

In [49]:
# see how many languages and how many batches with each amount of languages
len_stats = {}
for l in seq_len:
    if l in len_stats:
        len_stats[l] +=1
    else:
        len_stats[l] = 1

In [50]:
len_stats = SortedDict(sorted(len_stats.items(), key=lambda kv: kv[0]))

In [51]:
len_stats

SortedDict({1: 85, 2: 500, 3: 608, 4: 1372, 5: 1980, 6: 2047, 7: 2623, 8: 2452, 9: 2700, 10: 2864, 11: 3295, 12: 3374, 13: 4058, 14: 3783, 15: 4669, 16: 4208, 17: 4946, 18: 4523, 19: 5896, 20: 4667, 21: 5325, 22: 4988, 23: 5638, 24: 5325, 25: 5913, 26: 5500, 27: 6093, 28: 5716, 29: 6235, 30: 5605, 31: 6440, 32: 5688, 33: 6369, 34: 5779, 35: 6453, 36: 5937, 37: 6481, 38: 5594, 39: 6213, 40: 5816, 41: 6416, 42: 5673, 43: 6420, 44: 5583, 45: 6149, 46: 5515, 47: 6120, 48: 5444, 49: 5946, 50: 5509, 51: 5850, 52: 5252, 53: 6008, 54: 5469, 55: 5829, 56: 5297, 57: 5752, 58: 5313, 59: 5581, 60: 5279, 61: 5651, 62: 5142, 63: 5393, 64: 5049, 65: 5419, 66: 5016, 67: 5361, 68: 5049, 69: 5264, 70: 4927, 71: 5304, 72: 4869, 73: 5132, 74: 4873, 75: 5118, 76: 4771, 77: 5101, 78: 4787, 79: 4975, 80: 4706, 81: 5003, 82: 4520, 83: 4894, 84: 4550, 85: 4854, 86: 4509, 87: 4747, 88: 4440, 89: 4612, 90: 4360, 91: 4389, 92: 4291, 93: 4397, 94: 4189, 95: 4259, 96: 4130, 97: 4253, 98: 3896, 99: 4220, 100: 3862, 

In [52]:
sum(len_stats.values())

779000

In [53]:
sum([v for k,v in len_stats.items() if k > 1024])

0

In [54]:
sum([v for k,v in len_stats.items() if k <= 5])

4545

Data still seems quite bad, there are many samples with lenght that is suspiciously low (85 len 1, 500 len 2, etc) so I should go back to the data generation and will filter out everything that is below a threshold.

Also I'll need to manually check the files that contain data that seems bad. But maybe the number of languages (and/or samples) will still be cut down in order to improve data quality.

I will need to filter data by a minimum and maximum length, for this I need to calculate and decide on the (minimum maybe) and maximum size of the input data for the network.

Nevertheless, for the moment I'll have then to work on the network to define the input and output shapes.

Input will be maximum 1024 char in length, this is to avoid overcomplications and big networks on my setup, but can be changed later.

I might leave the sentences with small length (1,2,...) just to have some "noisy" input ... ?


In [57]:
%%time
padded_data = []
for batch in batches_data:
    for s in batch:
        padded_data.append(np.pad(s, [(0,0),(0,1024 - s.shape[1])], mode='constant'))

CPU times: user 26 s, sys: 2.03 s, total: 28 s
Wall time: 28 s


In [58]:
all_train_data_np = np.stack(padded_data, axis=0)

In [59]:
all_train_data_np.dtype

dtype('int32')

In [60]:
#reduce precision (everything is below the maximum utf-8 encoding for 2 segments index < 2^11)
np_data = all_train_data_np.astype(np.uint16)

In [61]:
np_data.dtype

dtype('uint16')

In [62]:
all_train_data_np.shape

(779000, 3, 1024)

In [63]:
# base_dir = "/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.4"
# train_fname = os.path.join(base_dir, "traindev_np_batches_779000x3x1024_int32.npy")
# np.save(train_fname, all_train_data_np)

In [64]:
# this file is 4.8GB on disk ... can't fit in my GPU and also have the models there, nice I do have 
train_fname = os.path.join(base_dir, "traindev_np_batches_779000x3x1024_uint16.npy")
np.save(train_fname, np_data)

In [65]:
np_data.shape

(779000, 3, 1024)

In [66]:
np_data[-1,:,:]

array([[ 65,  32, 118, ...,   0,   0,   0],
       [  5,  17,   7, ...,   0,   0,   0],
       [126,   0, 194, ...,   0,   0,   0]], dtype=uint16)

In [67]:
np.count_nonzero(np_data[0,:,:]), np.count_nonzero(np_data[-1,:,:])

(172, 280)