# Data Preparation for Multi-Language Supervised Part of Speech training.

This notebook prepares the data pre-processed from the conllu ud-treebank v2.4.

First load all datasets in memory (just because I can)

Count the number of samples for each and do some statistics. 

Here I need to take into account the different sampling strategies. 

There is the need for each batch to contain samples from all the languages, if not the training will not be optimal. The sampling strategies impact the performance in over-sampled and sub-sampled languages.

Mainly (TODO look back to the source and note it here) the different sampling strategies and their conditions are:

- Sampling the same amount from each language: Benefits the languages with less training samples and they take advantage from the learning of the languages with most samples. Penalises the languages with more samples.
- Rate based: benefits transfer from languages with more samples to less ones and does not penalises too much the languages with more data. 
- Proportional: penalises the most the languages with less data.


Data on the datasets with the least should not be repeated (at least too much) to avoid overfitting during this stage. The issue with overfitting now (I will be working on that later, testing what I can accomplish with overfitting)

A supposition that I have is that more complex languages (syntactically and semantically) might be better for training and transferring knowledge to less complex and/or languages with less training data. I won't be trying to test this hypothesis (for the moment at least)

In [1]:
import numpy as np
import pandas as pd
from langmodels.utils.preprocess_conllu import *
from utf8_encoder import *
import pickle
import random

In [2]:
base_dir = "/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.4"

In [3]:
allfiles = get_all_files_recurse(base_dir)
charseq_files = [f for f in allfiles if f.endswith(".pkl") and "charsec_code" in f]  # typo in the file saving (fixed now in the .py file)

In [4]:
len(charseq_files)

271

In [5]:
charseq_train = [ f for f in charseq_files if "-train-" in f]
charseq_test = [ f for f in charseq_files if "-test-" in f]
charseq_dev = [ f for f in charseq_files if "-dev-" in f]

In [6]:
len(charseq_files), len(charseq_train), len(charseq_test), len(charseq_dev)

(271, 81, 117, 73)

Now load all data and start counting

In [7]:
def load_data(file_list):
    data = []
    for fname in file_list:
        name = path_leaf(fname)
        with open(fname, "rb") as f:
            d = pickle.load(f)
            data.append((name, d))
    return data

In [8]:
%%time
data_train = load_data(charseq_train)
data_test = load_data(charseq_test)
data_dev = load_data(charseq_dev)

CPU times: user 1.02 s, sys: 382 ms, total: 1.4 s
Wall time: 1.4 s


In [9]:
len(data_train), len(data_test), len(data_dev)

(81, 117, 73)

In [10]:
data_count_train = [(n,len(d)) for n,d in data_train]
data_count_test = [(n,len(d)) for n,d in data_test]
data_count_dev = [(n,len(d)) for n,d in data_dev]

In [11]:
df_train = pd.DataFrame(data_count_train)
df_train.columns = ["name", "count"]
df_train = df_train.sort_values("count")

df_test = pd.DataFrame(data_count_test)
df_test.columns = ["name", "count"]
df_test = df_test.sort_values("count")

df_dev = pd.DataFrame(data_count_dev)
df_dev.columns = ["name", "count"]
df_dev = df_dev.sort_values("count")

In [12]:
df_train

Unnamed: 0,name,count
51,bxr_bdt-ud-train-charsec_code.pkl,19
37,hsb_ufal-ud-train-charsec_code.pkl,23
33,swl_sslc-ud-train-charsec_code.pkl,87
15,lt_hse-ud-train-charsec_code.pkl,153
20,be_hse-ud-train-charsec_code.pkl,319
4,ga_idt-ud-train-charsec_code.pkl,566
2,gl_treegal-ud-train-charsec_code.pkl,600
41,hy_armtdp-ud-train-charsec_code.pkl,672
59,fr_partut-ud-train-charsec_code.pkl,803
74,hu_szeged-ud-train-charsec_code.pkl,910


In [13]:
df_test

Unnamed: 0,name,count
46,swl_sslc-ud-test-charsec_code.pkl,34
22,lt_hse-ud-test-charsec_code.pkl,55
34,aii_as-ud-test-charsec_code.pkl,57
92,gun_thomas-ud-test-charsec_code.pkl,98
64,yo_ytb-ud-test-charsec_code.pkl,100
77,akk_pisandub-ud-test-charsec_code.pkl,101
86,fr_partut-ud-test-charsec_code.pkl,110
67,kpv_ikdp-ud-test-charsec_code.pkl,117
107,en_partut-ud-test-charsec_code.pkl,153
20,it_partut-ud-test-charsec_code.pkl,153


In [14]:
df_dev

Unnamed: 0,name,count
13,lt_hse-ud-dev-charsec_code.pkl,55
18,be_hse-ud-dev-charsec_code.pkl,65
31,swl_sslc-ud-dev-charsec_code.pkl,82
53,fr_partut-ud-dev-charsec_code.pkl,107
11,it_partut-ud-dev-charsec_code.pkl,156
65,en_partut-ud-dev-charsec_code.pkl,156
29,af_afribooms-ud-dev-charsec_code.pkl,194
63,el_gdt-ud-dev-charsec_code.pkl,403
17,fr_sequoia-ud-dev-charsec_code.pkl,412
2,mt_mudt-ud-dev-charsec_code.pkl,433


In [15]:
df_train.describe()

Unnamed: 0,count
count,81.0
mean,9653.876543
std,13826.426142
min,19.0
25%,1781.0
50%,5396.0
75%,13123.0
max,70123.0


In [16]:
df_test.describe()

Unnamed: 0,count
count,117.0
mean,1225.042735
std,1874.395771
min,34.0
25%,518.0
50%,957.0
75%,1204.0
max,17028.0


In [17]:
df_dev.describe()

Unnamed: 0,count
count,73.0
mean,1413.945205
std,2298.39349
min,55.0
25%,564.0
50%,912.0
75%,1476.0
max,17294.0


In [18]:
70123/19

3690.684210526316

Proportional sampling might be difficult as the difference between number of samples is too much. 
19 samples for the dataset with least data and 70123 for the most complete dataset. 

This means that I must do some sampling strategy that is proportional somehow but a rate of 3690 times is too much for it.

Also the training order might be important, so better use the ones with least data at the end of the training, so they benefit from the previous training instead of initializing. 

Maybe repeating there a few times will not necessarilly overfit?

So I will do that, I will order the trainig in a way that the last batches contain the samples from the languages with the least training data and the first batches will not contain them.

Also all batches will be of a length that can contain at least a sample from each training (language) dataset.

Also, for the languages with the lest number of samples might be good to merge the training with the dev datasets.

There are also many datasets that don't contain train, test and or dev dataset. Some have only one of those, these are good for testing the generalization of the network on languages that are not trained on.

In [19]:
df_train.head(20)

Unnamed: 0,name,count
51,bxr_bdt-ud-train-charsec_code.pkl,19
37,hsb_ufal-ud-train-charsec_code.pkl,23
33,swl_sslc-ud-train-charsec_code.pkl,87
15,lt_hse-ud-train-charsec_code.pkl,153
20,be_hse-ud-train-charsec_code.pkl,319
4,ga_idt-ud-train-charsec_code.pkl,566
2,gl_treegal-ud-train-charsec_code.pkl,600
41,hy_armtdp-ud-train-charsec_code.pkl,672
59,fr_partut-ud-train-charsec_code.pkl,803
74,hu_szeged-ud-train-charsec_code.pkl,910


In [20]:
df_dev.tail(20)

Unnamed: 0,name,count
9,es_gsd-ud-dev-charsec_code.pkl,1400
26,fr_gsd-ud-dev-charsec_code.pkl,1476
4,lv_lvtb-ud-dev-charsec_code.pkl,1622
70,es_ancora-ud-dev-charsec_code.pkl,1654
28,ca_ancora-ud-dev-charsec_code.pkl,1709
16,pl_lfg-ud-dev-charsec_code.pkl,1745
0,eu_bdt-ud-dev-charsec_code.pkl,1798
57,fro_srcmf-ud-dev-charsec_code.pkl,1842
59,orv_torot-ud-dev-charsec_code.pkl,1852
33,fi_ftb-ud-dev-charsec_code.pkl,1875


In [21]:
def get_root_name(fname):
    rt = path_leaf(fname).split("-ud-")[0]
    return rt

In [22]:
df_train['name_root'] = df_train.name.apply(get_root_name)
df_test['name_root'] = df_test.name.apply(get_root_name)
df_dev['name_root'] = df_dev.name.apply(get_root_name)

In [23]:
df_train.head()

Unnamed: 0,name,count,name_root
51,bxr_bdt-ud-train-charsec_code.pkl,19,bxr_bdt
37,hsb_ufal-ud-train-charsec_code.pkl,23,hsb_ufal
33,swl_sslc-ud-train-charsec_code.pkl,87,swl_sslc
15,lt_hse-ud-train-charsec_code.pkl,153,lt_hse
20,be_hse-ud-train-charsec_code.pkl,319,be_hse


In [24]:
smax_train = max([i[1] for i in data_count_train])

In [25]:
smax_train

70123

So the issue is to create a sampling strategy. What should be done I don't know, but I'll create a strategy anyways with my supositions.

So I'll do the following:
- First I'll merge the language files per language, train and dev datasets will be merged to benefit the smaller langs
- The batch size will be the number of languages that I'll be training the network
- batches might not (will not) contain the same number of languages, but batches will be saved with the number of languages as reference
- for training I'll first send the batches with the least language variability and at the end of each epoch the ones with the most variability (meaning the ones containing more languages)
- The number of batches will be defined by .... ???

Also for the languages that have more data I'll clean the datasets by filtering the language files according to the rating on the UD-treebank page. I'll start by the languages with more data by order.

The datasets that I'm cleaning are the onces from German and Czeck:
* Czech-CAC
* Czech-CLTT
* Czech-PUD

For the German datasets I'll avoid merging the dev datasets and the following datasets are not used for the training:
* German-GSD
* German-PUD
* German-LIT

For Russian, German and Czech I won't merge the dev sets

For the German HDT dataset there are two sets, a and b, dataset b is taken out because this dataset contains a lot of data.

Merging by language, also merging train and dev datasets

In [26]:
lang_dict = {}

for fname,data in data_train + data_dev:
    lang = fname.split("_")[0]  # language is as the first 2 or 3 chars of the filename
    # avoid merging dev sets for the languages with more data
    if lang in ["de", "cs", "ru"]:
        print("lang ", lang, " fname: ", fname)
        if "-dev-" in fname or "de_hdt-ud-train-b" in fname:
            print("skipping: ", fname)
            continue
    if not lang in lang_dict:
        lang_dict[lang] = data
    else:
        lang_dict[lang] += data

lang  ru  fname:  ru_taiga-ud-train-charsec_code.pkl
lang  ru  fname:  ru_gsd-ud-train-charsec_code.pkl
lang  de  fname:  de_hdt-ud-train-a-charsec_code.pkl
lang  de  fname:  de_hdt-ud-train-b-charsec_code.pkl
skipping:  de_hdt-ud-train-b-charsec_code.pkl
lang  cs  fname:  cs_pdt-ud-train-charsec_code.pkl
lang  ru  fname:  ru_syntagrus-ud-train-charsec_code.pkl
lang  ru  fname:  ru_taiga-ud-dev-charsec_code.pkl
skipping:  ru_taiga-ud-dev-charsec_code.pkl
lang  cs  fname:  cs_fictree-ud-dev-charsec_code.pkl
skipping:  cs_fictree-ud-dev-charsec_code.pkl
lang  ru  fname:  ru_gsd-ud-dev-charsec_code.pkl
skipping:  ru_gsd-ud-dev-charsec_code.pkl
lang  de  fname:  de_hdt-ud-dev-charsec_code.pkl
skipping:  de_hdt-ud-dev-charsec_code.pkl
lang  cs  fname:  cs_pdt-ud-dev-charsec_code.pkl
skipping:  cs_pdt-ud-dev-charsec_code.pkl
lang  ru  fname:  ru_syntagrus-ud-dev-charsec_code.pkl
skipping:  ru_syntagrus-ud-dev-charsec_code.pkl


In [27]:
len(lang_dict.keys())

50

In [28]:
data_count_traindev = [(k,len(d)) for k,d in lang_dict.items()]
df_traindev = pd.DataFrame(data_count_traindev)
df_traindev.columns = ["name", "count"]
df_traindev = df_traindev.sort_values("count")

In [29]:
df_traindev

Unnamed: 0,name,count
39,bxr,19
31,hsb,23
28,swl,169
17,be,384
4,ga,566
34,hy,1246
49,hu,1351
26,af,1509
3,mt,1556
12,wo,1637


In [30]:
df_traindev["count"].sum()

779078

The threshold for the number of batches will be chosen as to maximize the number of complete batches, this will also give the number of languages 
~~Just because (arbitrary number) I decide that the batches with the least languages will be 10, this might give enough language variability for the training. ~~

As there are 50 languages, the batches will contain 50 samples (from REVISITING SMALL BATCH TRAINING FOR DEEP NEURAL NETWORKS https://arxiv.org/pdf/1804.07612.pdf small batches are better)~~, so the batches with the least n# of languages will contain 5 samples per lang, and the ones with 50 languages one sample per batch.~~

To complete the batch for the ones that are not with all the languages I intend to randomly choose from the ~~10~~ N languages with the most samples.

In [31]:
# creating a list to fill the missing data in the batches, data will be then randomly shuffled
# I explore here the minimum number of languages that will give me the maximum number of batches.
# I'm looking only in the number of samples of the languages, not other point.

fi_len = 30437  # with this threshold I only keep up to 17 languages and 17524 batches
ar_len = 24759  # with this threshold I only keep up to 14 languages and 20953 batches
ro_len = 19991  # with this threshold I only keep up to 15 languages and 19991 batches 
# is the SWEET SPOT for the max number of batches
ca_len = 14832 # with this threshold I only keep up to 20 languages and 14832 batches  and there are 749 extra batches with up to 19 languages
threshold_len = ca_len

fill_data =[]
# create batches, later
for lang, data in lang_dict.items():
    if len(data)> threshold_len:
        fill_data.extend(data[threshold_len:])
        
# ensure that the data chosen to fill the missing elements is random 
# (and will statistically be proportional to the number of datapoints of each language)
random.shuffle(fill_data)

In [32]:
len(fill_data)

352417

In [33]:
start_index = 0
batches = []

# create batches, later
for i in range(threshold_len):  
    bid=0
    langs=[]
    datas=[]
    for lang, data in lang_dict.items():
        if len(data)> i:
            langs.append(lang)
            datas.append(data[i])
    while len(datas) < len(lang_dict.keys()):
        datas.append(fill_data.pop())
    batch = (bid, langs, datas)
    batches.append(batch)

In [34]:
len(fill_data)  # there is data still left

37478

In [35]:
len_samples_10 = len(fill_data) //50

In [36]:
len_samples_10

749

In [37]:
# from https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]


In [38]:
batches_10_langs = list(chunks(fill_data, 50))

In [39]:
# check that all batches are 50 samples long
batch_lens = list(map(lambda x: len(x[2]), batches))

In [40]:
len(batch_lens), max(batch_lens), min(batch_lens), len(batches_10_langs)

(14832, 50, 50, 750)

In [51]:
14832 * 50

741600

In [53]:
# there are then 
(14832 + 749) * 50
# datapoints

779050

Checking that the data is OK $ 779078  > 779050$ 
So the number of original samples is bigger than the number of samples on the prepared dataset. This is OK

In [43]:
# see how many languages and how many batches with each amount of languages
b_stats = {}
for bid, blangs, bdatas in batches:
    l = len(blangs)
    if l in b_stats:
        b_stats[l] +=1
    else:
        b_stats[l] = 1

In [44]:
b_stats

{50: 19,
 49: 4,
 48: 146,
 47: 215,
 46: 182,
 45: 680,
 44: 105,
 43: 158,
 42: 47,
 41: 81,
 40: 245,
 39: 183,
 38: 192,
 37: 1475,
 36: 132,
 35: 508,
 34: 280,
 33: 295,
 32: 250,
 31: 200,
 30: 328,
 29: 443,
 28: 1026,
 27: 442,
 26: 239,
 25: 582,
 24: 833,
 23: 253,
 22: 479,
 21: 1264,
 20: 3546}

Now I need to make something for the data shape such as all batches are the same shape.

In [45]:
len(batches[0])

3

In [46]:
b0 = batches[0][2]

In [47]:
len(b0)

50

In [48]:
b0[10].shape

(3, 168)

In [49]:
max_seq_len = []
for s in b0:
    max_seq_len.append(s.shape[1])

max_seq_len = max(max_seq_len)

In [50]:
max_seq_len

316