# A statistical Conllu file Exploration of  Universal Dependencies

## Introduction

While much work is being done in the current days on NLP and NLU, there is little work on describing why a certain length of transformer (or other as LSTM time steps) architecture has been chosen for the training, it is mostly arbitrary and depends on the goal of the work and resources available (mainly hardware). These decisions are hard once the model has been trained and there is nothing that can be done to extend the length of a transformer (for example) without having to retrain the entire network. There are however some works that tackle variable length sequences. 

This work presents a first complete analysis of the Universal Dependencies v2.6 dataset and presents the globan and individual results of each language present in the dataset.

This work does not intend to be a conference level paper (that is why there are no references to all the papers on each subject), but an informational technical report that might help to better select the most effective compromise text or token length for your particular application.

This notebook is dedicated to explore the basic text statistics (number of tokens, number of character) in the samples covered in the Universal Dependencies v2.6.

## Observations

While doing this work I found 

## Conclusion

The best compromise for choosing a sequence length on the NLP architecture for training will depend mostly on the requirements of the applicatino, nevertheless with the numbers here you should be able to make an informed guess on what might be better for your case.

We can see that having a multi-lingual approach will necessary make the needed sequences longer as there is a large variability on sequence length, but appliying to single language might allow you to optimize your neural architectures

#### Note: EU Official Languages

Just for information (as it happens I'm targetting mostly EU languages in my own research work)

The European Union has 23 official languages:

Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenia, Slovene, Spanish and Swedish.

In [1]:
from multiprocessing import Pool, cpu_count

import math
import os, sys
import orjson as json
import pyconll
import pyconll.util
from pycountry import languages

try:
    from utf8.utils import *
except:
    # to solve issue with ipython executing this import
    from utils import *

from preprocessors.preprocess_conllu import *

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

In [3]:
UD_VERSION = "2.6"
BASEPATH = "~/projects/Datasets/text"
CONLLU_BASEPATH = os.path.join(BASEPATH, 'UniversalDependencies/ud-treebanks-v{}'.format(UD_VERSION))

In [4]:
rootdir=CONLLU_BASEPATH
blacklist=BLACKLIST
allconll = get_all_files_recurse(rootdir)
train, test, dev = filter_conllu_files(allconll, blacklist)

In [5]:

def conllu_get_fields(fname):
    """
    Processes one conllu file
    :param fname: absolute path to the conllu file
    :return:
    """
    conll = pyconll.load_from_file(fname)
    upos = []
    xpos = []
    deprel = []
    sentences = []
    forms = []

    src_lang = path_leaf(fname).split('_')[0]
    for sen in conll:
        sentences.append((src_lang, sen.text))
        try:
            forms.extend([t.form for t in sen._tokens])
        except:
            pass
        try:
            sen_upos = [t.upos for t in sen._tokens]
            upos.append((src_lang, sen.text, tuple(sen_upos)))
        except:
            pass
        try:
            sen_xpos = [t.xpos for t in sen._tokens]
            xpos.append((src_lang, sen.text, tuple(sen_xpos)))
        except:
            pass
        try:
            sen_deprel = [t.deprel for t in sen._tokens]
            deprel.append((src_lang, sen.text, tuple(sen_deprel)))
        except:
            pass
    
    return (set(upos), len(upos)), (set(xpos), len(xpos)), (set(deprel), len(deprel)), (set(sentences), len(sentences)), (set(forms), len(forms))


In [23]:

def _try_get_2list(fname):
    try:
        return conllu_get_fields(fname)
    except Exception as e:
        print("Error processing file: {} \nWith error: {}".format(fname, e))


def conllu_process_get_2list(rootdir=CONLLU_BASEPATH, blacklist=BLACKLIST):
    allconll = get_all_files_recurse(rootdir)
    train, test, dev = filter_conllu_files(allconll, blacklist)
    all_files = train + test + dev
    print(all_files)

    with Pool(processes=cpu_count()) as pool:
        res = pool.map(_try_get_2list, all_files)
        return res


In [24]:
%%time
res = conllu_process_get_2list(blacklist=[])

[]
CPU times: user 13.8 ms, sys: 20.7 ms, total: 34.5 ms
Wall time: 81.8 ms


Finding now the shortest and longest sequences, checking the length and plotting those to see what's happening with the dataset.

In [22]:
res

[]

In [8]:
%%time

upos_data = []
xpos_data = []
deprel_data = []
sentences_data = []
forms_data = []

for r in res:
    upos_val, xpos_val, deprel_val, sentences_val, forms_val = r
#     print("lala 1")
    forms_data.extend(forms_val[0])
    for val in upos_val[0]:
#         print(val)
        lang1, txt1, upos  = val
        upos_data.append((lang1, txt1, upos, len(upos)))
    for lang2, txt2, xpos in xpos_val[0]:
        xpos_data.append((lang2, txt2, xpos, len(xpos)))
    for lang3, txt3, deprel in deprel_val[0]:
        deprel_data.append((lang3, txt3, deprel, len(deprel)))
    for lang4, txt4 in sentences_val[0]:
        sentences_data.append((lang4, txt4, len(txt4)))

# upos_data = sorted(upos_data)
# xpos_data = sorted(xpos_data)
# deprel_data = sorted(deprel_data)

    

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 5.96 µs


In [9]:
df_upos = pd.DataFrame(upos_data, columns=["lang", "text", "upos", "upos_len"])
df_xpos = pd.DataFrame(xpos_data, columns=["lang", "text", "xpos", "xpos_len"])
df_deprel = pd.DataFrame(deprel_data, columns=["lang", "text", "deprel", "deprel_len"])
df_txt = pd.DataFrame(sentences_data, columns=["lang", "text", "text_len"])

In [10]:
df_upos.columns

Index(['lang', 'text', 'upos', 'upos_len'], dtype='object')

In [11]:
df_upos['lang'].describe()

count       0
unique      0
top       NaN
freq      NaN
Name: lang, dtype: object

In [12]:
langs = sorted(df_upos['lang'].unique())

## Token (by UPOS) length analysis and histogram plot

In [13]:
df_upos.describe()

Unnamed: 0,lang,text,upos,upos_len
count,0.0,0.0,0.0,0.0
unique,0.0,0.0,0.0,0.0
top,,,,
freq,,,,


In [14]:
for lang in langs:
    fig, ax = plt.subplots()
    dest_lang = languages.get(alpha_2=lang) if len(lang) == 2 else languages.get(alpha_3=lang)
    dest_lang = dest_lang.name
    ax.set_title(lang +":" +dest_lang )
    df_upos.loc[df_upos['lang'] == lang]['upos_len'].hist(bins=100, ax=ax, label=lang)

A token is a word or a punctuation mark. Punctuation is counted in the sentence length.

Is interesting to see that most languages have few sentences longer than 100 tokens (including the punctuation)

Also, each language has a different centroid for the sentence length.

Also it's an interesting that different languages show different strip patterns, each having a different interval of lengths that are empty (like basque for example) while the count next to it has high counts (length 10 for basque has a count of more than 600 while the left and right are just empty). 

These patterns are really interesting.



In [15]:
df_deprel.describe()

Unnamed: 0,lang,text,deprel,deprel_len
count,0.0,0.0,0.0,0.0
unique,0.0,0.0,0.0,0.0
top,,,,
freq,,,,


In [21]:
langs

[]

In [16]:
for lang in langs:
    fig, ax = plt.subplots()
    dest_lang = languages.get(alpha_2=lang) if len(lang) == 2 else languages.get(alpha_3=lang)
    dest_lang = dest_lang.name
    ax.set_title(lang +":" +dest_lang )
    df_deprel.loc[df_deprel['lang'] == lang]['deprel_len'].hist(bins=100, ax=ax, label=lang)

## Character length analysis and histogram plot

In [17]:
df_txt.describe()

Unnamed: 0,lang,text,text_len
count,0.0,0.0,0.0
unique,0.0,0.0,0.0
top,,,
freq,,,


Checking the values we can see, even though we have a  max of 3471 characters, if we want to see how much of the spectrum we capture is as follows (if it were a normal distribution which I don't want to test yet but it should work well enough):

| char_len | mean+X*std | % captured |
|:--------:|:----------:|:----------:|
|    90    |            |     50%    |
|    138   |            |     75%    |
|    173   |  103+1*70  |    84.2%   |
|    243   |  103+2*70  |    98.8%   |
|    313   |  103+3*70  |    99.9%   |

So going for a maximum sequence length of at least 313 should capture most of the sequences in the training and testing datasets (and as they should also be significant of each language ... it should be enough). This is done for each language later in this notebook.

In [20]:
langs

[]

In [18]:

len_lang_list = []

_99p = []
_98p = []
_84p = []

for lang in langs:
    dest_lang = languages.get(alpha_2=lang) if len(lang) == 2 else languages.get(alpha_3=lang)
    dest_lang = dest_lang.name
    lng_txt = df_txt.loc[df_txt['lang'] == lang]
    d = lng_txt.describe()
    mean = d.loc['mean']['text_len']
    std = d.loc['std']['text_len']
    ef,ne,nn = math.ceil(mean+std), math.ceil(mean+2*std), math.ceil(mean+3*std)
    _99p.append(nn)
    _98p.append(ne)
    _84p.append(ef)
    len_lang_list.append((lang, dest_lang, ef, ne, nn))
    print(dest_lang)
    print("""
            |    {}   |  84.2%   |
            |    {}   |  98.8%   |
            |    {}   |  99.9%   |""".format(ef, ne, nn)
    )
    print(lng_txt.describe())
    print("_"*50)
#     lng_txt.describe()

In [19]:
max(_99p), max(_98p), max(_84p)

ValueError: max() arg is an empty sequence

When checking individually each language, the maximum length would be much higher, this is because some languages contain longer sentences, so we have to deal with this selecting a bigger sentence length to be able to capture most of it.

The longest being Belarusian

The complete list is sorted and printed here:

In [None]:
list(reversed(sorted(len_lang_list, key=lambda x: x[4])))

In [None]:
for lang in langs:
    fig, ax = plt.subplots()
    dest_lang = languages.get(alpha_2=lang) if len(lang) == 2 else languages.get(alpha_3=lang)
    dest_lang = dest_lang.name
    ax.set_title(lang +":" +dest_lang )
    df_txt.loc[df_txt['lang'] == lang]['text_len'].hist(bins=100, ax=ax, label=lang)

In [None]:
MAYBE_BLACKLIST_LANGS = ['ceb', 'jv', 'ce', 'cv', 'dv', 'ht', 'hy', 'ku', 'mh', 'mi', 'ps', 'su', 'tk', 'ba', 'tg',
                         'tt', 'ug'
                         ]
# blacklisting due to lack of samples, extinct language or other issue
EXTRA_BLACKLIST = ["olo", "swl", "bxr", "fa", "sme", "aii", "gun", "yo", "akk", "fo", "mdf",
                   "krl", "pcm", "bho", "sms", "am", "bm", "got", "cu", "hsb", "wo",
                  ]

# blacklists to reduce even more the number of languages, latin is left because will be used FIRST to train to set a learning baseline ...
# base, greek, old french and ancient greek will be nice too if I manage to transliterate it
# extra blacklisting to reduce the number of alphabets used and other low resource and other non-official languages
MORE_EXTRA_BLACKLIST = ["af", "gsw", "he", 
                        "ca", "cy", "eu", "ga", "gd", "gl", "cr", "hy", "tr",
                       ]

ANCIENT_LANGS = ["la", "grc", "fro",]  # latin, ancient greek, old french -> base for MANY languages
GREEK_BLACKLIST = ["el", "grc"]

# this means mainly taking out cyrillic scripts ... but bulgarian IS in the EU and uses 
# cyrillic so ... there it is, won't take them out
CYRILLIC_BLACKLIST = ["be", "bg", "he", "ru", "sr", "uk"]


BLACKLIST_LANGS = ['ar', 'as', 'arz', 'azb', 'bn', 'bp', 'ckb', 'eo', 'ew', 'fa', 'fo', 'gom', 'gu', 'hi', 'hu', 'id',
                   'ilo', 'ja', 'ka', 'kk', 'ko', 'lmo', 'ml', 'mr', 'mwl', 'ne', 'pa', 'py', 'sh', 'si', 'ta', 'te',
                   'th', 'tl', 'ur', 'vi',
                   'wuu', 'yi', 'zb', 'zh'
                   ] + MAYBE_BLACKLIST_LANGS + EXTRA_BLACKLIST

BLACKLIST_LANGS = sorted(list(set(BLACKLIST_LANGS)))

len(BLACKLIST_LANGS)

In [None]:
len(langs)

From the results, sample count and some other observations, I'm now cutting more languages (the EXTRA_BLACKLIST) such as to cut the complexity of the training dataset while trying to keep as much as possible to make a multi lingual 

In [None]:
# @interact
# def show_len_by_lang(column='upos_len', x=10):
# #     return df_upos[df_upos[column == x]]  # ['upos_len'].hist(bins=100, log=True)
#     return df_upos.loc[df_upos[column] > x]

In [None]:
df_deprel['deprel'].head(n=10)

In [None]:
# train[0]

In [None]:
# %%time

# all_upos = set([])
# all_upos_count = 0
# all_deprel = set([])
# all_deprel_count = 0

# for r in res:
#     (upos, upos_count), (xpos, xpos_count), (deprel, deprel_count) = r
#     all_upos = all_upos.union(upos)
#     all_upos_count += upos_count
#     all_deprel = all_deprel.union(deprel)
#     all_deprel_count += deprel_count
    



In [None]:
# len(all_upos), all_upos_count, len(all_deprel), all_deprel_count

It seems kind of big for putting it directly in a memory, some compression should be done if this method is to work ... (it should be more data efficient than the current methods)

For a single language that might be feasible, but as more languages are added this leads to a volume problem ... 

In [None]:
lall_upos = list(all_upos)
lup = [len(u) for u in lall_upos]

In [None]:
max(lup), min(lup)

In [None]:
lup.index(max(lup)), lup.index(min(lup))

In [None]:
lall_upos[83026]

In [None]:
lall_upos[371226]

The issue here seems to be the length of the sentences, putting a max length might work but will leave longer sentences out of the training.

The other idea would be to use a more thourough diccionary that contains more elements this will make smaller sentences, this means, for the training set the max length is 515 words, this is still too much.

The counterpoint when using a bigger dictionary is that even if the length of the input is smaller, the bigger the dictionary  the bigger the memory impact for the encoding and decoding.




In [None]:
# now I should get all the words and check the number and length there are

len(forms_data)