# A statistical Conllu file Exploration of  Universal Dependencies

## Introduction

While much work is being done in the current days on NLP and NLU, there is little work on describing why a certain length of transformer (or other as LSTM time steps) architecture has been chosen for the training, it is mostly arbitrary and depends on the goal of the work and resources available (mainly hardware). These decisions are hard once the model has been trained and there is nothing that can be done to extend the length of a transformer (for example) without having to retrain the entire network. There are however some works that tackle variable length sequences. 

This work presents a first complete analysis of the Universal Dependencies v2.6 dataset and presents the globan and individual results of each language present in the dataset.

This work does not intend to be a conference level paper (that is why there are no references to all the papers on each subject), but an informational technical report that might help to better select the most effective compromise text or token length for your particular NLP application.

The number of analyzed languages is 92, the token length is measured as the named UPOS tag in the dataset, while the character length is just that. There is no analysis on what constitutes a word or not, this means that a token includes the punctuiation and other symbols presents in the text samples. For lingüstic analysis purposes more de




## Observations

The histograms show a skew on the distribution, this can be a skewed gaussian, a generalized gaussian or a beta distribution form. Due to this, I will be testing different distribution fits with the Kolmogorov-Smirnov test.

There are many languages that do not have enough samples so the dsitribution fit will not be good  and errors will be big.
This is not an issue  from the code point of view. The important thing is if this data is used, take into account the number of samples available.


While doing this work I found quite interesting that are languages whose number of tokens or characters avoid certain bins in the histogram (Bulgarian, Breton Welsh, Danish, Slovak, Tamil and Thai are a few examples of this). This can mean that, either the language structure supports only those lengths, or that the analyzed dataset only contains samples that avoid some sentence lengths.

For some languages the number of samples is too small to make any good assumption from the data.


## Conclusion

This work presents a sample length analysis by language on the UniversalDependencies v2.6 dataset presenting the statistics for all 92 represented languages. The analysis then shows the length histograms by character and token length.

The best compromise for choosing a sequence length on the NLP architecture for training will depend mostly on the requirements of the applicatino, nevertheless with the numbers here you should be able to make an informed guess on what might be better for your case.

We can see that having a multi-lingual approach will necessary make the needed sequences longer as there is a large variability on sequence length, but appliying to single language might allow you to optimize your neural architectures

## Future Work

I am currently working on a more in depth analysis of the complete Gutenberg project dataset ( ~60K books in several languages) that will discriminate several other text characteristics.

I also have started to work on a complete parsing of a few of the Wiktionary datasets.

Stay tuned for those results ;)

In [119]:
from preprocessors.ud_conllu_stats import *
import json

In [2]:
import matplotlib.pyplot as plt
import bokeh
from bokeh.plotting import figure, output_file, show

%matplotlib inline

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

In [3]:
%%time
res = conllu_process_get_2list(blacklist=blacklist)

CPU times: user 11.1 s, sys: 2.31 s, total: 13.4 s
Wall time: 1min 50s


In [4]:
%%time
upos_data, deprel_data, sentences_data, forms_data = extract_data_from_fields(res)

CPU times: user 964 ms, sys: 52.7 ms, total: 1.02 s
Wall time: 1.01 s


In [5]:
%%time
# langs = ['es', 'fr', 'de', 'en']
# langs_data = compute_distributions(upos_data, deprel_data, sentences_data, langs)
langs_data = compute_distributions(upos_data, deprel_data, sentences_data)

  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
  improvement from the last ten iterations.
  return log(self._pdf(x, *args))


Error processing lang qhe with Exception 'NoneType' object has no attribute 'name'
CPU times: user 20min 1s, sys: 2.8 s, total: 20min 4s
Wall time: 20min 36s


In [59]:

def _get_stats(distrib, distrib_params, data):
    ret = { 
        'mvsk': distrib.stats(*distrib_params),  # mean, variance, skew, kurtosis
        'median': distrib.median(*distrib_params), 
        'std': distrib.std(*distrib_params),
        'cdf': distrib.cdf(data, *distrib_params), 
        'pdf': distrib.pdf(data, *distrib_params), 
    }
    return ret

def _get_lang_stats(lang_data, distributions=DISTRIBUTIONS):
#     'lang': dest_lang,
#     'upos_len': lng_upos_len,
#     'upos_distrib': get_best_distribution(lng_upos_len),
#     'deprel_len': lng_deprel_len,
#     'deprel_distrib': get_best_distribution(lng_deprel_len),
#     'text_len': lng_text_len,
#     'text_distrib': get_bdistributionsstribution(lng_text_len),
    upos_distrib = distributions[lang_data['upos_distrib'][0]]
    upos_distrib_params = lang_data['upos_distrib'][2]
#     print('upos', upos_distrib, upos_distrib_params)
    upos_data = lang_data['upos_len']
    upos_stats = _get_stats(upos_distrib, upos_distrib_params, upos_data)
    # 
    deprel_distrib = distributions[lang_data['deprel_distrib'][0]]
    deprel_distrib_params = lang_data['deprel_distrib'][2]
#     print('deprel', deprel_distrib, deprel_distrib_params)
    deprel_data = lang_data['deprel_len']
    deprel_stats = _get_stats(deprel_distrib, deprel_distrib_params, deprel_data)
    #
    text_distrib = distributions[lang_data['text_distrib'][0]]
    text_distrib_params = lang_data['text_distrib'][2]
#     print('text', text_distrib, text_distrib_params)
    text_data = lang_data['text_len']
    text_stats = _get_stats(text_distrib, text_distrib_params, text_data)
    
    lang_data['upos_stats'] = upos_stats
    lang_data['deprel_stats'] = deprel_stats
    lang_data['text_stats'] = text_stats
    
    return lang_data
    #best_dist, best_p, params[best_dist]

In [8]:
for lang in langs_data.keys():
    print(langs_data[lang]['upos_distrib'][0])

skewnorm
beta
beta
beta
beta
beta
skewnorm
beta
beta
beta
beta
beta
beta
beta
beta
skewnorm
beta
beta
beta
beta
skewnorm
beta
beta
beta
beta
skewnorm
beta
beta
beta
beta
gennorm
beta
beta
beta
beta
skewnorm
skewnorm
beta
beta
beta
beta
beta
beta
beta
beta
beta
gennorm
beta
beta
beta
beta
beta
beta
beta
norm
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
skewnorm
beta
beta
skewnorm
beta
skewnorm
beta
beta
beta
beta
beta
beta
gennorm
skewnorm
beta
beta
beta
skewnorm


In [23]:
langs_data['af']

{'lang': 'Afrikaans',
 'upos_len': 273202     13
 273203     25
 273204     14
 273205     12
 273206     21
            ..
 1196310    29
 1196311    34
 1196312    40
 1196313    22
 1196314    63
 Name: upos_len, Length: 1900, dtype: int64,
 'upos_distrib': ('skewnorm',
  0.0003660293737984097,
  (7.707064774227691, 8.816775190213058, 21.62498642522687)),
 'deprel_len': 273399      8
 273400     15
 273401     21
 273402     22
 273403     32
            ..
 1196603    44
 1196604    29
 1196605    45
 1196606    11
 1196607    29
 Name: deprel_len, Length: 1912, dtype: int64,
 'deprel_distrib': ('beta',
  0.05071891013122111,
  (3.0120100810462294,
   2853314.865394746,
   2.463986896784445,
   21996622.30480557)),
 'text_len': 271404     141
 271405      75
 271406      99
 271407      80
 271408      78
           ... 
 1129857    189
 1129858    245
 1129859    112
 1129860     78
 1129861    115
 Name: text_len, Length: 1899, dtype: int64,
 'text_distrib': ('beta',
  0.30651034

In [60]:
%%time
all_stats = {}

for lang, lang_data in langs_data.items():
    print('processing {}'.format(lang) )
    all_stats[lang] = _get_lang_stats(lang_data)

processing af
processing aii
processing akk
processing am
processing ar
processing be
processing bg
processing bho
processing bm
processing br
processing bxr
processing ca
processing cop
processing cs
processing cu
processing cy
processing da
processing de
processing el
processing en
processing es
processing et
processing eu
processing fa
processing fi
processing fo
processing fr
processing fro
processing ga
processing gd
processing gl
processing got
processing grc
processing gsw
processing gun
processing he
processing hi
processing hr
processing hsb
processing hu
processing hy
processing id
processing is
processing it
processing ja
processing kk
processing kmr
processing ko
processing koi
processing kpv
processing krl
processing la
processing lt
processing lv
processing lzh
processing mdf
processing mr
processing mt
processing myv
processing nl
processing no
processing olo
processing orv
processing pcm
processing pl
processing pt
processing ro
processing ru
processing sa
processing sk

In [61]:
all_stats

{'af': {'lang': 'Afrikaans',
  'upos_len': 273202     13
  273203     25
  273204     14
  273205     12
  273206     21
             ..
  1196310    29
  1196311    34
  1196312    40
  1196313    22
  1196314    63
  Name: upos_len, Length: 1900, dtype: int64,
  'upos_distrib': ('skewnorm',
   0.0003660293737984097,
   (7.707064774227691, 8.816775190213058, 21.62498642522687)),
  'deprel_len': 273399      8
  273400     15
  273401     21
  273402     22
  273403     32
             ..
  1196603    44
  1196604    29
  1196605    45
  1196606    11
  1196607    29
  Name: deprel_len, Length: 1912, dtype: int64,
  'deprel_distrib': ('beta',
   0.05071891013122111,
   (3.0120100810462294,
    2853314.865394746,
    2.463986896784445,
    21996622.30480557)),
  'text_len': 271404     141
  271405      75
  271406      99
  271407      80
  271408      78
            ... 
  1129857    189
  1129858    245
  1129859    112
  1129860     78
  1129861    115
  Name: text_len, Length: 1899, 

In [120]:
import json

# This solution is modified from:
# https://stackoverflow.com/questions/26646362/numpy-array-is-not-json-serializable
# https://github.com/mpld3/mpld3/issues/434

class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    def default(self, obj):

        if isinstance(obj, (np.int_, np.intc, np.intp, np.int8,
            np.int16, np.int32, np.int64, np.uint8,
            np.uint16, np.uint32, np.uint64)):
            return int(obj)
        elif isinstance(obj, (np.float_, np.float16, np.float32, 
            np.float64)):
            return float(obj)
        elif isinstance(obj,(np.ndarray,)): #### This is the fix
            return obj.tolist()
        elif isinstance(obj, pd.Series):
            obj = obj.to_list()
        return json.JSONEncoder.default(self, obj)
    
# but there is yet something missing that I kind of fix with the _recursive_jsonify which still misses some things

# TODO cleanup this mess

In [124]:
def _recursive_jsonify(dict_data):
    new_dict = {}
    for k, v in dict_data.items():
        k = str(k)  # always convert,
        if isinstance(v, tuple):
            ov = []
            for t in v:
                if isinstance(t, str):
                    ov.append(t)
                elif isinstance(t, tuple):
                    ov.append([float(i) for i in t])
                else:
                    ov.append(float(t))
        if isinstance(v, pd.Series):
            v = v.to_list()
        if isinstance(v, np.ndarray):
            v = v.tolist()
        if np.issubdtype(type(v), np.number):
            v = v.tolist()
        if isinstance(v, dict):
            new_dict[k] = _recursive_jsonify(v)
        else:
            new_dict[k] = v
    return new_dict


In [115]:
import copy

In [122]:
all_stats_copy = copy.deepcopy(all_stats)

In [125]:
# it seems that there is yet another thing I missed in the 
all_stats_copy = _recursive_jsonify(all_stats_copy)

In [126]:
dumped = json.dumps(all_stats_copy, cls=NumpyEncoder)
# jsn = json.dumps(all_stats_copy)

In [127]:
with open('conllu_stats.json', 'w') as f:
    f.write(dumped)
    f.flush()

In [130]:
ls -alh |grep stats

-rw-rw-r--  1 leo leo 166M juin   4 22:59 conllu_stats.json


Well the stats file is quite big now, at 166MB.

This is due to all the entire data functions (data, CDF, PDF) there, so an intermediate step would be to process this, plot the right graphs and then only save the graphs and the 

In [132]:
import gzip

saveto = 'conllu_stats.json.zip'

with gzip.open(saveto, 'wb') as f:
    print("Saving to {}".format(saveto))
    f.write(dumped.encode('utf-8'))
    f.flush()


Saving to conllu_stats.json.zip


In [133]:
ls -alh |grep stats

-rw-rw-r--  1 leo leo 166M juin   4 22:59 conllu_stats.json
-rw-rw-r--  1 leo leo  18M juin   4 23:05 [01;31mconllu_stats.json.zip[0m


This should have taken care of much of the size issue, but for a website that will still be too much :p