# A statistical Conllu file Exploration of  Universal Dependencies

## Introduction

While much work is being done in the current days on NLP and NLU, there is little work on describing why a certain length of transformer (or other as LSTM time steps) architecture has been chosen for the training, it is mostly arbitrary and depends on the goal of the work and resources available (mainly hardware). These decisions are hard once the model has been trained and there is nothing that can be done to extend the length of a transformer (for example) without having to retrain the entire network. There are however some works that tackle variable length sequences. 

This work presents a first complete analysis of the Universal Dependencies v2.6 dataset and presents the globan and individual results of each language present in the dataset.

This work does not intend to be a conference level paper (that is why there are no references to all the papers on each subject), but an informational technical report that might help to better select the most effective compromise text or token length for your particular NLP application.

The number of analyzed languages is 92, the token length is measured as the named UPOS tag in the dataset, while the character length is just that. There is no analysis on what constitutes a word or not, this means that a token includes the punctuiation and other symbols presents in the text samples. For lingüstic analysis purposes more de




## Observations

The histograms show a skew on the distribution, this can be a skewed gaussian, a generalized gaussian or a beta distribution form. Due to this, I will be testing different distribution fits with the Kolmogorov-Smirnov test.

There are many languages that do not have enough samples so the dsitribution fit will not be good  and errors will be big.
This is not an issue  from the code point of view. The important thing is if this data is used, take into account the number of samples available.


While doing this work I found quite interesting that are languages whose number of tokens or characters avoid certain bins in the histogram (Bulgarian, Breton Welsh, Danish, Slovak, Tamil and Thai are a few examples of this). This can mean that, either the language structure supports only those lengths, or that the analyzed dataset only contains samples that avoid some sentence lengths.

For some languages the number of samples is too small to make any good assumption from the data.


## Conclusion

This work presents a sample length analysis by language on the UniversalDependencies v2.6 dataset presenting the statistics for all 92 represented languages. The analysis then shows the length histograms by character and token length.

The best compromise for choosing a sequence length on the NLP architecture for training will depend mostly on the requirements of the applicatino, nevertheless with the numbers here you should be able to make an informed guess on what might be better for your case.

We can see that having a multi-lingual approach will necessary make the needed sequences longer as there is a large variability on sequence length, but appliying to single language might allow you to optimize your neural architectures

## Future Work

I am currently working on a more in depth analysis of the complete Gutenberg project dataset ( ~60K books in several languages) that will discriminate several other text characteristics.

I also have started to work on a complete parsing of a few of the Wiktionary datasets.

Stay tuned for those results ;)

In [172]:
from preprocessors.ud_conllu_stats import *
import json
import gzip

In [2]:
import matplotlib.pyplot as plt
import bokeh
from bokeh.plotting import figure, output_file, show

%matplotlib inline

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

In [3]:
%%time
res = conllu_process_get_2list(blacklist=blacklist)

CPU times: user 11.1 s, sys: 2.31 s, total: 13.4 s
Wall time: 1min 50s


In [4]:
%%time
upos_data, deprel_data, sentences_data, forms_data = extract_data_from_fields(res)

CPU times: user 964 ms, sys: 52.7 ms, total: 1.02 s
Wall time: 1.01 s


In [5]:
%%time
# langs = ['es', 'fr', 'de', 'en']
# langs_data = compute_distributions(upos_data, deprel_data, sentences_data, langs)
langs_data = compute_distributions(upos_data, deprel_data, sentences_data)

  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
  improvement from the last ten iterations.
  return log(self._pdf(x, *args))


Error processing lang qhe with Exception 'NoneType' object has no attribute 'name'
CPU times: user 20min 1s, sys: 2.8 s, total: 20min 4s
Wall time: 20min 36s


In [191]:
def _get_stats(distrib, distrib_params, data):
    mskv = [None, None, None, None]
    t_mskv = distrib.stats(*distrib_params)
    for i in range(len(t_mskv)):  # mean, variance, skew, kurtosis -> variable length
        mskv[i] = t_mskv[i]
    ret_stats = { 
        'mean': mskv[0],  # mean, variance, skew, kurtosis -> variable length
        'variance': mskv[1],
        'skew': mskv[2],
        'kurtosis': mskv[3],
        'median': distrib.median(*distrib_params), 
        'std': distrib.std(*distrib_params),
        'intervals': {'99': distrib.interval(0.99,*distrib_params),
                      '98': distrib.interval(0.98,*distrib_params),
                      '95': distrib.interval(0.95,*distrib_params),
                      '90': distrib.interval(0.90,*distrib_params),
                      '85': distrib.interval(0.85,*distrib_params),
                      '80': distrib.interval(0.8,*distrib_params),
                     }
    }
    ret_foo = {'cdf': distrib.cdf(data, *distrib_params), 
               'pdf': distrib.pdf(data, *distrib_params)
              }
    return ret_stats, ret_foo

In [192]:
def _get_lang_stats(lang_data, distributions=DISTRIBUTIONS):
#     'lang': dest_lang,
#     'upos_len': lng_upos_len,
#     'upos_distrib': get_best_distribution(lng_upos_len),
#     'deprel_len': lng_deprel_len,
#     'deprel_distrib': get_best_distribution(lng_deprel_len),
#     'text_len': lng_text_len,
#     'text_distrib': get_bdistributionsstribution(lng_text_len),
    upos_distrib = distributions[lang_data['upos_distrib'][0]]
    upos_distrib_params = lang_data['upos_distrib'][2]
#     print('upos', upos_distrib, upos_distrib_params)
    upos_data = lang_data['upos_len']
    upos_stats, upos_functions = _get_stats(upos_distrib, upos_distrib_params, upos_data)
    # 
    deprel_distrib = distributions[lang_data['deprel_distrib'][0]]
    deprel_distrib_params = lang_data['deprel_distrib'][2]
#     print('deprel', deprel_distrib, deprel_distrib_params)
    deprel_data = lang_data['deprel_len']
    deprel_stats, deprel_functions = _get_stats(deprel_distrib, deprel_distrib_params, deprel_data)
    #
    text_distrib = distributions[lang_data['text_distrib'][0]]
    text_distrib_params = lang_data['text_distrib'][2]
#     print('text', text_distrib, text_distrib_params)
    text_data = lang_data['text_len']
    text_stats, text_functions = _get_stats(text_distrib, text_distrib_params, text_data)
    
    lang_data['upos_stats'] = upos_stats
    lang_data['deprel_stats'] = deprel_stats
    lang_data['text_stats'] = text_stats
    
    lang_data['upos_functions'] = upos_functions
    lang_data['deprel_functions'] = deprel_functions
    lang_data['text_functions'] = text_functions
    
    return lang_data
    #best_dist, best_p, params[best_dist]

In [8]:
for lang in langs_data.keys():
    print(langs_data[lang]['upos_distrib'][0])

skewnorm
beta
beta
beta
beta
beta
skewnorm
beta
beta
beta
beta
beta
beta
beta
beta
skewnorm
beta
beta
beta
beta
skewnorm
beta
beta
beta
beta
skewnorm
beta
beta
beta
beta
gennorm
beta
beta
beta
beta
skewnorm
skewnorm
beta
beta
beta
beta
beta
beta
beta
beta
beta
gennorm
beta
beta
beta
beta
beta
beta
beta
norm
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
beta
skewnorm
beta
beta
skewnorm
beta
skewnorm
beta
beta
beta
beta
beta
beta
gennorm
skewnorm
beta
beta
beta
skewnorm


In [184]:
langs_data['fr']

{'lang': 'French',
 'upos_len': 46483      37
 46484      15
 46485      22
 46486       3
 46487      12
            ..
 1287652    27
 1287653    28
 1287654     8
 1287655    55
 1287656    25
 Name: upos_len, Length: 44744, dtype: int64,
 'upos_distrib': ('beta',
  1.0539453859211309e-12,
  (2.328372827493223,
   888069.8279996342,
   0.24182457861422196,
   10057137.131721482)),
 'deprel_len': 46505      10
 46506       9
 46507      19
 46508      14
 46509      13
            ..
 1287847    25
 1287848    55
 1287849    57
 1287850    46
 1287851    10
 Name: deprel_len, Length: 44752, dtype: int64,
 'deprel_distrib': ('beta',
  3.27876194922621e-12,
  (2.355821104156877,
   1340698.694634203,
   0.18655412650324638,
   15017052.551892862)),
 'text_len': 44870       60
 44871       78
 44872      184
 44873      274
 44874       38
           ... 
 1211677     66
 1211678     72
 1211679     36
 1211680     53
 1211681     80
 Name: text_len, Length: 43830, dtype: int64,
 'text_

In [193]:
%%time
all_stats = {}

for lang, lang_data in langs_data.items():
    print('processing {}'.format(lang) )
    all_stats[lang] = _get_lang_stats(lang_data)

processing af
processing aii
processing akk
processing am
processing ar
processing be
processing bg
processing bho
processing bm
processing br
processing bxr
processing ca
processing cop
processing cs
processing cu
processing cy
processing da
processing de
processing el
processing en
processing es
processing et
processing eu
processing fa
processing fi
processing fo
processing fr
processing fro
processing ga
processing gd
processing gl
processing got
processing grc
processing gsw
processing gun
processing he
processing hi
processing hr
processing hsb
processing hu
processing hy
processing id
processing is
processing it
processing ja
processing kk
processing kmr
processing ko
processing koi
processing kpv
processing krl
processing la
processing lt
processing lv
processing lzh
processing mdf
processing mr
processing mt
processing myv
processing nl
processing no
processing olo
processing orv
processing pcm
processing pl
processing pt
processing ro
processing ru
processing sa
processing sk

In [194]:
all_stats['fr']

{'lang': 'French',
 'upos_len': 46483      37
 46484      15
 46485      22
 46486       3
 46487      12
            ..
 1287652    27
 1287653    28
 1287654     8
 1287655    55
 1287656    25
 Name: upos_len, Length: 44744, dtype: int64,
 'upos_distrib': ('beta',
  1.0539453859211309e-12,
  (2.328372827493223,
   888069.8279996342,
   0.24182457861422196,
   10057137.131721482)),
 'deprel_len': 46505      10
 46506       9
 46507      19
 46508      14
 46509      13
            ..
 1287847    25
 1287848    55
 1287849    57
 1287850    46
 1287851    10
 Name: deprel_len, Length: 44752, dtype: int64,
 'deprel_distrib': ('beta',
  3.27876194922621e-12,
  (2.355821104156877,
   1340698.694634203,
   0.18655412650324638,
   15017052.551892862)),
 'text_len': 44870       60
 44871       78
 44872      184
 44873      274
 44874       38
           ... 
 1211677     66
 1211678     72
 1211679     36
 1211680     53
 1211681     80
 Name: text_len, Length: 43830, dtype: int64,
 'text_

In [195]:
import json

# This solution is modified from:
# https://stackoverflow.com/questions/26646362/numpy-array-is-not-json-serializable
# https://github.com/mpld3/mpld3/issues/434

class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    def default(self, obj):
        if isinstance(obj, (tuple, set)):
            return list(obj)
        elif isinstance(obj, (np.int_, np.intc, np.intp, np.int8,
            np.int16, np.int32, np.int64, np.uint8,
            np.uint16, np.uint32, np.uint64)):
            return int(obj)
        elif isinstance(obj, (np.float_, np.float16, np.float32, 
            np.float64)):
            return float(obj)
        elif isinstance(obj,(np.ndarray,)): #### This is the fix
            return obj.tolist()
        elif isinstance(obj, pd.Series):
            obj = obj.to_list()
        return json.JSONEncoder.default(self, obj)
    
# but there is yet something missing that I kind of fix with the _recursive_jsonify which still misses some things

# TODO cleanup this mess

In [196]:
def _recursive_jsonify(dict_data):
    new_dict = {}
    for k, v in dict_data.items():
        k = str(k)  # always convert,
        if isinstance(v, (tuple, set)):
            ov = []
            for t in v:
                if isinstance(t, str):
                    ov.append(t)
                elif isinstance(t, tuple):
                    ov.append([float(i) for i in t])
                else:
                    ov.append(float(t))
            v = ov
        if isinstance(v, pd.Series):
            v = v.to_list()
        if isinstance(v, np.ndarray):
            v = v.tolist()
        if np.issubdtype(type(v), np.number):
            v = float(v)
        if isinstance(v, dict):
            new_dict[k] = _recursive_jsonify(v)
        else:
            new_dict[k] = v
    return new_dict


In [197]:
import copy

In [198]:
all_stats_copy = copy.deepcopy(all_stats)

In [199]:
# it seems that there is yet another thing I missed in the 
all_stats_copy = _recursive_jsonify(all_stats_copy)

In [200]:
jsn = json.dumps(all_stats_copy)

In [201]:
dumped = json.dumps(all_stats_copy, cls=NumpyEncoder)
# jsn = json.dumps(all_stats_copy)

In [202]:
with open('conllu_stats.json', 'w') as f:
    f.write(dumped)
    f.flush()

In [203]:
ls -alh |grep stats

-rw-rw-r--  1 leo leo 166M juin   5 15:53 conllu_stats.json
-rw-rw-r--  1 leo leo  18M juin   5 15:26 [01;31mconllu_stats.json.zip[0m


In [204]:
saveto = 'conllu_stats.json.zip'

with gzip.open(saveto, 'wb') as f:
    print("Saving to {}".format(saveto))
    f.write(dumped.encode('utf-8'))
    f.flush()


Saving to conllu_stats.json.zip


In [205]:
ls -alh |grep stats

-rw-rw-r--  1 leo leo 166M juin   5 15:53 conllu_stats.json
-rw-rw-r--  1 leo leo  18M juin   5 15:53 [01;31mconllu_stats.json.zip[0m


Well the stats file is quite big now, at 166MB.

This is due to all the entire data functions (data, CDF, PDF) there, so an intermediate step would be to process this, plot the right graphs and then only save the graphs and then cut down the number of elements in the output.

This should have taken care of much of the size issue, but for a website that will still be too much

Also there are some other approaches and the idea is to think what the user would be looking for when reading the reports so:
Reading by language: each language can have it's own file, this means that 166MB/92 ~< 2MB per file. 



Also have a file with the table of the stats only, no need to have the graphs there. In this way there is an easy comparison.
The statistics should be computed and displayed for upos, deprel and text 

In [227]:
def flatten(lang, d,sep="_"):
    import collections

    obj = collections.OrderedDict()
    obj['lang_code'] = lang
    lang_name = languages.get(alpha_2=lang) if len(lang) == 2 else languages.get(alpha_3=lang)
    obj['lang_name'] = lang_name.name
    
    def recurse(t,parent_key=""):
        
        if isinstance(t,list):
            for i in range(len(t)):
                recurse(t[i],parent_key + sep + str(i) if parent_key else str(i))
        elif isinstance(t,dict):
            for k,v in t.items():
                recurse(v,parent_key + sep + k if parent_key else k)
        else:
            obj[parent_key] = t

    recurse(d)

    return obj

In [228]:
titles = ['lang code', 'lang name', 'distribution', 'distrib_params', 'mean', 'variance', 'skew', 'kurtosis', 'median', 'std'
 , 'interval-99%', 'interval-98%', 'interval-95%', 'interval-90%', 'interval-85%', 'interval-80%']
            
def stats_dict2table(all_lang_stats):
    upos_stats = []
    deprel_stats = []
    text_stats = []
    for lang, lang_data in all_lang_stats.items():
        upos_row, deprel_row, text_row = stats_dict2rows(lang, lang_data)
        upos_stats.append(upos_row)
        deprel_stats.append(deprel_row)
        text_stats.append(text_row)
        
    upos_df = pd.DataFrame(upos_stats)
    deprel_df = pd.DataFrame(deprel_stats)
    text_df = pd.DataFrame(text_stats)
    
    return upos_df, deprel_df, text_df

def stats_dict2rows(lang, lang_data):
    lang_desc = [lang, lang_data['lang']]
    upos_row = []
    deprel_row = []
    text_row = []
    upos_data = flatten(lang, lang_data['upos_stats'])
    deprel_data = flatten(lang, lang_data['deprel_stats'])
    text_data = flatten(lang, lang_data['text_stats'])
    return upos_data, deprel_data, text_data
    

In [232]:
%%time
upos_table, deprel_table, text_table = stats_dict2table(all_stats)

CPU times: user 21 ms, sys: 8.06 ms, total: 29.1 ms
Wall time: 28.3 ms


In [233]:
upos_table.columns

Index(['lang_code', 'lang_name', 'mean', 'variance', 'skew', 'kurtosis',
       'median', 'std', 'intervals_99', 'intervals_98', 'intervals_95',
       'intervals_90', 'intervals_85', 'intervals_80'],
      dtype='object')

In [234]:
upos_table

Unnamed: 0,lang_code,lang_name,mean,variance,skew,kurtosis,median,std,intervals_99,intervals_98,intervals_95,intervals_90,intervals_85,intervals_80
0,af,Afrikaans,25.927585960267578,174.8601926826326,,,23.402607,13.223471,"(5.280599100675655, 69.51884232580102)","(6.258407390253987, 64.5190489131594)","(7.799864509611233, 57.28707874813642)","(9.271453942046739, 51.200969749825205)","(10.336822972785601, 47.319292409903596)","(11.234934556648827, 44.3867125445238)"
1,aii,Assyrian Neo-Aramaic,7.95223896040547,21.34761490022813,,,6.619369,4.620348,"(3.015273767066714, 27.303123572333313)","(3.0552021095354553, 24.2539734751741)","(3.1634295006467537, 20.205803486512092)","(3.329923569732691, 17.128747805791164)","(3.4900776805231972, 15.314884874290584)","(3.648060803903675, 14.025646913746883)"
2,akk,Akkadian,18.232974129754655,187.56453881939268,,,14.632412,13.695420,"(2.1075303348856482, 73.55718077193345)","(2.32830241070646, 65.11653177698678)","(2.8546980739819627, 53.84269439084248)","(3.5756805985198374, 45.19368606407743)","(4.219109685968786, 40.067298719109836)","(4.824902173094636, 36.39087252092397)"
3,am,Amharic,11.84974287136193,21.53030807390967,,,11.265240,4.640076,"(3.288783881421643, 26.783777555278974)","(3.7540203919439934, 24.98589412760773)","(4.544613674793823, 22.44666179431439)","(5.333469849508624, 20.36623891562515)","(5.904487521937005, 19.065232718305346)","(6.375465607714158, 18.094651303954873)"
4,ar,Arabic,42.76667058120489,833.9226924542479,,,36.434776,28.877720,"(2.8959676960809393, 152.33346695611672)","(3.947309099855296, 136.5272781572294)","(6.075875480377076, 115.18866440410845)","(8.586130108863706, 98.58502135946628)","(10.619399161667658, 88.61786340070931)","(12.42165975420181, 81.39775985911106)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,wbp,Warlpiri,5.867332486184738,1.9702164565250406,,,5.623213,1.403644,"(3.46168232753963, 10.432757591084219)","(3.6045487998377173, 9.911324102280801)","(3.8250062391988218, 9.157094652612399)","(4.028459397864105, 8.522368217674941)","(4.17047263636114, 8.117544179815944)","(4.286297082999924, 7.811702473129676)"
87,wo,Wolof,21.620856362111084,149.3275802188701,,,19.271482,12.219966,"(3.1738953141561144, 65.70642754061119)","(3.7973214009203167, 59.712355699659206)","(4.987094628441902, 51.48498807400781)","(6.314453495557852, 44.96428204676087)","(7.350661507566832, 40.993947600712026)","(8.247850522398474, 38.08896127953269)"
88,yo,Yoruba,26.051000804473222,162.2299289568923,,,24.386838,12.736951,"(3.4961057204070083, 65.71811138754812)","(4.497889946486232, 61.375730537735606)","(6.31469202858458, 55.000823201226694)","(8.239387093341865, 49.5807371934747)","(9.68674864658017, 46.10901393417395)","(10.908304930763906, 43.48157763994962)"
89,yue,Yue Chinese,13.978903102269172,132.93264671541684,,,10.469358,11.529642,"(2.0394085897024166, 59.396217938428045)","(2.085096224417145, 53.308344113873176)","(2.236120034785811, 44.597427886174266)","(2.513704756866302, 37.47712537196305)","(2.8130258482169213, 33.09179901474099)","(3.1298801785203874, 29.880286534809095)"


In [224]:
# from bokeh.models.widgets import DataTable, DateFormatter, TableColumn
from bokeh.models import ColumnDataSource, DataTable, DateFormatter, TableColumn

df_tables = [upos_table, deprel_table, text_table]
bk_tables = []

for table in df_tables:
    Columns = [TableColumn(field=Ci, title=Ci) for Ci in table.columns] # bokeh columns
    data_table = DataTable(columns=Columns, source=ColumnDataSource(table)) # bokeh table
    bk_tables.append(data_table)


In [225]:
show(bk_tables[0])

TypeError: Object of type Language is not JSON serializable