# A statistical Conllu file Exploration of  Universal Dependencies

## Introduction

While much work is being done in the current days on NLP and NLU, there is little work on describing why a certain length of transformer (or other as LSTM time steps) architecture has been chosen for the training, it is mostly arbitrary and depends on the goal of the work and resources available (mainly hardware). These decisions are hard once the model has been trained and there is nothing that can be done to extend the length of a transformer (for example) without having to retrain the entire network. There are however some works that tackle variable length sequences. 

This work presents a first complete analysis of the Universal Dependencies v2.6 dataset and presents the globan and individual results of each language present in the dataset.

This work does not intend to be a conference level paper (that is why there are no references to all the papers on each subject), but an informational technical report that might help to better select the most effective compromise text or token length for your particular NLP application.

The number of analyzed languages is 92, the token length is measured as the named UPOS tag in the dataset, while the character length is just that. There is no analysis on what constitutes a word or not, this means that a token includes the punctuiation and other symbols presents in the text samples. For lingüstic analysis purposes more de




## Observations

The histograms show a skew on the distribution, this can be a skewed gaussian, a generalized gaussian or a beta distribution form. Due to this, I will be testing different distribution fits with the Kolmogorov-Smirnov test.

There are many languages that do not have enough samples so the dsitribution fit will not be good  and errors will be big.
This is not an issue  from the code point of view. The important thing is if this data is used, take into account the number of samples available.


While doing this work I found quite interesting that are languages whose number of tokens or characters avoid certain bins in the histogram (Bulgarian, Breton Welsh, Danish, Slovak, Tamil and Thai are a few examples of this). This can mean that, either the language structure supports only those lengths, or that the analyzed dataset only contains samples that avoid some sentence lengths.

For some languages the number of samples is too small to make any good assumption from the data.


## Conclusion

This work presents a sample length analysis by language on the UniversalDependencies v2.6 dataset presenting the statistics for all 92 represented languages. The analysis then shows the length histograms by character and token length.

The best compromise for choosing a sequence length on the NLP architecture for training will depend mostly on the requirements of the applicatino, nevertheless with the numbers here you should be able to make an informed guess on what might be better for your case.

We can see that having a multi-lingual approach will necessary make the needed sequences longer as there is a large variability on sequence length, but appliying to single language might allow you to optimize your neural architectures

## Future Work

I am currently working on a more in depth analysis of the complete Gutenberg project dataset ( ~60K books in several languages) that will discriminate several other text characteristics.

I also have started to work on a complete parsing of a few of the Wiktionary datasets.

Stay tuned for those results ;)

In [1]:
from preprocessors.ud_conllu_stats import *
import json
import gzip

In [237]:
import matplotlib.pyplot as plt
import bokeh
from bokeh.plotting import figure, output_file, show
from bokeh.palettes import Spectral4
from bokeh.io import output_notebook, output_file
from bokeh.models import LinearAxis, Range1d, HoverTool, ColumnDataSource, DataTable, TableColumn
from bokeh.models.layouts import Column
from bokeh.layouts import gridplot, column, row, Spacer

%matplotlib inline

# import ipywidgets as widgets
# from ipywidgets import interact, interact_manual

In [3]:
output_notebook()

In [7]:
%%time
all_stats = generate_files(blacklist=[], saveto='conllu_stats.json.zip')

  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
  improvement from the last ten iterations.
  a/(b-1.0),
  a*(a+1.0)/((b-2.0)*(b-1.0)),
  mu2 = mu2p - mu * mu
  Lhat = muhat - Shat*mu
  return log(self._pdf(x, *args))


Error processing lang qhe with Exception 'NoneType' object has no attribute 'name'
processing af
processing aii
processing akk
processing am
processing ar
processing be
processing bg
processing bho
processing bm
processing br
processing bxr
processing ca
processing cop
processing cs
processing cu
processing cy
processing da
processing de
processing el
processing en
processing es
processing et
processing eu
processing fa
processing fi
processing fo
processing fr
processing fro
processing ga
processing gd
processing gl
processing got
processing grc
processing gsw
processing gun
processing he
processing hi
processing hr
processing hsb
processing hu
processing hy
processing id
processing is
processing it
processing ja
processing kk
processing kmr
processing ko
processing koi
processing kpv
processing krl
processing la
processing lt
processing lv
processing lzh
processing mdf
processing mr
processing mt
processing myv
processing nl
processing no
processing olo
processing orv
processing pcm


In [149]:
upos_plt_info, txt_plt_info = _make_data_sources(all_stats['fr'])

In [208]:
upos_plt_info, txt_plt_info = _make_data_sources(all_stats['fr'])

In [209]:
upos_table, interval_table = _make_stats_tables(upos_plt_info[2])

['99', '98', '95', '90', '85', '80']


In [154]:
# def _make_grid_plot(lang_data):
#     upos_plt_info, txt_plt_info = _make_data_sources(lang_data)
#     upos_plot = make_plot(*upos_plt_info[:2])
#     text_plot = make_plot(*txt_plt_info[:2])
    
    
#     gp = gridplot([upos_plot,text_plot], ncols=1, sizing_mode="stretch_width", plot_height=350)
#     return gp

In [264]:
grid = _make_grid_plot(all_stats['fr'])

In [265]:
show(grid)

In [286]:
def _make_stat_tables(all_lang_stats):
    df_tables = (upos_df, text_df) = stats_dict2table(all_lang_stats)
    intervals = ['intervals_99', 'intervals_98', 'intervals_95', 'intervals_90', 'intervals_85', 'intervals_80']
    cols_to_drop = intervals + ['intervals_99_low', 'intervals_98_low', 
                                'intervals_95_low', 'intervals_90_low', 'intervals_85_low', 'intervals_80_low',
                   'skew', 'kurtosis']
    rename_cols = ['intervals_99', 'intervals_98', 'intervals_95', 'intervals_90', 'intervals_85', 'intervals_80']
    # round precision
    
    # separate and clean the data
    for df in df_tables:
        
        for interval in intervals:
            df[[interval+'_low', interval+'_high']] = pd.DataFrame(df[interval].tolist(), index=df.index)
        df.drop(columns=cols_to_drop)  
    bk_tables = []
    for table in df_tables:
        columns = [TableColumn(field=Ci, title=Ci, width=40) for Ci in table.columns] # bokeh columns
        data_table = DataTable(columns=columns, source=ColumnDataSource(table), sizing_mode='stretch_width', fit_columns=False ) # bokeh table
        bk_tables.append(data_table)
        
    return bk_tables

In [287]:
all_stats_copy = copy.deepcopy(all_stats)
upos_table, text_table = _make_stat_tables(all_stats)

ValueError: Columns must be same length as key

In [288]:
show(upos_table)

In [13]:
# df_tables = [upos_table, text_table]

# intervals = ['intervals_99', 'intervals_98', 'intervals_95', 'intervals_90', 'intervals_85', 'intervals_80']
# for df in df_tables:
#     for interval in intervals:
#         df[[interval+'_low', interval+'_high']] = pd.DataFrame(df[interval].tolist(), index=df.index)
#     df.drop(columns=intervals)

In [15]:

# df_tables = [upos_table, text_table]
# bk_tables = []


# for table in df_tables:
#     columns = [TableColumn(field=Ci, title=Ci) for Ci in table.columns] # bokeh columns
#     data_table = DataTable(columns=columns, source=ColumnDataSource(table)) # bokeh table
#     bk_tables.append(data_table)
