# A statistical Conllu file Exploration of  Universal Dependencies

## Introduction

While much work is being done in the current days on NLP and NLU, there is little work on describing why a certain length of transformer (or other as LSTM time steps) architecture has been chosen for the training, it is mostly arbitrary and depends on the goal of the work and resources available (mainly hardware). These decisions are hard once the model has been trained and there is nothing that can be done to extend the length of a transformer (for example) without having to retrain the entire network. There are however some works that tackle variable length sequences. 

This work presents a first complete analysis of the Universal Dependencies v2.6 dataset and presents the globan and individual results of each language present in the dataset.

This work does not intend to be a conference level paper (that is why there are no references to all the papers on each subject), but an informational technical report that might help to better select the most effective compromise text or token length for your particular NLP application.

The number of analyzed languages is 92, the token length is measured as the named UPOS tag in the dataset, while the character length is just that. There is no analysis on what constitutes a word or not, this means that a token includes the punctuiation and other symbols presents in the text samples. For lingüstic analysis purposes more de




## Observations

The histograms show a skew on the distribution, this can be a skewed gaussian, a generalized gaussian or a beta distribution form. Due to this, I will be testing different distribution fits with the Kolmogorov-Smirnov test.

There are many languages that do not have enough samples so the dsitribution fit will not be good  and errors will be big.
This is not an issue  from the code point of view. The important thing is if this data is used, take into account the number of samples available.


While doing this work I found quite interesting that are languages whose number of tokens or characters avoid certain bins in the histogram (Bulgarian, Breton Welsh, Danish, Slovak, Tamil and Thai are a few examples of this). This can mean that, either the language structure supports only those lengths, or that the analyzed dataset only contains samples that avoid some sentence lengths.

For some languages the number of samples is too small to make any good assumption from the data.


## Conclusion

This work presents a sample length analysis by language on the UniversalDependencies v2.6 dataset presenting the statistics for all 92 represented languages. The analysis then shows the length histograms by character and token length.

The best compromise for choosing a sequence length on the NLP architecture for training will depend mostly on the requirements of the applicatino, nevertheless with the numbers here you should be able to make an informed guess on what might be better for your case.

We can see that having a multi-lingual approach will necessary make the needed sequences longer as there is a large variability on sequence length, but appliying to single language might allow you to optimize your neural architectures

## Future Work

I am currently working on a more in depth analysis of the complete Gutenberg project dataset ( ~60K books in several languages) that will discriminate several other text characteristics.

I also have started to work on a complete parsing of a few of the Wiktionary datasets.

Stay tuned for those results ;)

In [172]:
from preprocessors.ud_conllu_stats import *
import json
import gzip

In [265]:
import matplotlib.pyplot as plt
import bokeh
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook

%matplotlib inline

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

In [266]:
output_notebook()

In [3]:
%%time
res = conllu_process_get_2list(blacklist=blacklist)

CPU times: user 11.1 s, sys: 2.31 s, total: 13.4 s
Wall time: 1min 50s


In [4]:
%%time
upos_data, deprel_data, sentences_data, forms_data = extract_data_from_fields(res)

CPU times: user 964 ms, sys: 52.7 ms, total: 1.02 s
Wall time: 1.01 s


In [5]:
%%time
# langs = ['es', 'fr', 'de', 'en']
# langs_data = compute_distributions(upos_data, deprel_data, sentences_data, langs)
langs_data = compute_distributions(upos_data, deprel_data, sentences_data)

  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
  improvement from the last ten iterations.
  return log(self._pdf(x, *args))


Error processing lang qhe with Exception 'NoneType' object has no attribute 'name'
CPU times: user 20min 1s, sys: 2.8 s, total: 20min 4s
Wall time: 20min 36s


Well the stats file is quite big now, at 166MB.

This is due to all the entire data functions (data, CDF, PDF) there, so an intermediate step would be to process this, plot the right graphs and then only save the graphs and then cut down the number of elements in the output.

This should have taken care of much of the size issue, but for a website that will still be too much

Also there are some other approaches and the idea is to think what the user would be looking for when reading the reports so:
Reading by language: each language can have it's own file, this means that 166MB/92 ~< 2MB per file. 



Also have a file with the table of the stats only, no need to have the graphs there. In this way there is an easy comparison.
The statistics should be computed and displayed for upos, deprel and text 

In [None]:
all_stats = generate_files(blacklist=[], saveto='conllu_stats.json.zip')

In [232]:
%%time
upos_table, deprel_table, text_table = stats_dict2table(all_stats)

CPU times: user 21 ms, sys: 8.06 ms, total: 29.1 ms
Wall time: 28.3 ms


In [233]:
upos_table.columns

Index(['lang_code', 'lang_name', 'mean', 'variance', 'skew', 'kurtosis',
       'median', 'std', 'intervals_99', 'intervals_98', 'intervals_95',
       'intervals_90', 'intervals_85', 'intervals_80'],
      dtype='object')

In [262]:
df_tables = [upos_table, deprel_table, text_table]

intervals = ['intervals_99', 'intervals_98', 'intervals_95', 'intervals_90', 'intervals_85', 'intervals_80']
for df in df_tables:
    for interval in intervals:
        df[[interval+'_low', interval+'_high']] = pd.DataFrame(df[interval].tolist(), index=df.index)
    df.drop(columns=intervals)

In [263]:
# from bokeh.models.widgets import DataTable, DateFormatter, TableColumn
from bokeh.models import ColumnDataSource, DataTable, DateFormatter, TableColumn

df_tables = [upos_table, deprel_table, text_table]
bk_tables = []

for table in df_tables:
    Columns = [TableColumn(field=Ci, title=Ci) for Ci in table.columns] # bokeh columns
    data_table = DataTable(columns=Columns, source=ColumnDataSource(table)) # bokeh table
    bk_tables.append(data_table)


In [264]:
upos_table.columns

Index(['lang_code', 'lang_name', 'mean', 'variance', 'skew', 'kurtosis',
       'median', 'std', 'intervals_99', 'intervals_98', 'intervals_95',
       'intervals_90', 'intervals_85', 'intervals_80', 'intervals_99_low',
       'intervals_99high', 'intervals_98_low', 'intervals_98high',
       'intervals_95_low', 'intervals_95high', 'intervals_90_low',
       'intervals_90high', 'intervals_85_low', 'intervals_85high',
       'intervals_80_low', 'intervals_80high', 'intervals_99_high',
       'intervals_98_high', 'intervals_95_high', 'intervals_90_high',
       'intervals_85_high', 'intervals_80_high'],
      dtype='object')

In [274]:
# upos_table

In [273]:
show(bk_tables[2])

In [275]:
frstats = all_stats['fr']

In [276]:
frstats.keys()

dict_keys(['lang', 'upos_len', 'upos_distrib', 'deprel_len', 'deprel_distrib', 'text_len', 'text_distrib', 'upos_stats', 'deprel_stats', 'text_stats', 'upos_functions', 'deprel_functions', 'text_functions'])

In [278]:
from bokeh.palettes import Spectral4
from bokeh.plotting import figure, output_file, show
# from bokeh.sampledata.stocks import AAPL, GOOG, IBM, MSFT

In [306]:
cdf, pdf = frstats['text_functions']['cdf'], frstats['text_functions']['pdf']

In [375]:
cdf100 = resample(cdf, 100)
pdf100 = resample(pdf, 100)

In [390]:
hist, bin_edges = np.histogram(frstats['text_len'], bins=100)

In [291]:
from scipy.signal import resample

In [292]:
cdf = resample(cdf, 100)

In [293]:
pdf = resample(pdf, 100)

In [362]:
import numpy as np
import scipy.special

from bokeh.layouts import gridplot
from bokeh.plotting import figure, output_file, show
from bokeh.models import LinearAxis, Range1d, HoverTool


In [386]:
max_len = max(frstats['text_len'])
x = np.linspace(0, max_len, 100)


In [409]:
# TODO make it even better being able to change the Sizing mode from a dropdown menu ?

hover = HoverTool(
#     names=["hist"],
    tooltips=[
#         ("index", "$index"),
        ("Count", "@hist"),
        ("pdf", "@pdf"),
        ("cdf", "@cdf"),
    ],

#     formatters={
#         '@date'        : 'datetime', # use 'datetime' formatter for '@date' field
#         '@{adj close}' : 'printf',   # use 'printf' formatter for '@{adj close}' field
#         '@numeral': '(.00)'                             # use default 'numeral' formatter for other fields
#     },
    # display a tooltip whenever the cursor is vertically in line with a glyph
    mode='vline'
)

In [414]:
def make_plot(title, data_source):

    p = figure(title=title, background_fill_color="#fafafa", 
               plot_height=500,  sizing_mode="stretch_width",
               tools="crosshair,pan,wheel_zoom,box_zoom,zoom_in,zoom_out,undo,redo,reset",
                toolbar_location="left",
               output_backend="webgl")
    p.add_tools(hover)
    # p = figure(title=title, tools='', background_fill_color="#fafafa")
    p.xaxis.axis_label = 'Length'
    p.yaxis.axis_label = 'Count'
    # second axe, probability
    p.extra_y_ranges = {"Pr(x)": Range1d(start=0., end=1.)}
    p.add_layout(LinearAxis(y_range_name="Pr(x)", axis_label='Pr(x)'), 'right')
    p.quad(name='hist', top='hist', bottom=0, left='bin_edges_left', right='bin_edges_right',
           fill_color="blue", line_color="white", alpha=0.5, legend_label="Freq.", source=data_source)
    p.line(name='PDF', x='x', y='pdf', line_color="green", line_width=4, alpha=0.7, legend_label="PDF", y_range_name="Pr(x)", source=data_source)
    p.line(name='CDF', x='x', y='cdf', line_color="red", line_width=2, alpha=0.7, legend_label="CDF", y_range_name="Pr(x)", source=data_source)


    p.y_range.start = 0

    p.title.align='center'
    p.legend.location = "center_right"
    #     p.legend.location = "bottom_right"
    p.legend.background_fill_color = "#fefefe"
    p.grid.grid_line_color="grey"
    #     p.legend.click_policy="mute"
    p.legend.click_policy="hide"

    show(p)

In [404]:
data_source = ColumnDataSource({'hist':hist,
                                'bin_edges_left': bin_edges[:-1],
                                'bin_edges_right': bin_edges[1:],
                                'x': x,
                                'pdf': pdf100,
                                'cdf': cdf100})

In [415]:
p_txt = make_plot('frech, txt-len', data_source)
show(p_txt)

ValueError: "Invalid object to show. The object to passed to show must be one of:

* a LayoutDOM (e.g. a Plot or Widget or Layout)
* a Bokeh Application
* a callable suitable to an application FunctionHandler


In [280]:
# Normal Distribution

bins = 100

mu, sigma = 0, 0.5

measured = np.random.normal(mu, sigma, 1000)
hist, edges = np.histogram(measured, density=True, bins=50)

x = np.linspace(-2, 2, 1000)
pdf = 1/(sigma * np.sqrt(2*np.pi)) * np.exp(-(x-mu)**2 / (2*sigma**2))
cdf = (1+scipy.special.erf((x-mu)/np.sqrt(2*sigma**2)))/2

p1 = make_plot("Normal Distribution (μ=0, σ=0.5)", hist, edges, x, pdf, cdf)

# Log-Normal Distribution

mu, sigma = 0, 0.5

measured = np.random.lognormal(mu, sigma, 1000)
hist, edges = np.histogram(measured, density=True, bins=50)

x = np.linspace(0.0001, 8.0, 1000)
pdf = 1/(x* sigma * np.sqrt(2*np.pi)) * np.exp(-(np.log(x)-mu)**2 / (2*sigma**2))
cdf = (1+scipy.special.erf((np.log(x)-mu)/(np.sqrt(2)*sigma)))/2

p2 = make_plot("Log Normal Distribution (μ=0, σ=0.5)", hist, edges, x, pdf, cdf)

# Gamma Distribution

k, theta = 7.5, 1.0

measured = np.random.gamma(k, theta, 1000)
hist, edges = np.histogram(measured, density=True, bins=50)

x = np.linspace(0.0001, 20.0, 1000)
pdf = x**(k-1) * np.exp(-x/theta) / (theta**k * scipy.special.gamma(k))
cdf = scipy.special.gammainc(k, x/theta)

p3 = make_plot("Gamma Distribution (k=7.5, θ=1)", hist, edges, x, pdf, cdf)

# Weibull Distribution

lam, k = 1, 1.25
measured = lam*(-np.log(np.random.uniform(0, 1, 1000)))**(1/k)
hist, edges = np.histogram(measured, density=True, bins=50)

x = np.linspace(0.0001, 8, 1000)
pdf = (k/lam)*(x/lam)**(k-1) * np.exp(-(x/lam)**k)
cdf = 1 - np.exp(-(x/lam)**k)

p4 = make_plot("Weibull Distribution (λ=1, k=1.25)", hist, edges, x, pdf, cdf)

# output_file('histogram.html', title="histogram.py example")

show(gridplot([p1,p2,p3,p4], ncols=2, plot_width=400, plot_height=400, toolbar_location=None))

In [384]:
from bokeh.io import show, output_file, save, show
from bokeh.layouts import column
from bokeh.plotting import figure
from bokeh.models.sources import ColumnDataSource
from bokeh.models.tools import HoverTool
from bokeh.models.callbacks import CustomJS

output_file("cross_hair.html")

x = list(range(12))
y = [v**2 for v in x]

NUM_PLOTS = 3

# Define a DataSource
data = dict(x=[0]*NUM_PLOTS)

line_source = ColumnDataSource(data=data)

js = '''
var geometry = cb_data['geometry'];
console.log(geometry);
var data = line_source.data;
var x = data['x'];
console.log(x);
if (isFinite(geometry.x)) {
  for (i = 0; i < x.length; i++) {
    x[i] = geometry.x;
  }
  line_source.change.emit();
}
'''


plots = []
for i in range(NUM_PLOTS):
    plot = figure(plot_width=250, plot_height=250, title=None)
    plot.segment(x0='x', y0=0, x1='x', y1=200, color='red', line_width=1, source=line_source)
    plot.circle(x, y, size=10, color="navy", alpha=0.5)
    hover = HoverTool(tooltips=None, 
                      point_policy='follow_mouse', 
                      callback=CustomJS(code=js, args={'line_source': line_source}))
    plot.add_tools(hover)
    plots.append(plot)

show(column(*plots))