# Universal Dependencies Treebank Pre-Processing

This notebook explores the preprocessing of the language conllu files to use them in character level neural networks.

The languages and files to use will be filtered in order to avoid languages that use characters of utf-8 that are represented with 3 and 4 segmetns, as the tests are done only with the first 2 segments (this is for prerformance reasons and I consider that this is enough to show that the methodology works).



In [1]:
import os
import sys
from sortedcontainers import SortedDict
import numpy as np
import pandas as pd
import conllu
import pyconll
import pyconll.util
import ntpath

In [2]:
# from https://stackoverflow.com/questions/8384737/extract-file-name-from-path-no-matter-what-the-os-path-format
def path_leaf(path):
    head, tail = ntpath.split(path)
    return tail or ntpath.basename(head)

In [3]:
base_dir = "/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5"

In [4]:
def get_all_files_recurse(rootdir):
    allfiles = []
    for root, directories, filenames in os.walk(rootdir):
        for filename in filenames: 
            allfiles.append(os.path.join(root,filename) )
    return allfiles

In [5]:
def get_conllu(fname):
    f = open(fname,"r")
    cnl = conllu.parse(f.read())
    return path_leaf(fname), cnl

In [6]:
allfiles = get_all_files_recurse(base_dir)

In [7]:
conllufiles = [f for f in allfiles if f.endswith(".conllu")]

In [8]:
len(conllufiles)

357

In [9]:
# list of languages that won't be used:
# this is because I'll be using only the first 2 segments of UTF-8, so the idea is that 
## Maybe blacklist 
# "Hebrew",
maybe_blacklist = ["Kurmanji", "Urdu", "Indonesian", "Coptic-Scriptorium", "Kazakh", "Marathi", "Tamil", "Thai", "Warlpiri"]
lang_tokens_blacklist = ["Hindi", "Chinese", "Korean", "Tagalog", "Vietnamese", "Telugu", "Uyghur", "Cantonese" ]

filter out the languages that I won't be training on 

mainly due to the encoding I'm using I don't encode anything that is above 2 segments up to U+07FF in UTF-8, this is for resources reasons in my local machine.

for more extensive and maybe future networks, I'll use 3 segments (up to U+FFFF), a complete encoding should use all 4 utf-8 segments up to U+10FFFF



In [10]:
blacklist = maybe_blacklist + lang_tokens_blacklist

prefiltered_conllu = []
for f in conllufiles:
    todel = list(filter(lambda bl: bl in f, blacklist))
#     print(f, todel)
    if len(todel)==0:
        prefiltered_conllu.append(f)
#     else:
#         print("todel>0", todel)
    

In [11]:
len(prefiltered_conllu)

298

In [12]:

conllu_train = [f for f in prefiltered_conllu if "-train" in f]
conllu_test = [f for f in prefiltered_conllu if "-test" in f]
conllu_dev = [f for f in prefiltered_conllu if "-dev" in f]

In [13]:
len(conllu_train), len(conllu_test), len(conllu_dev)

(88, 130, 80)

In [14]:
conllu_train

['/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_Basque-BDT/eu_bdt-ud-train.conllu',
 '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_Danish-DDT/da_ddt-ud-train.conllu',
 '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_Galician-TreeGal/gl_treegal-ud-train.conllu',
 '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_Maltese-MUDT/mt_mudt-ud-train.conllu',
 '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_Irish-IDT/ga_idt-ud-train.conllu',
 '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_Galician-CTG/gl_ctg-ud-train.conllu',
 '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_Latvian-LVTB/lv_lvtb-ud-train.conllu',
 '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_Bulgarian-BTB/bg_btb-ud-train.conllu',
 '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebank

In [15]:
# check that all are files
# ff = [os.path.isfile(f) for f in cleanfnames]

In [16]:
# fields according to conllu parser lib
#fields = ['id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc']
# fields = ['upostag', 'deprel', 'feats']
# fields = ['upostag', 'deprel']  # these are the only fields that I will take into account for training
fields = ['upostag']  # first I'll start only with UPOS as is the smallest and simplest one to check

In [17]:
# upos (and the other fields) is extracted from the analysis of all the files in the ud-treebank dataset v2.4,
# the same analysis can be done for v2.5 that came up on Nov 15th 2019.
upos = {'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 
        'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', '_'}

upos = sorted(list(upos))

In [18]:
# deprel analysis of ud-treebank v2.4 on all the language files
deprel={'_', 'acl', 'acl:adv', 'acl:appos', 'acl:cleft', 'acl:focus', 'acl:inf', 'acl:part', 'acl:poss', 'acl:relcl', 'advcl', 'advcl:appos', 'advcl:arg', 'advcl:cleft', 'advcl:cond', 'advcl:coverb', 'advcl:periph', 'advcl:relcl', 'advcl:sp', 'advcl:svc', 'advcl:tcl', 'advmod', 'advmod:appos', 'advmod:arg', 'advmod:cc', 'advmod:det', 'advmod:df', 'advmod:discourse', 'advmod:emph', 'advmod:locy', 'advmod:mode', 'advmod:neg', 'advmod:obl', 'advmod:periph', 'advmod:que', 'advmod:sentcon', 'advmod:tfrom', 'advmod:tlocy', 'advmod:tmod', 'advmod:to', 'advmod:tto', 'amod', 'amod:advmod', 'amod:att', 'amod:attlvc', 'amod:flat', 'amod:mode', 'amod:obl', 'appos', 'appos:conj', 'appos:nmod', 'aux', 'aux:aglt', 'aux:caus', 'aux:clitic', 'aux:cnd', 'aux:imp', 'aux:mood', 'aux:neg', 'aux:part', 'aux:pass', 'aux:poss', 'aux:q', 'case', 'case:acc', 'case:aspect', 'case:circ', 'case:dec', 'case:det', 'case:gen', 'case:loc', 'case:pred', 'case:pref', 'case:suff', 'case:voc', 'cc', 'cc:nc', 'cc:preconj', 'ccomp', 'ccomp:cleft', 'ccomp:obj', 'ccomp:obl', 'ccomp:pmod', 'ccomp:pred', 'clf', 'compound', 'compound:a', 'compound:affix', 'compound:coll', 'compound:conjv', 'compound:dir', 'compound:ext', 'compound:lvc', 'compound:n', 'compound:nn', 'compound:plur', 'compound:preverb', 'compound:prt', 'compound:quant', 'compound:redup', 'compound:smixut', 'compound:svc', 'compound:v', 'compound:vo', 'compound:vv', 'conj', 'conj:appos', 'conj:coord', 'conj:dicto', 'conj:extend', 'conj:redup', 'conj:svc', 'cop', 'cop:expl', 'cop:locat', 'cop:own', 'csubj', 'csubj:cleft', 'csubj:cop', 'csubj:pass', 'csubj:quasi', 'dep', 'dep:alt', 'dep:iobj', 'dep:obj', 'dep:prt', 'det', 'det:def', 'det:numgov', 'det:nummod', 'det:poss', 'det:predet', 'det:rel', 'discourse', 'discourse:emo', 'discourse:filler', 'discourse:intj', 'discourse:q', 'discourse:sp', 'dislocated', 'dislocated:cleft', 'expl', 'expl:impers', 'expl:pass', 'expl:poss', 'expl:pv', 'fixed', 'fixed:name', 'flat', 'flat:abs', 'flat:foreign', 'flat:name', 'flat:range', 'flat:repeat', 'flat:sibl', 'flat:title', 'flat:vv', 'goeswith', 'iobj', 'iobj:agent', 'iobj:appl', 'iobj:caus', 'list', 'mark', 'mark:adv', 'mark:advb', 'mark:advmod', 'mark:comp', 'mark:obj', 'mark:obl', 'mark:prt', 'mark:q', 'mark:rel', 'mark:relcl', 'nmod', 'nmod:abl', 'nmod:advmod', 'nmod:agent', 'nmod:appos', 'nmod:arg', 'nmod:att', 'nmod:attlvc', 'nmod:cau', 'nmod:clas', 'nmod:cmp', 'nmod:comp', 'nmod:dat', 'nmod:flat', 'nmod:gen', 'nmod:gmod', 'nmod:gobj', 'nmod:gsubj', 'nmod:ins', 'nmod:npmod', 'nmod:obl', 'nmod:obllvc', 'nmod:own', 'nmod:part', 'nmod:pmod', 'nmod:poss', 'nmod:pred', 'nmod:ref', 'nmod:tmod', 'nsubj', 'nsubj:advmod', 'nsubj:appos', 'nsubj:caus', 'nsubj:cop', 'nsubj:expl', 'nsubj:lvc', 'nsubj:nc', 'nsubj:obj', 'nsubj:own', 'nsubj:pass', 'nsubj:periph', 'nsubj:quasi', 'nummod', 'nummod:entity', 'nummod:gov', 'obj', 'obj:advmod', 'obj:advneg', 'obj:agent', 'obj:appl', 'obj:cau', 'obj:caus', 'obj:lvc', 'obj:obl', 'obj:periph', 'obl', 'obl:advmod', 'obl:agent', 'obl:appl', 'obl:arg', 'obl:cau', 'obl:cmpr', 'obl:comp', 'obl:lmod', 'obl:loc', 'obl:mod', 'obl:npmod', 'obl:own', 'obl:patient', 'obl:periph', 'obl:poss', 'obl:prep', 'obl:sentcon', 'obl:tmod', 'obl:x', 'orphan', 'parataxis', 'parataxis:appos', 'parataxis:conj', 'parataxis:deletion', 'parataxis:discourse', 'parataxis:dislocated', 'parataxis:hashtag', 'parataxis:insert', 'parataxis:newsent', 'parataxis:nsubj', 'parataxis:obj', 'parataxis:parenth', 'parataxis:rel', 'parataxis:rep', 'parataxis:restart', 'punct', 'reparandum', 'root', 'vocative', 'vocative:cl', 'vocative:mention', 'xcomp', 'xcomp:adj', 'xcomp:ds', 'xcomp:obj', 'xcomp:pred', 'xcomp:sp', 'xcomp:subj'}
deprel=sorted(list(deprel))

In [19]:
len(upos), len(deprel)

(18, 278)

In [20]:

UPOS = {'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON',
        'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', '_'}
UPOS_LIST = sorted(list(UPOS))
UPOS_IDX2CHAR = SortedDict(enumerate(UPOS_LIST))
UPOS_CHAR2IDX = SortedDict(zip(UPOS_LIST, range(len(UPOS_LIST))))

In [21]:
UPOS_CHAR2IDX

SortedDict({'ADJ': 0, 'ADP': 1, 'ADV': 2, 'AUX': 3, 'CCONJ': 4, 'DET': 5, 'INTJ': 6, 'NOUN': 7, 'NUM': 8, 'PART': 9, 'PRON': 10, 'PROPN': 11, 'PUNCT': 12, 'SCONJ': 13, 'SYM': 14, 'VERB': 15, 'X': 16, '_': 17})

In [22]:
UPOS_IDX2CHAR

SortedDict({0: 'ADJ', 1: 'ADP', 2: 'ADV', 3: 'AUX', 4: 'CCONJ', 5: 'DET', 6: 'INTJ', 7: 'NOUN', 8: 'NUM', 9: 'PART', 10: 'PRON', 11: 'PROPN', 12: 'PUNCT', 13: 'SCONJ', 14: 'SYM', 15: 'VERB', 16: 'X', 17: '_'})

In [23]:
# deprel analysis of ud-treebank v2.4 on all the language files
DEPREL = {'_', 'acl', 'acl:adv', 'acl:appos', 'acl:cleft', 'acl:focus', 'acl:inf', 'acl:part', 'acl:poss', 'acl:relcl', 'advcl', 'advcl:appos', 'advcl:arg', 'advcl:cleft', 'advcl:cond', 'advcl:coverb', 'advcl:periph', 'advcl:relcl', 'advcl:sp', 'advcl:svc', 'advcl:tcl', 'advmod', 'advmod:appos', 'advmod:arg', 'advmod:cc', 'advmod:det', 'advmod:df', 'advmod:discourse', 'advmod:emph', 'advmod:locy', 'advmod:mode', 'advmod:neg', 'advmod:obl', 'advmod:periph', 'advmod:que', 'advmod:sentcon', 'advmod:tfrom', 'advmod:tlocy', 'advmod:tmod', 'advmod:to', 'advmod:tto', 'amod', 'amod:advmod', 'amod:att', 'amod:attlvc', 'amod:flat', 'amod:mode', 'amod:obl', 'appos', 'appos:conj', 'appos:nmod', 'aux', 'aux:aglt', 'aux:caus', 'aux:clitic', 'aux:cnd', 'aux:imp', 'aux:mood', 'aux:neg', 'aux:part', 'aux:pass', 'aux:poss', 'aux:q', 'case', 'case:acc', 'case:aspect', 'case:circ', 'case:dec', 'case:det', 'case:gen', 'case:loc', 'case:pred', 'case:pref', 'case:suff', 'case:voc', 'cc', 'cc:nc', 'cc:preconj', 'ccomp', 'ccomp:cleft', 'ccomp:obj', 'ccomp:obl', 'ccomp:pmod', 'ccomp:pred', 'clf', 'compound', 'compound:a', 'compound:affix', 'compound:coll', 'compound:conjv', 'compound:dir', 'compound:ext', 'compound:lvc', 'compound:n', 'compound:nn', 'compound:plur', 'compound:preverb', 'compound:prt', 'compound:quant', 'compound:redup', 'compound:smixut', 'compound:svc', 'compound:v', 'compound:vo', 'compound:vv', 'conj', 'conj:appos', 'conj:coord', 'conj:dicto', 'conj:extend', 'conj:redup', 'conj:svc', 'cop', 'cop:expl', 'cop:locat', 'cop:own', 'csubj', 'csubj:cleft', 'csubj:cop', 'csubj:pass', 'csubj:quasi', 'dep', 'dep:alt', 'dep:iobj', 'dep:obj', 'dep:prt', 'det', 'det:def', 'det:numgov', 'det:nummod', 'det:poss', 'det:predet', 'det:rel', 'discourse', 'discourse:emo', 'discourse:filler', 'discourse:intj', 'discourse:q', 'discourse:sp', 'dislocated', 'dislocated:cleft', 'expl', 'expl:impers', 'expl:pass', 'expl:poss', 'expl:pv', 'fixed', 'fixed:name', 'flat', 'flat:abs', 'flat:foreign', 'flat:name', 'flat:range', 'flat:repeat', 'flat:sibl', 'flat:title', 'flat:vv', 'goeswith', 'iobj', 'iobj:agent', 'iobj:appl', 'iobj:caus', 'list', 'mark', 'mark:adv', 'mark:advb', 'mark:advmod', 'mark:comp', 'mark:obj', 'mark:obl', 'mark:prt', 'mark:q', 'mark:rel', 'mark:relcl', 'nmod', 'nmod:abl', 'nmod:advmod', 'nmod:agent', 'nmod:appos', 'nmod:arg', 'nmod:att', 'nmod:attlvc', 'nmod:cau', 'nmod:clas', 'nmod:cmp', 'nmod:comp', 'nmod:dat', 'nmod:flat', 'nmod:gen', 'nmod:gmod', 'nmod:gobj', 'nmod:gsubj', 'nmod:ins', 'nmod:npmod', 'nmod:obl', 'nmod:obllvc', 'nmod:own', 'nmod:part', 'nmod:pmod', 'nmod:poss', 'nmod:pred', 'nmod:ref', 'nmod:tmod', 'nsubj', 'nsubj:advmod', 'nsubj:appos', 'nsubj:caus', 'nsubj:cop', 'nsubj:expl', 'nsubj:lvc', 'nsubj:nc', 'nsubj:obj', 'nsubj:own', 'nsubj:pass', 'nsubj:periph', 'nsubj:quasi', 'nummod', 'nummod:entity', 'nummod:gov', 'obj', 'obj:advmod', 'obj:advneg', 'obj:agent', 'obj:appl', 'obj:cau', 'obj:caus', 'obj:lvc', 'obj:obl', 'obj:periph', 'obl', 'obl:advmod', 'obl:agent', 'obl:appl', 'obl:arg', 'obl:cau', 'obl:cmpr', 'obl:comp', 'obl:lmod', 'obl:loc', 'obl:mod', 'obl:npmod', 'obl:own', 'obl:patient', 'obl:periph', 'obl:poss', 'obl:prep', 'obl:sentcon', 'obl:tmod', 'obl:x', 'orphan', 'parataxis', 'parataxis:appos', 'parataxis:conj', 'parataxis:deletion', 'parataxis:discourse', 'parataxis:dislocated', 'parataxis:hashtag', 'parataxis:insert', 'parataxis:newsent', 'parataxis:nsubj', 'parataxis:obj', 'parataxis:parenth', 'parataxis:rel', 'parataxis:rep', 'parataxis:restart', 'punct', 'reparandum', 'root', 'vocative', 'vocative:cl', 'vocative:mention', 'xcomp', 'xcomp:adj', 'xcomp:ds', 'xcomp:obj', 'xcomp:pred', 'xcomp:sp', 'xcomp:subj'}
DEPREL_LIST = sorted(list(DEPREL))
DEPREL_IDX2CHAR = SortedDict(enumerate(DEPREL_LIST))
DEPREL_CHAR2IDX = SortedDict(zip(DEPREL_LIST, range(len(DEPREL_LIST))))


In [27]:
# manually doing one processing so I understand how it is and if the process goes OK.

es_test = '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_Spanish-AnCora/es_ancora-ud-train.conllu'
fr_test = '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_French-ParTUT/fr_partut-ud-train.conllu',
en_test = '/home/leo/projects/Datasets/text/UniversalDependencies/ud-treebanks-v2.5/UD_English-EWT/en_ewt-ud-train.conllu',

# should later verify that things are OK in every processed file 

In [28]:
%%time
es_leaf, es_cnl = get_conllu(es_test)

CPU times: user 6.73 s, sys: 243 ms, total: 6.97 s
Wall time: 6.98 s


In [29]:
escnl0 = es_cnl[0]

In [30]:
%%time
# conllu library seems to basic for the usage I need, now I'll try pyconll
es_pcnl = pyconll.load_from_file(es_test)

CPU times: user 4.33 s, sys: 224 ms, total: 4.56 s
Wall time: 4.55 s


In [31]:
ess0 = es_pcnl[0]

In [32]:
len( es_pcnl)

14305

In [33]:
# ess0._ids_to_indexes

In [34]:
est0 = ess0._tokens[0]

In [35]:
ess0.text

'El presidente del órgano regulador de las Telecomunicaciones se mostró partidario de completar esta liberalización de las telecomunicaciones con otras medidas que incentiven la competencia como puede ser abrir el acceso a la información de los clientes de Telefónica a otros operadores.'

In [36]:
np.array(list(ess0.text)).dtype

dtype('<U1')

In [37]:
l_txt = np.array(list(ess0.text))
l_tags = np.empty_like(l_txt)

In [38]:
np.stack([l_txt, l_tags]).shape

(2, 286)

In [39]:
# list(ess0.text)

In [40]:
# for t in es_pcnl:
#     print(t.text)

In [41]:
est0.form

'El'

In [42]:
ess0.text[0:].find(est0.form)

0

In [43]:
est0.upos

'DET'

In [44]:
ess0.text[0+0+2:]

' presidente del órgano regulador de las Telecomunicaciones se mostró partidario de completar esta liberalización de las telecomunicaciones con otras medidas que incentiven la competencia como puede ser abrir el acceso a la información de los clientes de Telefónica a otros operadores.'

In [45]:
def seq2charlevel(sequence, upos=True, deprel=False):
    """
    :param sequence: conllu (pyconll) sequence
    :return: a numpy array of chars where the first column is the text (char by char) and the 
    remaining columns are the tags assigned upos and deprel (in that order)
    """
    # convert text to list
    txt = sequence.text
    l_txt = np.array(list(txt))
#     l_tags = np.empty_like(l_txt)
#     l_tags_upos = np.empty(l_txt.shape, dtype="U5")  # 5 chars is the max lenght of the upos type
    l_tags_upos = np.full(l_txt.shape, fill_value="_", dtype="U5")  # 5 chars is the max lenght of the upos type
    l_tags_deprel = None
    if deprel:
#         l_tags_deprel = np.empty(l_txt.shape, dtype="U20")  # 5 chars is the max lenght of the deprel type in the analysis
        l_tags_deprel = np.full(l_txt.shape, fill_value="_", dtype="U20")
    # for each token, go in the string 
    index = 0
    for t in sequence._tokens:
        # find token indices in the remaining of the sequence (this is to do a good tagging)
        tidx = txt[index:].find(t.form)
        tlen = len(t.form)
        idx_start = index + tidx
        idx_end = idx_start + tlen
        # set the flags for each char
        l_tags_upos[idx_start:idx_end] = t.upos
        if deprel:
            l_tags_deprel[idx_start:idx_end] = t.deprel
        # set index to the new absolute text position
        index = idx_end
    if deprel:
        ret = np.stack([l_txt, l_tags_upos, l_tags_deprel])
    else:
        ret = np.stack([l_txt, l_tags_upos])
    return ret
#     return l_txt, l_tags


In [46]:
%%time
seq2charlevel(ess0, deprel=True).transpose()

CPU times: user 354 µs, sys: 23 µs, total: 377 µs
Wall time: 245 µs


array([['E', 'DET', 'det'],
       ['l', 'DET', 'det'],
       [' ', '_', '_'],
       ['p', 'NOUN', 'nsubj'],
       ['r', 'NOUN', 'nsubj'],
       ['e', 'NOUN', 'nsubj'],
       ['s', 'NOUN', 'nsubj'],
       ['i', 'NOUN', 'nsubj'],
       ['d', 'NOUN', 'nsubj'],
       ['e', 'NOUN', 'nsubj'],
       ['n', 'NOUN', 'nsubj'],
       ['t', 'NOUN', 'nsubj'],
       ['e', 'NOUN', 'nsubj'],
       [' ', '_', '_'],
       ['d', 'ADP', 'case'],
       ['e', 'ADP', 'case'],
       ['l', 'ADP', 'case'],
       [' ', '_', '_'],
       ['ó', 'NOUN', 'nmod'],
       ['r', 'NOUN', 'nmod'],
       ['g', 'NOUN', 'nmod'],
       ['a', 'NOUN', 'nmod'],
       ['n', 'NOUN', 'nmod'],
       ['o', 'NOUN', 'nmod'],
       [' ', '_', '_'],
       ['r', 'ADJ', 'amod'],
       ['e', 'ADJ', 'amod'],
       ['g', 'ADJ', 'amod'],
       ['u', 'ADJ', 'amod'],
       ['l', 'ADJ', 'amod'],
       ['a', 'ADJ', 'amod'],
       ['d', 'ADJ', 'amod'],
       ['o', 'ADJ', 'amod'],
       ['r', 'ADJ', 'amod'],
       [' 

In [47]:
def charseq2int(charseq, char2int_codebook, upos2int, deprel2int):
    """

    :param charseq: character sequence in a numpy matrix form where shape = (2,N) or (3,N)
            charseq[0] is the character sequence
            charseq[1] is the upos tag
            charseq[2] is the deprel tag
    :param char2int_codebook: dictionary codebook encoding the chars to int indices
    :param upos2int: dictionary encoding the upos tags to int
    :param deprel2int: dictionary encoding the deprel tags to int
    :return: an output matrix of type int of the same dimensions as the input with the index coding for each row
    """
    assert 2 <= charseq.shape[0] <= 3
    ret = np.empty(shape=charseq.shape, dtype=np.int32)
    ret[0, :] = np.vectorize(char2int_codebook.get)(charseq[0])
    ret[1, :] = np.vectorize(upos2int.get)(charseq[1], upos2int["_"])
    if charseq.shape[0] == 3:
        ret[2, :] = np.vectorize(deprel2int.get)(charseq[2], deprel2int["_"])
    return ret


Now I test the ideas here, I need to load the encoders

In [48]:
from utf8_encoder import *
utf8codebook = np.load("utf8-codes/utf8_code_matrix_2seg.npy")
idx2char = load_obj("utf8-codes/num2txt_2seg.pkl")
char2idx = load_obj("utf8-codes/txt2num_2seg.pkl")

In [49]:
charseq = seq2charlevel(ess0, deprel=True)

In [50]:
ret = np.zeros(shape=charseq.shape, dtype=np.int32)

In [51]:
ret[0, :] = np.vectorize(char2idx.get)(charseq[0])

In [52]:
ret[1, :] = np.vectorize(UPOS_CHAR2IDX.get)(charseq[1])

In [53]:
ret[2, :] = np.vectorize(DEPREL_CHAR2IDX.get)(charseq[2])

In [54]:
ret[2]

array([126, 126,   0, 203, 203, 203, 203, 203, 203, 203, 203, 203, 203,
         0,  63,  63,  63,   0, 174, 174, 174, 174, 174, 174,   0,  41,
        41,  41,  41,  41,  41,  41,  41,  41,   0,  63,  63,   0, 126,
       126, 126,   0, 174, 174, 174, 174, 174, 174, 174, 174, 174, 174,
       174, 174, 174, 174, 174, 174, 174, 174,   0, 219, 219,   0, 267,
       267, 267, 267, 267, 267,   0, 219, 219, 219, 219, 219, 219, 219,
       219, 219, 219,   0, 163, 163,   0,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   0, 126, 126, 126, 126,   0, 219, 219, 219, 219,
       219, 219, 219, 219, 219, 219, 219, 219, 219, 219,   0,  63,  63,
         0, 126, 126, 126,   0, 174, 174, 174, 174, 174, 174, 174, 174,
       174, 174, 174, 174, 174, 174, 174, 174, 174, 174,   0,  63,  63,
        63,   0, 126, 126, 126, 126, 126,   0, 229, 229, 229, 229, 229,
       229, 229,   0, 203, 203, 203,   0,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   0, 126, 126,   0, 219, 219, 219, 21

In [55]:
ind_charseq = charseq2int(charseq, char2idx, UPOS_CHAR2IDX, DEPREL_CHAR2IDX)
# ind_charseq = charseq2int(charseq, char2idx)  #, UPOS_CHAR2IDX, DEPREL_CHAR2IDX)

In [56]:
max(DEPREL_CHAR2IDX.values())

277

In [57]:
DEPREL_IDX2CHAR[26]

'advmod:df'

In [58]:
DEPREL_CHAR2IDX["nsubj"]

203

In [59]:
charseq[:,:20], ind_charseq[:,:20]

(array([['E', 'l', ' ', 'p', 'r', 'e', 's', 'i', 'd', 'e', 'n', 't', 'e',
         ' ', 'd', 'e', 'l', ' ', 'ó', 'r'],
        ['DET', 'DET', '_', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN',
         'NOUN', 'NOUN', 'NOUN', 'NOUN', 'NOUN', '_', 'ADP', 'ADP', 'ADP',
         '_', 'NOUN', 'NOUN'],
        ['det', 'det', '_', 'nsubj', 'nsubj', 'nsubj', 'nsubj', 'nsubj',
         'nsubj', 'nsubj', 'nsubj', 'nsubj', 'nsubj', '_', 'case', 'case',
         'case', '_', 'nmod', 'nmod']], dtype='<U20'),
 array([[ 69, 108,  32, 112, 114, 101, 115, 105, 100, 101, 110, 116, 101,
          32, 100, 101, 108,  32, 243, 114],
        [  5,   5,  17,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
          17,   1,   1,   1,  17,   7,   7],
        [126, 126,   0, 203, 203, 203, 203, 203, 203, 203, 203, 203, 203,
           0,  63,  63,  63,   0, 174, 174]], dtype=int32))

In [60]:
ind_charseq

array([[ 69, 108,  32, 112, 114, 101, 115, 105, 100, 101, 110, 116, 101,
         32, 100, 101, 108,  32, 243, 114, 103,  97, 110, 111,  32, 114,
        101, 103, 117, 108,  97, 100, 111, 114,  32, 100, 101,  32, 108,
         97, 115,  32,  84, 101, 108, 101,  99, 111, 109, 117, 110, 105,
         99,  97,  99, 105, 111, 110, 101, 115,  32, 115, 101,  32, 109,
        111, 115, 116, 114, 243,  32, 112,  97, 114, 116, 105, 100,  97,
        114, 105, 111,  32, 100, 101,  32,  99, 111, 109, 112, 108, 101,
        116,  97, 114,  32, 101, 115, 116,  97,  32, 108, 105,  98, 101,
        114,  97, 108, 105, 122,  97,  99, 105, 243, 110,  32, 100, 101,
         32, 108,  97, 115,  32, 116, 101, 108, 101,  99, 111, 109, 117,
        110, 105,  99,  97,  99, 105, 111, 110, 101, 115,  32,  99, 111,
        110,  32, 111, 116, 114,  97, 115,  32, 109, 101, 100, 105, 100,
         97, 115,  32, 113, 117, 101,  32, 105, 110,  99, 101, 110, 116,
        105, 118, 101, 110,  32, 108,  97,  32,  99

In [61]:
# check max values to see if there is something weird
max(ind_charseq[0]),max(ind_charseq[1]), max(ind_charseq[2])

(243, 17, 267)

In [62]:
%%time
list_charseq = []
# es_pcnl = pyconll.load_from_file(es_test)
for seq in es_pcnl:
    charseq = seq2charlevel(seq, deprel=True)
#     ind_charseq = charseq2int(charseq, char2idx, UPOS_CHAR2IDX, DEPREL_CHAR2IDX)
    list_charseq.append(charseq)
#     list_ind_charseq.append(ind_charseq)

CPU times: user 1.02 s, sys: 135 ms, total: 1.15 s
Wall time: 1.17 s


In [63]:
len(list_charseq)

14305

In [64]:
%%time
list_ind_charseq = []
count = 0
for charseq in list_charseq:
    count+=1
    ind_charseq = charseq2int(charseq, char2idx, UPOS_CHAR2IDX, DEPREL_CHAR2IDX)
    list_ind_charseq.append(ind_charseq)

CPU times: user 1.44 s, sys: 20.1 ms, total: 1.46 s
Wall time: 1.44 s


In [65]:
len(list_ind_charseq)

14305

In [66]:
from langmodels.utils.preprocess_conllu import *

In [67]:
%%time
# len(conllu_train), len(conllu_test), len(conllu_dev)
train_data = process_all(conllu_train, char2idx, return_data=True)

Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, n

In [68]:
%%time
test_data = process_all(conllu_test, char2idx, return_data=True)

Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, n

In [69]:
%%time
dev_data = process_all(conllu_dev, char2idx, return_data=True)

Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, not NoneType
Token exception:  must be str, n