Skip to content

UD parser

Mika Hämäläinen edited this page Mar 9, 2019 · 2 revisions

Parsing CoNLL-U formated data

First you need to parse the CoNLL-U formatted file into a UD collection.

from uralicNLP.ud_tools import UD_collection
ud = UD_collection(codecs.open("file.conllu", encoding="utf-8"))

You can loop sentences and words in a UD collection

for sentence in ud:
	for word in sentence:
		print word.pos, word.lemma, word.get_attribute("deprel")

For an individual sentence, you can parse it as

from uralicNLP.ud_tools import parse_sentence
conl = "# text = Toinen palkinto\n1\tToinen\ttoinen\tADJ\tNum\tCase=Nom\t2\tnummod\t_\t_\n2\tpalkinto\tpalkinto\tNOUN\tN\tCase=Nom\t0\troot\t_\t_"
sentence = parse_sentence(conl)

Search functionality

UD collection and UD sentence can be searched for matching entries.

sentences = ud.find_sentences(query={"lemma": "kissa"}) #finds all sentences with the lemma kissa
for sentence in sentences:
    word = sentence.find(query={"lemma": "kissa"})
    print word[0].get_attribute("form") #prints the form for the first word kissa in all the sentences containing that word

If the find and find_sentences are called without arguments, they will return everything (all sentences or all words). The query can contain any of the fields specified in the CoNLL-U format description. The queries can contain string objects or matchable regex patterns.

You can’t perform that action at this time.