Skip to content

UD parser

Mika Hämäläinen edited this page Jan 26, 2021 · 3 revisions

Parsing CoNLL-U formated data

First you need to parse the CoNLL-U formatted file into a UD collection.

from uralicNLP.ud_tools import UD_collection
ud = UD_collection(codecs.open("file.conllu", encoding="utf-8"))

You can loop sentences and words in a UD collection

for sentence in ud:
	for word in sentence:
		print(word.pos, word.lemma, word.get_attribute("deprel"))

For an individual sentence, you can parse it as

from uralicNLP.ud_tools import parse_sentence
conl = "# text = Toinen palkinto\n1\tToinen\ttoinen\tADJ\tNum\tCase=Nom\t2\tnummod\t_\t_\n2\tpalkinto\tpalkinto\tNOUN\tN\tCase=Nom\t0\troot\t_\t_"
sentence = parse_sentence(conl)

Search functionality

UD collection and UD sentence can be searched for matching entries.

sentences = ud.find_sentences(query={"lemma": "kissa"}) #finds all sentences with the lemma kissa
for sentence in sentences:
    word = sentence.find(query={"lemma": "kissa"})
    print(word[0].get_attribute("form")) #prints the form for the first word kissa in all the sentences containing that word

If the find and find_sentences are called without arguments, they will return everything (all sentences or all words). The query can contain any of the fields specified in the CoNLL-U format description. The queries can contain string objects or matchable regex patterns.

Clone this wiki locally