# Working with CoNLL-Us

CoNLL-Us and CoNLL-Xs are two highly similar types of files used for storing annotated linguistic data. They are explicitly designed to express their data in the [Universal Dependencies](https://universaldependencies.org/), as will be seen below.

Below will explore three ways of working with CoNLL files, using a sample sentence (with lots of non-punctuation periods!). We will work through the structure of CoNLL files as we explore!

## The conllu package

One of the most efficient ways to work with CoNLL files in Python is the [`conllu` package](https://pypi.org/project/conllu/). As always, you need to install this with pip first, using `pip install conllu` in the anaconda prompt. Then, import it with the command below:

In [23]:
from conllu import parse

For our purposes we will use a series of annotated sentences made with Stanza. You may uncomment the line below to work with a set generated by Trankit, but this has less metadata.

In [24]:
annotated_file = 'test_stanza.conllu'
#annotated_file = 'test_eng_trankit.conllu'

with open(annotated_file, encoding='utf-8') as f:
    sentences = parse(f.read())

`sentences` is a list of each sentence (separated by two line breaks instead of one) in the document. Let's take a closer look:

In [25]:
print(f'Found {len(sentences)} sentences.')
sample_sen = sentences[0]
sample_sen

Found 3 sentences.


TokenList<", Hello, Mr., Black, ,, ", J., J., said, ., metadata={text: ""Hello Mr. Black," J. J. said.", sent_id: "0"}>

That looks ugly! Let's take a closer look at what's going on.

Every sentence in the `conllu` library is rendered as a `TokenList` that includes `metadata`, indicated in the file with lines beginning with `#` before the tokens of the sentence. `metadata` behaves like a dict, and in this case (thanks to Stanza) includes `text`, which is the unedited, untokenized sentence. Let's see what we're dealing with:

In [26]:
print(sample_sen.metadata.get('text'))

"Hello Mr. Black," J. J. said.


The main elements of the sentence are split into _tokens_ and accessed as entries in the `TokenList`, which behaves as an iterator. Note that printing and outputting the tokens lead to different results!

In [27]:
for t in sample_sen:
    print(t)

"
Hello
Mr.
Black
,
"
J.
J.
said
.


but

In [28]:
said = sample_sen[-2]
said

{'id': 9,
 'form': 'said',
 'lemma': 'say',
 'upos': 'VERB',
 'xpos': 'VBD',
 'feats': {'Mood': 'Ind',
  'Number': 'Sing',
  'Person': '3',
  'Tense': 'Past',
  'VerbForm': 'Fin'},
 'head': 0,
 'deprel': 'root',
 'deps': None,
 'misc': {'start_char': '25', 'end_char': '29'}}

Note that tokens are really encoded as dictionaries. To see what's fully going on, let's look at each of these elements. They correspond to the standard columns in a CoNLL file:

* `id`: the place of the word in the sentence
* `form`: the exact form the token takes on in the sentence
* `lemma`: the dictionary or generalized form of the word
* `upos`: the POS in accordance with Universal Dependencies POS tags
* `xpos` (optional): an alternative POS schema. I don't know what this corresponds to in the particular case of Stanza's English pipeline.
* `feats`: Encoded as a dictionary. Specifies the grammatical or morphological features the word takes on, e.g. `Tense: Past` means the verb is a past-tense form.
* `head`: the index on which this word is deemed to depend syntactically. 0 is reserved for the `root` position, which indicates this is the main element of the sentence.
* `deprel`: how this word relates to the head
* `deps`: Advanced, an alternative way of representing dependencies.
* `misc`: Additional information. Stanza helpfully specifies the [start_char, end_char) of the token in the sentence

You can iterate through the `TokenList` to modify or identify words. For instance, if we want to find all verbs, we might do the following:

In [29]:
verbs = []
for t in sample_sen:
    if t.get('upos') == 'VERB':
        verbs.append(t)

print(verbs)

[{'id': 9, 'form': 'said', 'lemma': 'say', 'upos': 'VERB', 'xpos': 'VBD', 'feats': {'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Past', 'VerbForm': 'Fin'}, 'head': 0, 'deprel': 'root', 'deps': None, 'misc': {'start_char': '25', 'end_char': '29'}}]


If we wanted to, say, delete all periods that do _not_ serve a punctuation function in a sentence, we could do this:

In [30]:
def replace_nonpunct_periods(sen):
    punct = ['"', "'", '.',',']
    for t in sen:
        if t.get('upos') != 'PUNCT':
            t['form'] = t['form'].replace('.', '')
            t['lemma'] = t['lemma'].replace('.', '')
    
    sen.metadata['text'] = " ".join([t.get('form') for t in sen]) #Not perfect but it'll do
    for p in punct:
        sen.metadata['text'] = sen.metadata['text'].replace(f' {p}', p)

replace_nonpunct_periods(sample_sen)
print(sample_sen.metadata.get('text'))

" Hello Mr Black," J J said.


What if we want to return this to a .conllu? We can use the `serialize()` to turn it back into a CoNLL-u and then write to a file!

**Hint**: Remember to serialize *individual sentences*, not the whole list of sentences!

In [32]:
new_conllu = sample_sen.serialize()
print(new_conllu)

with open('modified_conllu.conllu', 'w', encoding='utf-8') as f:
    f.write(new_conllu)

# text = " Hello Mr Black," J J said.
# sent_id = 0
1	"	"	PUNCT	``	_	2	punct	_	start_char=0|end_char=1
2	Hello	hello	INTJ	UH	_	9	ccomp	_	start_char=1|end_char=6
3	Mr	Mr	PROPN	NNP	Number=Sing	2	vocative	_	start_char=7|end_char=10
4	Black	Black	PROPN	NNP	Number=Sing	3	flat	_	start_char=11|end_char=16
5	,	,	PUNCT	,	_	9	punct	_	start_char=16|end_char=17
6	"	"	PUNCT	''	_	9	punct	_	start_char=17|end_char=18
7	J	J	PROPN	NNP	Number=Sing	9	nsubj	_	start_char=19|end_char=21
8	J	J	PROPN	NNP	Number=Sing	7	flat	_	start_char=22|end_char=24
9	said	say	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	0	root	_	start_char=25|end_char=29
10	.	.	PUNCT	.	_	9	punct	_	start_char=29|end_char=30




## Visualizing CoNLL-Us: Palmyra:

You can also modify and visualize CoNLL files visually. This is especially useful for manually correcting the syntax so you can train a model or quantitatively assess later.

I recomend using [Palmyra](https://camel-lab.github.io/palmyra/viewtree.html) by CamelTools. For this, choose `ud.config` and then upload the file. We'll go over this visually in class.

## CoNLL-Us as .TSVs:

CoNLL-Us are, at their core, tab-separated spreadsheets. It is therefore possible to work with CoNLL-Us using the csv library. This is better for *writing* CoNLL files than *reading* them, and we will not cover this in depth, but we will present ways to load them in this way.

In [33]:
import csv

In [None]:
with open(annotated_file, encoding='utf8') as f:
    tsv = list(csv.reader(f, delimiter='\t'))

for l in tsv:
    print(l)

['# text = "Hello Mr. Black," J. J. said.']
['# sent_id = 0']
['1', '\t', 'PUNCT', '``', '_', '2', 'punct', '_', 'start_char=0|end_char=1']
['2', 'Hello', 'hello', 'INTJ', 'UH', '_', '9', 'ccomp', '_', 'start_char=1|end_char=6']
['3', 'Mr.', 'Mr.', 'PROPN', 'NNP', 'Number=Sing', '2', 'vocative', '_', 'start_char=7|end_char=10']
['4', 'Black', 'Black', 'PROPN', 'NNP', 'Number=Sing', '3', 'flat', '_', 'start_char=11|end_char=16']
['5', ',', ',', 'PUNCT', ',', '_', '9', 'punct', '_', 'start_char=16|end_char=17']
['6', '\t', 'PUNCT', "''", '_', '9', 'punct', '_', 'start_char=17|end_char=18']
['7', 'J.', 'J.', 'PROPN', 'NNP', 'Number=Sing', '9', 'nsubj', '_', 'start_char=19|end_char=21']
['8', 'J.', 'J.', 'PROPN', 'NNP', 'Number=Sing', '7', 'flat', '_', 'start_char=22|end_char=24']
['9', 'said', 'say', 'VERB', 'VBD', 'Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin', '0', 'root', '_', 'start_char=25|end_char=29']
['10', '.', '.', 'PUNCT', '.', '_', '9', 'punct', '_', 'start_char=29|en

Note that it breaks around the quotation marks! As an exercise, let's get a set of all UPOS in this corpus:

In [37]:
upos_list = []

for r in tsv:
    if(len(r) > 4):
        upos_list.append(r[3])

print(list(set(upos_list)))

['ADV', "''", 'VERB', '``', 'INTJ', 'PROPN', 'PUNCT']


Well, it's better than nothing. Yeah don't use .csv to read, just to write.