# Discourse-level annotations

See the learning materials associated with this exercise <a href="https://applied-language-technology.mooc.fi/html/notebooks/part_iii/06_text_linguistics.html" target="blank_">here</a>.

For instructions on how to use TestMyCode (TMC) to test your code and submit it to the server, see <a href="https://applied-language-technology.mooc.fi/html/tmc.html" target="blank_">here</a>.

Remember to save this Notebook before testing your code. Press <kbd>Control</kbd>+<kbd>s</kbd> or select the *File* menu and click *Save*.

**The maximum number of points for this exercise is 25.**

## 1. Import the *conllu* library (1 points)

Import the *conllu* library (`conllu`) into Python.

In [1]:
# Write your answer below this line
import conllu


## 2. Load CoNLL-U annotations from file (3 points)

A directory named `data` in the exercise directory contains a file named `GUM_interview_gaming.conllu`. The file contains CoNLLU-compliant annotations.

Load this file into Python and read its contents.

Assign the resulting string object under the variable `ann`.

In [2]:
# Write your answer below this line
with open('data/GUM_interview_gaming.conllu', mode="r", encoding="utf-8") as data:
    
    # Read the file contents and assign under 'annotations'
    ann = data.read()

## 3. Parse CoNLL-U annotations (3 points)

Use the `conllu` library to parse the annotations in the string object `ann`.

Store the result under the variable `c_ann`.

*Tip*: In case of weird error messages, it's always a good idea to reset the kernel. 

In [3]:
# Write your answer below this line
c_ann = conllu.parse(ann)


## 4. Examine CoNLL-U annotations (3 points)

Get the third *TokenList* object in the list `c_ann`. Then retrieve the universal part-of-speech tag for the *Token* at index 5 of the *TokenList* object.

Assign the tag under the variable `tag`.

In [4]:
# Write your answer below this line
tag=c_ann[2][5]['upos']

In [5]:
tag

'PUNCT'

## 5. Collect information on sentence mood (5 points)

Each *TokenList* object in the list `c_ann` contains information on the grammatical mood of the sentence.

This information is stored under the key `s_type` of the `metadata` attribute.

Collect this information for every *TokenList* object into a list named `moods`.

In [6]:
# Write your answer below this line
# Initialize an empty list to store the moods
moods = []

# Iterate through each TokenList object in the list c_ann
for token_list in c_ann:
    # Check if the 's_type' key exists in the metadata attribute
    if 's_type' in token_list.metadata:
        # Append the value of 's_type' to the moods list
        moods.append(token_list.metadata['s_type'])
    else:
        # Handle the case where 's_type' is not present (e.g., set it to None)
        moods.append(None)

## 6. Collect information from the CoNLL-U annotations (5 points)

The following code collects information from the CoNLL-U annotations stored under the list `c_ann`.

Finish the code by collecting the form of each *Token* into the list `words` and its universal part-of-speech tag into the list `pos`. 

In [13]:
# Create placeholder lists
words, pos, spaces, sents = [], [], [], []

# Collect words, spaces and sentence starts from the TokenList
# objects stored in the list 'ann'.
for sentence in c_ann:
    
    # Track sentence starts
    start_sent = True
    
    # Loop over tokens
    for token in sentence:
        
        # Check if this Token starts a sentence
        if start_sent:
            
            sents.append(True)
            start_sent = False
        
        # Otherwise append False
        else:
            
            sents.append(False)
            
        # Check if the Token is not followed by a space
        if token['misc'] is not None and 'SpaceAfter' in token['misc'] and token['misc']['SpaceAfter'] == 'No':
            
            spaces.append(False)
        
        # Otherwise append True
        else:
            
            spaces.append(True)
        
        # Write your answer below this line. Remember to indent your code!
        words.append(token['form'])
        pos.append(token['upostag'])
        
        

In [18]:
print(len(spaces))
print(len(words))
print(len(pos))
len(sents)

717
717
717


717

## 7. Create a spaCy *Doc* object manually (5 points)

Use the information collected into the lists `words`, `pos`, `spaces` and `sents` to create a spaCy *Doc* object manually.

Assign the resulting *Doc* object under the variable `doc`.

*Tip*: You need to provide inputs to the following variables: `vocab`, `words`, `pos`, `spaces` and `sent_starts`.

In [19]:
# Import the spaCy library and the Doc class
from spacy.tokens import Doc
import spacy

# Load a small language model for English
nlp = spacy.load('en_core_web_sm')

# Write your answer below this line
doc = Doc(vocab=nlp.vocab, 
          words=words,
          pos = pos,
          spaces=spaces,
          sent_starts=sents
          )
