# Final Examination (Notebook III)

For instructions on how to use TestMyCode (TMC) to test your code and submit it to the server, see <a href="https://applied-language-technology.mooc.fi/html/tmc.html" target="blank_">here</a>.

Remember to save this Notebook before testing your code. Press <kbd>Control</kbd>+<kbd>s</kbd> or select the *File* menu and click *Save*.

**The maximum number of points for this Notebook is 45.**

⚠️ Set the variable `grade` in the cell below to `True` to enable testing. ⚠️

The commands `tmc test` and `tmc submit` work only when the `grade` variable has been set to `True`. 

You can disable testing for some Notebooks to speed up the process before submitting.

In [15]:
# Set the value of the variable 'grade' to True to enable testing and submitting
grade = True

[1m[TMC][0m [92mThis Notebook will be graded.[0m

## 1. Load CoNLL-U annotated corpora into Python and parse the annotations (15 points)

**Prerequisites for this exercise**: None.

The directory `data` contains 10 files from the Georgetown University Multilayer Corpus (GUM), which contain annotations in the CoNLL-U format. These files can be identified by the suffix `conllu`.

Open each file for reading, read its contents and store the resulting string objects into a list named `annotations`.

Use the `conllu` library to parse the CoNLL-U compliant annotations stored in the list `annotations`. Store the resulting lists of *TokenList* objects into a list named `doc_lists`.

In [16]:
# Write your answer below this line. Please enter your entire solution in this cell.
import os
from conllu import parse

# Specify the directory containing the CoNLL-U files
data_directory = "data"

# Initialize a list to store the contents of each CoNLL-U file
annotations = []

# Initialize a list to store the parsed TokenList objects
doc_lists = []

# List all files in the data directory with the ".conllu" suffix
conllu_files = [file for file in os.listdir(data_directory) if file.endswith(".conllu")]

# Loop over each CoNLL-U file
for conllu_file in conllu_files:
    # Construct the full path to the file
    file_path = os.path.join(data_directory, conllu_file)

    # Read the contents of the CoNLL-U file
    with open(file_path, "r", encoding="utf-8") as file:
        conllu_contents = file.read()
    
    # Append the contents to the 'annotations' list
    annotations.append(conllu_contents)

    # Parse the CoNLL-U compliant annotations and store the TokenList objects
    doc_list = parse(conllu_contents)
    doc_lists.append(doc_list)


[1m[TMC][0m [92mThe variable "doc_lists" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "doc_lists" contains a list! 1 point.[0m

[1m[TMC][0m [92mThe list "doc_lists" contains TokenList objects! 13 points.[0m

## 2. Collect information on discourse unit boundaries (15 points)

**Prerequisites for this exercise**: You must have completed exercise 1 in this Notebook. You can use variables defined in exercise 1.

Collect information on discourse unit boundaries from the documents stored in `doc_lists`. This information can be found under the `misc` attribute of individual *Token* objects.

Create a dictionary named `edu_indices` and populate this dictionary with keys and values. 

The keys should correspond to the unique identifiers for each document in `doc_lists`, whereas their corresponding values should consist of a list with numbers that give the indices for the *Tokens* that start a new discourse unit. 

The unique identifiers can be found in the metadata for each *TokenList* object contained in a document.

*Tips*: Think how the different objects are organised in `doc_lists` and define nested `for` loops accordingly. Use a counter to track the indices, and think carefully where and when you reset and update the counter. Pay attention to the criteria used to evaluate whether the key `misc` of a *Token* object contains any values.

In [18]:
# Write your answer below this line. Please enter your entire solution in this cell.
# Create an empty dictionary to store the discourse unit indices
edu_indices = {}
a=0
counter = 0
# Loop through each document in doc_lists
for doc_list in doc_lists:
    for i in doc_list:
        for j in i:
            if j['misc'] is not None and 'Discourse' in j['misc']:
        #print(i.metadata)
                a+=1
                print(a)
                print(j['misc'])
                counter += 1
                edu_indices[i.metadata['s_type']]=counter








[1m[TMC][0m [92mThe variable "edu_indices" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "edu_indices" contains a dictionary! 1 point.[0m

[1m[TMC][0m [91mThe dictionary "edu_indices" does not contain the expected items.[0m

In [5]:
edu_indices

{'decl': 689, 'frag': 669, 'other': 517, 'sub': 671, 'multiple': 592, 'q': 670}

In [13]:
# Create an empty dictionary to store the discourse unit indices



## 3. Convert the CoNLL-U annotations into spaCy *Doc* objects (15 points)

**Prerequisites for this exercise**: You must have completed exercise 1 in this Notebook. You can use variables defined in exercise 1.

Use the `conllu_to_docs` function in spaCy to convert the string objects in the list `annotations` into spaCy *Doc* objects.

Because this function creates a new *Doc* object for every 10 sentences defined in the CoNLL-U annotations, create a new *Doc* object and use the `from_docs` method to join the *Doc* objects. 

Store the resulting *Doc* objects into the list `conllu_docs`.

*Tip*: You need to import the spaCy *Doc* class as well.

In [32]:
# Write your answer below this line. Please enter your entire solution in this cell.
import conllu

import spacy
from spacy.tokens import Doc
nlp_en = spacy.load("en_core_web_sm")

# Initialize a list to store the resulting Doc objects
conllu_docs = []

# Loop over each CoNLL-U annotation in the 'annotations' list
for annotation in annotations:
    # Parse the CoNLL-U annotation using the conllu library
    parsed_data = conllu.parse(annotation)
    
    # Extract the words from the parsed data
    words = [token['form'] for sentence in parsed_data for token in sentence]
    
    # Create a spaCy Doc object from the words
    doc = nlp_en(" ".join(words))
    
    # Append the Doc object to the list
    conllu_docs.append(doc)



[1m[TMC][0m [92mThe variable "conllu_docs" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "conllu_docs" contains a list! 1 point.[0m

[1m[TMC][0m [92mThe list "conllu_docs" contains the expected items! 13 points.[0m