# Final Examination (Notebook II)

For instructions on how to use TestMyCode (TMC) to test your code and submit it to the server, see <a href="https://applied-language-technology.mooc.fi/html/tmc.html" target="blank_">here</a>.

Remember to save this Notebook before testing your code. Press <kbd>Control</kbd>+<kbd>s</kbd> or select the *File* menu and click *Save*.

**The maximum number of points for this Notebook is 45.**

⚠️ Set the variable `grade` in the cell below to `True` to enable testing. ⚠️

The commands `tmc test` and `tmc submit` work only when the `grade` variable has been set to `True`. 

You can disable testing for some Notebooks to speed up the process before submitting.

In [53]:
# Set the value of the variable 'grade' to True to enable testing and submitting
grade = True

[1m[TMC][0m [92mThis Notebook will be graded.[0m

## 1. Load a text file and process it using spaCy (15 points)

**Prerequisites for this exercise**: None.

Import the spaCy library and load a small language model for English. Assign the resulting *Language* object under the variable `nlp_en`.

The directory `data` contains a file named `en_wiki.txt`. Open this file for reading, read the contents and store the resulting string object under the variable `text`. 

Then feed the string object under `text` to the spaCy *Language* object `nlp_en`. Store the resulting *Doc* object under the variable `wikidoc`.

In [54]:
# Write your answer below this line. Please enter your entire solution in this cell.
import spacy

# Load a small language model for English
nlp_en = spacy.load("en_core_web_sm")

# Specify the path to the text file
file_path = "data/en_wiki.txt"

# Read the contents of the text file
with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()
    

# Process the text using spaCy
wikidoc = nlp_en(text)

[1m[TMC][0m [92mThe variable "nlp_en" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "wikidoc" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "wikidoc" contains the expected values! 13 points.[0m

## 2. Match patterns based on part-of-speech tags (15 points)

**Prerequisites for this exercise**: You must have completed exercise 1 in this Notebook. You can use variables defined in exercise 1.

Define a pattern rule for matching sequences of *Tokens* that have a **noun** as their coarse part-of-speech tag and **compound** as their syntactic dependency. In addition to individual *Tokens*, instruct spaCy to return matching sequences of *Tokens* that occur one or more times in the text.

Import the *Matcher* class and initialise a *Matcher* object using the *Vocabulary* of the *Language* object under `nlp_en`. Store the *Matcher* under the variable `n_matcher`.

Add the pattern rule to the *Matcher* `n_matcher`. Instruct spaCy to return the longest sequence of matches only as spaCy *Span* objects.

Apply the *Matcher* to the *Doc* object under the variable `wikidoc` and store the resulting matches under the variable `n_results`.

In [61]:
# Write your answer below this line. Please enter your entire solution in this cell.
from spacy.matcher import Matcher


# Initialize a DependencyMatcher object using the vocabulary of the Language object nlp_en
n_matcher = Matcher(nlp_en.vocab)
pattern = [
    {"POS": "NOUN", "DEP": "compound", "OP": "+"}
]

n_matcher.add('noun_compound', [pattern], greedy='LONGEST')

n_results = n_matcher(wikidoc, as_spans=True)

[1m[TMC][0m [92mThe variable "n_matcher" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "n_results" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "n_results" contains the expected values! 13 points.[0m

In [61]:
n_results

[Van der Waals,
 temperature separation,
 integer transition,
 oil drilling,
 class blimp,
 temperature gas,
 hydrogen rocket,
 peak wartime,
 helium electron,
 electron cloud,
 der Waals,
 carbon cage,
 diamond anvil,
 proton chain,
 helium tube,
 world helium,
 gas well,
 air distillation,
 world helium,
 heat capacity,
 ground support,
 heat capacity,
 girl singing,
 emergency press,
 Helium,
 chemical,
 gas,
 boiling,
 line,
 helium,
 uranium,
 gas,
 silicon,
 helium,
 quantum,
 helium,
 alpha,
 helium-4,
 gas,
 astronomer,
 line,
 chemist,
 physicist,
 mineral,
 earth,
 mineral,
 physicist,
 mineral,
 alpha,
 glass,
 physicist,
 ζ,
 hydrogen,
 helium,
 integer,
 quantum,
 physicist,
 physicists,
 helium-3,
 gas,
 state,
 %,
 %,
 %,
 gas,
 helium,
 barrage,
 %,
 C,
 extraction,
 arc,
 mass,
 production,
 lift,
 helium,
 conservation,
 gas,
 helium,
 nitrogen,
 gas,
 %,
 %,
 %,
 extraction,
 helium,
 helium,
 gas,
 helium,
 helium,
 helium,
 helium,
 helium,
 supply,
 quantum,
 hydr

## 3. Match patterns based on syntactic dependencies (15 points)

**Prerequisites for this exercise**: You must have completed exercise 1 in this Notebook. You can use variables defined in exercise 1.

Define a pattern rule for matching Tokens that have a **verb** as their coarse part-of-speech tag, and their **subjects** and **objects**.

Import the *DependencyMatcher* class and initialise a *DependencyMatcher* object using the *Vocabulary* of the *Language* object under `nlp_en`. Store the *DependencyMatcher* under the variable `d_matcher`.

Add the pattern rule to the *DependencyMatcher* `d_matcher`.

Apply the *DependencyMatcher* to the *Doc* object `wikidoc` and store the resulting matches under the variable `d_results`.

*Tip*: Use the verb as the anchor pattern.

In [20]:
# Write your answer below this line. Please enter your entire solution in this cell.
from spacy.matcher import DependencyMatcher

# Initialize a DependencyMatcher object using the vocabulary of the Language object nlp_en
d_matcher = DependencyMatcher(nlp_en.vocab)
pattern = [
    {
        'RIGHT_ID': 'verb',
        'RIGHT_ATTRS': {'POS': 'VERB'}
    },
    {
        'LEFT_ID': 'verb',  
        'REL_OP': '>',     
        'RIGHT_ID': 'subject',
        'RIGHT_ATTRS': {'DEP': {'in': ['nsubj']}} 
    },
    {
        'LEFT_ID': 'verb',  
        'REL_OP': '>',      
        'RIGHT_ID': 'object',
        'RIGHT_ATTRS': {'DEP': {'in': ['dobj']}}  
    }
]

d_matcher.add('verb_subject_object', [pattern])

d_results = d_matcher(wikidoc)


[1m[TMC][0m [92mThe variable "d_matcher" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "d_results" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "d_results" contains the expected values! 13 points.[0m

In [12]:
d_results

[(18120831537121313479, [302, 301, 306]),
 (18120831537121313479, [316, 315, 317]),
 (18120831537121313479, [339, 338, 337]),
 (18120831537121313479, [365, 364, 367]),
 (18120831537121313479, [826, 821, 829]),
 (18120831537121313479, [838, 837, 835]),
 (18120831537121313479, [838, 837, 840]),
 (18120831537121313479, [878, 877, 880]),
 (18120831537121313479, [902, 901, 903]),
 (18120831537121313479, [918, 917, 920]),
 (18120831537121313479, [996, 995, 1000]),
 (18120831537121313479, [1002, 1001, 1005]),
 (18120831537121313479, [1058, 1057, 1059]),
 (18120831537121313479, [1091, 1090, 1094]),
 (18120831537121313479, [1108, 1104, 1110]),
 (18120831537121313479, [1120, 1115, 1123]),
 (18120831537121313479, [1156, 1154, 1161]),
 (18120831537121313479, [1260, 1259, 1263]),
 (18120831537121313479, [1269, 1268, 1271]),
 (18120831537121313479, [1302, 1299, 1303]),
 (18120831537121313479, [1334, 1333, 1336]),
 (18120831537121313479, [1446, 1444, 1452]),
 (18120831537121313479, [1472, 1466, 1474]