# Pattern matching

See the learning materials associated with this exercise <a href="https://applied-language-technology.mooc.fi/html/notebooks/part_iii/03_pattern_matching.html" target="blank_">here</a>.

For instructions on how to use TestMyCode (TMC) to test your code and submit it to the server, see <a href="https://applied-language-technology.mooc.fi/html/tmc.html" target="blank_">here</a>.

Remember to save this Notebook before testing your code. Press <kbd>Control</kbd>+<kbd>s</kbd> or select the *File* menu and click *Save*.

**The maximum number of points for this exercise is 35.**

## 1. Import the *Matcher* class (2 points)

Import the *Matcher* class from the `spacy.matcher` submodule into Python.

In [1]:
# Write your answer below this line
from spacy.matcher import Matcher

2023-09-05 14:28:12.837469: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-09-05 14:28:12.837525: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## 2. Create a *Matcher* object (3 points)

Use the *Matcher* class to create a *Matcher* object.

The variable `nlp` contains a spaCy *Language* object with a small language model for English. 

Provide the *Vocabulary* of the language model as input to the `vocab` argument of the *Matcher* object.

Store the *Matcher* object under the variable `en_matcher`.

In [2]:
# Import spacy
import spacy

# Load a small language model for English; assign result under 'nlp'
nlp = spacy.load('en_core_web_sm')

# Write your answer below this line
en_matcher = Matcher(vocab=nlp.vocab)

In [3]:
en_matcher

<spacy.matcher.matcher.Matcher at 0x7fc685e0d1c0>

## 3. Define a pattern rule (5 points)

Define a pattern rule for matching sequences that consist of determiners (`DET`), adjectives (`ADJ`) and nouns (`NOUN`).

 1. Use the coarse part-of-speech tags (`POS`) provided above for matching.
 2. Define the pattern rules using Python dictionaries.
 3. Store the dictionaries into a list named `pattern_rule`.
 
Remember that list items are separated by commas, whereas the keys and values of a dictionary are separated by colons. In this case, both keys and values must be string objects.

In [4]:
# Write your answer below this line
pattern_rule = [{'POS': 'DET'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}]

## 4. Add the pattern rule to the *Matcher* object (5 points)

Add the pattern rule defined in the `pattern_rule` list to the *Matcher* object stored under the variable `en_matcher`.

 1. Use the `add()` method of the *Matcher* object.
 2. Name the pattern `det-adj-noun` for the pattern rule.
 3. Provide the pattern rule stored under `pattern_rule` to the `patterns` argument.
 
Remember that a single pattern rule can contain multiple rules; hence the input to the `patterns` argument must be a list of lists.

In [5]:
# Write your answer below this line
en_matcher.add("det-adj-noun", patterns=[pattern_rule])

In [6]:
en_matcher

<spacy.matcher.matcher.Matcher at 0x7fc685e0d1c0>

## 5. Apply the *Matcher* to a spaCy *Doc* object (5 points)

The variable `doc` contains a spaCy *Doc* object with some text.

 1. Apply the *Matcher* object under `en_matcher` to the *Doc* object to find matching patterns in the text.
 2. Instruct spaCy to return the matches as *Span* objects.
 3. Store the matches under the variable `en_matches`.

In [7]:

doc = nlp(open(file='data/data.txt', encoding='utf-8', mode='r').read())

# Write your answer below this line

en_matches = en_matcher(doc, as_spans=True)


In [8]:
en_matches

[every inhabited continent,
 the first month,
 the major camps,
 a few demands,
 the Great Recession,
 the same kind,
 the global occupy,
 a worldwide protest,
 a peaceful occupation,
 the OccupyWallStreet.org web,
 the senior editor,
 The first protest,
 the tenth anniversary,
 a political slogan,
 the same time,
 the average income,
 a minimum income,
 the American population,
 the Great Recession,
 The Great Recession,
 the economic expansion,
 a larger share,
 a widespread ignorance,
 the true income,
 the early weeks,
 the early stages,
 a global march,
 the corrupting effect,
 a Presidential commission,
 the very power,
 a clear demand,
 a global collaboration,
 The global movement,
 The progressive provider,
 the illegal practices,
 a progressive stack,
 The progressive stack,
 A subsequent film,
 a clear objective,
 the economic system,
 a particular location,
 the earliest days,
 the vast majority,
 the global movement,
 an immense strength,
 The social media,
 the social medi

## 6. Create a *DependencyMatcher* object (3 points)

Create a spaCy *DependencyMatcher* object. 

Use the *Vocabulary* of the *Language* object stored under the variable `nlp`.

Store the *DependencyMatcher* object under the variable `en_dep_matcher`.

In [9]:
# Import the DependencyMatcher class from spacy.matcher submodule
from spacy.matcher import DependencyMatcher

# Write your answer below this line
en_dep_matcher = DependencyMatcher(vocab=nlp.vocab)

## 7. Define a rule for matching syntactic dependencies (5 points)

Define a pattern rule for matching the nominal subjects (`nsubj`) of the verb **'be'**. Use the key `LEMMA` to match the lemmas of this verb.

 1. Start by defining the anchor pattern. Use a dictionary with the keys `RIGHT_ID` and `RIGHT_ATTRS` to define the pattern. 
 2. Continue by defining the second pattern. Use a dictionary, and connect this pattern to the anchor pattern using the key `LEFT_ID`. Define the relationship between this pattern and the anchor using the key `REL_OP`.
 3. Store the two dictionaries into a list named `dep_rule`. 
 
Tips: You can define the names used for the `RIGHT_ID` and `LEFT_ID` attributes yourself.

In [10]:
# Write your answer below this line
dep_rule = [{'RIGHT_ID': 'verb', 'RIGHT_ATTRS': {'LEMMA': 'be'}},
               {'LEFT_ID': 'verb', 'REL_OP': '>', 'RIGHT_ID': 'subject', 'RIGHT_ATTRS': {'DEP': 'nsubj'}}
              ]

## 8. Add the pattern rule to the *DependencyMatcher* object and apply it to a *Doc* (7 points)

Add the pattern rule defined in the `dep_rule` list to the *DependencyMatcher* object stored under the variable `en_dep_matcher`.

 1. Use the `add()` method of the *DependencyMatcher* object.
 2. Name the pattern `is_nsubj` for the pattern rule.
 3. Provide the pattern rule stored under `dep_rule` to the `patterns` argument.
 4. Apply the *DependencyMatcher* object to the *Doc* object under `doc`.
 5. Store the resulting matches under the variable `en_dep_matches`.

In [11]:
# Write your answer below this line
# Add the pattern to the matcher under the name 'nsubj_verb'
en_dep_matcher.add('is_nsubj', patterns=[dep_rule])

# Apply the DependencyMatcher to the Doc object under 'doc'; Store the result 
# under the variable 'dep_matches'.
en_dep_matches = en_dep_matcher(doc)

# Call the variable to examine the output
en_dep_matches


[(10072227686199776841, [6, 5]),
 (10072227686199776841, [316, 315]),
 (10072227686199776841, [417, 416]),
 (10072227686199776841, [481, 480]),
 (10072227686199776841, [692, 691]),
 (10072227686199776841, [877, 870]),
 (10072227686199776841, [1021, 1015]),
 (10072227686199776841, [1021, 1019]),
 (10072227686199776841, [1252, 1246]),
 (10072227686199776841, [1343, 1330]),
 (10072227686199776841, [1532, 1531]),
 (10072227686199776841, [1549, 1548]),
 (10072227686199776841, [1611, 1608]),
 (10072227686199776841, [1685, 1670]),
 (10072227686199776841, [1728, 1720]),
 (10072227686199776841, [1836, 1835]),
 (10072227686199776841, [2126, 2119]),
 (10072227686199776841, [2278, 2277]),
 (10072227686199776841, [2626, 2618]),
 (10072227686199776841, [2677, 2676]),
 (10072227686199776841, [2726, 2725]),
 (10072227686199776841, [2764, 2762]),
 (10072227686199776841, [3012, 3007]),
 (10072227686199776841, [3175, 3174]),
 (10072227686199776841, [3301, 3295]),
 (10072227686199776841, [3698, 3693]),
 (