# 7. Pattern matching in dependency trees

Today we will practice the ways in which dependency parsing can be used to extract complex syntactic structures from texts. SpaCy's *pattern matching* engine allows us to separate the description of the structure of interest from the code that finds it. The latter is mostly done by spaCy.

We will use a new dataset consisting of YLE English news from the recent weeks.

In [1]:
import csv
import spacy
import spacy.matcher

In [2]:
def read_articles(filename):
    result = []
    with open(filename, encoding='utf-8') as fp:
        reader = csv.DictReader(fp)
        for row in reader:
            result.append(row['text'])
    return result

In [3]:
nlp = spacy.load('en_core_web_sm')
articles = read_articles('yle_en.csv')      # TODO supply the file path if needed
docs = [nlp(a) for a in articles]

In [4]:
docs[30][:50]

The construction of a new passenger ferry terminal in Kotka which will enable service to St. Petersburg, Russia is expected to begin this spring.
The terminal's final design is still in progress, but building work is expected to start as early as May this year

In [5]:
articles[5]

"When the phone lines opened for a new age group to book coronavirus vaccination appointments in Oulu, more than 3,500 calls flooded in a matter of hours.\nIt's a similar situation nationwide.\nIn Helsinki people often try to make vaccination appointments ahead of time, while the majority of calls to the appointment booking service in Kainuu come from individuals not yet eligible for the jabs.\nYle gathered some of the most common questions people ask when calling the vaccination appointment lines and put them to two experts, Sirkku Kaltakari, service manager at Oulu Social and Health Services, and infectious diseases doctor Jarkko Huusko.\nHere are their answers:\n1. How do I know if it's my turn to get the vaccine?\nVaccinations are given according to a nationally-agreed schedule. The website of the Finnish Institute for Health and Welfare (THL) contains information on the order of vaccinations by age and risk groups.\nIf you can't find the information you need there, you can call yo

In [6]:
docs[5][0].head

opened

# In-class exercises

# Ex 1

Write a function `verb_dep(nlp, docs, tag, dep)`, that finds all **verbs** which take a dependent of type `dep` tagged as `tag`. The function should return a list of lists of pairs of `Token` objects: the verb and its dependent. We return a list of such pairs for every article (see example).

Use the function to find all instances of verbs that have a proper noun (`PROPN`) as subject (`nsubj`).

Use `spacy.matcher.DependencyMatcher` for implementation. You will find the necessary documentation here:
* https://spacy.io/api/dependencymatcher
* https://spacy.io/api/matcher

In [15]:
# solution without matchers
def verb_dep(nlp, docs, tag, dep):
    # find words tagged with 'tag' whose head is a verb and whose dependency type is 'dep'
    results = []
    for d in docs:
        results_doc = []
        for tok in d:
            if tok.pos_ == tag and tok.head.pos_ == 'VERB' and tok.dep_ == 'nsubj':
                results_doc.append((tok.head, tok))
        results.append(results_doc)
    return results

In [34]:
# this is the proper solution to the exercise
def verb_dep(nlp, docs, tag, dep):
    pattern = [
        {'RIGHT_ID': 'verb', 
         'RIGHT_ATTRS': {'POS': 'VERB'}},
        # verb > subject
        {'LEFT_ID': 'verb',
         'RIGHT_ID': 'subject',
         'REL_OP': '>',
         'RIGHT_ATTRS': {'POS': tag, 'DEP': dep}}
    ]
    matcher = spacy.matcher.DependencyMatcher(nlp.vocab)
    matcher.add('verb_with_its_dep', [pattern])
    results = []
    for d in docs:
        results_doc = []
        for m_id, toks in matcher(d):
            results_doc.append((d[toks[0]], d[toks[1]], d[toks[2]]))
        results.append(results_doc)
    return results

In [7]:
# what if we wanted the verb to be additionally specified by an adverb?
# -> we don't need to change much, just add another token to the pattern
def verb_dep(nlp, docs, tag, dep):
    pattern = [
        {'RIGHT_ID': 'verb', 
         'RIGHT_ATTRS': {'POS': 'VERB'}},
        # verb > subject
        {'LEFT_ID': 'verb',
         'RIGHT_ID': 'subject',
         'REL_OP': '>',
         'RIGHT_ATTRS': {'POS': tag, 'DEP': dep}},
        # verb > adverb
        {'LEFT_ID': 'verb',
         'RIGHT_ID': 'adverb',
         'REL_OP': '>',
         'RIGHT_ATTRS': {'POS': 'ADV'}}
    ]
    matcher = spacy.matcher.DependencyMatcher(nlp.vocab)
    matcher.add('verb_with_its_dep', [pattern])
    results = []
    for d in docs:
        results_doc = []
        for m_id, toks in matcher(d):
            results_doc.append((d[toks[0]], d[toks[1]], d[toks[2]]))
        results.append(results_doc)
    return results

In [8]:
verb_dep(nlp, docs, 'PROPN', 'nsubj')[:2]

[[(employs, Nokia, currently), (employs, Nokia, worldwide)],
 [(pick, Schoolchildren, also)]]

## Ex 2

Use `spacy.matcher.PhraseMatcher` to find the following multi-word phrases in the corpus:
* `"municipal elections"`
* `"coronavirus restrictions"`

In [28]:
# create the `matcher` object
# matcher musí mít jméno (tady 'COVID'), jinak nebude fungovat
matcher = spacy.matcher.PhraseMatcher(nlp.vocab)
matcher.add('COVID', [nlp('municipal elections'), nlp('coronavirus restrictions')])

In [30]:
for i, d in enumerate(docs):
    for m in matcher(d):
        print(m)
        print(d[m[1]:m[2]])
        print(d[m[1]].sent)
        print()

(2428025407850408099, 251, 253)
coronavirus restrictions
Katri Mannonen / Yle
Municipalities scrambled to provide food to remote learning school children as most parts of the country entered a three-week period of heightened coronavirus restrictions, starting 8 March.

(2428025407850408099, 631, 633)
municipal elections
In addition, it was decided to postpone the municipal elections due to the difficult public health situation.

(2428025407850408099, 312, 314)
coronavirus restrictions
Rallies against coronavirus restrictions are taking place in some 40 countries around the world on Saturday.

(2428025407850408099, 76, 78)
municipal elections
Jussi Halla-aho, chair of the Finns Party, and the party's parliamentary group leader, Ville Tavio,on Saturday claimed Justice Minister Anna-Maja Henriksson had misled citizens and instructed municipalities to break the law in the lead-up to the decision to postpone this spring's municipal elections.

(2428025407850408099, 171, 173)
municipal elect

## Ex 3

Extract all syntactic phrases headed by the word `"situation"`.

In a dependency tree, such phrase corresponds to the subtree containing all descendents of the token of interest. You can get it using the field `token.subtree`.

In [31]:
for d in docs:
    for tok in d:
        if tok.lemma_ == 'situation':
            print(list(tok.subtree))

[the, situation]
[certain, situations, we, have, to, go, to, so, -, called, virgin, areas]
[a, similar, situation]
[the, prevailing, pandemic, situation, and, the, necessary, safety, arrangements]
[the, examination, situation]
[the, disease, situation]
[the, situation, in, Estonia, ,, where, the, infection, rate, is, now, the, second, -, highest, in, the, continent, and, around, nine, times, higher, than, it, is, in, Finland]
[the, worst, infection, situation, outside, of, the, Helsinki, region]
[The, coronavirus, situation]
[the, pandemic, situation]
[the, situation, in, the, greater, Helsinki, and, Turku, regions]
[Finland, 's, deteriorating, Covid, situation]
[the, situation]
[the, coronavirus, situation]
[an, embarrassing, situation]
[a, completely, everyday, situation]
[certain, situations]
[the, current, situation]
[situations, where, a, passenger, does, not, provide, proof, of, illness, or, a, recent, coronavirus, test]
[the, situation]
[The, situation]
[the, situation]
[the, de

# Homework

## Ex 4 (5p.)

Write a function `find_quotes(nlp, docs)` that identifies and attributes direct quotes in the text. A direct quote looks like this:

"*There is no simple answer,*" **Matti Sarvimäki**, Assistant Professor of Economics at Aalto University **told** the paper.

We recognize the following three identifying elements:
* **proposition** - the text that is being quoted (e.g. "There is no simple answer"),
* **author** - the person that is being quoted ("Matti Sarvimäki"),
* **cue** - a verb indicating a speech act, e.g. "told".

The function should output a list of triplets: `(author, cue, proposition)`, each being a `Tok` or `Span` object. You may use information gained from exercise 1 to manually create a list of speech act verbs (it does not need to be exhaustive, a couple of common examples are enough).

3 points are given for a basic working version of the function. For full points, pay attention to the following details:
* If a name is given in the form `FirstName LastName`, output a `Span` object containing them both as author. (1p.)
* If the proposition ends with a quote character (`"`), look for the matching character to find where it starts. It might be a different sentence! (1p.)

You can restrict the solution to propositions enclosed in quotes and the speaker being named by name (rather than a pronoun).

In [89]:
def find_quotes(nlp, doc):
    pattern = [
        # verb indicating speech
        {"RIGHT_ID": "speech_verb",
         "RIGHT_ATTRS": {"POS": "VERB", "LEMMA": {"IN": ["tell", "say", "ask", "talk"]}}},
        # who is saying
        {"LEFT_ID": "speech_verb",
         "REL_OP": ">",
         "RIGHT_ID": "who",
         "RIGHT_ATTRS": {"DEP": "nsubj", "ENT_TYPE": "PERSON"}},
        # what is being said
        {"LEFT_ID": "speech_verb",
         "REL_OP": ">",
         "RIGHT_ID": "proposition",
         "RIGHT_ATTRS": {"DEP": "ccomp"}}
    ]
    matcher = spacy.matcher.DependencyMatcher(nlp.vocab)
    matcher.add('pattern', [pattern])
    
    results = []
    for d in docs:
        for m_id, toks in matcher(d):
            results.append((d[toks[0]], d[toks[1]], ' '.join(str(t) for t in list(d[toks[2]].subtree))))
    return results

# I tried printing the name with this: ' '.join(str(n) for n in list(d[toks[1]].subtree) if n.dep_ == "compound")
# But in some sentences, it gave me an empty string (q[100]) or printed more than I wanted (q[1]), so I left
# only d[toks[1]] in the code

In [90]:
q = find_quotes(nlp, docs)

In [91]:
print(q[1])
print(q[100])
print(q[213])
print(q[87])

(said, Turkia, 'Some said it was great that we reacted so quickly')
(said, Mäkijärvi, '" Each municipality will decide on the matter independently , but I would believe that this will happen')
(said, Callamard, 'With the pandemic at this stage , even the most deluded leaders have difficulty denying the fact that our social , economic and political systems are broken')
(said, Supo, 'The most significant terrorist threat on a global scale is posed by radical Islamists')


In [253]:
q = find_quotes(nlp, docs)

In [254]:
# This might not show *all* quotes. My solution is not perfect either.
print(q[3])

[(told, Hanna Nohynek, "Thinking about it, it could probably be a way to increase the number of people protected, if the quantity of vaccines available in Finland is much less than promised,), (said, Puhakka, "Oulu is a growing city and in certain situations we have to go to so-called virgin areas), (told, Puhakka, "We believe that Oulu's forest nature conservation decisions compensate for the decisions that have to be made in different parts of the city, due to the city's natural growth)]


In [255]:
print(q[207])

[(said, Simola, "My husband immediately said that it was a wolf), (told, Simola, "It was a really touching, amazing moment. Seeing such a wild animal inside the Ring I beltway really made you rub your eyes. It was a rare situation that doesn't happen every day), (said, Simola, "We've talked today about our relationship with wildlife and how even though the wolf may look nice, it is still an unpredictable wild animal that needs to be given its own space), (said, Hallamaa, "The wolf has not growled or shown its teeth. On the contrary, it has apparently tried to escape from humans with its tail between its legs)]


# Further reading

* https://spacy.io/usage/rule-based-matching
* https://applied-language-technology.readthedocs.io/en/latest/notebooks/part_iii/02_pattern_matching.html