# 7. Pattern matching in dependency trees

Today we will practice the ways in which dependency parsing can be used to extract complex syntactic structures from texts. SpaCy's *pattern matching* engine allows us to separate the description of the structure of interest from the code that finds it. The latter is mostly done by spaCy.

We will use a new dataset consisting of YLE English news from the recent weeks.

In [1]:
import csv
import spacy
import spacy.matcher

In [2]:
def read_articles(filename):
    result = []
    with open(filename, encoding='utf-8') as fp:
        reader = csv.DictReader(fp)
        for row in reader:
            result.append(row['text'])
    return result

In [3]:
nlp = spacy.load('en_core_web_sm')
articles = read_articles('yle_en.csv')      # TODO supply the file path if needed
docs = [nlp(a) for a in articles]

In [4]:
docs[30][:50]

The construction of a new passenger ferry terminal in Kotka which will enable service to St. Petersburg, Russia is expected to begin this spring.
The terminal's final design is still in progress, but building work is expected to start as early as May this year

In [5]:
articles[5]

"When the phone lines opened for a new age group to book coronavirus vaccination appointments in Oulu, more than 3,500 calls flooded in a matter of hours.\nIt's a similar situation nationwide.\nIn Helsinki people often try to make vaccination appointments ahead of time, while the majority of calls to the appointment booking service in Kainuu come from individuals not yet eligible for the jabs.\nYle gathered some of the most common questions people ask when calling the vaccination appointment lines and put them to two experts, Sirkku Kaltakari, service manager at Oulu Social and Health Services, and infectious diseases doctor Jarkko Huusko.\nHere are their answers:\n1. How do I know if it's my turn to get the vaccine?\nVaccinations are given according to a nationally-agreed schedule. The website of the Finnish Institute for Health and Welfare (THL) contains information on the order of vaccinations by age and risk groups.\nIf you can't find the information you need there, you can call yo

In [6]:
docs[5][0].head

opened

# In-class exercises

# Ex 1

Write a function `verb_dep(nlp, docs, tag, dep)`, that finds all **verbs** which take a dependent of type `dep` tagged as `tag`. The function should return a list of lists of pairs of `Token` objects: the verb and its dependent. We return a list of such pairs for every article (see example).

Use the function to find all instances of verbs that have a proper noun (`PROPN`) as subject (`nsubj`).

Use `spacy.matcher.DependencyMatcher` for implementation. You will find the necessary documentation here:
* https://spacy.io/api/dependencymatcher
* https://spacy.io/api/matcher

In [15]:
# solution without matchers
def verb_dep(nlp, docs, tag, dep):
    # find words tagged with 'tag' whose head is a verb and whose dependency type is 'dep'
    results = []
    for d in docs:
        results_doc = []
        for tok in d:
            if tok.pos_ == tag and tok.head.pos_ == 'VERB' and tok.dep_ == 'nsubj':
                results_doc.append((tok.head, tok))
        results.append(results_doc)
    return results

In [34]:
# this is the proper solution to the exercise
def verb_dep(nlp, docs, tag, dep):
    pattern = [
        {'RIGHT_ID': 'verb', 
         'RIGHT_ATTRS': {'POS': 'VERB'}},
        # verb > subject
        {'LEFT_ID': 'verb',
         'RIGHT_ID': 'subject',
         'REL_OP': '>',
         'RIGHT_ATTRS': {'POS': tag, 'DEP': dep}}
    ]
    matcher = spacy.matcher.DependencyMatcher(nlp.vocab)
    matcher.add('verb_with_its_dep', [pattern])
    results = []
    for d in docs:
        results_doc = []
        for m_id, toks in matcher(d):
            results_doc.append((d[toks[0]], d[toks[1]], d[toks[2]]))
        results.append(results_doc)
    return results

In [72]:
# what if we wanted the verb to be additionally specified by an adverb?
# -> we don't need to change much, just add another token to the pattern
def verb_dep(nlp, docs, tag, dep):
    pattern = [
        {'RIGHT_ID': 'verb', 
         'RIGHT_ATTRS': {'POS': 'VERB'}},
        # verb > subject
        {'LEFT_ID': 'verb',
         'RIGHT_ID': 'subject',
         'REL_OP': '>',
         'RIGHT_ATTRS': {'POS': tag, 'DEP': dep}},
        # verb > adverb
        {'LEFT_ID': 'verb',
         'RIGHT_ID': 'adverb',
         'REL_OP': '>',
         'RIGHT_ATTRS': {'POS': 'ADV'}}
    ]
    matcher = spacy.matcher.DependencyMatcher(nlp.vocab)
    matcher.add('verb_with_its_dep', [pattern])
    results = []
    for d in docs:
        results_doc = []
        for m_id, toks in matcher(d):
            results_doc.append((d[toks[0]], d[toks[1]], d[toks[2]]))
        results.append(results_doc)
    return results

In [73]:
verb_dep(nlp, docs, 'PROPN', 'nsubj')[:2]

[[(employs, Nokia, currently), (employs, Nokia, worldwide)],
 [(pick, Schoolchildren, also)]]

## Ex 2

Use `spacy.matcher.PhraseMatcher` to find the following multi-word phrases in the corpus:
* `"municipal elections"`
* `"coronavirus restrictions"`

In [28]:
# create the `matcher` object
# matcher musí mít jméno (tady 'COVID'), jinak nebude fungovat
matcher = spacy.matcher.PhraseMatcher(nlp.vocab)
matcher.add('COVID', [nlp('municipal elections'), nlp('coronavirus restrictions')])

In [30]:
for i, d in enumerate(docs):
    for m in matcher(d):
        print(m)
        print(d[m[1]:m[2]])
        print(d[m[1]].sent)
        print()

(2428025407850408099, 251, 253)
coronavirus restrictions
Katri Mannonen / Yle
Municipalities scrambled to provide food to remote learning school children as most parts of the country entered a three-week period of heightened coronavirus restrictions, starting 8 March.

(2428025407850408099, 631, 633)
municipal elections
In addition, it was decided to postpone the municipal elections due to the difficult public health situation.

(2428025407850408099, 312, 314)
coronavirus restrictions
Rallies against coronavirus restrictions are taking place in some 40 countries around the world on Saturday.

(2428025407850408099, 76, 78)
municipal elections
Jussi Halla-aho, chair of the Finns Party, and the party's parliamentary group leader, Ville Tavio,on Saturday claimed Justice Minister Anna-Maja Henriksson had misled citizens and instructed municipalities to break the law in the lead-up to the decision to postpone this spring's municipal elections.

(2428025407850408099, 171, 173)
municipal elect

## Ex 3

Extract all syntactic phrases headed by the word `"situation"`.

In a dependency tree, such phrase corresponds to the subtree containing all descendents of the token of interest. You can get it using the field `token.subtree`.

In [31]:
for d in docs:
    for tok in d:
        if tok.lemma_ == 'situation':
            print(list(tok.subtree))

[the, situation]
[certain, situations, we, have, to, go, to, so, -, called, virgin, areas]
[a, similar, situation]
[the, prevailing, pandemic, situation, and, the, necessary, safety, arrangements]
[the, examination, situation]
[the, disease, situation]
[the, situation, in, Estonia, ,, where, the, infection, rate, is, now, the, second, -, highest, in, the, continent, and, around, nine, times, higher, than, it, is, in, Finland]
[the, worst, infection, situation, outside, of, the, Helsinki, region]
[The, coronavirus, situation]
[the, pandemic, situation]
[the, situation, in, the, greater, Helsinki, and, Turku, regions]
[Finland, 's, deteriorating, Covid, situation]
[the, situation]
[the, coronavirus, situation]
[an, embarrassing, situation]
[a, completely, everyday, situation]
[certain, situations]
[the, current, situation]
[situations, where, a, passenger, does, not, provide, proof, of, illness, or, a, recent, coronavirus, test]
[the, situation]
[The, situation]
[the, situation]
[the, de

# Homework

## Ex 4 (5p.)

Write a function `find_quotes(nlp, docs)` that identifies and attributes direct quotes in the text. A direct quote looks like this:

"*There is no simple answer,*" **Matti Sarvimäki**, Assistant Professor of Economics at Aalto University **told** the paper.

We recognize the following three identifying elements:
* **proposition** - the text that is being quoted (e.g. "There is no simple answer"),
* **author** - the person that is being quoted ("Matti Sarvimäki"),
* **cue** - a verb indicating a speech act, e.g. "told".

The function should output a list of triplets: `(author, cue, proposition)`, each being a `Tok` or `Span` object. You may use information gained from exercise 1 to manually create a list of speech act verbs (it does not need to be exhaustive, a couple of common examples are enough).

3 points are given for a basic working version of the function. For full points, pay attention to the following details:
* If a name is given in the form `FirstName LastName`, output a `Span` object containing them both as author. (1p.)
* If the proposition ends with a quote character (`"`), look for the matching character to find where it starts. It might be a different sentence! (1p.)

You can restrict the solution to propositions enclosed in quotes and the speaker being named by name (rather than a pronoun).

In [54]:
from spacy.matcher import Matcher
from collections import defaultdict
def find_quotes(nlp, docs):
    verbs_pattern = [
            {"RIGHT_ID": 'speech_verb',
             "RIGHT_ATTRS": {"LEMMA": {"IN": ["tell", "say"]}}}
             ]

    verbs_matcher = spacy.matcher.DependencyMatcher(nlp.vocab)
    verbs_matcher.add('speech_verbs', [verbs_pattern])
    
    speech_verbs = [d[m[1][0]] for i, d in enumerate(docs) for m in verbs_matcher(d)]
    
    indexes = defaultdict(list)
    for i, par in enumerate(docs):
        for tok in par:
            if tok.norm_ == '"':
                #print(tok, tok.i, tok.head)
                indexes[i].append(tok.i)
                #print(tok.orth_, tok.i, tok.head, tok.head.i)
            #if tok.dep_ == 'nsubj' and tok.head in speech_verbs:
                #print(tok.orth_, tok.head, tok.head.i)
    
    for key, val in indexes.items():
        start = 0
        end = 0
        count = 0
        for tok in indexes[key]:
            if val.index(tok) % 2 == 0:
                start = tok
                count += 1
            else:
                end = tok
                count += 1
            if tok.i in indexes[key]:
                print(tok, tok.i, tok.head, tok.pos_)
        
        
    quotes = []
    for par in docs:
        for tok in par:
            if tok.dep_ == 'nsubj' and tok.head in speech_verbs:
                quotes.append((tok, tok.head))

    """
    quotes = []
    for verb in speech_verbs:
        quote_index = 0
        who_said = []
        quote = ()
        for tok in verb.sent:
            if tok.norm_ == '"' and tok.head.dep_ == 'ROOT':
                quote_index = tok.i
            if tok.dep_ == 'nsubj' and tok.head == verb:
                who_said.append(tok) 
                who_said.append(tok.head)
        quote = (who_said[0], who_said[1], verb.sent[:quote_index])
        quotes.append(quote)
        print(quote_indexes)
        if len(quote_indexes) == 1:
            print(quote_indexes)
            quote = (who_said[0], who_said[1], verb.sent[:quote_indexes[0]])
            quotes.append(quote)
        if len(quote_indexes) == 2:
            quote = (who_said[0], who_said[1], verb.sent[quote_indexes[0]:quote_indexes[1]])
            quotes.append(quote)
        print('Quotes', quotes)"""
    #print(quotes)

In [33]:
print(docs[0][32:39], docs[0][164].head, docs[1][87].head, docs[18][68].head)

"reset its cost base," said group said


In [55]:
find_quotes(nlp, docs)

" 32 reset PUNCT
" 38 said PUNCT
" 114 taken PUNCT
" 164 said PUNCT
" 84 group PUNCT
" 87 group PUNCT
" 93 praised PUNCT
" 113 said PUNCT
" 272 see PUNCT
" 291 commented PUNCT
" 321 had PUNCT
" 343 Director PUNCT
" 117 is PUNCT
" 162 said PUNCT
" 198 is PUNCT
" 234 emphasised PUNCT
" 294 fighting PUNCT
" 324 said PUNCT
" 29 are PUNCT
" 47 writes PUNCT
" 52 took PUNCT
" 64 continues PUNCT
" 245 level PUNCT
" 247 level PUNCT
" 264 be PUNCT
" 281 be PUNCT
" 391 Thinking PUNCT
" 424 told PUNCT
" 488 stated PUNCT
" 512 chair PUNCT
" 571 said PUNCT
" 592 said PUNCT
" 623 believe PUNCT
" 657 told PUNCT
" 47 transition PUNCT
" 50 transition PUNCT
" 254 aims PUNCT
" 290 said PUNCT
" 634 enable PUNCT
" 650 are PUNCT
" 688 changed PUNCT
" 733 said PUNCT
" 42 drawn PUNCT
" 58 said PUNCT
" 60 estimated PUNCT
" 108 said PUNCT
" 266 ensure PUNCT
" 312 said PUNCT
" 440 are PUNCT
" 492 said PUNCT
" 61 continue PUNCT
" 91 said PUNCT
" 157 be PUNCT
" 180 said PUNCT
" 212 waste PUNCT
" 215 waste PUNCT
" 3

" 211 exceeded PUNCT
" 233 said PUNCT
" 65 tells PUNCT
" 70 tells PUNCT
" 80 ordered PUNCT
" 99 ordered PUNCT
" 216 tells PUNCT
" 231 tells PUNCT
" 298 be PUNCT
" 340 is PUNCT
" 643 been PUNCT
" 665 told PUNCT
" 62 is PUNCT
" 113 said PUNCT
" 45 section PUNCT
" 48 section PUNCT
" 193 get PUNCT
" 233 said PUNCT
" 288 said PUNCT
" 308 said PUNCT
" 349 be PUNCT
" 370 said PUNCT
" 376 be PUNCT
" 405 is PUNCT
" 167 relationships PUNCT
" 170 relationships PUNCT
" 54 need PUNCT
" 78 said PUNCT
" 88 are PUNCT
" 123 said PUNCT
" 154 been PUNCT
" 171 added PUNCT
" 93 is PUNCT
" 134 said PUNCT
" 144 have PUNCT
" 165 added PUNCT
" 305 based PUNCT
" 351 said PUNCT
" 59 weakened PUNCT
" 62 weakened PUNCT
" 52 seems PUNCT
" 71 told PUNCT
" 122 is PUNCT
" 148 said PUNCT
" 204 is PUNCT
" 217 said PUNCT
" 381 got PUNCT
" 400 told PUNCT
" 136 be PUNCT
" 152 said PUNCT
" 254 happen PUNCT
" 264 said PUNCT
" 57 Cool PUNCT
" 65 made PUNCT
" 120 is PUNCT
" 184 writes PUNCT
" 192 cool PUNCT
" 195 cool PUNCT
" 

" 379 said PUNCT
" 437 number PUNCT
” 471 told PUNCT
" 659 agreed PUNCT
" 667 study PUNCT
" 687 on PUNCT
" 690 carried PUNCT
" 701 are PUNCT
” 711 documents PUNCT
“ 713 correspondence PUNCT
" 719 correspondence PUNCT
" 546 cautious PUNCT
" 548 as PUNCT
" 867 passport PUNCT
" 870 passport PUNCT
" 986 consider PUNCT
" 1012 consider PUNCT
" 1129 described PUNCT
" 1134 fades PUNCT
" 1462 hold PUNCT
" 1465 events PUNCT
" 114 thrilled PUNCT
" 185 said PUNCT
" 114 was PUNCT
" 121 was PUNCT
" 166 stated PUNCT
" 185 stated PUNCT
" 205 are PUNCT
" 220 said PUNCT
" 255 was PUNCT
" 283 said PUNCT
" 380 welcome PUNCT
" 420 said PUNCT
" 204 are PUNCT
" 218 economics PUNCT
" 329 is PUNCT
" 360 said PUNCT
" 61 were PUNCT
" 89 said PUNCT
" 23 level PUNCT
" 25 level PUNCT
" 60 developing PUNCT
" 81 told PUNCT
" 111 reopen PUNCT
" 136 told PUNCT
" 276 gap PUNCT
" 305 nurse PUNCT
" 346 said PUNCT
" 368 said PUNCT
" 402 transcended PUNCT
" 409 speaking PUNCT
" 475 Come PUNCT
" 511 told PUNCT
" 129 went PUN

In [25]:
docs[0][164].head

said

In [20]:
spacy.displacy.render(docs[0][165].sent, style='dep', options={'distance': 100, 'compact':True})

In [95]:
q = find_quotes(nlp, docs)

0 28
0 57
0 165
1 103
1 114
1 311
1 352
2 33
2 62
2 173
2 257
2 326
3 111
3 173
3 207
3 228
3 360
3 429
3 527
3 543
3 594
3 659
4 175
4 237
4 292
4 333
4 431
4 516
4 592
4 735
5 83
6 37
6 157
6 176
7 35
7 109
7 160
7 236
7 263
7 314
7 358
7 493
8 38
8 93
8 182
8 293
8 703
8 757
8 921
8 1122
8 1206
9 22
9 83
9 161
9 208
10 199
11 197
12 13
12 42
12 130
12 222
12 253
12 361
12 384
12 442
12 505
12 605
12 612
12 667
12 703
12 725
13 102
13 128
14 71
14 165
14 238
14 308
15 93
15 263
15 301
15 338
15 366
15 377
15 494
15 578
15 626
15 673
15 779
15 885
16 202
16 237
16 268
16 313
16 366
16 435
16 460
16 576
16 633
16 673
16 677
16 711
16 787
16 854
16 883
17 240
17 246
17 291
18 37
18 71
18 108
18 124
18 170
18 234
18 245
18 288
18 343
19 51
19 250
19 258
19 327
20 79
21 85
21 176
22 83
22 156
22 194
22 353
23 1
23 65
23 118
23 147
23 198
23 228
24 102
25 4
25 34
25 47
25 84
25 211
25 308
25 428
25 454
25 468
26 44
26 214
26 233
26 256
26 295
26 337
27 77
27 83
27 104
27 160
27 210
27 275


In [253]:
q = find_quotes(nlp, docs)

In [254]:
# This might not show *all* quotes. My solution is not perfect either.
print(q[3])

[(told, Hanna Nohynek, "Thinking about it, it could probably be a way to increase the number of people protected, if the quantity of vaccines available in Finland is much less than promised,), (said, Puhakka, "Oulu is a growing city and in certain situations we have to go to so-called virgin areas), (told, Puhakka, "We believe that Oulu's forest nature conservation decisions compensate for the decisions that have to be made in different parts of the city, due to the city's natural growth)]


In [255]:
print(q[207])

[(said, Simola, "My husband immediately said that it was a wolf), (told, Simola, "It was a really touching, amazing moment. Seeing such a wild animal inside the Ring I beltway really made you rub your eyes. It was a rare situation that doesn't happen every day), (said, Simola, "We've talked today about our relationship with wildlife and how even though the wolf may look nice, it is still an unpredictable wild animal that needs to be given its own space), (said, Hallamaa, "The wolf has not growled or shown its teeth. On the contrary, it has apparently tried to escape from humans with its tail between its legs)]


# Further reading

* https://spacy.io/usage/rule-based-matching
* https://applied-language-technology.readthedocs.io/en/latest/notebooks/part_iii/02_pattern_matching.html