# DIGI405 Lab 4.2: spaCy Matcher for Parts of Speech and Dependency Matching

In [None]:
import spacy
from spacy import displacy
from spacy.matcher import Matcher, DependencyMatcher
from IPython.display import display, HTML

In [None]:
nlp = spacy.load("en_core_web_sm", disable=["ner", "textcat", "lemmatizer"])

### spaCy Matcher

We are going to start by using part of speech tags for corpus analysis. For this, we'll use the [spaCy Matcher](https://spacy.io/api/matcher). This can be very useful for finding simple patterns. Below, we'll try to match all numbers with a percentage symbol.

In [None]:
text_list1 = [
    "The onshore processing sector makes the largest contribution to employment with about 65% of total employment related to tuna fisheries coming from this sector.",
    "Dr Norman says a 3% levy on income above $70,00 and a business tax rate of 30% for five-and-a-half years could have already raised nearly one-fifth of the total cost of the rebuild of $5.5 billion",
    "The proportion of people agreeing that climate change is a serious issue fell to 36%, from almost 43% last year."
]

In [None]:
docs1 = list(nlp.pipe(text_list1))

In [None]:
matcher1 = Matcher(nlp.vocab)

# match a number followed by a %
pattern1 = [{"POS": "NUM"}, {"ORTH": "%"}]
matcher1.add("symbol", [pattern1])

In [None]:
for idx, doc in enumerate(docs1):
    matches = matcher1(doc)

    # print the results
    print("=== Doc {} ===".format(idx))
    print("Number of matches: ", len(matches))
    if matches:
        print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))
    print('\n')

### Matching Parts of Speech

Say we want to find every VERB that is related to a query word in a list of texts. We can try to use the Matcher for this.

In [1]:
nlp = spacy.load("en_core_web_sm")

NameError: name 'spacy' is not defined

In [None]:
# Run through the notebook using the query word 'economists' as shown below before experimenting with your own query
query_word = "economists"

In [None]:
# example news texts
text_list2 = [
    "New Zealand's Sydney correspondent says some economists are now wondering whether reform is possible in the current climate",
    "The anticipated recession and is growing faster than economists expected, inflation hit 40-year highs in 2022",
    "Economists are warning of increased risk of a 'hard landing' - the RBNZ tightening of monetary conditions bringing the economy to a screeching halt",
    "An economist is calling for a new law requiring the next government to keep track of what economists are studying, in case any of it turns out to be useful."
]

In [None]:
docs2 = list(nlp.pipe(text_list2))

In [None]:
# create a matcher using the vocab from spaCy's language model
matcher2 = Matcher(nlp.vocab)

# match a verb following the word query_word
pattern2 = [{"LOWER": query_word}, {"POS": "VERB"}]
matcher2.add("economist", [pattern2])

pattern3 = [{"LOWER": query_word}, {"POS": "AUX"}, {"POS": "VERB"}]
matcher2.add("economist", [pattern3])

In [None]:
for idx, doc in enumerate(docs2):
    matches = matcher2(doc)

    # print the results
    print("=== Doc {} ===".format(idx))
    print("Number of matches: ", len(matches))
    if matches:
        print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))
    print('\n')

### Dependency matching

Our basic matcher above initially found one instance of the query word 'economists' and a related verb. This was because it matched only the string directly following the word 'economists'.

You can uncomment the lines for pattern3, and see how this matches 'economist' + auxiliary verb ('are' in this case) + verb.

This is nice but, as you can see, it doesn't capture our first example: "some economists are now wondering whether...". The problem here is that there are many ways sentences can be put together, and it is possible for the position of the verb to vary a lot. We will have a hard time predicting all these possible variants.

Instead, we can use a more powerful approach, building on the grammatical dependencies that spaCy can extract. We will also try to match 'economist' singular and 'economists' plural.

### Visualisation of grammatical dependencies

displaCy provides a visualiser for dependencies, allowing you to see the grammatical structure of a sentence.

In [None]:
# display an example - paste in your own text below.
text_to_parse = nlp(
    "New Zealand's Sydney correspondent says some economists are now wondering whether reform is possible in the current climate"
)

html = displacy.render(text_to_parse, options={"fine_grained": True}, jupyter = False)
display(HTML(html))

spaCy's [dependency matcher documentation](https://spacy.io/usage/rule-based-matching#dependencymatcher) explains how to match on a specific a relation within the dependency structure. 

In the example above, the word 'economists' is a dependant of the verb 'wondering'. We can see this because displaCy draws the arrow from 'wondering' to 'economists'.

To match this pattern, we start by creating a pattern for the word 'economists' (lowercased, so it will also match if the first letter is a capital E). 

We have also added a regular expression here to capture both single and plural instances of "economist". The ```(s)?``` indicates an optional group that will be matched if it is present. In the second part of Pattern 4 we then match any verb which is the immediate 'head' node in relation to the word economists.

In [None]:
pattern4 = [
  # anchor token: economists (lowercased)
  {
    "RIGHT_ID": query_word,
    "RIGHT_ATTRS": {"LOWER": {"REGEX": "economist(s)?"}}
  },
  # find verbs higher in the dependency chain from query_word
  # they could appear before or after the word query_word
  {
    "LEFT_ID": query_word,
    "REL_OP": "<",
    "RIGHT_ID": "verb",
    "RIGHT_ATTRS": {"POS": "VERB"}
  }
]

In [None]:
# create dependency matcher and add patterns

dep_matcher = DependencyMatcher(nlp.vocab)
dep_matcher.add(query_word, [pattern4])

# to add further patterns, define them above,
# then extend the list, as in the next line:
#dep_matcher.add(query_word, [pattern4, pattern5])

In [None]:
for idx, doc in enumerate(docs2):
    matches = dep_matcher(doc)

    # print the results
    print("=== Doc {} ===".format(idx))
    print("Number of matches: ", len(matches))

    for m_index, match in enumerate(matches):
        match_id, token_ids = match
        for i in range(len(token_ids)):
            print(pattern4[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text)
    print('\n')

### Dependency match operators (REL_OP)

To get started all you need are the four simplest operators below. The full list can be found in the [DependencyMatcher documentation](https://spacy.io/usage/rule-based-matching#dependencymatcher).

| Symbol | Description |
| --- | --- |
| A < B | A is the immediate dependent of B. |
| A > B | A is the immediate head of B. | 
| A << B | A is the dependent in a chain to B following dep → head paths. |
| A >> B | A is the head in a chain to B following head → dep paths. |


### Create a pipeline for multiple texts

You may have noticed above we use the nlp.pipe command to process a collection of texts. We can pass this a longer list of texts to process.

Here we will experiment with the Assignment 1 corpus.

In [None]:
import pandas as pd
from collections import Counter

In [None]:
# read in the csv version of the corpus
df = pd.read_csv('cc_all_v2.csv')

In [None]:
df.head()

In [None]:
#texts = df['fulltext']

# sample a smaller corpus
texts = df['fulltext'].sample(frac=0.3, random_state=9)

# NB This step will take a while with lots of texts
# You can comment/uncomment the lines above to use the full dataset
corpus = list(nlp.pipe(texts, batch_size=50))

In [None]:
match_counts = Counter()

for idx, doc in enumerate(corpus):

    matches = dep_matcher(doc)

    # iterate over matches starting at the first verb
    # and increment in 2s to skip all the matches with query_word

    for m_index, match in enumerate(matches):
        match_id, token_ids = match

        for i in range(len(token_ids))[1::2]:
            if matches:
                match_id, token_ids = matches[0]
                match_counts.update([doc[token_ids[i]].text])
                print(pattern4[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text)


In [None]:
match_counts.most_common(20)

### Tasks

Now try some of the following:

* Use the methods introduced in the first two weeks of the course to explore words / phrases of interest in the Assignment 1 corpus. 
* Select some of these and visualise their dependencies using displaCy.
* Adapt / extend either the Matcher or DependencyMatcher to match different patterns of interest. You should consult the spaCy [documentation on linguistic features](https://spacy.io/usage/linguistic-features) for ideas.
