# Rule-based systems

**Statistical models**: Statistical models are useful if your application needs to be able to generalize based on a few examples. For instance, detecting product or person names usually benefits from a statistical model. Instead of providing a list of all person names ever, your application will be able to predict whether a span of tokens is a person name. Similarly, you can predict dependency labels to find subject/object relationships.
- Use cases: spaCy's entity recognizer, dependency parser or part-of-speech tagger

**Rule-based systems**: Rule-based approaches on the other hand come in handy if there's a more or less finite number of instances you want to find, e.g. drug names, country names.
- In spaCy: custom tokenization rules (tokenizer), `Matcher`, `PhraseMatcher`

## Matcher

- You can add patterns to the "Vocab" using the `matcher.add()` method.
- Use operators (`'OP'`) to specify how often to match a token

In [24]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}]
matcher.add('LOVE_CATS', None, pattern)

# Operators can specify how often a token should be matched
## `+` operator matches 1 or more times
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

In [25]:
print(matches)

[(9137535031263442622, 1, 3)]


As you can see the output is the `(ID, start_index, end_index)`

In [28]:
# Example 2

matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)
    # Get the span's root token and root head token
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)
    # Get the previous token and its POS tag
    print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


## PhraseMatcher

Use `PhraseMatcher` to find sequences of words. It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context.

- More efficient and faster than the `Matcher`
- Great for matching large word lists!

In [30]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add('DOG', None, pattern)
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)


Matched span: Golden Retriever


## Matcher Example

In [32]:
[token.text for token in nlp("ad-free viewing")]

['ad', '-', 'free', 'viewing']

In [33]:
doc = nlp("""Twitch Prime, the perks program for Amazon Prime members offering free loot, games and other benefits, 
          is ditching one of its best features: ad-free viewing. According to an email sent out to Amazon Prime members
          today, ad-free viewing will no longer be included as a part of Twitch Prime for new members, beginning on
          September 14. However, members with existing annual subscriptions will be able to continue to enjoy ad-free
          viewing until their subscription comes up for renewal. Those with monthly subscriptions will have access to
          ad-free viewing until October 15.""")

for token in doc[:10]:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))

Twitch      ADJ       compound  
Prime       PROPN     nsubj     
,           PUNCT     punct     
the         DET       det       
perks       NOUN      compound  
program     NOUN      appos     
for         ADP       prep      
Amazon      PROPN     compound  
Prime       PROPN     compound  
members     NOUN      nsubj     


**Match the following:**

1. "Amazon" plus a title-cased proper noun
2. case-insensitive mentions of "ad-free", plus the following noun

In [34]:
# Create the match patterns
pattern1 = [{'LOWER': 'amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{'LOWER': 'ad'}, {'TEXT': '-'}, {'LOWER': 'free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', None, pattern1)
matcher.add('PATTERN2', None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


## PhraseMatcher Example

In [35]:
import pycountry

[country.name for country in pycountry.countries][:6]

['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Åland Islands', 'Albania']

In [36]:
COUNTRIES = [country.name for country in pycountry.countries]

In [37]:
# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns)

In [40]:
doc = nlp("Czechia may help Slovakia protect its airspace")

# Call the matcher on the doc and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czechia, Slovakia]


## PhraseMatcher to add "GPE" label to country matches

In [62]:
from spacy.tokens import Span

# Create a doc and find matches in it
doc = nlp("""Narnia
Hyderabad 
Terabithia""")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

[('Narnia', 'GPE'), ('Hyderabad', 'GPE'), ('Terabithia', 'GPE')]
