# Combining Models and Rules

Combining statistical models with rule-based systems, it is one of the most powerful tricks you should have in your nlp toolbooks. In this chapter we will look at how to do it with spaCy.

![](../imgs/stat-models-vs-rules-v01.png)

Statistical models are useful if your application needs to be able to generalize based on a few examples. For instance, detecting product or person names usually benefits from a statistical model. Instead of providing a list of all person names ever your application will be able to predict whether a span of tokens is a person's name. Similarly, you can predict dependency labels to find subject-object relationships. To do this you will use spaCy entity recognizer, dependency parcder or part-of-speech tagger.

![](../imgs/stat-models-vs-rules-v02.png)

Rule-based approaches on the other hand comming handy if there is a more or less finite number of instances you want to find. For example all countries or cities of the world, drug names or even dog breeds. In spaCy you can achive this with custom tokenization rules as well as `Matcher` and `PhraseMatcher`.

## Recap: Rule-based Matching

In the last chapter you've learn how to use spaCy rule-based matcher to find complex patterns in your text. Here it is a quick recap. 

The `Matcher` is initialized with the shared vocabulary, usually `nlp.vocab`.

```python
import spacy
# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
```

Patterns are lists of dictionaries describing the tokens and each dictionary describes one token and each attributes.

Patters can be added to the matcher using the `matcher.add` method:

```python
# Patterns are lists of dictionaries describing the tokens
pattern = [{'lemma': 'love','pos': 'VERB'}, {'lower': 'cats'}]
matcher.add('LOVE_CATS', None, pattern)
```

Operators can specify how often to match a token. For example `+` will match one or more times:

```python
# Operators can specify how often a token should be matched
pattern = [{'orth': 'very','OP': '+'}, {'orth': 'happy'}]
```
Calling the `matcher` on a `doc` object, will return a list of the matches. It match is a tuple consisting of id and start and end token index in the document

```python
# Calling matcher on `doc` returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)
```

## Adding statistical predictions

Here it is an example of the matcher rule for Golden Retriever:

In [53]:
import spacy
from spacy.matcher import Matcher

# Load a larger model with vectos
nlp = spacy.load('en_core_web_lg')

matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'lower': 'golden'}, {'lower': 'retriever'}])

doc = nlp("I have a Golden Retriever")

If we iterate over the matches returned by matcher we can get the match id and the start and index of the matched span:

In [54]:
for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text, span.start, span.end)

Matched span: Golden Retriever 3 5


We can then find more about it. Span object give us access to the original document in all other token attributes and linguistic features predicted by the model. For example, we can get the span root token.


If the span consists of more than one token this will be the token that decides the category of the phrase. For example, the root of the Golden Retriever is Retriever.

In [55]:
# Get the span's root token and root head token
print('Root token:', span.root.text)

Root token: Retriever


We can also can find the head token of the root. This is the sintactic pattern that governs the phrase. In this case the verb "have"

In [56]:
print('Root head token:', span.root.head.text)

Root head token: have


Finally, we can look at the previous token and its attributes. In this case, it is a determinant, the article "a":

In [57]:
# Get the previous token and its POS tag
print('Previous token:', doc[start - 1].text, doc[start - 1].pos_,'\n')

Previous token: a DET 



## Efficient phrase matching (1)

The phrase matcher it is another helpful tool to find sequences of words on your data. It performs a keywork search on the document but instead of only finding strings it give you direct access to the tokens in context. 

- `PhraseMatcher` like regular expressions or keyword seach - but with access to the tokens!

- It takes `Doc` objecs as patterns

- It is also really fast. More efficient and faster than the `Matcher`

This it makes very useful for matching large dictonaries an world lists on large volumes of text

- Great for matching large word lists

## Efficient phrase matching (2)

Here it is an example. The `PhraseMatcher` can be imported from `spacy.matcher` and follows the same as API as a regular matcher. 

In [58]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

Instead of a list of dictionaries we pass in a `Doc` object as the pattern:

In [59]:
pattern = nlp("Golden Retriever")
matcher.add('DOG', None, pattern)

We can then iterate over the matchers and a text which give us the matcher id and the start and the end of the match. This let us to create a span object for the match tokens Golde Retriever to analyze it in context:

In [60]:
doc = nlp("I have a Golden Retriever")
# iterate over the matches
for match_id, start, end in matcher(doc):
    # get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


Let's try out these new techniques to combining rules with statistical models

## Debugging patterns (1)
Why does this pattern not match the tokens "Silicon Valley" in the `doc`?
```
pattern = [{'LOWER': 'silicon'}, {'TEXT': ' '}, {'LOWER': 'valley'}]
```
```
doc = nlp("Can Silicon Valley workers rein in big tech from within?")
```

You can try it out in your IPython shell. The `matcher` with the added pattern and the `doc` are already created.



Answer: The tokenizer doesn't create tokens for single spaces, so there's no token with the value ' ' in between.

Correct! The tokenizer already takes care of splitting off whitespace and each dictionary in the pattern describes one token.

## Debugging patterns (2)

Both patterns in this exercise contain mistakes and won't match as expected. Can you fix them?

The `nlp` and a `doc` have already been created for you. If you get stuck, try printing the tokens in the `doc` to see how the text will be split and adjust the pattern so that each dictionary represents one token.

- Edit `pattern1` so that it correctly matches all case-insensitive mentions of `"Amazon"` plus a title-cased proper noun.
- Edit `pattern2` so that it correctly matches all case-insensitive mentions of `"ad-free"`, plus the following noun.

In [46]:
# Create the match patterns
pattern1 = [{'LOWER': 'Amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{'LOWER': 'ad-free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', None, pattern1)
matcher.add('PATTERN2', None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

In [61]:
myfile = open('../data/match_exercise.txt')
content = myfile.read()
myfile.close()

In [62]:
doc = nlp(content)

In [63]:
# Create the match patterns
pattern1 = [{'LOWER': 'Amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{'LOWER': 'ad-free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', None, pattern1)
matcher.add('PATTERN2', None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

Solution:

In [64]:
# Create the match patterns
pattern1 = [{'LOWER': 'amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{'LOWER': 'ad'}, {'TEXT': '-'}, {'LOWER': 'free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', None, pattern1)
#matcher.add('PATTERN2', None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN1 Amazon Prime


the answer should be:

```
<script.py> output:
    PATTERN1 Amazon Prime
    PATTERN2 ad-free viewing
    PATTERN1 Amazon Prime
    PATTERN2 ad-free viewing
    PATTERN2 ad-free viewing
    PATTERN2 ad-free viewing
```

Well done! For the token `'_'`, you can match on the attribute `TEXT`, `LOWER` or even `SHAPE`. All of those are correct. As you can see, paying close attention to the tokenization is very important when working with the token-based `Matcher`. Sometimes it's much easier to just match exact strings instead and use the `PhraseMatcher`, which we'll get to in the next exercise.

## Efficient phrase matching
Sometimes it's more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world.

We already have a list of countries, so let's use this as the basis of our information extraction script. A list of string names is available as the variable `COUNTRIES`. The `nlp` object and a test doc have already been created and the `doc.text` has been printed to the shell.

- Import the `PhraseMatcher` and initialize it with the shared vocab as the variable `matcher`.
- Use the `nlp` object to create one phrase pattern per country in `COUNTRIES`.
Call the matcher on the `doc`.

In [68]:
nlp = spacy.load('en')

doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
patterns = [nlp(country) for country in COUNTRIES]
matcher.add('COUNTRY', None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

NameError: name 'COUNTRIES' is not defined

Answer should be:

```
<script.py> output:
    [Czech Republic, Slovakia]
```

## Extracting countries and relationships
In the previous exercise, you wrote a script using spaCy's `PhraseMatcher` to find country names in text. Let's use that country matcher on a longer text, analyze the syntax and update the document's entities with the matched countries. The `nlp` object has already been created.

The text is available as the variable `text`, the `PhraseMatcher` with the country patterns as the variable `matcher`. The `Span` class has already been imported.

- Iterate over the matches and create a `Span` with the label `"GPE"` (geopolitical entity).
- Overwrite the entities in `doc.ents` and add the matched span.

In [72]:
text = "In Germany the trees are green. In U.S.A the cars are black. In Spain the"
# Create a doc and find matches in it
doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label='GPE')

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]
    
# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

[('Germany', 'GPE'), ('U.S.A', 'GPE'), ('Spain', 'GPE')]


- Update the script and get the matched span's root head token.
- Print the text of the head token and the span.

In [77]:
text = "In Germany the trees are green. In U.S.A the cars are black. In Spain the"
# Create a doc and find matches in it
doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label='GPE')

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]
    
# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

[('Germany', 'GPE'), ('U.S.A', 'GPE'), ('Spain', 'GPE')]


In [81]:
text = "In Germany the trees are green. In U.S.A the cars are black. In Spain the"
# Create a doc and find matches in it
doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE" and overwrite the doc.ents
    span = Span(doc, start, end, label='GPE')
    doc.ents = list(doc.ents) + [span]
    
    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, '-->', span.text)

answer:
```
Haiti --> Haiti
Mozambique --> Mozambique
Somalia --> Somalia
```

Well done! Now that you've practiced combining predictions with rule-based information extraction, you're ready for chapter 3, which will teach you everything about spaCy's processing pipelines.

You have finished the chapter "Large-scale data analysis with spaCy"!