## The Strengths of RegEx

There are several strengths to RegEx.

- Due to its complex syntax, it can allow for programmers to write robust rules in short spaces.
- It can allow the researcher to find all types of variance in strings
- It can perform remarkably quickly when compared to other methods.
- It is universally supported


## he Weaknesses of RegEx

Despite these strengths, there are a few weaknesses to RegEx.

-  syntax is quite difficult for beginners. (I still find myself looking up how to do certain things).
- It order to work well, it requires a domain-expert to work alongside the programmer to think of all ways a pattern may vary in texts.

In [1]:
import re

In [11]:
pattern = r"((\d){1,2} (January|February|March|April|May|June|July|August|September|October|November|December))"

text = "This is a date 2 February. Another date would be 14 August."
matches = re.findall(pattern, text)
print(matches)

[('2 February', '2', 'February'), ('14 August', '4', 'August')]


Notice, however, that we have a lot of superfluous information for each match. These are the components of each match. There are several ways we can remove them. One way is to use the command `finditer`, rather than findall in RegEx.

In [4]:
iter_matches = re.finditer(pattern, text)
iter_matches

<callable_iterator at 0x7f45b8ad0bb0>

In [9]:
# This is an iterator object, we can loop over it, however, and get our results.
iter_matches = re.finditer(pattern, text)
for hit in iter_matches:
    print(hit)

<re.Match object; span=(15, 25), match='2 February'>
<re.Match object; span=(49, 58), match='14 August'>


In [12]:
iter_matches = re.finditer(pattern, text)
for hit in iter_matches:
    start = hit.start()
    end = hit.end()
    print(text[start:end])

2 February
14 August


# How to Use RegEx in spaCy

Things like dates, times, IP Addresses, etc. that have either consistent or fairly consistent structures are excellent candidates for RegEx. Fortunately, spaCy has easy ways to implement RegEx in three pipes: Matcher, PhraseMatcher, and EntityRuler. One of the major drawbacks to the Matcher and PhraseMatcher, is that they do not align the matches as doc.ents. Because this textbook is about NER and our goal is to store the entities in the doc.ents, we will focus on using RegEx with the EntityRuler. In the next notebook, we will examine other methods.

In the previous notebook, we saw how the code below allowed for us to capture the phone number in the string. I have modified it a bit here for reasons that will become a bit more clear below.

In [13]:
#Import the requisite library
import spacy

#Sample text
text = "This is a sample number 555-5555."

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {"label": "PHONE_NUMBER", "pattern": [
                    {"SHAPE": "ddd"},
                    {"ORTH": "-", "OP": "?"},
                    {"SHAPE": "dddd"}
                ]}
            ]
#add patterns to ruler
ruler.add_patterns(patterns)

#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)



555-5555 PHONE_NUMBER


In [14]:
# let’s write some RegEx to capturee 555-5555.
pattern = r"((\d){3}-(\d){4})"
text = "This is a sample number 555-5555."
matches = re.findall(pattern, text)
print(matches)

[('555-5555', '5', '5')]


Okay. So, now we know that we have a RegEx pattern that works. Let’s try and implement it in the spaCy EntityRuler. We can do that with the code below. When we execute the code below, we have no output.

In [15]:
#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){3}-(\d){4})"}}]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

This is for one very important reason. SpaCy’s EntityRuler cannot use RegEx to pattern match across tokens. The dash in the phone number throws off the EntityRuler. So, what are we to do in this scenario? Well, we have a few different options that we will explore in the next notebook. But before we get to that, let’s try and use RegEx to capture the phone number with no hyphen.

In [19]:
#Sample text
text = "This is a sample number 5555555."
#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){5})"}}
                                                        ]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

5555555 PHONE_NUMBER


---
# Working with Multi-Word Token Entities and RegEx in spaCy

we can use spaCy’s Matcher to grab multi-word tokens, or tokens that span multiple tokens. The main problem with this, however, is that these multi-word tokens are not placed into the doc.ents. This means that we cannot access them the same way we would other entities


## Extract Multi-Word Tokens

First, we need to grab the multi-word tokens. In this notebook, we are going to try and grab a multi-word token. In this case, a person whose first name begins with Paul. In the RegEx below, we specify that we are looking for any string that starts with “Paul” and then is followed by a capitalized letter. We then tell it to grab the entire second word until the end of the word.

In [20]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

pattern = r"Paul [A-Z]\w+"

matches = re.finditer(pattern, text)

for match in matches:
    print(match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


Note that we have not grabbed the final “Paul” which is not followed by a last name. In this case, we are not interested in that Paul. Now that we know how to grab the multi-word tokens, we need to have a way to parse them in spaCy.

## Reconstruct Spans

This next stage is a bit more complicated, but works quite well once you understand the process. First, we need to import the libraries we will need. Note that we are also adding Span from spacy.tokens.

In [21]:
from spacy.tokens import Span

In [22]:
nlp = spacy.blank("en")
doc = nlp(text)

In [23]:
# Even though this part is unnecessary, it is good to do it here because 
# in other situations you will have entities. If you do, 
# you need to store them as a separate list to which we will append things.
original_ents = list(doc.ents)

In [27]:
mwt_ents = []
for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))

mwt_ents

[(0, 2, 'Paul Newman'), (8, 10, 'Paul Hollywood')]

## nject the Spans into the `doc.ents`

With that data, we can iterate over each entity and identify where it begins and ends in spaCy. Note, we are using the spaCy Span class. This allows us to create a span object and assign it a custom label. With this data, we can append each Span to `original_ents`.

In [28]:
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="PERSON")
    original_ents.append(per_ent)

original_ents

[Paul Newman, Paul Hollywood]

In [29]:
doc.ents = original_ents

In [32]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
Paul Hollywood PERSON


## Give priority to Longer Spans

Sometimes, the situation is not so neat. Sometimes our custom RegEx entities will overlap with spaCy’s Entities

In [33]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host."
pattern = r"Hollywood"

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP


Let’s say that we create a new entity. Maybe words associated with Cinema. So, we want to classify Hollywood as a tag “CINEMA”. Now, in the above text, Hollywood is clearly associated with Paul Hollywood, but let’s imagine for a moment that it is not. Let’s try and run the same code as above. If we do, we notice that we get an error.

In [34]:
mwt_ents = []
original_ents = list(doc.ents)
for match in re.finditer(pattern, doc.text):
    print (match)
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))
        
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="CINEMA")
    original_ents.append(per_ent)

doc.ents = original_ents

<re.Match object; span=(44, 53), match='Hollywood'>


ValueError: ignored

This error tells us that one of our tokens from the `finditer()` overlapped with one that our “ner” component found. This is a problem that can be rectified with spaCy’s `filter_spans`. This gives primacy to longer spans. Notice how we have allowed the Paul Hollywood entity to be a PERSON, rather than CINEMA. This is because Hollywood is shorter than Paul Hollywood.

In [35]:
from spacy.util import filter_spans

In [37]:
filtered = filter_spans(original_ents)
doc.ents = filtered

for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP
