-------------
#### Rule based - information extraction
--------------

In [1]:
import re

def check_phone_number(str):
    if re.match(r'^\d{3}-\d{3}-\d{4}$', str):
        print("Valid phone number")
    else:
        print("Not valid")

In [2]:
check_phone_number("555-555-5555")  # Valid phone number
check_phone_number("123-456-7890")  # Valid phone number
check_phone_number("123-4256-7890") # Not valid

Valid phone number
Valid phone number
Not valid


In [3]:
def check_phone_number(str):
    if re.match(r'^\d{3}-\d{3,4}-\d{4}$', str):
        print("Valid phone number")
    else:
        print("Not valid")

In [4]:
check_phone_number("555-555-5555")  # Valid phone number
check_phone_number("123-456-7890")  # Valid phone number
check_phone_number("123-4256-7890") # Valid phone number

Valid phone number
Valid phone number
Valid phone number


Example ...

In [5]:
text = """
Name: Bhupen
Address: Doddaballapur Road, Yelahanka

Hi, my name is Bhupen, and I'd like you to extract my name and address from this block of text!
"""

In [6]:
name_match    = re.search(r'.*?Name:\s*(.*?)\n',    text)
address_match = re.search(r'.*?Address:\s*(.*?)\n', text)

name    = None
address = None

if name_match:
    name = name_match.group(1)

if address_match:
    address = address_match.group(1)

In [7]:
print('{} lives at : {}'.format(name, address))

Bhupen lives at : Doddaballapur Road, Yelahanka


#### Using spacy 

**token matcher**

In [8]:
import spacy
from spacy.util import filter_spans
from spacy.matcher import Matcher

In [9]:
nlp = spacy.load("en_core_web_sm")

In [10]:
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

In [11]:
date_rule = {
            "date_rule" : [
                            {"LOWER": "march"}, 
                            {"IS_DIGIT": True}, 
                            {"IS_PUNCT": True}, 
                            {"IS_DIGIT": True}
            ]
}

In [14]:
for rule_name, rule_tags in date_rule.items(): # register rules in matcher
    matcher.add(rule_name, [rule_tags])

In [15]:
# Process some text
doc = nlp("SpaceX's Starlink 17 mission lifts off on a Falcon 9 rocket from Launch \
          Complex 39A at NASA's Kennedy Space Center in Florida, on March 4, 2021")

In [16]:
# Call the matcher on the doc
matches = matcher(doc)

In [17]:
for match_id, start, end in matches:
    print(doc[start:end])

March 4, 2021


**phrase matching**

The `phrase matcher` can be used when large terminologies have to be matched. 

It functions the same way as the token matcher, but instead of specifying rules and patterns, we can `input strings to match`!

In [18]:
import spacy

#import the phrase matcher
from spacy.matcher import PhraseMatcher

#load a model and create nlp object
nlp = spacy.load("en_core_web_sm")

In [19]:
#initilize the matcher with a shared vocab
matcher = PhraseMatcher(nlp.vocab)

In [20]:
#create the list of words to match
fruit_list = ['apple','orange','banana',]

In [21]:
#obtain doc object for each word in the list and store it in a list
patterns = [nlp(fruit) for fruit in fruit_list]

In [22]:
patterns

[apple, orange, banana]

In [23]:
#add the pattern to the matcher
matcher.add("FRUIT_PATTERN", patterns)

In [24]:
#process some text
doc = nlp("An orange contains citric acid and an apple contains banana oxalic acid")

In [25]:
matches = matcher(doc)

In [26]:
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

orange
apple
banana


#### Example - Rule 1

`Comment`: Great smartphone. I love the screen size.

`Important attributes`: `smartphone` and `screen size`.

We can create the rules:
- Smartphone  = Noun
- Screen Size = Compound + Noun

We have a part-of-speech (POS) tag `noun` following an optional dependency label (DEP) `compound`. 

To make the DEP optional to appear, we can use the operator (OP) `?`. 

Let’s call that rule `Noun and compound`

#### Example - Rule 2

- `Comment`: This phone is water-resistant?
    
- `Important attributes`: `phone` and `water-resistant`.

We can create the rules:

- `Phone` = Noun
- `Water Resistant` = Noun + Adjective

In [27]:
rules = {
        "Noun and compound": [
            { "DEP": "compound",  "OP": "?" },
            { "POS": "NOUN" }
        ],
        "Noun and adjective": [
            { "POS": "NOUN" },
            { "POS": "ADJ"  }
        ]
}

#### Model

After defining all rules to extract our attributes, we need to code the `matcher` responsible for extracting it according to what we want. 

We can create a `Matcher` Spacy object and add all rules defined previously. 

Now the model is ready for extraction when we input a text.

In [28]:
rule_matcher = Matcher(nlp.vocab)

In [29]:
for rule_name, rule_tags in rules.items(): # register rules in matcher
    rule_matcher.add(rule_name, [rule_tags])

#### Extraction
The model is ready and we’re able to extract attributes using the code listed below.

In [30]:
def extract(text):
    doc = nlp(text)  # Convert string to spacy 'doc' type
    matches = rule_matcher(doc)  # Run matcher

    result = []
    for match_id, start, end in matches:  # For each attribute detected, save it in a list
        attribute = doc[start:end]
        result.append(attribute.text)

    return result

In [31]:
print(extract("Great smartphone. I love the screen size."))

['smartphone', 'screen', 'screen size', 'size']


In [32]:
print(extract("This phone is water resistant?"))

['phone', 'water', 'water resistant']


In [33]:
print(extract("Sound and battery are great"))

['Sound', 'battery']


#### Another example

In [34]:
rules = {
        "Noun and adjective": [
                {'POS': 'DET', 'OP': '?'},
                {'POS': 'ADJ', 'OP': '*'},
                {'POS': 'NOUN'}
        ]
}

In [35]:
def noun_chunks(text):
    doc = nlp(text)
    pattern = [
                {'POS': 'DET', 'OP': '?'},
                {'POS': 'ADJ', 'OP': '*'},
                {'POS': 'NOUN'}
    ]
    matcher = Matcher(nlp.vocab)
    matcher.add('NOUN_PHRASE', pattern)
    matches = matcher(doc)

    spans = [doc[start:end] for match_id, start, end in matches]

    return filter_spans(spans)

if you have a phrase like `the yellow dog`, you'll get `the yellow dog`, `yellow dog`, and `dog` as matches. 

what we want here, is to get `only the largest matching spans`.

In [36]:
text = """
It was a rimy morning, and very damp. I had seen the damp lying on the
outside of my little window, as if some goblin had been crying there all
night, and using the window for a pocket-handkerchief. Now, I saw the
damp lying on the bare hedges and spare grass, like a coarser sort of
spiders' webs; hanging itself from twig to twig and blade to blade. On
every rail and gate, wet lay clammy, and the marsh mist was so thick,
that the wooden finger on the post directing people to our village--a
direction which they never accepted, for they never came there--was
invisible to me until I was quite close under it. Then, as I looked up
at it, while it dripped, it seemed to my oppressed conscience like a
phantom devoting me to the Hulks.
""".replace("\n", " ")

In [37]:
for chunk in noun_chunks(text):
    print(chunk)

ValueError: [E178] Each pattern should be a list of dicts, but got: {'POS': 'DET', 'OP': '?'}. Maybe you accidentally passed a single pattern to Matcher.add instead of a list of patterns? If you only want to add one pattern, make sure to wrap it in a list. For example: `matcher.add('NOUN_PHRASE', [pattern])`