Phrase Matching and Vocabulary

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

#rule based matching tool called 'matcher' and that allows you to build a library of token patterns.
#then match those patterns against a doc object, to return a list of found matches.
#and you can match actually every part of the token including text and annotations and you can add mutiple patterns
#to the same matcher


In [3]:
#import the matcher library
from spacy.matcher import Matcher

In [4]:
#Creating matcher object
matcher = Matcher(nlp.vocab)

In [5]:
#so we need to create patterns that we actually want to match on. you have a list and creating dictionary in these list
#the first Pattern I look for SolarPower or Solar-power or Solar power, I try to detect all three patterns
#the second pattern we are looking forsome sort of puctuation like a dash in the middle and boolean True
#then we pass in the dictionary with lower as a key and then is the lowercase equal to power
#the third pattern is lower equal to solar and then is the token immediatiately following this is that the string power
# SolarPower
pattern1 = [{'LOWER':'solarpower'}]
# Solar-power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{'LOWER':'power'}]
# Solar power
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]


In [6]:
#Now we have these 3 Patterns and now its time to add them to my matcher and we name our matcher 'Solarpower' matcher
#callbacks 'None'
#so now these three particular patterns have been added to this matching object and they run under the 'SolarPower'
matcher.add('SolarPower',None,pattern1,pattern2,pattern3)

In [7]:
#So now that we have this solar power matcher Let's go ahead and create a document and see if we're able to match on these
#various phrases.
doc = nlp(u"The Solar Power industry continues to grow a solarpower increases. Solar-power is amazing")

In [8]:
#now we take the matcher that i have created and then pass in that doc that document object.
#And i will set a variable called found matches Equal to this matcher object
found_matches = matcher(doc)

In [9]:
#what is nice here is I can simply print out my found that matches and its to return back tuples with three pieces of
#information
#The first piece of information (8656102463236116519 is the match ID essentially the string ID for the particular match
#and then it indicates the start '1' and the stop '3' the start and the stop is really on the token 
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 8, 9), (8656102463236116519, 11, 14)]


In [10]:
# i create a little for loop that prints out these strings representation
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 8 9 solarpower
8656102463236116519 SolarPower 11 14 Solar-power


In [11]:
#this is how you remove a particular pattern you have just created, lets say you no longer were interested in this SolarPower
#matcher and any of these old patterns anymore, you wanted a full update instead of adding o the matcher.
#instead of adding to the matcher , you can remove from the matcher by doing this
#matcher.remove('SolarPower')

In [13]:
#Now you can make token rules optional by passing and O.P. Asterix argument.So that lets us streamline our patterns list.
#I create a new set of patterns , we have removed the old set of pattern "SolarPower"
#let's create a new set
#i put an asterix for a string. it allows me with this pattern to match zero or more times.
# pattern1 is going to be able ro find solarpower SolarPower put together as lowercase
pattern1 = [{'LOWER':'solarpower'}]
#pattern2 graps solar.power and any amount of punctuation thats what the Asterix is doing with the O.P. so it can be double 
#dahses or can be one underscore or -- + . or whatever it happens to be punctuation an then power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True,'OP':'*'},{'LOWER':'power'}]


In [15]:
matcher.add('SolarPower',None,pattern1,pattern2)

In [16]:
#so let's add thatas our new solar power matching and let's create a new document doc
doc2 = nlp(u"Solar--power is solarpower yay!")

In [17]:
found_matches = matcher(doc2)

In [18]:
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 4, 5)]


This is how i created my own patterns and match on them by using spaCy