## Advanced NLP with SpaCy

DataCamp: 6/16 & 6/17/2022

KPR

In [1]:
import spacy
# spacy.cli.download("en_core_web_sm")

In [2]:
from spacy.lang.en import English
nlp = English()

In [3]:
hound =  open('Datasets/hound.txt','r', encoding="utf-8")
hound_text = hound.read()
hound.close()

In [4]:
doc = nlp(hound_text)
token1 = doc[3]
token1.text

'Holmes'

In [5]:
span1 = doc[3:34]
span1

Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated at the breakfast table.

In [6]:
subdoc = nlp(span1.text)
print("Indexes: ", [token.i for token in subdoc])
print("Text: ", [token.text for token in subdoc])
print("Is Alpha?: ", [token.is_alpha for token in subdoc])
print("Is Punctuation?: ", [token.is_punct for token in subdoc])
print("Like Num?: ", [token.like_num for token in subdoc])

Indexes:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
Text:  ['Holmes', ',', 'who', 'was', 'usually', 'very', 'late', 'in', 'the', 'mornings', ',', 'save', 'upon', 'those', 'not', 'infrequent', 'occasions', 'when', 'he', 'was', 'up', 'all', 'night', ',', 'was', 'seated', 'at', 'the', 'breakfast', 'table', '.']
Is Alpha?:  [True, False, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, False]
Is Punctuation?:  [False, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, True]
Like Num?:  [False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, F

### Modeling

In [7]:
from spacy.lang.en.examples import sentences 

nlp = spacy.load("en_core_web_sm")
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)


Apple is looking at buying U.K. startup for $1 billion
Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


In [8]:
# Iterate over the predicted entities
for ent in doc.ents:
    # print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


Subsetting a missing term: IPhone X

Running the model on the sample sentence below failed to pick up the term 'IPhone X'; one way to get at this term is to pick it up manually

In [9]:
doc2 = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake")
print(doc2.text)
for token in doc2:
    print(token.text, token.pos_, token.dep_)

New iPhone X release date leaked as Apple reveals pre-orders by mistake
New PROPN amod
iPhone PROPN compound
X NOUN compound
release NOUN compound
date NOUN ROOT
leaked VERB acl
as SCONJ mark
Apple PROPN nsubj
reveals VERB advcl
pre ADJ dobj
- NOUN dobj
orders NOUN dobj
by ADP prep
mistake NOUN pobj


In [10]:
for ent in doc2.ents:
    # print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc2[1:3]

# Print the span text
print('Missing entity:', iphone_x.text)

Apple ORG
Missing entity: iPhone X


### Rules-based Matching

- Matching on document objects, not only strings as with regular expressions
- Match on tokens or token attributes
- Use the model's predictions
- Match on a word only if it is a desired part of speech; eg. 'duck' as a verb, not a noun

### Match Patterns

- A list of dictionaries, one per token
- Matching the exact token text: e.g. [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
- Matching the lexical attributes: e.g. [{'LOWER': 'iphone'}, {'LOWER': 'x'}]
- Matching token attributes: e.g. [{'LEMMA': 'buy'}, {'POS': 'NOUN'}] <- would match 'buying milk' or 'bought flowers'


In [11]:
from spacy.matcher import Matcher

match_model = spacy.load("en_core_web_sm")

# init matcher with shared vocabulary
matcher = Matcher(match_model.vocab)

# add a test pattern (Note: this is different syntax from the DataCamp course, which is based on an old version of the library.)
pattern = [[{'ORTH': 'iPhone'}, {'ORTH': 'X'}]]
matcher.add('IPHONE_PATTERN', pattern)
on_match, patterns = matcher.get("IPHONE_PATTERN")

print(on_match)
print(patterns)

None
[[{'ORTH': 'iPhone'}, {'ORTH': 'X'}]]


In [12]:
print(doc2.text)
print()
matches = matcher(doc2)

print('Matches:', [doc2[start:end].text for match_id, start, end in matches])

New iPhone X release date leaked as Apple reveals pre-orders by mistake

Matches: ['iPhone X']


### Operators and quantifiers

- OP: ! // match NOT the token
- OP: ? // match 0 or 1 times
- OP: + // match 1 or more times
- OP: * // match 0 or more times


In [13]:
p2 = [[{'LEMMA': 'buy'},
       {'POS': 'DET', 'OP': '?'}, # match 0 or 1 times
       {'POS': 'NOUN'}]]

matcher.add('PURCHASES', p2)
_, patterns = matcher.get("PURCHASES")

print(patterns)

example = nlp("I bought a smartphone, now I'm buying apps.")
m2 = matcher(example)

# print matches using list-comprehension
print('Matches:', [example[start:end].text for match_id, start, end in m2])

[[{'LEMMA': 'buy'}, {'POS': 'DET', 'OP': '?'}, {'POS': 'NOUN'}]]
Matches: ['bought a smartphone', 'buying apps']


Matching digits

In [15]:
ios_text = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
ios_pattern = [[{'TEXT': 'iOS'}, {'IS_DIGIT': True}]]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', ios_pattern)
ios_matches = matcher(ios_text)
print('Total matches found:', len(ios_matches))

# Iterate over the matches and print the span text
for match_id, start, end in ios_matches:
    print('Match found:', ios_text[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


Matching an "adjective, noun, optional noun" pattern

In [16]:
anydoc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
ann_pattern = [[{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', ann_pattern)
ms = matcher(anydoc)
print('Total matches found:', len(ms))

# Iterate over the matches and print the span text
for match_id, start, end in ms:
    print('Match found:', anydoc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


### Vocabulary, Lexemes and StringStore 

A "lexeme" is an entry in the vocabulary; see an example below that shows a lexeme, its text, orth (hash), an some attributes

In [19]:
weird = nlp("I love coffee and chickens.")
coffee = nlp.vocab['coffee']
chickens = nlp.vocab['chickens']

print(coffee.text, coffee.orth, coffee.is_alpha, coffee.norm, coffee.shape, coffee.lower, coffee.is_title, coffee.is_currency)
print(chickens.text, chickens.orth, chickens.is_alpha, chickens.norm, chickens.shape, chickens.lower, chickens.is_title, chickens.is_currency)

coffee 3197928453018144401 True 3197928453018144401 13110060611322374290 3197928453018144401 False False
chickens 8787911050361940795 True 8787911050361940795 13110060611322374290 8787911050361940795 False False


In [24]:
spacy.cli.download("en_core_web_md")

✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')


In [30]:
md = spacy.load("en_core_web_md")
gloomy = md("I am so angry; this is horrible.")
happy = md("I love you; it is wonderful to see you.")

horrible = md.vocab['horrible']
wonderful = md.vocab['wonderful']

print(horrible.similarity(wonderful))
print(gloomy.similarity(happy))

0.45024433732032776
0.9204466307952975


While it appears that the model doesn't view the adjectives 'horrible' and 'wonderful' as similar, it DOES seem to view the two sentences as similar overall.  Perhaps because they both express strong emotions?

In [31]:
scientific = md("The fit of many models can be improved with regularization techniques.")
slang = md("Yo, dude! This is the gnarliest trail I've been on in a while!")

print(scientific.similarity(slang))

0.6588226006975084


I'm surprised these two sentences were deemed that similar.

### Manual creation of Docs and Spans

In [40]:
# Probably not something to use too often

en = English()
from spacy.tokens import Doc, Span

words = ['No', 'Worries', '!']
spaces = [True, False, False]
aussie_ex = Doc(md.vocab, words=words, spaces=spaces)
span = Span(aussie_ex, 0, 2, label="EXCLAMATION")

print(aussie_ex)
print(span)
print(span.label_)
print(aussie_ex[1].orth)

No Worries!
No Worries
EXCLAMATION
15559262463737172580


### More Word and Phrase Matching

Use pattern1 to match all case-insensitive mentions of "Amazon" plus a title-cased proper noun.

Use pattern2 to match all case-insensitive mentions of "ad-free", plus the following noun.

In [66]:
ad_free_amazon = 'Twitch Prime, the perks program for Amazon Prime members offering free loot, games and other benefits, is ditching one of its best features: ad-free viewing. According to an email sent out to Amazon Prime members today, ad-free viewing will no longer be included as a part of Twitch Prime for new members, beginning on September 14. However, members with existing annual subscriptions will be able to continue to enjoy ad-free viewing until their subscription comes up for renewal. Those with monthly subscriptions will have access to ad-free viewing until October 15.'
phrase = nlp(ad_free_amazon)

# Create the match patterns
pattern1 = [[{'LOWER': 'amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]]
pattern2 = [[{'LOWER': 'ad'}, {'TEXT': '-'}, {'LOWER': 'free'}, {'POS': 'NOUN'}]]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', pattern1)
matcher.add('PATTERN2', pattern2)

# Iterate over the matches
for match_id, start, end in matcher(phrase):
    # Print pattern string name and text of matched span
    print(phrase.vocab.strings[match_id], phrase[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


This was hard because at first, I did not understand that you need a FULL DICT entry for each ITEM you want to match.  I kept getting confused, thinking 'Oh, this is a phrase, so the whole phrase should be within the same set of curly braces'...

NO!  {'LOWER', 'each'}, {'LOWER', 'item'}, {'LOWER', 'needs'}, {'TEXT', 'a'}, {'LOWER', 'match'}

In [73]:
short = nlp("This thing is very very very silly.")
apattern = [[{'TEXT': 'very', 'OP': '+'}, {'LOWER': 'silly'}]]

m = Matcher(nlp.vocab)
m.add('APATTERN', apattern)

for match_id, start, end in m(short):
    # Print pattern string name and text of matched span
    print(short.vocab.strings[match_id], short[start:end].text)

APATTERN very silly
APATTERN very very silly
APATTERN very very very silly


This is not easy to wrap my head around.  In a regular expression, this would match ONCE - 'very very very silly', not three separate times...

### Using the Phrase Matcher

In [77]:
from spacy.matcher import PhraseMatcher

s = "Russia is denying its role in the war in Ukraine"
sdoc = nlp(s)

pm = PhraseMatcher(nlp.vocab)
pm.add("WAR", [nlp("war in Ukraine")], on_match=on_match)

# Call the matcher on the test document and print the result
matches = pm(sdoc)
print([sdoc[start:end] for match_id, start, end in matches])

[war in Ukraine]


### Custom Component Example

NOTES:

This doesn't work, because I didn't actually create an animal phrase matcher (TO DO)

The code that came from DataCamp had to be modified using the 'Language Processing Pipelines' reference in the spaCy documentation; a lot of things have changed in the new version of spaCy and the DataCamp course has not kept up with these changes

In [87]:
from spacy.language import Language

@Language.component("animal_component")
def animal_component(doc):
    # Create a Span for each match and assign the label 'ANIMAL'
    # and overwrite the doc.ents with the matched spans
    doc.ents = [Span(doc, start, end, label='ANIMAL')
                for match_id, start, end in matcher(doc)]
    return doc
    
# Add the component to the pipeline after the 'ner' component 
nlp.add_pipe("animal_component", after="ner")

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

[]


### Examples using extensions

Extensions allow additions of custom metadata to documents, tokens and spans.

Accessible via the ._ property - 
- doc._.title = 'My document'
- token._.is_color = True
- span._.has_color = False

Registered using the set_extension method -

Doc.set_extension('title', default=None)

Types 
- Attribute: set a default value that can be overwritten
- Property: allow definition of a getter and setter, getter is called when value is retrieved
- Method: assign a function that becomes available as an object method, lets you pass args to the extension function

Method example:

from spacy.tokens import Doc

def has_token(doc, token_text):

  in_doc = token_text in [token.text for token in doc]
  
  

In [89]:
from spacy.tokens import Token

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]
  
# Register the Token property extension 'reversed' with the getter get_reversed
Token.set_extension('reversed', getter=get_reversed)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print('reversed:', token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


Wikipedia URL Extension

In [92]:
def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ('PERSON', 'ORG', 'GPE', 'LOCATION'):
        entity_text = span.text.replace(' ', '_')
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('wikipedia_url', getter=get_wikipedia_url, force=True)

doc = nlp("In over fifty years from his very first recordings right through to his last album, David Bowie was at the vanguard of contemporary culture.")
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

Example of a Countries Component - doesn't work because no matcher is defined

In [97]:

@Language.component("countries_component")
def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    doc.ents = [Span(doc, start, end, label='GPE')
                for match_id, start, end in matcher(doc)]
    return doc

# Add the component to the pipeline
nlp.add_pipe('countries_component')

# Register capital and getter that looks up the span text in country capitals
Span.set_extension('capital', getter=lambda span: capitals.get(span.text))

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

[]


### Optimization

Using nlp.pipe to process many texts at once

In [103]:
TEXTS = ["N-nothing important. That is, I heard a good deal about a ring, and a dark lord, and something about the end of the world, but please, Mr. Gandalf, sir, don't hurt me. Don't turn me into anything... unnatural.",
         "There's a dirty great root sticking into my back!", 
         "If I take one more step, it'll be the farthest from home I've every been.",
         "I wonder if we'll ever be put into exciting stories or tales?",
         "It'll be spring soon, and the orchards will be in blossom. The birds will be nesting in the hazel thicket.",
         "Rosie Cotton, dancing...she had beautiful flowers in her hair."]

docs = list(nlp.pipe(TEXTS))
for doc in docs:
    print([token.text for token in doc if token.pos_ == 'ADJ'])

['important', 'good', 'dark', 'unnatural']
['dirty', 'great']
['more', 'farthest']
['exciting']
['hazel']
['beautiful']


### Customization

Loading custom document attributes values

In [106]:
DATA = [('One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.',
  {'author': 'Franz Kafka', 'book': 'Metamorphosis'}),
 ("I know not all that may be coming, but be it what it will, I'll go to it laughing.",
  {'author': 'Herman Melville', 'book': 'Moby-Dick or, The Whale'}),
 ('It was the best of times, it was the worst of times.',
  {'author': 'Charles Dickens', 'book': 'A Tale of Two Cities'}),
 ('The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.',
  {'author': 'Jack Kerouac', 'book': 'On the Road'}),
 ('It was a bright cold day in April, and the clocks were striking thirteen.',
  {'author': 'George Orwell', 'book': '1984'}),
 ('Nowadays people know the price of everything and the value of nothing.',
  {'author': 'Oscar Wilde', 'book': 'The Picture Of Dorian Gray'})]

from spacy.tokens import Doc
Doc.set_extension('book', default=None, force=True)
Doc.set_extension('author', default=None, force=True)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['book']
    doc._.author = context['author']
    
    # Print the text and custom attribute data
    print(doc.text, '\n', "— '{}' by {}".format(doc._.book, doc._.author), '\n')

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '1984' by George Orwell 

Nowadays people know the price of everything and the value of nothing. 
 — 'The Picture Of Dorian Gray' by Oscar Wilde 

