# Chapter 1. Regular Expression

## Tables of regular expression

### pattern       matches        example
     \w+          word           'Magic'
     \d           digit           9
     \s           space           ''
     \S           no space        'no_space'
     .*           wildcard        'username74'
     + or *        greedy match    'aaaaa'
     [a-z]         lowercase        'abced'
     

### About re.match and re.search

re.match: match from the beignning of the sentence;

re.search: look for the whole sentence and try to match

In [1]:
import re
import nltk

In [4]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Introduction to tokenization

#### Definition: Turning a string into smaller chunks (tokens)

#### Usually use nltk library



In [10]:
#Example
from nltk.tokenize import word_tokenize

a = word_tokenize('Hi there!')
a

['Hi', 'there', '!']

In [None]:
sentence_endings = r"[.?!]

In [12]:
from nltk.tokenize import regexp_tokenize,TweetTokenizer
tweets = ['This is the best #nlp exercise ive found online! #python']
# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"
# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0],pattern1) #Note: the order is reverse from the re.math or re.search or re.findall
print(hashtags)

['#nlp', '#python']


In [None]:
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer(tweets)
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)


# Chapter 2. Word Counts with bag-of-words

## Bag of words:

1. Basic method of finding topics in a text

2. Need to first create tokens using tokenization

3. then count up all the tokens -> the more frequent a word, the more important it might be


In [19]:
from nltk.tokenize import word_tokenize

from collections import Counter

counter = Counter(word_tokenize('The cat is the box. The cat likes the box. The box is over the cat.'))

counter


Counter({'The': 3,
         'cat': 3,
         'is': 2,
         'the': 3,
         'box': 3,
         '.': 3,
         'likes': 1,
         'over': 1})

In [20]:
counter.most_common(2)

[('The', 3), ('cat', 3)]

## Simple text preprocessing

### Examples:

1. Tokenization to create a bag of words

2. Lowercasing words

3. Lemmatization/Stemming: shorten words to their root stems
    
4. Removing stop words, punctuation, or unwanted tokens

In [24]:
from nltk.corpus import stopwords

text  = 'The cat is the box. The cat likes the box. The box is over the cat.'

#Tokenization
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()] #.isalpha will return True if the string only has alphabetical characters

tokens

['the',
 'cat',
 'is',
 'the',
 'box',
 'the',
 'cat',
 'likes',
 'the',
 'box',
 'the',
 'box',
 'is',
 'over',
 'the',
 'cat']

In [26]:
#Remove stop words

no_stop = [t for t in tokens if t not in stopwords.words('english')]

no_stop

['cat', 'box', 'cat', 'likes', 'box', 'box', 'cat']

In [28]:
#Create a Counter

counter = Counter(no_stop).most_common(2)

counter

[('cat', 3), ('box', 3)]

In [None]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stop]

### Introduction to Gensim

In [2]:
from gensim.corpora.dictionary import Dictionary

from nltk.tokenize import word_tokenize

In [3]:
my_document = ['The movis is good', 'I really like the movie!','Awesome action scenes, but boring characters']

tokenize_doc = [word_tokenize(doc.lower()) for doc in my_document]

dictionary = Dictionary(tokenize_doc)

dictionary.token2id

{'!': 4,
 ',': 9,
 'action': 10,
 'awesome': 11,
 'boring': 12,
 'but': 13,
 'characters': 14,
 'good': 0,
 'i': 5,
 'is': 1,
 'like': 6,
 'movie': 7,
 'movis': 2,
 'really': 8,
 'scenes': 15,
 'the': 3}

In [6]:
corpus = [dictionary.doc2bow(doc) for doc in tokenize_doc]

corpus  #Each list represent one document; for each (), the first element represents token id and the second represent the 
        #token frequency

[[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)]]

In [None]:
# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

### defaultdict:

#### allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument int, we are able to ensure that any non-existent keys are automatically assigned a default value of 0.

### itertools.chain.from_iterable():

#### allows us to iterate through a list of list: it will flat a list of list as a whole list and iterate through

In [4]:
import itertools

In [None]:
from collections import defaultdict
# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
    
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id,word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

### Tf-idf with gensim

### Definition: 

#### Term-frequency - inverse document frequency

1. Allows you to determin the most important words in each document

2. Each corpus may have shared words beyond just stopwords and those words should be down-weighted in importance, which tf-idf can do

3. Ensure most common words don't show up as key words

4. keeps document specific frequent words weighted high

<img src = "Tf-idf_formula.PNG">

In [7]:
from gensim.models.tfidfmodel import TfidfModel

tfidf = TfidfModel(corpus)

tfidf[corpus[1]] #Reach each documnet of corpus; (token_id, token_weights)

[(3, 0.16284991207632715),
 (4, 0.44124367556640004),
 (5, 0.44124367556640004),
 (6, 0.44124367556640004),
 (7, 0.44124367556640004),
 (8, 0.44124367556640004)]

# Chapter 3. Named Entity Recognition

### Def:

1. Identify important named entities in the text


### Stanford CoreNLP Library

1. Integrated into Python via nltk
2. Java based

In [8]:
#Example
import nltk

sentence = 'In New York, I like to ride the Metro to visit MOMA and some restaurants rated well by Ruth Reivhl.'

tokenized_sent = word_tokenize(sentence)

tagged_sent = nltk.pos_tag(tokenized_sent) #add tags to each word to specify like pronouns etc.

tagged_sent

[('In', 'IN'),
 ('New', 'NNP'),
 ('York', 'NNP'),
 (',', ','),
 ('I', 'PRP'),
 ('like', 'VBP'),
 ('to', 'TO'),
 ('ride', 'VB'),
 ('the', 'DT'),
 ('Metro', 'NNP'),
 ('to', 'TO'),
 ('visit', 'VB'),
 ('MOMA', 'NNP'),
 ('and', 'CC'),
 ('some', 'DT'),
 ('restaurants', 'NNS'),
 ('rated', 'VBN'),
 ('well', 'RB'),
 ('by', 'IN'),
 ('Ruth', 'NNP'),
 ('Reivhl', 'NNP'),
 ('.', '.')]

In [9]:
print(nltk.ne_chunk(tagged_sent))

(S
  In/IN
  (GPE New/NNP York/NNP)
  ,/,
  I/PRP
  like/VBP
  to/TO
  ride/VB
  the/DT
  (ORGANIZATION Metro/NNP)
  to/TO
  visit/VB
  (ORGANIZATION MOMA/NNP)
  and/CC
  some/DT
  restaurants/NNS
  rated/VBN
  well/RB
  by/IN
  (PERSON Ruth/NNP Reivhl/NNP)
  ./.)


In [None]:
# Tokenize the article into sentences: sentences
sentences = sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences,binary = True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)
            
# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

# Create the nested for loop
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(v) for v in labels]

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()            


### Introduction to SpaCy

1. Similar to gensim, focusing on creating NLP pipelines to generate models and corpora


In [23]:
import spacy 

nlp = spacy.load('en',parse = True,tage = True, entity = True)

nlp.entity

doc = nlp('Berlin is the capital of Germany; and the residence of Chancellor Angela Merkel.')

doc.ents

print(doc.ents[0],doc.ents[0].label_)

Berlin GPE


### Multilingual NER with polyglot


In [37]:
from polyglot.text import Text

ptext = Text(text)

ModuleNotFoundError: No module named 'polyglot'

In [None]:
# Create a new text object using Polyglot's Text class: txt
txt = Text(article)

# Print each of the entities found
for ent in txt.entities:
    print(ent)
    
# Print the type of ent
print(type(ent))

# Create the list of tuples: entities
entities = [(ent.tag, ' '.join(ent)) for ent in txt.entities]

# Print entities
print(entities)


# Chapter 4. Classifying fake news using supervised learning with NLP


In [None]:
#Example

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

df = ...

y = ['Sci-Fi']

X_train,X_test,y_train,y_test = train_test_split(df['plot'],y,randomstate = 53,test_size = .33)

count_vectorizer = CountVectorizer(stop_words = 'english') #remove stop words in English

count_train = count_vectorizer.fit_transform(X_train.values)

count_test = count_vectorizer.transform(X_test.values)


In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english',max_df = .7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

In [None]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A,columns = tfidf_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns) #Check if elements appeared in A also appeared in B
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))

### Naive Bayes Model is commonly used for testing NLP classification problems; basis in probability

#  Course 2:  Adavanced NLP with Spacy

## 1.1. Introduction to spaCy

In [1]:
from spacy.lang.en import English

In [15]:
nlp = English()

#It contains the processing pipeline
# Includes language-specific rules used for tokenizing the text into words and punctuation

In [16]:
#The Doc object: Doc represents documentation
doc = nlp('Hello world!')

for token in doc:
    print(token.text)

Hello
world
!


In [6]:
token = doc[1]
token

world

In [7]:
#The Span object: a slice of the document consisting of one or more tokens

#Note: it's only a view of the Doc and doesn't contain any data itself


span = doc[1:4]

print(span.text)

world!


In [11]:
#Lexical attributes

doc = nlp("It costs $5. IV")

print('Index:  ', [token.i for token in doc]) # "i" is the index of the token within the parent document

print('Text:   ', [token.text for token in doc]) #.text returns the text

print('is_alpha:  ', [token.is_alpha for token in doc]) #whether the token consists of alphanumeric characters

print('is_punct:  ', [token.is_punct for token in doc]) #whether it's a punctuation

print('like_num:  ', [token.like_num for token in doc]) #whether it resembles a number

Index:   [0, 1, 2, 3, 4, 5]
Text:    ['It', 'costs', '$', '5', '.', 'IV']
is_alpha:   [True, True, False, False, False, True]
is_punct:   [False, False, False, False, True, False]
like_num:   [False, False, False, True, False, False]


In [None]:
#Example

# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i+1]
        # Check if the next token's text equals '%'
        if next_token.text == '%':
            print('Percentage found:', token.text)

## 1.2. Statistical Models

### What are the statistical models?

#### It enables spaCy to predict linguistic attributes in context
    (1) Part-of-speech tags
    (2) Syntactic dependencies
    (3) Named entities
    
#### Models are trained on large datasets of labeled example texts, and can be updated with more examples to fine-tune predictions


 ### Model packages

import spacy

nlp = spacy.load('en_core_web_sm')

(1) The package contains binary weights
(2) Vocabulary
(3) Meta information (language, pipeline)


### Example: Predicting part-of-speech tags

In [20]:
import spacy

nlp = spacy.load('en_core_web_sm')

#Process a text
doc = nlp('She ate the pizza')

#Iterate over the tokens
for token in doc:
    print(token.text, token.pos_) #print the text and the predicted part-of-speech tag

She PRON
ate VERB
the DET
pizza NOUN


### Example: Predicting Syntactic Dependencies

In [21]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text) #.dep_: returns the predicted dependency label; "head" returns the 
                                                               #syntactic head token (i.e., the parent token this word is attached to)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


### Table for syntactic dependencies spaCy using

     Label      Description        Example

     nsubj    nominal subject       she
     dobj     direct object         pizza
     det      determiner (article)  the
     
     
     
#### Example:
ate(VERB) -------(nsubj)---> She(PRON);

ate(VERB) -------(dobj) ---> pizza(NOUN);

the(DET) <-------(det)----> pizza(NOUN);

### Example: Predicting Named Entities

In [26]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents: #.ents lets you access the named entities predicted by the model
    print(ent.text, ent.label_)
    

Apple ORG
U.K. GPE
$1 billion MONEY


In [30]:
#Tips: the exaplin method and help you get quick definitions of the most common tags and labels

spacy.explain('GPE')

spacy.explain('NNP')

'noun, proper singular'

## 1.3. Rule-based Matching

## Why not just regular expression?

1. The matcher will match on Doc objects, not just strings

2. more flexible, you can search for texts, but also other lexical attributes

3. use the model's predictions: example, find the word "duck" only if it's a verb not a noun

## Match patterns

1. Match patterns are lists of dictionaries, each dictionary describes one token;

2. keys are the names of token attributes, mapped to their expected values;

Example:

[{'ORTH': 'iPhone'}, {'ORTH': 'X'}] #match exact token texts

[{'LOWER': 'iphone'}, {'LOWER': 'x'}] #match lexical attributes

[{'LEMMA': 'buy'}, {'POS', 'NOUN'}] #match any token attributes


In [32]:
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')

In [33]:
#Initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

In [38]:
#Add the pattern to the matcher
pattern = [{'ORTH': 'iPhone'},{'ORTH':'X'}]

matcher.add('IPHONE_PATTERN',None, pattern) #first arg: unique ID to identify which pattern was matched; 
                                            #second arg: optional callback
                                            #pattern itself
        
#Process some text
doc = nlp('New iPhone X release date leaked')

matches = matcher(doc)

matches #(match ID, start index, end index)

[(9528407286733565721, 1, 3)]

In [39]:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


### Matching lexical attributes

In [44]:
pattern = [
            {'IS_DIGIT': True},
            {'LOWER': 'fifa'},
            {'LOWER': 'world'},
            {'LOWER': 'cup'},
            {'IS_PUNCT': True}]

doc = nlp("2018 FIFA World Cup: France won!")

matcher.add('LEXICAL_PATTERN',None,pattern)

matches = matcher(doc)

for match_id, start,end in matches:
    print(doc[start:end])

2018 FIFA World Cup:


### Matching other token attributes

In [None]:
pattern - [
        {'LEMMA': 'love'},{'POS':'VERB'},
        {'POS','NOUN'}
]

doc = nlp('I loved dogs but now I love cats more.')

### Using operators and quantifiers

In [None]:
pattern = [
            {'LEMMA': 'buy'},
            {'POS':'DET', 'OP': '?'} # optional: match 0 or 1 times
            {'POS': 'NOUN'}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")


                               Description
{'OP': '!'}:       Negation: match 0 times;

{'OP': '?'}:       Optional: match 0 or 1 times;

{'OP': '+'}:       Match 1 or more times;

{'OP': '*'}:       Match 0 or more times

## 2.1 Data Structures: Vocab, Lexemes and StringStore

#### Vocab: stores data shared across multiple documents

#### To save memory, spaCy encodes all strings to hash values, so if a word occurs more than once, we don't need to save it every time

#### Strings are only stored once in the StringStore via nlp.vocab.strings

#### Example:

coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]

#### Hash ID can't be reversed - that's why we need to provide the shared vocab

In [47]:
# Look up the string and hash in nlp.vocab.strings

doc = nlp('I love coffee')

print('hash value: ', nlp.vocab.strings['coffee'])

print('string value: ', nlp.vocab.strings[3197928453018144401])

print('hash value: ', doc.vocab.strings['coffee'])

hash value:  3197928453018144401
string value:  coffee
hash value:  3197928453018144401


### Lexemes: context-independent entries in the vocabulary

### Lexemes don't have context-dependent part-of-speech tags, dependencies or entity labels

In [48]:
doc = nlp('I love coffee')

lexeme = nlp.vocab['coffee']

#pring the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha) #.orth(hash id)

coffee 3197928453018144401 True


## 2.2 Data Structures: Doc, Span and Token

In [12]:
#Create an nlp object

from spacy.lang.en import English

nlp = English()
    
#Import the Doc class
from spacy.tokens import Doc, Span

#The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False] #Indicating whther the word is followed by a space

#Create a doc manually
doc = Doc(nlp.vocab, words = words, spaces = spaces)

print("doc: ", doc)

#Create a span manually
span = Span(doc, 0, 2) #The doc, start and end index

print("span: ", span)

#Create a span with a label
span_with_label = Span(doc, 0, 2, label = 'GREETING') #usually write label names in upper case

print("span_with_label: ", span_with_label) #or span_with_label.text

#Add span to the doc.ents, which is writtable
doc.ents = [span_with_label]

print(doc.ents[0], doc.ents[0].label_)
print([(ent.text, ent.label_) for ent in doc.ents])

doc:  Hello world!
span:  Hello world
span_with_label:  Hello world
Hello world GREETING
[('Hello world', 'GREETING')]


### Tips:

#### Doc and Span are very powerful and hold references and relationships of words and sentences

#### If your application needs to output strings, make sure to convert the doc as late as possible; otherwise, you'll lose all relationships between the tokens

#### To keep things consistent, try to use built-in token attributes wherever possible, for example, token.i for the token index

#### Don't forget to always pass in the shared vocab

## 2.3 Word vectors and semantic similatrity

### spaCy can compare two objects and predict how similar they are

### Doc.similarity(), Span.similarity(), Token.similarity()

### Take another object and return a smiliarity score (0 to 1)

### Note: in order to use similarity, you need a larger spaCy model that has word vectors included, for example:
1. en_core_web_md (medium model)
2. en_core_web_lg (large model)



In [14]:
#Load a larger model with vectors
import spacy
nlp = spacy.load('en_core_web_md')

#Compare two documnets
doc1 = nlp('I like fast food')
doc2 = nlp('I like pizza')

print(doc1.similarity(doc2))

#Compare two tokens
doc = nlp('I like pizza and pasta')
token1 = doc[2]
token2 = doc[4]

print(token1.similarity(token2))


#Compare a documnet with a token
doc = nlp('I like pizza')
token = nlp('soap')[0]

print(doc.similarity(token))


#Compare a span with a document
span = nlp('I like pizza and pasta')[2:5]
doc = nlp('McDonalds sells burgers')

print(span.similarity(doc))

OSError: [E050] Can't find model 'en_core_web_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

### How does spaCy predict similarity?

#### 1. Similarity is determined using word vectors
#### 2. Multi-dimimensional meaning representations of words
#### 3. Generated using an algorithm like Word2Vec and lots of text
#### 4. Can be added to spaCy's statistical models
#### 5. Default: cosine similarity, but can be adjusted
#### 6. Doc and Span vectors default to average of token vectors, that's also why you usually get more value out of shorter phrases with fewer irrelevant words

In [None]:
nlp = spacy.load('en_core_web_md')
doc = nlp('I have a banana')

print(doc[3].vector) #300 dimensional vector of the word 'banana'

### Useful for many applications: recommendation systems, flagging duplicates etc.
### Keep in mind that there's no objective definition of "similarity". It always depends on the context and what application needs to be done

## 2.4 Combining models and rules

In [21]:
#Example
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')

matcher = Matcher(nlp.vocab)
matcher.add('DOG',None, [{'LOWER': 'golden'},{'LOWER': 'retriever'}])
doc = nlp('I have a Golden Retriever')

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span: ', span.text)
    
    #Get the span's root token and root head token
    print('Root token: ', span.root.text)
    print('Root head token: ', span.root.head.text)
    
    #Get the previous token and its POS tag
    print('Previous token: ', doc[start-1].text, doc[start-1].pos_)

Matched span:  Golden Retriever
Root token:  Retriever
Root head token:  have
Previous token:  a DET


### PhraseMatcher 
#### 1. like regex or keyword search, but with access to the tokens; 
#### 2. it takes Doc object as patterns; 
#### 3. more effecient than Matcher

In [24]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern  = nlp('Golden Retriever')

matcher.add('DOG', None, pattern)

doc = nlp('I have a Golden Retriever')

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span: ', span.text)

Matched span:  Golden Retriever
