# Advanced NLP with spacy - Part 1

## Getting Started

In [1]:
#conda install -c conda-forge spacy=2.1.0
import spacy
#Import the English Language class
from spacy.lang.en import English
print(spacy.__version__)

2.1.4


In [2]:
# Create the nlp object
nlp = English()

In [3]:
# Process the text
doc = nlp("This is a sentence.")

In [4]:
# Print the document text
print(doc.text)

This is a sentence.


In [5]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [6]:
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


## Documents, spans and tokens

When you call nlp on a string, spaCy first tokenizes the text and creates a document object.

In [7]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()

In [8]:
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

In [9]:
# Select the first token
first_token = doc[0]

In [10]:
# Print the first token's text
print(first_token.text)

I


In [11]:
# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

tree kangaroos


In [12]:
# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:-1]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos and narwhals


### Lexical attributes

In [13]:
# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

In [14]:
# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        #Get the token following the current token in the document.
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == '%':
            print('Percentage found:', token.text)

Percentage found: 60
Percentage found: 4


## Statistical Models
Statistical models allow you to generalize based on a set of training examples. Once they're trained, they use binary weights to make predictions. That's why it's not necessary to ship them with their training data.

### Loading models

In [2]:
#!python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

In [3]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
doc = nlp(text)
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


In [19]:
#!python -m spacy download de_core_news_sm
nlp = spacy.load('de_core_news_sm')

In [20]:
text = "Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht"
doc = nlp(text)
print(doc.text)

Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht


### Predicting linguistic annotations
Use spacy.explain to find out what a tag or label means. Eg: spacy.explain('GPE') or spacy.explain('PROPN')

In [4]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

In [6]:
def get_entities(doc: spacy.Doc):
    pass

AttributeError: module 'spacy' has no attribute 'Doc'

In [5]:
for token in doc:
    token_text = token.text
    token_pos = token.pos_ #part-of-speech tag
    token_dep = token.dep_ #dependency label
    print('{:<12}{:<10}{:<10} {:<10}'.format(token_text, token_pos, token_dep, spacy.explain(token_pos)))

It          PRON      nsubj      pronoun   
’s          PROPN     ROOT       proper noun
official    NOUN      acomp      noun      
:           PUNCT     punct      punctuation
Apple       PROPN     nsubj      proper noun
is          VERB      ROOT       verb      
the         DET       det        determiner
first       ADJ       amod       adjective 
U.S.        PROPN     nmod       proper noun
public      ADJ       amod       adjective 
company     NOUN      attr       noun      
to          PART      aux        particle  
reach       VERB      relcl      verb      
a           DET       det        determiner
$           SYM       quantmod   symbol    
1           NUM       compound   numeral   
trillion    NUM       nummod     numeral   
market      NOUN      compound   noun      
value       NOUN      dobj       noun      


In [23]:
# Iterate over the predicted entities
for ent in doc.ents:
    # print the entity text and its label
    print(ent.text, ent.label_, spacy.explain(ent.label_))

Apple ORG Companies, agencies, institutions, etc.
first ORDINAL "first", "second", etc.
U.S. GPE Countries, cities, states
$1 trillion MONEY Monetary values, including unit


## Predicting named entities in context
Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you're processing.

In [24]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG


Here the model didn't predict "iPhone X"

In [25]:
# Get the span for "iPhone X"
iphone_x = doc[1:3]
#Print the span text
print('Missing entity:', iphone_x.text)

Missing entity: iPhone X


## Rule based matching
spaCy's rule-based matcher can help you find certain words and phrases in text. Here we are not using regular expression based matching because we want to match on `Doc` object and not just strings. Also match on token & token attributes and use the model predictions.Ex: "duck"(verb) vs "duck"(noun)

### Matcher

In [26]:
from spacy.matcher import Matcher

#Load a model and create nlp object
nlp = spacy.load('en_core_web_sm')

#Initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# Add the pattern to the matcher. Here None is a callback
matcher.add('IPHONE_X_PATTERN', None, pattern)

doc = nlp("New iPhone X release date leaked")

#Use matcher on the doc
#A list of `(key, start, end)` tuples describing the matches. A match tuple describes a span
#`doc[start:end]`. 
matches = matcher(doc)
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Match found: iPhone X


### Writing match patterns

In [27]:
doc = nlp('''After making the iOS update you won't notice a radical system-wide redesign: 
nothing like the aesthetic upheaval we got with iOS 7. 
Most of iOS 11's furniture remains the same as in iOS 10. 
But you will discover some tweaks once you delve a little deeper.''')

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', None, pattern)

matches = matcher(doc)

print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)


Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [28]:
doc = nlp('''i downloaded Fortnite on my laptop and can't open the game at all. 
Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... 
do I also need to download Winzip?''')

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [29]:
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns(one noun and one optional noun).
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 4
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice responses


## Data Structures

### Vocab, Lexemes and StringStore

In [4]:
from spacy.lang.en import English
nlp = English()

In [5]:
# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings['cat']
print(cat_hash)
# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811


KeyError: "[E018] Can't retrieve string for hash '5439657043933447811'."

**The above does not work the way that I want. Seems to be a bug with spacy 2.1.3. Spacy 2.1.0 works the way that I want returning `cat`**

In [32]:
# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(person_hash)
# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


In [33]:
from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()

# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings['Bowie']
print(bowie_id)

# Look up the ID for 'Bowie' in the vocab
print(nlp_de.vocab.strings[bowie_id])

2644858412616767388


KeyError: "[E018] Can't retrieve string for hash '2644858412616767388'."

### Doc, Span and Tokens

`Doc` and `Span` hold references and relationships of words & sentences. 

**Best practices**
- convert results to strings as late as possible 
- Use token attributes if available - eg: token.i for the token index
- don't forget to pass in the shared `vocab`.

In [34]:
#Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ['spaCy', 'is', 'cool', '!']
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


In [35]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ['Go', ',', 'get', 'started', '!']
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started! 


In [36]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ['Oh', ',', 'really', '?', '!']
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


In [43]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ['I', 'like', 'David', 'Bowie']
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words, spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label='PERSON')
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])


I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


Creating spaCy's objects manually and modifying the entities will come in handy when you're writing your own information extraction pipelines.

### Datastructure best practices

The below code is trying to analyze a text and collect all proper nouns. If the token following the proper noun is a verb, it should also be extracted. 

In [49]:
nlp = spacy.load('en_core_web_sm')
doc = nlp('Berlin is a nice city')
# Get all tokens and part-of-speech tags
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == 'PROPN':
        # Check if the next token is a verb
        if pos_tags[index + 1] == 'VERB':
            print('Found a verb after a proper noun!')

Found a verb after a proper noun!


The above code is bad because it only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.

Rewriting the code to use the native token attributes instead of a list of pos_tags

In [51]:
nlp = spacy.load('en_core_web_sm')
doc = nlp('Berlin is a nice city')
# Get all tokens and part-of-speech tags
for token in doc:
    if token.pos_ == 'PROPN':
        if doc[token.i + 1].pos_ == 'VERB':
            print('Found a verb after a proper noun!')

Found a verb after a proper noun!


### Word vectors and similarity

Spacy can compare two objects and predict a similarity score (0 to 1). Doc.similarity(), Span.similarity, Token.similarity(). **Make sure to use the statistical model medium or large where word vectos are included.** Similarity depends on application context. It is useful for many applications: recommendation systems, flagging duplicates. There is no objective definition of `similarity`.

In [58]:
#!python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')


In [59]:
# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[-2.2009e-01 -3.0322e-02 -7.9859e-02 -4.6279e-01 -3.8600e-01  3.6962e-01
 -7.7178e-01 -1.1529e-01  3.3601e-02  5.6573e-01 -2.4001e-01  4.1833e-01
  1.5049e-01  3.5621e-01 -2.1508e-01 -4.2743e-01  8.1400e-02  3.3916e-01
  2.1637e-01  1.4792e-01  4.5811e-01  2.0966e-01 -3.5706e-01  2.3800e-01
  2.7971e-02 -8.4538e-01  4.1917e-01 -3.9181e-01  4.0434e-04 -1.0662e+00
  1.4591e-01  1.4643e-03  5.1277e-01  2.6072e-01  8.3785e-02  3.0340e-01
  1.8579e-01  5.9999e-02 -4.0270e-01  5.0888e-01 -1.1358e-01 -2.8854e-01
 -2.7068e-01  1.1017e-02 -2.2217e-01  6.9076e-01  3.6459e-02  3.0394e-01
  5.6989e-02  2.2733e-01 -9.9473e-02  1.5165e-01  1.3540e-01 -2.4965e-01
  9.8078e-01 -8.0492e-01  1.9326e-01  3.1128e-01  5.5390e-02 -4.2423e-01
 -1.4082e-02  1.2708e-01  1.8868e-01  5.9777e-02 -2.2215e-01 -8.3950e-01
  9.1987e-02  1.0180e-01 -3.1299e-01  5.5083e-01 -3.0717e-01  4.4201e-01
  1.2666e-01  3.7643e-01  3.2333e-01  9.5673e-02  2.5083e-01 -6.4049e-02
  4.2143e-01 -1.9375e-01  3.8026e-01  7.0883e-03 -2

#### Comparing similarities

In [60]:
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


Higher the score, more similar. These two documents are more similar.

In [61]:
doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books" 
similarity = token1.similarity(token2)
print(similarity)

0.2232533


TV and books are not similar

In [63]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

In [64]:
# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.7517392


"great restaurant" and "really nice bar" are very similar

### Combining models and rules

Statistical predictions vs. rules - See the evernote note
- **Tricks** Use the [Rule-based Matcher Explorer tool](https://explosion.ai/demos/matcher)
- [Docs - Rule-based matching](https://spacy.io/usage/rule-based-matching)

In [67]:
from spacy.matcher import Matcher
#Initialize with the shared vocab
matcher = Matcher(nlp.vocab)

doc = nlp('Can Silicon Valley workers rein in big tech from within?')

#Patterns are lists of dictionaries describing the tokens
#pattern = [{'LOWER': 'silicon'}, {'TEXT': ' '}, {'LOWER': 'valley}]
# The tokenizer already takes care of splitting off whitespace & each dictionary in the pattern describes one token. 
pattern = [{'LOWER': 'silicon'}, {'LOWER': 'valley'}]

matcher.add('SILICON_VALLEY', None, pattern)                                                
# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)
                                                 

SILICON_VALLEY Silicon Valley


In [75]:
from spacy.matcher import Matcher
#Initialize with the shared vocab
matcher = Matcher(nlp.vocab)

doc = nlp('''Twitch Prime, the perks program for Amazon Prime members offering free loot, games and other benefits, is ditching one of its best features: ad-free viewing. 
According to an email sent out to Amazon Prime members today, ad-free viewing will no longer be included as a part of Twitch Prime for new members, beginning on September 14. 
However, members with existing annual subscriptions will be able to continue to enjoy ad-free viewing until their subscription comes up for renewal. 
Those with monthly subscriptions will have access to ad-free viewing until October 15.''')
#Create the match patterns
# Matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun
pattern1 = [{'LOWER': 'amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
# Matches all case-insensitive mentions of "ad-free", plus the following noun
pattern2 = [{'LOWER': 'ad'},{'LOWER': '-'},{'LOWER': 'free'}, {'POS': 'NOUN'}]

#Initialize the Matcher and add the patterns
matcher.add('PATTERN1', None, pattern1)
matcher.add('PATTERN2', None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


#### Efficient phrase matching

Sometimes it's more efficient to match exact strings instead of writing patterns descibing the individual tokens. This works well for finite categories of things - like all countries of the world

In [9]:
COUNTRIES = ['Afghanistan', 'Åland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia (Plurinational State of)', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'United States Minor Outlying Islands', 'Virgin Islands (British)', 'Virgin Islands (U.S.)', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cabo Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo (Democratic Republic of the)', 'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Holy See', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', "Côte d'Ivoire", 'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia (the former Yugoslav Republic of)', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia (Federated States of)', 'Moldova (Republic of)', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', "Korea (Democratic People's Republic of)", 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine, State of', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Kosovo', 'Réunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and the South Sandwich Islands', 'Korea (Republic of)', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-Leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom of Great Britain and Northern Ireland', 'United States of America', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']

In [90]:
#Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
doc = nlp('Czech Republic may help Slovakia protect its airspace')
matcher = PhraseMatcher(nlp.vocab) #Needs to be an exact phrase match

# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns) #Need to pass *patterns else it cannot convert list to spacy.tokens.Doc

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia]


#### Extracting countries and relationships

In [7]:
text = '''After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait. Brian Urquhart, Under-Secretary-General from 1971 to 1985, later described the hopes raised by these successes as a "false renaissance" for the organization, given the more troubled missions that followed. Though the UN Charter had been written primarily to prevent aggression by one nation against another, in the early 1990s the UN faced a number of simultaneous, serious crises within nations such as Somalia, Haiti, Mozambique, and the former Yugoslavia. The UN mission in Somalia was widely viewed as a failure after the US withdrawal following casualties in the Battle of Mogadishu, and the UN mission to Bosnia faced "worldwide ridicule" for its indecisive and confused mission in the face of ethnic cleansing. In 1994, the UN Assistance Mission for Rwanda failed to intervene in the Rwandan genocide amid indecision in the Security Council. Beginning in the last decades of the Cold War, American and European critics of the UN condemned the organization for perceived mismanagement and corruption. In 1984, the US President, Ronald Reagan, withdrew his nation's funding from UNESCO (the United Nations Educational, Scientific and Cultural Organization, founded 1946) over allegations of mismanagement, followed by Britain and Singapore. Boutros Boutros-Ghali, Secretary-General from 1992 to 1996, initiated a reform of the Secretariat, reducing the size of the organization somewhat. His successor, Kofi Annan (1997–2006), initiated further management reforms in the face of threats from the United States to withhold its UN dues. In the late 1990s and 2000s, international interventions authorized by the UN took a wider variety of forms. The UN mission in the Sierra Leone Civil War of 1991–2002 was supplemented by British Royal Marines, and the invasion of Afghanistan in 2001 was overseen by NATO. In 2003, the United States invaded Iraq despite failing to pass a UN Security Council resolution for authorization, prompting a new round of questioning of the organization's effectiveness. Under the eighth Secretary-General, Ban Ki-moon, the UN has intervened with peacekeepers in crises including the War in Darfur in Sudan and the Kivu conflict in the Democratic Republic of Congo and sent observers and chemical weapons inspectors to the Syrian Civil War. In 2013, an internal review of UN actions in the final battles of the Sri Lankan Civil War in 2009 concluded that the organization had suffered "systemic failure". One hundred and one UN personnel died in the 2010 Haiti earthquake, the worst loss of life in the organization's history. The Millennium Summit was held in 2000 to discuss the UN's role in the 21st century. The three day meeting was the largest gathering of world leaders in history, and culminated in the adoption by all member states of the Millennium Development Goals (MDGs), a commitment to achieve international development in areas such as poverty reduction, gender equality, and public health. Progress towards these goals, which were to be met by 2015, was ultimately uneven. The 2005 World Summit reaffirmed the UN's focus on promoting development, peacekeeping, human rights, and global security. The Sustainable Development Goals were launched in 2015 to succeed the Millennium Development Goals. In addition to addressing global challenges, the UN has sought to improve its accountability and democratic legitimacy by engaging more with civil society and fostering a global constituency. In an effort to enhance transparency, in 2016 the organization held its first public debate between candidates for Secretary-General. On 1 January 2017, Portuguese diplomat António Guterres, who previously served as UN High Commissioner for Refugees, became the ninth Secretary-General. Guterres has highlighted several key goals for his administration, including an emphasis on diplomacy for preventing conflicts, more effective peacekeeping efforts, and streamlining the organization to be more responsive and versatile to global needs.'''

In [11]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
doc = nlp(text)
matcher = PhraseMatcher(nlp.vocab)

patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    #Create a Span with the label for "GPE"
    span = Span(doc, start, end, label='GPE')
    
    #Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]
    
    #Get the span's root head token
    span_root_head = span.root.head
     
    
# Print the entities in a document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]


**Important Note: Be careful when you upgrade spacy and ensure you have a lot of unit tests with good test coverage when you are adopting spacy in a project.**