<a href="https://colab.research.google.com/github/just-joseph/NLP-basics/blob/main/Day4_Spacy_tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

spaCy is an advanced modern library for Natural Language Processing developed by Matthew Honnibal and Ines Montani. It is designed to be industrial grade but open source.

In [1]:
!pip install spacy



spaCy comes with pretrained NLP models that can perform most common NLP tasks, such as tokenization, parts of speech (POS) tagging, named entity recognition (NER), lemmatization, transforming to word vectors etc.

If you are dealing with a particular language, you can load the spacy model specific to the language using spacy.load() function.

In [2]:
import spacy
# Load small english model: https://spacy.io/models
nlp= spacy.load("en_core_web_sm")  # most other models need to be installed before we use

This returns a Language object that comes ready with multiple built-in capabilities.

Now, let us say you have your text data in a string. What can be done to understand the structure of the text?

First, call the loaded nlp object on the text. It should return a processed Doc object.

In [3]:
my_text = """The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost"""

my_doc = nlp(my_text)
type(my_doc)

spacy.tokens.doc.Doc

But, what exactly is a Doc object ?

It is a sequence of tokens that contains not just the original text but all the results produced by the spaCy model after processing the text. Useful information such as the lemma of the text, whether it is a stop word or not, named entities, the word vector of the text and so on are pre-computed and readily stored in the Doc object.

The good thing is that you have complete control on what information needs to be pre-computed and customized. We will see all of that shortly.

Also, though the text gets split into tokens, no information of the original text is actually lost.

What is a Token?

Tokens are individual text entities that make up the text. Typically a token can be the words, punctuation, spaces, etc.

**Tokenization** is the process of converting a text into smaller sub-texts, based on certain predefined rules. For example, sentences are tokenized to words (and punctuation optionally). And paragraphs into sentences, depending on the context.

This is typically the first step for NLP tasks like text classification, sentiment analysis, etc.

Each token in spacy has different attributes that tell us a great deal of information.

Such as, if the token is a punctuation, what part-of-speech (POS) is it, what is the lemma of the word etc. This article will cover everything from A-Z.

Let’s see the token texts on my_doc. The string which the token represents can be accessed through the token.text attribute.



In [4]:
# Printing the tokens of a doc
for token in my_doc:
  print(token.text)

The
economic
situation
of
the
country
is
on
edge
,
as
the
stock


market
crashed
causing
loss
of
millions
.
Citizens
who
had
their
main
investment


in
the
share
-
market
are
facing
a
great
loss
.
Many
companies
might
lay
off


thousands
of
people
to
reduce
labor
cost


The above tokens contain punctuation and common words like “a”, ” the”, “was”, etc. These do not add any value to the meaning of your text. They are called stop words.

Let’s clean it up.

As mentioned in the last section, there is ‘noise’ in the tokens. The words such as ‘the’, ‘was’, ‘it’ etc are very common and are referred as ‘stop words’.

Besides, you have punctuation like commas, brackets, full stop and some extra white spaces too. The process of removing noise from the doc is called Text Cleaning or Preprocessing.

What is the need for Text Preprocessing ?

The outcome of the NLP task you perform, be it classification, finding sentiments, topic modeling etc, the quality of the output depends heavily on the quality of the input text used.

Stop words and punctuation usually (not always) don’t add value to the meaning of the text and can potentially impact the outcome. To avoid this, its might make sense to remove them and clean the text of unwanted characters can reduce the size of the corpus.

How to identify and remove the stopwords and punctuation?

The tokens in spacy have attributes which will help you identify if it is a stop word or not.

The token.is_stop attribute tells you that. Likewise, token.is_punct and token.is_space tell you if a token is a punctuation and white space respectively.

In [5]:
# Printing tokens and boolean values stored in different attributes
for token in my_doc:
  print(token.text,'--',token.is_stop,'---',token.is_punct)

The -- True --- False
economic -- False --- False
situation -- False --- False
of -- True --- False
the -- True --- False
country -- False --- False
is -- True --- False
on -- True --- False
edge -- False --- False
, -- False --- True
as -- True --- False
the -- True --- False
stock -- False --- False

 -- False --- False
market -- False --- False
crashed -- False --- False
causing -- False --- False
loss -- False --- False
of -- True --- False
millions -- False --- False
. -- False --- True
Citizens -- False --- False
who -- True --- False
had -- True --- False
their -- True --- False
main -- False --- False
investment -- False --- False

 -- False --- False
in -- True --- False
the -- True --- False
share -- False --- False
- -- False --- True
market -- False --- False
are -- True --- False
facing -- False --- False
a -- True --- False
great -- False --- False
loss -- False --- False
. -- False --- True
Many -- True --- False
companies -- False --- False
might -- True --- False
lay -

In [6]:
# Removing StopWords and punctuations
my_doc_cleaned = [token for token in my_doc if not token.is_stop and not token.is_punct]

for token in my_doc_cleaned:
  print(token.text)

economic
situation
country
edge
stock


market
crashed
causing
loss
millions
Citizens
main
investment


share
market
facing
great
loss
companies
lay


thousands
people
reduce
labor
cost


You can now see that the cleaned doc has only tokens that contribute to meaning in some way.

Also , the computational costs decreases by a great amount due to reduce in the number of tokens. In order to grasp the effect of Preprocessing on large text data , you can excecute the below code

In [7]:
# Lemmatizing the tokens of a doc
text='she played chess against rita she likes playing chess.'
doc=nlp(text)
for token in doc:
  print(token.lemma_)

-PRON-
play
chess
against
rita
-PRON-
like
play
chess
.


Recall that we used is_punct and is_space attributes in Text Preprocessing. They are called as **‘lexical attributes’.**

In this section, you will learn about a few more significant lexical attributes.

The spaCy model provides many useful lexical attributes. These are the attributes of Token object, that give you information on the type of token.

For example, you can use like_num or like_email (get emails) attribute of a token to check if it is a number. Let’s print all the numbers in a text.

In [7]:
# Printing the tokens which are like numbers
text=' 2020 is far worse than 2009'
doc=nlp(text)
for token in doc:
  if token.like_num:
    print(token)

2020
2009


Likewise, spaCy provides a variety of token attributes. Below is a list of those attributes and the function they perform.
1. token.is_alpha : Returns True if the token is an alphabet
2. token.is_ascii : Returns True if the token belongs to ascii characters
3. token.is_digit : Returns True if the token is a number(0-9)
4. token.is_upper : Returns True if the token is upper case alphabet
5. token.is_lower : Returns True if the token is lower case alphabet
6. token.is_space : Returns True if the token is a space ‘ ‘
7. token.is_bracket : Returns True if the token is a bracket
8. token.is_quote : Returns True if the token is a quotation mark
9. token.like_url : Returns True if the token is similar to a URl (link to website)

**Part of Speech analysis with spaCy**
Consider a sentence , “Emily likes playing football”.

Here , Emily is a NOUN , and playing is a VERB. Likewise , each word of a text is either a noun, pronoun, verb, conjection, etc. These tags are called as Part of Speech tags (POS).

How to identify the part of speech of the words in a text document ?

It is present in the pos_ attribute

In [9]:
# POS tagging using spaCy
my_text='John plays basketball,if time permits. He played in high school too.'
my_doc=nlp(my_text)
for token in my_doc:
  print(token.text,'---- ',token.pos_)

John ----  PROPN
plays ----  VERB
basketball ----  NOUN
, ----  PUNCT
if ----  SCONJ
time ----  NOUN
permits ----  VERB
. ----  PUNCT
He ----  PRON
played ----  VERB
in ----  ADP
high ----  ADJ
school ----  NOUN
too ----  ADV
. ----  PUNCT


From above output , you can see the POS tag against each word like VERB , ADJ, etc..

What if you don’t know what the tag SCONJ means ?

Using spacy.explain() function , you can know the explanation or full-form in this case.

In [8]:
spacy.explain('SCONJ')

'subordinating conjunction'

**Example application of POS tagging**

In [9]:
# Raw text document
raw_text="""I liked the movies etc The movie had good direction  The movie was amazing i.e.
            The movie was average direction was not bad The cinematography was nice. i.e.
            The movie was a bit lengthy  otherwise fantastic  etc etc"""

# Creating a spacy object
raw_doc=nlp(raw_text)

# Checking if POS tag is X and printing them
print('The junk values are..')
for token in raw_doc:
  if token.pos_=='X':
    print(token.text)

print('After removing junk')
# Removing the tokens whose POS tag is junk.
clean_doc=[token for token in raw_doc if not token.pos_=='X']
print(clean_doc)

The junk values are..
etc
i.e.
i.e.
etc
etc
After removing junk
[I, liked, the, movies, The, movie, had, good, direction,  , The, movie, was, amazing, 
            , The, movie, was, average, direction, was, not, bad, The, cinematography, was, nice, ., 
            , The, movie, was, a, bit, lengthy,  , otherwise, fantastic,  ]


In [12]:
#You can also know what types of tokens are present in your text by creating a dictionary shown below.
# creating a dictionary with parts of speeach &amp; corresponding token numbers.

all_tags = {token.pos: token.pos_ for token in raw_doc}
print(all_tags)

{95: 'PRON', 100: 'VERB', 90: 'DET', 92: 'NOUN', 101: 'X', 87: 'AUX', 84: 'ADJ', 103: 'SPACE', 94: 'PART', 97: 'PUNCT', 86: 'ADV'}


In [11]:
# For better understanding of various POS of a sentence, you can use the visualization function displacy of spacy.
# Importing displacy
from spacy import displacy
my_text='She never likes playing , reading was her hobby'
my_doc=nlp(my_text)

# displaying tokens with their POS tags
displacy.render(my_doc,style='dep',jupyter=True)

Have a look at this text “John works at Google″. In this, ” John ” and ” Google ” are names of a person and a company. These words are referred as named-entities. They are real-world objects like name of a company , place,etc..

How can find all the named-entities in a text ?

Using spaCy’s ents attribute on a document, you can access all the named-entities present in the text.

In [12]:
# Preparing the spaCy document
text='Tony Stark owns the company StarkEnterprises . Emily Clark works at Microsoft and lives in Manchester. She loves to read the Bible and learn French'
doc=nlp(text)

# Printing the named entities
print(doc.ents)

(Tony Stark, StarkEnterprises, Emily Clark, Microsoft, Manchester, Bible, French)


You can see all the named entities printed.

But , is this complete information ? NO.

Each named entity belongs to a category, like name of a person, or an organization, or a city, etc. The common Named Entity categories supported by spacy are :

PERSON : Denotes names of people
GPE : Denotes places like counties, cities, states.
ORG : Denotes organizations or companies
WORK_OF_ART : Denotes titles of books, fimls,songs and other arts
PRODUCT : Denotes products such as vehicles, food items ,furniture and so on.
EVENT : Denotes historical events like wars, disasters ,etc…
LANGUAGE : All the recognized languages across the globe.
How can you find out which named entity category does a given text belong to?

You can access the same through .label_ attribute of spacy. It prints the label of named entities as shown below.

In [15]:
# Printing labels of entities.
for entity in doc.ents:
  print(entity.text,'--- ',entity.label_)

Tony Stark ---  PERSON
StarkEnterprises ---  ORG
Emily Clark ---  PERSON
Microsoft ---  ORG
Manchester ---  GPE
Bible ---  WORK_OF_ART
French ---  LANGUAGE


spaCy also provides special visualization for NER through displacy. Using displacy.render() function, you can set the style=ent to visualize.

In [13]:
# Using displacy for visualizing NER
from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)

Now that you have got a grasp on basic terms and process, let’s move on to see how named entity recognition is useful for us.

Consider this article about competition in the mobile industry.

In [18]:
mobile_industry_article=""" 30 Major mobile phone brands Compete in India – A Case Study of Success and Failures
Is the Indian mobile market a terrible War Zone? We have more than 30 brands competing with each other. Let’s find out some insights about the world second-largest mobile bazaar.There is a massive invasion by Chinese mobile brands in India in the last four years. Some of the brands have been able to make a mark while others like Meizu, Coolpad, ZTE, and LeEco are a failure.On one side, there are brands like Sony or HTC that have quit from the Indian market on the other side we have new brands like Realme or iQOO entering the marketing in recent months.The mobile market is so competitive that some of the brands like Micromax, which had over 18% share back in 2014, now have less than 5%. Even the market leader Samsung with a 34% market share in 2014, now has a 21% share whereas Xiaomi has become a market leader. The battle is fierce and to sustain and scale-up is going to be very difficult for any new entrant.new comers in Indian Mobile MarketiQOO –They have recently (March 2020) launched the iQOO 3 in India with its first 5G phone – iQOO 3. The new brand is part of the Vivo or the BBK electronics group that also owns several other brands like Oppo, Oneplus and Realme.Realme – Realme launched the first-ever phone – Realme 1 in November 2018 and has quickly became a popular brand in India. The brand is one of the highest sellers in online space and even reached a 16% market share threatening Xiaomi’s dominance.iVoomi – In 2017, we have seen the entry of some new Chinese mobile brands likeiVoomi which focuses on the sub 10k price range, and is a popular online player. They have an association with Flipkart.Techno &amp; Infinix – Transsion Group’s Tecno and Infinix brands debuted in India in mid-2017 and are focusing on the low end and mid-range phones in the price range of Rs. 5000 to Rs. 12000.10.OR &amp; Lephone – 10.OR has a partnership with Amazon India and is an exclusive online brand with phones like 10.OR D, G and E. However, the brand is not very aggressive currently.Kult – Kult is another player who launched a very aggressively priced Kult Beyond mobile in 2017 and followed up by launching 2-3 more models.However, most of these new brands are finding it difficult to strengthen their footing in India. As big brands like Xiaomi leave no stone unturned to make things difficult.Also, it is worth noting that there is less Chinese players coming to India now. As either all the big brands have already set shop or burnt their hands and retreated to the homeland China.Chinese/ Global  Brands Which failed or are at the Verge of Failing in India?
There are a lot more failures in the market than the success stories. Let’s first look at the failures and then we will also discuss why some brands were able to succeed in India.HTC – The biggest surprise this year for me was the failure of HTC in India. The brand has been in the country for many years, in fact, they were the first brand to launch Android mobiles. Finally HTC decided to call it a day in July 2018.LeEco – LeEco looked promising and even threatening to Xiaomi when it came to India. The company launched a series of new phones and smart TVs at affordable rates. Unfortunately, poor financial planning back home caused the brand to fail in India too.LG – The company seems to have lost focus and are doing poorly in all segments. While the budget and mid-range offering are uncompetitive, the high-end models are not preferred by buyers.Sony – Absurd pricing and lack of ability to understand the Indian buyers have caused Sony to shrink mobile operations in India. In the last 2 years, there are far fewer launches and hardly any promotions or hype around the new products.Meizu – Meizu is also a struggling brand in India and is going nowhere with the current strategy. There are hardly any popular mobiles nor a retail presence.ZTE – The company was aggressive till last year with several new phones launching under the Nubia banner, but with recent issues in the US, they have even lost the plot in India.Coolpad – I still remember the first meeting with Coolpad CEO in Mumbai when the brand started operations. There were big dreams and ambitions, but the company has not been able to deliver and keep up with the rivals in the last 1 year.Gionee – Gionee was doing well in the retail, but the infighting in the company and loss of focus from the Chinese parent company has made it a failure. The company is planning a comeback. However, we will have to wait and see when that happens."""

What if you want to know all the companies that are mentioned in this article?

This is where Named Entity Recognition helps. You can check which tokens are organizations using label_ attribute as shown in below code.

In [19]:
# creating spacy doc
mobile_doc=nlp(mobile_industry_article)

# List to store name of mobile companies
list_of_org=[]

# Appending entities which havel the label 'ORG' to the list
for entity in mobile_doc.ents:
  if entity.label_=='ORG':
    list_of_org.append(entity.text)

print(list_of_org)

['Meizu', 'ZTE', 'LeEco', 'Sony', 'HTC', 'Xiaomi', 'Xiaomi', 'Flipkart', 'Techno &amp', 'Infinix – Transsion Group', 'Infinix', '12000.10.OR &amp', 'Amazon India', 'Kult', 'Kult', 'Kult Beyond', 'HTC', 'Android', 'Sony', 'Sony', 'Meizu', 'Meizu', 'ZTE', 'Nubia']


You have successfully extracted list of companies that were mentioned in the article.

Let us also discuss another application. You come across many articles about theft and other crimes.

In [20]:
# Creating a doc on news articles
news_text="""Indian man has allegedly duped nearly 50 businessmen in the UAE of USD 1.6 million and fled the country in the most unlikely way -- on a repatriation flight to Hyderabad, according to a media report on Saturday.Yogesh Ashok Yariava, the prime accused in the fraud, flew from Abu Dhabi to Hyderabad on a Vande Bharat repatriation flight on May 11 with around 170 evacuees, the Gulf News reported.Yariava, the 36-year-old owner of the fraudulent Royal Luck Foodstuff Trading, made bulk purchases worth 6 million dirhams (USD 1.6 million) against post-dated cheques from unsuspecting traders before fleeing to India, the daily said.
The bought goods included facemasks, hand sanitisers, medical gloves (worth nearly 5,00,000 dirhams), rice and nuts (3,93,000 dirhams), tuna, pistachios and saffron (3,00,725 dirhams), French fries and mozzarella cheese (2,29,000 dirhams), frozen Indian beef (2,07,000 dirhams) and halwa and tahina (52,812 dirhams).
The list of items and defrauded persons keeps getting longer as more and more victims come forward, the report said.
The aggrieved traders have filed a case with the Bur Dubai police station.
The traders said when the dud cheques started bouncing they rushed to the Royal Luck's office in Dubai but the shutters were down, even the fraudulent company's warehouses were empty."""

news_doc=nlp(news_text)

While using this for a case study, you might need to to avoid use of original names, companies and places. How can you do it ?

Write a function which will scan the text for named entities which have the labels PERSON , ORG and GPE. These tokens can be replaced by “UNKNOWN”.

In [21]:
# Function to identify  if tokens are named entities and replace them with UNKNOWN
def remove_details(word):
  if word.ent_type_ =='PERSON' or word.ent_type_=='ORG' or word.ent_type_=='GPE':
    return ' UNKNOWN '
  return word.string


# Function where each token of spacy doc is passed through remove_deatils()
def update_article(doc):
  # iterrating through all entities
  for ent in doc.ents:
    ent.merge()
  # Passing each token through remove_details() function.
  tokens = map(remove_details,doc)
  return ''.join(tokens)

# Passing our news_doc to the function update_article()
update_article(news_doc)

"Indian man has allegedly duped nearly 50 businessmen in the  UNKNOWN of USD 1.6 million and fled the country in the most unlikely way -- on a repatriation flight to  UNKNOWN , according to a media report on Saturday. UNKNOWN , the prime accused in the fraud, flew from  UNKNOWN to  UNKNOWN on a Vande Bharat repatriation flight on May 11 with around 170 evacuees,  UNKNOWN reported. UNKNOWN , the 36-year-old owner of the fraudulent  UNKNOWN , made bulk purchases worth 6 million dirhams (USD 1.6 million) against post-dated cheques from unsuspecting traders before fleeing to  UNKNOWN , the daily said.\nThe bought goods included facemasks, hand sanitisers, medical gloves (worth nearly 5,00,000 dirhams), rice and nuts (3,93,000 dirhams), tuna, pistachios and saffron (3,00,725 dirhams), French fries and mozzarella cheese (2,29,000 dirhams), frozen Indian beef (2,07,000 dirhams) and halwa and  UNKNOWN (52,812 dirhams).\nThe list of items and defrauded persons keeps getting longer as more and m

Using Matcher of spacy you can identify token patterns as seen above. But when you have a phrase to be matched, using Matcher will take a lot of time and is not efficient.

spaCy provides PhraseMatcher which can be used when you have a large number of terms(single or multi-tokens) to be matched in a text document. Writing patterns for Matcher is very difficult in this case. PhraseMatcher solves this problem, as you can pass Doc patterns rather than Token patterns.

The procedure to use PhraseMatcher is very similar to Matcher.

Initialize a PhraseMatcher object with a vocab.
Define the terms you want to match
Add the pattern to the matcher
Run the text through the matcher to extract the matching positions.


In [22]:
from spacy.matcher import PhraseMatcher
# PhraseMatcher
# After importing , first you need to initialize the PhraseMatcher with vocab through below command
matcher = PhraseMatcher(nlp.vocab)

In [23]:
# Terms to match
terms_list = ['Bruce Wayne', 'Tony Stark', 'Batman', 'Harry Potter', 'Severus Snape']

In [24]:
# Make a list of docs
patterns = [nlp.make_doc(text) for text in terms_list]

You can add the pattern to your matcher through matcher.add() method.

The inputs for the function are – A custom ID for your matcher, optional parameter for callable function, pattern list.

In [25]:
matcher.add("phrase_matcher", None, *patterns)

Now you can apply your matcher to your spacy text document. Below, you have a text article on prominent fictional characters and their creators.



In [26]:
# Matcher Object
fictional_char_doc = nlp("""Superman (first appearance: 1938)  Created by Jerry Siegal and Joe Shuster for Action Comics #1 (DC Comics).Mickey Mouse (1928)  Created by Walt Disney and Ub Iworks for Steamboat Willie.Bugs Bunny (1940)  Created by Warner Bros and originally voiced by Mel Blanc.Batman (1939) Created by Bill Finger and Bob Kane for Detective Comics #27 (DC Comics).
Dorothy Gale (1900)  Created by L. Frank Baum for novel The Wonderful Wizard of Oz. Later portrayed by Judy Garland in the 1939 film adaptation.Darth Vader (1977) Created by George Lucas for Star Wars IV: A New Hope.The Tramp (1914)  Created and portrayed by Charlie Chaplin for Kid Auto Races at Venice.Peter Pan (1902)  Created by J.M. Barrie for novel The Little White Bird.
Indiana Jones (1981)  Created by George Lucas for Raiders of the Lost Ark. Portrayed by Harrison Ford.Rocky Balboa (1976)  Created and portrayed by Sylvester Stallone for Rocky.Vito Corleone (1969) Created by Mario Puzo for novel The Godfather. Later portrayed by Marlon Brando and Robert DeNiro in Coppola’s film adaptation.Han Solo (1977) Created by George Lucas for Star Wars IV: A New Hope. 
Portrayed most famously by Harrison Ford.Homer Simpson (1987)  Created by Matt Groening for The Tracey Ullman Show, later The Simpsons as voiced by Dan Castellaneta.Archie Bunker (1971) Created by Norman Lear for All in the Family. Portrayed by Carroll O’Connor.Norman Bates (1959) Created by Robert Bloch for novel Psycho.  Later portrayed by Anthony Perkins in Hitchcock’s film adaptation.King Kong (1933) 
Created by Edgar Wallace and Merian C Cooper for the film King Kong.Lucy Ricardo (1951) Portrayed by Lucille Ball for I Love Lucy.Spiderman (1962)  Created by Stan Lee and Steve Ditko for Amazing Fantasy #15 (Marvel Comics).Barbie (1959)  Created by Ruth Handler for the toy company Mattel Spock (1964)  Created by Gene Roddenberry for Star Trek. Portrayed most famously by Leonard Nimoy.
Godzilla (1954) Created by Tomoyuki Tanaka, Ishiro Honda, and Eiji Tsubaraya for the film Godzilla.The Joker (1940)  Created by Jerry Robinson, Bill Finger, and Bob Kane for Batman #1 (DC Comics)Winnie-the-Pooh (1924)  Created by A.A. Milne for verse book When We Were Young.Popeye (1929)  Created by E.C. Segar for comic strip Thimble Theater (King Features).Tarzan (1912) Created by Edgar Rice Burroughs for the novel Tarzan of the Apes.Forrest Gump (1986)  Created by Winston Groom for novel Forrest Gump.  Later portrayed by Tom Hanks in Zemeckis’ film adaptation.Hannibal Lector (1981)  Created by Thomas Harris for the novel Red Dragon. Portrayed most famously by Anthony Hopkins in the 1991 Jonathan Demme film The Silence of the Lambs.
Big Bird (1969) Created by Jim Henson and portrayed by Carroll Spinney for Sesame Street.Holden Caulfield (1945) Created by J.D. Salinger for the Collier’s story “I’m Crazy.”  Reworked into the novel The Catcher in the Rye in 1951.Tony Montana (1983)  Created by Oliver Stone for film Scarface.  Portrayed by Al Pacino.Tony Soprano (1999)  Created by David Chase for The Sopranos. Portrayed by James Gandolfini.
The Terminator (1984)  Created by James Cameron and Gale Anne Hurd for The Terminator. Portrayed by Arnold Schwarzenegger.Jon Snow (1996)  Created by George RR Martin for the novel The Game of Thrones.  Portrayed by Kit Harrington.Charles Foster Kane (1941)  Created and portrayed by Orson Welles for Citizen Kane.Scarlett O’Hara (1936)  Created by Margaret Mitchell for the novel Gone With the Wind. Portrayed most famously by Vivien Leigh 
for the 1939 Victor Fleming film adaptation.Marty McFly (1985) Created by Robert Zemeckis and Bob Gale for Back to the Future. Portrayed by Michael J. Fox.Rick Blaine (1940)  Created by Murray Burnett and Joan Alison for the unproduced stage play Everybody Comes to Rick’s. Later portrayed by Humphrey Bogart in Michael Curtiz’s film adaptation Casablanca.Man With No Name (1964)  Created by Sergio Leone for A Fistful of Dollars, which was adapted from a ronin character in Kurosawa’s Yojimbo (1961).  Portrayed by Clint Eastwood.Charlie Brown (1948)  Created by Charles M. Shultz for the comic strip L’il Folks; popularized two years later in Peanuts.E.T. (1982)  Created by Melissa Mathison for the film E.T.: the Extra-Terrestrial.Arthur Fonzarelli (1974)  Created by Bob Brunner for the show Happy Days. Portrayed by Henry Winkler.)Phillip Marlowe (1939)  Created by Raymond Chandler for the novel The Big Sleep.Jay Gatsby (1925)  Created by F. Scott Fitzgerald for the novel The Great Gatsby.Lassie (1938) Created by Eric Knight for a Saturday Evening Post story, later turned into the novel Lassie Come-Home in 1940, film adaptation in 1943, and long-running television show in 1954.  Most famously portrayed by the dog Pal.
Fred Flintstone (1959)  Created by William Hanna and Joseph Barbera for The Flintstones. Voiced most notably by Alan Reed. Rooster Cogburn (1968)  Created by Charles Portis for the novel True Grit. Most famously portrayed by John Wayne in the 1969 film adaptation. Atticus Finch (1960)  Created by Harper Lee for the novel To Kill a Mockingbird.  (Appeared in the earlier work Go Set A Watchman, though this was not published until 2015)  Portrayed most famously by Gregory Peck in the Robert Mulligan film adaptation. Kermit the Frog (1955)  Created and performed by Jim Henson for the show Sam and Friends. Later popularized in Sesame Street (1969) and The Muppet Show (1976) George Bailey (1943)  Created by Phillip Van Doren Stern (then as George Pratt) for the short story The Greatest Gift. Later adapted into Capra’s It’s A Wonderful Life, starring James Stewart as the renamed George Bailey. Yoda (1980) Created by George Lucas for The Empire Strikes Back. Sam Malone (1982)  Created by Glen and Les Charles for the show Cheers.  Portrayed by Ted Danson. Zorro (1919)  Created by Johnston McCulley for the All-Story Weekly pulp magazine story The Curse of Capistrano.Later adapted to the Douglas Fairbanks’ film The Mark of Zorro (1920).Moe, Larry, and Curly (1928)  Created by Ted Healy for the vaudeville act Ted Healy and his Stooges. Mary Poppins (1934)  Created by P.L. Travers for the children’s book Mary Poppins. Ron Burgundy (2004)  Created by Will Ferrell and Adam McKay for the film Anchorman: The Legend of Ron Burgundy.  Portrayed by Will Ferrell. Mario (1981)  Created by Shigeru Miyamoto for the video game Donkey Kong. Harry Potter (1997)  Created by J.K. Rowling for the novel Harry Potter and the Philosopher’s Stone. The Dude (1998)  Created by Ethan and Joel Coen for the film The Big Lebowski. Portrayed by Jeff Bridges.
Gandalf (1937)  Created by J.R.R. Tolkien for the novel The Hobbit. The Grinch (1957)  Created by Dr. Seuss for the story How the Grinch Stole Christmas! Willy Wonka (1964)  Created by Roald Dahl for the children’s novel Charlie and the Chocolate Factory. The Hulk (1962)  Created by Stan Lee and Jack Kirby for The Incredible Hulk #1 (Marvel Comics) Scooby-Doo (1969)  Created by Joe Ruby and Ken Spears for the show Scooby-Doo, Where Are You! George Costanza (1989)  Created by Larry David and Jerry Seinfeld for the show Seinfeld.  Portrayed by Jason Alexander.Jules Winfield (1994)  Created by Quentin Tarantino for the film Pulp Fiction. Portrayed by Samuel L. Jackson. John McClane (1988)  Based on the character Detective Joe Leland, who was created by Roderick Thorp for the novel Nothing Lasts Forever. Later adapted into the John McTernan film Die Hard, starring Bruce Willis as McClane. Ellen Ripley (1979)  Created by Don O’cannon and Ronald Shusett for the film Alien.  Portrayed by Sigourney Weaver. Ralph Kramden (1951)  Created and portrayed by Jackie Gleason for “The Honeymooners,” which became its own show in 1955.Edward Scissorhands (1990)  Created by Tim Burton for the film Edward Scissorhands.  Portrayed by Johnny Depp.Eric Cartman (1992)  Created by Trey Parker and Matt Stone for the animated short Jesus vs Frosty.  Later developed into the show South Park, which premiered in 1997.  Voiced by Trey Parker.
Walter White (2008)  Created by Vince Gilligan for Breaking Bad.  Portrayed by Bryan Cranston. Cosmo Kramer (1989)  Created by Larry David and Jerry Seinfeld for Seinfeld.  Portrayed by Michael Richards.Pikachu (1996)  Created by Atsuko Nishida and Ken Sugimori for the Pokemon video game and anime franchise.Michael Scott (2005)  Based on a character from the British series The Office, created by Ricky Gervais and Steven Merchant.  Portrayed by Steve Carell.Freddy Krueger (1984)  Created by Wes Craven for the film A Nightmare on Elm Street. Most famously portrayed by Robert Englund.
Captain America (1941)  Created by Joe Simon and Jack Kirby for Captain America Comics #1 (Marvel Comics)Goku (1984)  Created by Akira Toriyama for the manga series Dragon Ball Z.Bambi (1923)  Created by Felix Salten for the children’s book Bambi, a Life in the Woods. Later adapted into the Disney film Bambi in 1942.Ronald McDonald (1963) Created by Williard Scott for a series of television spots.Waldo/Wally (1987) Created by Martin Hanford for the children’s book Where’s Wally? (Waldo in US edition) Frasier Crane (1984)  Created by Glen and Les Charles for Cheers.  Portrayed by Kelsey Grammar.Omar Little (2002)  Created by David Simon for The Wire.Portrayed by Michael K. Williams.
Wolverine (1974)  Created by Roy Thomas, Len Wein, and John Romita Sr for The Incredible Hulk #180 (Marvel Comics) Jason Voorhees (1980)  Created by Victor Miller for the film Friday the 13th. Betty Boop (1930)  Created by Max Fleischer and the Grim Network for the cartoon Dizzy Dishes. Bilbo Baggins (1937)  Created by J.R.R. Tolkien for the novel The Hobbit. Tom Joad (1939)  Created by John Steinbeck for the novel The Grapes of Wrath. Later adapted into the 1940 John Ford film and portrayed by Henry Fonda.Tony Stark (Iron Man) (1963)  Created by Stan Lee, Larry Lieber, Don Heck and Jack Kirby for Tales of Suspense #39 (Marvel Comics)Porky Pig (1935)  Created by Friz Freleng for the animated short film I Haven’t Got a Hat. Voiced most famously by Mel Blanc.Travis Bickle (1976)  Created by Paul Schrader for the film Taxi Driver. Portrayed by Robert De Niro.
Hawkeye Pierce (1968)  Created by Richard Hooker for the novel MASH: A Novel About Three Army Doctors.  Famously portrayed by both Alan Alda and Donald Sutherland. Don Draper (2007)  Created by Matthew Weiner for the show Mad Men.  Portrayed by Jon Hamm. Cliff Huxtable (1984)  Created and portrayed by Bill Cosby for The Cosby Show. Jack Torrance (1977)  Created by Stephen King for the novel The Shining. Later adapted into the 1980 Stanley Kubrick film and portrayed by Jack Nicholson. Holly Golightly (1958)  Created by Truman Capote for the novella Breakfast at Tiffany’s.  Later adapted into the 1961 Blake Edwards films starring Audrey Hepburn as Holly. Shrek (1990)  Created by William Steig for the children’s book Shrek! Later adapted into the 2001 film starring Mike Myers as the titular character. Optimus Prime (1984)  Created by Dennis O’Neil for the Transformers toy line.Sonic the Hedgehog (1991)  Created by Naoto Ohshima and Yuji Uekawa for the Sega Genesis game of the same name.Harry Callahan (1971)  Created by Harry Julian Fink and R.M. Fink for the movie Dirty Harry.  Portrayed by Clint Eastwood.Bubble: Hercule Poirot, Tyrion Lannister, Ron Swanson, Cercei Lannister, J.R. Ewing, Tyler Durden, Spongebob Squarepants, The Genie from Aladdin, Pac-Man, Axel Foley, Terry Malloy, Patrick Bateman
Pre-20th Century: Santa Claus, Dracula, Robin Hood, Cinderella, Huckleberry Finn, Odysseus, Sherlock Holmes, Romeo and Juliet, Frankenstein, Prince Hamlet, Uncle Sam, Paul Bunyan, Tom Sawyer, Pinocchio, Oliver Twist, Snow White, Don Quixote, Rip Van Winkle, Ebenezer Scrooge, Anna Karenina, Ichabod Crane, John Henry, The Tooth Fairy,
Br’er Rabbit, Long John Silver, The Mad Hatter, Quasimodo """)


character_matches = matcher(fictional_char_doc)

The PhraseMatcher returns a list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end].

The match_id refers to the string ID of the match pattern

In [27]:
# Matching positions
character_matches

[(520014689628841516, 56, 57),
 (520014689628841516, 449, 450),
 (520014689628841516, 1352, 1354),
 (520014689628841516, 1365, 1367),
 (520014689628841516, 2084, 2086)]

You can see that 3 of the terms have been found in the text, but we dont know what they are. For that , you need to extract the Span using start and end as shown below.

In [28]:
# Matched items
for match_id, start, end in character_matches:
    span = fictional_char_doc[start:end]
    print(span.text)

Batman
Batman
Harry Potter
Harry Potter
Tony Stark


You can see that ‘Harry Potter’ and ‘Batman’ were mentioned twice ,
‘Tony Stark’ once, but the other terms didn’t match.

Another useful feature of PhraseMatcher is that while intializing the matcher, you have an option to use the parameter attr, using which you can set rules for how the matching has to happen.

How to use attr?

Setting a attr to match on will change the token attributes that will be compared to determine a match. For example, if you use attr='LOWER', then case-insensitive matching will happen.

For understanding, I shall demonstrate it in the below example.

In [29]:
# Using the attr parameter as 'LOWER'
case_insensitive_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

# Creating doc &amp; pattern
my_doc=nlp('I wish to visit new york city')
terms=['New York']
pattern=[nlp(term) for term in terms]

# adding pattern to the matcher
case_insensitive_matcher.add("matcher",None,*pattern)

# applying matcher to the doc
my_matches=case_insensitive_matcher(my_doc)

for match_id,start,end in my_matches:
  span=my_doc[start:end]
  print(span.text)

new york


You can observe that irrespective the difference in the case, the phrase was successfully matched.

Let’s see a more useful case.

If you set the attr='SHAPE', then matching will be based on the shape of the terms in pattern .

This can be used to match URLs, dates of specific format, time-formats, where the shape will be same. Let us consider a text having information about various radio channels.

You want to extract the channels (in the form of ddd.d)

In [14]:
my_doc = nlp('From 8 am , Mr.X will be speaking on your favorite chanel 191.1. Afterward there shall be an exclusive interview with actor Vijay on channel 194.1 . Hope you are having a great day. Call us on 666666')

In [15]:
# Let us create the pattern. You need to pass an example radio channel of the desired shape as pattern to the matcher.
pattern=nlp('154.6')

Your pattern is ready , now initialize the PhraseMatcher with attribute set as "SHAPE".. Then add the pattern to matcher.

In [17]:
# Initializing the matcher and adding pattern
from spacy.matcher import PhraseMatcher
pincode_matcher= PhraseMatcher(nlp.vocab,attr="SHAPE")
pincode_matcher.add("pincode_matching", None, pattern)

You can apply the matcher to your doc as usual and print the matching phrases.

In [18]:
# Applying matcher on doc
matches = pincode_matcher(my_doc)

# Printing the matched phrases
for match_id, start, end in matches:
  span = my_doc[start:end]
  print(span.text)


191.1
194.1


Entity Ruler is intetesting and very useful.

While trying to detect entities, some times certain names or organizations are not recognized by default. It might be because they are small scale or rare. Wouldn’t it be better to improve accuracy of our doc.ents_ method ?

spaCy provides a more advanced component EntityRuler that let’s you match named entities based on pattern dictionaries. Overall, it makes Named Entity Recognition more efficient.

It is a pipeline supported component and can be imported as shown below .

In [19]:
from spacy.pipeline import EntityRuler
# Initialize
ruler = EntityRuler(nlp)

What type of patterns do you pass to the EntityRuler ?

Basically, you need to pass a list of dictionaries, where each dictionary represents a pattern to be matched.

Each dictionary has two keys "label" and "pattern".

label : Holds the entity type as values eg: PERSON, GPE, etc
pattern: Holds the the matcher pattern as values eg: John, Calcutta, etc
For example, let us consider a situation where you want to add certain book names under the entity label WORK_OF_ART.

What will be your pattern ?

My label will be WORK_OF_ART and pattern will contain the book names I wish to add. Below code demonstrates the same.

In [20]:
pattern=[{"label": "WORK_OF_ART", "pattern": "My guide to statistics"}]

You can add pattern to the ruler through add_patterns() function

In [21]:
ruler.add_patterns(pattern)

How can you apply the EntityRuler to your text ?

You can add it to the nlp model through add_pipe() function. It Adds the ruler component to the processing pipeline

In [22]:
# Add entity ruler to the NLP pipeline. 
# NLP pipeline is a sequence of NLP tasks that spaCy performs for a given text
# More on pipelines coming in future section in this post.
nlp.add_pipe(ruler)

Now , the EntityRuler is incorporated into nlp. You can pass the text document to nlp to create a spacy doc . As the ruler is already added, by default “My guide to statistics” will be recognized as named entities under category WORK_OF_ART.

You can verify it through below code

In [23]:
# Extract the custom entity type 
doc = nlp(" I recently published my work fanfiction by Dr.X . Right now I'm studying the book of my friend .You should try My guide to statistics for clear concepts.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('My guide to statistics', 'WORK_OF_ART')]


You have successfuly enhanced the named entity recoginition. It is possible to train spaCy to detect new entities it has not seen as well.

EntityRuler has many amazing features, you’ll run into them later in this article.

Word Vectors are numerical vector representations of words and documents. The numeric form helps understand the semantics about the word and can be used for NLP tasks such as classification.

Because, vector representation of words that are similar in meaning and context appear closer together.

spaCy models support inbuilt vectors that can be accessed through directly through the attributes of Token and Doc. How can you check if the model supports tokens with vectors ?

First, load a spaCy model of your choice. Here, I am using the medium model for english en_core_web_md. Next, tokenize your text document with nlp boject of spacy model.

You can check if a token has in-buit vector through Token.has_vector attribute.

In [39]:
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.4MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp37-none-any.whl size=98051305 sha256=9c62d4770cb0a9bea1f42ab94dbf15508593afeee21d748443d9911864c0c51e
  Stored in directory: /tmp/pip-ephem-wheel-cache-7xb4s1c5/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


**When running this in Colab, if the cell block below throws an error, go to Runtime > Restart runtime. Then run the below cell block again and it will work.**

In [24]:
# Check if word vector is available
import spacy

# Loading a spacy model
nlp = spacy.load("en_core_web_md")
tokens = nlp("I am an excellent cook")

for token in tokens:
  print(token.text ,' ',token.has_vector)

I   True
am   True
an   True
excellent   True
cook   True


You can see that all tokens in above text have a vector. It is because these words are pre-existing or the model has been trained on them. Let’s see what is the result when the text has some non-existent / made up word .

In [25]:
# Check if word vector is available
tokens=nlp("I wish to go to hogwarts lolXD ")
for token in tokens:
  print(token.text,' ',token.has_vector)

I   True
wish   True
to   True
go   True
to   True
hogwarts   True
lolXD   False


The word “lolXD” is not a part of the model’s vocabulary, hence it does not have a vector. (Remember this issue from tfidf when our corpus we used to train the tfidf vectorizer did not have many words?)

How to access the vector of the tokens?

You can access through token.vector method. Also ,token.vector_norm attribute stores L2 norm of the token’s vector representation.

In [26]:
# Extract the word Vector
tokens=nlp("I wish to go to hogwarts lolXD ")
for token in tokens:
  print(token.text,' ',token.vector_norm)

I   6.4231944
wish   5.1652417
to   4.74484
go   5.05723
to   4.74484
hogwarts   7.4110312
lolXD   0.0


In [29]:
# actual vector
for token in tokens:
  print(token.vector)

[ 1.8733e-01  4.0595e-01 -5.1174e-01 -5.5482e-01  3.9716e-02  1.2887e-01
  4.5137e-01 -5.9149e-01  1.5591e-01  1.5137e+00 -8.7020e-01  5.0672e-02
  1.5211e-01 -1.9183e-01  1.1181e-01  1.2131e-01 -2.7212e-01  1.6203e+00
 -2.4884e-01  1.4060e-01  3.3099e-01 -1.8061e-02  1.5244e-01 -2.6943e-01
 -2.7833e-01 -5.2123e-02 -4.8149e-01 -5.1839e-01  8.6262e-02  3.0818e-02
 -2.1253e-01 -1.1378e-01 -2.2384e-01  1.8262e-01 -3.4541e-01  8.2611e-02
  1.0024e-01 -7.9550e-02 -8.1721e-01  6.5621e-03  8.0134e-02 -3.9976e-01
 -6.3131e-02  3.2260e-01 -3.1625e-02  4.3056e-01 -2.7270e-01 -7.6020e-02
  1.0293e-01 -8.8653e-02 -2.9087e-01 -4.7214e-02  4.6036e-02 -1.7788e-02
  6.4990e-02  8.8451e-02 -3.1574e-01 -5.8522e-01  2.2295e-01 -5.2785e-02
 -5.5981e-01 -3.9580e-01 -7.9849e-02 -1.0933e-02 -4.1722e-02 -5.5576e-01
  8.8707e-02  1.3710e-01 -2.9873e-03 -2.6256e-02  7.7330e-02  3.9199e-01
  3.4507e-01 -8.0130e-02  3.3451e-01  2.7063e-01 -2.4544e-02  7.2576e-02
 -1.8120e-01  2.3693e-01  3.9977e-01  4.5012e-01  2

You can notice that when vector is not present for a token, the value of vector_norm is 0 for it.

Identifying similarity of two words or tokens is very crucial . It is the base to many everyday NLP tasks like text classification , recommendation systems, etc.. It is necessary to know how similar two sentences are , so they can be grouped in same or opposite category.

How to find similarity of two tokens?

Every Doc or Token object has the function similarity(), using which you can compare it with another doc or token.

Know about cosine similarity.

It returns a float value. Higher the value is, more similar are the two tokens or documents.

In [32]:
# Compute Similarity
token_1=nlp("I am a software engineer.")
token_2=nlp("I work with software.")

similarity_score= token_1.similarity(token_2)
print(similarity_score)

0.8923226154433829


That is how you use the similarity function.

Let me show you an example of how similarity() function on docs can help in text categorization.

In [8]:
review_1=nlp(' The food was amazing')
review_2=nlp('The food was excellent')
review_3=nlp('I did not like the food')
review_4=nlp('It was very bad experience')

score_1=review_1.similarity(review_2)
print('Similarity between review 1 and 2',score_1)

score_2=review_3.similarity(review_4)
print('Similarity between review 3 and 4',score_2)

Similarity between review 1 and 2 0.9566212627033192
Similarity between review 3 and 4 0.8461898618188776


You can see that first two reviews have high similarity score and hence will belong in the same category(positive).

You can also check if two tokens or docs are related (includes both similar side and opposite sides) or completely irrelevant.

In [33]:
# Compute Similarity between texts 
pizza=nlp('pizza')
burger=nlp('burger')
chair=nlp('chair')

print('Pizza and burger  ',pizza.similarity(burger))
print('Pizza and chair  ',pizza.similarity(chair))

Pizza and burger   0.7269758865234512
Pizza and chair   0.1917966191121549


You can observe that pizza and burger are both food items and have good similarity score.

Whereas, pizza and chair are completely irrelevant and score is very low.

You have used tokens and docs in many ways till now. In this section, let’s dive deeper and understand the basic pipeline behind this.

When you call the nlp object on spaCy, the text is segmented into tokens to create a Doc object. Following this, various process are carried out on the Doc to add the attributes like POS tags, Lemma tags, dependency tags,etc..

This is referred as the Processing Pipeline

The processing pipeline consists of components, where each component performs it’s task and passes the Processed Doc to the next component. These are called as pipeline components.

spaCy provides certain in-built pipeline components. Let’s look at them.

The built-in pipeline components of spacy are :

Tokenizer : It is responsible for segmenting the text into tokens are turning a Doc object. This the first and compulsory step in a pipeline.
Tagger : It is responsible for assigning Part-of-speech tags. It takes a Doc as input and createsDoc[i].tag

DependencyParser : It is known as parser. It is responsible for assigning the dependency tags to each token. It takes a Doc as input and returns the processed Doc

EntityRecognizer : This component is referred as ner. It is responsible for identifying named entities and assigning labels to them.

TextCategorizer : This component is called textcat. It will assign categories to Docs.

EntityRuler : This component is called * entity_ruler*.It is responsible for assigning named entitile based on pattern rules. Revisit Rule Based Matching to know more.

Sentencizer : This component is called **sentencizer** and can perform rule based sentence segmentation.

merge_noun_chunks : It is called mergenounchunks. This component is responsible for merging all noun chunks into a single token. It has to be add in the pipeline after tagger and parser.

merge_entities : It is called merge_entities .This component can merge all entities into a single token. It has to added after the ner.

merge_subtokens : It is called merge_subtokens. This component can merge the subtokens into a single token.

These are the various in-built pipeline components. It is not necessary for every spaCy model to have each of the above components.

After loading a spaCy model , you check or inspect what pipeline components are present.

After loading the spacy model and creating a Language object nlp, you view the list of pipeline components present by default using nlp.pipe_names attribute

In [34]:
# Inspect a pipeline
import spacy
nlp = spacy.load("en_core_web_md")
print(nlp.pipe_names)

['tagger', 'parser', 'ner']


You can also check if a particular component is present in the pipline through nlp.has_pipe. You have to pass the name of the component like tagger , ner ,textcat as input.

In [11]:
# Check if pipeline component present
nlp.has_pipe('textcat')

False

## **Customize Spacy**

You can add a component to the processing pipeline through nlp.add_pipe() method. You have to pass the component to be added as input.

The component can also be written by you, i.e, custom made pipeline component. (We will come to this later). In case you want to add an in-built component like textcat, how to do it ?

You can use nlp.create_pipe() and pass the component name to get any in-built pipeline component.

In [35]:
# Add new pipeline component
nlp.add_pipe(nlp.create_pipe('textcat'))

Now , you can verify if the component was added using nlp.pipe_names().

In [13]:
nlp.pipe_names

['tagger', 'parser', 'ner', 'textcat']

Observe that textcat has been added at the last. The order of the components signify the order in which the Doc will be processed.

How to specify where you want to add the new component?

The nlp.add_pipe() method provides various arguments for this. You can set one among before, after, first or last to True.

By default, last=True is used.

If you want textcat before ner, you can set before=ner. If you want it to be at first you can set first=True. Just remeber that you should not pass more than one of these arguments as it will lead to contradiction.

In [15]:
# Adding a pipeline component
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(nlp.create_pipe('textcat'),before='ner')
nlp.pipe_names

['tagger', 'parser', 'textcat', 'ner']

In [16]:
# Removing a pipeline component and printing 
nlp.remove_pipe("textcat")
print('After removing the textcat pipeline')
print(nlp.pipe_names)

After removing the textcat pipeline
['tagger', 'parser', 'ner']


In [17]:
# Renaming pipeline components
nlp.rename_pipe(old_name='ner',new_name='my_custom_ner')
nlp.pipe_names

['tagger', 'parser', 'my_custom_ner']

# Increasing efficiency of Spacy pipelines

While dealing with huge amount of text data , the process of converting the text into processed Doc ( passing through pipeline components) is often time consuming.

In this section , you’ll learn various methods for different situations to help you reduce computational expense.

Let’s say you have a list of text data , and you want to process them into Doc onject. The traditional method is to call nlp object on each of the text data . Below is the given list.

In [19]:
list_of_text_data=['In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.','Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.','Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving','As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect.','The term military simulation can cover a wide spectrum of activities, ranging from full-scale field-exercises,[2] to abstract computerized models that can proceed with little or no human involvement','As a general scientific principle, the most reliable data comes from actual observation and the most reliable theories depend on it.[4] This also holds true in military analysis','Any form of training can be regarded as a "simulation" in the strictest sense of the word (inasmuch as it simulates an operational environment); however, many if not most exercises take place not to test new ideas or models, but to provide the participants with the skills to operate within existing ones.','ull-scale military exercises, or even smaller-scale ones, are not always feasible or even desirable. Availability of resources, including money, is a significant factor—it costs a lot to release troops and materiel from any standing commitments, to transport them to a suitable location, and then to cover additional expenses such as petroleum, oil and lubricants (POL) usage, equipment maintenance, supplies and consumables replenishment and other items','Moving away from the field exercise, it is often more convenient to test a theory by reducing the level of personnel involvement. Map exercises can be conducted involving senior officers and planners, but without the need to physically move around any troops. These retain some human input, and thus can still reflect to some extent the human imponderables that make warfare so challenging to model, with the advantage of reduced costs and increased accessibility. A map exercise can also be conducted with far less forward planning than a full-scale deployment, making it an attractive option for more minor simulations that would not merit anything larger, as well as for very major operations where cost, or secrecy, is an issue']

First , create the doc normally calling nlp() on each individual text. You can use %%timeit to know the time taken.



In [20]:
%%timeit
docs = [nlp(text) for text in list_of_text_data]

1 loop, best of 5: 477 ms per loop


You can observe the time taken. Another efficient method of creating the doc is using nlp.pipe() method. You can pass the list as input to this. This method takes less time , as it processes the texts as a stream rather than individually.

In [21]:
%%timeit
docs = list(nlp.pipe(list_of_text_data))

1 loop, best of 5: 321 ms per loop


From above output , you can observe that time taken is less using nlp.pipe() method. When the amount of data will be very large, the time difference will be very important.

Another way to keep the process efficient is using only the pipeline components you need. For example , if your problem does not use POS tags , then tagger is not necessary.

The unnecessary pipeline components can be disabled to improve loading speed and efficiency

There are two common cases where you will need to disable pipeline components.

First case is when you don’t need the component throughout your project. In this case, you can disable the component while loading the spacy model itself. This will save you a great deal of time. It can be done through the disable argument of spacy.load() function.

Below code demonstrates how to disable loading of tagger and parser.



In [22]:
nlp=spacy.load('en_core_web_sm')
for doc in nlp.pipe(list_of_text_data, disable=["ner", "parser"]):
  print(doc.is_tagged)

True
True
True
True
True
True
True
True
True


# How to Train spaCy to Autodetect New Entities (NER)

In [23]:
# Performing NER on E-commerce article

article_text="""India that previously comprised only a handful of players in the e-commerce space, is now home to many biggies and giants battling out with each other to reach the top. This is thanks to the overwhelming internet and smartphone penetration coupled with the ever-increasing digital adoption across the country. These new-age innovations not only gave emerging startups a unique platform to deliver seamless shopping experiences but also provided brick and mortar stores with a level-playing field to begin their online journeys without leaving their offline legacies.
In the wake of so many players coming together on one platform, the Indian e-commerce market is envisioned to reach USD 84 billion in 2021 from USD 24 billion in 2017. Further, with the rate at which internet penetration is increasing, we can expect more and more international retailers coming to India in addition to a large pool of new startups. This, in turn, will provide a major Philip to the organized retail market and boost its share from 12% in 2017 to 22-25% by 2021. 
Here’s a view to the e-commerce giants that are dominating India’s online shopping space:
Amazon – One of the uncontested global leaders, Amazon started its journey as a simple online bookstore that gradually expanded its reach to provide a large suite of diversified products including media, furniture, food, and electronics, among others. And now with the launch of Amazon Prime and Amazon Music Limited, it has taken customer experience to a godly level, which will remain undefeatable for a very long time. 
Flipkart – Founded in 2007, Flipkart is recognized as the national leader in the Indian e-commerce market. Just like Amazon, it started operating by selling books and then entered other categories such as electronics, fashion, and lifestyle, mobile phones, etc. And now that it has been acquired by Walmart, one of the largest leading platforms of e-commerce in the US, it has also raised its bar of customer offerings in all aspects and giving huge competition to Amazon. 
Snapdeal – Started as a daily deals platform in 2010, Snapdeal became a full-fledged online marketplace in 2011 comprising more than 3 lac sellers across India. The platform offers over 30 million products across 800+ diverse categories from over 125,000 regional, national, and international brands and retailers. The Indian e-commerce firm follows a robust strategy to stay at the forefront of innovation and deliver seamless customer offerings to its wide customer base. It has shown great potential for recovery in recent years despite losing Freecharge and Unicommerce. 
ShopClues – Another renowned name in the Indian e-commerce industry, ShopClues was founded in July 2011. It’s a Gurugram based company having a current valuation of INR 1.1 billion and is backed by prominent names including Nexus Venture Partners, Tiger Global, and Helion Ventures as its major investors. Presently, the platform comprises more than 5 lac sellers selling products in nine different categories such as computers, cameras, mobiles, etc. 
Paytm Mall – To compete with the existing e-commerce giants, Paytm, an online payment system has also launched its online marketplace – Paytm Mall, which offers a wide array of products ranging from men and women fashion to groceries and cosmetics, electronics and home products, and many more. The unique thing about this platform is that it serves as a medium for third parties to sell their products directly through the widely-known app – Paytm. 
Reliance Retail – Given Reliance Jio’s disruptive venture in the Indian telecom space along with a solid market presence of Reliance, it is no wonder that Reliance will soon be foraying into retail space. As of now, it has plans to build an e-commerce space that will be established on online-to-offline market program and aim to bring local merchants on board to help them boost their sales and compete with the existing industry leaders. 
Big Basket – India’s biggest online supermarket, Big Basket provides a wide variety of imported and gourmet products through two types of delivery services – express delivery and slotted delivery. It also offers pre-cut fruits along with a long list of beverages including fresh juices, cold drinks, hot teas, etc. Moreover, it not only provides farm-fresh products but also ensures that the farmer gets better prices. 
Grofers – One of the leading e-commerce players in the grocery segment, Grofers started its operations in 2013 and has reached overwhelming heights in the last 5 years. Its wide range of products includes atta, milk, oil, daily need products, vegetables, dairy products, juices, beverages, among others. With its growing reach across India, it has become one of the favorite supermarkets for Indian consumers who want to shop grocery items from the comforts of their homes. 
Digital Mall of Asia – Going live in 2020, Digital Mall of Asia is a very unique concept coined by the founders of Yokeasia Malls. It is designed to provide an immersive digital space equipped with multiple visual and sensory elements to sellers and shoppers. It will also give retailers exclusive rights to sell a particular product category or brand in their respective cities. What makes it unique is its zero-commission model enabling retailers to pay only a fixed amount of monthly rental instead of paying commissions. With its one-of-a-kind features, DMA is expected to bring
never-seen transformation to the current e-commerce ecosystem while addressing all the existing e-commerce worries such as counterfeiting. """

doc=nlp(article_text)
for ent in doc.ents:
  print(ent.text,ent.label_)

India GPE
one CARDINAL
Indian NORP
USD 84 billion MONEY
2021 DATE
USD 24 billion MONEY
2017 DATE
India GPE
Philip PERSON
12% PERCENT
2017 DATE
22-25% PERCENT
2021 DATE
India GPE
Amazon ORG
One CARDINAL
Amazon ORG
Amazon ORG
Amazon Music Limited ORG
Flipkart PERSON
2007 DATE
Flipkart PERSON
Indian NORP
Amazon ORG
Walmart LOC
one CARDINAL
US GPE
Amazon ORG
daily DATE
2010 DATE
2011 DATE
more than 3 CARDINAL
India GPE
over 30 million CARDINAL
over 125,000 CARDINAL
Indian NORP
recent years DATE
Freecharge PERSON
Unicommerce GPE
ShopClues PERSON
Indian NORP
ShopClues ORG
July 2011 DATE
Gurugram ORG
INR ORG
1.1 billion CARDINAL
Nexus Venture Partners ORG
Helion Ventures ORG
more than 5 CARDINAL
nine CARDINAL
Paytm Mall PERSON
Paytm ORG
Paytm Mall FAC
third ORDINAL
Paytm GPE
Indian NORP
Reliance ORG
Reliance ORG
India GPE
Big Basket ORG
two CARDINAL
One CARDINAL
2013 DATE
the last 5 years DATE
daily DATE
India GPE
Indian NORP
Digital Mall FAC
Asia LOC
2020 DATE
Digital Mall ORG
Asia LOC
Yokea

As you saw, spaCy has in-built pipeline ner for Named recogniyion. Though it performs well, it’s not always completely accurate for your text .Sometimes , a word can be categorized as PERSON or a ORG depending upon the context. Also , sometimes the category you want may not be buit-in in spacy.

Observe the above output. Notice that FLIPKART has been identified as PERSON, it should have been ORG . Walmart has also been categorized wrongly as LOC , in this context it should have been ORG . Same goes for Freecharge , ShopClues ,etc..



In [24]:
# Load pre-existing spacy model
import spacy
nlp=spacy.load('en_core_web_sm')

# Getting the pipeline component
ner=nlp.get_pipe("ner")

spaCy accepts training data as list of tuples.

Each tuple should contain the text and a dictionary. The dictionary should hold the start and end indices of the named enity in the text, and the category or label of the named entity.

For example, ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]})

To do this, you’ll need example texts and the character offsets and labels of each entity contained in the texts.

In [25]:
# training data
TRAIN_DATA = [
              ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
              ("I reached Chennai yesterday.", {"entities": [(19, 28, "GPE")]}),
              ("I recently ordered a book from Amazon", {"entities": [(24,32, "ORG")]}),
              ("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
              ("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
              ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT")]}),
              ("I bought a new Washer", {"entities": [(16,22, "PRODUCT")]}),
              ("I bought a old table", {"entities": [(16,21, "PRODUCT")]}),
              ("I bought a fancy dress", {"entities": [(18,23, "PRODUCT")]}),
              ("I rented a camera", {"entities": [(12,18, "PRODUCT")]}),
              ("I rented a tent for our trip", {"entities": [(12,16, "PRODUCT")]}),
              ("I rented a screwdriver from our neighbour", {"entities": [(12,22, "PRODUCT")]}),
              ("I repaired my computer", {"entities": [(15,23, "PRODUCT")]}),
              ("I got my clock fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("I got my truck fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
              ("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Swiggy", {"entities": [(24,29, "ORG")]})
              ]

The above code clearly shows you the training format. You have to add these labels to the ner using ner.add_label() method of pipeline . Below code demonstrates the same

In [26]:
# Adding labels to the `ner`

for _, annotations in TRAIN_DATA:
  for ent in annotations.get("entities"):
    ner.add_label(ent[2])

Now it’s time to train the NER over these examples. But before you train, remember that apart from ner , the model has other pipeline components. These components should not get affected in training.

So, disable the other pipeline components through nlp.disable_pipes() method.

In [27]:
# Disable pipeline components you dont need to change
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

You have to perform the training with unaffected_pipes disabled.

### Training the NER model

First, let’s understand the ideas involved before going to the code.
(a) To train an ner model, the model has to be looped over the example for sufficient number of iterations. If you train it for like just 5 or 6 iterations, it may not be effective.

(b) Before every iteration it’s a good practice to shuffle the examples randomly throughrandom.shuffle() function .

This will ensure the model does not make generalizations based on the order of the examples.

(c) The training data is usually passed in batches.

You can call the minibatch() function of spaCy over the training data that will return you data in batches . The minibatch function takes size parameter to denote the batch size. You can make use of the utility function compounding to generate an infinite series of compounding values.
compunding() function takes three inputs which are start ( the first integer value) ,stop (the maximum value that can be generated) and finally compound. This value stored in compund is the compounding factor for the series.If you are not clear, check out this link for understanding.

For each iteration , the model or ner is updated through the nlp.update() command. Parameters of nlp.update() are :

docs: This expects a batch of texts as input. You can pass each batch to the zip method, which will return you batches of text and annotations. `
golds: You can pass the annotations we got through zip method here

drop: This represents the dropout rate.

losses: A dictionary to hold the losses against each pipeline component. Create an empty dictionary and pass it here.

At each word, the update() it makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t , it adjusts the weights so that the correct action will score higher next time.

Finally, all of the training is done within the context of the nlp model with disabled pipeline, to prevent the other components from being involved.

In [28]:
# Import requirements
import random
from spacy.util import minibatch, compounding
from pathlib import Path

# TRAINING THE MODEL
with nlp.disable_pipes(*unaffected_pipes):

  # Training for 30 iterations
  for iteration in range(30):

    # shuufling examples  before every iteration
    random.shuffle(TRAIN_DATA)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
        print("Losses", losses)

Losses {'ner': 4.972815847482707}
Losses {'ner': 7.060777689512179}
Losses {'ner': 7.397154785874591}
Losses {'ner': 12.679692163284926}
Losses {'ner': 15.525220436124073}
Losses {'ner': 1.896032368647866}
Losses {'ner': 4.921380922939392}
Losses {'ner': 7.635120868754569}
Losses {'ner': 11.701976171184327}
Losses {'ner': 15.463391020928356}
Losses {'ner': 2.0279084882036784}
Losses {'ner': 3.9485471565654677}
Losses {'ner': 3.955853801391271}
Losses {'ner': 7.297129265698089}
Losses {'ner': 13.672626542068969}
Losses {'ner': 6.45540950720207}
Losses {'ner': 10.455944284516967}
Losses {'ner': 12.8789729087433}
Losses {'ner': 14.87061946092522}
Losses {'ner': 18.268121289103284}
Losses {'ner': 4.522875945782289}
Losses {'ner': 5.918226479333725}
Losses {'ner': 9.883843705744141}
Losses {'ner': 11.329770056516054}
Losses {'ner': 11.33256825987496}
Losses {'ner': 3.57964714434587}
Losses {'ner': 3.6586533139629296}
Losses {'ner': 3.6934490205757697}
Losses {'ner': 3.762100345063061}
Losse

In [36]:
# Testing the model
doc = nlp("I was driving a Alto")
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

ValueError: ignored

You can observe that even though I didn’t directly train the model to recognize “Alto” as a vehicle name, it has predicted based on the similarity of context.
This is the awesome part of the NER model.

The model does not just memorize the training examples. It should learn from them and be able to generalize it to new examples.

Once you find the performance of the model satisfactory, save the updated model.

You can save it your desired directory through the to_disk command.

After saving, you can load the model from the directory at any point of time by passing the directory path to spacy.load() function.

In [None]:
# Save the  model to directory
output_dir = Path('/content/')
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# Load the saved model and predict
print("Loading from", output_dir)
nlp_updated = spacy.load(output_dir)
doc = nlp_updated("Fridge can be ordered in FlipKart" )
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])