In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

## Finding Name Entities in the Text

In [5]:
doc = nlp("Tesla Inc is an American electric vehicle and clean energy company founded in 2003 by Martin Eberhard and Marc Tarpenning.")

for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Tesla Inc | ORG | Companies, agencies, institutions, etc.
American | NORP | Nationalities or religious or political groups
2003 | DATE | Absolute or relative dates or periods
Martin Eberhard | PERSON | People, including fictional
Marc Tarpenning | PERSON | People, including fictional


## Displaying named entities in a document

In [4]:
from spacy import displacy

displacy.render(doc, style="ent")

In [6]:
text = """"Microsoft Corporation, an American multinational technology company, was founded by Bill Gates and Paul Allen on April 4, 1975. 
Its headquarters is located in Redmond, Washington. 
Microsoft is best known for its software products like the Microsoft Windows line of operating systems, the Microsoft Office suite, and the Internet Explorer and Edge web browsers. 
In 2020, Microsoft was ranked No. 21 in the Fortune 500 rankings of the largest United States corporations by total revenue. As of 2020, Bill Gates's net worth is estimated to be $113.7 billion."
"""

text

'"Microsoft Corporation, an American multinational technology company, was founded by Bill Gates and Paul Allen on April 4, 1975. \nIts headquarters is located in Redmond, Washington. \nMicrosoft is best known for its software products like the Microsoft Windows line of operating systems, the Microsoft Office suite, and the Internet Explorer and Edge web browsers. \nIn 2020, Microsoft was ranked No. 21 in the Fortune 500 rankings of the largest United States corporations by total revenue. As of 2020, Bill Gates\'s net worth is estimated to be $113.7 billion."\n'

In [8]:
doc = nlp(text)
doc

"Microsoft Corporation, an American multinational technology company, was founded by Bill Gates and Paul Allen on April 4, 1975. 
Its headquarters is located in Redmond, Washington. 
Microsoft is best known for its software products like the Microsoft Windows line of operating systems, the Microsoft Office suite, and the Internet Explorer and Edge web browsers. 
In 2020, Microsoft was ranked No. 21 in the Fortune 500 rankings of the largest United States corporations by total revenue. As of 2020, Bill Gates's net worth is estimated to be $113.7 billion."

The Result: organizations (Microsoft Corporation), persons (Bill Gates, Paul Allen), dates (April 4, 1975, 2020), \
locations (American, Redmond, Washington, United States), and monetary values ($113.7 billion). 

In [9]:
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Microsoft Corporation | ORG | Companies, agencies, institutions, etc.
American | NORP | Nationalities or religious or political groups
Bill Gates | PERSON | People, including fictional
Paul Allen | PERSON | People, including fictional
April 4, 1975 | DATE | Absolute or relative dates or periods
Redmond | GPE | Countries, cities, states
Washington | GPE | Countries, cities, states
Microsoft | ORG | Companies, agencies, institutions, etc.
Microsoft | ORG | Companies, agencies, institutions, etc.
Microsoft Office | ORG | Companies, agencies, institutions, etc.
Edge | ORG | Companies, agencies, institutions, etc.
2020 | DATE | Absolute or relative dates or periods
Microsoft | ORG | Companies, agencies, institutions, etc.
21 | CARDINAL | Numerals that do not fall under another type
Fortune 500 | LAW | Named documents made into laws.
United States | GPE | Countries, cities, states
2020 | DATE | Absolute or relative dates or periods
Bill Gates's | PERSON | People, including fictional
$113.7 b

In [12]:
displacy.render(doc, style="ent")

## Listing all the Entities

In [13]:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

List of entities are also documented on this page: https://spacy.io/models/en

In [15]:
doc = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion")

for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_), "|", ent.start_char, '|', ent.end_char)

Tesla Inc | ORG | Companies, agencies, institutions, etc. | 0 | 9
Twitter Inc | ORG | Companies, agencies, institutions, etc. | 30 | 41
$45 billion | MONEY | Monetary values, including unit | 46 | 57


In [16]:
text = """"Microsoft Corporation, an American multinational technology company, was founded by Bill Gates and Paul Allen on April 4, 1975. 
Its headquarters is located in Redmond, Washington. 
Microsoft is best known for its software products like the Microsoft Windows line of operating systems, the Microsoft Office suite, and the Internet Explorer and Edge web browsers. 
In 2020, Microsoft was ranked No. 21 in the Fortune 500 rankings of the largest United States corporations by total revenue. As of 2020, Bill Gates's net worth is estimated to be $113.7 billion."
"""

doc = nlp(text)
doc

"Microsoft Corporation, an American multinational technology company, was founded by Bill Gates and Paul Allen on April 4, 1975. 
Its headquarters is located in Redmond, Washington. 
Microsoft is best known for its software products like the Microsoft Windows line of operating systems, the Microsoft Office suite, and the Internet Explorer and Edge web browsers. 
In 2020, Microsoft was ranked No. 21 in the Fortune 500 rankings of the largest United States corporations by total revenue. As of 2020, Bill Gates's net worth is estimated to be $113.7 billion."

In [17]:
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_), "|", ent.start_char, '|', ent.end_char)

Microsoft Corporation | ORG | Companies, agencies, institutions, etc. | 1 | 22
American | NORP | Nationalities or religious or political groups | 27 | 35
Bill Gates | PERSON | People, including fictional | 85 | 95
Paul Allen | PERSON | People, including fictional | 100 | 110
April 4, 1975 | DATE | Absolute or relative dates or periods | 114 | 127
Redmond | GPE | Countries, cities, states | 161 | 168
Washington | GPE | Countries, cities, states | 170 | 180
Microsoft | ORG | Companies, agencies, institutions, etc. | 183 | 192
Microsoft | ORG | Companies, agencies, institutions, etc. | 242 | 251
Microsoft Office | ORG | Companies, agencies, institutions, etc. | 291 | 307
Edge | ORG | Companies, agencies, institutions, etc. | 345 | 349
2020 | DATE | Absolute or relative dates or periods | 368 | 372
Microsoft | ORG | Companies, agencies, institutions, etc. | 374 | 383
21 | CARDINAL | Numerals that do not fall under another type | 399 | 401
Fortune 500 | LAW | Named documents made into laws.

In [18]:
doc1 = nlp("""Apple Inc., a multinational technology company, was established by Steve Jobs, Steve Wozniak, and Ronald Wayne on April 1, 1976. 
           The company's headquarters is in Cupertino, California. 
           Apple is renowned for its hardware products, which include the iPhone smartphone, the iPad tablet computer, the Mac personal computer, the iPod portable media player, the Apple Watch smartwatch, and the Apple TV digital media player. 
           In the Fortune 500 rankings of 2020, Apple was ranked No. 4 among the largest United States corporations by total revenue. 
           As of 2020, Tim Cook, the CEO of Apple, had a net worth of $1.3 billion.
           """)
doc1

Apple Inc., a multinational technology company, was established by Steve Jobs, Steve Wozniak, and Ronald Wayne on April 1, 1976. 
           The company's headquarters is in Cupertino, California. 
           Apple is renowned for its hardware products, which include the iPhone smartphone, the iPad tablet computer, the Mac personal computer, the iPod portable media player, the Apple Watch smartwatch, and the Apple TV digital media player. 
           In the Fortune 500 rankings of 2020, Apple was ranked No. 4 among the largest United States corporations by total revenue. 
           As of 2020, Tim Cook, the CEO of Apple, had a net worth of $1.3 billion.
           

In [19]:
for ent in doc1.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Apple Inc. | ORG | Companies, agencies, institutions, etc.
Steve Jobs | PERSON | People, including fictional
Steve Wozniak | PERSON | People, including fictional
Ronald Wayne | PERSON | People, including fictional
April 1, 1976 | DATE | Absolute or relative dates or periods
Cupertino | GPE | Countries, cities, states
California | GPE | Countries, cities, states
Apple | ORG | Companies, agencies, institutions, etc.
iPhone | ORG | Companies, agencies, institutions, etc.
iPad | ORG | Companies, agencies, institutions, etc.
Mac | PERSON | People, including fictional
iPod | ORG | Companies, agencies, institutions, etc.
Apple Watch | ORG | Companies, agencies, institutions, etc.
Apple TV | ORG | Companies, agencies, institutions, etc.
2020 | DATE | Absolute or relative dates or periods
Apple | ORG | Companies, agencies, institutions, etc.
4 | CARDINAL | Numerals that do not fall under another type
United States | GPE | Countries, cities, states
2020 | DATE | Absolute or relative dates or per

Here is the issue with the NER tagging in 

In [20]:
displacy.render(doc1, style="ent")

## Issues with NER

In [21]:
doc = nlp("Michael Bloomberg is foudned Bloomberg in 1982")
for ent in doc.ents: 
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | ORG | Companies, agencies, institutions, etc.
1982 | DATE | Absolute or relative dates or periods


The model is able to recognize Bloomberg a company here. This is a good part of Spacy. 

# Name Entity Recognition Exercise

In [22]:
import spacy 
nlp = spacy.load("en_core_web_sm")

Exercise: 1 (Extract all the Geographical (cities, Countries, states) names from a given text)\
    text = """Kiran want to know the famous foods in each state of India. So, he opened Google and search for this question. Google showed that
in Delhi it is Chaat, in Gujarat it is Dal Dhokli, in Tamilnadu it is Pongal, in Andhrapradesh it is Biryani, in Assam it is Papaya Khar,
in Bihar it is Litti Chowkha and so on for all other states"""

In [26]:
text = """Kiran want to know the famous foods in each state of India. So, he opened Google and search for this question. Google showed that
in Delhi it is Chaat, in Gujarat it is Dal Dhokli, in Tamilnadu it is Pongal, in Andhrapradesh it is Biryani, in Assam it is Papaya Khar,
in Bihar it is Litti Chowkha and so on for all other states"""

In [27]:
doc = nlp(text)
doc

Kiran want to know the famous foods in each state of India. So, he opened Google and search for this question. Google showed that
in Delhi it is Chaat, in Gujarat it is Dal Dhokli, in Tamilnadu it is Pongal, in Andhrapradesh it is Biryani, in Assam it is Papaya Khar,
in Bihar it is Litti Chowkha and so on for all other states

In [28]:
#List for storing all the names 

all_gpe_names = []

for ent in doc.ents:
    if ent.label_ == 'GPE': #Checking whether the entity is a GPE
        all_gpe_names.append(ent.text)
        
print("Geographical Location Names:", all_gpe_names)
print("Count:", len(all_gpe_names))

Geographical Location Names: ['India', 'Delhi', 'Gujarat', 'Tamilnadu', 'Pongal', 'Andhrapradesh', 'Assam', 'Bihar']
Count: 8


Excersie: 2\
Extract all the birth dates of cricketers in the given Text

In [29]:

text = """Sachin Tendulkar was born on 24 April 1973, Virat Kholi was born on 5 November 1988, Dhoni was born on 7 July 1981
and finally Ricky ponting was born on 19 December 1974."""

doc = nlp(text)

In [30]:
all_birth_dates = [] #List for storing all the birth dates

for ent in doc.ents:
    if ent.label_ == "DATE": #Checking whether the entity is a DATE
        all_birth_dates.append(ent.text)

print("All Birth Dates of the Cricket Players:", all_birth_dates)
print("Count:", len(all_birth_dates))

All Birth Dates of the Cricket Players: ['24 April 1973', '5 November 1988', '7 July 1981', '19 December 1974']
Count: 4


In [31]:
import nltk

In [33]:
from nltk import word_tokenize
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
text

['They',
 'refuse',
 'to',
 'permit',
 'us',
 'to',
 'obtain',
 'the',
 'refuse',
 'permit']

In [34]:
from nltk import pos_tag
pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]