## nltk and the Stanford CoreNLP Library

NLTK allows you to interact with named entity recognition via it's own model, but also the aforementioned Stanford library. 

The Stanford library integration requires you to perform a few steps before you can use it, including installing the required Java files and setting system environment variables. 

You can also use the standford library on its own without integrating it with NLTK or operate it as an API server. 

The stanford CoreNLP library has great support for named entity recognition as well as some related nlp tasks such as coreference (or linking pronouns and entities together) and dependency trees to help with parsing meaning and relationships amongst words or phrases in a sentence.

For our simple use case, we will use the built-in named entity recognition with NLTK.

## Using nltk for Named Entity Recognition

In [1]:
import nltk
# take a normal sentence
sentence = '''In Chennai, I like to visit Marina beach, Mahabalipuram, Sri Parthasarathy Temple, Kapaleeshwar Temple and some restaurants rated well by TripAdvisor.'''
sentence

'In Chennai, I like to visit Marina beach, Mahabalipuram, Sri Parthasarathy Temple, Kapaleeshwar Temple and some restaurants rated well by TripAdvisor.'

In [2]:
# preprocess it via tokenization
tokenized_sent = nltk.word_tokenize(sentence)
tokenized_sent

['In',
 'Chennai',
 ',',
 'I',
 'like',
 'to',
 'visit',
 'Marina',
 'beach',
 ',',
 'Mahabalipuram',
 ',',
 'Sri',
 'Parthasarathy',
 'Temple',
 ',',
 'Kapaleeshwar',
 'Temple',
 'and',
 'some',
 'restaurants',
 'rated',
 'well',
 'by',
 'TripAdvisor',
 '.']

In [3]:
# tag the sentence for parts of speech
# this will add tags for proper nouns, pronouns, adjective, verbs and other part of speech that NLTK uses based on an english grammar
# NNP is the tag for a proper noun, singular
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent

[('In', 'IN'),
 ('Chennai', 'NNP'),
 (',', ','),
 ('I', 'PRP'),
 ('like', 'VBP'),
 ('to', 'TO'),
 ('visit', 'VB'),
 ('Marina', 'NNP'),
 ('beach', 'NN'),
 (',', ','),
 ('Mahabalipuram', 'NNP'),
 (',', ','),
 ('Sri', 'NNP'),
 ('Parthasarathy', 'NNP'),
 ('Temple', 'NNP'),
 (',', ','),
 ('Kapaleeshwar', 'NNP'),
 ('Temple', 'NNP'),
 ('and', 'CC'),
 ('some', 'DT'),
 ('restaurants', 'NNS'),
 ('rated', 'VBN'),
 ('well', 'RB'),
 ('by', 'IN'),
 ('TripAdvisor', 'NNP'),
 ('.', '.')]

In [4]:
# pass this tagged sentence into the ne_chunk function
# returns the sentence as a tree
# they do have leaves and subtrees representing more complex grammar

print(nltk.ne_chunk(tagged_sent))

(S
  In/IN
  (GPE Chennai/NNP)
  ,/,
  I/PRP
  like/VBP
  to/TO
  visit/VB
  (GPE Marina/NNP)
  beach/NN
  ,/,
  (GPE Mahabalipuram/NNP)
  ,/,
  (PERSON Sri/NNP Parthasarathy/NNP Temple/NNP)
  ,/,
  (PERSON Kapaleeshwar/NNP Temple/NNP)
  and/CC
  some/DT
  restaurants/NNS
  rated/VBN
  well/RB
  by/IN
  (ORGANIZATION TripAdvisor/NNP)
  ./.)


Chennai is identified as GPE (geopolitical entity), TripAdvisor as an organization and visit as verb. It does so without consulting a knowledge base, like wikipedia, but instead uses trained statistical and grammatical parsers.

## More named entity recognition using nltk

In [5]:
with open("/Users/brindhamanivannan/Desktop/KaggleX/DataCamp/NLP/new_article.txt", "r") as file:
    new_article = file.read()

In [6]:
print(new_article)

The taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company.


Uber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire

In [7]:
type(new_article)

str

In [8]:
# Tokenize the article into sentences: sentences
sentences_new = nltk.sent_tokenize(new_article)
sentences_new

['The taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character.',
 'If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic.',
 'Uber wanted to know as much as possible about the people who use its service, and those who don’t.',
 'It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies.',
 'Even if their email was notionally anonymised, this use of it was not something the users had bargained for.',
 'Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company.',
 'Uber has also tweaked its software so that regulatory agencies that the company regarded as hostile would,

In [9]:
len(sentences_new)

20

In [10]:
# Tokenize each sentence into words: token_sentences
token_sentences_1 = [nltk.word_tokenize(sent) for sent in sentences_new]
token_sentences_1[:2]


[['The',
  'taxi-hailing',
  'company',
  'Uber',
  'brings',
  'into',
  'very',
  'sharp',
  'focus',
  'the',
  'question',
  'of',
  'whether',
  'corporations',
  'can',
  'be',
  'said',
  'to',
  'have',
  'a',
  'moral',
  'character',
  '.'],
 ['If',
  'any',
  'human',
  'being',
  'were',
  'to',
  'behave',
  'with',
  'the',
  'single-minded',
  'and',
  'ruthless',
  'greed',
  'of',
  'the',
  'company',
  ',',
  'we',
  'would',
  'consider',
  'them',
  'sociopathic',
  '.']]

In [11]:
# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences_1] 
pos_sentences[:10]

[[('The', 'DT'),
  ('taxi-hailing', 'JJ'),
  ('company', 'NN'),
  ('Uber', 'NNP'),
  ('brings', 'VBZ'),
  ('into', 'IN'),
  ('very', 'RB'),
  ('sharp', 'JJ'),
  ('focus', 'VB'),
  ('the', 'DT'),
  ('question', 'NN'),
  ('of', 'IN'),
  ('whether', 'IN'),
  ('corporations', 'NNS'),
  ('can', 'MD'),
  ('be', 'VB'),
  ('said', 'VBD'),
  ('to', 'TO'),
  ('have', 'VB'),
  ('a', 'DT'),
  ('moral', 'JJ'),
  ('character', 'NN'),
  ('.', '.')],
 [('If', 'IN'),
  ('any', 'DT'),
  ('human', 'JJ'),
  ('being', 'VBG'),
  ('were', 'VBD'),
  ('to', 'TO'),
  ('behave', 'VB'),
  ('with', 'IN'),
  ('the', 'DT'),
  ('single-minded', 'JJ'),
  ('and', 'CC'),
  ('ruthless', 'JJ'),
  ('greed', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('company', 'NN'),
  (',', ','),
  ('we', 'PRP'),
  ('would', 'MD'),
  ('consider', 'VB'),
  ('them', 'PRP'),
  ('sociopathic', 'JJ'),
  ('.', '.')],
 [('Uber', 'NNP'),
  ('wanted', 'VBD'),
  ('to', 'TO'),
  ('know', 'VB'),
  ('as', 'RB'),
  ('much', 'JJ'),
  ('as', 'IN'),
  ('p

In [12]:
# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)
chunked_sentences

<generator object ParserI.parse_sents.<locals>.<genexpr> at 0x7ff26a04e200>

In [13]:
# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)

(NE Uber/NNP)
(NE Beyond/NN)
(NE Apple/NNP)
(NE Uber/NNP)
(NE Uber/NNP)
(NE Travis/NNP Kalanick/NNP)
(NE Tim/NNP Cook/NNP)
(NE Apple/NNP)
(NE Silicon/NNP Valley/NNP)
(NE CEO/NNP)
(NE Yahoo/NNP)
(NE Marissa/NNP Mayer/NNP)


## NER with NLTK - Full code

Use nltk to find the named entities in this article.


In [None]:
# Tokenize the article into sentences: sentences
sentences = nltk.sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)

## SpaCy

SpaCy is a free, open-source library for natural language processing (NLP) in Python. It is designed to help developers build applications that can process and understand large volumes of text efficiently.

SpaCy offers a range of NLP tasks, including tokenization (splitting text into individual words and punctuation), part-of-speech tagging (identifying the part of speech of each word), dependency parsing (analyzing the grammatical structure of a sentence), and entity recognition (identifying and classifying named entities). It also includes tools for creating and training custom models for specific NLP tasks.

SpaCy is fast and efficient, and has a simple and intuitive API. It is widely used in industry and academia, and is a popular choice for building NLP-based applications.

SpaCy is a NLP library similar to Gensim, but with different implementations, including a particular focus on creating NLP pipelines to generate models and corpora.

SpaCy has several extra libraries and tools, including Displacy - a visualization tool for viewing parse trees which uses Node-js to create interactive text.

In [19]:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)


Apple ORG
U.K. GPE
$1 billion MONEY


In [21]:
doc = nlp("""The White House is in the United States and it is the official residence and workplace of the president Joe Biden.""")
doc.ents

(The White House, the United States, Joe Biden)

In [24]:
print(doc.ents[0], doc.ents[0].label_)

The White House ORG


In [25]:
print(doc.ents[1], doc.ents[1].label_)

the United States GPE


In [26]:
print(doc.ents[2], doc.ents[2].label_)

Joe Biden PERSON


## Why use SpaCy for NER?

There are several reasons why SpaCy is a good choice for named entity recognition (NER):

Speed: SpaCy is designed to be fast and efficient, and is able to process large volumes of text quickly. This makes it well-suited for tasks like NER, where you may need to extract information from a large dataset.

Accuracy: SpaCy is known for its high accuracy in NER tasks, and includes pre-trained models that have been trained on a large dataset of annotated text. These models can often achieve high performance on NER tasks with minimal fine-tuning.

Simplicity: SpaCy has a simple and intuitive API, making it easy to use and learn. This can be especially useful if you are new to NLP or are working on a time-sensitive project.

Customization: SpaCy allows you to customize and train your own models for specific NER tasks, giving you the flexibility to extract the specific types of entities that you are interested in.

Overall, SpaCy is a powerful and efficient tool for NER tasks, and is widely used in industry and academia for this purpose.

Also, Outside of being able to integrate with the other great Spacy features like easy pipeline creation, it has a different set of entity types and often labels entities differently than nltk. In addition, Spacy comes with informal language corpora, allowing you to more easily find entities in documents like Tweets and chat messages. It's a quickly growing library, so it might even have more languages supported by the time you are reading this!

## Comparing NLTK with spaCy NER

In [27]:
print(new_article)

The taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company.


Uber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire

In [28]:
# Import spacy
import spacy

# Instantiate the English model: nlp
# To minimize execution times, specify the keyword argument disable=['tagger', 'parser', 'matcher'] when loading the spaCy model, because we only care about the entity in this exercise
nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'matcher'])

In [29]:
nlp

<spacy.lang.en.English at 0x7ff24ee97ee0>

In [31]:
# Create a new document: doc
doc = nlp(new_article)
doc





The taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company.


Uber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire

In [32]:
# Print all of the found entities and their labels
for ent in doc.ents:
    print(ent.label_, ent.text)

ORG Uber
ORG Apple’s
ORG Uber
PERSON Travis Kalanick
ORG Uber
PERSON Tim Cook
ORG Apple
CARDINAL Millions
ORG Uber
LOC Silicon Valley’s
ORG Yahoo
PERSON Marissa Mayer
MONEY $186m


## spaCy NER Categories

Extra categories that spacy uses compared to nltk in its named-entity recognition

- NORP
- CARDINAL
- MONEY
- WORKOFART
- LANGUAGE
- EVENT

## Multilingual NER with polyglot library

- Polyglot is yet another natural language processing library which uses word vectors to perform simple tasks such as entity recognition
- Polyglot has word embeddings for more than 130 languages!


### Use the polyglot library to identify French entities

In [34]:
with open("/Users/brindhamanivannan/Desktop/KaggleX/DataCamp/NLP/french_article.txt", "r") as file:
    french_article = file.read()

In [35]:
print(french_article)

édition abonné


Dans une tribune au « Monde », l’universitaire Charles Cuvelliez estime que le fantasme d’un remplacement de l’homme par l’algorithme et le robot repose sur un malentendu.


Le Monde | 10.05.2017 à 06h44 • Mis à jour le 10.05.2017 à 09h47 | Par Charles Cuvelliez (Professeur à l’Ecole polytechnique de l'université libre de Bruxelles)


TRIBUNE. L’usage morbide, par certains, de Facebook Live a amené son fondateur à annoncer précipitamment le recrutement de 3 000 modérateurs supplémentaires. Il est vrai que l’intelligence artificielle (IA) est bien en peine de reconnaître des contenus violents, surtout diffusés en direct.


Le quotidien affreux de ces modérateurs, contraints de visionner des horreurs à longueur de journée, mériterait pourtant qu’on les remplace vite par des machines !


L’IA ne peut pas tout, mais là où elle peut beaucoup, on la maudit, accusée de détruire nos emplois, de remplacer la convivialité humaine. Ce débat repose sur un malentendu.


Il vient d’

In [None]:
# French NER with polyglot
# Text class of polyglot needs to be imported from polyglot.text

# Create a new text object using Polyglot's Text class: txt
txt = Text(french_article)

# Print each of the entities found
for ent in txt.entities:
    print(ent)
    
# Print the type of ent
print(type(ent))

# Create the list of tuples: entities
entities = [(ent.tag, ' '.join(ent)) for ent in txt.entities]

# Print entities
print(entities)

## Spanish NER with polyglot

In [53]:
with open("/Users/brindhamanivannan/Desktop/KaggleX/DataCamp/NLP/spanish_article.txt", "r") as file:
    spanish_article = file.read()

In [54]:
print(spanish_article)

Lina del Castillo es profesora en el Instituto de Estudios Latinoamericanos Teresa Lozano Long (LLILAS) y el Departamento de Historia de la Universidad de Texas en Austin. Ella será la moderadora del panel “Los Mundos Políticos de Gabriel García Márquez” este viernes, Oct. 30, en el simposio Gabriel García Márquez: Vida y Legado.


LIna del Castillo


Actualmente, sus investigaciones abarcan la intersección de cartografía, disputas a las demandas de tierra y recursos, y la formación del n...el tren de medianoche que lleva a miles y miles de cadáveres uno encima del otro como tantos racimos del banano que acabarán tirados al mar. Ningún recuento periodístico podría provocar nuestra imaginación y nuestra memoria como este relato de García Márquez.


Contenido Relacionado


Lea más artículos sobre el archivo de Gabriel García Márquez


Reciba mensualmente las últimas noticias e información del Harry Ransom Center con eNews, nuestro correo electrónico mensual. ¡Suscríbase hoy!


In [None]:
# The Text object has been created as txt

# Determine how many of the entities contain the words "Márquez" or "Gabo" - these refer to the same person in different ways!

# Initialize the count variable: count
count = 0

# Iterate over all the entities
for ent in txt.entities:
    # Check whether the entity contains 'Márquez' or 'Gabo'
    if "Márquez" in ent or "Gabo" in ent:
        # Increment count
        count = count + 1

# Print count
print(count)

# Calculate the percentage of entities that refer to "Gabo": percentage
percentage = count / len(txt.entities)
print(percentage)