## Named Entity Recognition

Named Entity Recognition or NER for short is a natural language processing task used to identify important named entities in the text such as people, places and organizations. They can even be dates, states, works of art and other categories depending on the libraries and notation you use. 

NER can be used alongside topic identification, or on its own to determine important items in a text or answer basic natural language understanding questions such as who? what? when and where?

Named Entity Recognition is the process of identifying and classifying entities such as persons, locations and organisations in the full-text in order to enhance searchability. 

## What kind of problems can NER solve?

Named Entity Recognition (NER) is a sub-task of natural language processing (NLP) that involves identifying and classifying named entities in a text. NER is used to extract structured information from unstructured text data and is commonly used in a wide range of applications to solve various types of problems. Some examples of problems that NER can help solve include:

- Information extraction: NER can be used to extract specific pieces of information from a text, such as names, dates, and locations, and organize them in a structured format. This can be useful for tasks such as building databases or creating summaries of text documents.

- Question answering: NER can be used to identify named entities in a question and use them to search for relevant information in a database or text corpus.

- Text summarization: NER can be used to extract key named entities from a text and use them to generate a concise summary of the main points of the text.

- Entity disambiguation: NER can be used to identify and disambiguate named entities that may have multiple meanings or refer to different entities in different contexts.

- Text classification: NER can be used as a feature in text classification tasks, such as identifying the topic or genre of a text.

Overall, NER can be used to extract structured information from unstructured text data and solve a variety of problems in natural language processing and other related fields.

## Simple example of Named Entity Recognition (NER)

In [1]:
import nltk
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/brindhamanivannan/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [2]:
import nltk
nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     /Users/brindhamanivannan/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [3]:
# Input text
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
text

'Barack Obama was born in Hawaii. He was the 44th President of the United States.'

In [4]:
# Tokenize the text
tokens = nltk.word_tokenize(text)
tokens


['Barack',
 'Obama',
 'was',
 'born',
 'in',
 'Hawaii',
 '.',
 'He',
 'was',
 'the',
 '44th',
 'President',
 'of',
 'the',
 'United',
 'States',
 '.']

In [5]:
print(len(tokens))
print(type(tokens))

17
<class 'list'>


In [6]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/brindhamanivannan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [7]:
tagged_tokens = nltk.pos_tag(tokens)
tagged_tokens

[('Barack', 'NNP'),
 ('Obama', 'NNP'),
 ('was', 'VBD'),
 ('born', 'VBN'),
 ('in', 'IN'),
 ('Hawaii', 'NNP'),
 ('.', '.'),
 ('He', 'PRP'),
 ('was', 'VBD'),
 ('the', 'DT'),
 ('44th', 'JJ'),
 ('President', 'NNP'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('United', 'NNP'),
 ('States', 'NNPS'),
 ('.', '.')]

In [8]:
print(len(tagged_tokens))
print(type(tagged_tokens))

17
<class 'list'>


In [9]:
# Perform named entity recognition

entities = [chunk for chunk in nltk.ne_chunk(tagged_tokens) if isinstance(chunk, nltk.Tree)]
for entity in entities:
    print(entity)

(PERSON Barack/NNP)
(PERSON Obama/NNP)
(GPE Hawaii/NNP)
(GPE United/NNP States/NNPS)


## Simple NER code for information extraction - full code

In [10]:
# Input text
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Tag the tokens
tagged_tokens = nltk.pos_tag(tokens)

# Perform named entity recognition 
entities = [chunk for chunk in nltk.ne_chunk(tagged_tokens) if isinstance(chunk, nltk.Tree)]
for entity in entities:
    print(entity)

(PERSON Barack/NNP)
(PERSON Obama/NNP)
(GPE Hawaii/NNP)
(GPE United/NNP States/NNPS)


## Code Explanation

This code is using the Natural Language Toolkit (nltk) library to perform Named Entity Recognition (NER) on a given piece of text. Here is a brief explanation of each line of code:

- The first line defines a string called text that contains the input text for the NER process.
- The second line tokenizes the text string into a list of individual words, called tokens.
- The third line tags each token with a part-of-speech (POS) tag, which indicates the word's grammatical role in the sentence. The result is a list of tuples called tagged_tokens, where each tuple consists of a word and its POS tag.
- The fourth line uses nltk.ne_chunk() to identify named entities in the tagged_tokens list. The named entities are returned as a tree-like structure, where the leaves are individual words and the branches are the named entities that they belong to. The chunk variable in the list comprehension will iterate over each element of this tree-like structure. The isinstance(chunk, nltk.Tree) expression checks whether each element is a named entity (a tree) or a single word (a leaf). If it is a named entity, it is added to the entities list.
- The final for loop iterates over each named entity in the entities list and prints it to the console.



## Another example

Here is another simple example of Named Entity Recognition (NER) code that extracts information from a given piece of text using the Natural Language Toolkit (nltk) library:

In [11]:
import nltk

def extract_entities(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Tag each token with its part-of-speech (POS)
    tagged_tokens = nltk.pos_tag(tokens)
    
    # Use nltk's named entity chunker to extract named entities
    entities = nltk.ne_chunk(tagged_tokens)
    
    # Iterate over the entities and extract the information we want
    for entity in entities:
        # Check if the entity is a named entity
        if isinstance(entity, nltk.Tree):
            # Get the label for the entity (e.g. "PERSON")
            label = entity.label()
            # Get the string representation of the entity
            entity_string = " ".join([word for word, tag in entity])
            # Print the entity information
            print(f"{label}: {entity_string}")

# Test the extract_entities() function
text = "Narendra Modi was born in Vadnagar, India. He is the 14th and current Prime Minister of India since 2014."
extract_entities(text)


PERSON: Narendra
PERSON: Modi
GPE: Vadnagar
GPE: India
GPE: India


This code defines a function called extract_entities() that takes a string of text as input and uses the nltk library to identify named entities in the text. It then extracts and prints the label (e.g. "PERSON") and string representation of each named entity.

In [12]:
text_1 = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
extract_entities(text_1)

PERSON: Barack
PERSON: Obama
GPE: Hawaii
GPE: United States


In [13]:
text_2 = "Prime Minister Narendra Modi’s government is pushing to overhaul the country’s heavily-regulated education sector to enable Indian students to obtain foreign qualifications at an affordable cost and make India an attractive global study destination. The move will also help overseas institutions to tap the nation’s young population."
extract_entities(text_2)

PERSON: Narendra Modi
GPE: Indian
GPE: India


In [14]:
text_3 = "Even as India’s universities and colleges have produced chief executive officers at companies from Microsoft Corp. to Alphabet Inc., many fare poorly in global rankings. The country needs to boost its education sector to become more competitive and close the growing gap between college curricula and market demand. It’s currently ranked 101 among 133 nations in the Global Talent Competitiveness Index of 2022 that measures a nation’s ability to grow, attract and retain talent.Some universities have already set up partnerships with Indian institutions, allowing students to partially study in India and complete their degrees on the main campus abroad. The current move will encourage these overseas institutions to set up campuses without local partners. The University Grants Commission’s final draft will be presented to the parliament for its approval before becoming law."

extract_entities(text_3)

GPE: India
ORGANIZATION: Microsoft
ORGANIZATION: Alphabet Inc.
ORGANIZATION: Global
ORGANIZATION: Competitiveness Index
GPE: Indian
GPE: India
ORGANIZATION: University Grants Commission


## Simple NER code for question answering - full code

In [15]:
import nltk

def answer_question(question, text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Tag each token with its part-of-speech (POS)
    tagged_tokens = nltk.pos_tag(tokens)
    
    # Use nltk's named entity chunker to extract named entities
    entities = nltk.ne_chunk(tagged_tokens)
    
    # Initialize variables to store the person and location
    person = None
    location = None
    
    # Iterate over the entities and extract the information we want
    for entity in entities:
        # Check if the entity is a named entity
        if isinstance(entity, nltk.Tree):
            # Get the label for the entity (e.g. "PERSON")
            label = entity.label()
            # Get the string representation of the entity
            entity_string = " ".join([word for word, tag in entity])
            
            # Store the person and location
            if label == "PERSON" and person is None:
                person = entity_string
            elif label == "GPE" and location is None:
                location = entity_string
    
    # Extract the answer to the question
    if "born" in question.lower():
        print(f"{person} was born in {location}")
    elif "president" in question.lower():
        print(f"{person} was the president")

# Test the answer_question() function
new_text = "Narendra Modi was born in Vadnagar, India. He is the 14th and current Prime Minister of India since 2014."
new_question = "Where was Narendra Modi born?"

answer_question(new_question, new_text)

Narendra was born in Vadnagar


In [16]:
# Test the answer_question() function
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
question = "Where was Barack Obama born?"
answer_question(question, text)

Barack was born in Hawaii


In [17]:
# Test the answer_question() function
new_text_1 = "Manivannan was born in Madurai."
new_question_1 = "Where was Manivannan born?"
answer_question(new_question_1, new_text_1)

Manivannan was born in Madurai


## Simple NER code for Text summarization - full code

In [18]:
import nltk

def summarize(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Tag each token with its part-of-speech (POS)
    tagged_tokens = nltk.pos_tag(tokens)
    
    # Use nltk's named entity chunker to extract named entities
    entities = nltk.ne_chunk(tagged_tokens)
    
    # Initialize a list to store the named entities
    named_entities = []
    
    # Iterate over the entities and extract the named entities
    for entity in entities:
        # Check if the entity is a named entity
        if isinstance(entity, nltk.Tree):
            # Get the label for the entity (e.g. "PERSON")
            label = entity.label()
            # Get the string representation of the entity
            entity_string = " ".join([word for word, tag in entity])
            
            # Add the entity to the list of named entities
            named_entities.append(entity_string)
    
    # Join the named entities into a single string
    summary = " ".join(named_entities)
    
    return summary

# Test the summarize() function
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
summary = summarize(text)
print(summary)


Barack Obama Hawaii United States


This code defines a function called summarize() that takes a string of text as input and returns a summary of the text as a string. The summary is generated by extracting the named entities from the text using the nltk library and then joining them into a single string.

In [20]:
# Test the summarize() function
text_mod = "Narendra Modi was born in Vadnagar, India. He is the 14th and current Prime Minister of India since 2014."
summary = summarize(text_mod)
print(summary)

Narendra Modi Vadnagar India India


## Simple NER code for Entity disambiguation

### What is Entity disambiguation?

Entity disambiguation is the process of identifying the correct meaning or interpretation of a named entity in a text. A named entity is a real-world object with a proper name, such as a person, organization, or location.

For example, consider the following sentence: "Barack Obama was the President of the United States from 2009 to 2017." In this sentence, "Barack Obama" is a named entity that refers to a specific person, while "United States" is a named entity that refers to a specific location. However, there may be multiple people or locations with the same name, so it's important to disambiguate the entities in order to correctly understand the meaning of the text.

Entity disambiguation is an important task in natural language processing and information retrieval, as it helps to correctly interpret the meaning of a text and identify the relevant information. There are various approaches to entity disambiguation, including using external knowledge sources such as Wikipedia or using machine learning techniques to predict the correct meaning of an entity based on its context in the text.

In [23]:
import spacy
import wikipedia

# Load the model
nlp = spacy.load("en_core_web_md")

# Define a function to disambiguate named entities
def disambiguate_entities(text):
    # Process the text with the model
    doc = nlp(text)
    
    # Iterate over the named entities
    for ent in doc.ents:
        # Print the entity text and label
        print(ent.text, ent.label_)
        
        # If the entity is a person, get the Wikipedia page
        if ent.label_ == "PERSON":
            wikipedia_page = wikipedia.page(ent.text)
            print(wikipedia_page.url)
            
# Disambiguate the entities in a sample text
text = "Barack Obama was the President of the United States from 2009 to 2017."
disambiguate_entities(text)


Barack Obama PERSON
https://en.wikipedia.org/wiki/Barack_Obama
the United States GPE
2009 to 2017 DATE


This simple code processes the input text with the spacy model, which will identify and label the named entities in the text. It then iterates over the named entities and checks if they are labeled as "PERSON". If they are, it uses the wikipedia library to get the Wikipedia page for the person and prints the URL.

## Simple NER code for Text classification

Text classification is the process of assigning predefined categories or labels to a piece of text. It is a common task in natural language processing, and it is useful for a wide range of applications such as sentiment analysis, spam filtering, and topic classification.

For example, given a dataset of customer reviews for a product, a text classification model might be trained to predict whether a given review is positive or negative. Similarly, a text classification model might be trained to classify news articles into different categories such as sports, politics, or entertainment.

Text classification typically involves preprocessing the text data and converting it into a numerical form that can be used by a machine learning model. Common techniques for preprocessing text data include tokenization, stemming, and removing stop words. Machine learning algorithms such as support vector machines, decision trees, and Naive Bayes can then be used to train a model on the preprocessed data. The trained model can then be used to predict the class label for new, unseen text data.

## Kaggle Text Classification Datasets

https://www.kaggle.com/datasets?search=text+classification

Here is a simple example of text classification using the sklearn library in Python:

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Load the data
df = pd.read_csv("data.csv")

# Split the data into features and labels
X = df["text"]
y = df["label"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the text data
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

# Train a classifier
classifier = MultinomialNB()
classifier.fit(X_train_vectors, y_train)

# Evaluate the classifier on the test data
accuracy = classifier.score(X_test_vectors, y_test)
print("Accuracy: {:.2f}".format(accuracy))


This code will load a dataset from a CSV file, split it into features (the text data) and labels (the class labels), and then split it into training and testing sets. It will then vectorize the text data using the CountVectorizer class, which converts the text into numerical data that can be used by a machine learning model. Finally, it will train a classifier using a Naive Bayes model, and evaluate the classifier on the test data by printing the accuracy.