<a href="https://colab.research.google.com/github/johnowusuduah/named_entity_recognition_application/blob/main/named_entity_recognition_application.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Named Entity Recognition (NER)

**How Do you Identify Named Entities in Textual Data?**


**What is a Named Entity?**

Any word which represents proper nouns (ie. persons, organization and location). Extraction of named entities involves identifying words which are named entities in a given text and its called entity identification or entity chunking


**Why is it important?**

Before meaning can be made from a given text (eg. tweet, document or facebook post), it is important to identify the subject and object of the subject matter. Named Entity Recognition is an integral part of the pipeline of NLP tasks which involves understanding text.

**Approach**

Used Spacy library because it is one of the best algorithms which is built from a neural network model.

In [1]:
import nltk
import pandas as pd
import spacy
from spacy import displacy
# called to find out which version of spacy
spacy.__version__

'2.2.4'

**Caveat**

No algorithm can correctly identify all named entities 100% of the time

In [3]:
#Download spacy models
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 5.4 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [5]:
#Load spacy model
# small size model
nlp = spacy.load("en_core_web_sm")
# medium size model
#nlp = spacy.load("en_core_web_md")
# large size model
#nlp = spacy.load("en_core_web_lg")

In [6]:
# Input data is an extract from a news article on twitter
input_data = "The original lawsuit, filed in 2016 by a Twitter shareholder, alleged Dorsey and others including former \
CEO Dick Costolo and board member Evan Williams hid facts about Twitter’s slowing user growth while they sold their personal \
stock holdings “for hundreds of millions of dollars in insider profits.” The complaint alleged the company was tracking daily \
active users (DAUs) as the primary indicator of Twitter’s user engagement by early 2015 but didn’t reveal that to investors at \
the time (when it was reporting monthly active user figures). According to the lawsuit, Twitter’s DAU figures showed that user \
engagement growth was either flat or declining."

In [10]:
document = nlp(input_data)

entities = []
labels = []
position_start = []
position_end = []

for entity in document.ents:
  entities.append(entity)
  labels.append(entity.label_)
  position_start.append(entity.start_char)
  position_end.append(entity.end_char)


df_NER = pd.DataFrame({"Entities":entities, "Labels":labels, "Position_Start":position_start, "Position_End":position_end})

df_NER


Unnamed: 0,Entities,Labels,Position_Start,Position_End
0,(2016),DATE,31,35
1,(Dorsey),ORG,70,76
2,"(Dick, Costolo)",PERSON,109,121
3,"(Evan, Williams)",PERSON,139,152
4,(Twitter),PERSON,169,176
5,"(hundreds, of, millions, of, dollars)",MONEY,250,281
6,(daily),DATE,350,355
7,(Twitter),PERSON,404,411
8,"(early, 2015)",DATE,433,443
9,(monthly),DATE,515,522


We can identify that Dorsey was labeled as an organization instead of a person. We also see that Twitter was identified as a person instead of an organiziation.

** Application **

A bot that can analyze financial news and extract information about entities (ie. location, dates and numeric information) that are mentioned in a public document. Further to that, algorithmic trading bots can be built from this.