## Named Entity Recognition (NER)
**Named entity recognition (NER)** is a natural language processing (NLP) technique that involves identifying and categorizing named entities in text, such as people, organizations, locations, and dates. NER is an important step in many NLP applications that require extracting relevant information from unstructured text data.

**Named entity recognition (NER)** is a subtask of natural language processing (NLP) that involves identifying and classifying named entities in text into predefined categories such as person names, organization names, locations, dates, and other entities. NER is an important step in many NLP applications such as information extraction, question answering, and machine translation. The goal of NER is to automatically identify and extract relevant information from unstructured text data, which can then be used to enhance various downstream tasks.

**Som real life usecases of NER is listed bellow:**
   1. **Search:** We see search box in every website to find our needed information quickly. To know about specific companies mentioned in a huge text.
   2. **Recommendation:** This is associated with the first one (search), it means based on search we propose related documents. The new contents will be suggested based on what you read and watched previously.
   3. **Customer Care:** When we received different documents from customers, we can classify it based on topics they're talking about and sent it to relavant support team.

In [2]:
# Let's first load the spacy library and spacy English pre-trained model:
import spacy
nlp = spacy.load("en_core_web_sm")

In [3]:
# Here we want to explore NER support in spacy.
# To see the components in the model:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [9]:
# So here we see the 'ner' component. So we'll see how this particular component work?
# Let's have a simple text and recognize the entities in this text.
doc = nlp("Tesla Inc is going to acquire Twitter Co for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
Twitter Co  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [10]:
# We can use 'displacy' to visually render the same thigns.
from spacy import displacy
displacy.render(doc, style="ent")

In [11]:
# To knows which entities are supported by spacy with the loaded pre-trained model:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [14]:
# Let's have another example:
doc = nlp("Michael Bloomberg founded Bloomberg Inc in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg Inc | ORG | Companies, agencies, institutions, etc.
1982 | DATE | Absolute or relative dates or periods


In [15]:
# For better visualization:
from spacy import displacy
displacy.render(doc, style="ent")

* **Hugging face** is a popular machine learning NLP library. It has some BERT based named entity recognization. 
https://huggingface.co/dslim/bert-base-NER

In [20]:
# So we might want to set a custome entity, for example in the following text, companies are recognizes by the
# pre-trained model, but we think it's not recognized and we manually define it to 'spacy'.
doc = nlp("Tesla is going to acquire Twitter Inc. for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter Inc.  |  ORG
$45 billion  |  MONEY


In [22]:
# So if we check the type of each token, it will be span.
type(doc[0:4])

spacy.tokens.span.Span

In [25]:
# So we can import 'Spam' class and manually assigns labels to the tokens.
from spacy.tokens import Span

s1 = Span(doc, 0, 1, label="Com")              # the label is assigned to index 0 which has 'Tesla' word.
s2 = Span(doc, 5, 6, label="Com")              # the label is assigned to index 5 which has 'Twitter' word.
doc.set_ents([s1, s2], default="unmodified")

In [26]:
# Now to print the labels:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  Com
Twitter  |  Com
Inc.  |  ORG
$45 billion  |  MONEY


### How can I build my own NER?

The first approach to build our own NER is **Simple Lookup.** You can have a database of locations, companies, persons... and as you get more locations, companies in your vocabulary, you have a process where an operator manually add those entities into the databases. Then when you look to the text, you simply make comparsion. This approach is looking to be very stupid and naive but it works. Most people use it. It depends for your usecase, may be for your usecase it work better and will be best idea.

<img src = "img.jpg" width = "400px" height = "400px"></img>

The second approach is **Rule based NER.** This work based on some defined rules. We saw this already in spacy that when a word followed by Inc or Co it was recognizting to be company. Spacy provide a class called **EntityRuler** and could be use to define rules.

<img src = "img1.jpg" width = "600px" height = "600px"></img>

The **3rd** approach is **Machine Learning** where you can use a technique called Conditional Random Fields (CRF) and as well as BERT.

### Named Entity Recognition (NER): Exercises

* **Excersie: 1**

    Extract all the Geographical (cities, Countries, states) names from a given text

In [37]:
text = """Kiran want to know the famous foods in each state of India. So, he opened Google and search for this question. Google showed that
in Delhi it is Chaat, in Gujarat it is Dal Dhokli, in Tamilnadu it is Pongal, in Andhrapradesh it is Biryani, in Assam it is Papaya Khar,
in Bihar it is Litti Chowkha and so on for all other states"""

doc = nlp(text)

* **Expected Output:**

   Geographical location Names: [India, Delhi, Gujarat, Tamilnadu, Andhrapradesh, Assam, Bihar]

   Count: 7

In [38]:
# So first to know which labels we have for Person names and Dates we can find them from:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [40]:
# So first to find geographical locations, we first define an array which stores the locations, next throught loop we go 
# through over the text.
geo_names = []
for entity in doc.ents:
    if entity.label_ == 'GPE':     
        geo_names.append(entity)

In [41]:
# Next we can print the geographical locations names:
geo_names   

[India, Delhi, Gujarat, Tamilnadu, Pongal, Andhrapradesh, Assam, Bihar]

In [42]:
# Next to count them, we simply find the lenght of the locations:
len(geo_names)

8

* **Excersie: 2**
    
  Extract all the birth dates of cricketers in the given Text

In [43]:
text = """Sachin Tendulkar was born on 24 April 1973, Virat Kholi was born on 5 November 1988, Dhoni was born on 7 July 1981
and finally Ricky ponting was born on 19 December 1974."""

doc = nlp(text)

* **Expected Output:**

    All Birth Dates: [24 April 1973, 5 November 1988, 7 July 1981, 19 December 1974]
   
    Count: 4

In [44]:
# Same as geographical locations, we just compare the entities labels with the 'Date' and store then in an array.
player_birth_dates = []
for entity in doc.ents:
    if entity.label_ == 'DATE':     
        player_birth_dates.append(entity)

In [45]:
# Next we simply print the birth dates:
player_birth_dates

[24 April 1973, 5 November 1988, 7 July 1981, 19 December 1974]

In [46]:
# Again to count them, we simply take their lengths:
len(player_birth_dates)

4

* **Thats were all for this session.**