# Named Entity Recognition (NER)

NER is an algorithm/logic that helps you identify "Entities" in the text.

## What are Named Entities?
Every text comprises of various Entities. Few general entities that can be found in any text are "Name of a Person", "Location" etc., But, there could be specialised Entities specific to certain kind of texts. For example, court judgements might have entities like "Court", "Petitioner" etc.

## What are the advantages of building an NER system?

### Increased search relevance:
Currently search is mostly done using keywords. If a particular keyword is present in a text, that text would be selected. But often the keyword may not belong to the required entity. For example, you want to read all articles about "Tesla" which belongs to the "COMPANY" entity. But a keyword search would also show you articles on "Nikola Tesla". NER based search would thus increase search relevance.

### Information Extraction and accelerated pace of pattern recognition:
The naive way of extracting information from text is reading the text. But what if there are 1000s of similar texts? In a pre-AI world, expertise over a topic is gained by the experience of spending more time with a particular kind of texts. An expert property lawyer must have read 100s of texts (court judgements etc) relevant to property disputes; an expert cardiologist must have read 100s of articles related to cardiology. These experts, as a result of the time they spend with their texts, develop mental patterns which are highly valuable. An expert property lawyer or a cardiologist can quickly diagnose your predicament and provide a range of solutions. NER enables such pattern recognition from a large number of texts. If I'm a budding property lawyer, I can now download all the property dispute cases; use a customised NER system devloped to extract information from such cases; and develop my own mental patterns quickly.

# Table of Contents
1. [SpaCy](#spacy)

2. [Building your own NERs](#customner) <br>
    2.1 [Rule based NER](#rulener) <br>
    2.2 [Machine learning based NER](#mlner)
    
3. [OpenNyai](#opennyai): A Custom NER built to recognise specialised entities from Indian court judgements.

## SpaCy <a class="anchor" id="spacy"></a>

In [1]:
import spacy

In [2]:
#Load Spacy's pre-trained Small English NLP Model
nlp_spacy = spacy.load("en_core_web_sm")

# en_core_web_md -- Medium English
# en_core_web_lg -- Large English

In [3]:
# All components in the Spacy's English NLP Model
nlp_spacy.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [4]:
#Sample court judgement
sample_text = open('SampleTexts/sample_judgment1.txt').read()

#Create the doc object by passing it through nlp pipeline.
doc = nlp_spacy(sample_text)

In [5]:
# Run NER on the Doc object
for ent in doc.ents:    
    print(ent.text, "|", ent.start_char,"|", ent.end_char,"|", ent.label_,"|", spacy.explain(ent.label_))

THE SUPREME COURT OF INDIA | 32 | 58 | ORG | Companies, agencies, institutions, etc.
2017 | 140 | 144 | DATE | Absolute or relative dates or periods
APPELLANT(S | 224 | 235 | ORG | Companies, agencies, institutions, etc.
KANUNGO | 286 | 293 | PERSON | People, including fictional
RESPONDENT(S | 349 | 361 | PERSON | People, including fictional
B.R. GAVAI | 398 | 408 | PERSON | People, including fictional
18th April 2012 | 467 | 482 | DATE | Absolute or relative dates or periods
the High Court of Orissa in Arbitration Appeal No. | 521 | 571 | ORG | Companies, agencies, institutions, etc.
25 | 572 | 574 | CARDINAL | Numerals that do not fall under another type
2007 | 578 | 582 | DATE | Absolute or relative dates or periods
2 | 640 | 641 | CARDINAL | Numerals that do not fall under another type
3 kilometers | 771 | 783 | QUANTITY | Measurements, as of weight or distance
Kanjipani | 810 | 819 | GPE | Countries, cities, states
Kuntala | 823 | 830 | GPE | Countries, cities, states
16th Decembe

In [6]:
type(doc.ents)

tuple

In [7]:
type(doc.ents[23]) #Span class in SpaCy

spacy.tokens.span.Span

In [8]:
#Visualising Named Entities with SpaCy
from spacy import displacy
displacy.render(doc, style='ent') #displacy.serve -- if not on Jupyter Notebook.

In [9]:
# Entities recognised by the spaCy NER
nlp_spacy.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

You can observe that the NER by Spacy is not perfect and may not work well with all kinds of text. That is why people develop custom NERs for specialised topics. For instance, [Opennyai](https://github.com/OpenNyAI/Opennyai) developed an [NER for Indian Court Judgements](https://github.com/Legal-NLP-EkStep/legal_NER) with customised entities that a spaCy NER doesn't recognise (PRECEDENT, STATUTE etc). 

#  Building your own NERs <a class="anchor" id="customner"></a>

## Rule-based NERs <a class="anchor" id="rulener"></a>
Even spaCy "en_core_web_sm" is a rule based NER. Let's see few of the rules it follows:

In [10]:
doc = nlp_spacy("The share price of Tata Motors is INR 412.90")
displacy.render(doc, style='ent') #displacy.serve -- if not on Jupyter Notebook.

In [11]:
doc = nlp_spacy("The share price of tata motors is $ 412.90")
displacy.render(doc, style='ent') #displacy.serve -- if not on Jupyter Notebook.

Clearly spaCy's NER uses these rules:
1. ORG entity has to start with a capital letter.
2. MONEY entity starts with $ character

Similarly, we can also build our own rule based NERs. These rules can be based on Parts of Speech (POS), Regex etc., SpaCy's EntityRuler can be used to build a rule based NER. [Link](https://ner.pythonhumanities.com/02_01_spaCy_Entity_Ruler.html#introducing-complex-rules-and-variance-to-the-entityruler-advanced)

## Machine Learning based NERs <a class="anchor" id="mlner"></a>

### BERT - Hugging Face NER

It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC). 

In [12]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

In [13]:
nlp_bert = pipeline("ner", model=model, tokenizer=tokenizer)
ner_results = nlp_bert(sample_text)

In [14]:
print(ner_results)

[{'entity': 'B-MISC', 'score': 0.6289397, 'index': 9, 'word': 'S', 'start': 36, 'end': 37}, {'entity': 'B-LOC', 'score': 0.99272704, 'index': 17, 'word': 'IN', 'start': 53, 'end': 55}, {'entity': 'I-LOC', 'score': 0.773564, 'index': 18, 'word': '##DI', 'start': 55, 'end': 57}, {'entity': 'I-LOC', 'score': 0.93382025, 'index': 19, 'word': '##A', 'start': 57, 'end': 58}, {'entity': 'I-ORG', 'score': 0.45966476, 'index': 88, 'word': 'CH', 'start': 278, 'end': 280}, {'entity': 'I-ORG', 'score': 0.5811653, 'index': 89, 'word': '##AN', 'start': 280, 'end': 282}, {'entity': 'I-ORG', 'score': 0.5139487, 'index': 90, 'word': '##DR', 'start': 282, 'end': 284}, {'entity': 'I-ORG', 'score': 0.7135396, 'index': 92, 'word': 'K', 'start': 286, 'end': 287}, {'entity': 'I-ORG', 'score': 0.464004, 'index': 93, 'word': '##AN', 'start': 287, 'end': 289}, {'entity': 'B-PER', 'score': 0.96993154, 'index': 128, 'word': 'B', 'start': 398, 'end': 399}, {'entity': 'I-PER', 'score': 0.75982845, 'index': 130, 'wo

SpaCy also has a transformer based NER model.

Even a combination of ML and Rule based NER Models can be built. 

# OpenNyai <a class="anchor" id="opennyai"></a>

This is a custom NER developed by Ek-STEP to identify specialised entities from Indian Court Judgements. It is a transformer based NER but with custom rules.

In [15]:
from opennyai import Pipeline
from opennyai.utils import Data

In [16]:
#Sample court judgements
text1 = open('SampleTexts/sample_judgment1.txt').read()
text2 = open('SampleTexts/sample_judgment2.txt').read()

# you can also load your text files directly into this
texts_to_process = [text1, text2]

In [17]:
# create Data object for data  preprocessing before running ML models
data = Data(texts_to_process, preprocessing_nlp_model='en_core_web_trf')

#Other pre-processing models available are: en_core_web_md, en_core_web_sm (fastest but less accurate)

[38;5;4mℹ Pre-processing will happen on CPU![0m


The following pre-processing steps takes place with the above code:
1. Separating preamble from judgment text
2. Sentence splitting of judgment text
3. Convert upper case words in preamble to title case
4. Replace newline characters within a sentence with space in judgment text

More on pre-processing: [Link](https://opennyai.readthedocs.io/en/latest/preprocessing/preprocessing.html) (chunks etc)

In [18]:
# If you have access to GPU then set this to True else False
use_gpu = False

In [19]:
# Load NER model
pipeline = Pipeline(components=['NER'], use_gpu=use_gpu, verbose=True,
                   ner_model_name='en_legal_ner_trf',
                   ner_mini_batch_size=40000,
                   ner_do_sentence_level=True,
                   ner_do_postprocess=True,
                   ner_statute_shortforms_path='')


#'en_legal_ner_sm’ is fast but less accurate.

[38;5;4mℹ Loading NER...[0m
[38;5;4mℹ NER will run on CPU![0m


In [20]:
#Apply NER pipeline over the data
results = pipeline(data)

[38;5;4mℹ Processing documents with Legal NER!!![0m


 50%|██████████████████████▌                      | 1/2 [00:12<00:12, 12.94s/it]

[38;5;3m⚠ There was some issue while performing postprocessing for doc id
38c4e8ad5395d9a57af900e8bd801d146c84a09d0a91703ca69a3ebeb5f5bece.
Some of postprocessing info may be absent because of this in doc.[0m


100%|█████████████████████████████████████████████| 2/2 [00:35<00:00, 17.56s/it]


In [None]:
# Export the results to a json file
import json
with open('output.json','w') as f:
     json.dumps(results,f,indent=4)

OpenNyai's NER performs at sentence level. So, all entities present in a particular sentence are stored in a dictionary. The first sentence of every judgement is "Preamble" by default.

For more information on parameters: [Link](https://opennyai.readthedocs.io/en/latest/ner/legal_named_entities.html)

Since the NER is done at sentence level, post processing will be helpful in capturing document level context. Post processing is done using rules.

In [21]:
# Entities identified in Preamble of 1st judgement
results[0]['annotations'][0]

{'id': '20f13480b7804b28a397c336e753f6ef',
 'start': 0,
 'end': 405,
 'text': '     Reportable\n\n            In The Supreme Court Of India\n             Civil Appellate Jurisdiction \n\n             Civil Appeal No. 8990 Of 2017 \n\nExecutive Engineer (R And B)\nAnd Others                                    ... Appellant(S) \n\n                           Versus\nGokul Chandra Kanungo (Dead) \nThr. His Lrs.                                ... Respondent(S) \n\n                         Judgment\n',
 'entities': [{'id': '204e0f2415924ef1affad2ac69f094f1',
   'normalized_name': 'Supreme Court Of India',
   'start': 36,
   'end': 58,
   'text': 'Supreme Court Of India',
   'labels': ['COURT']},
  {'id': 'dbb2d97e62204879a7dd726053503824',
   'normalized_name': 'Executive Engineer',
   'start': 148,
   'end': 166,
   'text': 'Executive Engineer',
   'labels': ['PETITIONER']},
  {'id': '61e0c69831ce476384c6c4bc6ea5dd7f',
   'normalized_name': 'Gokul Chandra Kanungo',
   'start': 276,
   'end'

In [22]:
results[0]['annotations'][76]

{'id': '1a629c5b51a04d2eb695f4f24fa887bb',
 'start': 10799,
 'end': 10926,
 'text': 'It will be apposite to refer to the following observations of this Court in the case of Rajendra Construction Co. (supra):\n\n30.',
 'entities': [{'id': '74bd7e9df8ad45e2b5b04331da3e9aa1',
   'normalized_name': 'Rajendra Construction Co.',
   'start': 10887,
   'end': 10912,
   'text': 'Rajendra Construction Co.',
   'labels': ['ORG']}]}

In [23]:
ner_doc_1 = pipeline._ner_model_output[0]
identified_entites = [(ent, ent.label_) for ent in ner_doc_1.ents]
identified_entites

[(Supreme Court Of India, 'COURT'),
 (Executive Engineer, 'PETITIONER'),
 (Gokul Chandra Kanungo, 'RESPONDENT'),
 (B.R. GAVAI, 'JUDGE'),
 (18th April 2012, 'DATE'),
 (High Court of Orissa, 'COURT'),
 (Arbitration Appeal No. 25 of 2007, 'CASE_NUMBER'),
 (Kanjipani, 'GPE'),
 (Kuntala, 'GPE'),
 (16th December 1971, 'DATE'),
 (15 th December 1972, 'DATE'),
 (30th August 1977, 'DATE'),
 (25th July 1989, 'DATE'),
 (10 th August 1989, 'DATE'),
 (O.S. No. 206 of 1989, 'CASE_NUMBER'),
 (Court of Civil Judge (Senior Division), Bhubaneswar, 'COURT'),
 (Section 20, 'PROVISION'),
 (Arbitration Act, 1940, 'STATUTE'),
 (14th February 1990, 'DATE'),
 (1940 Act, 'STATUTE'),
 (Arbitration and Conciliation Act, 1996, 'STATUTE'),
 (4th February 2000, 'DATE'),
 (MJC No. 36 of 2000, 'CASE_NUMBER'),
 (Section 11, 'PROVISION'),
 (15 th October 2001, 'DATE'),
 (S.K. Mohanty, 'JUDGE'),
 (15th March 2002, 'DATE'),
 (1st April 1976, 'DATE'),
 (15th March 2002, 'DATE'),
 (24 th August 2004, 'DATE'),
 (1st April 19

Entities that are identified by Opennyai's NER: COURT, PETITIONER, RESPONDENT, JUDGE, LAWYER, DATE, ORG, GPE, STATUTE, PROVISION, PRECEDENT, CASE_NUMBER, WITNESS, OTHER_PERSON. These are better explained here: [Link](https://docs.google.com/presentation/d/e/2PACX-1vSpWE_Qk9X_wBh7xJWPyYcWcME3ZBh_HmqeZOx58oMLyJSi0Tn0-JMWKI-HsQIRuUTbQHPql6MlU7OS/pub?start=false&loop=false&delayms=3000#slide=id.g13cb659750e_0_76)

Hence, it is much more specialised than general purpose NERs. 

In [24]:
type(ner_doc_1.ents[0])

spacy.tokens.span.Span

In [25]:
#Visualising NER
from spacy import displacy
from opennyai.ner.ner_utils import ner_displacy_option
#displacy.serve(ner_doc_1, style='ent',port=8080,options=ner_displacy_option)
displacy.render(ner_doc_1, style='ent', options=ner_displacy_option)

In [30]:
# Frequency count of all precendents in a judgement
#from opennyai.ner import get_unique_precedent_count
#import opennyai.ner as InLegalNER

#precedents=InLegalNER.get_unique_precedent_count(ner_doc_1)
#precedents

In [27]:
# Frequency count of all provisions in a judgement
from opennyai.ner import get_unique_provision_count
provisions=get_unique_provision_count(ner_doc_1)
provisions

{'Section 20 of Arbitration Act, 1940': 1,
 'Article 142 of Constitution of India': 4,
 'Section 11 of 1996 Act': 3,
 'Section 34 of 1996 Act': 1,
 'Section 37 of 1996 Act': 1,
 'subsection (7) of Section 31 of 1996 Act': 1,
 'Section 31(7)(a) of 1996 Act': 3,
 'clause (a) of sub section (7) of Section 31 of 1996 Act': 1}

In [28]:
# Frequency count of all statues in a judgement
from opennyai.ner import get_unique_statute_count
statutes=get_unique_statute_count(ner_doc_1)
statutes

{'Arbitration Act, 1940': 1, 'Constitution of India': 4, '1996 Act': 10}

In [29]:
# Saving NER Results
from opennyai.ner import get_csv
get_csv(ner_doc_1,'sample_judgement1_NER.csv', save_path=r"/home/krishna/NLP/NER/sample_judgement1_NER.csv")

# References
1. [NLP Tutorial by Codebasics](https://www.youtube.com/watch?v=2XUhKpH0p4M)
2. [Huggingface NER](https://huggingface.co/dslim/bert-base-NER)
3. [Creating your own rule based NER using SpaCy](https://ner.pythonhumanities.com/02_01_spaCy_Entity_Ruler.html#introducing-complex-rules-and-variance-to-the-entityruler-advanced)
4. [Opennyai - NER](https://github.com/Legal-NLP-EkStep/legal_NER)