<a href="https://colab.research.google.com/github/komalvpachupate/NLP/blob/main/NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Named entity recognition (NER) is an NLP based technique to identify mentions of rigid designators from text belonging to particular semantic types such as a person, location, organisation etc.

## **NER with nltk**

nltk is a leading python-based library for performing NLP tasks such as preprocessing text data, modelling data, parts of speech tagging, evaluating models and more. It can be widely used across operating systems and is simple in terms of additional configurations. Now, lets install nltk and perform NER on a simple sentence.

In [1]:
# Step One: Import nltk and download necessary packages
 
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
 
# Step Two: Load Data
 
sentence = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

# Step Three: Tokenise, find parts of speech and chunk words 

for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


GPE WASHINGTON
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn


## **NER with Spacy**

Spacy is an open-source NLP library for advanced Natural Language Processing in Python and Cython. It's well maintained and has over 20K stars on Github. There are several pre-trained models in Spacy that you can use directly on your data for tasks like NER, Information Extraction etc. 

In [2]:
# !python -m spacy download en_core_web_sm

In [3]:
# import spacy
import spacy
 
# load spacy model
nlp = spacy.load('en_core_web_sm')
 
# load data
sentence = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(sentence)
 
# print entities
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [4]:
sentence1 = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
doc = nlp(sentence1)

# print entities
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

WASHINGTON 0 10 GPE
New York 51 59 GPE
the 1990s 79 88 DATE
Loretta E. Lynch 90 106 PERSON
Brooklyn 138 146 GPE
African-Americans 203 220 NORP


## **Transformers BERT pipeline for NER**

In [7]:
# !pip install transformers

In [8]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)


Downloading (…)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/433M [00:00<?, ?B/s]

[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


In [10]:
sentence2 = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

ner_results = nlp(sentence2)
print(ner_results)

[{'entity': 'B-LOC', 'score': 0.9406784, 'index': 1, 'word': 'WA', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.6661039, 'index': 2, 'word': '##S', 'start': 2, 'end': 3}, {'entity': 'I-MISC', 'score': 0.54796904, 'index': 3, 'word': '##H', 'start': 3, 'end': 4}, {'entity': 'I-ORG', 'score': 0.345704, 'index': 4, 'word': '##ING', 'start': 4, 'end': 7}, {'entity': 'I-ORG', 'score': 0.5425898, 'index': 5, 'word': '##TO', 'start': 7, 'end': 9}, {'entity': 'I-ORG', 'score': 0.3733317, 'index': 6, 'word': '##N', 'start': 9, 'end': 10}, {'entity': 'B-LOC', 'score': 0.999529, 'index': 18, 'word': 'New', 'start': 51, 'end': 54}, {'entity': 'I-LOC', 'score': 0.99932694, 'index': 19, 'word': 'York', 'start': 55, 'end': 59}, {'entity': 'B-PER', 'score': 0.99936014, 'index': 26, 'word': 'Lo', 'start': 90, 'end': 92}, {'entity': 'B-PER', 'score': 0.9971523, 'index': 27, 'word': '##retta', 'start': 92, 'end': 97}, {'entity': 'I-PER', 'score': 0.99329716, 'index': 28, 'word': 'E', 'start': 98

In [12]:
ner_results

[{'entity': 'B-LOC',
  'score': 0.9406784,
  'index': 1,
  'word': 'WA',
  'start': 0,
  'end': 2},
 {'entity': 'I-ORG',
  'score': 0.6661039,
  'index': 2,
  'word': '##S',
  'start': 2,
  'end': 3},
 {'entity': 'I-MISC',
  'score': 0.54796904,
  'index': 3,
  'word': '##H',
  'start': 3,
  'end': 4},
 {'entity': 'I-ORG',
  'score': 0.345704,
  'index': 4,
  'word': '##ING',
  'start': 4,
  'end': 7},
 {'entity': 'I-ORG',
  'score': 0.5425898,
  'index': 5,
  'word': '##TO',
  'start': 7,
  'end': 9},
 {'entity': 'I-ORG',
  'score': 0.3733317,
  'index': 6,
  'word': '##N',
  'start': 9,
  'end': 10},
 {'entity': 'B-LOC',
  'score': 0.999529,
  'index': 18,
  'word': 'New',
  'start': 51,
  'end': 54},
 {'entity': 'I-LOC',
  'score': 0.99932694,
  'index': 19,
  'word': 'York',
  'start': 55,
  'end': 59},
 {'entity': 'B-PER',
  'score': 0.99936014,
  'index': 26,
  'word': 'Lo',
  'start': 90,
  'end': 92},
 {'entity': 'B-PER',
  'score': 0.9971523,
  'index': 27,
  'word': '##retta'