# Named Entity Recognition (NER)

NER is used to locate and classify named entity mentions into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In [16]:
import nltk
from nltk.tokenize import sent_tokenize

The following sample text is extracted from Wikipedia: https://en.wikipedia.org/wiki/Steve_Jobs

In [17]:
text = "Steven Paul Jobs was an American business magnate and investor. He was the chairman, chief executive officer, and co-founder of Apple Inc.; chairman and majority shareholder of Pixar."

In [18]:
tokenizer = sent_tokenize(text)

Run a loop which iterates through all the sentences in the document, separates each word using `word_tokenize` and assigns a POS tag using `pos_tag`.

Example: 
    
Sentence 1
    
    Word 1 -> POS tag assigned
    Word 2 -> POS tag assigned
    Word 3 -> POS tag assigned
    Word 4 -> POS tag assigned
    Word 5 -> POS tag assigned
    
Sentence 2

    Word 1 -> POS tag assigned
    Word 2 -> POS tag assigned
    Word 3 -> POS tag assigned
    Word 4 -> POS tag assigned
    Word 5 -> POS tag assigned
    
.

.

.

.

.

.

(iterates till all the words in all the sentences are tagged in the document)

Later on, every word is classified according to Named Entity Recognition (NER) feature using `ne_chunk`.

In [20]:
for i in tokenizer:
    words = nltk.word_tokenize(i)
    tagged = nltk.pos_tag(words)
    print(tagged)
    
    named_entity = nltk.ne_chunk(tagged, binary = True)
    print(named_entity)

[('Steven', 'NNP'), ('Paul', 'NNP'), ('Jobs', 'NNP'), ('was', 'VBD'), ('an', 'DT'), ('American', 'JJ'), ('business', 'NN'), ('magnate', 'NN'), ('and', 'CC'), ('investor', 'NN'), ('.', '.')]
(S
  (NE Steven/NNP Paul/NNP Jobs/NNP)
  was/VBD
  an/DT
  (NE American/JJ)
  business/NN
  magnate/NN
  and/CC
  investor/NN
  ./.)
[('He', 'PRP'), ('was', 'VBD'), ('the', 'DT'), ('chairman', 'NN'), (',', ','), ('chief', 'JJ'), ('executive', 'NN'), ('officer', 'NN'), (',', ','), ('and', 'CC'), ('co-founder', 'NN'), ('of', 'IN'), ('Apple', 'NNP'), ('Inc.', 'NNP'), (';', ':'), ('chairman', 'NN'), ('and', 'CC'), ('majority', 'NN'), ('shareholder', 'NN'), ('of', 'IN'), ('Pixar', 'NNP'), ('.', '.')]
(S
  He/PRP
  was/VBD
  the/DT
  chairman/NN
  ,/,
  chief/JJ
  executive/NN
  officer/NN
  ,/,
  and/CC
  co-founder/NN
  of/IN
  (NE Apple/NNP Inc./NNP)
  ;/:
  chairman/NN
  and/CC
  majority/NN
  shareholder/NN
  of/IN
  (NE Pixar/NNP)
  ./.)


### Note:
    
The parameter `binary = True` in `named_entity = nltk.ne_chunk(tagged, binary = True)` combines Steven Paul Jobs as one PERSON.
    
If only `named_entity = nltk.ne_chunk(tagged)` statement is passed, Steven and Paul Jobs will be identified as two different PERSONS.

It is observed that Steven Paul Jobs is identified as a PERSON, American is identified as a GPE (Geo-Political Entity), Apple Inc. as an ORGANIZATION, etc.