### NER is used in many fields in Natural Language Processing (NLP)

Named entity recognition (NER)is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions, such as:

* Which companies were mentioned in the news article?
* Were specified products mentioned in complaints or reviews?
* Does the tweet contain the name of a person? Does the tweet contain this person’s location?

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

#### Information Extraction

The sentence from The New York Times, “European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices.”

In [2]:
ex = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

Then we apply word tokenization and part-of-speech tagging to the sentence.

In [7]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

May have to download NLTK averaged_perceptron_tagger
To accomplish this:

```Python

import nltk
nltk.download('averaged_perceptron_tagger')

```

In [9]:
sent = preprocess(ex)
sent

[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$', '$'),
 ('5.1', 'CD'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices', 'NNS')]

We get a list of tuples containing the individual words in the sentence and their associated part-of-speech.

Now we’ll implement noun phrase chunking to identify named entities using a regular expression consisting of rules that indicate how sentences should be chunked.

Our chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.

In [10]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

In [11]:
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs)

(S
  European/JJ
  authorities/NNS
  fined/VBD
  Google/NNP
  (NP a/DT record/NN)
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  (NP power/NN)
  in/IN
  (NP the/DT mobile/JJ phone/NN)
  (NP market/NN)
  and/CC
  ordered/VBD
  (NP the/DT company/NN)
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


IOB tags have become the standard way to represent chunk structures in files, and we will also be using this format.

In [12]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('European', 'JJ', u'O'),
 ('authorities', 'NNS', u'O'),
 ('fined', 'VBD', u'O'),
 ('Google', 'NNP', u'O'),
 ('a', 'DT', u'B-NP'),
 ('record', 'NN', u'I-NP'),
 ('$', '$', u'O'),
 ('5.1', 'CD', u'O'),
 ('billion', 'CD', u'O'),
 ('on', 'IN', u'O'),
 ('Wednesday', 'NNP', u'O'),
 ('for', 'IN', u'O'),
 ('abusing', 'VBG', u'O'),
 ('its', 'PRP$', u'O'),
 ('power', 'NN', u'B-NP'),
 ('in', 'IN', u'O'),
 ('the', 'DT', u'B-NP'),
 ('mobile', 'JJ', u'I-NP'),
 ('phone', 'NN', u'I-NP'),
 ('market', 'NN', u'B-NP'),
 ('and', 'CC', u'O'),
 ('ordered', 'VBD', u'O'),
 ('the', 'DT', u'B-NP'),
 ('company', 'NN', u'I-NP'),
 ('to', 'TO', u'O'),
 ('alter', 'VB', u'O'),
 ('its', 'PRP$', u'O'),
 ('practices', 'NNS', u'O')]


In this representation, there is one token per line, each with its part-of-speech tag and its named entity tag. Based on this training corpus, we can construct a tagger that can be used to label new sentences; and use the ```nltk.chunk.conlltags2tree()``` function to convert the tag sequences into a chunk tree.

With the function ```nltk.ne_chunk()```, we can recognize named entities using a classifier, the classifier adds category labels such as ```PERSON, ORGANIZATION, and GPE```.

May have to download NLTK maxent_ne_chunker
To accomplish this:

```Python

import nltk
nltk.download('maxent_ne_chunker')

```

In [16]:


ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  phone/NN
  market/NN
  and/CC
  ordered/VBD
  the/DT
  company/NN
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


Google is recognized as a person. It’s quite disappointing, don’t you think so?