# What is Named Entity Recognition?

Named entities - in QS's use case - can include:
1. Drug Names, (e.g.  Adalimumab)
2. Condition Names, (e.g. Non-infectious uveitis) 
3. Treatments, (e.g. Dexamethasone intravitreal implant)
4. Side-Effects paired w/ Drug Names?, (e.g. nasopharyngitis, entity usually follows a tagged line indicating the drug)
5. Procedure, (e.g. X Drug costs Y amount for Z treatment)

This is achieved using one or more machine learning sequence models to label entities but can also call special rule-based components to interpret numerical data such as prices (e.g. converts all currencies to USD) using regular expressions. 

NER is a method of building knowledge from semi-structured and unstructured text sources. In QS's use case, it can be used to locate, classify, and extract named entities - such as the ones I listed above, from a text into those same pre-defined categories. It can be used to answer queries like:
* What conditions were mentioned in the study?
* Does this study cover treatments?
* What side effects come with those treatments?

### NER Pipeline Overview (Based on the Stanford Model) 
1. Statistical Models
2. Numeric Sequences and SUTime
3. Fine Grained NER
4. RegexNER Rules Format
5. Customizing the Fine-Grained NER
6. Additional TokensRegexNER Rules
7. Additional TokensRegex Rules
8. Entity Mention Detection 
9. API Creation 
10. Accessing Entity Confidences 

### Basic Example: Building an NER using SpaCy and NLTK 
__Source:__ [Susan Li on Towards Data Science](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)<br>
__Summary:__


### Foundational Paper: Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling - Stanford 
__Source:__ [White Paper from Stanford University](https://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf)<br>
__Summary:__ 

### Continued Improvements: Fine-Grained Entity Recognition (FIGER) 
__Source:__ [White Paper from the University of Washington](http://xiaoling.github.io/pubs/ling-aaai12.pdf)<br>
__Summary:__<br>
__Notes:__ This implementation is in Java only.<br><br>
There are three main challenges towards the development of a medically-specific, fine-grained Entity Recognizer that can be trained to recognize uncommon entities:
1. Selection of the tag set (i.e. the labels)
2. Creation of training data
3. Development of a fast and accurate multi-class labeling algorithm.<br>


##### Selection of the Tag Set 
[FIGER](http://xiaoling.github.io/pubs/ling-aaai12.pdf)'s authors propose the curation of 112 unique tags based on [Freebase](https://developers.google.com/freebase/guide/basic_concepts) types. 

##### Training Data Creation
FIGER's authors propose exploiting anchor links in Wikipedia text to automatically label entity segments with appropriate tags. 

##### Fast/Accurate Model 
Using the heuristically-labeled (<- weird way of phrasing that) training data to train a conditional random field (CRF) model for segmentation that identifies the boundaries of text that mentions an entity. 

The final step is assigning tags to the segmented mentions using an adapted perceptron algorithm (isn't that just a neural network?) for multi-class, multi-label classification. 

##### Evaluation
Evaluation of the model takes two stages:
1. Precision/accuracy of the tag assignment
2. Do the tags actually have use beyond their assignment?

### Python Implementation Example
__Source:__ [Depends on the Definition Blog Post](https://www.depends-on-the-definition.com/introduction-named-entity-recognition-python/)<br>
__Summary:__

## A Basic Example - NER using NLTK and SpaCy

In this example, I'll cover how to build a simple NER to recognize names - of persons, organizations, and locations - using two standard Python packages. 

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

### Information Extraction 
Here, I'll take a short string from a NY Times headline today:

In [4]:
example = 'A call to arms for Sri Lankan monks. Ethnic cleansing of the Rohingya in Myanmar. A Buddhist faith known for pacifism is taking its place in a new age of nationalism.'

Then, I'll apply word tokenization and POS tagging to the sentence. This will return a __list of tuples containing the individual words in the sentence and their associated part-of-speech__. 

In [5]:
def preprocess(sentence):
    s = nltk.word_tokenize(sentence)
    s = nltk.pos_tag(s)
    return s

In [6]:
s = preprocess(example)
s

[('A', 'DT'),
 ('call', 'NN'),
 ('to', 'TO'),
 ('arms', 'NNS'),
 ('for', 'IN'),
 ('Sri', 'NNP'),
 ('Lankan', 'NNP'),
 ('monks', 'NNS'),
 ('.', '.'),
 ('Ethnic', 'JJ'),
 ('cleansing', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('Rohingya', 'NNP'),
 ('in', 'IN'),
 ('Myanmar', 'NNP'),
 ('.', '.'),
 ('A', 'NNP'),
 ('Buddhist', 'NNP'),
 ('faith', 'NN'),
 ('known', 'VBN'),
 ('for', 'IN'),
 ('pacifism', 'NN'),
 ('is', 'VBZ'),
 ('taking', 'VBG'),
 ('its', 'PRP$'),
 ('place', 'NN'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('new', 'JJ'),
 ('age', 'NN'),
 ('of', 'IN'),
 ('nationalism', 'NN'),
 ('.', '.')]

Now - I want to implement __noun phrase chunking__ to identify named entities using a RegEx consisting of rules that indicate how sentences should be chunked. 

Rule:

My chunk pattern rule declares that a __noun phrase, NP, should be formed whenever the chunker finds an  _optional determiner_, or DT, followed by any number of (?) adjectives JJ, and then a (*) noun, NN.__

In [7]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

### Chunking

Using the pattern variable I just declared above, I can create a chunk parser and test it on my sentence.

In [9]:
cp = nltk.RegexpParser(pattern)
print(cp)

chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       <ChunkRule: '<DT>?<JJ>*<NN>'>


In [10]:
cs = cp.parse(s)
print(cs)

(S
  (NP A/DT call/NN)
  to/TO
  arms/NNS
  for/IN
  Sri/NNP
  Lankan/NNP
  monks/NNS
  ./.
  (NP Ethnic/JJ cleansing/NN)
  of/IN
  the/DT
  Rohingya/NNP
  in/IN
  Myanmar/NNP
  ./.
  A/NNP
  Buddhist/NNP
  (NP faith/NN)
  known/VBN
  for/IN
  (NP pacifism/NN)
  is/VBZ
  taking/VBG
  its/PRP$
  (NP place/NN)
  in/IN
  (NP a/DT new/JJ age/NN)
  of/IN
  (NP nationalism/NN)
  ./.)


__IOB Format__ = short for inside, outside, and beginning is a common tagging format for tagging tokens in a chunking task in NER. 

`tree2conlltags(t)` = returns a list of 3-tuples containing `(word, tag, IOB-tag)`. Arg1, t = the tree to be converted. In this case, it is the output of the parser from the previous cell. 

`pprint()` = data pretty printer, provides the capability to "pretty-print" arbitrary Python data structures in a form that can be used as input to the interpreter. 

Below, I use both print and pprint to display the contrast:

In [14]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)
print(iob_tagged)
pprint(iob_tagged)

[('A', 'DT', 'B-NP'), ('call', 'NN', 'I-NP'), ('to', 'TO', 'O'), ('arms', 'NNS', 'O'), ('for', 'IN', 'O'), ('Sri', 'NNP', 'O'), ('Lankan', 'NNP', 'O'), ('monks', 'NNS', 'O'), ('.', '.', 'O'), ('Ethnic', 'JJ', 'B-NP'), ('cleansing', 'NN', 'I-NP'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Rohingya', 'NNP', 'O'), ('in', 'IN', 'O'), ('Myanmar', 'NNP', 'O'), ('.', '.', 'O'), ('A', 'NNP', 'O'), ('Buddhist', 'NNP', 'O'), ('faith', 'NN', 'B-NP'), ('known', 'VBN', 'O'), ('for', 'IN', 'O'), ('pacifism', 'NN', 'B-NP'), ('is', 'VBZ', 'O'), ('taking', 'VBG', 'O'), ('its', 'PRP$', 'O'), ('place', 'NN', 'B-NP'), ('in', 'IN', 'O'), ('a', 'DT', 'B-NP'), ('new', 'JJ', 'I-NP'), ('age', 'NN', 'I-NP'), ('of', 'IN', 'O'), ('nationalism', 'NN', 'B-NP'), ('.', '.', 'O')]
[('A', 'DT', 'B-NP'),
 ('call', 'NN', 'I-NP'),
 ('to', 'TO', 'O'),
 ('arms', 'NNS', 'O'),
 ('for', 'IN', 'O'),
 ('Sri', 'NNP', 'O'),
 ('Lankan', 'NNP', 'O'),
 ('monks', 'NNS', 'O'),
 ('.', '.', 'O'),
 ('Ethnic', 'JJ', 'B-NP'),
 ('cleansing', 

In this representation, there is one token per line, each with its part-of-speech tag and its named entity tag. 

Based on this training sentence, I can construct a tagger that can be used to label new sentences; and use the `nltk.chunk.conlltags2tree()` function to convert the tag sequences into a chunk tree. 

With the function `nltk.ne_chunk()`, I can recognize named entities using a classifier, while the classifier adds category labels such as PERSON, ORGANIZATION, and GPE. 

In [17]:
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(example)))
print(ne_tree)

(S
  A/DT
  call/NN
  to/TO
  arms/NNS
  for/IN
  (PERSON Sri/NNP Lankan/NNP)
  monks/NNS
  ./.
  Ethnic/JJ
  cleansing/NN
  of/IN
  the/DT
  (ORGANIZATION Rohingya/NNP)
  in/IN
  (GPE Myanmar/NNP)
  ./.
  A/NNP
  Buddhist/NNP
  faith/NN
  known/VBN
  for/IN
  pacifism/NN
  is/VBZ
  taking/VBG
  its/PRP$
  place/NN
  in/IN
  a/DT
  new/JJ
  age/NN
  of/IN
  nationalism/NN
  ./.)


Notice how now there are identifiers for several words in the sentence (i.e. Sri Lankan is tagged as PERSON, Rohingya is tagged as ORGANIZATION, Myanmar as Geopolitical Entity or GPE, etc.) However, the quality is a bit poor - obviously Sri Lankans are a people, not a singular person. 

Now I'll use SpaCy to demonstrate its efficiency compared to NLTK. 

## SpaCy

SpaCy's named entity recognition has been trained on the OntoNotes 5 corpus and supports several different entity types beyond the small three I placed above. 

I'll use the same sentence as before: "A call to arms for Sri Lankan monks. Ethnic cleansing of the Rohingya in Myanmar. A Buddhist faith known for pacifism is taking its place in a new age of nationalism."

In [33]:
import spacy
from spacy import displacy
from collections import Counter
nlp = spacy.load('en')

In [34]:
doc = nlp('A call to arms for Sri Lankan monks. Ethnic cleansing of the Rohingya in Myanmar. A Buddhist faith known for pacifism is taking its place in a new age of nationalism.')
pprint(
    [
        (X.text, X.label_) for X in doc.ents
    ]
)

[('Sri Lankan', 'GPE'),
 ('Rohingya', 'PERSON'),
 ('Myanmar', 'GPE'),
 ('Buddhist', 'NORP')]


As I can see, even SpaCy has trouble identifying everything here. Sri Lankan and Rohingya should both be identified as 'NORP' instead of GPE and PERSON. Interestingly, NLTK's tagger - even thoughh it has poorer performance - had more accurate tags. Sri Lankan is more likely to be a PERSON and Rohingya an ORGANIZATION than what SpaCy put down. 

What if I want to further categorize the dataset? I can go a step further and demonstrate token-level entity annotation using theh BILUO (Beginning, In, Last, Unit, and Out) tagging schema to describe entity boundaries. 

What is BILUO?
1. B = the first token of a multi-token entity
2. I = the inner token of a multi-token entity
3. L = the final token of a multi-token entity
4. U = a single-token entity
5. O = a non-entity token 

In [35]:
pprint(
    [
        (X, X.ent_iob_, X.ent_type_) for X in doc
    ]
)

[(A, 'O', ''),
 (call, 'O', ''),
 (to, 'O', ''),
 (arms, 'O', ''),
 (for, 'O', ''),
 (Sri, 'B', 'GPE'),
 (Lankan, 'I', 'GPE'),
 (monks, 'O', ''),
 (., 'O', ''),
 (Ethnic, 'O', ''),
 (cleansing, 'O', ''),
 (of, 'O', ''),
 (the, 'O', ''),
 (Rohingya, 'B', 'PERSON'),
 (in, 'O', ''),
 (Myanmar, 'B', 'GPE'),
 (., 'O', ''),
 (A, 'O', ''),
 (Buddhist, 'B', 'NORP'),
 (faith, 'O', ''),
 (known, 'O', ''),
 (for, 'O', ''),
 (pacifism, 'O', ''),
 (is, 'O', ''),
 (taking, 'O', ''),
 (its, 'O', ''),
 (place, 'O', ''),
 (in, 'O', ''),
 (a, 'O', ''),
 (new, 'O', ''),
 (age, 'O', ''),
 (of, 'O', ''),
 (nationalism, 'O', ''),
 (., 'O', '')]


## Extracting Named Entities from Web Articles using Beautiful Soup

So far, this has all been pretty simplistic. I want to actually get some decent information from an online article, so I'll use Beautiful Soup and requests to get ahold of the data for an NYT article and analyze it. 

__Quick overview of helper modules__:
`requests` is a module that allows me to deal with HTTP requests in a more elegant manner than using Python's built-in `urllib`

__Notes on Formatting Python Print statements__:
`%s` = tells the command to convert to string 
`%f` = converts to floating point in fixed-decimal notation
`%c` = converts to a single character

In [36]:
from bs4 import BeautifulSoup
import requests
import re

In [39]:
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'lxml')
    for script in soup(
        [
            "script", "style", "aside"
        ]
    ):
        script.extract()
    return " ".join(re.split(r'[\n\t]+',
                             soup.get_text()))

In [49]:
ny_bb = url_to_string('https://www.nytimes.com/2019/07/09/us/politics/amy-mcgrath-mitch-mcconnell.html?action=click&module=Top%20Stories&pgtype=Homepage')
article = nlp(ny_bb)
print('There are %s entities in the article' % len(article.ents))



There are 70 entities in the article


In [50]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'PERSON': 20,
         'GPE': 5,
         'ORG': 16,
         'TIME': 1,
         'DATE': 12,
         'NORP': 8,
         'ORDINAL': 1,
         'LOC': 1,
         'FAC': 1,
         'CARDINAL': 4,
         'WORK_OF_ART': 1})

In [51]:
# this tells me the three most frequent tokens
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('Mitch McConnell', 5), ('McConnell', 4), ('Kentucky', 3)]

In [60]:
# I'll randomly select a sentence to learn more
sentences = [x for x in article.sents]
len(sentences)

31

In [61]:
import random
rint = random.randint(0, 30)
print(sentences[rint])

She lost her bid for a House seat last year and is now seeking to challenge Senator Mitch McConnell of Kentucky.


One of the cooler aspects of SpaCy is its ability to render an informative visualization of what's going on. Notice that the parameter `jupyter=Bool` is set to True - this kind of illustration might only work in a web format. 

In [79]:
# now I'll run displacy.render to generate raw markup
displacy.render(nlp(str(sentences[rint])), 
                jupyter=True, 
                style='ent')

In [80]:
displacy.render(nlp(str(sentences[rint])),
                # indicates that I want to visualize dependecies
                style='dep',
                # indicates that I want displacy to generate HTML code to fit in the notebook
                jupyter=True, 
                # stylistic - just like a CSS letter-spacing property 
                options = {'distance':120})

The next step I'll complete is called _lemmatization_, which basically means removing inflectional endings only and return the base or dictionary form of a word, which is known as a _lemma_. I'll also take the chance to extract the part-of-speech as well. 

In [82]:
[(x.orth_,
  x.pos_,
  x.lemma_) for x in [
    y
    for y
    in nlp(str(sentences[rint]))
    if not y.is_stop and y.pos_ != 'PUNCT']]

[('lost', 'VERB', 'lose'),
 ('bid', 'NOUN', 'bid'),
 ('House', 'PROPN', 'House'),
 ('seat', 'NOUN', 'seat'),
 ('year', 'NOUN', 'year'),
 ('seeking', 'VERB', 'seek'),
 ('challenge', 'VERB', 'challenge'),
 ('Senator', 'PROPN', 'Senator'),
 ('Mitch', 'PROPN', 'Mitch'),
 ('McConnell', 'PROPN', 'McConnell'),
 ('Kentucky', 'PROPN', 'Kentucky')]

In [83]:
dict([(str(x), x.label_) for x in nlp(str(sentences[rint])).ents])

{'House': 'ORG',
 'last year': 'DATE',
 'Mitch McConnell': 'PERSON',
 'Kentucky': 'GPE'}

In [84]:
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[rint]])

[(She, 'O', ''), (lost, 'O', ''), (her, 'O', ''), (bid, 'O', ''), (for, 'O', ''), (a, 'O', ''), (House, 'B', 'ORG'), (seat, 'O', ''), (last, 'B', 'DATE'), (year, 'I', 'DATE'), (and, 'O', ''), (is, 'O', ''), (now, 'O', ''), (seeking, 'O', ''), (to, 'O', ''), (challenge, 'O', ''), (Senator, 'O', ''), (Mitch, 'B', 'PERSON'), (McConnell, 'I', 'PERSON'), (of, 'O', ''), (Kentucky, 'B', 'GPE'), (., 'O', '')]


Now, just for kicks, I'll use displacy to visualize the entire article. 

In [97]:
displacy.render(nlp(str(sentences[1:10])), 
                jupyter=True, 
                style='ent')