# What is Named Entity Recognition?

Named entities - in QS's use case - can include:
1. Drug Names, (e.g.  Adalimumab)
2. Condition Names, (e.g. Non-infectious uveitis) 
3. Treatments, (e.g. Dexamethasone intravitreal implant)
4. Side-Effects paired w/ Drug Names?, (e.g. nasopharyngitis, entity usually follows a tagged line indicating the drug)
5. Procedure, (e.g. X Drug costs Y amount for Z treatment)

This is achieved using one or more machine learning sequence models to label entities but can also call special rule-based components to interpret numerical data such as prices (e.g. converts all currencies to USD) using regular expressions. 

### NER Pipeline Overview (Based on the Stanford Model) 
1. Statistical Models
2. Numeric Sequences and SUTime
3. Fine Grained NER
4. RegexNER Rules Format
5. Customizing the Fine-Grained NER
6. Additional TokensRegexNER Rules
7. Additional TokensRegex Rules
8. Entity Mention Detection 
9. API Creation 
10. Accessing Entity Confidences 

### Foundational Paper: Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling - Stanford 
Source: [White Paper](https://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf)
Summary:
- d

## Challenges 

There are three main challenges towards the development of a medically-specific, fine-grained Entity Recognizer that can be trained to recognize uncommon entities:
1. Selection of the tag set (i.e. the labels)
2. Creation of training data
3. Development of a fast and accurate multi-class labeling algorithm.

### Selection of the Tag Set 
[FIGER](http://xiaoling.github.io/pubs/ling-aaai12.pdf)'s authors propose the curation of 112 unique tags based on [Freebase](https://developers.google.com/freebase/guide/basic_concepts) types. 

### Training Data Creation
FIGER's authors propose exploiting anchor links in Wikipedia text to automatically label entity segments with appropriate tags. 

### Fast/Accurate Model 
Using the heuristically-labeled (<- weird way of phrasing that) training data to train a conditional random field (CRF) model for segmentation that identifies the boundaries of text that mentions an entity. 

The final step is assigning tags to the segmented mentions using an adapted perceptron algorithm (isn't that just a neural network?) for multi-class, multi-label classification. 

### Evaluation
Evaluation of the model takes two stages:
1. Precision/accuracy of the tag assignment
2. Do the tags actually have use beyond their assignment?

In [5]:
import pandas as pd
import numpy as np
import sklearn.preprocessing as skp