### POS tagging and NER

REMEMBER TO TAKE:
1. remember to format words 
2. add reference in a correct format

In this section, we will cover the following contents:

* Goal
    - learn about POS tagging basics and constituency/dependency relation
    - use NLTK to automatically pos-tag words and dependency relations
    - use POS tags as idenfity key phrases (noun phrases, verb phrases)
    - use NLTK to perform named entity recognition
    - Some other tools available for POS tagging and name entity recognition


#### Part of Speech Tagging Basics
Part of Speech (POS) Tagging is one of the most fundamental tasks in NLP. It aims to identify the grammatical function each word plays in a sentence, such as noun, verb, adjective, etc.  
For example, we want to parse the following sentence.  
*The field of health IT is diverse, rapidly changing, and covers numerous areas of scholarship* -- CHIP website

|Word|POS tags|Meaning|Word|POS tags|Meaning|
|------|------|------|------|------|------|
|The|DT|Determiner|changing|VBG|Verb, gerund or present participle|
|field|NN|Noun, singular or mass|and|CC|Coordinating conjunction|
|of|IN|Preposition or subordinating conjunction|covers|VBZ|Verb, 3rd person singular present|
|health|NN|Noun, singular or mass|numerous|JJ|Adjective|
|IT|NNP|Proper noun, singular|areas|NNS|Noun, plural| 
|is|VBZ|Verb, 3rd person singular present|of|IN|Preposition or subordinating conjunction
|diverse|JJ|Adjective|scholarship|NN|Noun, singular or mass|
|rapidly|RB|Adverb|

It looks cool but why we need POS tagging? We already have information about words from a text using tokenization and why we still need the information about grammatical constructs? POS tagging in fact helps us with many tasks, such as

1) Word sense disambiguition
A word can have multiple senses. For example, "I like playing video games" and "Peter looks like his grandfather", the "like" in two sentences is used as a verb and preposition, respectively. Using POS tags, we can easily idenitfy such difference, rather than infering it from its meaning which is often more computationally complex.  

2) Identify key components in a text
Noun Phrase is a key components to prepresent either subject or object in a document and thus indicate informative content in texts. Another example we can think about is the intensive use of adjectives and adverb in sentiment analysis, which indicates sentence’s polarity, i.e., positive, neutral and negative . The  POS tag can help use identify such syntactic structures easily.

In the following, we will take a taste of parsing a text with NLTK and complete the following tasks
* Tag each word with syntactic categories
* Extract noun, verb and prepositional phrases
* Parse sentence to explore its dependency relations among words


In [1]:
import nltk
from nltk import word_tokenize, pos_tag, RegexpParser
from nltk.parse import CoreNLPParser
from IPython.display import display
from Util.pos_ner_helper import getNP

TEXT_SAMPLE = """ Health informatics is information engineering applied to the field of health care, 
                essentially the management and use of patient health care information """


# Tokenize the text
text_tokens = word_tokenize(TEXT_SAMPLE)

# Obtain tags
tags = pos_tag(text_tokens)   # 1
print("The pos tags of the given text: {}".format(tags))

# Noun phrase regex pattern:            #2
np_phrase = r"""
            NP: {<DT>?<JJ>+<NN.*>}    
                {<NN.*>{2,}}    
            """
# RegexpParser parses the tags and visualize the grammatical structure
parser = RegexpParser(np_phrase)      #3
parser_res = parser.parse(tags)                   #4
print("The parse result is drawn here")
parser_res.draw()   

# Collect noun phrase from the parsing results
NP_list = getNP(parser_res)
print("The extracted noun phrases are {}".format(NP_list))

The pos tags of the given text: [('Health', 'NNP'), ('informatics', 'NNS'), ('is', 'VBZ'), ('information', 'NN'), ('engineering', 'NN'), ('applied', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('field', 'NN'), ('of', 'IN'), ('health', 'NN'), ('care', 'NN'), (',', ','), ('essentially', 'RB'), ('the', 'DT'), ('management', 'NN'), ('and', 'CC'), ('use', 'NN'), ('of', 'IN'), ('patient', 'NN'), ('health', 'NN'), ('care', 'NN'), ('information', 'NN')]
The parse result is drawn here
The extracted noun phrases are ['Health informatics', 'information engineering', 'health care', 'patient health care information']


#### Constituency and Dependency relation
A text can be analyzed at various level (e.g., phrase, sentence). 
A sentence-level interpretation requires the analysis of lexical relationship among words, or ***dependency relation***.

When we look at a relationship, we tends to care about who is more dominant versus subordinate. The same case for analyzing the word relations in a sentence. We call a part ***clausal relations*** that plays a central role and organizes other words (e.g., Noun phrase, or verb) and call the remaining words ***modifier*** that are dependent on the head, e.g., modify, complement meanings. 

In [2]:
# Code in C&D relationship



### Name entity recognition
In the previous section, we introduce the POS tagging and played with it using NLTK. We also learn about the basics in Constituency and Dependency relation

In this section, we will introduce an important application of POS tagging, Name entity recognition (NER). 
The name entities refer to a particular type of objects, such as organization, person, location and so on. 
See examples in this table 

|NE Type|Examples|
|------|------|
|ORGANIZATION|Georgia-Pacific Corp., WHO|
|PERSON|Eddy Bonte, President Obama|
|LOCATION|Murray River, Mount Everest|
|DATE|June, 2008-06-29|
|TIME|two fifty a m, 1:30 p.m.|
|MONEY|175 million Canadian Dollars, GBP 10.40|
|PERCENT|twenty pct, 18.75 %|
|FACILITY|Washington Monument, Stonehenge|
|GPE|South East Asia, Midlothian|
source: https://www.nltk.org/book/ch07.html

See here for code examples

In [2]:
# Code for NER

