REMEMBER TO TAKE:
1. remember to standardize terminologies
2. add reference in correct formats

### POS tagging and NER
In this section, we will cover the following contents:

#### Goal
* learn about POS tagging basics 
    1. Syntax structure 
    2. Penn treebank and universal tag 
    3. Why do we need pos tagging 
* Constituency/Dependency relation
    1. Constituency 
    2. Dependency 
* NER
* Code examples
    1. use POS tags as identify key phrases (noun phrases, verb phrases)
    2. use NLTK to automatically pos-tag words and dependency relations 
    3. use NLTK to perform named entity recognition
* Implementation
    1. Simple QA system  


Why do we need pos tagging 


#### 1. Part of Speech Tagging Basics
Part of Speech (POS) Tagging is one of the most fundamental tasks in NLP. It aims to identify the grammatical function each word plays in a sentence, such as noun, verb, adjective, etc.  
For example, we want to parse the following sentence.  
*The field of health IT is diverse, rapidly changing, and covers numerous areas of scholarship* -- CHIP website

|Word|POS tags|Meaning|Word|POS tags|Meaning|
|------|------|------|------|------|------|
|The|DT|Determiner|changing|VBG|Verb, gerund or present participle|
|field|NN|Noun, singular or mass|and|CC|Coordinating conjunction|
|of|IN|Preposition or subordinating conjunction|covers|VBZ|Verb, 3rd person singular present|
|health|NN|Noun, singular or mass|numerous|JJ|Adjective|
|IT|NNP|Proper noun, singular|areas|NNS|Noun, plural| 
|is|VBZ|Verb, 3rd person singular present|of|IN|Preposition or subordinating conjunction
|diverse|JJ|Adjective|scholarship|NN|Noun, singular or mass|
|rapidly|RB|Adverb|

It looks cool but why we need POS tagging? We already have information about words from a text using tokenization and why we still need the information about grammatical constructs? POS tagging in fact helps us with many tasks, such as

1) Word sense disambiguition
A word can have multiple senses. For example, "I like playing video games" and "Peter looks like his grandfather", the "like" in two sentences is used as a verb and preposition, respectively. Using POS tags, we can easily idenitfy such difference, rather than infering it from its meaning which is often more computationally complex.  

2) Identify key components in a text
Noun Phrase is a key components to prepresent either subject or object in a document and thus indicate informative content in texts. Another example we can think about is the intensive use of adjectives and adverb in sentiment analysis, which indicates sentence’s polarity, i.e., positive, neutral and negative . The  POS tag can help use identify such syntactic structures easily.

In the following, we will take a taste of parsing a text with NLTK and complete the following tasks
* Tag each word with syntactic categories
* Extract noun, verb and prepositional phrases
* Parse sentence to explore its dependency relations among words

#### Examples

In [3]:
import nltk
from nltk import word_tokenize, pos_tag, RegexpParser
from nltk.parse import CoreNLPParser
from Util.pos_ner_helper import getNP

TEXT_SAMPLE = """ Health informatics is information engineering applied to the field of health care, 
                essentially the management and use of patient health care information """


# Tokenize the text
text_tokens = word_tokenize(TEXT_SAMPLE)

# Obtain tags
tags = pos_tag(text_tokens)   # 1
print("The pos tags of the given text: {}".format(tags))

# Noun phrase regex pattern:            #2
np_phrase = r"""
            NP: {<DT>?<JJ>+<NN.*>}    
                {<NN.*>{2,}}    
            """
# RegexpParser parses the tags and visualize the grammatical structure
parser = RegexpParser(np_phrase)      #3
parser_res = parser.parse(tags)                   #4
print("The parse result is drawn here")
parser_res.draw()   

# Collect noun phrase from the parsing results
NP_list = getNP(parser_res)
print("The extracted noun phrases are {}".format(NP_list))

ModuleNotFoundError: No module named 'nltk'

#### 2. Constituency and Dependency relation
Syntax denotes the arrangement of words and phrases to create well-formed sentences in a language. 
We need to be clear about the sentence's structure before conducting analysis. 
To make it concrete, let's take the following sentence as an example.

"I shot an elephant in my pajamas."

There could be a divergence in interpreting this sentence, although one of the meaning might sound unreasonalble. Each of the meanings can be parsed in their own syntactic structure shown as follows. 

a. I shot an elephant while I wore my pajamas.

<img src="http://www.nltk.org/book/tree_images/ch08-tree-1.png" style="height:300px">

b. The elephant was in my pajamas when I shot it.

<img src="http://www.nltk.org/book/tree_images/ch08-tree-2.png" style="height:300px">

The trees we drew show the ***constituent structure*** which is based on the observation that words combine with other words to form units. In meaning ***a***, the proposition phrase (PP) "in my pajamas" and the verb phrase (VB) "shot an elephant" are separate units by themselves, then together form a larger unit to combine with the noun phrase "I". Here we can interprete the sentence as that the movement "shot an elephant" happened in a condition "in my pajamas" which was performed by the subject "I". Similarly, in meaning ***b***, "an elephant in my pajamas" is a unit which derives the image of an elephant wearing my pajamas.

Following code shows the recursive descent parser that allows us to parse a sentence into a constituent structure. 

In [2]:
import nltk

#define your own grammar (you may expand the grammar by adding more elements)
grammar1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked" | "shot"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my" 
  N -> "man" | "dog" | "cat" | "telescope" | "park" | "elephant" | "pajamas"
  P -> "in" | "on" | "by" | "with"
  """)

#building the parser using the above grammar
rd_parser = nltk.RecursiveDescentParser(grammar1)

#parsing the following sentence
sent1 = "I shot an elephant in my pajamas"
for tree in rd_parser.parse(sent1):
    print(tree)

#parsing the following sentence
sent2 = 'Mary saw a cat'.split()
for tree in rd_parser.parse(sent2):
    print(tree)
    
#the following sentence can also be parsed by the grammar1
sent3 = 'Bob walked a dog in a park'.split()
for tree in rd_parser.parse(sent3):
    print(tree)



ModuleNotFoundError: No module named 'nltk'

*P.S. To learn more about context free grammar and other constituency parser, please refer to http://www.nltk.org/book/ch08.html*

Phrase structure grammar is concerned with how words and sequences of words combine to form constituents. A distinct and complementary approach, ***dependency grammar***, focusses instead on how words relate to other words. Dependency is a binary asymmetric relation that holds between a head and its dependents. The head of a sentence is usually taken to be the tensed verb, and every other word is either dependent on the sentence head, or connects to it through a path of dependencies.

A dependency representation is a labeled directed graph, where the nodes are the lexical items and the labeled arcs represent dependency relations from heads to dependents. the following figure illustrates a dependency graph, where arrows point from heads to their dependents.

<img src="http://www.nltk.org/images/depgraph0.png" style="height:100px">


In [None]:
import nltk

#composing the grammar
groucho_dep_grammar = nltk.DependencyGrammar.fromstring("""
 'shot' -> 'I' | 'elephant' | 'in'
 'elephant' -> 'an' | 'in'
 'in' -> 'pajamas'
 'pajamas' -> 'my'
 """)

#building the parser
pdp = nltk.ProjectiveDependencyParser(groucho_dep_grammar)

#parsing the following sentence
sent = 'I shot an elephant in my pajamas'.split()
trees = pdp.parse(sent)
for tree in trees:
    print(tree)

Each of the dependency and constituency tree has their own pros and cons. For example, a dependency tree addresses the relationship between elements in a clear and light mannar, but could miss structural and phrasal information that are represented in a constituency tree. A constituency tree, on the contrary, while gives a whole picture of the sentence, could takes more time and space to parse and implement in the real cases (it usually has several times of nodes and edges as a graph comparing to a dependency tree for a same sentence). Thus, it will be very important to be clear about your goal and utilize these structures accordingly.

*the knowledge about treebanks could be included here*

#### 3. Name entity recognition
In the previous section, we introduce the POS tagging and played with it using NLTK. We also learn about the basics in Constituency and Dependency relation

In this section, we will introduce an important application of POS tagging, Name entity recognition (NER). 
The name entities refer to a particular type of objects, such as organization, person, location and so on. 
See examples in this table 

|NE Type|Examples|
|------|------|
|ORGANIZATION|Georgia-Pacific Corp., WHO|
|PERSON|Eddy Bonte, President Obama|
|LOCATION|Murray River, Mount Everest|
|DATE|June, 2008-06-29|
|TIME|two fifty a m, 1:30 p.m.|
|MONEY|175 million Canadian Dollars, GBP 10.40|
|PERCENT|twenty pct, 18.75 %|
|FACILITY|Washington Monument, Stonehenge|
|GPE|South East Asia, Midlothian|

source: https://www.nltk.org/book/ch07.html

See here for code examples

In [2]:
# Code for NER using NLTK and sPacy

