# Syntactic processing

3 Things:
- POS tagging (Part of Speech): Tag a word. E.g.Verb, adverb, etc.
- NER (Named Entity Recognition): Recognize names such as corporations or people
- Text parsing: Analysis of a string of words (e.g. sentence) resulting in a parse tree showing their syntactic relation to each other.

In [2]:
review = """I purchased this Dell monitor because of budgetary concerns. This item was the most inexpensive 17 inch Apple monitor 
available to me at the time I made the purchase. My overall experience with this monitor was very poor. When the 
screen  wasn't contracting or glitching the overall picture quality was poor to fair. I've viewed numerous different 
monitor models since I 'm a college student at UPM in Madrid and this particular monitor had as poor of picture quality as 
any I 've seen."""

tweet = """@concert Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to 
        her music!!!! WOW!!! #ladygaga #britney"""

### POS tagging
For tagging the words an annotated corpus, such as Penn Treebank, will be used. Tag set depends on the annotated corpus. 

In [3]:
from nltk import pos_tag, word_tokenize

We can also use the WordNetLemmatizer. That is interesting for four categories: ADJ, ADV, NOUN and VERB.

In the following example, we parse using the WordNetLemmatizer and we lemmatize those results from the parsing that belong to any of the pos_mapping categories

In [4]:
from nltk.stem import WordNetLemmatizer

review_postagged = pos_tag(word_tokenize(review), tagset='universal')
pos_mapping = {'NOUN': 'n', 'ADJ': 'a', 'VERB': 'v', 'ADV': 'r', 'ADP': 'n', 'CONJ': 'n', 
               'PRON': 'n', 'NUM': 'n', 'X': 'n' }

wordnet = WordNetLemmatizer()
lemmas = [wordnet.lemmatize(w, pos=pos_mapping[tag]) for (w,tag) in review_postagged if tag in pos_mapping.keys()]
print(lemmas)

['I', 'purchase', 'Dell', 'monitor', 'because', 'of', 'budgetary', 'concern', 'item', 'be', 'most', 'inexpensive', '17', 'inch', 'Apple', 'monitor', 'available', 'me', 'at', 'time', 'I', 'make', 'purchase', 'My', 'overall', 'experience', 'with', 'monitor', 'be', 'very', 'poor', 'When', 'screen', 'be', "n't", 'contract', 'or', 'glitching', 'overall', 'picture', 'quality', 'be', 'poor', 'fair', 'I', "'ve", 'view', 'numerous', 'different', 'monitor', 'model', 'since', 'I', "'m", 'college', 'student', 'at', 'UPM', 'in', 'Madrid', 'and', 'particular', 'monitor', 'have', 'a', 'poor', 'of', 'picture', 'quality', 'a', 'I', "'ve", 'see']


### NER
Identify entities of places, organizations, people. Brown tagset will be used.

In [5]:
from nltk import ne_chunk, pos_tag, word_tokenize
ne_tagged = ne_chunk(pos_tag(word_tokenize(review)), binary = False)
print(ne_tagged)

(S
  I/PRP
  purchased/VBD
  this/DT
  (ORGANIZATION Dell/NNP)
  monitor/NN
  because/IN
  of/IN
  budgetary/JJ
  concerns/NNS
  ./.
  This/DT
  item/NN
  was/VBD
  the/DT
  most/RBS
  inexpensive/JJ
  17/CD
  inch/NN
  Apple/NNP
  monitor/NN
  available/JJ
  to/TO
  me/PRP
  at/IN
  the/DT
  time/NN
  I/PRP
  made/VBD
  the/DT
  purchase/NN
  ./.
  My/PRP$
  overall/JJ
  experience/NN
  with/IN
  this/DT
  monitor/NN
  was/VBD
  very/RB
  poor/JJ
  ./.
  When/WRB
  the/DT
  screen/NN
  was/VBD
  n't/RB
  contracting/VBG
  or/CC
  glitching/VBG
  the/DT
  overall/JJ
  picture/NN
  quality/NN
  was/VBD
  poor/JJ
  to/TO
  fair/VB
  ./.
  I/PRP
  've/VBP
  viewed/VBN
  numerous/JJ
  different/JJ
  monitor/NN
  models/NNS
  since/IN
  I/PRP
  'm/VBP
  a/DT
  college/NN
  student/NN
  at/IN
  (ORGANIZATION UPM/NNP)
  in/IN
  (GPE Madrid/NNP)
  and/CC
  this/DT
  particular/JJ
  monitor/NN
  had/VBD
  as/IN
  poor/JJ
  of/IN
  picture/NN
  quality/NN
  as/IN
  any/DT
  I/PRP
  've/VBP
  see

## Parsing and Chunking
- Obtain a tree given the grammar. Useful for understanding relationship among words.
- Full parsing tree or shallow parsing (chunking)


### Shift-reduce parser
Move the words from text one by one onto a Stack, where the word will be reduced and at the top of the items a single item is assigned. Before pushing a new word to the stack (i.e. shifting to the left) reducing items lower must be already performed. Parser finishes when all items rest under a single item, s item.

In [6]:
from nltk.app import  srparser_app
# srparser_app.app()

### Chunking
The process of detecting relationships in a text after we get the pos-tagged sentences (list of lists of tuples).
To do so, we'll begin understanding the Noun Phrase Chunking.

#### Noun Phrase chunking
Aka NP-chunking search for chunks corresponding to individual noun phrases. An NP chunk can be represented by a noun. NP chunks do not contain another NP chunk, consequently leaving out prepositional phrases or subordinate clauses that modify the noun.
To use NP chunker (or to create one) a chunk grammar must already be defined, consisting of the rules that will chunk the sentence:

In [7]:
import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
grammar = "NP: {<DT>?<JJ>*<NN>}"
# Create chunk parser
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


In [8]:
# result.draw()

From the above written example, what grammar symbolizes is a Tag pattern which is used to sequences of tagged words. Tag patterns are delimited by the < DT > ? < JJ >*< NN > marks.

#### Chunking with Regular Expressions
To find the chunk structure for a given sentence, the RegexpParser chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.

In [9]:
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), # [_code-chunker1-ex]
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
print(cp.parse(sentence))


(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))


#### Chinking
Useful whenever we define what we want to exclude from a chunk. We define a chink to be a sequence of tokens that is not included in a chunk.

In [10]:
# Example:
grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN>+{      # Chink sequences of VBD and IN
  """
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
# Output
print(cp.parse(sentence))

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


Observe that barked/VBD and at/IN (the chink that was defined) does not hang from the NP anymore. Chinking has excluded it

#### Tags vs Trees (representing Chunks)
- Chunk structures can be represented using tags or trees
- For tags, IOB representation is used. IOB stands for Inside, Outside and Begin
- IOB are special chunk tags
- IOB is used as the standard way to represent chunk structures in files

#### Quick catch up in chunking or shallow parsing 
Chunking aims at extracting relevant parts of the sentence. There are two ways of approachink it:
- Regular expressions based on POS tags
- Training a chunk parser

POS will be showed

In [11]:
from nltk.chunk.regexp import *
pattern = """NP: {<PRON><ADJ><NOUN>+} 
                 {<DET>?<ADV>?<ADJ|NUM>*?<NOUN>+}
                 """
NPChunker = RegexpParser(pattern)
reviews_pos = (pos_tag(word_tokenize(review), tagset='universal'))
chunks_np = NPChunker.parse(reviews_pos)
print(chunks_np)

(S
  I/PRON
  purchased/VERB
  (NP this/DET Dell/NOUN monitor/NOUN)
  because/ADP
  of/ADP
  (NP budgetary/ADJ concerns/NOUN)
  ./.
  (NP This/DET item/NOUN)
  was/VERB
  (NP
    the/DET
    most/ADV
    inexpensive/ADJ
    17/NUM
    inch/NOUN
    Apple/NOUN
    monitor/NOUN)
  available/ADJ
  to/PRT
  me/PRON
  at/ADP
  (NP the/DET time/NOUN)
  I/PRON
  made/VERB
  (NP the/DET purchase/NOUN)
  ./.
  (NP My/PRON overall/ADJ experience/NOUN)
  with/ADP
  (NP this/DET monitor/NOUN)
  was/VERB
  very/ADV
  poor/ADJ
  ./.
  When/ADV
  (NP the/DET screen/NOUN)
  was/VERB
  n't/ADV
  contracting/VERB
  or/CONJ
  glitching/VERB
  (NP the/DET overall/ADJ picture/NOUN quality/NOUN)
  was/VERB
  poor/ADJ
  to/PRT
  fair/VERB
  ./.
  I/PRON
  've/VERB
  viewed/VERB
  (NP numerous/ADJ different/ADJ monitor/NOUN models/NOUN)
  since/ADP
  I/PRON
  'm/VERB
  (NP a/DET college/NOUN student/NOUN)
  at/ADP
  (NP UPM/NOUN)
  in/ADP
  (NP Madrid/NOUN)
  and/CONJ
  (NP this/DET particular/ADJ monitor/NOU

In [12]:
#chunks_np.draw()

In [13]:
# Obtain the strings:
def extractTrees(parsed_tree, category = 'NP'):
    return list(parsed_tree.subtrees(filter = lambda x: x.label() == category))
extractTrees(chunks_np, 'NP')

[Tree('NP', [('this', 'DET'), ('Dell', 'NOUN'), ('monitor', 'NOUN')]),
 Tree('NP', [('budgetary', 'ADJ'), ('concerns', 'NOUN')]),
 Tree('NP', [('This', 'DET'), ('item', 'NOUN')]),
 Tree('NP', [('the', 'DET'), ('most', 'ADV'), ('inexpensive', 'ADJ'), ('17', 'NUM'), ('inch', 'NOUN'), ('Apple', 'NOUN'), ('monitor', 'NOUN')]),
 Tree('NP', [('the', 'DET'), ('time', 'NOUN')]),
 Tree('NP', [('the', 'DET'), ('purchase', 'NOUN')]),
 Tree('NP', [('My', 'PRON'), ('overall', 'ADJ'), ('experience', 'NOUN')]),
 Tree('NP', [('this', 'DET'), ('monitor', 'NOUN')]),
 Tree('NP', [('the', 'DET'), ('screen', 'NOUN')]),
 Tree('NP', [('the', 'DET'), ('overall', 'ADJ'), ('picture', 'NOUN'), ('quality', 'NOUN')]),
 Tree('NP', [('numerous', 'ADJ'), ('different', 'ADJ'), ('monitor', 'NOUN'), ('models', 'NOUN')]),
 Tree('NP', [('a', 'DET'), ('college', 'NOUN'), ('student', 'NOUN')]),
 Tree('NP', [('UPM', 'NOUN')]),
 Tree('NP', [('Madrid', 'NOUN')]),
 Tree('NP', [('this', 'DET'), ('particular', 'ADJ'), ('monitor',

In [14]:
def extractStrings(parsed_tree, category = 'NP'):
    return [" ".join(word for word, pos in vp.leaves()) for vp in extractTrees(parsed_tree, category)]
extractStrings(chunks_np, 'NP')

['this Dell monitor',
 'budgetary concerns',
 'This item',
 'the most inexpensive 17 inch Apple monitor',
 'the time',
 'the purchase',
 'My overall experience',
 'this monitor',
 'the screen',
 'the overall picture quality',
 'numerous different monitor models',
 'a college student',
 'UPM',
 'Madrid',
 'this particular monitor',
 'picture quality']