## Part of speach (POS) Tagging

### Recall high school grammer : nouns, verbs, adjectives,
### Many more tags or word classes than just these


| Tag | word class | 
|:-: | :-: |
| CC | Conjunction |
| CD | Cardinal |
| DT | Determiner |
| IN | Preposition |
| JJ | Adjective |
| MD | Modal |
| NN | Noun |
| POS | Possessive |
| PRP | Pronoun |
| RB | Adverb |
| SYM | Symbol |
| VB | Verb |


In [2]:
import nltk

In [4]:
import nltk

In [7]:
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


### Recall splitting a sentence into words/tokens

In [8]:
text11 = "Children shouldn't drink a sugary drink before bed."

In [9]:
text13 = nltk.word_tokenize(text11)

### NLTK's Tokenizer

In [10]:
nltk.pos_tag(text13)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

## Ambiguity in POS Tagging

- Ambiguity is common in English

In [11]:
text14 = nltk.word_tokenize("Visiting aunts can be a nuisance")

In [12]:
nltk.pos_tag(text14)

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN')]

- Another alternative POS tagging

In [13]:
[('Visiting', 'JJ'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN')]

[('Visiting', 'JJ'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN')]

## Parsing Sentence Structure

- Making sense of sentences is easy if they follow a well-defined grammatical structure

In [15]:
text15 = nltk.word_tokenize("Alice loves Bob")

In [29]:
grammar = nltk.CFG.fromstring(""" 
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves' 
""")


In [31]:
parser = nltk.ChartParser(grammar)

In [32]:
trees = parser.parse_all(text15)

In [33]:
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


## Ambiguity in Parsing

### Ambiguity may exist even if sentences are grammatically correct

### I saw the man with a telescope

In [34]:
text16 = nltk.word_tokenize("I saw the man with a telescope")

In [37]:
grammar1 = nltk.data.load("mygrammar1.cfg")

In [38]:
grammar1

<Grammar with 13 productions>

In [39]:
parser = nltk.ChartParser(grammar1)

In [40]:
trees = parser.parse_all(text16)

In [42]:
for tree in trees:
    print(tree)

(S
  (NP I)
  (VP
    (VP (V saw) (NP (DT the) (N man)))
    (PP (P with) (NP (DT a) (N telescope)))))
(S
  (NP I)
  (VP
    (V saw)
    (NP (DT the) (N man) (PP (P with) (NP (DT a) (N telescope))))))


## NLTK and Parse Tree Collection

In [44]:
from nltk.corpus import treebank

In [45]:
text17 = treebank.parsed_sents('wsj_0001.mrg')[0]

In [47]:
print(text17)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


## POS Tagging & Parsing Complexity

### Uncommon usges of words 
- The old man the boat

In [48]:
text18 = nltk.word_tokenize("The old man the boat")

In [50]:
nltk.pos_tag(text18)

[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]

### Well-formed sentences may still be meaningless!
- Colorless green ideas sleep furiously

In [51]:
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")

In [65]:
text19

['Colorless', 'green', 'ideas', 'sleep', 'furiously']

In [52]:
nltk.pos_tag(text19)

[('Colorless', 'NNP'),
 ('green', 'JJ'),
 ('ideas', 'NNS'),
 ('sleep', 'VBP'),
 ('furiously', 'RB')]

In [53]:
string='string1'

In [60]:
string.upper()

'STRING1'

In [61]:
string2= ' white space   '