## DS 7337 - Natural Language Processing

### Author: Brandon Croom

### Homework: 4

In [7]:
# import nltk and other items
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.tag import RegexpTagger
import numpy as np

1.	Run one of the part-of-speech (POS) taggers available in Python. 
 * Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.
 * Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. Show the input and output. 
 
 Explain your conjecture as to why the tagger might have been less than perfect with this sentence.



In [12]:
# Define a random long and short sentence.
long_sent_input = "Every manager should be able to recite at least ten nursery rhymes backward."
short_sent_input = "Out of the park."

# Tokenize the sentences so the taggers will work
short_sent_tokens = nltk.word_tokenize(short_sent_input)
long_sent_tokens = nltk.word_tokenize(long_sent_input)

# we'll use there regular expression tagger first. 
# Define the patters required for this tagger. Use the inital patterns defined in the NLTK book
patterns = [
        (r'.*ing$', 'VBG'),               # gerunds
        (r'.*ed$', 'VBD'),                # simple past
        (r'.*es$', 'VBZ'),                # 3rd singular present
        (r'.*ould$', 'MD'),               # modals
        (r'.*\'s$', 'NN$'),               # possessive nouns
        (r'.*s$', 'NNS'),                 # plural nouns
        (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
        (r'.*', 'NN')                     # nouns (default)
       ]

reg_expr_tagger = nltk.RegexpTagger(patterns)

print("Long Sentence Parts of Speech: ", reg_expr_tagger.tag(long_sent_tokens))

print("Short Sentence Parts of Speech: ", reg_expr_tagger.tag(short_sent_tokens))

Long Sentence Parts of Speech:  [('Every', 'NN'), ('manager', 'NN'), ('should', 'MD'), ('be', 'NN'), ('able', 'NN'), ('to', 'NN'), ('recite', 'NN'), ('at', 'NN'), ('least', 'NN'), ('ten', 'NN'), ('nursery', 'NN'), ('rhymes', 'VBZ'), ('backward', 'NN'), ('.', 'NN')]
Short Sentence Parts of Speech:  [('Out', 'NN'), ('of', 'NN'), ('the', 'NN'), ('park', 'NN'), ('.', 'NN')]


The tagger worked less than perfect with the short sentence due not having enough patterns defined within the tagger to accurately catch all the nuances of the sentence. The default specified for any pattern that does not match a defined regular expression is to default to NN (noun). It can be seen in the case of the short sentence that everything was tagged as NN. This means that the majority of the sentence was placed in the default pattern. 

In [None]:
2.	Run a different POS tagger in Python. Process the same two sentences from question 1.
 * Does it produce the same or different output?
 * Explain any differences as best you can.


In [13]:
# Now we'll run the same two sentences through the default tagger that it provided by NLTK

print("Long Sentence Parts of Speech: ", nltk.pos_tag(long_sent_tokens))

print("Short Sentence Parts of Speech: ", nltk.pos_tag(short_sent_tokens))

Long Sentence Parts of Speech:  [('Every', 'DT'), ('manager', 'NN'), ('should', 'MD'), ('be', 'VB'), ('able', 'JJ'), ('to', 'TO'), ('recite', 'VB'), ('at', 'IN'), ('least', 'JJS'), ('ten', 'JJ'), ('nursery', 'JJ'), ('rhymes', 'RB'), ('backward', 'RB'), ('.', '.')]
Short Sentence Parts of Speech:  [('Out', 'IN'), ('of', 'IN'), ('the', 'DT'), ('park', 'NN'), ('.', '.')]


Using the default tagger it can be seen that the tagging is much more accurate for both sentences than the regular expression tagger. With respect to the short sentence the default tagger tagged most of the values correctly unlike the regular expression tagger that tagged every token in the sentence as a noun. A similar result can be seen with the long sentence as well. The default tagger was able to identify more tokens with the correct tag and did not default to the noun tag as much as the regular expression tag did. 

3.	In a news article from this week’s news, find a random sentence of at least 10 words.
 * Looking at the Penn tag set, manually POS tag the sentence yourself.
 * Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?
 * Explain any differences between the two taggers and your manual tagging as much as you can.


In [15]:
news_sentence = "By the time President Donald Trump was gliding in his helicopter toward Joint Base Andrews on Saturday, destined for what he'd once hoped would be a triumphant packed-to-the-rafters return to the campaign trail, things were already looking bad."

news_sent_tokens = nltk.word_tokenize(news_sentence)

print("Regular Expression Tagger Results: ", reg_expr_tagger.tag(news_sent_tokens))
print("Default Tagger Results: ", nltk.pos_tag(news_sent_tokens))

Regular Expression Tagger Results:  [('By', 'NN'), ('the', 'NN'), ('time', 'NN'), ('President', 'NN'), ('Donald', 'NN'), ('Trump', 'NN'), ('was', 'NNS'), ('gliding', 'VBG'), ('in', 'NN'), ('his', 'NNS'), ('helicopter', 'NN'), ('toward', 'NN'), ('Joint', 'NN'), ('Base', 'NN'), ('Andrews', 'NNS'), ('on', 'NN'), ('Saturday', 'NN'), (',', 'NN'), ('destined', 'VBD'), ('for', 'NN'), ('what', 'NN'), ('he', 'NN'), ("'d", 'NN'), ('once', 'NN'), ('hoped', 'VBD'), ('would', 'MD'), ('be', 'NN'), ('a', 'NN'), ('triumphant', 'NN'), ('packed-to-the-rafters', 'NNS'), ('return', 'NN'), ('to', 'NN'), ('the', 'NN'), ('campaign', 'NN'), ('trail', 'NN'), (',', 'NN'), ('things', 'NNS'), ('were', 'NN'), ('already', 'NN'), ('looking', 'VBG'), ('bad', 'NN'), ('.', 'NN')]
Default Tagger Results:  [('By', 'IN'), ('the', 'DT'), ('time', 'NN'), ('President', 'NNP'), ('Donald', 'NNP'), ('Trump', 'NNP'), ('was', 'VBD'), ('gliding', 'VBG'), ('in', 'IN'), ('his', 'PRP$'), ('helicopter', 'NN'), ('toward', 'IN'), ('Join

### Manual tagging:
By - IN
the - DT
time - NN
President - NNP
Donald - NNP
Trump - NNP
was - VBD
gliding - VBG
in - IN
his - PRPS
helicopter - NN
toward - IN
Joint - NNP
Base - NNP
Andrews - NNP
on - IN
Saturday - NNP
, - ,
destined - VBN
for - IN
what - NP
he - PRP
'd - VBD
once - RB
hoped - VBN
would - MD
be - VB
a - DT
trimphant - JJ
packed-to-the-rafters - NNS
return - VBP
to - TO
the - BT
campaign - NN
trail - NN
, - ,
things - NNS
were - VBD
already - RB
looking - VBG
bad - JJ
. - .


### Analysis:
In matching the manual tagging to the regular expression tagger and the NLTK default tagger, we see that the default tagger and the manual tagging were almost identical. This was not the case with the regular expression tagger and the manual tagging effort. For example, in the manual tagging effort the first word in the sentence "By" was tagged as a preposition. In the regular expression tagger it was tagged as a noun due to no pattern matching this value. The default tagger matched the manual tagging effort. Prepositional phrases and determinants seem to be the largest difference between the taggers. The regular expression tagger does a poor job with these token types where the default tagger and the manual tagging effort both take these types into account.