# MSDS 7337 Homework 4

Author: Nathan Wall

Date: 6/25/2019

The notebook below works through the questions as part of the homework 5. This notebook utilizes articles from MIT Technology Review for the purpose of academic work.

Notebook Sections:
- [Data Preperation](#pre)
- [Q1: POS Tagger 1](#q1)
- [Q2: POS Tagger 2](#q2)
- [Q3: POS Tagger Comparison](#q3)

In [24]:
import random
import pandas as pd
import numpy as np
import requests
import nltk
import spacy
from bs4 import BeautifulSoup

## Data Preperation
<a id='pre'></a>

In this section we scrape a recent news article and clean up some of the text for use in the rest of this assignment.

In [2]:
#read from the artificial intelligence setion of MIT technology review
page = requests.get('https://www.technologyreview.com/artificial-intelligence') 
soup = BeautifulSoup(page.content, 'html.parser')
#find the article links
webLinks = soup.find_all('h3', {"class": "grid-tz__title"})

#select the first list article in that list 
articleLink = 'https://www.technologyreview.com'+webLinks[0].find('a').get('href')+'amp/' 
#technology article text require 'amp'

print(articleLink)

https://www.technologyreview.com/s/613738/artificial-intelligence-sees-construction-site-accidents-before-they-happen/amp/


In order to ensure the interpretations below make sense we will only use the following link for the rest of this HW. Although it would be possible to run the scraper at anytime to pull the most recent article.

In [2]:
articleLink = 'https://www.technologyreview.com/s/613738/artificial-intelligence-sees-construction-site-accidents-before-they-happen/amp/'

In [3]:
#now lets parse that article
page = requests.get(articleLink)
# parse with BFS
soup = BeautifulSoup(page.text, 'html.parser')

pTags = soup.find_all('p')

pList = []
for p in pTags:
    pList.append(p.get_text())

article = [' '.join(pList)][0]
article[0:100]

'A construction site is a dangerous place to work, with a fatal accident rate five times higher than '

From above we are able to pull in one of the latest articles on the MIT Research Review website, to begin testing out diffee POS Taggers.

## Q1: Run one of the part-of-speech (POS) taggers available in Python. 
<a id='q1'></a>

For this we will begin with the default pre-trained NLTK tagger based on the Penn Tree-bank tagset

In [4]:
def nltk_pos(sent):
    #use NLTK default POS "Penn Tree-Bank"
    #:sent: sentence to be tagged
    tokens = nltk.word_tokenize(sent)
    d = dict();  
    d['pos'] = nltk.pos_tag(tokens)
    d['length'] = len(tokens)
    return d

In [5]:
#first we find the longest sentence
sent = nltk.sent_tokenize(article)
sent_list = []
for s in sent:
    sent_list.append(nltk_pos(s))
    
sent_df = pd.DataFrame(sent_list)

### a) Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.

In [6]:
sent_df.sort_values('length', ascending = False).head()

Unnamed: 0,length,pos
16,46,"[(“, NNS), (Most, RBS), (companies, NNS), (don..."
6,41,"[(It, PRP), (can, MD), (then, RB), (be, VB), (..."
20,35,"[(Mary, NNP), (Gray, NNP), (,, ,), (an, DT), (..."
4,34,"[(Jit, NNP), (Kee, NNP), (Chin, NNP), (,, ,), ..."
2,32,"[(Suffolk, NNP), (,, ,), (a, DT), (constructio..."


The longest sentence in the article was pretty poorly tagged so I below are inputs and outputs from the second longest sentence from the article.

In [16]:
sent[6]

'It can then be put to work monitoring a new construction site and flagging situations that seem likely to lead to an accident, such as worker not wearing gloves or working too close to a dangerous piece of machinery.'

In [18]:
#apply the tokenizer and display results
print("The longest sentence has {} words".format(sent_df['length'][6]))
print("The tokens for each word are shown below")
for t in sent_df['pos'][6]:
    print(t)

The longest sentence has 41 words
The tokens for each word are shown below
('It', 'PRP')
('can', 'MD')
('then', 'RB')
('be', 'VB')
('put', 'VBN')
('to', 'TO')
('work', 'VB')
('monitoring', 'VBG')
('a', 'DT')
('new', 'JJ')
('construction', 'NN')
('site', 'NN')
('and', 'CC')
('flagging', 'JJ')
('situations', 'NNS')
('that', 'WDT')
('seem', 'VBP')
('likely', 'JJ')
('to', 'TO')
('lead', 'VB')
('to', 'TO')
('an', 'DT')
('accident', 'NN')
(',', ',')
('such', 'JJ')
('as', 'IN')
('worker', 'NN')
('not', 'RB')
('wearing', 'VBG')
('gloves', 'NNS')
('or', 'CC')
('working', 'VBG')
('too', 'RB')
('close', 'RB')
('to', 'TO')
('a', 'DT')
('dangerous', 'JJ')
('piece', 'NN')
('of', 'IN')
('machinery', 'NN')
('.', '.')


The NLTK POS tagger (Penn-Treebank) seems to get most of the 41 words from this sentence correctly tagged. All he common punctuation, determiners (DT), conjunctions (CC), & preposition (IN) are all flagged correctly. 

All of the nouns seem to appropriately tagged based whether they are plural (NNS) or singular (NN). 

Most of the adjectives (JJ) seem to be tagged correctly with the one potential exception of the word "flagging". This is tagged as and adjective but may be used a verb in this sentence. The rest of the verbs appear to be flagged correctly.

Overall the POS tagging seems good.

### b) Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. Show the input and output. 

In [21]:
sent_df.sort_values('length').head()

Unnamed: 0,length,pos
22,5,"[(MIT, NNP), (Technology, NNP), (Review, NNP),..."
14,8,"[(Improving, VBG), (safety, NN), (is, VBZ), (a..."
15,13,"[(“, JJ), (Safety, NNP), (was, VBD), (a, DT), ..."
13,13,"[(Deep-learning, JJ), (algorithms, NN), (typic..."
19,16,"[(And, CC), (it, PRP), (is, VBZ), (unlikely, J..."


The shortest sentence in the article is just the the footer naming the the website source. The only other sentence less than 10 words is sentence 14.

In [23]:
sent[14]

'Improving safety is an incentive as well.'

In [24]:
#apply the tokenizer and display results
print("The shortest sentence has {} words".format(sent_df['length'][14]))
print("The tokens for each word are shown below")
for t in sent_df['pos'][14]:
    print(t)

The shortest sentence has 8 words
The tokens for each word are shown below
('Improving', 'VBG')
('safety', 'NN')
('is', 'VBZ')
('an', 'DT')
('incentive', 'NN')
('as', 'RB')
('well', 'RB')
('.', '.')


This tagging seems to perform pretty well with the exception of the tag applied to 'as'. It tags 'as' with 'RB' indicating it is an adverb. However, it appears to be used as a preposition the adverb 'well' which is tagged correctly.

Aside from that the rest of the sentence appears to tagged correctly.

## Q2: Run a different POS tagger in Python.
<a id='q2'></a>

For the next tagger we are going to use Spacy's core english model (small). This is a pre-trained model that includes POS tags for the universal tagset. It was trained using primarily web text including news articles.

In [28]:
nlp = spacy.load("en_core_web_sm")

### a) Does it produce the same or different output?

##### Long Sentence

In [30]:
long_sent = nlp(sent[6])
print(long_sent.text)

It can then be put to work monitoring a new construction site and flagging situations that seem likely to lead to an accident, such as worker not wearing gloves or working too close to a dangerous piece of machinery.


In [31]:
for token in long_sent:
    print(token.text, token.pos_)

It PRON
can VERB
then ADV
be VERB
put VERB
to ADP
work NOUN
monitoring VERB
a DET
new ADJ
construction NOUN
site NOUN
and CCONJ
flagging ADJ
situations NOUN
that DET
seem VERB
likely ADJ
to PART
lead VERB
to ADP
an DET
accident NOUN
, PUNCT
such ADJ
as ADP
worker NOUN
not ADV
wearing VERB
gloves NOUN
or CCONJ
working VERB
too ADV
close ADV
to ADP
a DET
dangerous ADJ
piece NOUN
of ADP
machinery NOUN
. PUNCT


##### Short Sentence

In [32]:
short_sent = nlp(sent[14])
print(short_sent.text)

Improving safety is an incentive as well.


In [33]:
for token in short_sent:
    print(token.text, token.pos_)

Improving VERB
safety NOUN
is VERB
an DET
incentive NOUN
as ADV
well ADV
. PUNCT


### b) Explain any differences as best you can.

Based on the output it seems that the two POS taggers produce roughly the same output with the biggest differences betweent the two being the tagsets. The Spacy Tagset is based on a "universal" tagset where the nltk defaults to the Penn-Treebank tagset.

However, while the Penn-Treebank tags a much more in depth than the universal tagset the general groups the two taggers assigned the words too appears to be very similar. Even the same two tags that seemed questionable from NLTK tags are tagged the same way in the Spacy POS tags. "Flagging" from the longer sentence is still shown as an adjective and "as" from the shorter is still shown as an adverb.

Overall, the only real difference is the difference in the tagset, but the results are comparable.

## Q3: In a news article from this week’s news, find a random sentence of at least 10 words.
<a id='q3'></a>

In [35]:
random_sent = random.choice(sent)
random_sent

'And it is unlikely to stop there—we may all find ourselves working for algorithms eventually.'

This is a very interesting sentence, as was actually the first random sentence returned!

### a) Looking at the Penn tag set, manually POS tag the sentence yourself.

In [43]:
# tokens = nltk.word_tokenize(random_sent)
tokens = ['And', 'it', 'is', 'unlikeley', 'to', 'stop', 'there', '-', 'we', 'may', 'all', 'find', 'ourselves', 'working', 'for', 'algorithms', 'eventually','.']
man_tags = ['CC','PRP', 'VBZ', 'JJ', 'TO', 'VB', 'RB', '-','PRP', 'VB', 'JJ', 'VBD', 'NNS', 'VBG', 'IN', 'NNS', 'RB','.']
hand_tag = dict(zip(tokens,man_tags))
hand_tag

{'And': 'CC',
 'it': 'PRP',
 'is': 'VBZ',
 'unlikeley': 'JJ',
 'to': 'TO',
 'stop': 'VB',
 'there': 'RB',
 '-': '-',
 'we': 'PRP',
 'may': 'VB',
 'all': 'JJ',
 'find': 'VBD',
 'ourselves': 'NNS',
 'working': 'VBG',
 'for': 'IN',
 'algorithms': 'NNS',
 'eventually': 'RB',
 '.': '.'}

### b) Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?

NLTK POS Tag

In [40]:
nltk_tag = nltk.pos_tag(tokens)
for t in nltk_tag:
    print(t)

('And', 'CC')
('it', 'PRP')
('is', 'VBZ')
('unlikeley', 'JJ')
('to', 'TO')
('stop', 'VB')
('there', 'EX')
('-', ':')
('we', 'PRP')
('may', 'MD')
('all', 'DT')
('find', 'VB')
('ourselves', 'NNS')
('working', 'VBG')
('for', 'IN')
('algorithms', 'JJ')
('eventually', 'RB')
('.', '.')


Spacy POS Tags

In [41]:
spacy_tag = nlp(random_sent)
for t in spacy_tag:
    print(t.text, t.pos_)

And CCONJ
it PRON
is VERB
unlikely ADJ
to PART
stop VERB
there ADV
— PUNCT
we PRON
may VERB
all ADV
find VERB
ourselves PRON
working VERB
for ADP
algorithms NOUN
eventually ADV
. PUNCT


It looks like all three methods provided different results which explore as part c.

### c) Explain any differences between the two taggers and your manual tagging as much as you can.

All three of the taggers seem to in line with each other until 'there'. Both my manual tagging and the Spacy tagging labeled it as an adverb. However, the NLTK tagging labeled it as an 'existential' there. I did not tag it that way in the manual as I missed that as an option, and the Spacy tag is using the universal tagset so did not have that option available.

Again at the word 'may' my tags and Spacy tags disagree with the NLTK tags with the NLTK tagging it as a modal and the other two as a verb. Modal is a type of verb so these are pretty close.

All there methods disagree on the work 'all'. This seems like a difficult word to correctly tag as I tagged it as an adjective, NLTK tagged it as a determinant, and Spacy tagged it as an adverb.

The last term with interesting disagreement is 'algorithms'. While Spacy & myself tagged it as a noun, the NTLK tagger tagged it as an adjective. This seems like a rather simple task, which makes me wonder the age of the corpus used for the pre-trained NLTK tagger, and if that term is as prominently used as it is today.