# Assignment 3. Analyze Data from Web Scraping
### Due Date: April 19th (Friday) Midnight

This week, we will obtain and analyze data from web scraping. We will also cover classic NLP metric “TF-IDF” and other useful analyzing metrics.

## Problem 1: Text Data from RSS Feed

The first problem is about how to obtain content from news feed or blog. You may have heard of some rss reading tools,
like the most famous Google Reader (already closed), Reeder app, Feedly, Inoreader and etc. They all allow users to add some their rss feed and the tools will 
periodly fetch data from the source and show the content to the users after performing some parsing for better reading experience.

In this problem, we will develop a toy rss reader. Given a rss feed URL, our reader should extract content from the feed, and provide some linguistic "metadata" on the content.

Feel free to use propocessing techniques you have learnt in Assignment 2!

### Q1: Store content from a RSS Feed (20 pts)

The first function of our toy RSS feed is to download data from a RSS feed and make it serializable. We provide a sample RSS feed, Ben Thompson's personal blog **stratechery** (https://stratechery.com/). It's actually a subscription-based newsletter service about technews.

You need do following things:

- Use `feedparser` to parse the feed
- Print out how many entries you fetch from the feed
- Discard posts whose title includes "Exponent Podcast"; Those are short intros to podcast rather than article
- Compose a dict that has keys `title`, `content` and `link` and store corresponding fields of entries in it
- Save the dict to a json file called *"feed.json"* (serialization)

Packages you may use: `feedparser`, `BeautifulSoup`, `nltk`

In [1]:
import json
import feedparser
from bs4 import BeautifulSoup
import nltk

NEWS_FEED = "http://stratechery.com/feed/"

# write your code here
fp = feedparser.parse(NEWS_FEED)
print('There are {0} entries in the feed.'.format(len(fp.entries)))

index=[]
for i in range(len(fp.entries)):
    if 'Exponent Podcast' in fp.entries[i]['title']:
        index.append(i)
index.reverse()
for i in index:
    del fp.entries[i]

dict_rss = {}
for i in range(len(fp.entries)):
    dict_rss[i]={}
    dict_rss[i]['title']=fp.entries[i]['title']
    dict_rss[i]['content']=fp.entries[i]['content'][0]['value']
    dict_rss[i]['link']=fp.entries[i]['link']

with open('feed.json', 'w') as file:
    json.dump(dict_rss, file)

There are 10 entries in the feed.


### Q2: Using NLP tools to process blog data (20 pts)

Now we have content data from the blog. The next step is to use NLP tools to analyze the corpus. 

Please load the json file we stored previously, do following prepocessing:

- Remove stop words and punctuation (you can have custom set of punctuation, if necessary)
- Tokenize the data with `nltk.tokenize` (which means split the data into sentences, words...)

For each post, you need print out:
    
- The number of sentences 
- The number of words
- The number of unique words
- The number of words that only appears only once (hapax legomenon)
- Ten most frequent words

In [2]:
import json
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from collections import Counter
import re
from bs4 import BeautifulSoup

BLOG_DATA = "feed.json"

# write your code here
with open(BLOG_DATA) as json_file:
    data = json.load(json_file)

token_word = [0,0,0,0,0]
sentences = [0,0,0,0,0]
punctuation = list(string.punctuation) + ['’']
stop_words=stopwords.words('english') + punctuation + ['—','’','“','”']

for i in range(len(data)):
    data[str(i)]['content']=BeautifulSoup(data[str(i)]['content'],'html.parser').get_text()
    data[str(i)]['content']=data[str(i)]['content'].replace('\n',' ')
    sentences[i]=nltk.sent_tokenize(data[str(i)]['content'])
    token_word[i] = word_tokenize(data[str(i)]['content']) 
    token_word[i] = [w for w in token_word[i] if not w in stop_words]
    token_word[i] =[re.sub(r'\+','',w) for w in token_word[i]]
    token_word[i]=[w for w in token_word[i] if not w.isdigit()] # get rid of digits
    token_word[i]=[w.lower() for w in token_word[i]] # convert to lowercase in case double count the unique words

count_word = [0,0,0,0,0]
once = [0,0,0,0,0]
for i in range(len(data)):
    count_word[i] = Counter(token_word[i])
    once[i]=[word for (word, count) in count_word[i].items() if count == 1]

for i in range(len(data)):
    print('Post'+str(i+1)+'\n------------------')
    print('The number of sentences:',len(sentences[i]))
    print('The number of words',len(token_word[i]))
    print('The number of unique words',len(set(token_word[i])))
    print('The number of words that only appears only once (hapax legomenon)',len(once[i]))
    print('Ten most frequent words',count_word[i].most_common(10))

Post1
------------------
The number of sentences: 62
The number of words 1278
The number of unique words 697
The number of words that only appears only once (hapax legomenon) 490
Ten most frequent words [('uber', 60), ('ride-sharing', 17), ('company', 13), ('lyft', 12), ('autonomous', 12), ('the', 11), ('eats', 11), ('revenue', 11), ('i', 11), ('we', 10)]
Post2
------------------
The number of sentences: 87
The number of words 1607
The number of unique words 786
The number of words that only appears only once (hapax legomenon) 514
Ten most frequent words [('disney', 47), ('tv', 32), ('content', 26), ('netflix', 21), ('cable', 20), ('business', 19), ('model', 18), ('traditional', 18), ('the', 16), ('sports', 13)]
Post3
------------------
The number of sentences: 64
The number of words 1124
The number of unique words 661
The number of words that only appears only once (hapax legomenon) 482
Ten most frequent words [('content', 31), ('free', 20), ('users', 16), ('the', 14), ('youtube', 11)

## Problem 2: Text Data from Article

Not all data sources enable RSS Feed. For those cases, we have to make a call to the target (URL) manually and parse the response by ourselves. 

For example, Dartmouth News (https://news.dartmouth.edu/): Though it claims it has many feeds (https://www.dartmouth.edu/~dartlife/rss/index.html), no one of them seems working.
    
In this problem, we will use `boilerpipe` package as the extractor to extract text data directly from an article webpage (https://news.dartmouth.edu/news/2019/01/dartmouth-kicks-its-250th-year-celebration), and execute TF-IDF analysis on the result. `boilerpipe` has built-in rules to handle possible tags in HTML. It uses supervised machine learning to bifurcate the boilerplate and the content of the page, which means, if you provide `boilerpipe` with the response to the call, it can give back the article content auotmatically!

Check the paper (https://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf) under the hood if you are interested in.

You need to install `boilerpipe` with `pip install boilerpipe3`. Since it's just a python wrapper on a java library, you have to install Java environment first.

### Q1: Extract Text from an Article (10 pts)

Extract the first three sentences in the article (https://news.dartmouth.edu/news/2019/01/dartmouth-kicks-its-250th-year-celebration) content.

Load them into a dictionary named "corpus". With keys as 'a', 'b' and 'c' (just simple mark for corpus), the values should be the three sentences.

It should be exactly like:
    
```
{'a': 'With speeches, music, and refreshments, Dartmouth launched its yearlong '
      '250th anniversary celebration on campus with nine simultaneous kickoffs '
      'across the institution on Jan. 10.',
 'b': 'Starting at 4 p.m., members of the Dartmouth community gathered at '
      'locations across campus to share in the celebration.',
 'c': 'Livestreamed on screens across campus from the Dartmouth Library’s '
      'Baker-Berry hall, President Philip J. Hanlon ’77 joined 250th co-chairs '
      'Cheryl Bascomb ’82 , vice president for alumni relations, and Donald '
      'Pease , the Ted and Helen Geisel Third Century Professor in the '
      'Humanities, to ring in the sestercentennial.'}
```

Packages you may use: `boilerpipe`, `nltk`

In [3]:
import nltk
import boilerpipe
import pandas as pd

URL = "https://news.dartmouth.edu/news/2019/01/dartmouth-kicks-its-250th-year-celebration"

# write your code here
from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor',url=URL)

extracted_text = extractor.getText()

sentences = nltk.sent_tokenize(extracted_text)

corpus = {'a':sentences[9],
         'b':sentences[10],
         'c':sentences[11]}
print(corpus)

#convert into dataframe to make life easier for me
first_sent = sentences[9]
sec_sent = sentences[10]
third_sent = sentences[11]
corpus_df = pd.DataFrame([first_sent,sec_sent,third_sent],index=['a','b','c'],columns=['column'])
corpus_df

{'a': 'With speeches, music, and refreshments, Dartmouth launched its yearlong 250th anniversary celebration on campus with nine simultaneous kickoffs across the institution on Jan. 10.', 'b': 'Starting at 4 p.m., members of the Dartmouth community gathered at locations across campus to share in the celebration.', 'c': 'Livestreamed on screens across campus from the Dartmouth Library’s Baker-Berry hall, President Philip J. Hanlon ’77 joined 250th co-chairs Cheryl Bascomb ’82 , vice president for alumni relations, and Donald Pease , the Ted and Helen Geisel Third Century Professor in the Humanities, to ring in the sestercentennial.'}


Unnamed: 0,column
a,"With speeches, music, and refreshments, Dartmo..."
b,"Starting at 4 p.m., members of the Dartmouth c..."
c,Livestreamed on screens across campus from the...


### Q2: TF-IDF analysis on text data (30 pts)

TF-IDF stands for term frequency–inverse document frequency and can be used to query a corpus by calculating normalized scores that express the relative importance of terms in the documents.

Mathematically, TF-IDF is expressed as the product of the term frequency and the inverse document frequency, $tf-idf = tf*idf$, where the term $tf = \frac{num\ of\ certain\ word\ in\ a\ sentence}{num\ of\ words\ in\ a\ sentence}$ represents the importance of a term in a specific document and $idf = 1 + log(\frac{num\ of\ sentences}{num\ of\ sentences\ that\ have\ certain\ word})$ represents the importance of a term relative to the entire corpus. Multiplying these terms together produces a score that accounts for both factors and has been an integral part of every major search engine at some point in its existence. Here we are using "sentences" to denote the corpus in our case. You can extend such definition on paragraphs or even entire documents.

<p align="center">
  <img align="center" src="https://s3.amazonaws.com/agwarbliu/TF-IDF.png" style="width: 400px;" />
</p>

In this question, you are required to compute TF-IDF on the `corpus` given the `QUERY_WORDS = ['dartmouth', '250th']`.  

Specifically, you need to print out these results:

- TF score of each sentence in the corpus for each `QUERY_WORDS`
- IDF score of entire corpus for each `QUERY_WORDS`
- TF-IDF score of each sentence in the corpus for each `QUERY_WORDS`
- Viewing the two `QUERY_WORDS` as a vector, sum up their TF-IDF score for each sentence in entire corpus. Compare the result, and make some comments on similarity among sentences.

Packages you are allowed to use: `math`, `numpy`. **Please do not use built-in TF-IDF functions in any other packages.**

In [4]:
from math import log
import numpy as np

QUERY_WORDS = ['dartmouth', '250th']

def tf(term, doc, normalize=True):
    # write your code here
    token_word=[]
    token_word = nltk.word_tokenize(doc)
    token_word = [w for w in token_word if not w in punctuation] #same punctuation variable from previous question
    certain_word = [w for w in token_word if w.lower() == term]
    num_certain=np.array(len(certain_word))
    num_words=np.array(len(token_word))
    tf = np.true_divide(num_certain,num_words)
    return tf
    
def idf(term, corpus):
    # write your code here
    num_sen=len(corpus_df)
    token_word=[0,0,0]
    num_certain=0
    for i in range(len(corpus_df)):
        token_word[i] = nltk.word_tokenize(corpus_df['column'][i])
        token_word[i] = [w for w in token_word[i] if not w in punctuation]
        for w in token_word[i]:
            if w.lower() == term:
                num_certain += 1
    idf = 1+log(np.true_divide(num_sen,num_certain))
    return idf
    
def tf_idf(term, doc, corpus):
    # write your code here
    
    return tf(term, doc) * idf(term, corpus)

print('TF score of each sentence in the corpus for each QUERY_WORDS:')
print('First sentence with dartmouth:',round(tf(QUERY_WORDS[0],corpus_df['column'][0]),5))
print('Second sentence with dartmouth:',round(tf(QUERY_WORDS[0],corpus_df['column'][1]),5))
print('Third sentence with dartmouth:',round(tf(QUERY_WORDS[0],corpus_df['column'][2]),5))
print('First sentence with 250th:',round(tf(QUERY_WORDS[1],corpus_df['column'][0]),5))
print('Second sentence with 250th:',round(tf(QUERY_WORDS[1],corpus_df['column'][1]),5))
print('Third sentence with 250th:',round(tf(QUERY_WORDS[1],corpus_df['column'][2]),5))
print('--------------------')
print('IDF score of entire corpus for each QUERY_WORDS:')
print('IDF score for dartmouth:',round(idf(QUERY_WORDS[0],corpus_df),5))
print('IDF score for 250th:',round(idf(QUERY_WORDS[1],corpus_df),5))
print('--------------------')
print('TF-IDF score of each sentence in the corpus for each QUERY_WORDS:')
print('First sentence with dartmouth:',round(tf_idf(QUERY_WORDS[0], corpus_df['column'][0], corpus),5))
print('Second sentence with dartmouth:',round(tf_idf(QUERY_WORDS[0], corpus_df['column'][1], corpus),5))
print('Third sentence with dartmouth:',round(tf_idf(QUERY_WORDS[0], corpus_df['column'][2], corpus),5))
print('First sentence with 250th:',round(tf_idf(QUERY_WORDS[0], corpus_df['column'][0], corpus),5))
print('Second sentence with 250th:',round(tf_idf(QUERY_WORDS[0], corpus_df['column'][1], corpus),5))
print('Third sentence with 250th:',round(tf_idf(QUERY_WORDS[0], corpus_df['column'][2], corpus),5))
print('-------------------')
print('Sum of TF-IDF score for each sentence in entire corpus:')
print('First sentence:',round(tf_idf(QUERY_WORDS[0], corpus_df['column'][0], corpus_df)+tf_idf(QUERY_WORDS[1], corpus_df['column'][0], corpus_df),5))
print('Second sentence:',round(tf_idf(QUERY_WORDS[0], corpus_df['column'][1], corpus_df)+tf_idf(QUERY_WORDS[1], corpus_df['column'][1], corpus_df),5))
print('Third sentence:',round(tf_idf(QUERY_WORDS[0], corpus_df['column'][2], corpus_df)+tf_idf(QUERY_WORDS[1], corpus_df['column'][2], corpus_df),5)) 
print('-------------------')

TF score of each sentence in the corpus for each QUERY_WORDS:
First sentence with dartmouth: 0.04167
Second sentence with dartmouth: 0.05263
Third sentence with dartmouth: 0.02128
First sentence with 250th: 0.04167
Second sentence with 250th: 0.0
Third sentence with 250th: 0.02128
--------------------
IDF score of entire corpus for each QUERY_WORDS:
IDF score for dartmouth: 1.0
IDF score for 250th: 1.40547
--------------------
TF-IDF score of each sentence in the corpus for each QUERY_WORDS:
First sentence with dartmouth: 0.04167
Second sentence with dartmouth: 0.05263
Third sentence with dartmouth: 0.02128
First sentence with 250th: 0.04167
Second sentence with 250th: 0.05263
Third sentence with 250th: 0.02128
-------------------
Sum of TF-IDF score for each sentence in entire corpus:
First sentence: 0.10023
Second sentence: 0.05263
Third sentence: 0.05118
-------------------


From the results, the first sentence in the corpus has the highest frequency of the two QUERY_WORDS while the second and third sentence have about the same frequency.

## Problem 3: Chart Data from Webpage (20 pts)

Sometimes you may meet data in charts. In this problem, you need write code to download the weekend box office chart from [Rotten Tomatoes](https://www.rottentomatoes.com/browse/box-office/), parse the web page, load the chart to pandas dataframe, and print out the chart.

Packages you may use: `requests`, `pandas`, `BeautifulSoup`

In [5]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

# write your code here
url = "https://www.rottentomatoes.com/browse/box-office/"
response=requests.get(url)
html=response.content
tables = pd.read_html(html)
tables[0]

Unnamed: 0,THIS WEEK,LAST WEEK,T-METER,TITLE,WEEKS RELEASED,WEEKEND GROSS,TOTAL GROSS,THEATER AVERAGE,# OF THEATERS
0,1,--,33%,The Curse of La Llorona,1,$26.5M,$26.5M,--,--
1,2,1,90%,Shazam!,3,$17.3M,$17.3M,--,--
2,3,--,64%,Breakthrough,1,$11.1M,$11.1M,--,--
3,4,6,78%,Captain Marvel,7,$9.1M,$9.1M,--,--
4,5,2,46%,Little,2,$8.5M,$8.5M,--,--
5,6,5,47%,Dumbo,4,$6.8M,$6.8M,--,--
6,7,4,58%,Pet Sematary,3,$4.9M,$4.9M,--,--
7,8,9,89%,Missing Link,2,$4.4M,$4.4M,--,--
8,9,7,94%,Us,5,$4.3M,$4.3M,--,--
9,10,3,14%,Hellboy,2,$3.9M,$3.9M,--,--
