# Assignment 3. Web Scraping

### Due: Feb 20th (Saturday) 11:59 pm

## Introduction

In this assignment, we ask you to get data from various sources using the scraping techniques you have learned in class. We will also ask you to do some preliminary analysis of the extracted data. You will be provided with some bare code skeleton that you may use, but don't have to. 

Please populate this Jupyter notebook with your code and embeded results (outputs, figures, etc) and submit it on Canvas.

### Data / Libraries

You will be using the feedparser, requests, and beautifulsoup for your scraping. 

### Useful libraries for this assignment

- requests
- beautifulsoup
- feedparser
- NLTK


### Q1 (20 points)



Use the feedparser library to extract the text from the top 3 stories from the CNN and Fox news RSS links provided below. Note that you should only be capturing the text, no html content should be captured. Print the first 10 words of each story. Save these 6 stories as you will be using them later on in this assignment.
<br>
CNN: http://rss.cnn.com/rss/cnn_latest.rss
<br>
Fox: http://feeds.foxnews.com/foxnews/latest

In [150]:
import feedparser
import pprint
import requests
from bs4 import BeautifulSoup
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jialing/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [60]:
def get_articles(rss_feeds):
    
    def get_article(rss_feed, div_calss, n=3, ):
        # write your code here
        fp=feedparser.parse(rss_feed)
        entries=fp['entries']
        articles = []
        i = 0
        while n > 0 and i < len(entries):
            url_link = entries[i]["link"]
            print(url_link)
            page=requests.get(url_link)
            content=page.content
            soup=BeautifulSoup(content,'html5lib')
            text_divs = soup.find_all(class_=div_calss)
            print(len(text_divs))
            if text_divs:
                articles.append(' '.join([d.get_text(" ") for d in text_divs]))
                n -= 1
            i += 1
        return articles
    
    cnn_articles = get_article(rss_feeds[0], "zn-body__paragraph")
    fox_articles = get_article(rss_feeds[1], "speakable")
    return cnn_articles,fox_articles

# write your code here
cnn_articles, fox_articles = get_articles(["http://rss.cnn.com/rss/cnn_latest.rss", "http://feeds.foxnews.com/foxnews/latest"])
articles = cnn_articles + fox_articles

http://rss.cnn.com/~r/rss/cnn_latest/~3/u_3RFRprNek/index.html
44
http://rss.cnn.com/~r/rss/cnn_latest/~3/BZxNyEn6k4k/index.html
35
http://rss.cnn.com/~r/rss/cnn_latest/~3/ETLhksVqyzw/index.html
11
https://www.foxnews.com/politics/burgess-owens-biden-immigration-push-defeat-middle-class-american-dream
3
https://www.foxnews.com/media/chuck-todd-comment-texas-winter-storm-wear-mask
3
https://www.foxnews.com/opinion/tucker-carlson-mainstream-media-love-affair-with-joe-biden
3


In [61]:
articles

[' (CNN) Concerns are growing over the potential for violence in Myanmar as  anti-coup protesters  urge supporters to take to the streets en masse, in defiance of a  growing military presence .  Protests  and civil disobedience movements  against the military coup  on February 1 have been the biggest in decades, with thousands calling for  ousted  civilian leader Aung San Suu Kyi to be released from detention and  power be handed  back to civilian control.  A mas rally in downtown Yangon was called for Wednesday, after a second charge was  filed against  Suu Kyi Tuesday. Her lawyer Khin Maung Zaw said that she had been charged in relation to a national disaster law, in addition to an earlier count under the country\'s import and export act.  "Let\'s gather in millions to take down the dictators," activist Khin Sandar wrote on Facebook, according to Reuters.  Kyi Toe, a senior member of Suu Kyi\'s National League for Democracy (NLD) party who has not been arrested, said: "Let\'s march e

### Q2 (20 points)
Take the 6 news articles from Q1 and do the following to each one:
(1) Clean them by lowercasing, removing stopwords, removing punctuations, and stemming using PorterStemmer
(2) Save each these 6 cleaned articles as you will be using them later in the assignment.

Next, combine the 6 cleaned articles to create one document. Use NLTK to extract the top 20 unigrams, top 7 bigrams and top 3 trigrams from this joint document. Print these 30 ngrams.

Save these ngrams as you will be using them later.

In [69]:
import nltk
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk import PorterStemmer, word_tokenize

porter = PorterStemmer()
sws = stopwords.words('english')


def clean_article(article):
    # write your code here
    cleaned_text = []
    words = word_tokenize(article)
    
    for w in words:
        if w.isalnum():
            w = w.lower()
            if w not in sws:
                cleaned_text.append(porter.stem(w))
    return ' '.join(cleaned_text)

# save 6 cleaned articles
cleaned_articles = []
for a in articles:
    cleaned_articles.append(clean_article(a))
    
# combine into 1 article
combined_article = ' '.join(cleaned_articles)



from nltk import ngrams

def extract_top_K_ngram(article, N, top_K):
    # write your code here
    n_grams = list(ngrams(combined_article.split(' '), N))
    freq=nltk.FreqDist(n_grams)
    top_K_ngrams = sorted(freq.items(), key=lambda x: x[1], reverse=True)[:top_K]
    top_K_ngrams = [t[0] for t in top_K_ngrams]    
    return top_K_ngrams

top_grams = extract_top_K_ngram(combined_article, 1, 20) + extract_top_K_ngram(combined_article, 2, 7) + extract_top_K_ngram(combined_article, 3, 3)
top_grams



[nltk_data] Downloading package punkt to /Users/jialing/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('said',),
 ('biden',),
 ('peopl',),
 ('militari',),
 ('vaccin',),
 ('protest',),
 ('right',),
 ('would',),
 ('myanmar',),
 ('state',),
 ('countri',),
 ('fight',),
 ('one',),
 ('human',),
 ('report',),
 ('week',),
 ('first',),
 ('hous',),
 ('go',),
 ('get',),
 ('human', 'right'),
 ('biden', 'said'),
 ('white', 'hous'),
 ('suu', 'kyi'),
 ('potenti', 'violenc'),
 ('myanmar', 'militari'),
 ('light', 'infantri'),
 ('light', 'infantri', 'divis'),
 ('human', 'right', 'watch'),
 ('nation', 'leagu', 'democraci')]

### Q3 (20 points)

Use NLTK to extract the unigrams, bigrams and trigrams from all six CLEANED articles from Q2. Next, create a dataframe where the rows correspond to the six articles (you can use the first 5 words of the articles as their name) and the columsn correspond to the 30 ngrams (20 unigrams, 7 bigrams, and 3 trigrams) extracted in Q2. The cells should correspond to the count of each of these ngrams in the article. Print the dataframe.

In [115]:
# write your code here
import pandas as pd

ngram_count_matrix = []
for a in cleaned_articles:
    n_grams = list(ngrams(a.split(' '), 1)) + list(ngrams(a.split(' '), 2)) + list(ngrams(a.split(' '), 3))
    freq = nltk.FreqDist(n_grams)
    ngram_count_matrix.append([freq[g] if g in freq else 0 for g in top_grams])

        
df = pd.DataFrame(data=ngram_count_matrix)
df.index = [' '.join(a.strip().split()[:5]) for a in articles]
df.columns = top_grams
df

Unnamed: 0,"(said,)","(biden,)","(peopl,)","(militari,)","(vaccin,)","(protest,)","(right,)","(would,)","(myanmar,)","(state,)",...,"(human, right)","(biden, said)","(white, hous)","(suu, kyi)","(potenti, violenc)","(myanmar, militari)","(light, infantri)","(light, infantri, divis)","(human, right, watch)","(nation, leagu, democraci)"
(CNN) Concerns are growing over,16,0,10,15,0,12,12,1,11,3,...,9,0,0,5,3,3,3,3,3,2
(CNN) President Joe Biden would,23,16,4,0,8,0,0,9,0,2,...,0,4,7,0,0,0,0,0,0,0
(CNN) President Joe Biden at,6,5,2,0,7,0,0,2,0,4,...,0,3,0,0,0,0,0,0,0,0
Utah Republican warns left is,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
At least 10 people have,0,0,2,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
The most dishonest institution in,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Q4 (10 points)

Each row in the dataframe generated in Q3 is basically a vector representation of that article. Find the similarity between the articles by measuring the distance between the vectors representing each article. Please use the cosine similarity as your similarity metric. You can use the NLTk implementation here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html . We will go over different metrics later in the class, for now just know that cosine similairty is one way to measure distance/similarity between vectors (another way being eucilidean distance which everyone should be familiar with).

Create a 6x6 dataframe where the rows and columns correspond to the articles and the cells show the similarity between each article.

HINT:
HERE IS HOW YOU USE THE COSINE_SIMILARITY FUNCTION:
<br>
x=[1,2,3]
<br>
y=[2,1,1]
<br>
cosine_similarity([x],[y])




In [129]:
from sklearn.metrics.pairwise import cosine_similarity
# write your code here

similarities = [[cosine_similarity([ngram_count_matrix[i]], [ngram_count_matrix[j]])[0, 0] for j in range(6)] for i in range(6)]
df = pd.DataFrame(data=similarities)
df.index = [' '.join(a.strip().split()[:5]) for a in articles]
df.columns = [' '.join(a.strip().split()[:5]) for a in articles]
df

Unnamed: 0,(CNN) Concerns are growing over,(CNN) President Joe Biden would,(CNN) President Joe Biden at,Utah Republican warns left is,At least 10 people have,The most dishonest institution in
(CNN) Concerns are growing over,1.0,0.371771,0.303363,0.092748,0.238302,0.311625
(CNN) President Joe Biden would,0.371771,1.0,0.753885,0.417135,0.123665,0.103054
(CNN) President Joe Biden at,0.303363,0.753885,1.0,0.466667,0.316228,0.158114
Utah Republican warns left is,0.092748,0.417135,0.466667,1.0,0.0,0.0
At least 10 people have,0.238302,0.123665,0.316228,0.0,1.0,0.5
The most dishonest institution in,0.311625,0.103054,0.158114,0.0,0.5,1.0


### Q5 (30 points + 10 points bonus)

Use your web scraping knowledge to extract the top 5 comments from the following forum page: "https://www.vgr.com/forum/topic/8467-racing-wheels/"
<br>
Next, use VADER to extract the sentiment of each of the comments. Create a dataframe with one column corresponding to the "text" and 4 other columns corresponding to the output of VADER (neutral, positive, negative, and compound).

For bonus points, also extract the username and time of the posts.

In [197]:

from nltk.sentiment.vader import SentimentIntensityAnalyzer

url="https://www.vgr.com/forum/topic/8467-racing-wheels/"

sid = SentimentIntensityAnalyzer()


#write your code here
page=requests.get(url)
content=page.content
soup=BeautifulSoup(content,'html5lib')
article_tags=soup.find_all('article')

data = []
for article in article_tags[:5]:
    post_text = article.find('div', {'data-role': 'commentContent'}).get_text(" ").strip()
    username = article.find('aside').find(class_= "ipsType_break").get_text(" ").strip()
    post_time = article.find('time')['datetime']
    senti = sid.polarity_scores(post_text)
    data.append([username, post_time] + [senti[k] for k in pola_keys])
    

df = pd.DataFrame(data=data)
df.index = ["comment1", "comment2", "comment3", "comment4", "comment5"]
df.columns = ['username', 'post_time'] + pola_keys
df

Unnamed: 0,username,post_time,neg,neu,pos,compound
comment1,Kane99,2020-10-16T02:37:16Z,0.017,0.907,0.076,0.7278
comment2,m76,2020-10-16T09:19:35Z,0.0,0.818,0.182,0.8316
comment3,Patrik,2021-02-10T18:49:04Z,0.0,0.832,0.168,0.6432
comment4,Kane99,2021-02-11T18:52:55Z,0.019,0.776,0.205,0.9633
comment5,Patrik,2021-02-11T19:34:03Z,0.047,0.742,0.211,0.9545
