# CS 315 Project Data Collection

## Marisa Papagelis and Natalie Reid

In this notebook, we will collect data for our project. At the most basic level, we are scraping Facebook pages for post content (text, likes, and comments). Additional information will be added to this section as our project progresses.

**All data was scraped from Reuters, MSNBC, and FoxNews Facebook pages on October 6, 2020 between 11am and 3pm EDT.** 

We will compliment this notebook with Exploring_Data_CS315_Project_MarisaPapagelis_NatalieReid, a notebook for our data exploration.

## Part I: Initial Data Collection

For our data collection, we modified code created by Junita Sirait in order to scrape Facebook news source pages. We have separated our data collection into three parts, one for each news source, below. We used Selenium and a few other Python libraries (BeautifulSoup, Pandas, etc.). We used selenium to automate scrolling through our Facebook pages and BeautifulSoup to scrape each page to collect our data. 

We have noted the modifications we made to Juita's code as we run through it, but it was helpful to use her code as a starting point as she created it as a teaching opportunity for our CS315 class, and her process aligned with our goal. 

In [90]:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import json

## Part 1: Scraping Reuters Facebook page


In Part 1, we scrape the Reuters Facebook page. We will complete this process two additional times after our initial one for our additional two news sources. The process will be idential with only a change to the scroll time of selenium accounting for post length. 

## Scraping data with Selenium

First, we needed to use chromedriver to open a chrome browser, and then a Facebook page, that we could manipulate and scroll through using selenium. After reaching Facebook on our Chrome browser, we manually logged into Facebook and navigated to Reuters Facebook page. We understand that this could be done through our Python code, but for simplicity in our trial exploration, we navigated manually.

In [19]:
driver = webdriver.Chrome(executable_path='Downloads/chromedriver') # this will open a new page of Google Chrome

In [20]:
driver.get("https://facebook.com") #manually logged into FB and navigated to Reuters FB page

Next, we used selenium to scroll down Reuters Facebook page and collect posts. 
We modified the scrolling time to give selenium enough time to collect our ideal amount of posts (~750).

In [21]:
i = 0
while i < 300: 
    SCROLL_PAUSE_TIME = 0.5

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            i+=1
            break
        last_height = new_height

Finally, we saved the file as an HTML page to be used for future parsing. We note that the encoding parameter is important because the HTML page contains characters that are encoded as Unicode (e.g. emoji-s).

In [22]:
with open("Reuters.html", "w", encoding='utf-8') as file:
    file.write(driver.page_source)

## Parsing the file

In this section, we use BeautifulSoup along with some modified helper functions to parce through the data from our saved HTML file. We inspected our Facebook page in our browser to find the class names for the post text, the likes, and the comments. We then used BeautifulSoup's .find function to pull the appropriate content from each post. We further modified Junita's original getLikesCommentsShares() function into our own getLikesComments() function to pull only the information that we needed (likes and comments). We also split the comments into a list of two strings for easier manipulation in the future, and we got rid of the links because we didn't need them for our exploration. 

In [23]:
soup= BeautifulSoup(driver.page_source, "html.parser")

The two helper functions that we modified, as described above, are below. 

In [24]:
def getPostText(content): 
    """A helper function to extrac the text of each post.
    It's possible that class names might change over time.
    """
    post_el = content.find("div", 
                           class_="_5pbx userContent _3576")
    if post_el: 
        return post_el.text
    else: 
        return np.nan

In [25]:
def getLikesComments(article):
    """
    A helper file to extract the statistics of engagement for each post.
    """
    likes = 0 
    comments = 0
    
    # gather likes 
    likes_el = article.find("span", class_="_81hb")
    if likes_el: 
        likes = likes_el.text 
        
    # gather comments
    comments_el = article.find("a", class_="_3hg- _42ft")
    if comments_el:
        comments = comments_el.text
            
    return [likes, comments.split()]

### Extracting the information from each post

Here, we inspected our Reuters Facebook page to find the class of the div containing an entire post, so we could get information from each post on the page (rather than a singular post). We took some of this code from Junita's example, but we modified it to only collect post text, likes, and comments. 

In [26]:
post_data = []
for article in soup.find_all("div", class_="_5pcr userContentWrapper"):
    try:
        post_text = getPostText(article)
        likes, comments = getLikesComments(article)
        post_data.append((post_text, likes, comments))
    except AttributeError: 
        pass

### Save results into a pandas DataFrame

Next, we save our results as a data frame so it can be used later in exploration.

In [27]:
reuters_df = pd.DataFrame(post_data, columns=["text", "likes", "comments"])

In [28]:
reuters_df # view data frame

Unnamed: 0,text,likes,comments
0,Southwest Airline <LUV.N> can avoid furloughs ...,6,"[2, Comments]"
1,The U.S. Securities and Exchange Commission (S...,34,"[3, Comments]"
2,The U.S. Senate Judiciary Committee announced ...,576,"[106, Comments]"
3,U.S. commercial bankruptcy filings are up 33% ...,152,"[54, Comments]"
4,Democratic presidential candidate Joe Biden ap...,1.2K,"[222, Comments]"
...,...,...,...
1086,"How murder, kidnappings and miscalculation set...",83,"[12, Comments]"
1087,Australia's competition regulator said the fed...,65,"[3, Comments]"
1088,Semiconductor company Mellanox Technologies ha...,58,"[1, Comment]"
1089,"The U.S. Senate, rushing to meet a looming dea...",95,"[39, Comments]"


### Save results into a JSON

Finally, we save our results as a JSON file to be used later in exploration. This is so we can easily access our data in another notebook used solely for exploration. 

In [29]:
json.dump(post_data, open('reuters-posts.json', 'w'))

In order to view the JSON, uncomment the cell below. It is left commented for an easier to navigate notebook.

In [24]:
# reuters

## Part 2: Scraping MSNBC Facebook page


In Part 2, we will scrape MSNBC Facebook page. We will preform the same procedure as we did for Reuters, except we will adjust the scroll time to account fo MSNBC's page. To avoid redundancy, we shorten our explanations unless they are different from they were in Part 1. Refer back to Part 1 for motivation and code adjustments.

### Scraping data with Selenium


First, we open a chrome browser and manually navigate to MSNBC Facebook page.

In [15]:
driver = webdriver.Chrome(executable_path='Downloads/chromedriver') # this will open a new page of Google Chrome

In [16]:
driver.get("https://facebook.com") # manually logged into FB and navigated to MSNBC FB page

Next, we use selenium to scroll through the page. Since MSNBC has longer posts and a larger following, Selenium takes more time to load than Reuters did. We will increate the scroll time to account for this, so we can retrieve our ideal number of posts (~750). 

In [33]:
i = 0
while i < 350: 
    SCROLL_PAUSE_TIME = 0.5

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            i+=1
            break
        last_height = new_height

Finally, we saved the file as an HTML page.

In [34]:
with open("MSNBC.html", "w", encoding='utf-8') as file:
    file.write(driver.page_source)

## Parsing the file

In this section, we use BeautifulSoup and some modified helper functions to extract information a post on MSNBC. 

In [35]:
soup= BeautifulSoup(driver.page_source, "html.parser")

In [36]:
def getPostText(content): 
    """A helper function to extrac the text of each post.
    It's possible that class names might change over time.
    """
    post_el = content.find("div", 
                           class_="_5pbx userContent _3576")
    if post_el: 
        return post_el.text
    else: 
        return np.nan

In [37]:
def getLikesComments(article):
    """
    A helper file to extract the statistics of engagement for each post.
    """
    likes = 0 
    comments = 0
    
    # gather likes 
    likes_el = article.find("span", class_="_81hb")
    if likes_el: 
        likes = likes_el.text 
        
    # gather comments
    comments_el = article.find("a", class_="_3hg- _42ft")
    if comments_el:
        comments = comments_el.text
            
    return [likes, comments.split()]

### Extracting the information from each post

Here, we extract the same information (likes and comments) from each post on MSNBC.

In [38]:
post_data = []
for article in soup.find_all("div", class_="_5pcr userContentWrapper"):
    try:
        post_text = getPostText(article)
        likes, comments = getLikesComments(article)
        post_data.append((post_text, likes, comments))
    except AttributeError: 
        pass

### Save results into a pandas DataFrame

Next, we save our results as a data frame so it can be used later in exploration.

In [39]:
MSNBC_df = pd.DataFrame(post_data, columns=["text", "likes", "comments"]) 

In [40]:
MSNBC_df # view data frame

Unnamed: 0,text,likes,comments
0,"""I still don't even know how to fully characte...",299,"[135, Comments]"
1,Joy Reid on President Trump removing his mask ...,1.3K,"[522, Comments]"
2,"PLAN YOUR VOTE: Multiple states, such as Flori...",10,"[11, Comments]"
3,BREAKING: President Trump returns to the White...,5.4K,"[2.4K, Comments]"
4,WATCH: President Trump arrives at White House ...,914,"[881, Comments]"
...,...,...,...
1393,What do you know about the first ever presiden...,21,"[13, Comments]"
1394,Rep. Ocasio-Cortez stands by her use of the te...,5.3K,"[2K, Comments]"
1395,LIVE: Senate confirmation hearing for UN ambas...,651,"[1.5K, Comments]"
1396,BREAKING: An international team of investigato...,268,"[65, Comments]"


### Save results into a JSON

Finally, we save our results as a JSON file to be used later in exploration.

In [41]:
json.dump(post_data, open('MSNBC-posts.json', 'w'))

In order to view the JSON, uncomment the cell below.

In [10]:
# MSNBC

## Part 3: Scraping FOX News Facebook page

In Part 3, we will scrape Fox News Facebook page. We will preform the same procedure as we did for Reuters and MSNBC, except we will adjust the scroll time to account fo Fox News' page. To avoid redundancy, we shorten our explanations unless they are different from they were in Part 1. Refer back to Part 1 for motivation and code adjustments.

### Scraping data with Selenium

First, we open a chrome browser and manually navigate to Fox News Facebook page.

In [16]:
driver = webdriver.Chrome(executable_path='Downloads/chromedriver') # this will open a new page of Google Chrome

In [17]:
driver.get("https://facebook.com") #manually logged into FB and navigated to Fox News FB page

Next, we use selenium to scroll through the page. Since Fox News has longer posts and a larger following than both Reuters and MSNBC, Selenium takes more time to load than Reuters and MSNBC did. We will increate the scroll time to account for this. 

In [19]:
i = 0
while i < 750: 
    SCROLL_PAUSE_TIME = 0.5

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            i+=1
            break
        last_height = new_height

Finally, we saved the file as an HTML page.

In [8]:
with open("FoxNews.html", "w", encoding='utf-8') as file:
    file.write(driver.page_source)

## Parsing the file

In this section, we use BeautifulSoup and some modified helper functions to extract information from posts on Fox News. 

In [9]:
soup= BeautifulSoup(driver.page_source, "html.parser")

In [10]:
def getPostText(content): 
    """A helper function to extrac the text of each post.
    It's possible that class names might change over time.
    """
    post_el = content.find("div", 
                           class_="_5pbx userContent _3576")
    if post_el: 
        return post_el.text
    else: 
        return np.nan

In [11]:
def getLikesComments(article):
    """
    A helper file to extract the statistics of engagement for each post.
    """
    likes = 0 
    comments = 0
    shares = 0
    
    # gather likes 
    likes_el = article.find("span", class_="_81hb")
    if likes_el: 
        likes = likes_el.text 
        
    # gather comments
    comments_el = article.find("a", class_="_3hg- _42ft")
    if comments_el:
        comments = comments_el.text
            
    return [likes, comments.split()]

### Extracting the information from each post

Here, we extract the same information (likes and comments) from each post on Fox News.

In [12]:
post_data = []
for article in soup.find_all("div", class_="_5pcr userContentWrapper"):
    try:
        post_text = getPostText(article)
        likes, comments = getLikesComments(article)
        post_data.append((post_text, likes, comments))
    except AttributeError: 
        pass

### Save results into a pandas DataFrame

Next, we save our results as a data frame so it can be used later in exploration.

In [50]:
FoxNews_df = pd.DataFrame(post_data, columns=["text", "likes", "comments"])

In [51]:
FoxNews_df # view data frame

Unnamed: 0,text,likes,comments
0,JUST IN: President Donald J. Trump plans to ta...,28K,"[5.1K, Comments]"
1,A new report reveals that despite some states ...,10K,"[1.4K, Comments]"
2,President Donald J. Trump arrives at the White...,62K,"[17K, Comments]"
3,President Donald J. Trump departs from Walter ...,30K,"[6.7K, Comments]"
4,President Donald J. Trump leaves Walter Reed M...,30K,"[11K, Comments]"
...,...,...,...
746,House Speaker Nancy Pelosi has a bill enrollme...,12K,"[12K, Comments]"
747,President Donald J. Trump pays his respects at...,127K,"[49K, Comments]"
748,President Donald J. Trump delivers remarks in ...,110K,"[42K, Comments]"
749,Oregon Governor Kate Brown holds a press confe...,2.2K,"[4.6K, Comments]"


### Save results into a JSON

Finally, we save our results as a JSON file to be used later in exploration.

In [15]:
json.dump(post_data, open('FoxNews-posts.json', 'w'))

In order to view the JSON, uncomment the cell below.

In [14]:
# FoxNews

## Part II: Classifier Data Collection

In order to see how accurate our classifier would be on news sources that are not the three we trained it on, we chose 6 more news sources to run our classifier on. We chose two historically neutral sources, AP and Bloomberg, two left-wing, Wonkette, CNN, and two right-wing, InfoWars, The Washington Times. *Both InfoWars and Wonkette are historically less reliable and contain propagated information, so we decided to include these sources to analyze what happens.* We made these selections using an Interactive Bias Media Chart (https://www.adfontesmedia.com/interactive-media-bias-chart-2/) that we referred to when choosing our intial three sources as well as during our Literature Review.

We decided to scrape 200 of the most recent posts from all six of our news sources and run them through the classifier to test to accuracy. Since our classifier was trained to work for any post, time does not need to be controlled in this step. 

**Below, we scrape 6 news sources for ~200 posts. Since the scraping process was thoroughly described in Part I, there isn't much explanation done below. We run the same bits of code on each appropriate news source. Refer to Part I for full documentation.**

### Scraping data with Selenium

First, we open a chrome browser and manually navigate to the appropriate Facebook page.

In [91]:
driver = webdriver.Chrome(executable_path='Downloads/chromedriver') # this will open a new page of Google Chrome

In [92]:
driver.get("https://facebook.com") #manually logged into FB and navigated to appropriate FB page

Next, we use selenium to scroll through the page. We adjust the scroll time appropriately, taking into account post length and page popularity, for each particular page.

In [93]:
i = 0
while i < 80: 
    SCROLL_PAUSE_TIME = 0.5

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            i+=1
            break
        last_height = new_height

Finally, we save the file to its appropriate HTML page.

In [17]:
with open("AP.html", "w", encoding='utf-8') as file:
    file.write(driver.page_source)

In [26]:
with open("Bloomberg.html", "w", encoding='utf-8') as file:
    file.write(driver.page_source)

In [35]:
with open("Wonkette.html", "w", encoding='utf-8') as file:
    file.write(driver.page_source)

In [53]:
with open("CNN.html", "w", encoding='utf-8') as file:
    file.write(driver.page_source)

In [82]:
with open("DailyCaller.html", "w", encoding='utf-8') as file:
    file.write(driver.page_source)

In [94]:
with open("WashTimes.html", "w", encoding='utf-8') as file:
    file.write(driver.page_source)

## Parsing the file

In this section, we use BeautifulSoup and some modified helper functions to extract information from each appropriate page. 

In [95]:
soup= BeautifulSoup(driver.page_source, "html.parser")

In [96]:
def getPostText(content): 
    """A helper function to extrac the text of each post.
    It's possible that class names might change over time.
    """
    post_el = content.find("div", 
                           class_="_5pbx userContent _3576")
    if post_el: 
        return post_el.text
    else: 
        return np.nan

In [97]:
def getLikesComments(article):
    """
    A helper file to extract the statistics of engagement for each post.
    """
    likes = 0 
    comments = 0
    shares = 0
    
    # gather likes 
    likes_el = article.find("span", class_="_81hb")
    if likes_el: 
        likes = likes_el.text 
        
    # gather comments
    comments_el = article.find("a", class_="_3hg- _42ft")
    if comments_el:
        comments = comments_el.text
            
    return [likes, comments.split()]

### Extracting the information from each post

Here, we extract the same information (likes and comments) from each post on Fox News.

In [98]:
post_data = []
for article in soup.find_all("div", class_="_5pcr userContentWrapper"):
    try:
        post_text = getPostText(article)
        likes, comments = getLikesComments(article)
        post_data.append((post_text, likes, comments))
    except AttributeError: 
        pass

### Save results into a pandas DataFrame

Next, we save our results as a data frame so it can be used later in exploration.

In [22]:
AP_df = pd.DataFrame(post_data, columns=["text", "likes", "comments"])

In [31]:
Bloomberg_df = pd.DataFrame(post_data, columns=["text", "likes", "comments"])

In [40]:
Wonkette_df = pd.DataFrame(post_data, columns=["text", "likes", "comments"])

In [58]:
CNN_df = pd.DataFrame(post_data, columns=["text", "likes", "comments"])

In [87]:
DailyCaller_df = pd.DataFrame(post_data, columns=["text", "likes", "comments"])

In [99]:
WashTimes_df = pd.DataFrame(post_data, columns=["text", "likes", "comments"])

We can use the cell below to view each appropriate data frame to ensure the post count is ~200. 

In [100]:
WashTimes_df # view data frame

Unnamed: 0,text,likes,comments
0,“Given the president’s refusal to participate ...,51,"[31, Comments]"
1,Americans got a dose of politics and debate as...,0,"[4, Comments]"
2,“Just as PETA sent humane bug catchers to cand...,42,"[14, Comments]"
3,The sheriff’s department stuck an undercover o...,298,"[58, Comments]"
4,The debate reportedly cut out for nearly three...,403,"[107, Comments]"
...,...,...,...
667,Tulsa Athletic announced it will no longer pla...,231,"[103, Comments]"
668,President Trump on Wednesday said top Obama-er...,1.5K,"[301, Comments]"
669,"In an appearance on CNN, NASCAR’s only black d...",391,"[294, Comments]"
670,A 24-year-old real estate investment CEO has w...,243,"[60, Comments]"


### Save results into a JSON

Finally, we save our results as a JSON file to be used later in exploration.

In [24]:
json.dump(post_data, open('AP-posts.json', 'w'))

In [33]:
json.dump(post_data, open('Bloomberg-posts.json', 'w'))

In [42]:
json.dump(post_data, open('Wonkette-posts.json', 'w'))

In [60]:
json.dump(post_data, open('CNN-posts.json', 'w'))

In [89]:
json.dump(post_data, open('DailyCaller-posts.json', 'w'))

In [101]:
json.dump(post_data, open('WashTimes-posts.json', 'w'))