# DATS 6103 - Graham Hulsey - Project 2 

# Part 1 - Web Scraping

In this project, I will be using bs4 to scrape news headlines from several major news outlets. Then, I will analyze the data to look for trends in which stories get covered, which outlets cover which stories, time trends, and how media reacts to certain events.

In this part, I will scrape headline text from news organizations, create a dataframe, and then save that dataframe  for future analysis. 

Here are the news outlets I will analyze, and their urls.

1. New York Times https://www.nytimes.com/
2. Washington Post https://www.washingtonpost.com/
3. CNBC https://www.cnbc.com
4. Al-Jazeera https://www.aljazeera.com/
5. BBC https://bbc.com/news
6. China Daily https://global.chinadaily.com.cn/
7. Fox News https://www.foxnews.com/
8. Mehr News https://en.mehrnews.com/
9. The Atlantic https://www.theatlantic.com/
10. Buzzfeed https://www.buzzfeed.com/
11. New Yoker https://www.newyorker.com/
12. Mother Jones https://www.motherjones.com/

In [1]:
# Get imports 
from bs4 import BeautifulSoup
import urllib.request as url
import pandas as pd
import requests
from datetime import date

In [2]:
# Create dictionary of outlets and their urls
sources = {"NYT":"https://www.nytimes.com/","WaPo":"https://www.washingtonpost.com/",
          "CNBC":"https://www.cnbc.com","Al-Jazeera":"https://www.aljazeera.com/",
          "BBC":"https://bbc.com/news","China Daily":"https://global.chinadaily.com.cn/",
          "Fox":"https://www.foxnews.com/",
          "Mehr":"https://en.mehrnews.com/","The Atlantic":"https://www.theatlantic.com/",
          "Buzzfeed":"https://www.buzzfeed.com/","New Yorker":"https://www.newyorker.com/",
          "Mother Jones":"https://www.motherjones.com/"}

In [3]:
# Create a dictionary of empty lists to store text data
scrapes = {"NYT":[],"WaPo":[],
          "CNBC":[],"Al-Jazeera":[],
          "BBC":[],"China Daily":[],
          "Fox":[],
          "Mehr":[],"The Atlantic":[],
          "Buzzfeed":[],"New Yorker":[],
          "Mother Jones":[]}

Using BeautifulSoup, it's fairly easy to read each url and get all text using the html hyperlink ("a") marker.

In [4]:
for key in sources:
    source_url = sources[key]
    sauce = url.urlopen(source_url).read()
    soup = BeautifulSoup(sauce, 'html5lib')
    for paragraph in soup.find_all("a"):
        scrapes[key].append(paragraph.text)
        

To check that the scraper has worked, let's look at how many words from headlines were scraped.

In [5]:
for j in sources.keys():
    print("{0}: {1} links found".format(j,len(scrapes[j])))

NYT: 173 links found
WaPo: 449 links found
CNBC: 458 links found
Al-Jazeera: 128 links found
BBC: 289 links found
China Daily: 268 links found
Fox: 842 links found
Mehr: 244 links found
The Atlantic: 306 links found
Buzzfeed: 301 links found
New Yorker: 276 links found
Mother Jones: 222 links found


In [6]:
# Concatenate all links into one body of text for each news source
separator = " "
for j in sources.keys():
    text = []
    for i in range(len(scrapes[j])):
        if "\n" not in scrapes[j][i] or "  " not in scrapes[j][i]:
            text.append(scrapes[j][i])
    body = separator.join(text)
    scrapes[j] = body # append to correct list in dictionary

Now let's take a look at the text data and see what it looks like.

In [7]:
scrapes["CNBC"]

"Skip Navigation   Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Business Economy Finance Health & Science Media Real Estate Energy Transportation Industrials Retail Wealth Life Small Business Investing Invest In You Personal Finance Fintech Financial Advisors Trading Nation Options Action ETF Street Buffett Archive Earnings Trader Talk Tech Cybersecurity Enterprise Internet Media Mobile Social Media Venture Capital Tech Guide Politics White House Policy Defense Congress 2020 Elections CNBC TV Live TV Live Audio Business Day Shows The News with Shepard Smith Entertainment Shows Full Episodes Latest Video Top Video CEO Interviews CNBC Documentaries CNBC World Digital Originals Live TV Schedule Watchlist PRO PRO News PRO Live Make It Select USA INTL SIGN IN  Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Business Economy Finance Health & Science Media Real Estate Energy Transportati

Looks good. Now, we just need to create a proper dataframe from the data so that we can combine all data frames for proper analysis.

In [8]:
# Create and display dataframe from dictionary
scrapes_df = pd.DataFrame.from_dict(scrapes, orient="index")
scrapes_df

Unnamed: 0,0
NYT,Continue reading the main story Skip to conten...
WaPo,Skip to main content Election 2020 Coronavirus...
CNBC,Skip Navigation Markets Pre-Markets U.S. Mar...
Al-Jazeera,Live play News Middle East Africa Asia US & C...
BBC,Homepage Skip to content Accessibility Help BB...
China Daily,Global Edition China Edition ASIA 中文 双语 Franç...
Fox,Fox News U.S. Politics Opinion Business Entert...
Mehr,Instagram Twitter facebook RSS Archive Me...
The Atlantic,Skip to content Sign in My Account Subscrib...
Buzzfeed,Skip To Content Homepage Quizzes TV & Movies S...


In [12]:
# Get AM/PM to keep track of time
time_of_day = input("Is it AM or PM? ")

Is it AM or PM? PM


In [13]:
# Save dataframe, and make the day/time the file name
scrapes_df = scrapes_df.rename(columns={0:str(date.today())+" " + str(time_of_day)})

Let's take a look at the final data frame.

In [14]:
scrapes_df

Unnamed: 0,2020-11-09 PM
NYT,Continue reading the main story Skip to conten...
WaPo,Skip to main content Election 2020 Coronavirus...
CNBC,Skip Navigation Markets Pre-Markets U.S. Mar...
Al-Jazeera,Live play News Middle East Africa Asia US & C...
BBC,Homepage Skip to content Accessibility Help BB...
China Daily,Global Edition China Edition ASIA 中文 双语 Franç...
Fox,Fox News U.S. Politics Opinion Business Entert...
Mehr,Instagram Twitter facebook RSS Archive Me...
The Atlantic,Skip to content Sign in My Account Subscrib...
Buzzfeed,Skip To Content Homepage Quizzes TV & Movies S...


Great. Time to move on to the analysis!