### **Scraping**

This is not my scraper. I utilized code from https://holwech.github.io/blog/Automatic-news-scraper/ since I wanted to add some more recent articles to my dataset. However, I modified some of the code so that I could scrape more articles than the scraper was originally scraping.

In [3]:
!pip3 install feedparser

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [4]:
!pip3 install newspaper3k

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [6]:
import os
# from google.colab import drive

# Mount google drive
DRIVE_MOUNT='./'
# drive.mount(DRIVE_MOUNT)

# create folder to write data to
DLNLP_FOLDER=os.path.join(DRIVE_MOUNT, 'My Drive', 'DLNLP_2020')
HOMEWORK_FOLDER=os.path.join(DLNLP_FOLDER, 'Project')
os.makedirs(HOMEWORK_FOLDER, exist_ok=True)

In [4]:
import json

In [2]:
dictionary = {
  "cnn": {
    "link": "http://edition.cnn.com/"
  },
  "bbc": {
    "rss": "http://feeds.bbci.co.uk/news/rss.xml",
    "link": "http://www.bbc.com/"
  },
  "theguardian": {
    "rss": "https://www.theguardian.com/uk/rss",
    "link": "https://www.theguardian.com/international"
  },
  "breitbart": {
    "link": "http://www.breitbart.com/"
  },
  "infowars": {
    "link": "https://www.infowars.com/"
  },
  "foxnews": {
    "link": "http://www.foxnews.com/"
  },
  "nbcnews": {
    "link": "http://www.nbcnews.com/"
  },
  "washingtonpost": {
    "rss": "http://feeds.washingtonpost.com/rss/world",
    "link": "https://www.washingtonpost.com/"
  },
  "theonion": {
      "rss": "https://www.theonion.com/rss",
    "link": "http://www.theonion.com/"
  }
}

In [5]:
json_object = json.dumps(dictionary, indent = 4) 

In [7]:
with open("NewsPapers.json", "w") as outfile: 
    outfile.write(json_object) 

In [8]:
import feedparser as fp
import json
import newspaper
from newspaper import Article
from time import mktime
from datetime import datetime

# Set the limit for number of articles to download
LIMIT = 14500

data = {}
data['newspapers'] = {}

# Loads the JSON files with news sites
with open('NewsPapers.json') as data_file:
    companies = json.load(data_file)

count = 1

# Iterate through each news company
for company, value in companies.items():
    # If a RSS link is provided in the JSON file, this will be the first choice.
    # Reason for this is that, RSS feeds often give more consistent and correct data.
    # If you do not want to scrape from the RSS-feed, just leave the RSS attr empty in the JSON file.
    if 'rss' in value:
        d = fp.parse(value['rss'])
        print("Downloading articles from ", company)
        newsPaper = {
            "rss": value['rss'],
            "link": value['link'],
            "articles": []
        }
        for entry in d.entries:
            # Check if publish date is provided, if no the article is skipped.
            # This is done to keep consistency in the data and to keep the script from crashing.
            if hasattr(entry, 'published'):
                if count > LIMIT:
                    break
                article = {}
                article['link'] = entry.link
                date = entry.published_parsed
                article['published'] = datetime.fromtimestamp(mktime(date)).isoformat()
                try:
                    content = Article(entry.link)
                    content.download()
                    content.parse()
                except Exception as e:
                    # If the download for some reason fails (ex. 404) the script will continue downloading
                    # the next article.
                    print(e)
                    print("continuing...")
                    continue
                article['title'] = content.title
                article['text'] = content.text
                newsPaper['articles'].append(article)
                print(count, "articles downloaded from", company, ", url: ", entry.link)
                count = count + 1
    else:
        # This is the fallback method if a RSS-feed link is not provided.
        # It uses the python newspaper library to extract articles
        print("Building site for ", company)
        paper = newspaper.build(value['link'], memoize_articles=False)
        newsPaper = {
            "link": value['link'],
            "articles": []
        }
        noneTypeCount = 0
        for content in paper.articles:
            if count > LIMIT:
                break
            try:
                content.download()
                content.parse()
            except Exception as e:
                print(e)
                print("continuing...")
                continue
            # Again, for consistency, if there is no found publish date the article will be skipped.
            # After 10 downloaded articles from the same newspaper without publish date, the company will be skipped.
            if content.publish_date is None:
                print(count, " Article has date of type None...")
                noneTypeCount = noneTypeCount + 1
                if noneTypeCount > 100:
                    print("Too many noneType dates, aborting...")
                    noneTypeCount = 0
                    break
                count = count + 1
                continue
            article = {}
            article['title'] = content.title
            article['text'] = content.text
            article['link'] = content.url
            article['published'] = content.publish_date.isoformat()
            newsPaper['articles'].append(article)
            print(count, "articles downloaded from", company, " using newspaper, url: ", content.url)
            count = count + 1
            noneTypeCount = 0
    count = 1
    data['newspapers'][company] = newsPaper

# Finally it saves the articles as a JSON-file.
try:
    with open('scraped_articles.json', 'w') as outfile:
        json.dump(data, outfile)
except Exception as e: print(e)

Downloading articles from  theonion
1 articles downloaded from theonion , url:  https://local.theonion.com/spatter-analyst-finally-working-with-blood-after-years-1846070424
2 articles downloaded from theonion , url:  https://www.theonion.com/god-blindsided-after-illegitimate-son-from-andromeda-ga-1846069193
3 articles downloaded from theonion , url:  https://www.theonion.com/seth-rich-conspiracy-theorists-publicly-apologize-as-pa-1846069164
4 articles downloaded from theonion , url:  https://entertainment.theonion.com/vince-gilligan-reunites-with-bryan-cranston-for-new-bre-1846071138
5 articles downloaded from theonion , url:  https://www.theonion.com/mlb-beginning-to-suspect-pirates-just-a-mob-front-1846067072
6 articles downloaded from theonion , url:  https://www.theonion.com/lady-gaga-j-lo-to-perform-at-biden-inauguration-1846069106
7 articles downloaded from theonion , url:  https://politics.theonion.com/she-s-now-eating-a-muffin-in-the-commissary-posts-co-1846067955
8 articles do

In [11]:
with open('scraped_articles.json') as json_data:
    d = json.load(json_data)

In [12]:
for i, site in enumerate((list(d['newspapers']))):
    print(i, site)

0 theonion


In [13]:
import pandas as pd
for i, site in enumerate((list(d['newspapers']))):
    articles = list(d['newspapers'][site]['articles'])
    if i == 0:
        df = pd.DataFrame.from_dict(articles)
        df["site"] = site
    else:
        new_df = pd.DataFrame.from_dict(articles)
        new_df["site"] = site
        df = pd.concat([df, new_df], ignore_index = True)     

In [14]:
df.shape

(25, 5)

In [15]:
df

Unnamed: 0,link,published,title,text,site
0,https://local.theonion.com/spatter-analyst-fin...,2021-01-18T14:02:00,Spatter Analyst Finally Working With Blood Aft...,NEW YORK—Happy to move on to the next phase of...,theonion
1,https://www.theonion.com/god-blindsided-after-...,2021-01-18T14:00:00,God Blindsided After Illegitimate Son From And...,THE HEAVENS—Expressing uncertainty about how t...,theonion
2,https://www.theonion.com/seth-rich-conspiracy-...,2021-01-18T14:00:00,Seth Rich Conspiracy Theorists Publicly Apolog...,"Ed Butowsky and Matt Couch, two conspiracy the...",theonion
3,https://entertainment.theonion.com/vince-gilli...,2021-01-18T14:00:00,Vince Gilligan Reunites With Bryan Cranston Fo...,LOS ANGELES—Finally announcing the joint ventu...,theonion
4,https://www.theonion.com/mlb-beginning-to-susp...,2021-01-15T22:00:00,MLB Beginning To Suspect Pirates Just A Mob Front,PITTSBURGH—Speculating as to how the listless ...,theonion
5,https://www.theonion.com/lady-gaga-j-lo-to-per...,2021-01-15T19:43:00,"Lady Gaga, J. Lo To Perform At Biden Inauguration",Lady Gaga will sing the national anthem and J....,theonion
6,https://politics.theonion.com/she-s-now-eating...,2021-01-15T19:20:00,"‘She’s Now Eating A Muffin In The Commissary,’...",WASHINGTON—Following her brief suspension from...,theonion
7,https://www.theonion.com/u-s-mint-introduces-n...,2021-01-15T18:00:00,U.S. Mint Introduces New Seven-Cent Coin To Bo...,WASHINGTON—Explaining they were excited to “ki...,theonion
8,https://www.theonion.com/nation-enters-new-pha...,2021-01-15T17:55:00,Nation Enters New Phase Of Vaccine Distributio...,ATLANTA—Reviewing changes to the priorities fo...,theonion
9,https://www.theonion.com/wikipedia-turns-20-18...,2021-01-15T15:50:00,Wikipedia Turns 20,"Wikipedia was launched Jan. 15, 2001, and the ...",theonion


In [18]:
!cp scraped_articles.json "./My Drive/DLNLP_2020/Project/"

In [25]:
# dfm = pd.read_csv("./My Drive/DLNLP_2020/Project/scraped_articles.csv", index_col=0 )
# df = dfm.append(df,ignore_index=True)

In [26]:
df.to_csv("./My Drive/DLNLP_2020/Project/scraped_articles.csv", index=None)