# Data Read-In and Cleaning

Here we read in data from The Guardian API and web-scraped articles from the top of all time from Reddit's /r/TheOnion for our real and fake news, respectively.

In [0]:
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize

resp = requests.get("https://content.guardianapis.com/search?show-fields=bodyText&page-size=197&page=1&api-key=49ac985b-ee8a-42be-b9db-6381c8bb3ce1")
news_dict = {"title" : [], "url" : [], "content" : [], "fake" : [],
             "title_len" : [], "content_len" : []}

# Cleaning and Reading Data - Real News

To clean the data from The Guardian API, we simply had to verify if the article had any actual content.

In [0]:
x = resp.json()
for news in range(len(x['response']['results'])):
  title = x['response']['results'][news]['webTitle']
  content = x['response']['results'][news]['fields']['bodyText']
  if len(content) > 0:
    news_dict['content'].append(content)
  else:
    continue
  news_dict['content_len'].append(len(content.split()))
  news_dict['url'].append(x['response']['results'][news]['webUrl'])
  news_dict['title'].append(title)
  news_dict['title_len'].append(len(title.split()))
  news_dict['fake'].append(0)

In [3]:
pd.DataFrame(news_dict)

Unnamed: 0,title,url,content,fake,title_len,content_len
0,Australia coronavirus live updates: Scott Morr...,https://www.theguardian.com/world/live/2020/ma...,The next stage in the government stimulus plan...,0,15,18082
1,Coronavirus live news: some tube stations clos...,https://www.theguardian.com/world/live/2020/ma...,"There are close to 220,000 confirmed coronavir...",0,17,9718
2,The coronavirus pandemic threatens a crisis fo...,https://www.theguardian.com/commentisfree/2020...,You can learn a lot about someone’s perspectiv...,0,13,1042
3,Samsung Galaxy S20 Ultra review: the superphon...,https://www.theguardian.com/technology/2020/ma...,Samsung’s new Galaxy S20 Ultra superphone is p...,0,12,2079
4,"The Truth review – mothers, memory and a haugh...",https://www.theguardian.com/film/2020/mar/19/t...,"The title is a deadpan challenge, and it is up...",0,18,750
...,...,...,...,...,...,...
190,"We won’t stop until she's free, says sister of...",https://www.theguardian.com/world/2020/mar/18/...,Loujain al-Hathloul first realised that speaki...,0,12,796
191,Pupils are joking that they're 'dying to learn...,https://www.theguardian.com/commentisfree/2020...,Schools are to remain open in England. So says...,0,16,773
192,Eurovision Song Contest cancelled due to coron...,https://www.theguardian.com/tv-and-radio/2020/...,The 2020 Eurovision Song Contest has become th...,0,8,350
193,Joyce Rimmer obituary,https://www.theguardian.com/society/2020/mar/1...,"My friend Joyce Rimmer, who has died aged 87, ...",0,3,444


In [4]:
!pip install praw

Collecting praw
[?25l  Downloading https://files.pythonhosted.org/packages/25/c0/b9714b4fb164368843b41482a3cac11938021871adf99bf5aaa3980b0182/praw-6.5.1-py3-none-any.whl (134kB)
[K     |██▍                             | 10kB 24.1MB/s eta 0:00:01[K     |████▉                           | 20kB 3.1MB/s eta 0:00:01[K     |███████▎                        | 30kB 3.8MB/s eta 0:00:01[K     |█████████▊                      | 40kB 2.8MB/s eta 0:00:01[K     |████████████▏                   | 51kB 3.1MB/s eta 0:00:01[K     |██████████████▋                 | 61kB 3.7MB/s eta 0:00:01[K     |█████████████████               | 71kB 4.0MB/s eta 0:00:01[K     |███████████████████▌            | 81kB 4.0MB/s eta 0:00:01[K     |█████████████████████▉          | 92kB 4.5MB/s eta 0:00:01[K     |████████████████████████▎       | 102kB 4.5MB/s eta 0:00:01[K     |██████████████████████████▊     | 112kB 4.5MB/s eta 0:00:01[K     |█████████████████████████████▏  | 122kB 4.5MB/s eta 0:00:01

In [0]:
import praw
import pandas as pd
import datetime as dt
import requests
from bs4 import BeautifulSoup
import time

reddit = praw.Reddit(client_id='XqbMk5vI3mntdA', \
                     client_secret='2KYXTDQYB7hopncHh4vklQJ0mnM', \
                     user_agent='DATA301', \
                     username='data301_project', \
                     password='DATA301Project')

# Cleaning and Reading Data - Fake News

Cleaning the data from the web-scraped Onion articles was a bit more complicated. We first had to exclude any articles that weren't from The Onion in order to make the process easier. Simlar to the Guardian articles, we also had to verify that there was content in the articles since some of them simply had a title and an image. In addition, a large majority of the articles' content was prepended by the city and state the article was relevant to, followed by a hyphen. In order to make sure our machine learning models weren't thrown off by this, we had to exclude this small bit of information.

In [0]:
top_submissions = reddit.subreddit('TheOnion').top(limit=250)
for submission in top_submissions:
    # Only includes onion articles
    if not 'onion' in submission.url:
      continue
    response = requests.get(submission.url)
    soup = BeautifulSoup(response.content, "html.parser")
    content = soup.find_all('p')
    # Only add articles that have content
    if len(content) > 0:
      content_body = content[0].text
      # Remove city and state at beginning of some articles
      if ('—' in content_body) and (content_body.index('—') < 25):
        city_end_index = content_body.index('—')
        filtered_body = content_body[city_end_index+1:]
        news_dict['content'].append(filtered_body)
        news_dict['content_len'].append(len(filtered_body.split()))
      else:
        news_dict['content'].append(content_body)
        news_dict['content_len'].append(len(content_body.split()))
    else:
      continue
    news_dict["title"].append(submission.title)
    news_dict['title_len'].append(len(submission.title.split()))
    news_dict["url"].append(submission.url)
    news_dict["fake"].append(1)
    time.sleep(0.3)

In [7]:
df_news = pd.DataFrame(news_dict)
df_news = df_news.sample(frac=1).reset_index(drop=True).copy()
df_news

Unnamed: 0,title,url,content,fake,title_len,content_len
0,'No way to prevent this' says only nation wher...,https://www.theonion.com/no-way-to-prevent-thi...,In the hours following a violent rampage in Fl...,1,12,196
1,Fox News Condemns 2020 Election As Partisan Wi...,https://politics.theonion.com/fox-news-condemn...,Calling the running and nomination of a candid...,1,15,168
2,'A generation has died': Italian province stru...,https://www.theguardian.com/world/2020/mar/19/...,Coffins awaiting burial are lining up in churc...,0,12,716
3,Around the world from your sofa: British Libra...,https://www.theguardian.com/books/2020/mar/18/...,What better way to see the world in these trav...,0,13,612
4,Man Playing ‘Battlefield V’ Has Now Spent More...,https://www.theonion.com/man-playing-battlefie...,After dedicating an immense portion of his spa...,1,15,258
...,...,...,...,...,...,...
361,Picky Refugee Just Expects To Be Reunited With...,https://local.theonion.com/picky-refugee-just-...,Expressing frustration with the migrant child’...,1,13,184
362,"‘Nothing Is More Attractive Than Confidence,’ ...",https://local.theonion.com/nothing-is-more-att...,Naively insisting that we seek partners with t...,1,17,203
363,Months not weeks before rugby resumes in Wales...,https://www.theguardian.com/sport/2020/mar/18/...,The Welsh Rugby Union has warned its clubs it ...,0,10,651
364,"‘No Way To Prevent This,’ Says Only Nation Whe...",https://www.theonion.com/no-way-to-prevent-thi...,In the hours following a violent rampage in Fl...,1,12,196
