# Fake _Fake_ News?: The Scraper
Kevin Chang, Hariz Hisham, Thuy Le, Tam Nguyen, Qing Zhang<br>
Santa Clara University

The goal of this exercise is to determine whether fake news and factual news sites can be told apart by contemporary algorithms.

This analysis is an extension of prior [research](https://arxiv.org/abs/1810.01765) conducted at MIT. The data for this research can be retrieved [here](https://github.com/ramybaly/News-Media-Reliability/). Additionally, this project was inspired by [prior work](http://web.stanford.edu/~mattm401/docs/2018-Golbeck-WebSci-FakeNewsVsSatire.pdf) done by Golbeck et al (2018).

Note: In this analysis, we are relying 100% on the output of the algorithm to determine if a news site is 'fake news' or 'reall'. We provide a caveat and explanation as to why this approach may not be completely sound (or in fact, safe) IRL.

This project was done in part as a collective effort with the Markkula Center for Applied Ethics at Santa Clara University. For more information on the great work covered by the Markkula Center, click [here](https://www.scu.edu/ethics/).

Special thanks to Sanjiv Das and Subbu Vincent for their support and guidance on this project.

Keywords: Fake news, classification, support vector machine

In [0]:
import pandas as pd
import numpy as np

In [2]:
### DON'T RUN UNLESS ON GOOGLE COLAB ###
from google.colab import drive
drive.mount('/content/drive')
### DON'T RUN ###

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


### Read in list of news sources

List of sources retrieved from and generated by MIT

In [0]:
# Read in MIT corpus
corp = pd.read_csv('/content/drive/Shared drives/Machine Learning Project/Final Project/data/Master/data/corpus.csv',
                   index_col = False)

### Perform feature renaming and merging to corpus in one block

In [4]:
# Define list of filenames we want loaded in
ls = ['alexa', 'counts', 'created_at', 'description', 'handcrafted_url',\
     'has_location', 'has_twitter', 'has_wiki', 'title', 'url_match', 'verified',\
     'wikicategories', 'wikicontent', 'wikisummary', 'wikitoc']

for x in ls:
    f = str(x)
    fp = '/content/drive/Shared drives/Machine Learning Project/Final Project/data/Master/data/features/' + f + '.csv'
    feat = pd.read_csv(fp, index_col = False).drop(columns = ['fact', 'bias'])
    feat.columns = [f + '_' + str(col) for col in feat.columns]
    meep = f + '_source_url_processed'
    feat = feat.rename(columns={meep:'source_url_processed'})
    corp = pd.merge(corp, feat, how = 'left', on = 'source_url_processed')

# Quick visual check
corp.sample(3)

Unnamed: 0,source_url,source_url_processed,URL,fact,bias,alexa_f0,counts_f0,counts_f1,counts_f2,counts_f3,counts_f4,created_at_f0,description_f0,description_f1,description_f2,description_f3,description_f4,description_f5,description_f6,description_f7,description_f8,description_f9,description_f10,description_f11,description_f12,description_f13,description_f14,description_f15,description_f16,description_f17,description_f18,description_f19,description_f20,description_f21,description_f22,description_f23,description_f24,description_f25,description_f26,description_f27,...,wikitoc_f260,wikitoc_f261,wikitoc_f262,wikitoc_f263,wikitoc_f264,wikitoc_f265,wikitoc_f266,wikitoc_f267,wikitoc_f268,wikitoc_f269,wikitoc_f270,wikitoc_f271,wikitoc_f272,wikitoc_f273,wikitoc_f274,wikitoc_f275,wikitoc_f276,wikitoc_f277,wikitoc_f278,wikitoc_f279,wikitoc_f280,wikitoc_f281,wikitoc_f282,wikitoc_f283,wikitoc_f284,wikitoc_f285,wikitoc_f286,wikitoc_f287,wikitoc_f288,wikitoc_f289,wikitoc_f290,wikitoc_f291,wikitoc_f292,wikitoc_f293,wikitoc_f294,wikitoc_f295,wikitoc_f296,wikitoc_f297,wikitoc_f298,wikitoc_f299
159,http://www.peninsuladailynews.com/,peninsuladailynews.com,http://mediabiasfactcheck.com/peninsula-daily-...,HIGH,right-center,5e-06,8.451908,6.240276,5.420535,5.860786,10.017486,2026,-0.034197,-0.065416,0.121596,0.00708,-0.059611,0.059881,-0.122328,0.027388,-0.166097,0.030397,0.008653,-0.047187,-0.168471,0.059842,-0.049479,0.098931,0.038005,0.09884,-0.134806,-0.092912,-0.022542,0.024075,0.046232,-0.050388,-0.13878,0.021756,-0.044964,0.278456,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
56,https://www.propublica.org/,propublica.org,http://mediabiasfactcheck.com/propublica/,HIGH,left-center,6.1e-05,13.52607,5.771441,9.716073,8.207129,10.940685,2030,-0.019782,-0.054253,0.072668,0.121853,-0.034119,0.086934,-0.171305,0.036815,-0.137532,0.120609,-0.022726,-0.029175,-0.148004,0.221381,0.003496,0.087019,0.118571,0.110203,-0.150363,-0.014584,-0.078942,0.14013,0.265517,-0.148147,-0.154948,-0.044461,-0.010101,0.277371,...,-0.04042,-0.008635,0.087201,-0.15228,-0.053571,0.056277,-0.108702,0.104099,-0.074515,-0.164893,-0.063673,0.252967,0.051372,0.052342,0.083598,0.021875,-0.081275,-0.012975,-0.066245,0.126062,-0.046536,0.149689,-0.117991,-0.002434,0.054107,0.048196,0.075887,0.033921,0.149048,-0.096848,0.006477,-0.031229,-0.103906,0.111296,-0.146298,-0.176601,-0.130016,-0.001229,0.063312,0.009946
417,http://www.fairus.org/,fairus.org,http://mediabiasfactcheck.com/the-federation-f...,LOW,extreme-right,7e-06,12.142962,7.202661,6.816736,3.912023,10.341968,2013,-0.066208,-0.083611,0.107766,0.12104,-0.036537,-0.063286,-0.230003,0.027153,-0.220139,0.090591,-0.041321,-0.061371,-0.233955,0.152763,-0.032534,0.126167,0.090378,0.131039,-0.153938,0.012552,-0.030304,0.09055,0.2853,-0.04567,-0.234558,-0.046753,0.020004,0.236057,...,-0.063143,-0.012548,0.086362,-0.158195,-0.031694,-0.043052,-0.094833,0.025088,-0.078662,-0.198867,-0.071567,0.150032,0.049907,-0.041872,0.06552,0.060421,-0.10535,-0.026968,-0.056044,0.128713,0.032991,0.190365,-0.093369,-0.035249,0.014463,0.007031,0.067416,0.052398,0.078855,-0.126432,0.027388,-0.02828,-0.077288,0.153625,-0.127578,-0.17917,-0.07937,0.020726,0.034165,0.05898


### Newspaper

We utilize the [newspaper3k package](https://buildmedia.readthedocs.org/media/pdf/newspaper/latest/newspaper.pdf) to scrape news articles listed on our corpus.

In [5]:
!pip install newspaper3k

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K     |█▌                              | 10kB 19.3MB/s eta 0:00:01[K     |███                             | 20kB 3.3MB/s eta 0:00:01[K     |████▋                           | 30kB 4.6MB/s eta 0:00:01[K     |██████▏                         | 40kB 3.0MB/s eta 0:00:01[K     |███████▊                        | 51kB 3.7MB/s eta 0:00:01[K     |█████████▎                      | 61kB 4.4MB/s eta 0:00:01[K     |██████████▉                     | 71kB 5.1MB/s eta 0:00:01[K     |████████████▍                   | 81kB 5.7MB/s eta 0:00:01[K     |██████████████                  | 92kB 6.3MB/s eta 0:00:01[K     |███████████████▌                | 102kB 5.0MB/s eta 0:00:01[K     |█████████████████               | 112kB 5.0MB/s eta 0:00:01[K     |██████████████████▋             | 122kB 5.0MB/

In [0]:
from newspaper import Article
import newspaper

In [0]:
# Instantiate function to scrape news articles
# The scraper skips the first article (which is often empty or promotional)
# and scrapes up to 5 articles
def download_text(url):
  
  site = newspaper.build(url)
  
  ls = []

  try:
    print(url)
    
    for i in range(1, 6):

        article = site.articles[i]

        article.download()
        article.html
        article.parse()

        p = article.text
        
        ls.append(p)
        
    return np.array(ls)
        
  except:
    print(url + ' cannot be scraped.')
      

In [9]:
corp['article_raw'] = corp.source_url.apply(download_text)

http://www.villagevoice.com/
https://insideclimatenews.org/
http://www.fury.news/
http://www.fury.news/ cannot be scraped.
http://now8news.com/
http://constitution.com/
http://freebeacon.com/
http://brexitcentral.com
http://foreignpolicynews.org
https://patriotpost.us/
http://loser.com
http://www.empiresports.co/
http://www.emirates247.com/
http://samuel-warde.com/
http://www.trueactivist.com/
https://popularresistance.org/
https://www.thebeaverton.com/
http://nationalreport.net/
https://politicalmayhem.news
http://www.itv.com/news/
http://www.breitbart.com/
http://www.forwardprogressives.com/
http://www.forwardprogressives.com/ cannot be scraped.
https://triggerreset.net
https://www.theatlantic.com/
http://www.dcclothesline.com/
http://liberaldarkness.com/
http://leftoverrights.com/
http://inthesetimes.com/
http://www.tampabay.com/
http://france24-tv.com/
http://france24-tv.com/ cannot be scraped.
http://freewestmedia.com/
https://theguardiansofdemocracy.com/
http://www.darientimes.co

TypeError: ignored

In [0]:
res = corp.to_csv('corpus_scraped_text.csv', index = False)

### Further refinements
- scrape websites for sample $(n = 5)$ articles per news source and vectorize text to create our own body of corpus
- define a pipeline where we utilize logistic --> PCA/SVM --> PCA/Naive Bayes and compare results
- introduce k-fold cross-validation

## Tests

In [0]:
test_df = corp.sample(10)

In [0]:
test_df = test_df.dropna().reset_index(drop = True)

In [0]:
test_df = test_df.dropna().reset_index(drop = True)

test_df.article_raw

cols = ['article_1','article_2','article_3','article_4','article_5']

articles = pd.DataFrame(test_df.article_raw.tolist(), columns = cols)

pd.concat([test_df, articles], axis = 1)

In [0]:
dtest_df['articles'] = test_df.source_url.apply(download_text)

http://www.voanews.com/
http://grist.org/
http://lexingtoninstitute.org/
http://thedcgazette.com/
https://photographyisnotacrime.com/
https://photographyisnotacrime.com/ cannot be scraped.
http://liberaldarkness.com/
https://www.cpj.org/
https://theredshtick.com/
http://vidmax.com/
http://politichicks.com/


In [0]:
articles = test_df[['source_url', 'articles']].reset_index(drop = True)
test_df2 = test_df.drop(columns = ('articles')).reset_index(drop = True)

In [0]:
articles.articles

0    [The Game Room\n\nWinners and almost winners t...
1    [a sign of relief Trump signs a relief bill co...
2    [Should an enemy submarine surface well beyond...
3    [On Wednesday night Speaker Nancy Pelosi attac...
4                                                 None
5    [Brock Turner has it coming to him for the wro...
6    [CPJ joins call for UN Security Council to act...
7    [Radio Stations Pull “All I Want for Christmas...
8                                             [, , , ]
9    [If you were running for President and you had...
Name: articles, dtype: object

## Extract articles, merge, and save

In [0]:
corp = corp.dropna().reset_index(drop = True)

corp.article_raw

cols = ['article_1','article_2','article_3','article_4','article_5']

articles = pd.DataFrame(corp.article_raw.tolist(), columns = cols)

test_corpus = pd.concat([corp, articles], axis = 1)

In [0]:
fs = test_corpus.to_csv('corpus_full_text_v2.csv', index = False)