### Abstract

This analysis is about exploring fake news using the FakeNewsNet dataset. In particular, I determine the most common words and websites used in fake vs real news. My results have strong implications for journalists and users. This work can be improved by using a larger dataset for richer analysis. 

### Introduction

For my final project, I want to dig deeper into fake news. This is a topic that has gained a lot of traction recently and I would like to explore what constitutes fake news as well as the dissemination in social media. From a human centered perspective, fake news has dire consequences. With the growing dependence on smartphones, more individiauls read their news on social media and, hence, viral fake news can influence people wrongly. Through this analysis, I hope to learn the characteristics or features that identify news as fake as well as discover the websites that are guilty of sharing fake news.

### Data

For this analysis, I will be using FakeNewsNet - a data repository with news content, social context, and spatiotemporal information for studying fake news on social media. The repository can be found at https://github.com/KaiDMML/FakeNewsNet and the paper at https://arxiv.org/abs/1809.01286. FakeNewsNet contains 2 datasets collected using ground truths from Politifact and Gossipcop. Each source contributes one csv for fake news and the other real news. I wil focus on the two Politifact csv's. Each of the CSV files is comma separated file and have the following columns:
1. id - Unique identifider for each news
2. url - Url of the article from web that published that news
3. title - Title of the news article
4. tweet_ids - Tweet ids of tweets sharing the news. This field is list of tweet ids separated by tab.

With this dataset, I wil be able to compare the fake news and real news. 

### Related Work

There has been alot of work to distinguish between fake news and real news. For instance, Eliza Shoemaker created a tool for automating the detection of Fake News by identifying which features are more useful for different classifiers (Using data science to detect fake news, 2019). Similary, there has been many projects and repositories to build such classifiers. TowardsDataScience published an article which uses Recurrent Neural Networks to detect fake news (https://towardsdatascience.com/building-a-fake-news-classifier-using-natural-language-processing-83d911b237e1). 

### Research Questions

1. Which terms are most strongly associated with fake vs real news?
2. Which websites are most strongly associated with fake vs real news?

### Methodology

Import the packages used for this analysis.

In [9]:
import pandas as pd
from collections import Counter
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 

We start by tackling research question 1 and the Politifact fake news dataset. 

In [4]:
#Read dataset
politic_fake = pd.read_csv('./data/dataset/politifact_fake.csv')

In [5]:
#Explore data
print(politic_fake.shape)
politic_fake.head()

(432, 4)


Unnamed: 0,id,news_url,title,tweet_ids
0,politifact15014,speedtalk.com/forum/viewtopic.php?t=51650,BREAKING: First NFL Team Declares Bankruptcy O...,937349434668498944\t937379378006282240\t937380...
1,politifact15156,politics2020.info/index.php/2018/03/13/court-o...,Court Orders Obama To Pay $400 Million In Rest...,972666281441878016\t972678396575559680\t972827...
2,politifact14745,www.nscdscamps.org/blog/category/parenting/467...,UPDATE: Second Roy Moore Accuser Works For Mic...,929405740732870656\t929439450400264192\t929439...
3,politifact14355,https://howafrica.com/oscar-pistorius-attempts...,Oscar Pistorius Attempts To Commit Suicide,886941526458347521\t887011300278194176\t887023...
4,politifact15371,http://washingtonsources.org/trump-votes-for-d...,Trump Votes For Death Penalty For Being Gay,915205698212040704\t915242076681506816\t915249...


The first step is to clean all the titles and combine the text. This includes removing non alphabetic characters and converting all words to lower case. 

In [7]:
#Remove non alphabetic characters
sms = re.sub('[^A-Za-z]', ' ', str(politic_fake['title']))

In [8]:
#Make the words lower case
sms = sms.lower()

The next step is to remove any stopwords. To do this, we first tokenize the words. 

In [10]:
#Tokenize words
nltk.download('punkt')
tokenized_sms = word_tokenize(sms)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kusht\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
#Remove stop words
nltk.download('stopwords')

for word in tokenized_sms:
    if word in stopwords.words('english'):
        tokenized_sms.remove(word)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kusht\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now that we have our titles cleaned and tokenzed, we can count the occurence of each word and store the counts in a dictionary. 

In [12]:
counts = {}
for i in tokenized_sms:
    if i not in counts:
        counts[i] = 0
    counts[i] = counts[i] + 1

Finally, we use Counter to determine the most common words in the all titles. 

In [13]:
d = Counter(counts)
d.most_common()

[('trump', 10),
 ('breaking', 8),
 ('obama', 7),
 ('clinton', 4),
 ('donald', 4),
 ('court', 3),
 ('new', 3),
 ('george', 3),
 ('arrested', 3),
 ('the', 3),
 ('suicide', 2),
 ('death', 2),
 ('pope', 2),
 ('francis', 2),
 ('man', 2),
 ('york', 2),
 ('men', 2),
 ('hillary', 2),
 ('president', 2),
 ('us', 2),
 ('issued', 2),
 ('snapchat', 2),
 ('t', 2),
 ('official', 2),
 ('c', 2),
 ('arrest', 2),
 ('america', 2),
 ('account', 2),
 ('suspended', 2),
 ('police', 2),
 ('a', 2),
 ('first', 1),
 ('nfl', 1),
 ('team', 1),
 ('declares', 1),
 ('bankruptcy', 1),
 ('orders', 1),
 ('pay', 1),
 ('million', 1),
 ('rest', 1),
 ('update', 1),
 ('second', 1),
 ('roy', 1),
 ('moore', 1),
 ('accuser', 1),
 ('works', 1),
 ('mic', 1),
 ('oscar', 1),
 ('pistorius', 1),
 ('attempts', 1),
 ('commit', 1),
 ('votes', 1),
 ('penalty', 1),
 ('gay', 1),
 ('putin', 1),
 ('says', 1),
 ('not', 1),
 ('god', 1),
 ('wanted', 1),
 ('infecting', 1),
 ('saudi', 1),
 ('arabia', 1),
 ('behead', 1),
 ('school', 1),
 ('girls', 

We repeat this process but for the Politifact real news dataset. 

In [14]:
politic_real = pd.read_csv('./data/dataset/politifact_real.csv')

In [15]:
print(politic_real.shape)
politic_real.head()

(624, 4)


Unnamed: 0,id,news_url,title,tweet_ids
0,politifact14984,http://www.nfib-sbet.org/,National Federation of Independent Business,967132259869487105\t967164368768196609\t967215...
1,politifact12944,http://www.cq.com/doc/newsmakertranscripts-494...,comments in Fayetteville NC,942953459\t8980098198\t16253717352\t1668513250...
2,politifact333,https://web.archive.org/web/20080204072132/htt...,"Romney makes pitch, hoping to close deal : Ele...",
3,politifact4358,https://web.archive.org/web/20110811143753/htt...,Democratic Leaders Say House Democrats Are Uni...,
4,politifact779,https://web.archive.org/web/20070820164107/htt...,"Budget of the United States Government, FY 2008",89804710374154240\t91270460595109888\t96039619...


In [16]:
#Remove non alphabetic characters
import re
sms = re.sub('[^A-Za-z]', ' ', str(politic_real['title']))

In [17]:
#Make the word lower case
sms = sms.lower()

In [18]:
#Tokenize words
import nltk

nltk.download('punkt')
from nltk.tokenize import word_tokenize

tokenized_sms = word_tokenize(sms)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kusht\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [19]:
#Remove stop words
nltk.download('stopwords')
from nltk.corpus import stopwords 

for word in tokenized_sms:
    if word in stopwords.words('english'):
        tokenized_sms.remove(word)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kusht\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
counts = {}
for i in tokenized_sms:
    if i not in counts:
        counts[i] = 0
    counts[i] = counts[i] + 1

In [21]:
d = Counter(counts)
d.most_common()

[('the', 7),
 ('says', 5),
 ('democratic', 4),
 ('transcript', 4),
 ('house', 3),
 ('trump', 3),
 ('political', 3),
 ('health', 3),
 ('hillary', 3),
 ('clinton', 3),
 ('sanders', 3),
 ('debate', 3),
 ('national', 2),
 ('comments', 2),
 ('budget', 2),
 ('affordable', 2),
 ('gop', 2),
 ('stop', 2),
 ('statistics', 2),
 ('care', 2),
 ('week', 2),
 ('impeachment', 2),
 ('rand', 2),
 ('paul', 2),
 ('u', 2),
 ('senate', 2),
 ('show', 2),
 ('report', 2),
 ('bernie', 2),
 ('tax', 2),
 ('senator', 2),
 ('group', 2),
 ('federation', 1),
 ('independent', 1),
 ('business', 1),
 ('fayetteville', 1),
 ('nc', 1),
 ('romney', 1),
 ('makes', 1),
 ('pitch', 1),
 ('hoping', 1),
 ('close', 1),
 ('deal', 1),
 ('ele', 1),
 ('leaders', 1),
 ('say', 1),
 ('democrats', 1),
 ('uni', 1),
 ('united', 1),
 ('states', 1),
 ('government', 1),
 ('fy', 1),
 ('donald', 1),
 ('exaggerates', 1),
 ('china', 1),
 ('ha', 1),
 ('th', 1),
 ('amendment', 1),
 ('briefing', 1),
 ('white', 1),
 ('press', 1),
 ('secretary', 1),
 (

To tackle research question 2, we analyze the news_urls. Again, we start with the Politifact fake news dataset. We parse the website name by splitting the news_url string on '/'. We use an IF statement to account for the different patterns of the news_url stored. Finally, we store these websites in a dictionary along with their frequeny. 

In [27]:
counts = {}
for i in politic_fake['news_url']:
    if i == i: #to account for news_url that are Nan
        split = i.split('/')
        if split[0] == 'http:' or split[0] == 'https:':
            if split[2] not in counts:
                counts[split[2]] = 0
            counts[split[2]] = counts[split[2]] + 1
        else:
            if split[0] not in counts:
                counts[split[0]] = 0
            counts[split[0]] = counts[split[0]] + 1

Finally, we use Counter to determine the most common websites. 

In [29]:
d = Counter(counts)
d.most_common()

[('web.archive.org', 69),
 ('yournewswire.com', 15),
 ('www.facebook.com', 6),
 ('www.thegatewaypundit.com', 5),
 ('dailyfeed.news', 4),
 ('thehill.com', 4),
 ('worldnewsdailyreport.com', 4),
 ('www.react365.com', 4),
 ('www.washingtonpost.com', 4),
 ('me.me', 3),
 ('www.breitbart.com', 3),
 ('www.breakingnews365.net', 3),
 ('www.cnn.com', 3),
 ('www.newsweek.com', 3),
 ('www.usacarry.com', 3),
 ('politicot.com', 3),
 ('www.neonnettle.com', 3),
 ('dailyworldupdate.us', 3),
 ('www.trendolizer.com', 3),
 ('observeronline.news', 3),
 ('www.usatoday.com', 3),
 ('our.news', 3),
 ('therightists.com', 2),
 ('www.politico.com', 2),
 ('nrtonline.info', 2),
 ('www.puppetstringnews.com', 2),
 ('newsbreakshere.com', 2),
 ('nyeveningnews.com', 2),
 ('www.newslo.com', 2),
 ('www.nbcnews.com', 2),
 ('obama.trendolizer.com', 2),
 ('flashnewss.club', 2),
 ('eveningw.com', 2),
 ('theglobalheadlines.net', 2),
 ('americannews.com', 2),
 ('mainerepublicemailalert.com', 2),
 ('www.bbc.com', 2),
 ('notallowe

We repeat this process for the Politifact real news dataset.

In [30]:
counts = {}
for i in politic_real['news_url']:
    if i == i: #to account for news_url that are Nan
        split = i.split('/')
        if split[0] == 'http:' or split[0] == 'https:':
            if split[2] not in counts:
                counts[split[2]] = 0
            counts[split[2]] = counts[split[2]] + 1
        else:
            if split[0] not in counts:
                counts[split[0]] = 0
            counts[split[0]] = counts[split[0]] + 1

In [31]:
d = Counter(counts)
d.most_common()

[('web.archive.org', 127),
 ('www.youtube.com', 45),
 ('abcnews.go.com', 24),
 ('www.nytimes.com', 22),
 ('www.politifact.com', 21),
 ('www.whitehouse.gov', 17),
 ('www.cq.com', 13),
 ('www.msnbc.msn.com', 13),
 ('www.washingtonpost.com', 12),
 ('www.foxnews.com', 11),
 ('www.cnn.com', 10),
 ('medium.com', 9),
 ('twitter.com', 8),
 ('www.senate.gov', 7),
 ('transcripts.cnn.com', 7),
 ('www.cbsnews.com', 6),
 ('www.cbo.gov', 6),
 ('www.politico.com', 5),
 ('frwebgate.access.gpo.gov', 5),
 ('thomas.loc.gov', 5),
 ('politicaladarchive.org', 4),
 ('data.bls.gov', 4),
 ('www.nbcnews.com', 4),
 ('www.cdc.gov', 4),
 ('www.bls.gov', 4),
 ('query.nytimes.com', 4),
 ('time.com', 4),
 ('cq.com', 4),
 ('www.kff.org', 3),
 ('www.bea.gov', 3),
 ('my.barackobama.com', 3),
 ('politifact.com', 3),
 ('www.c-span.org', 3),
 ('youtu.be', 2),
 ('www.desmoinesregister.com', 2),
 ('video.foxnews.com', 2),
 ('www.taxpolicycenter.org', 2),
 ('www.eia.gov', 2),
 ('www.ilga.gov', 2),
 ('www.facebook.com', 2),
 (

### Results

*Research question 1*

Most common fake news words: Trump, Breaking, Suicide, Death

Most common real news words: Awards, Discuss, Critics, Star

*Research question 2*

Most common fake news websites: yournewswire, facebook, dailyfeed

Most common real news websites: abcnews, nytimes, politifact

### Conclusion

From the above results, we see that fake news uses many pronouns and attention grabbing words for clickbait. Such words lure users scrolling through Twitter to click on such articles. We see that fake news titles use dull words in comparison, as we see 'the' as the most common word. One implication is that this information can persuade journalists to use more exciting titles to compete with those of fake news.

Coming to research question 2, we see a stark difference between the websites sharing fake vs real news articles. This provides strong evidence to users to check the source of the article to disseminate a fake vs real article. 

### Limitations

One limitation of this analysis is that the dataset is small. Hence, the word counts for research question 1 are relatively low. 