# Data Exploration

Exploring publicly available datasets

In [1]:
import pandas as pd
import numpy as np

## NewsDataset

Source: https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets

In [2]:
df1_fake = pd.read_csv("raw_data/NewsDataSet/Fake.csv")
df1_real = pd.read_csv("raw_data/NewsDataSet/True.csv")

In [3]:
df1_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [4]:
df1_fake.iloc[2000]["text"]

'No matter what Donald Trump does or where he goes, he s never actually doing the right thing and fulfilling his duties as POTUS.One might think that after suffering such a monstrous fail with his American Health Care Act, the president might double down and get right back to work on his quest to somehow  improve  Obamacare, which he insists is awful despite the fact the Americans overwhelmingly approve of it. But no   instead, Trump decided to spend his Saturday the way he s spent pretty much every Saturday since he became president   by playing golf.In the short 9 weeks of his presidency, Trump has already gone golfing 12 times   which is far more than any of his predecessors and former President Barack Obama, whom Trump once criticized for taking any downtime to play golf. Trump has been getting blasted for his weekend golfing getaways, and the White House has gone to great lengths to hide it:However, Trump s cover was blown when some Instagram photographs of Trump surfaced, reveali

In [5]:
df1_real.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [6]:
df1_real.iloc[2000]["text"]

'WASHINGTON (Reuters) - U.S. Secretary of State Rex Tillerson said on Sunday the firing of three ballistic missiles by North Korea this week was a provocative act but that the United States will continue to seek a peaceful resolution.  “We do view it as a provocative act against the United States and our allies,” Tillerson said in an interview on Fox News Sunday. “We’re going to continue our peaceful pressure campaign as I have described it, working with allies, working with China as well to see if we can bring the regime in Pyongyang to the negotiating table.” '

In [7]:
df1_fake.groupby('subject').count()

Unnamed: 0_level_0,title,text,date
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Government News,1570,1570,1570
Middle-east,778,778,778
News,9050,9050,9050
US_News,783,783,783
left-news,4459,4459,4459
politics,6841,6841,6841


In [8]:
df1_real.groupby('subject').count()

Unnamed: 0_level_0,title,text,date
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
politicsNews,11272,11272,11272
worldnews,10145,10145,10145


In [9]:
print("Length of real: ", len(df1_real))
print("Length of real: ", len(df1_fake))

Length of real:  21417
Length of real:  23481


The `subject` columns are probably not useful. They are completely independent labels.

This dataset is very politics heavy. It is not clear from just eye scanning whether the fake and true rows are fake or not. It is more on opinion base vs fact rather than true and false.

A lot of cleaning is also needed. For example, the word `Reuters` shows up multiple times in real but not fake so we need to make sure its remoced or else the model will just predict based on that.

## FakeNewsNet

source: https://github.com/KaiDMML/FakeNewsNet/tree/master/dataset

In [10]:
df2_real_gossip = pd.read_csv("raw_data/FakeNewsNet/gossipcop_real.csv")
df2_real_politifact = pd.read_csv("raw_data/FakeNewsNet/politifact_real.csv")
df2_fake_gossip = pd.read_csv("raw_data/FakeNewsNet/gossipcop_fake.csv")
df2_fake_politifact = pd.read_csv("raw_data/FakeNewsNet/politifact_fake.csv")

In [11]:
df2_real_gossip.head()

Unnamed: 0,id,news_url,title,tweet_ids
0,gossipcop-882573,https://www.brides.com/story/teen-mom-jenelle-...,Teen Mom Star Jenelle Evans' Wedding Dress Is ...,912371411146149888\t912371528343408641\t912372...
1,gossipcop-875924,https://www.dailymail.co.uk/tvshowbiz/article-...,Kylie Jenner refusing to discuss Tyga on Life ...,901989917546426369\t901989992074969089\t901990...
2,gossipcop-894416,https://en.wikipedia.org/wiki/Quinn_Perkins,Quinn Perkins,931263637246881792\t931265332022579201\t931265...
3,gossipcop-857248,https://www.refinery29.com/en-us/2018/03/19192...,I Tried Kim Kardashian's Butt Workout & Am For...,868114761723936769\t868122567910936576\t868128...
4,gossipcop-884684,https://www.cnn.com/2017/10/04/entertainment/c...,Celine Dion donates concert proceeds to Vegas ...,915528047004209152\t915529285171122176\t915530...


In [12]:
df2_fake_gossip.head()

Unnamed: 0,id,news_url,title,tweet_ids
0,gossipcop-2493749932,www.dailymail.co.uk/tvshowbiz/article-5874213/...,Did Miley Cyrus and Liam Hemsworth secretly ge...,284329075902926848\t284332744559968256\t284335...
1,gossipcop-4580247171,hollywoodlife.com/2018/05/05/paris-jackson-car...,Paris Jackson & Cara Delevingne Enjoy Night Ou...,992895508267130880\t992897935418503169\t992899...
2,gossipcop-941805037,variety.com/2017/biz/news/tax-march-donald-tru...,Celebrities Join Tax March in Protest of Donal...,853359353532829696\t853359576543920128\t853359...
3,gossipcop-2547891536,www.dailymail.co.uk/femail/article-3499192/Do-...,Cindy Crawford's daughter Kaia Gerber wears a ...,988821905196158981\t988824206556172288\t988825...
4,gossipcop-5476631226,variety.com/2018/film/news/list-2018-oscar-nom...,Full List of 2018 Oscar Nominations – Variety,955792793632432131\t955795063925301249\t955798...


This dataset is very comprehensive but we will need the twitter API which is unfortunately not free if we want to use more than 500 posts.

This is probably not realistic.

## COVID Misinfo

Source1: https://zenodo.org/records/4557828 (misinfo vid)

Source2: https://esoc.princeton.edu/publications/esoc-covid-19-misinformation-dataset

In [13]:
df3_vid = pd.read_csv("raw_data/CovidMisInfo/covid-misinfo-videos.csv")

In [14]:
df3_vid.head()

Unnamed: 0,youtube_link,video_title,video_description,view_count,channel_id,subscriber_count,removal_timestamp,published_timestamp,archive_url,facebook_graph_reactions,facebook_graph_comments,facebook_graph_shares,twitter_post_ids,facebook_post_ids
0,https://youtube.com/watch?v=-0ERhEl3n4U,DA NON PERDERE! DR RASHID BUTTAR VS GATES E FAUCI,"N. B. Per il RISPETTO di tutti, non saranno ap...",,,,,,,700,183,1300,"[1254689102546558976, 1253642806704386048, 125...","[3396214160413490, 368262960761073, 5298753012..."
1,https://youtube.com/watch?v=-0FFXqMkwLM,Coronavirus clinicamente non esiste più! Alber...,Coronavirus clinicamente non esiste più! Alber...,,,,,,,8921,6204,2310,"[1267576569213698048, 1267550551459409920, 126...","[868056943670381, 655605928364978, 22862780483..."
2,https://youtube.com/watch?v=-0HCr9Y4qiQ,Dr Erickson COVID 19,#COV #Corona #COVID,,,,,,,82,38,124,"[1255642644807639040, 1255637802710114304, 125...","[228854528344019, 3223708224340040, 3390196817..."
3,https://youtube.com/watch?v=-1PJjn0Z6rw,Doctors Speak Out About COVID 19 & The Violent...,Created & Edited By Ryan Cristián The only que...,1896.0,UC_ClYrAtDNAGy5J0N-AwBNw,8.0,2020-05-16 23:31:30,2020-05-11 0:00:00,http://web.archive.org/web/20200511183250/http...,1500,540,1167,"[1260591220230885376, 1260557512350334976, 126...","[3222231977789105, 2607261019494178, 297814419..."
4,https://youtube.com/watch?v=-1g-Gta858E,Noticia - Descubierto el primer fármaco eficaz...,Descubierto el primer fármaco eficaz contra el...,,,,,,,2,0,1,[1273255975911329792],[3124940617566423]


This dataset is not labeled. Also its very limited for three reasons:

1. facebook post API required
2. X API required
3. No text from video and only video title and description available

In [15]:
df3_jns = pd.read_excel("raw_data/CovidMisInfo/jns-covid_misinfo_2021-03-06_Final_Clean.xlsx")

  warn(msg)


In [16]:
df3_jns.head()

Unnamed: 0,s_no,Reported_On,Additional_Reporting,Retrieve_from_1,Retrieve_from_2,Retrieve_from_3,Twitter_Reference,Direct_Post_1,Direct_Post_2,Direct_Post_3,...,Motive_Description,Source,Source_Description,Distrib_Channel,Misinfo_Type,Key_Words,Summary,Coder,Notes,Region
0,1,https://www.buzzfeednews.com/article/ryanhates...,,https://www.buzzfeednews.com/article/ryanhates...,,,0.0,,,,...,Efforts to spread false claims on the origins ...,Individual actor,General public,Youtube,Conspiracy,"Coronavirus, India, bat soup",Hindi language Youtube account suggesting COVI...,Jan,,
1,2,https://twitter.com/Rangoli_A/status/122779241...,,,,,1.0,https://twitter.com/Rangoli_A/status/122779241...,,,...,"Twitter user stoking fear among other users, s...",Individual actor,General public,Twitter,False reporting,"Coronavirus, China, shooting","Tweet with video showing ""people getting shot ...",Jan,,
2,3,https://twitter.com/Woppa1Woppa/status/1220068...,,,,,1.0,https://twitter.com/Woppa1Woppa/status/1220068...,,,...,Twitter user discrediting Chinese-American pop...,Individual actor,General public,Twitter,False reporting,"Coronavirus, Chinese food, bat soup","Video of an individual eating a delicacy, and ...",Jan,,
3,4,https://twitter.com/FreddiGoldstein/status/123...,,,,,1.0,https://twitter.com/FreddiGoldstein/status/123...,,,...,Chain message spread to stoke fear among Ameri...,Individual actor,General public,"Media, SMS",False reporting,"Coronavirus, NYPD, containment zone",Tweet with a screenshot of chain message sugge...,Jan,,
4,5,https://www.boomlive.in/health/hoax-alert-vira...,,https://www.boomlive.in/health/hoax-alert-vira...,,,0.0,,,,...,Chain message spread to stoke fear among India...,Individual actor,General public,"Facebook, WhatsApp",False reporting,"Coronavirus, India, travel advisory",WhatsApp chain message circulating among India...,Jan,,


In [17]:
df3_jns[["Source","Misinfo_Type"]].groupby("Misinfo_Type").count()

Unnamed: 0_level_0,Source
Misinfo_Type,Unnamed: 1_level_1
Conspiracy,966
"Conspiracy, Fake remedy",1
"Conspiracy, False reporting",5
Fake remedy,502
"Fake remedy, False reporting",1
"Fake remedy, conspiracy",1
"Fake remedy, false reporting",2
False Reporting,3
False reporting,4123
"False reporting, Conspiracy",1


In [18]:
df3_jns["Summary"].iloc[0]

'Hindi language Youtube account suggesting COVID-19 came from "Chinese people eating bat soup."'

In [19]:
df3_jns["Motive_Description"].iloc[1]

'Twitter user stoking fear among other users, suggesting violent reactions to COVID-19.'

In [20]:
df3_jns["Reported_On"].iloc[4]

'https://www.boomlive.in/health/hoax-alert-viral-emergency-notification-on-coronavirus-is-fake-6682'

This dataset accumulates the COVID 19 misinfo events. This means two things.

1. The raw text itself is not available but inside URL links which could not be accessible eg. Twitter (X).
2. The text is about reporting misinformation rather than the misinfo text itself.

There might be a use for this dataset for example, mapping a COVID19 text and check for misconception and common misinformation throught this dataset.

In [21]:
df3_urls = pd.read_excel("raw_data/CovidMisInfo/COVID_Disinformation_URLs_2020-04-10.xlsx")

In [22]:
df3_urls.head()

Unnamed: 0,Serial Number,Claim,URL,Date Entered,Source,Coder,Unnamed: 6
0,1,COVID-19 as a bio-weapon created by the Commun...,https://www.rushlimbaugh.com/daily/2020/02/24/...,2020-03-23,https://www.rushlimbaugh.com/daily/2020/02/24/...,Luca,
1,2,"Two companies claim they can ""eradicate COVID-...",https://www.protectedrestoration.com/virus-mit...,2020-03-23,https://www.click2houston.com/health/2020/03/1...,Luca,
2,3,"Individuals pretended to be officials of WHO, ...",https://securityboulevard.com/2020/03/coronavi...,2020-03-23,https://securityboulevard.com/2020/03/coronavi...,Luca,
3,4,COVID-19 as a bio-weapon created by the US in ...,https://www.vipnoviny.cz/kdo-stoji-za-vznikem-...,2020-03-23,https://cesti-elfove.cz/mimoradny-report-dezin...,Luca,
4,5,"A ""nano-silver"" toothpaste as a remedy for Cov...",https://www.infowarsstore.com/super-blue-fluor...,2020-03-24,https://qz.com/1818606/alex-jones-ordered-to-s...,Luca,


In [23]:
df3_urls["Claim"].iloc[10]

'Joe Biden tested positive for COVID'

In [24]:
df3_urls["URL"].iloc[10]

'http://ucrtv.com/usa-presidential-candidate-joe-biden-tests-positive-to-coronavirus/'

In [25]:
len(df3_urls)

213

This dataset is also interesting as it contains the claim of the misinfo. However, it is not a dataset with real and fake data i.e. unlabelled. Also its very small. The dataset also gives links to the data which we will need to scrape (a lot of the site already taken down the data).

## MisInfo79

This is similar to the first one

Source: https://www.kaggle.com/datasets/stevenpeutz/misinformation-fake-news-text-dataset-79k/data

In [26]:
df4_misinfo79_fake = pd.read_csv("raw_data/MisInfo79k/DataSet_Misinfo_FAKE.csv")
df4_misinfo79_real = pd.read_csv("raw_data/MisInfo79k/DataSet_Misinfo_TRUE.csv")

In [27]:
print("length of fake: ", len(df4_misinfo79_fake))
print("length of real: ", len(df4_misinfo79_real))

length of fake:  43642
length of real:  34975


In [28]:
df4_misinfo79_fake.head()

Unnamed: 0.1,Unnamed: 0,text
0,0,Donald Trump just couldn t wish all Americans ...
1,1,House Intelligence Committee Chairman Devin Nu...
2,2,"On Friday, it was revealed that former Milwauk..."
3,3,"On Christmas day, Donald Trump announced that ..."
4,4,Pope Francis used his annual Christmas Day mes...


In [29]:
df4_misinfo79_fake["text"].iloc[4]

'Pope Francis used his annual Christmas Day message to rebuke Donald Trump without even mentioning his name. The Pope delivered his message just days after members of the United Nations condemned Trump s move to recognize Jerusalem as the capital of Israel. The Pontiff prayed on Monday for the  peaceful coexistence of two states within mutually agreed and internationally recognized borders. We see Jesus in the children of the Middle East who continue to suffer because of growing tensions between Israelis and Palestinians,  Francis said.  On this festive day, let us ask the Lord for peace for Jerusalem and for all the Holy Land. Let us pray that the will to resume dialogue may prevail between the parties and that a negotiated solution can finally be reached. The Pope went on to plead for acceptance of refugees who have been forced from their homes, and that is an issue Trump continues to fight against. Francis used Jesus for which there was  no place in the inn  as an analogy. Today, as

In [30]:
df4_misinfo79_real.head()

Unnamed: 0.1,Unnamed: 0,text
0,0,The head of a conservative Republican faction ...
1,1,Transgender people will be allowed for the fir...
2,2,The special counsel investigation of links bet...
3,3,Trump campaign adviser George Papadopoulos tol...
4,4,President Donald Trump called on the U.S. Post...


In [None]:
len(df4_misinfo79_real["text"].iloc[4])

5172

## WelFake

Source: https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification

1 means real and 0 means fake

In [32]:
df5_welfake = pd.read_csv("raw_data/WelFake/WELFake_Dataset.csv")

In [33]:
df5_welfake.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [35]:
df5_welfake.isna().sum()

Unnamed: 0      0
title         558
text           39
label           0
dtype: int64

In [37]:
len(df5_welfake)

72134

In [38]:
df5_welfake["text"].iloc[0]

'No comment is expected from Barack Obama Members of the #FYF911 or #FukYoFlag and #BlackLivesMatter movements called for the lynching and hanging of white people and cops. They encouraged others on a radio show Tuesday night to  turn the tide  and kill white people and cops to send a message about the killing of black people in America.One of the F***YoFlag organizers is called  Sunshine.  She has a radio blog show hosted from Texas called,  Sunshine s F***ing Opinion Radio Show. A snapshot of her #FYF911 @LOLatWhiteFear Twitter page at 9:53 p.m. shows that she was urging supporters to  Call now!! #fyf911 tonight we continue to dismantle the illusion of white Below is a SNAPSHOT Twitter Radio Call Invite   #FYF911The radio show aired at 10:00 p.m. eastern standard time.During the show, callers clearly call for  lynching  and  killing  of white people.A 2:39 minute clip from the radio show can be heard here. It was provided to Breitbart Texas by someone who would like to be referred to