 
# Introduction
# 
In today's world, fake news has become a significant issue, especially in the context of elections and politics. The divisiveness caused by fake news can have far-reaching consequences on public opinion and societal harmony. Our project aims to analyze fake news and determine if the words and language structure of an article can indicate its reliability. By leveraging Natural Language Processing (NLP) and sentiment analysis, we hope to uncover patterns that distinguish fake news from true news.


# Data Curation



Source: PolitiFact

Description: Contains political quotes with their truthfulness rating (e.g., True, Mostly True, Half True, False, Pants on Fire). Also includes contextual information about the speaker like who they are, the audience, the speaker's past statements and their political party.

We have 2 datasets from this website which we will use to help determine whether or not a politican statement made by an individual with given qualities is likely to be reliable or false. The two datasets have different features and the purpose of this data curation section is to make htem consistent with each other


In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [56]:
statements = pd.read_csv('statements.tsv', sep='\t')
print(len(statements))
statements.head()

10239


Unnamed: 0,2635.json,false,Says the Annies List political group supports third-trimester abortions on demand.,abortion,dwayne-bohac,State representative,Texas,republican,0,1,0.1,0.2,0.3,a mailer
0,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
1,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
2,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
3,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN
4,12465.json,true,The Chicago Bears have had more starting quart...,education,robin-vos,Wisconsin Assembly speaker,Wisconsin,republican,0.0,3.0,2.0,5.0,1.0,a an online opinion-piece


Since the dataset doesnt have headers, I set headers based on the descriptioin given on the politifact website. I then filled all na with 0's, and dropped columns that are irrelevant for our analysis

In [57]:
statements = pd.read_csv('statements.tsv', sep='\t', header=None)
statements.columns = ["id", "reliability", "statement", "subject", "speaker", "job_title", "state", "party", "barely_true_counts", "false_counts", "half_true_counts", "mostly_true_counts", "pants_fire_counts", "audience"]
count_columns = ['barely_true_counts', 'false_counts', 'half_true_counts', 'mostly_true_counts', 'pants_fire_counts']
statements[count_columns] = statements[count_columns].fillna(0).astype(int)
statements.drop(columns = ["state", "job_title", "id"], inplace = True)
display(statements.head())

Unnamed: 0,reliability,statement,subject,speaker,party,barely_true_counts,false_counts,half_true_counts,mostly_true_counts,pants_fire_counts,audience
0,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,republican,0,1,0,0,0,a mailer
1,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,democrat,0,0,1,1,0,a floor speech.
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,democrat,70,71,160,163,9,Denver
3,false,Health care reform legislation is likely to ma...,health-care,blog-posting,none,7,19,3,5,44,a news release
4,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,democrat,15,9,20,19,2,an interview on CNN


The politifact dataset had Nan's instead of Zeros which we decided to turn into 0's, and then turn the previous count columns into Integers. We also split up the subjects into a list of the different subjects so its easier to categorize

In [58]:
statements['subject'].unique()
statements['subject'] = statements['subject'].apply(lambda x: x.split(',') if isinstance(x, str) else x)
statements['subject'] = statements['subject'].apply(lambda x: [item.strip() for item in x] if isinstance(x, list) else x)
statements.head()


Unnamed: 0,reliability,statement,subject,speaker,party,barely_true_counts,false_counts,half_true_counts,mostly_true_counts,pants_fire_counts,audience
0,false,Says the Annies List political group supports ...,[abortion],dwayne-bohac,republican,0,1,0,0,0,a mailer
1,half-true,When did the decline of coal start? It started...,"[energy, history, job-accomplishments]",scott-surovell,democrat,0,0,1,1,0,a floor speech.
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",[foreign-policy],barack-obama,democrat,70,71,160,163,9,Denver
3,false,Health care reform legislation is likely to ma...,[health-care],blog-posting,none,7,19,3,5,44,a news release
4,half-true,The economic turnaround started at the end of ...,"[economy, jobs]",charlie-crist,democrat,15,9,20,19,2,an interview on CNN


I now changed the format of speaker so we can use it later on in the second dataset

In [59]:
statements['speaker'] = statements['speaker'].astype(str).str.lower().str.replace('-', ' ')
statements.head()

Unnamed: 0,reliability,statement,subject,speaker,party,barely_true_counts,false_counts,half_true_counts,mostly_true_counts,pants_fire_counts,audience
0,false,Says the Annies List political group supports ...,[abortion],dwayne bohac,republican,0,1,0,0,0,a mailer
1,half-true,When did the decline of coal start? It started...,"[energy, history, job-accomplishments]",scott surovell,democrat,0,0,1,1,0,a floor speech.
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",[foreign-policy],barack obama,democrat,70,71,160,163,9,Denver
3,false,Health care reform legislation is likely to ma...,[health-care],blog posting,none,7,19,3,5,44,a news release
4,half-true,The economic turnaround started at the end of ...,"[economy, jobs]",charlie crist,democrat,15,9,20,19,2,an interview on CNN


In [60]:
true = ['half_true', 'mostly_true', 'true']
false = ['false', 'pants_fire', 'barely_true']
statements['verdict'] = statements['reliability'].apply(lambda x: 'true' if x in true else 'false')
display(statements.head())


Unnamed: 0,reliability,statement,subject,speaker,party,barely_true_counts,false_counts,half_true_counts,mostly_true_counts,pants_fire_counts,audience,verdict
0,false,Says the Annies List political group supports ...,[abortion],dwayne bohac,republican,0,1,0,0,0,a mailer,False
1,half-true,When did the decline of coal start? It started...,"[energy, history, job-accomplishments]",scott surovell,democrat,0,0,1,1,0,a floor speech.,False
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",[foreign-policy],barack obama,democrat,70,71,160,163,9,Denver,False
3,false,Health care reform legislation is likely to ma...,[health-care],blog posting,none,7,19,3,5,44,a news release,False
4,half-true,The economic turnaround started at the end of ...,"[economy, jobs]",charlie crist,democrat,15,9,20,19,2,an interview on CNN,False


Now we process the second politifact file, a json. We will read the json and turn it into a pandas dataframe, rename the columns to fit the conventions of the first dataframe, and drop the columns we dont need.

In [61]:
statements2 = pd.read_json("politifact_factcheck_data.json", lines=True)
statements2.head()


Unnamed: 0,verdict,statement_originator,statement,statement_date,statement_source,factchecker,factcheck_date,factcheck_analysis_link
0,true,Barack Obama,John McCain opposed bankruptcy protections for...,6/11/2008,speech,Adriel Bettelheim,6/16/2008,https://www.politifact.com/factchecks/2008/jun...
1,false,Matt Gaetz,"""Bennie Thompson actively cheer-led riots in t...",6/7/2022,television,Yacob Reyes,6/13/2022,https://www.politifact.com/factchecks/2022/jun...
2,mostly-true,Kelly Ayotte,"Says Maggie Hassan was ""out of state on 30 day...",5/18/2016,news,Clay Wirestone,5/27/2016,https://www.politifact.com/factchecks/2016/may...
3,false,Bloggers,"""BUSTED: CDC Inflated COVID Numbers, Accused o...",2/1/2021,blog,Madison Czopek,2/5/2021,https://www.politifact.com/factchecks/2021/feb...
4,half-true,Bobby Jindal,"""I'm the only (Republican) candidate that has ...",8/30/2015,television,Linda Qiu,8/30/2015,https://www.politifact.com/factchecks/2015/aug...


In [62]:
statements2['statement_originator'] = statements2['statement_originator'].astype(str).str.lower().str.replace('-', ' ')
statements2.rename(columns={'statement_originator': 'speaker', 'verdict': 'reliability', 'statement_source': 'audience'}, inplace=True)
statements2.drop(columns=["factchecker", "factcheck_date", "factcheck_analysis_link"], inplace=True)

statements2.head()






Unnamed: 0,reliability,speaker,statement,statement_date,audience
0,true,barack obama,John McCain opposed bankruptcy protections for...,6/11/2008,speech
1,false,matt gaetz,"""Bennie Thompson actively cheer-led riots in t...",6/7/2022,television
2,mostly-true,kelly ayotte,"Says Maggie Hassan was ""out of state on 30 day...",5/18/2016,news
3,false,bloggers,"""BUSTED: CDC Inflated COVID Numbers, Accused o...",2/1/2021,blog
4,half-true,bobby jindal,"""I'm the only (Republican) candidate that has ...",8/30/2015,television


Now as you see, were still missing many key elements from the first dataframe such as political party and their statement history. Using the names we can infer this, and we will determine the subject of the statement in the next step

# Exploratory Data Analysis

In this section of the data science life cycle, we are going to graph the data in order to gain a better understanding of the data between our datasets

In [63]:
display(statements.describe())




Unnamed: 0,barely_true_counts,false_counts,half_true_counts,mostly_true_counts,pants_fire_counts
count,10240.0,10240.0,10240.0,10240.0,10240.0
mean,11.530957,13.283887,17.130371,16.431055,6.200195
std,18.972596,24.111296,35.84381,36.148887,16.127585
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0
50%,2.0,2.0,3.0,3.0,1.0
75%,12.0,12.0,13.0,11.0,5.0
max,70.0,114.0,160.0,163.0,105.0


In [64]:
statements.head()

Unnamed: 0,reliability,statement,subject,speaker,party,barely_true_counts,false_counts,half_true_counts,mostly_true_counts,pants_fire_counts,audience,verdict
0,false,Says the Annies List political group supports ...,[abortion],dwayne bohac,republican,0,1,0,0,0,a mailer,False
1,half-true,When did the decline of coal start? It started...,"[energy, history, job-accomplishments]",scott surovell,democrat,0,0,1,1,0,a floor speech.,False
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",[foreign-policy],barack obama,democrat,70,71,160,163,9,Denver,False
3,false,Health care reform legislation is likely to ma...,[health-care],blog posting,none,7,19,3,5,44,a news release,False
4,half-true,The economic turnaround started at the end of ...,"[economy, jobs]",charlie crist,democrat,15,9,20,19,2,an interview on CNN,False


In [65]:
statements['party'] = statements['party'].replace({'none': 'independent', 'Moderate': 'independent'})
statements['party'].value_counts()






party
republican                      4497
democrat                        3336
independent                     1892
organization                     219
newsmaker                         56
libertarian                       40
activist                          39
journalist                        38
columnist                         35
talk-show-host                    26
state-official                    20
labor-leader                      11
tea-party-member                  10
business-leader                    9
green                              3
education-official                 2
liberal-party-canada               1
government-body                    1
democratic-farmer-labor            1
ocean-state-tea-party-action       1
constitution-party                 1
Name: count, dtype: int64

In [66]:

sns.countplot(data=statements, x='reliability')
plt.title('Distribution of Reliability (True vs False)')
plt.show()

sns.boxplot(data=statements, x='reliability', y='false_counts')
plt.title('Distribution of False Counts by Reliability')
plt.show()










from scipy.stats import chi2_contingency

display(statements.head())
true = ['half_true', 'mostly_true', 'true']
false = ['false', 'pants_fire', 'barely_true']

statements['Fake'] = statements['reliability'].apply(lambda x: 1 if x == 'false' else 0)
contingency_table = pd.crosstab(statements['party'], statements['Fake'])
chi2_job_title, p_value_job_title, dof_job_title, expected_job_title = chi2_contingency(contingency_table)
# print(contingency_job_title)
print(f"Chi-squared statistic (Job Title): {chi2_job_title}")
print(f"P-value (Job Title): {p_value_job_title}")
print(f"Degrees of Freedom (Job Title): {dof_job_title}")
print(f"Expected Frequencies (Job Title):\n{expected_job_title}")
plt.figure(figsize=(10, 7))
heatmap = sns.heatmap(contingency_table, annot=True, fmt="d", cmap="Blues", cbar=True, linewidths=0.5)
plt.title('Heatmap of Party vs Reliability')
plt.xlabel('Reliability')
plt.ylabel('Party')
plt.show()

NameError: name 'sns' is not defined