# Fake News Detection

## Inspiration
With every news channel constantly wanting to attract more and more consumers on their page, it becomes increasingly hard to determine which news articles are real and which are seeking clicks through false headlines and false information. I personally do not want to be misled into believing something that is not true so I wanted to create an app that would tell me whether certain news is true or not.

We see the consequences of fake news in our culture everyday. A good example would be fake news about celebrities that could ruin their image.

In [22]:
# import necessary libraries to work with data and nlp models

import tensorflow
import torch
from transformers import pipeline
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [9]:
# read in the data
all_data_news = pd.read_csv('WELFake_Dataset.csv', index_col=0)
# label: 0 = fake, 1 = real
all_data_news.head(2)

Unnamed: 0,title,text,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,,Did they post their votes for Hillary already?,1


## Plan
We will be using Hugging Face, an AI company, that offers pretrained models (which includes NLP models), to detect fake news.

**Why not train our own model?** An NLP model requires extensive data which often takes days to train, and takes up a lot of space. It is also unethical to train your own data when there are pretrained models which already exist.

#### Split Data

In [34]:
X = all_data_news[['title', 'text']]
y = all_data_news[['label']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


We'll first use the fake news model given to us from Hugging Face. Then we'll fine tune our own fake news model.

In [105]:
model = 'jy46604790/Fake-News-Bert-Detect'
classifier = pipeline('sentiment-analysis', model=model)

In [102]:
news_titles = X_test['title']
news_text = X_test['text']
print('Number of NaN titles:', news_titles.isna().sum())
print('Number of NaN texts:', news_text.isna().sum())
print('Rows w/ NaN title or NaN text:', X_test.isna().any(axis=1).sum())
print('Rows w/ NaN title and NaN text:', (X_test.isna().sum(axis=1) > 1).sum())

Number of NaN titles: 0
Number of NaN texts: 7
Rows w/ NaN title or NaN text: 7
Rows w/ NaN title and NaN text: 0


We see from above that for every row that has a NaN title, it was a text value that is NOT NaN, so therefore we can replace all NaN title's with their corresponding text value to check for fake news.

In [104]:
X_test['title'].fillna(X_test['text'], inplace=True)
filled_news_titles = X_test['title']
print('New Number of NaN titles:', filled_news_titles.isna().sum())

New Number of NaN titles: 0


In [107]:
filled_news_titles = list(filled_news_titles)

In [115]:
tokenizer_kwargs = {'truncation':True,'max_length':512}
class_dict = classifier(filled_news_titles,**tokenizer_kwargs)

In [109]:
class_dict

[{'label': 'LABEL_0', 'score': 0.9990565180778503},
 {'label': 'LABEL_0', 'score': 0.9979349374771118}]

In [112]:
temp = []
for val in class_dict:
    if val['label'] == 'LABEL_0':
        temp.append(0)
    else:
        temp.append(1)
print(temp)

[0, 0]


In [116]:
class_dict

[{'label': 'LABEL_0', 'score': 0.9990565180778503},
 {'label': 'LABEL_0', 'score': 0.9979349374771118},
 {'label': 'LABEL_0', 'score': 0.9901407361030579},
 {'label': 'LABEL_0', 'score': 0.998322069644928},
 {'label': 'LABEL_0', 'score': 0.9987269043922424},
 {'label': 'LABEL_0', 'score': 0.9985826015472412},
 {'label': 'LABEL_0', 'score': 0.9979164004325867},
 {'label': 'LABEL_0', 'score': 0.9986995458602905},
 {'label': 'LABEL_0', 'score': 0.8433653116226196},
 {'label': 'LABEL_1', 'score': 0.9998961687088013},
 {'label': 'LABEL_1', 'score': 0.9998873472213745},
 {'label': 'LABEL_0', 'score': 0.99727863073349},
 {'label': 'LABEL_0', 'score': 0.9986685514450073},
 {'label': 'LABEL_0', 'score': 0.9986502528190613},
 {'label': 'LABEL_1', 'score': 0.9993463158607483},
 {'label': 'LABEL_0', 'score': 0.9985575079917908},
 {'label': 'LABEL_1', 'score': 0.9995670914649963},
 {'label': 'LABEL_0', 'score': 0.9985716342926025},
 {'label': 'LABEL_0', 'score': 0.9972461462020874},
 {'label': 'LAB

In [119]:
scores = pd.DataFrame({'real':class_dict})

In [122]:
scores

Unnamed: 0,real
0,"{'label': 'LABEL_0', 'score': 0.9990565180778503}"
1,"{'label': 'LABEL_0', 'score': 0.9979349374771118}"
2,"{'label': 'LABEL_0', 'score': 0.9901407361030579}"
3,"{'label': 'LABEL_0', 'score': 0.998322069644928}"
4,"{'label': 'LABEL_0', 'score': 0.9987269043922424}"
...,...
14422,"{'label': 'LABEL_0', 'score': 0.9987441301345825}"
14423,"{'label': 'LABEL_1', 'score': 0.9988206028938293}"
14424,"{'label': 'LABEL_1', 'score': 0.9998724460601807}"
14425,"{'label': 'LABEL_0', 'score': 0.9985507130622864}"


In [121]:
scores.to_csv('scores.csv')