# Fake News Detection with Machine Learning
## Overview

### What You'll Learn
In this section, you'll learn
1. How to use various scikit-learn machine learning algorithms
2. How to select features for a real-world machine learning problem
3. How to design a neural network that makes predictions based on our selected features

### Prerequisites
Before starting this section, you should have an understanding of
1. [scikit-learn and Tensorflow](https://colab.research.google.com/github/HackBinghamton/MachineLearningWorkshopWeek1/)
2. [Basic Python (functions, loops, lists)](https://github.com/HackBinghamton/PythonWorkshop)
3. [Numpy and Pandas](https://github.com/HackBinghamton/DataScienceWorkshop)

### Introduction
We've all heard about fake news over the past few years. This workshop will guide you through designing a relatively primitive fake news detector based on a modified version of the [FakeNewsNet dataset](https://github.com/KaiDMML/FakeNewsNet).

### Setup
#### Package Installations

In [1]:
!pip3 install tensorflow
!pip3 install sklearn
!pip3 install python-whois
!pip3 install pandas
!pip3 install textstat
!pip3 install -U textblob
!pip3 install requests



You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Collecting python-whois
  Downloading https://files.pythonhosted.org/packages/4b/c6/85268f0cef6a1c7e466140437c8fe49eb12c203d9a31be6d2976a29c7266/python-whois-0.7.3.tar.gz (91kB)
Building wheels for collected packages: python-whois
  Building wheel for python-whois (setup.py): started
  Building wheel for python-whois (setup.py): finished with status 'done'
  Created wheel for python-whois: filename=python_whois-0.7.3-cp37-none-any.whl size=87707 sha256=0ce853d804ca33a26ad7798c4a0d0a2a278ca1ffba1861b172ce2ca463119161
  Stored in directory: C:\Users\sophi\AppData\Local\pip\Cache\wheels\12\3c\9b\901b7deea1fa960f9abfd84df82414dff215c4a1d6869bcea2
Successfully built python-whois
Installing collected packages: python-whois
Successfully installed python-whois-0.7.3


You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Collecting textstat
  Downloading https://files.pythonhosted.org/packages/60/af/0623a6e3adbcfda0be827664eacab5e02cd0a08d36f00013cb53784917a9/textstat-0.6.2-py3-none-any.whl (102kB)
Collecting pyphen (from textstat)
  Downloading https://files.pythonhosted.org/packages/15/82/08a3629dce8d1f3d91db843bb36d4d7db6b6269d5067259613a0d5c8a9db/Pyphen-0.9.5-py2.py3-none-any.whl (3.0MB)
Installing collected packages: pyphen, textstat
Successfully installed pyphen-0.9.5 textstat-0.6.2


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Collecting textblob
  Downloading https://files.pythonhosted.org/packages/60/f0/1d9bfcc8ee6b83472ec571406bd0dd51c0e6330ff1a51b2d29861d389e85/textblob-0.15.3-py2.py3-none-any.whl (636kB)
Collecting nltk>=3.1 (from textblob)
  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
Collecting regex (from nltk>=3.1->textblob)
  Downloading https://files.pythonhosted.org/packages/cc/dd/3b5f941e13e9732465468c7806f84b65c5fda128f51a5b4d1456a9e684e5/regex-2020.10.15-cp37-cp37m-win_amd64.whl (272kB)
Collecting tqdm (from nltk>=3.1->textblob)
  Downloading https://files.pythonhosted.org/packages/bd/cf/f91813073e4135c1183cadf968256764a6fe4e35c351d596d527c0540461/tqdm-4.50.2-py2.py3-none-any.whl (70kB)
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py): started
  Building wheel for nltk (setup.py): finished with status 'done'
  Created wheel for nltk: filename=nltk-3.5-cp37-none-any.w

You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
import pandas as pd
import datetime
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
import tensorflow as tf
import textblob
from textstat.textstat import textstat
import requests

## Step 1: Gathering data and selecting features
### Selecting a dataset
For the purpose of this workshop, we'll be using a modified version of the FakeNewsNet dataset. The data provided for you has been cleaned and a few features have been added. Namely, the dataset did not originally include information from ICANN WHOIS or article text.

When starting a machine learning project, it is very important to select a good dataset. Your dataset should have diverse information, be well-formed (no missing data), and not have incorrect data. It should also have a lot of data points - the 350 articles used for this exercise are not a sufficiently sized dataset.

### Loading the data
Methods that load training and testing data have been provided for you:

In [3]:
def load_fake_news_data(file_name):
    url = "https://raw.githubusercontent.com/HackBinghamton/MachineLearningWorkshopWeek2/master/fake_news_detection/" + file_name
    json_data = requests.get(url).text
    fake_news_data = pd.read_json(json_data)

    fake_news_features = fake_news_data.drop(columns=["is_fake"])
    fake_news_labels = fake_news_data["is_fake"]

    return fake_news_features, fake_news_labels


def load_fake_news_training_data():
    return load_fake_news_data("fakenewsnet_modified_training_set.json")


def load_fake_news_testing_data():
    return load_fake_news_data("fakenewsnet_modified_testing_set.json")

Let's take a look at what we're working with.

In [4]:
print(load_fake_news_training_data()[0].shape)
print(load_fake_news_training_data()[0].columns)

print(load_fake_news_testing_data()[0].shape)
print(load_fake_news_testing_data()[0].columns)

(280, 6)
Index(['id', 'news_url', 'title', 'creation_date', 'country', 'article_text'], dtype='object')
(71, 6)
Index(['id', 'news_url', 'title', 'creation_date', 'country', 'article_text'], dtype='object')


### Adding new features
Although there's a good amount of information in this dataset, not all of it is terribly useful (yet). Let's make some functions to create new features from the data we have.

### New Feature: ICANN WHOIS registered country
A lot of fake news comes from Macedonia, Panama, or from websites whose owners hide behind domain privacy services. We can add a new column that contains a 1 if a given news article's host website was registered in Macedonia, Panama, or the location is hidden by a privacy service.

In [5]:
def is_suspicious_country(country):
    sus_countries = ["MK", "PA"]

    return int(country in sus_countries or "REDACTED" in country)


def add_suspicious_country_column(fake_news_df):
    fake_news_df["is_suspicious_country"] = fake_news_df["country"].apply(lambda x: is_suspicious_country(x))

    return fake_news_df

### New Features: Text complexity
Professional journalists are generally much better writers than those who create fake news. If an article is easy to read, it may have been written by a professional journalist rather than a propagandist.

Let's start by writing a function that measures an article's Flesch-Kincaid reading ease level.

In [6]:
def add_flesch_reading_ease_column(fake_news_df):
    fake_news_df["flesch_reading_ease"] = fake_news_df["article_text"].apply(
        lambda x: (textstat.flesch_reading_ease(x))
    )

    return fake_news_df

It might also be helpful to determine how many difficult words the author used. Usage of more difficult words may imply higher proficiency with the English language, which may indicate the writing was done by a professional.

In [7]:
def percent_difficult_words(article):
    if textstat.lexicon_count(article) == 0:
        return 0

    return textstat.difficult_words(article) / textstat.lexicon_count(article)


def add_percent_difficult_words_column(fake_news_df):
    fake_news_df["percent_difficult_words"] = fake_news_df["article_text"].apply(lambda x: percent_difficult_words(x))

    return fake_news_df

### New Feature: Text sentiment
Professional journalists are expected to be objective and calm in their writing. Fake news, on the other hand, is usually opinion-heavy and designed to provoke anger from its readers. Let's add two columns which calculate the article's polarity and subjectivity.

In [8]:
def analyze_sentiment(article):
    article = textblob.TextBlob(article)

    return (article.sentiment.polarity + 1) / 2, (article.sentiment.subjectivity + 1) / 2

def add_sentiment_columns(fake_news_df):
    fake_news_df["article_polarity"], fake_news_df["article_subjectivity"] = zip(
        *fake_news_df["article_text"].map(analyze_sentiment)
    )

    return fake_news_df

### Adding the features
Now that we have functionality to create new features, let's create a function that applies all of this functionality to our datasets.

In [9]:
def add_features(fake_news_df):
    # Comment or uncomment these features as you see fit. More features isn't always better -
    # sometimes, features you think are helpful might actually harm your accuracy!
    fake_news_df = add_suspicious_country_column(fake_news_df)
    fake_news_df = add_flesch_reading_ease_column(fake_news_df)
    fake_news_df = add_percent_difficult_words_column(fake_news_df)
    fake_news_df = add_sentiment_columns(fake_news_df)

    return fake_news_df

### Dropping unused features
Unprocessed, stuff like the article ID, url, title, or article text don't make much sense to ML algorithms. Let's make a function that drops this information from our dataframe.

In [10]:
def drop_features(fake_news_df):
    # Drop features we're not using for our machine learning algorithm
    fake_news_df = fake_news_df.drop(columns=["id", "article_text", "country", "title", "news_url"])
    fake_news_df = fake_news_df.reset_index(drop=True)

    return fake_news_df

### Scaling existing features
As we learned last week, we want to make sure our features are scaled properly. Let's scale our creation date timestamp to be between 0 and 1.

In [11]:
def scale_creation_dates(fake_news_df):
    now_timestamp = datetime.datetime.now().timestamp()
    fake_news_df["creation_date"] = fake_news_df["creation_date"].apply(lambda x: x / now_timestamp)

    return fake_news_df


def scale_features(fake_news_df):
    fake_news_df = scale_creation_dates(fake_news_df)

    return fake_news_df

### Pulling EVERYTHING together
Finally, let's write a function that does all the feature creation, deletion, and scaling for us.

In [12]:
def refine_fake_news_data(fake_news_df):
    fake_news_df = add_features(fake_news_df)
    fake_news_df = drop_features(fake_news_df)
    fake_news_df = scale_features(fake_news_df)

    return fake_news_df

## Step 2: Training our algorithms
### The scikit-learn approach
Let's begin by first testing out some `scikit-learn` algorithms and observing how they perform.

We see in the first few lines, "Logistic Regression," "Linear Discriminant Analysis," "K-Nearest Neighbors," and several other complicated sounding terms. These are all different types of classifiers, and previous pages in the workshop have discussed several of these. As for any you aren't familiar with, it's most important to know that they all function differently but attempt to achieve the same goal. Some may be extremely good classifiers for the dataset, others might perform so poorly it would actually hurt the neural network to leave them in. It's good to check multiple and see how accurately they perform.

In [13]:
def evaluate_sklearn_models(training_features, training_labels):
    models = [
        ("Logistic Regression", LogisticRegression(solver="lbfgs")),
        ("Linear Discriminant Analysis", LinearDiscriminantAnalysis()),
        ("K-Nearest Neighbors", KNeighborsClassifier()),
        ("Decision Tree", DecisionTreeClassifier()),
        ("Gaussian Naive Bayes", GaussianNB()),
        ("Support Vector Machine", SVC(gamma="scale")),
        ("Bagging Classifier", BaggingClassifier()),
        ("Random Forest Classifier", RandomForestClassifier(n_estimators=100))
    ]

    for name, model in models:
        kfold = model_selection.KFold(n_splits=10)

        cv_results = model_selection.cross_val_score(
            model, training_features, training_labels, cv=kfold, scoring="accuracy"
        )

        msg = "%s: \n\tAverage accuracy: %f \n\tStandard deviation: %f" % (
            name, cv_results.mean() * 100, cv_results.std() * 100
        )

        print(msg)

### Designing and training a neural network
Let's now try designing a neural network.

In the first few lines, we create a dense relu layer, a "droupout layer", and a dense softmax layer. ReLu and SoftMax are both __activation functions__ that help us process the data in each node in the given layer. A "dropout layer" tells the neural network to drop a certain amount of nodes to help it learn. Without this, the "neurons" can develop a sort of interdependency on each other, which can lead to overfitting. By forcing the network to drop some of the nodes each time it trains, it has to find new features with only the nodes it now has remaining.

In [14]:
def create_neural_network():
    # This is the same design as last week's neural network, with the exception that:
    #     1. There is no input to flatten
    #     2. The dense softmax layer has been reduced from 10 units to 2 units, since our labels 
    #        can either be true or false (2 options) as opposed to a digit between 0 and 9 (10 options)
    dense_relu_layer = tf.keras.layers.Dense(1024, activation="relu")
    dropout_layer = tf.keras.layers.Dropout(0.2)
    dense_softmax_layer = tf.keras.layers.Dense(2, activation="softmax")

    neural_network_model = tf.keras.models.Sequential([
        dense_relu_layer,
        dropout_layer,
        dense_softmax_layer
    ])

    neural_network_model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

    return neural_network_model


def train_neural_network(neural_network_model, training_features, training_labels):
    neural_network_model.fit(training_features.values, training_labels.values, epochs=400)

    return neural_network_model


def evaluate_neural_network(neural_network_model, testing_features, testing_labels):
    test_loss, test_acc = neural_network_model.evaluate(testing_features.values, testing_labels.values)

    return test_acc

## Step 3: Evaluating our algorithms
Now that we've designed our approach to the problem, let's execute!

In [15]:
def main():
    fake_news_training_features, fake_news_training_labels = load_fake_news_training_data()
    fake_news_testing_features, fake_news_testing_labels = load_fake_news_testing_data()

    fake_news_training_features = refine_fake_news_data(fake_news_training_features)
    fake_news_testing_features = refine_fake_news_data(fake_news_testing_features)

    evaluate_sklearn_models(fake_news_training_features, fake_news_training_labels)

    neural_network_model = create_neural_network()
    neural_network_model = train_neural_network(
        neural_network_model, fake_news_training_features, fake_news_training_labels
    )

    print(evaluate_neural_network(neural_network_model, fake_news_testing_features, fake_news_testing_labels))


main()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Logistic Regression: 
	Average accuracy: 79.285714 
	Standard deviation: 9.819805
Linear Discriminant Analysis: 
	Average accuracy: 81.428571 
	Standard deviation: 9.556492
K-Nearest Neighbors: 
	Average accuracy: 70.714286 
	Standard deviation: 9.006800
Decision Tree: 
	Average accuracy: 85.714286 
	Standard deviation: 8.601139
Gaussian Naive Bayes: 
	Average accuracy: 80.000000 
	Standard deviation: 8.630747
Support Vector Machine: 
	Average accuracy: 58.214286 
	Standard deviation: 8.459085
Bagging Classifier: 
	Average accuracy: 85.357143 
	Standard deviation: 8.214286
Random Forest Classifier: 
	Average accuracy: 85.000000 
	Standard deviation: 8.571429
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400

Epoch 69/400
Epoch 70/400
Epoch 71/400
Epoch 72/400
Epoch 73/400
Epoch 74/400
Epoch 75/400
Epoch 76/400
Epoch 77/400
Epoch 78/400
Epoch 79/400
Epoch 80/400
Epoch 81/400
Epoch 82/400
Epoch 83/400
Epoch 84/400
Epoch 85/400
Epoch 86/400
Epoch 87/400
Epoch 88/400
Epoch 89/400
Epoch 90/400
Epoch 91/400
Epoch 92/400
Epoch 93/400
Epoch 94/400
Epoch 95/400
Epoch 96/400
Epoch 97/400
Epoch 98/400
Epoch 99/400
Epoch 100/400
Epoch 101/400
Epoch 102/400
Epoch 103/400
Epoch 104/400
Epoch 105/400
Epoch 106/400
Epoch 107/400
Epoch 108/400
Epoch 109/400
Epoch 110/400
Epoch 111/400
Epoch 112/400
Epoch 113/400
Epoch 114/400
Epoch 115/400
Epoch 116/400
Epoch 117/400
Epoch 118/400
Epoch 119/400
Epoch 120/400
Epoch 121/400
Epoch 122/400
Epoch 123/400
Epoch 124/400
Epoch 125/400
Epoch 126/400
Epoch 127/400
Epoch 128/400
Epoch 129/400
Epoch 130/400
Epoch 131/400
Epoch 132/400
Epoch 133/400
Epoch 134/400
Epoch 135/400
Epoch 136/400
Epoch 137/400
Epoch 138/400
Epoch 139/400
Epoch 140/400
Epoch 141/400
Epoch 142

Epoch 150/400
Epoch 151/400
Epoch 152/400
Epoch 153/400
Epoch 154/400
Epoch 155/400
Epoch 156/400
Epoch 157/400
Epoch 158/400
Epoch 159/400
Epoch 160/400
Epoch 161/400
Epoch 162/400
Epoch 163/400
Epoch 164/400
Epoch 165/400
Epoch 166/400
Epoch 167/400
Epoch 168/400
Epoch 169/400
Epoch 170/400
Epoch 171/400
Epoch 172/400
Epoch 173/400
Epoch 174/400
Epoch 175/400
Epoch 176/400
Epoch 177/400
Epoch 178/400
Epoch 179/400
Epoch 180/400
Epoch 181/400
Epoch 182/400
Epoch 183/400
Epoch 184/400
Epoch 185/400
Epoch 186/400
Epoch 187/400
Epoch 188/400
Epoch 189/400
Epoch 190/400
Epoch 191/400
Epoch 192/400
Epoch 193/400
Epoch 194/400
Epoch 195/400
Epoch 196/400
Epoch 197/400
Epoch 198/400
Epoch 199/400
Epoch 200/400
Epoch 201/400
Epoch 202/400
Epoch 203/400
Epoch 204/400
Epoch 205/400
Epoch 206/400
Epoch 207/400
Epoch 208/400
Epoch 209/400
Epoch 210/400
Epoch 211/400
Epoch 212/400
Epoch 213/400
Epoch 214/400
Epoch 215/400
Epoch 216/400
Epoch 217/400
Epoch 218/400
Epoch 219/400
Epoch 220/400
Epoch 

Epoch 230/400
Epoch 231/400
Epoch 232/400
Epoch 233/400
Epoch 234/400
Epoch 235/400
Epoch 236/400
Epoch 237/400
Epoch 238/400
Epoch 239/400
Epoch 240/400
Epoch 241/400
Epoch 242/400
Epoch 243/400
Epoch 244/400
Epoch 245/400
Epoch 246/400
Epoch 247/400
Epoch 248/400
Epoch 249/400
Epoch 250/400
Epoch 251/400
Epoch 252/400
Epoch 253/400
Epoch 254/400
Epoch 255/400
Epoch 256/400
Epoch 257/400
Epoch 258/400
Epoch 259/400
Epoch 260/400
Epoch 261/400
Epoch 262/400
Epoch 263/400
Epoch 264/400
Epoch 265/400
Epoch 266/400
Epoch 267/400
Epoch 268/400
Epoch 269/400
Epoch 270/400
Epoch 271/400
Epoch 272/400
Epoch 273/400
Epoch 274/400
Epoch 275/400
Epoch 276/400
Epoch 277/400
Epoch 278/400
Epoch 279/400
Epoch 280/400
Epoch 281/400
Epoch 282/400
Epoch 283/400
Epoch 284/400
Epoch 285/400
Epoch 286/400
Epoch 287/400
Epoch 288/400
Epoch 289/400
Epoch 290/400
Epoch 291/400
Epoch 292/400
Epoch 293/400
Epoch 294/400
Epoch 295/400
Epoch 296/400
Epoch 297/400
Epoch 298/400
Epoch 299/400
Epoch 300/400
Epoch 

Epoch 311/400
Epoch 312/400
Epoch 313/400
Epoch 314/400
Epoch 315/400
Epoch 316/400
Epoch 317/400
Epoch 318/400
Epoch 319/400
Epoch 320/400
Epoch 321/400
Epoch 322/400
Epoch 323/400
Epoch 324/400
Epoch 325/400
Epoch 326/400
Epoch 327/400
Epoch 328/400
Epoch 329/400
Epoch 330/400
Epoch 331/400
Epoch 332/400
Epoch 333/400
Epoch 334/400
Epoch 335/400
Epoch 336/400
Epoch 337/400
Epoch 338/400
Epoch 339/400
Epoch 340/400
Epoch 341/400
Epoch 342/400
Epoch 343/400
Epoch 344/400
Epoch 345/400
Epoch 346/400
Epoch 347/400
Epoch 348/400
Epoch 349/400
Epoch 350/400
Epoch 351/400
Epoch 352/400
Epoch 353/400
Epoch 354/400
Epoch 355/400
Epoch 356/400
Epoch 357/400
Epoch 358/400
Epoch 359/400
Epoch 360/400
Epoch 361/400
Epoch 362/400
Epoch 363/400
Epoch 364/400
Epoch 365/400
Epoch 366/400
Epoch 367/400
Epoch 368/400
Epoch 369/400
Epoch 370/400
Epoch 371/400
Epoch 372/400
Epoch 373/400
Epoch 374/400
Epoch 375/400
Epoch 376/400
Epoch 377/400
Epoch 378/400
Epoch 379/400
Epoch 380/400
Epoch 381/400
Epoch 

Epoch 391/400
Epoch 392/400
Epoch 393/400
Epoch 394/400
Epoch 395/400
Epoch 396/400
Epoch 397/400
Epoch 398/400
Epoch 399/400
Epoch 400/400
0.7183099


There are many ways we can do better. Try doing the following on your own:
1. Playing around with tensorflow parameters/adding different layers
2. Adding new features

Possibly helpful further reading:
1. [Types of Keras layers](https://keras.io/layers/core/)
2. [Types of Keras Activations](https://keras.io/activations/)
3. [Fake News Detector from HackBU 2018](https://github.com/cfiutak1/HackBU2018-Fake-News-Detector/)