<a target="_blank" href="https://colab.research.google.com/github/gerakys/PyhtonProject_DMTA/blob/main/Neural_Net_FND.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Fake News Detection using Neural Networks
---
In this notebook, our mission is to construct a robust neural network for discerning the authenticity of news articles. The approach will be as follows:

* Data Exploration and Cleaning
* Data Selection and Encoding
* Text Preprocessing
* Neural Network Architecture
* Model Training and Evalutation
* Real-time Prediction

An example of a real and fake news are shown below.

<img src='notebook_ims/real_fake_example.png' width=50% height=80%/>

Our dataset comprises a collection of news articles labeled as either real or fake. We meticulously explore and preprocess this data to ensure our neural network receives high-quality inputs for training and evaluation.

Let's Begin!

In [48]:
# We import the libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import random as python_random
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.metrics import Precision, Recall 


In [49]:
# Set random seeds for reproducibility
np.random.seed(123)
python_random.seed(123)
tf.random.set_seed(1234)

## Load in and Visualize the Data

In [50]:
# Load the dataset
data = pd.read_csv("data/fake_or_real_news.csv")
data.head(7)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE


In [51]:
# We check for any missing values
data.isnull().sum() # there are not
# We see if our dataset is balanced
data['label'].value_counts() # it is!

label
REAL    3171
FAKE    3164
Name: count, dtype: int64

## Data Preprocessing

The initial step in constructing our neural network is preparing the data for effective input. As we intend to incorporate a word-embedding layer, the following steps are undertaken:

1. **Encoding Words and Labels:**
   - Encode each word in the news articles as an integer.
   - Encode news labels, representing real as 1 and false as 0.

2. **Tokenization:**
   - Utilize the `Tokenizer` class to create a vocabulary of unique words.
   - Fit the tokenizer on the dataset to associate each word with a unique integer.

3. **Text to Sequences:**
   - Transform the textual data into sequences of integers using the fitted tokenizer.
   - This step helps in numerical representation of the news headlines.

4. **Padding Sequences:**
   - Apply padding to the sequences to ensure uniform length across all data points.
   - Pad sequences with zeros to match the desired length, enhancing model compatibility.

In [52]:
# Extract features and labels
x = np.array(data["title"])
y = np.array(data["label"])

In [53]:
# Convert labels to numerical format
le = LabelEncoder()
y = le.fit_transform(y)

In [54]:
# Tokenize the text data
max_words = 5000
tokenizer = Tokenizer(num_words=max_words, split=' ')
tokenizer.fit_on_texts(x)
x = tokenizer.texts_to_sequences(x)
x = pad_sequences(x)

In [55]:
# we take a look at our vocabulary
indice_parole = tokenizer.word_index
for word, index in list(indice_parole.items())[:20]:
    print(f"{word}: {index}")

the: 1
to: 2
in: 3
of: 4
trump: 5
for: 6
on: 7
a: 8
and: 9
is: 10
clinton: 11
hillary: 12
with: 13
obama: 14
new: 15
by: 16
as: 17
donald: 18
from: 19
at: 20


## Let's Build our Model


First step will be splitting the data into training and testing sets.
This is a crucial step in preparing our dataset for training and evaluating the model.
Using the train_test_split function from sklearn, we divide our data into training and testing sets.
We allocate 80% of the data for training and 20% for testing.


In [56]:
# Split the data into training and testing sets
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)

The layers are as follows:
* An embedding layer that converts our word tokens (integers) into embeddings of a specific size.
* A dropout layer to prevent overfitting by deactivating 20% of the previous embeddings.
* An LSTM layer defined by a hidden_state size and number of layers
* A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
* A sigmoid activation layer which turns all outputs into a value 0-1; return only the last sigmoid output as the output of this network.
* Compile the model with binary crossentropy loss, adam optimizer, and accuracy metric


In [57]:
# Build the neural network model
model = Sequential()
model.add(Embedding(max_words, 128, input_length=x.shape[1])) 
model.add(SpatialDropout1D(0.2)) 
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid')) 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', Precision(), Recall()])

In [58]:
# Implement Early Stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1, mode='min')

In [59]:
# Train the model
batch_size = 32
epochs = 10
model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.2, callbacks=[early_stopping])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 4: early stopping


<keras.src.callbacks.History at 0x16311d8d0>

# Model Evaluation on Test Set

Here, we assess the performance of our trained model on the test set. The evaluation includes metrics such as loss, accuracy, precision and recall.

In [60]:
# Evaluate the model on the test set
loss, accuracy, precision, recall = model.evaluate(xtest, ytest)
print(f"Model Accuracy: {accuracy:.2f}")
print(f"Model Loss: {loss}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

Model Accuracy: 0.78
Model Loss: 0.641485869884491
Precision: 0.76
Recall: 0.82


The output shows the accuracy, loss, precision and recall achieved by the model on the test set.

# Real-time Prediction
We also demonstrate the real-time prediction capability of the model using a sample news headline entered by the user.

In [61]:
# Make predictions on a sample news headline
news_headline = input("Type News title here: ")
headline_seq = tokenizer.texts_to_sequences(news_headline)
headline_padded = pad_sequences(headline_seq, maxlen=x.shape[1])
result = model.predict(headline_padded)[0][0]
predicted_label = "Real" if result < 0.5 else "Fake"
print(f"Predicted Label: {predicted_label} (Probability: {result:.2f})")

Type News title here:  COVID-19 vaccinations contain microchips for global tracking. 


Predicted Label: Fake (Probability: 0.55)


In this example, the user inputs a news headline, and the model predicts whether it is real or fake, along with the associated probability.