## FAKE NEWS DETECTOR - LSTM

### Imports

In [7]:
! pip3 install tensorflow
! pip3 install keras
! pip3 install nltk
! pip3 install gensim
! pip3 install scikit-learn

#### Others
import itertools
import pandas as pd
import numpy as np
import tensorflow
import re
import nltk
import warnings
warnings.filterwarnings("ignore")

#### Scikit-learn
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer, HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB,GaussianNB
from sklearn.linear_model import PassiveAggressiveClassifier,LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

#### NLTK
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

nltk.download('stopwords')

#### Tensorflow
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Constants

In [8]:
DATA_BASE_PATH = "./"
TRAIN_RATIO = 0.70
TEST_RATIO = 0.30
MAX_FEATURES_VECTORIZER= 500
TRAINING_EPOCHS = 20 # Number of training epochs
BATCH_SIZE = 64

VOCABULARY_SIZE = 5000 # the size of the vocabulary, indicating the maximum number of unique words that will be considered during the text embedding process.
SENTENCE_LENGTH = 20 # the desired length of each sentence or text sequence after preprocessing. It indicates the number of words that will be included in each sequence.
EMBEDDING_VECTOR_FEATURES = 40 # the number of dimensions in which each word will be represented in the embedding space. It determines the size of the word vectors generated, where words are mapped to continuous vector representations for machine learning tasks.

### Load data

In [9]:
# Load Dataset
train_data = pd.read_csv(DATA_BASE_PATH + 'train.csv')
test_data = pd.read_csv(DATA_BASE_PATH + 'test.csv')

### Preprocessing

The provided code snippet involves data preprocessing steps.

First of all, we must remove the rows with missing values (NaN) from the train_data dataset.

After dropping rows, the indices of the remaining rows may become non-contiguous by resetting the index of the DataFrame to ensure continuous and sequential indexing. This results in an updated DataFrame with a reset index, where the previous index values are moved to a new column, and a new sequential index is assigned to each row.

In [10]:
# Remove NaN
train_data = train_data.dropna()
train_data.reset_index(inplace = True)

First step to continue preprocessing is to prepare input features and targe labels by creating two dataframes, one of them leaving only the input features for model training, and the other with the 'label' column is assigned to the variable y_train, representing the target labels corresponding to the input features in x_train.

In [11]:
# Get target column (Y) and input features (X)
x_train = train_data.drop('label',axis =1)
y_train = train_data['label']

The provided code snippet initializes a PorterStemmer (from the Natural Language Toolkit (NLTK) library is initialized. The stemmer will be used to reduce words to their root form.) for text stemming and creates empty lists corpus and words. It iterates through each title in the DataFrame, removing non-alphanumeric characters, converting to lowercase, and splitting into words. The words are then stemmed and stopwords are removed, resulting in preprocessed sentences added to the corpus list and individual stemmed words to the words list. This process prepares the text data for analysis or model training.

In [12]:
# Stemming and preprocessing
ps = PorterStemmer()
corpus = []
words = []
for i in range(0,len(train_data)):
    review = re.sub('[^a-zA-Z0-9]',' ',train_data['title'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    statements = ' '.join(review)
    corpus.append(statements)
    words.append(review)

In the provided code, a copy of the training data x_train is assigned to the variable messages. The reset_index() method is then applied to the messages DataFrame. This method is often used to reset the index of a DataFrame, which means that the current index (usually numeric) is replaced with a default integer index starting from 0.

By calling reset_index() with the inplace=True parameter, the operation modifies the DataFrame messages in place, without creating a new DataFrame. This can be useful when you want to apply changes directly to the existing DataFrame instead of creating a new copy.

It's important to note that the specific impact of resetting the index depends on the structure and context of your data. In some cases, you might need to reset the index to make further manipulations or analyses easier, especially if the original index is not providing meaningful information. However, it's recommended to understand the implications of changing the index before applying it to your data.

In [13]:
messages = x_train.copy()
messages.reset_index(inplace=True)

One-hot encoding is often used as a preprocessing step when working with natural language processing (NLP) tasks like text classification, including the detection of fake news using techniques like LSTM (Long Short-Term Memory) networks. LSTM networks are a type of recurrent neural network (RNN) that can effectively model sequences and patterns in sequential data like text.

One-hot encoding is used with LSTM for the following reasons:

- Input Representation: LSTM networks require input data to be in a numerical format. One-hot encoding converts words into numerical vectors, where each word is represented by a vector with all zeros except for a single "1" at the index corresponding to the word's position in the vocabulary. This allows text data to be fed into the network as numeric sequences.

- Sparse Data Handling: NLP datasets typically have a large vocabulary, resulting in sparse data when using raw text representation. One-hot encoding reduces this sparsity by representing each word as a fixed-size vector. This can make training more efficient and reduce memory requirements.

- Word Relationships: One-hot encoding treats each word as independent, which may not capture the semantic relationships between words. However, LSTM networks can learn contextual information from sequences of one-hot encoded vectors, allowing them to capture word relationships and dependencies within a text.

- Sequence Modeling: LSTM networks excel at modeling sequences, and one-hot encoded vectors provide a suitable input format for sequential data. LSTM cells can maintain and update internal states that help capture longer-range dependencies in text.

- Embedding Layer: In many cases, one-hot encoded vectors are further transformed using an embedding layer within the LSTM network. This layer learns dense representations (word embeddings) that capture semantic relationships between words. These learned embeddings can enhance the model's ability to understand the meaning and context of words.

In the context of detecting fake news, one-hot encoding followed by LSTM modeling allows the network to learn patterns and relationships within the textual data, enabling the model to identify relevant features and make accurate predictions about the authenticity of news articles.

The one_hot() function takes a word and a vocabulary size as input and returns a unique integer value for that word based on its position in the vocabulary. This technique is called one-hot encoding, where each word is represented by a vector with all values as zeros except for the position corresponding to the word's index in the vocabulary, which is set to 1.

The resulting onehot_repr list contains one-hot encoded representations of words for each text document in the corpus. This representation is often used as an initial step before feeding the data into neural networks or other machine learning models for further processing.

In [14]:
onehot_repr = [one_hot(words, VOCABULARY_SIZE) for words in corpus]
onehot_repr

[[4993, 4287, 2167, 4360, 4316, 414, 2404, 1438, 1568, 2540],
 [3420, 4394, 1459, 499, 1390, 1778, 4835],
 [2649, 771, 3265, 1701],
 [439, 2563, 3264, 65, 2590, 625, 399],
 [780, 1390, 4310, 2542, 2998, 2831, 1390, 1966, 3657, 4081],
 [1289,
  2939,
  4175,
  1445,
  1596,
  1558,
  1044,
  3064,
  3388,
  60,
  3442,
  1276,
  3845,
  4535,
  4835],
 [2871, 928, 4002, 2148, 2876, 2527, 1326, 2430, 1854, 1611, 2827],
 [2980, 57, 2165, 3251, 2136, 3725, 1558, 4382, 1854, 1611, 2827],
 [4039, 4461, 551, 788, 1741, 803, 821, 4692, 1558, 3949],
 [1663, 4566, 3955, 3036, 3149, 1182, 117, 4358],
 [3033, 2068, 2272, 2825, 3755, 3057, 717, 2869, 1814, 287, 1410],
 [2590, 2668, 4316, 803, 1558, 2136],
 [44, 4240, 2091, 3784, 2611, 343, 3811, 1923, 4169],
 [4978, 632, 1969, 4561, 4906, 4777, 2966, 1854, 1611, 2827],
 [1023, 1471, 4033, 3754, 4250, 1854, 1611, 2827],
 [3283, 2115, 2809, 3665, 4849, 3118, 1762, 3709, 1438, 1959, 4989, 4430],
 [3432, 53, 4394],
 [2181, 4392, 326, 503, 1558, 4051, 2

In the following lines, after encoding (0, 1) the words in the given corpus, the variable embedded_docs is created using the pad_sequences function from Keras. This step is essential when preparing text data for training neural network models like LSTM for natural language processing tasks.

The onehot_repr list contains one-hot encoded representations of words in news article texts, and each inner list corresponds to a sequence. The padding='post' parameter ensures that padding is added to the end of each sequence, and maxlen specifies the desired sequence length after padding. The resulting embedded_docs array contains the transformed and padded sequences, forming the input data for the LSTM model.

This preparation process is vital to maintain consistent sequence lengths required for neural network training and allows the data to be effectively fed into the LSTM model for further analysis and prediction.

In [15]:
embedded_docs = pad_sequences(onehot_repr, padding='post' ,maxlen=SENTENCE_LENGTH)
print(embedded_docs)

[[4993 4287 2167 ...    0    0    0]
 [3420 4394 1459 ...    0    0    0]
 [2649  771 3265 ...    0    0    0]
 ...
 [ 187  901  640 ...    0    0    0]
 [4665 2136  158 ...    0    0    0]
 [4038  301 1870 ...    0    0    0]]


In the provided code and output, the array embedded_docs[0] represents the first sequence of word indices after one-hot encoding and padding. Each number in the array corresponds to the index of a word in the vocabulary, and this sequence has been padded with zeros to match a specified length (maxlen). This format is suitable for input into neural network models, such as LSTM, where each number indicates the word's presence in the text. The zeros represent the padding introduced to ensure uniform sequence lengths across all input samples. This processed array serves as a structured input for the subsequent stages of the LSTM model, enabling the analysis and prediction of the underlying patterns in the text data.

In [16]:
embedded_docs[0]

array([4993, 4287, 2167, 4360, 4316,  414, 2404, 1438, 1568, 2540,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0], dtype=int32)

### Neural Network Design

As an option of design, the goal is to create a sequential neural network model for text classification, where the input text sequences are transformed through an embedding layer, passed through an LSTM layer to capture sequence information, and finally output a binary classification prediction. Dropout layers are used to mitigate overfitting during training.

The model includes an embedding layer that transforms integer-encoded words into fixed-size vectors, followed by dropout layers to prevent overfitting.
- An LSTM layer with 100 units captures sequential information, and another dropout layer is employed for regularization.
- A dense layer with a sigmoid activation produces the final binary classification output.
- The model is compiled with binary cross-entropy loss, the Adam optimizer, and accuracy as the evaluation metric.

The model's architecture summary is printed, revealing layer configurations and parameter counts. This architecture enables the model to process input text data, capture context through LSTM, and make binary classification predictions.

In [17]:
model=Sequential()
model.add(Embedding(VOCABULARY_SIZE,EMBEDDING_VECTOR_FEATURES,input_length=SENTENCE_LENGTH))
model.add(Dropout(0.3))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 40)            200000    
                                                                 
 dropout (Dropout)           (None, 20, 40)            0         
                                                                 
 lstm (LSTM)                 (None, 100)               56400     
                                                                 
 dropout_1 (Dropout)         (None, 100)               0         
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________
None


### Tranining

Now, before training process, we must prepare the input for the neural network.

In the provided code, X_final represents the final numpy array of embedded documents obtained after processing the text data through the one-hot encoding and padding steps. This array contains sequences of word indices, each corresponding to a processed text.

Additionally, Y_final contains the labels associated with the training dataset

This processed data can now be used for testing and evaluating the LSTM model's performance on detecting fake news.

In [18]:
X_final = np.array(embedded_docs)
Y_final = np.array(y_train)

In [19]:
# Data split
X_train_embed, X_test_embed, Y_train_embed, Y_test_embed = train_test_split(X_final, Y_final, test_size=TEST_RATIO, random_state=27)

The provided code snippet trains a sequential neural network model using TensorFlow.

The model architecture consists of an embedding layer followed by a dropout layer, a Long Short-Term Memory (LSTM) layer with 100 units, another dropout layer, and a dense layer with a sigmoid activation function.

The model is compiled with the binary cross-entropy loss function and the Adam optimizer. The training data X_train_embed and Y_train_embed are used for training, and the validation data X_test_embed and Y_test_embed are used for validation. T

The model is trained over 20 epochs, with each batch containing 64 samples. This combination of architecture, loss function, optimizer, and training configuration is designed to achieve high accuracy in binary classification tasks, possibly like detecting fake news or other similar tasks.

The choice of the number of epochs and batch size in training a neural network depends on various factors and needs to be fine-tuned based on the specific problem and dataset.

In the given code, the model is trained using the fit() function with 20 epochs and a batch size of 64.

 - 20 Epochs:
Using 20 epochs means that the entire dataset will be iterated through the neural network 20 times during training. More epochs can potentially allow the model to learn more complex patterns from the data. However, increasing the number of epochs may also lead to overfitting, where the model becomes too specialized to the training data and performs poorly on new, unseen data. The choice of 20 epochs could be based on empirical observations that the model's validation performance tends to stabilize or converge within this range.

- Batch Size of 64:
The batch size determines the number of training samples that are propagated through the network before updating the model's weights. Smaller batch sizes (e.g., 32, 64) can lead to more frequent updates and potentially faster convergence, as the model updates its parameters more often. However, smaller batch sizes can also result in noisy gradients and slower training on hardware with high parallelism (like GPUs). Larger batch sizes (e.g., 128, 256) can provide more stable gradient estimates but might take longer to update the model.

In practice, the optimal values for epochs and batch size can vary depending on factors like the complexity of the dataset, the architecture of the model, available computing resources, and the presence of regularization techniques. It's common to try different values and monitor the model's performance on validation data to determine the best combination.

In [20]:
# Model trained with Tensorflow
model.fit(X_train_embed, Y_train_embed, validation_data = (X_test_embed,Y_test_embed), epochs=TRAINING_EPOCHS, batch_size=BATCH_SIZE)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7ce72a1e1030>

In [21]:
# Save the model to a file
model.save('lstm_model_fake_news.h5')

### Testing

In [22]:
X_test_embed

array([[2655, 2553, 2282, ...,    0,    0,    0],
       [4039, 2754, 2384, ...,    0,    0,    0],
       [4039, 4058,  517, ...,    0,    0,    0],
       ...,
       [   5, 1459, 4021, ...,    0,    0,    0],
       [1854, 1631,  673, ...,    0,    0,    0],
       [3112, 4394, 4768, ...,    0,    0,    0]], dtype=int32)

In [24]:
predictions = (model.predict(X_test_embed) > 0.5).astype("int32")



In [25]:
predictions

array([[1],
       [1],
       [1],
       ...,
       [1],
       [1],
       [1]], dtype=int32)

### Visualizations and metrics

In [27]:
accuracy_score(Y_test_embed,predictions)

0.9094057601166606

In [28]:
import tensorflow as tf
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()])

# Evaluate the model on test data
results = model.evaluate(X_test_embed, Y_test_embed)

loss = results[0]
accuracy = results[1]
precision = results[2]
recall = results[3]

print("Loss:", loss)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Loss: 0.5935298204421997
Accuracy: 0.9094057679176331
Precision: 0.9006316065788269
Recall: 0.8912500143051147


### Conclusions