<a href="https://colab.research.google.com/github/pimverschuuren/FakeNews/blob/main/Simple_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook implements a recurrent neural network model as a binary classifier that identifies text as fake or true.



Create objects that will facilitate the tokenization of the textual data. Only include the columns that contain the text and the label.

In [2]:
import torch
from torchtext.legacy import data
from torchtext.legacy import datasets

TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm')
LABEL = data.LabelField(dtype = torch.float)

fields = [(None, None), (None, None), (None, None), ('text', TEXT),('label',LABEL)]

Load the csv file into a pytorch dataset object.

In [3]:
tab_data = data.TabularDataset(path = 'drive/MyDrive/Colab Notebooks/FakeNews/FakeNews.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

Split the dataset into training, testing and validation datasets.

In [6]:
import random

SEED = 1234

train_data, test_data = tab_data.split(split_ratio=0.7, random_state = random.seed(SEED))
train_data, validation_data = train_data.split(split_ratio=0.7, random_state = random.seed(SEED))

Print the lengths of each dataset and an example data point.

In [10]:
print("Number of training examples: "+str(len(train_data)))
print("Number of validation examples: "+str(len(validation_data)))
print("Number of testing examples: "+str(len(test_data)))
print("Example data point:")
print(vars(validation_data[0]))

Number of training examples: 10192
Number of validation examples: 4368
Number of testing examples: 6240
Example data point:
{'text': ['In', 'the', 'battle', 'against', 'the', 'Islamic', 'State', 'for', 'Mosul', ',', 'Iraq', '’s', '  ', 'city', ',', 'there', 'is', 'no', 'continuous', 'front', 'line', ',', 'but', 'a', 'patchwork', 'of', 'battlegrounds', 'in', 'the', 'city', 'and', 'all', 'around', 'its', 'edge', '.', 'When', 'the', 'Islamic', 'State', ',', 'also', 'known', 'as', 'ISIS', 'or', 'ISIL', ',', 'retreats', 'from', 'a', 'position', ',', 'it', 'tries', 'to', 'leave', 'as', 'much', 'damage', 'behind', 'as', 'possible', ',', 'including', 'burning', 'oil', 'field', 'wells', 'to', 'provide', 'concealment', 'ahead', 'of', 'a', 'government', 'advance', '.', 'Snipers', 'are', 'also', 'in', 'place', ',', 'ready', 'to', 'strike', '.', 'A', 'village', 'or', 'neighborhood', 'that', 'is', 'retaken', 'by', 'government', 'troops', 'could', 'soon', 'be', 'flooded', 'again', 'with', 'extremist'

Use the training data to create a vocabulary with a maximum amount of words. The larger the vocabulary, the larger the embedding layer in the RNN which will result in more learnable parameters.

In [9]:
max_vocab_size = 25000

TEXT.build_vocab(train_data, max_size=max_vocab_size)
LABEL.build_vocab(train_data)

Print the size of the vocabulary. The size will be the maximum of words plus two for the unk and pad tokens for words outside the vocabulary and padding, respectively.

In [None]:
print("Unique tokens in TEXT vocabulary: "+str(len(TEXT.vocab)))
print("Unique tokens in LABEL vocabulary: "+str(len(LABEL.vocab)))

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


Create an iterator that contains batches of tokenized text for the training, validation and testing later in the notebook. If run on Google Colab, turn on the GPU in Notebook settings.

In [11]:
import torch

batch_size = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, validation_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, validation_data, test_data), 
    batch_sizes = (batch_size, batch_size, batch_size),
    sort = False,
    device = device)

Lets start introducing our first model. We start of with a simple RNN.

In [12]:
class Simple_RNN(torch.nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        
        super().__init__()
        
        # The embedding layer takes the tokenized text input with the vocab
        # size as dimension and outputs a vector with less elements of 
        # dimension embedding_dim.
        self.embedding_layer = torch.nn.Embedding(input_dim, embedding_dim)
        
        # Pass the output vector to an single-layer RNN.
        self.rnn_layer = torch.nn.RNN(embedding_dim, hidden_dim)
        
        # Pass the RNN output to a linear layer.
        self.fc_layer = torch.nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        
        embedded = self.embedding_layer(text)
        
        output, hidden = self.rnn_layer(embedded)
        
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        
        return self.fc_layer(hidden.squeeze(0))

Now lets instantiate an model object.


*   input_dim = the size of the vocabulary
*   embedding_dim = the size of the vector that the embedding layer outputs.
*   hiddem_dim = the size of the hidden states
*   output_dim = one that quantifies the class probability i.e. being fake news or not.





In [13]:
input_dim = len(TEXT.vocab)
embedding_dim = 100
hidden_dim = 256
output_dim = 1

model = Simple_RNN(input_dim, embedding_dim, hidden_dim, output_dim)

Define the optimizer.

In [14]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)

Define the loss criterion that will be minimized.

In [15]:
criterion = torch.nn.BCEWithLogitsLoss()

Place the model and loss function on the GPU

In [16]:
model = model.to(device)
criterion = criterion.to(device)

Define the accuracy of the model

In [17]:
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

Define the training loop

In [18]:
def train(model, iterator, optimizer, criterion):
    
    # Define placeholders for the loss and accuracy.
    epoch_loss = 0
    epoch_acc = 0
    
    # Set the model in train mode. This only affects 
    # dropout and batch norm layers.
    model.train()
    
    # Loop over the batches.
    for batch in iterator:
        
        # Reset the gradient.
        optimizer.zero_grad()
            
        # Make the predictions.
        predictions = model(batch.text).squeeze(1)
        
        # Calculate the loss function.
        loss = criterion(predictions, batch.label)
        
        # Calculate the accuracy.
        acc = binary_accuracy(predictions, batch.label)
        
        # Do backpropagation.
        loss.backward()
        
        # Take a step in parameter space.
        optimizer.step()
        
        # Add the loss and accuracy to a total sum.
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    # Return the loss and accuracy averaged over the passed epochs.
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Define a loop to evaluate the model on the testing data.

In [19]:
def evaluate(model, iterator, criterion):
    
    # Define placeholders for the loss and accuracy.
    epoch_loss = 0
    epoch_acc = 0
    
    # Set the model in train mode. This only affects 
    # dropout and batch norm layers.
    model.eval()
    
    # Fix the parameters.
    with torch.no_grad():
    
        # Loop over the batches.
        for batch in iterator:

            # Make predictions.
            predictions = model(batch.text).squeeze(1)
            
            # Calculate the loss.
            loss = criterion(predictions, batch.label)
            
            # Calculate the accuracy.
            acc = binary_accuracy(predictions, batch.label)

            # Add the loss and accuracy to a total sum.
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    # Return the loss and accuracy averaged over the passed epochs.
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Define a function that will calculate the passed time for each epoch.

In [20]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Run the training.

In [21]:
n_epochs = 5

# Loop over the data.
for epoch in range(n_epochs):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, validation_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 28s
	Train Loss: 0.696 | Train Acc: 50.02%
	 Val. Loss: 0.694 |  Val. Acc: 49.46%
Epoch: 02 | Epoch Time: 2m 28s
	Train Loss: 0.693 | Train Acc: 49.79%
	 Val. Loss: 0.694 |  Val. Acc: 49.46%
Epoch: 03 | Epoch Time: 2m 27s
	Train Loss: 0.693 | Train Acc: 49.93%
	 Val. Loss: 0.694 |  Val. Acc: 49.48%
Epoch: 04 | Epoch Time: 2m 27s
	Train Loss: 0.693 | Train Acc: 50.19%
	 Val. Loss: 0.694 |  Val. Acc: 49.48%
Epoch: 05 | Epoch Time: 2m 27s
	Train Loss: 0.693 | Train Acc: 50.22%
	 Val. Loss: 0.694 |  Val. Acc: 49.50%


Test the model.

In [22]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.693 | Test Acc: 49.71%


The model shows no clear diminish in training or validation loss and results in an arbitrary testing accuracy of ~50%.