<a href="https://colab.research.google.com/github/pimverschuuren/FakeNews/blob/main/FastTex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook implements a FastTex model as a binary classifier that identifies text as fake or true.

Define a function that will include sets of two words in a list of words, i.e. bigrams.

In [9]:
def generate_bigrams(x):
    n_grams = set(zip(*[x[i:] for i in range(2)]))
    for n_gram in n_grams:
        x.append(' '.join(n_gram))
    return x

Create objects that will facilitate the tokenization of the textual data. Only include the columns that contain the text and the label.




In [10]:
import torch
from torchtext.legacy import data
from torchtext.legacy import datasets

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
                  preprocessing = generate_bigrams)

LABEL = data.LabelField(dtype = torch.float)

#fields = [(None, None), ('text', TEXT), (None, None), (None, None),('label',LABEL)]
fields = [(None, None), (None, None), (None, None), ('text', TEXT),('label',LABEL)]

Load the csv file into a pytorch dataset object.

In [11]:
tab_data = data.TabularDataset(path = 'drive/MyDrive/Colab Notebooks/FakeNews/FakeNews.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

Split the dataset into training, testing and validation datasets.

In [12]:
import random

SEED = 1234

train_data, test_data = tab_data.split(split_ratio=0.7, random_state = random.seed(SEED))
train_data, validation_data = train_data.split(split_ratio=0.7, random_state = random.seed(SEED))

Print the lengths of each dataset.

In [13]:
print("Number of training examples: "+str(len(train_data)))
print("Number of validation examples: "+str(len(validation_data)))
print("Number of testing examples: "+str(len(test_data)))

Number of training examples: 10134
Number of validation examples: 4343
Number of testing examples: 6205


Use the training data to create a vocabulary with a maximum amount of words.



In [14]:
max_vocab_size = 25000

TEXT.build_vocab(train_data, max_size=max_vocab_size, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)
LABEL.build_vocab(train_data)

100%|█████████▉| 399999/400000 [00:17<00:00, 22928.91it/s]


Print the size of the vocabulary. The size will be the maximum of words plus two for the unk and pad tokens for words outside the vocabulary and padding, respectively.

In [15]:
print("Unique tokens in TEXT vocabulary: "+str(len(TEXT.vocab)))
print("Unique tokens in LABEL vocabulary: "+str(len(LABEL.vocab)))

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


Create an iterator that contains batches of text for the training loop later in the notebook. If run on Google Colab, turn on the GPU in Notebook settings.

In [16]:
import torch

batch_size = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, validation_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, validation_data, test_data), 
    batch_sizes = (batch_size, batch_size, batch_size),
    sort=False,
    device = device)

Here we introduce our FastText model.

In [18]:
import torch.nn as nn
import torch.nn.functional as F

class FastText(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):
        
        super().__init__()
        
        # The embedding layer takes the tokenized text input with the vocab
        # size as dimension and outputs a vector with less elements of 
        # dimension embedding_dim. Include padding in this model to make
        # the sentences equal in length.
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        
        # Define the linear layer.
        self.fc = nn.Linear(embedding_dim, output_dim)
        
    def forward(self, text):
  
        # Get the embedding layer output
        embedded = self.embedding(text)
        
        # Rotate the tensor.
        embedded = embedded.permute(1, 0, 2)
        
        # Average the values of the embedding layer.
        pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) 
                
        return self.fc(pooled)

Now lets instantiate an model object.


*   input_dim = the size of the vocabulary
*   embedding_dim = the size of the vector that the embedding layer outputs.
*   hiddem_dim = the size of the hidden states
*   output_dim = one that quantifies the class probability i.e. being fake news or not.





In [19]:
input_dim = len(TEXT.vocab)
embedding_dim = 100
output_dim = 1
pad_idx = TEXT.vocab.stoi[TEXT.pad_token]

model = FastText(input_dim, embedding_dim, output_dim, pad_idx)

Get the embedding layer weights from the pretrained GloVe word vectors. And set copy them to the model.

In [20]:
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.1077,  0.1105,  0.5981,  ..., -0.8316,  0.4529,  0.0826],
        ...,
        [ 0.3149, -0.5569,  1.7605,  ...,  1.1534, -0.2848,  2.5636],
        [-0.5501, -0.9610, -0.9494,  ...,  1.4609, -1.2607,  0.5446],
        [ 1.5952,  0.9065, -1.5196,  ..., -0.6795,  0.7539, -1.8911]])

In [21]:

unk_idx = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[unk_idx] = torch.zeros(embedding_dim)
model.embedding.weight.data[pad_idx] = torch.zeros(embedding_dim)

Define the optimizer.

In [22]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

Define the loss criterion that will be minimized.

In [23]:
criterion = torch.nn.BCEWithLogitsLoss()

Place the model and loss function on the GPU

In [24]:
model = model.to(device)
criterion = criterion.to(device)

Define the accuracy of the model

In [25]:
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

Define the training loop

In [27]:
def train(model, iterator, optimizer, criterion):
    
    # Define placeholders for the loss and accuracy.
    epoch_loss = 0
    epoch_acc = 0
    
    # Set the model in train mode. This only affects 
    # dropout and batch norm layers.
    model.train()

    # Loop over the batches.
    for batch in iterator:
        
        # Reset the gradient for each batch.
        optimizer.zero_grad()

        # Get the predictions.
        predictions = model(batch.text).squeeze(1)
        
        # Calculate the loss.
        loss = criterion(predictions, batch.label)
        
        # Calculate the accuracy.
        acc = binary_accuracy(predictions, batch.label)
        
        # Do back propagation.
        loss.backward()
        
        # Take a step in parameter space.
        optimizer.step()
        
        # Sum the loss and accuracy.
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    
    # Return the average loss and accuracy.
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Define a loop to evaluate the model on the testing data.

In [28]:
def evaluate(model, iterator, criterion):
    
    # Define placeholders for the loss and accuracy.
    epoch_loss = 0
    epoch_acc = 0
    
    # Set the model in eval mode. This only affects 
    # dropout and batch norm layers.
    model.eval()
    
    # Set the parameters constant.
    with torch.no_grad():
    
        # Loop over the batches.
        for batch in iterator:

            # Get the predictions.
            predictions = model(batch.text).squeeze(1)
            
            # Calculate the loss.
            loss = criterion(predictions, batch.label)
            
            # Calculate the accuracy.
            acc = binary_accuracy(predictions, batch.label)

            # Sum the loss and accuracy.
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    
    # Return the average loss and accuracy.
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Define a function that will calculate the passed time for each epoch.

In [29]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Run the training.

In [30]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, validation_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 32s
	Train Loss: 0.682 | Train Acc: 62.45%
	 Val. Loss: 0.664 |  Val. Acc: 68.83%
Epoch: 02 | Epoch Time: 0m 31s
	Train Loss: 0.643 | Train Acc: 72.50%
	 Val. Loss: 0.611 |  Val. Acc: 82.50%
Epoch: 03 | Epoch Time: 0m 31s
	Train Loss: 0.572 | Train Acc: 85.03%
	 Val. Loss: 0.527 |  Val. Acc: 89.73%
Epoch: 04 | Epoch Time: 0m 31s
	Train Loss: 0.494 | Train Acc: 90.39%
	 Val. Loss: 0.447 |  Val. Acc: 92.02%
Epoch: 05 | Epoch Time: 0m 30s
	Train Loss: 0.413 | Train Acc: 92.47%
	 Val. Loss: 0.382 |  Val. Acc: 93.79%


Test the model.

In [31]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.394 | Test Acc: 92.99%


The trained FastText model shows a good accuracy of ~92% on the testing data. This is lower than the previously trained LSTM model(99%) but the training is considerably faster.