<a href="https://colab.research.google.com/github/pimverschuuren/Deduplication/blob/master/FastTex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The dataset used for training, validation and testing is train.csv taken from https://www.kaggle.com/c/fake-news. Here it is renamed FakeNews.csv.

Define a function that will include sets of two words in a list of words, i.e. bigrams.

In [1]:
def generate_bigrams(x):
    n_grams = set(zip(*[x[i:] for i in range(2)]))
    for n_gram in n_grams:
        x.append(' '.join(n_gram))
    return x

Create objects that will facilitate the dataset preprocessing. Only include the columns that contain the text and the label.




In [2]:
import torch
from torchtext.legacy import data
from torchtext.legacy import datasets

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
                  preprocessing = generate_bigrams)

LABEL = data.LabelField(dtype = torch.float)

#fields = [(None, None), ('text', TEXT), (None, None), (None, None),('label',LABEL)]
fields = [(None, None), (None, None), (None, None), ('text', TEXT),('label',LABEL)]

Load the csv file into a pytorch dataset object.

In [29]:
tab_data = data.TabularDataset(path = 'drive/MyDrive/Colab Notebooks/FakeNews/FakeNews.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

Split the dataset into training, testing and validation datasets.

In [30]:
import random

SEED = 1234

train_data, test_data = tab_data.split(split_ratio=0.7, random_state = random.seed(SEED))
train_data, validation_data = train_data.split(split_ratio=0.7, random_state = random.seed(SEED))

Print the lengths of each dataset.

In [26]:
print("Number of training examples: "+str(len(train_data)))
print("Number of validation examples: "+str(len(validation_data)))
print("Number of testing examples: "+str(len(test_data)))

Number of training examples: 12740
Number of validation examples: 5460
Number of testing examples: 7800


Use the training data to create a vocabulary with a maximum amount of words.



In [31]:
max_vocab_size = 25000

TEXT.build_vocab(train_data, max_size=max_vocab_size, 
                 vectors = "glove.6B.300d", 
                 unk_init = torch.Tensor.normal_)
LABEL.build_vocab(train_data)

Print the size of the vocabulary. The size will be the maximum of words plus two for the unk and pad tokens for words outside the vocabulary and padding, respectively.

In [7]:
print("Unique tokens in TEXT vocabulary: "+str(len(TEXT.vocab)))
print("Unique tokens in LABEL vocabulary: "+str(len(LABEL.vocab)))

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


Create an iterator that contains batches of text for the training loop later in the notebook. If run on Google Colab, turn on the GPU in Notebook settings.

In [28]:
import torch

batch_size = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, validation_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, validation_data, test_data), 
    batch_sizes = (batch_size, batch_size, batch_size),
    sort=False,
    device = device)

Lets start introducing our first model. We start of with a simple RNN.

In [9]:
import torch.nn as nn
import torch.nn.functional as F

class FastText(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)

        #self.embedding.weight.requires_grad = False
        
        self.fc = nn.Linear(embedding_dim, output_dim)
        
    def forward(self, text):
        
        #text = [sent len, batch size]
        
        embedded = self.embedding(text)
                
        #embedded = [sent len, batch size, emb dim]
        
        embedded = embedded.permute(1, 0, 2)
        
        #embedded = [batch size, sent len, emb dim]
        
        pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) 
        
        #pooled = [batch size, embedding_dim]
                
        return self.fc(pooled)

Now lets instantiate an model object.


*   input_dim = the size of the vocabulary
*   embedding_dim = the size of the vector that the embedding layer outputs.
*   hiddem_dim = the size of the hidden states
*   output_dim = one that quantifies the class probability i.e. being fake news or not.





In [10]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 300
OUTPUT_DIM = 1
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM, PAD_IDX)

In [11]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 7,500,901 trainable parameters


In [12]:
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ..., -1.4447,  0.8402, -0.8668],
        [ 0.1032, -1.6268,  0.5729,  ...,  0.3180, -0.1626, -0.0417],
        [ 0.0466,  0.2132, -0.0074,  ...,  0.0091, -0.2099,  0.0539],
        ...,
        [ 0.1720, -0.5647,  0.6393,  ..., -0.3045, -1.0363, -1.9498],
        [ 0.1345, -0.6839, -0.3999,  ..., -0.4588,  1.2334,  0.8383],
        [-0.1763, -1.0858, -0.1501,  ...,  1.4413, -1.2809, -0.2875]])

In [13]:

UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

Define the optimizer.

In [14]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

Define the loss criterion that will be minimized.

In [15]:
criterion = torch.nn.BCEWithLogitsLoss()

Place the model and loss function on the GPU

In [16]:
model = model.to(device)
criterion = criterion.to(device)

Define the accuracy of the model

In [17]:
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

Define the training loop

In [18]:

def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()

        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Define a loop to evaluate the model on the testing data.

In [19]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Define a function that will calculate the passed time for each epoch.

In [20]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Run the training.

In [22]:
N_EPOCHS = 100

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, validation_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

KeyboardInterrupt: ignored

In [23]:
model.load_state_dict(torch.load('tut3-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.158 | Test Acc: 96.44%
