# Pooled GRU + GloVe in PyTorch

This Notebook is designed to replicate the methods presented in https://www.kaggle.com/yekenot/pooled-gru-fasttext in PyTorch. I have recently started learning Pytorch after i saw it mentioned in the fast.ai lectures. I could not clearly figure out some of the features which i will state as questions in the end.

Import the dependent modules

In [2]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd
import time

# Pytorch
import torch
import torch.nn as nn
import torch.legacy.nn as legacy
import torch.nn.functional as F
from torch.autograd import Variable

# Text Preprocessing
import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

# Tokenizer
import spacy
from spacy.symbols import ORTH

# Progress Bar
from tqdm import tqdm

# Evaluation
from sklearn.metrics import roc_auc_score

# Preprocess the text

Create a tokenizer. Using SpaCy! 

In [3]:
NLP = spacy.load('en')

def tokenizer(comment):
    comment = re.sub(r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’\!;]", " ", str(comment))
    comment = re.sub(r"[ ]+", " ", comment)
    comment = re.sub(r"\!+", "!", comment)
    comment = re.sub(r"\,+", ",", comment)
    comment = re.sub(r"\?+", "?", comment)
    return [x.text for x in NLP.tokenizer(comment) if x.text != " "]

Create a torchtext data.Field defining how to treat the text.

In [4]:
COMMENT = data.Field(
        sequential=True,
        fix_length=200,
        tokenize=tokenizer,
        pad_first=True,
        lower=True
    )
LABEL = data.Field(
        sequential=False,
        use_vocab=False
    )

Replace the text with tokens. Rename path to wherever you datasets are stored. 

In [5]:
train = data.TabularDataset(
        path='Dataset/train.csv', format='csv', skip_header=True,
        fields=[
            ('id', None),
            ('comment_text', COMMENT),
            ('toxic', LABEL),
            ('severe_toxic', LABEL),
            ('obscene', LABEL),
            ('threat', LABEL),
            ('insult', LABEL),
            ('identity_hate', LABEL),
        ])
test = data.TabularDataset(
        path='Dataset/test.csv', format='csv', skip_header=True,
        fields=[
            ('id', None),
            ('comment_text', COMMENT)
        ])

Build a vocabulary with the 50000 most common words.

In [8]:
COMMENT.build_vocab(
        train, test,
        max_size=50000,
        min_freq=0,
        vectors=None
    )

Load pretrained GloVe word embeddings with 300 dimensions. I tried to use Fasttesxt but there was an error message regarding the dimensionality of some words.

In [9]:
COMMENT.vocab.load_vectors('glove.6B.300d')

Create batches for train and test set

In [23]:
train_iter = data.Iterator(train, batch_size=64, sort_within_batch=False, 
                           device=-1, sort_key=lambda x: len(x.comment_text), repeat=False, shuffle=False)
test_iter = data.Iterator(test, batch_size=64, device=-1, sort=False, 
                          sort_within_batch=False, repeat=False, shuffle=False)

I could not figure out exactly where the text gets actually one-hot-encoded. It seems as long as you don't create iterators the vocabulary does not get applied. I would like to know if it is possible to not create iterators as a mandatory step.

# Train the model

Set Hyperparameters

In [17]:
vocab_size = len(COMMENT.vocab)
hidden_size = 80
batch_size = 64
embedding_size = 300
label_size = 6

Create a model

In [18]:
class ToxicCommentClassifier(nn.Module):

    def __init__(self, embedding_size, hidden_size, vocab_size, label_size, batch_size):
        super(ToxicCommentClassifier, self).__init__()
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        self.word_embeddings = nn.Embedding(vocab_size, embedding_size)
        self.word_embeddings.weight.data.copy_(COMMENT.vocab.vectors)
        self.word_embeddings.weight.requires_grad = False
        self.spatialdropout = nn.Dropout2d(p=0.2)
        self.gru = nn.GRU(embedding_size, hidden_size, bidirectional=True)
        self.linear = nn.Linear(hidden_size*2*2, label_size)        

    def forward(self, x):
        self.hidden = self.init_hidden()
        x = self.word_embeddings(x)
        x = self.spatialdropout(x.transpose(1,0).contiguous())
        output, self.hidden = self.gru(x, self.hidden)
        maxpool, _ = torch.max(output, dim=0)
        meanpool = torch.mean(output, dim=0)
        concat = torch.cat((maxpool, meanpool), dim=1)
        y_pred = self.linear(concat)
        return y_pred
    
    def init_hidden(self):
        return Variable(torch.zeros(2, self.batch_size, self.hidden_size))

Apparently SpatialDropout exists only in the legacy modules in PyTorch. However, if i am not mistaken Dropout2d is the equivalent as it zeroes out whole channels. I am still uncertain about this point, though!

Initialize the model, optimizer and loss function.

In [19]:
model = ToxicCommentClassifier(embedding_size, hidden_size, vocab_size, label_size, batch_size)
parameters = filter(lambda p: p.requires_grad, model.parameters())
optimizer = torch.optim.Adam(parameters, lr=0.001)
criterion = nn.BCEWithLogitsLoss()

In [20]:
def return_inputs_labels(batch):
    x = batch.comment_text
    y = torch.stack([
        batch.toxic, batch.severe_toxic, 
        batch.obscene,
        batch.threat, batch.insult, 
        batch.identity_hate
    ], dim=1)
    y = y.type(torch.FloatTensor)
    x = x.transpose(1,0).contiguous()
    return x, y

Train the model.

In [24]:
EPOCH = 2

for i in range(EPOCH):
    train_loss = 0.0
    model.train()
    for batch in tqdm(train_iter):
        inputs, labels = return_inputs_labels(batch)
        model.batch_size = len(inputs.data)
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        train_loss += loss.data[0]
       
        model.zero_grad()
        loss.backward()
        optimizer.step() 
    
    
    train_loss = train_loss / len(train)
    
    print ('Epoch: %d, Training Loss: %g' % ((i+1), train_loss))
    time.sleep(1)

100%|██████████| 2494/2494 [13:47<00:00,  3.01it/s]


Epoch: 1, Training Loss: 0.00104232


100%|██████████| 2494/2494 [13:51<00:00,  3.00it/s]


Epoch: 2, Training Loss: 0.000816634


Make predictions.

In [25]:
prediction_list = []
for batch in tqdm(test_iter):
    x = batch.comment_text
    inputs = x.transpose(1,0).contiguous()
    model.batch_size = len(inputs.data)
    predictions = model(inputs)
    predictions = predictions.data.numpy()
    predictions = 1 / (1 + np.exp(-predictions))
    prediction_list.append(predictions)
predictions = np.vstack(prediction_list)

100%|██████████| 2394/2394 [08:00<00:00,  4.98it/s]


Create submission file

In [26]:
submission = pd.read_csv("Dataset/test.csv")
for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
    submission[col] = predictions[:, i]
submission.drop("comment_text", axis=1).to_csv("submission.csv", index=False)

The score on the public leaderboard with this notebook is 0.9670. 

If i use no dropout i could get up to 0.9783 - which leads me too believe Dropout2d is not exactly doing what i think it is doing!?

Also i find it really difficult to implement proper validation techniques as i am always only able to deal with the data in batches and i get lost somewhere in the for-loops.

Again, i am very new to this and thought i would play around with PyTorch a little bit as most of the kernels are written with Keras. I think it is a great framework but the documentation is lacking at some specific points so i am glad for any suggestions on how to optimize my code. Thank you!