### Based on github code at:
https://raw.githubusercontent.com/keitakurita/practical-torchtext/master/Lesson%201%20intro%20to%20torchtext%20with%20text%20classification.ipynb

### For more (like really detailed) readup, check out the author's blog at:
http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/

In [1]:
import pandas as pd
import numpy as np
import torch

In [199]:
# create train and validation set
ori_train = pd.read_csv("C://Users/hafid/Dropbox/3.SelfStudy/kaggle/ai6-kaggle-toxic/data/train/train.csv").values

# load test set
test_df = pd.read_csv("C://Users/hafid/Dropbox/3.SelfStudy/kaggle/ai6-kaggle-toxic/data/test/test.csv")

In [200]:
# split dependent variable and independent variable
y=ori_train[:,2:]
x=ori_train[:,0:2]

# split training set and testing set
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25)

In [201]:
# turn into dataframes
new_train = np.concatenate((xtrain, ytrain), axis=1)
new_valid = np.concatenate((xtest, ytest), axis=1)

train_df = pd.DataFrame(new_train,columns=["id","comment_text","toxic","severe_toxic","obscene","threat","insult","identity_hate"])
valid_df = pd.DataFrame(new_valid, columns=["id","comment_text","toxic","severe_toxic","obscene","threat","insult","identity_hate"])

In [202]:
# a bit of massaging to remove newlines
train_df['comment_text'] = train_df['comment_text'].str.replace('\n',' ')
valid_df['comment_text'] = valid_df['comment_text'].str.replace('\n',' ')
test_df['comment_text'] = test_df['comment_text'].str.replace('\n',' ')

In [203]:
# we'll export these out to keita_data folder since traditionally torchtext loads data via csv, not dataframes
train_df.to_csv("C://Users/hafid/Dropbox/3.SelfStudy/kaggle/ai6-kaggle-toxic/data/keita_data/train.csv", index = False)
valid_df.to_csv("C://Users/hafid/Dropbox/3.SelfStudy/kaggle/ai6-kaggle-toxic/data/keita_data/valid.csv", index = False)
test_df.to_csv("C://Users/hafid/Dropbox/3.SelfStudy/kaggle/ai6-kaggle-toxic/data/keita_data/test.csv", index = False)

# Loading the data

First, let's examine what the data looks like.

In [204]:
pd.read_csv("C://Users/hafid/Dropbox/3.SelfStudy/kaggle/ai6-kaggle-toxic/data/keita_data/train.csv").head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0bc240e647fd49e7,"That's your POV talking, which isn't really re...",0,0,0,0,0,0
1,c75a1622fe2d21ae,Obviously long term population predictions are...,0,0,0,0,0,0
2,f664d0a69b9d5f33,"I have limited experience with this guy, but t...",0,0,0,0,0,0
3,92dbcf851942462f,""" Again, RCOO-M+ is archaic. Please read my r...",0,0,0,0,0,0
4,f096887295b234f3,FYI: Please see this: User talk:Central. If th...,0,0,0,0,0,0
5,57b5093689ca3fe8,""" We know she was charged with shooting at the...",0,0,0,0,0,0
6,6d0ef1f2fe1a34db,It occurred to me that I regard the EU as inde...,0,0,0,0,0,0
7,512782c73bf30b53,"If you want to file a report go ahead, just be...",0,0,0,0,0,0
8,ec2072c145eb2546,HTML Coding Damaged @ Bottom The tables near...,0,0,0,0,0,0
9,a3a8d4b22acaa208,""" Hi Wally, not sure if you are picking t...",0,0,0,0,0,0


Apparently we have to predict 6 labels

In [205]:
pd.read_csv("C://Users/hafid/Dropbox/3.SelfStudy/kaggle/ai6-kaggle-toxic/data/keita_data/test.csv").head(10)

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,"== From RfC == The title is fine as it is, ..."
2,00013b17ad220c46,""" == Sources == * Zawe Ashton on Lapland..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.
5,0001ea8717f6de06,Thank you for understanding. I think very high...
6,00024115d4cbde0f,Please do not add nonsense to Wikipedia. Such ...
7,000247e83dcc1211,:Dear god this site is horrible.
8,00025358d4737918,""" Only a fool can believe in such numbers. ..."
9,00026d1092fe71cc,== Double Redirects == When fixing double r...


### Declaring Fields

The Field class determines how the data is preprocessed and converted into a numeric format

In [206]:
from torchtext.data import Field

We want comment_text field to be converted to lowercase, tokenized on whitespace, and preprocessed. So we tell that to the Field

In [207]:
tokenize = lambda x: x.split()
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)

That was simple. The preprocessing of the labels is even easier, since they are already converted into a binary encoding.
All we need to do is to tell the Field class that the labels are already processed. We do this by passing the use_vocab=False keyword to the constructor

In [208]:
LABEL = Field(sequential=False, use_vocab=False)

### Creating the Dataset

We'll use the TabularDataset class to read our data, since it is in csv format (TabularDataset handles csv, tsv, and json files as of now)

In [209]:
from torchtext.data import TabularDataset

For the train and validation data, we need to process the labels. The fields we pass in must be in the same order as the columns. For fields we don't use, we pass in a tuple where the second element is None

In [410]:
PATH = "C://Users/hafid/Dropbox/3.SelfStudy/kaggle/ai6-kaggle-toxic/data/keita_data/"
TRAIN_FN = "train.csv"
VALID_FN = "valid.csv"
TEST_FN = "test.csv"
TEST_PATH = PATH + TEST_FN

In [211]:
TEST_PATH

'C://Users/hafid/Dropbox/3.SelfStudy/kaggle/ai6-kaggle-toxic/data/keita_data/test.csv'

In [212]:
%%time
tv_datafields = [("id", None), # we won't be needing the id, so we pass in None as the field
                 ("comment_text", TEXT), ("toxic", LABEL),
                 ("severe_toxic", LABEL), ("threat", LABEL),
                 ("obscene", LABEL), ("insult", LABEL),
                 ("identity_hate", LABEL)]

trn, vld = TabularDataset.splits(
        path=PATH, # the root directory where the data lies
        train=TRAIN_FN, validation=VALID_FN,
        format='csv',
        skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=tv_datafields)

Wall time: 6.31 s


In [213]:
tv_datafields

[('id', None),
 ('comment_text', <torchtext.data.field.Field at 0x1814508d208>),
 ('toxic', <torchtext.data.field.Field at 0x1814508d198>),
 ('severe_toxic', <torchtext.data.field.Field at 0x1814508d198>),
 ('threat', <torchtext.data.field.Field at 0x1814508d198>),
 ('obscene', <torchtext.data.field.Field at 0x1814508d198>),
 ('insult', <torchtext.data.field.Field at 0x1814508d198>),
 ('identity_hate', <torchtext.data.field.Field at 0x1814508d198>)]

For the test data, we don't have any labels

In [214]:
%%time
tst_datafields = [("id", None), # we won't be needing the id, so we pass in None as the field
                 ("comment_text", TEXT)
]

tst = TabularDataset(
        path=TEST_PATH, # the file path
        format='csv',
        skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=tst_datafields)

Wall time: 6.13 s


For the TEXT field to convert words into integers, it needs to be told what the entire vocabulary is. To do this, we run TEXT.build_vocab, passing in the dataset to build the vocabulary on.

In [218]:
%%time
TEXT.build_vocab(trn)

Wall time: 2.62 s


Let's take a look at what the vocab looks like.

The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words.

In [219]:
TEXT.vocab.freqs.most_common(10)

[('the', 368752),
 ('to', 220799),
 ('of', 167202),
 ('and', 163975),
 ('a', 159533),
 ('i', 148251),
 ('you', 139392),
 ('is', 128545),
 ('that', 110350),
 ('in', 105814)]

It is also instructive to take a look inside the Dataset. Datasets can be indexed like normal lists, so we'll look at the first element.

In [220]:
trn[0]

<torchtext.data.example.Example at 0x18146776860>

Each element of the dataset is an Example object that bundles the attributes of a single data point together.

In [221]:
trn[0].__dict__.keys()

dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

We see that the comment text is already tokenized for us.

In [222]:
trn[0].comment_text[:3]

["that's", 'your', 'pov']

Looking good. Now, let's build the Iterator which will allow us to load the data into our model.

### Creating the Iterator

In [223]:
from torchtext.data import Iterator, BucketIterator

During training, we'll be using a special kind of Iterator, called the **BucketIterator**.

When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time.

The BucketIterator groups sequences of similar lengths together for each batch to minimize padding. Handy, right?

In [224]:
train_iter, val_iter = BucketIterator.splits(
        (trn, vld), # we pass in the datasets we want the iterator to draw data from
        batch_sizes=(64, 64),
        device=0, # if you want to use the GPU, specify the GPU number here
        sort_key=lambda x: len(x.comment_text), # the BucketIterator needs to be told what function it should use to group the data.
        sort_within_batch=False,
        repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)

Let's take a look at what the output of the BucketIterator looks like

In [225]:
batch = next(train_iter.__iter__()); batch

<torchtext.data.batch.Batch at 0x18145202550>

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [226]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'train', 'comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

For the test set, we don't want the data to be shuffled. This is why we'll be using a standard Iterator.

In [421]:
test_iter = Iterator(tst, batch_size=64, device=0, train=False, sort=False, sort_within_batch=False, repeat=False)

### Wrapping the Iterator

Currently, the iterator returns a custom datatype called torchtext.data.Batch. This makes code reuse difficult (since each time the column names change, we need to modify the code), and makes torchtext hard to use with other libraries for some use cases (like torchsample and fastai). 

I hope this will be dealt with in the future (I'm considering filing a PR if I can decide what the API should look like), but in the meantime, we'll hack on a simple wrapper to make the batches easy to use. 

Concretely, we'll convert the batch to a tuple in the form (x, y) where x is the independent variable (the input to the model) and y is the dependent variable (the supervision data).

In [422]:
class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
    
    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_vars is not None: # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))

            yield (x, y)
    
    def __len__(self):
        return len(self.dl)

We'll use this to wrap the BucketIterator

In [423]:
train_dl = BatchWrapper(train_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
valid_dl = BatchWrapper(val_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
test_dl = BatchWrapper(test_iter, "comment_text", None)

In [368]:
next(train_dl.__iter__())

(Variable containing:
 
 Columns 0 to 5 
  3.5585e+05  7.3000e+01  4.9970e+04  1.8400e+02  1.2501e+05  4.7290e+03
  2.5000e+01  1.5000e+01  4.7697e+04  5.0000e+00  8.0000e+00  4.8000e+01
  2.5458e+04  1.0600e+03  5.1580e+03  6.0000e+00  2.3720e+03  2.3061e+05
  1.5300e+02  5.1000e+01  1.5200e+02  6.2170e+03  6.4970e+03  4.8000e+01
  6.0000e+00  1.2300e+02  1.6000e+01  2.5100e+02  8.0000e+00  3.0000e+00
  8.3910e+03  9.0000e+00  2.0000e+00  1.1000e+01  9.3000e+01  5.8760e+03
  1.5742e+05  6.0000e+00  1.0666e+04  6.6000e+01  4.9000e+01  1.6000e+01
  2.7400e+02  5.1500e+03  4.7417e+04  7.2220e+03  1.2246e+04  3.4121e+05
  1.0000e+00  1.0000e+00  1.0000e+00  1.0000e+00  1.0000e+00  1.0000e+00
 
 Columns 6 to 11 
  6.9000e+01  4.9700e+02  1.7174e+04  2.3787e+05  7.3500e+02  3.6567e+05
  2.0000e+00  3.0000e+00  7.7000e+01  7.6000e+01  2.8034e+05  8.0000e+00
  8.7000e+02  5.1000e+02  2.9600e+02  3.4000e+01  2.0564e+05  9.2200e+02
  8.8696e+04  4.0000e+00  8.0000e+00  8.0000e+00  2.5000e+01  2

Now we're ready to start training a model!

# Training a Text Classifier

We'll use a simple LSTM as a baseline example.

In [369]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

In [370]:
class SimpleBiLSTMBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300,
                 spatial_dropout=0.05, recurrent_dropout=0.1, num_linear=1):
        super().__init__() # don't forget to call this!
        self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1, dropout=recurrent_dropout)
        self.linear_layers = []
        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 6)
    
    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)
        return preds

In [371]:
em_sz = 100
nh = 500
nl = 3
model = SimpleBiLSTMBaseline(nh, emb_dim=em_sz); model

SimpleBiLSTMBaseline(
  (embedding): Embedding(386045, 100)
  (encoder): LSTM(100, 500, dropout=0.1)
  (linear_layers): ModuleList(
  )
  (predictor): Linear(in_features=500, out_features=6)
)

If you're using a GPU, remember to call model.cuda() to move your model to the GPU.

In [372]:
model.cuda()

SimpleBiLSTMBaseline(
  (embedding): Embedding(386045, 100)
  (encoder): LSTM(100, 500, dropout=0.1)
  (linear_layers): ModuleList(
  )
  (predictor): Linear(in_features=500, out_features=6)
)

### The training loop

In [373]:
import tqdm

In [418]:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()

In [419]:
epochs = 2

In [420]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm.tqdm(train_dl): # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()

        preds = model(x)
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        
        running_loss += loss.data[0] * x.size(0)
        
    epoch_loss = running_loss / len(trn)
    
    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(preds, y)
        val_loss += loss.data[0] * x.size(0)

    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

  
100%|██████████████████████████████████████████████████████████████████████████████| 1870/1870 [00:52<00:00, 35.88it/s]


Epoch: 1, Training Loss: 0.0445, Validation Loss: 0.0663


100%|██████████████████████████████████████████████████████████████████████████████| 1870/1870 [00:51<00:00, 36.23it/s]


Epoch: 2, Training Loss: 0.0402, Validation Loss: 0.0627
Wall time: 1min 54s


### Writing Predictions

In [424]:
test_dl

<__main__.BatchWrapper at 0x1816cb26a90>

In [425]:
test_preds = []
for x, y in tqdm.tqdm(test_dl):
    preds = model(x)
    # if you're data is on the GPU, you need to move the data back to the cpu
    preds = preds.data.cpu().numpy()
#     preds = preds.data.numpy()
    # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
    preds = 1 / (1 + np.exp(-preds))
    test_preds.append(preds)
test_preds = np.vstack(test_preds)

  
100%|██████████████████████████████████████████████████████████████████████████████| 2394/2394 [01:46<00:00, 22.51it/s]


In [429]:
test_dl

TypeError: 'BatchWrapper' object does not support indexing

In [426]:
len(test_preds)

153164

In [427]:
len(df)

153164

In [428]:
df = pd.read_csv(TEST_PATH)
for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
    df[col] = test_preds[:, i]

# if you want to write the submission file to disk, uncomment and run the below code
df.drop("comment_text", axis=1).to_csv("C://Users/hafid/Dropbox/3.SelfStudy/kaggle/ai6-kaggle-toxic/submission/multi-label-bilstm-2.csv", index=None)

In [408]:
df.head(3)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,0.023659,5.4e-05,1.008396e-05,0.007243,0.014948,0.000581
1,0000247867823ef7,"== From RfC == The title is fine as it is, ...",0.001512,1.6e-05,5.643992e-06,0.00055,0.000892,4.9e-05
2,00013b17ad220c46,""" == Sources == * Zawe Ashton on Lapland...",0.001571,3e-06,4.825711e-07,0.000925,0.000823,4.6e-05
