Using the torchtext tutorial here as a guide:  http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/

Preprocessing Steps for NLP:
1. **Read the data** from disk
2. **Tokenize** the text
3. Create a **mapping from word to a unique integer**
4. **Convert the text into lists of integers**
5. Load the data in whatever **format** your deep learning framework requires
6. **Pad the text so that all the sequences are the same length**, so you can process them in batch

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import pdb
import os

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from pandas_summary import DataFrameSummary

import torch
import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import spacy
from spacy.lang.en.stop_words import STOP_WORDS as spacy_STOPWORDS
spacy_en = spacy.load('en')

from wordcloud import WordCloud, STOPWORDS

# pandas and plotting config
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', -1)

## 1. Torchtext Overview

![image.png](attachment:image.png)

Torchtext takes in raw data in the form of *text files, csv/tsv files, json files, and directories* (as of now) and converts them to Datasets. **Datasets** are simply blocks of data read into memory that have been preprocessed according to how the user has specified various fields to be processed. Datasets are basically a way to standardize raw data in various forms into one canonical form that other data structures can use.

Having created a Dataset, torchtext then passes the Dataset to an Iterator. **Iterators** handle numericalizing, batching, packaging, and moving the data to the GPU. Basically, it does all the heavy lifting necessary to pass the data to a neural network

In [3]:
PATH = 'data'

os.makedirs(f'{PATH}/models', exist_ok=True)
os.makedirs(f'{PATH}/tmp', exist_ok=True)

In [4]:
raw_train_df = pd.read_csv(f'{PATH}/train.csv')
test_df = pd.read_csv(f'{PATH}/test.csv')
sample_subm_df = pd.read_csv(f'{PATH}/sample_submission.csv')

label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
raw_train_df['none'] = 1 - raw_train_df[label_cols].max(axis=1)

## 2. Declaring Fields

Tell Torchtext how you what you want the data to look like, and Torchtext will handle it for you.  You do this by defining a **Field**, which specifies how a *field* should be processed

In [5]:
tokenize = lambda x: x.split()
TEXT_fld = data.Field(sequential=True, tokenize=tokenize, lower=True)
LABEL_fld = data.Field(sequential=False, use_vocab=False)

All fields expect a sequence of words from which to build a mapping from the words to integers (this is the **vocab**).  

If you are passing in a field that is *already* numericalized and *not* sequential, you should pass `sequential=False, use_vocab=False`.

|       Name      | Description | UseCase |
|:---------------:|:-----------:|:-------:|
| Field           |A regular field that defines preprocessing and postprocessing|Non-text fields and text fields where you don’t need to map integers back to words|
| ReversibleField |	An extension of the field that allows reverse mapping of word ids to words |Text fields if you want to map the integers back to natural language (such as in the case of language modeling)|
| NestedField     |	A field that takes processes non-tokenized text into a set of smaller fields|Char-based models|

## 3. Constructing the Dataset

First we need to make sure the data is cleaned up for torchtext to use by doing the following:
1. Replace `\n` characters with `" "` from training and test .csv files
2. Create a training, validation, and test .csv files

In [6]:
# torchtext cannot read the .csv files correctly if there are newline characters, so replace with " "
raw_train_df.comment_text = raw_train_df.comment_text.str.replace("\n", " ")
test_df.comment_text = test_df.comment_text.str.replace("\n", " ")

In [7]:
# split the training data into a train and validatin dataset
trn, val = train_test_split(raw_train_df, test_size=0.30, stratify=raw_train_df['none'], random_state=42)
print(len(trn), len(val), len(trn[trn.none != 1]), len(val[val.none != 1]))

# save train, val, and test datasets for torchtext
trn.to_csv(f'{PATH}/train_ds.csv', index=None)
val.to_csv(f'{PATH}/valid_ds.csv', index=None)
test_df.to_csv(f'{PATH}/test_ds.csv', index=None)

111699 47872 11357 4868


In [8]:
display(pd.read_csv("data/train_ds.csv").head(2))
display(pd.read_csv("data/valid_ds.csv").head(2))
display(pd.read_csv("data/test_ds.csv").head(2))

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,none
0,28e4f9d2f129d1b8,"""::My class went quite well today. A student tried to teach me something about basketball and his struggles to teach little uninformed me helped demonstrate how hard it is to teach. The whole class then tried to help him out, so they started working together. That feeling of """"let's work on this problem together"""" transferred over into the rest of the class - it was great. One of those rare days. | talk Take all the time you need. I am wrapped in many things right now, so there is no urgency. | talk """,0,0,0,0,0,0,1
1,f54b1d9a5d75ee75,"I truly don't see the issue with the breakdown of Street Walker's chronology. Jackson's solo albums are very much tied in with his appearance and personal life. If anything, perhaps you can lump Off the Wall and Thriller together, and then place Bad and Dangerous tour, and after that the rest. The original layout by Steet Walker seems fine, it doesn't come off like a glorified shrine in any sense of the word. 16:18, 02 February 2006",0,0,0,0,0,0,1


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,none
0,345481186836b351,Why have I been blocked. I haven't done anything to warrant it. It seems an abuse of power by the blocker. No wasn't any warnings at all. Please rescind the block. 192.148.117.79,0,0,0,0,0,0,1
1,d65f3e209e35b44c,"In 1987 film Planes Trains and Automobiles a character complains that a punch to the stomach could have killed him, as it did Houdini. According to snopes.com: http://www.snopes.com/horrors/freakish/houdini.asp",0,0,0,0,0,0,1


Unnamed: 0,id,comment_text
0,00001cee341fdb12,"Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,"
1,0000247867823ef7,"== From RfC == The title is fine as it is, IMO."


There are various built-in Datasets in torchtext that handle common use cases. **For csv/tsv files, the TabularDataset class** is convenient. Here’s how we would read data from a csv file using the TabularDataset:

In [9]:
# train/validation
train_datafields = [("id", None), # we won't be needing the id, so we pass in None as the field
                    ("comment_text", TEXT_fld), ("toxic", LABEL_fld),
                    ("severe_toxic", LABEL_fld), ("obscene", LABEL_fld),
                    ("threat", LABEL_fld), ("insult", LABEL_fld),
                    ("identity_hate", LABEL_fld), ("none", LABEL_fld)]

train_ds, valid_ds = data.TabularDataset.splits(PATH, train='train_ds.csv', validation='valid_ds.csv',
                                          format='csv', skip_header=True, fields=train_datafields)

# test
test_datafields = [("id", None), ("comment_text", TEXT_fld)]

test_ds = data.TabularDataset(f'{PATH}/test_ds.csv', format='csv', skip_header=True, fields=test_datafields)

For the TabularDataset, we pass in a list of (name, field) pairs as the fields argument. **The fields we pass in must be in the same order as the columns**. For the columns we don’t use, we pass in a tuple where the field element is None.

*Note: The next release of torchtext (and the current version on GitHub) will be able to take a dictionary mapping each column by name to its corresponding field instead of a list.*

The **splits** method creates a dataset for the train and validation data by applying the same processing. It can also handle the test data, but since out test data has a different format from the train and validation data, we create a different dataset.

**Datasets** can be treated in the same way as lists of preprocessed and bundled data

Each contains a list of **Example** objects and text fields have already been tokenized, as you can see below:

In [10]:
train_ds[0]

<torchtext.data.example.Example at 0x7f1f6dc63780>

In [11]:
train_ds[0].__dict__.keys()

dict_keys(['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', 'none'])

In [12]:
train_ds[1].comment_text[:5]

['i', 'truly', "don't", 'see', 'the']

Build your **vocab** (e.g., your mapping of words to integers).  *Note: you probably want to build your vocabularly on the training set only*

Torchtext has its own class called Vocab for handling the vocabulary. The **Vocab class** holds a mapping from word to id in its `stoi` attribute and a reverse mapping in its `itos` attribute. In addition to this, it [can automatically build an embedding matrix for you using various pretrained embeddings like word2vec](http://mlexplained.com/2018/02/15/language-modeling-tutorial-in-torchtext-practical-torchtext-part-2/). 

The Vocab class can also take options like `max_size` and `min_freq` that dictate how many words are in the vocabulary or how many times a word has to appear to be registered in the vocabulary. 

Words that are not included in the vocabulary will be converted into <unk>, a token standing for “unknown”.

In [26]:
TEXT_fld.build_vocab(train_ds, min_freq=10)

In [27]:
# The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words.
TEXT_fld.vocab.freqs.most_common(10)

[('the', 343496),
 ('to', 206144),
 ('of', 155888),
 ('and', 152384),
 ('a', 148039),
 ('i', 136933),
 ('you', 132362),
 ('is', 119338),
 ('that', 102607),
 ('in', 98348)]

Here is a list of the currently available set of datasets and the format of data they take in:

| Name | Description | Use Case |
| --- | --- | --- |
|  TabularDataset | Takes paths to csv/tsv files and json files or Python dictionaries as inputs. | Any problem that involves a label (or labels) for each piece of text |
|  LanguageModelingDataset | Takes the path to a text file as input. | Language modeling |
|  TranslationDataset |  Takes a path and extensions to a file for each language.
e.g. If the files are English: “hoge.en”, French: “hoge.fr”, path=”hoge”, exts=(“en”,”fr”) |  Translation |
|  SequenceTaggingDataset | Takes a path to a file with the input sequence and output sequence separated by tabs. 2 | Sequence tagging |


## 4. Constructing the iterator

In torchvision and PyTorch, **the processing and batching of data is handled by DataLoaders**. For some reason, **torchtext has renamed the objects that do the exact same thing to Iterators**. The basic functionality is the same, but Iterators, as we will see, have some convenient functionality that is unique to NLP

Below is code to initialize iterators for train, validatin, and test datasets

In [35]:
train_iter, val_iter = data.BucketIterator.splits(
    (train_ds, valid_ds), # we pass in the datasets we want the iterator to draw data from
    batch_sizes=(32, 32),
    device=-1, # if you want to use the GPU, specify the GPU number here
    sort_key=lambda x: len(x.comment_text), # the BucketIterator needs to be told what function it should use to group the data.
    sort_within_batch=False,
    repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)

**BucketIterator** automatically shuffles and buckets the input sequences into sequences of similar length.  

This is powerful because we need to pad the input sequences to be of the same length to enable batch processing. The amount of padding necessary is determined by the longest sequence in the batch. Therefore, padding is most efficient when the sequences are of similar lengths. The BucketIterator does all this behind the scenes. 

During training, we'll be using a special kind of Iterator, called the BucketIterator.
When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g. [ [3, 15, 2, 7], [4, 1], [5, 5, 6, 8, 1] ] -> [ [3, 15, 2, 7, 0], [4, 1, 0, 0, 0], [5, 5, 6, 8, 1] ]

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time.
The BucketIterator groups sequences of similar lengths together for each batch to minimize padding. Handy, right?

You need to tell the BucketIterator what attribute you want to bucket the data on. In our case, we want to bucket based on the lengths of the comment_text field, so we pass that in as a keyword argument.

Output from BucketIterator

In [36]:
batch = next(train_iter.__iter__()); batch

<torchtext.data.batch.Batch at 0x7f1f13470438>

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [37]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'train', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', 'none'])

For the test set, we don't want the data to be shuffled. This is why we'll be using a standard Iterator.

In [38]:
test_iter = data.Iterator(test_ds, batch_size=64, device=-1, sort=False, sort_within_batch=False, repeat=False)

Here’s a list of the Iterators that torchtext currently implements:

| Name | Description | Use Case |
| --- | --- | --- |
|  Iterator | Iterates over the data in the order of the dataset. | Test data, or any other data where the order is important. |
|  BucketIterator | Buckets sequences of similar lengths together. | Text classification, sequence tagging, etc. (use cases where the input is of variable length) |
|  BPTTIterator | An iterator built especially for language modeling that also generates the input sequence delayed by one timestep. It also varies the BPTT (backpropagation through time) length. This iterator deserves its own post, so I’ll omit the details here. |  Language modeling |

For more on the BPTTIterator, see: http://mlexplained.com/2018/02/15/language-modeling-tutorial-in-torchtext-practical-torchtext-part-2/

## 5. Wrapping the Iterator

Currently, the iterator returns a custom datatype called **torchtext.data.Batch**. 

The **Batch type** has a similar API to the Example type, with a batch of data from each field as attributes. Unfortunately, this custom datatype makes code reuse difficult (since each time the column names change, we need to modify the code), and makes torchtext hard to use with other libraries for some use cases (like torchsample and *fastai*).

I hope this will be dealt with in the future (I’m considering filing a PR if I can decide what the API should look like), but in the meantime, we’ll hack on a simple wrapper to make the batches easy to use.

Concretely, we’ll convert the batch to a tuple in the form (x, y) where x is the independent variable (the input to the model) and y is the dependent variable (the supervision data). Here’s the code:

In [39]:
class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
    
    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_vars is not None: # we will concatenate y into a single tensor
                y = torch.cat([ getattr(batch, feat).unsqueeze(1) for feat in self.y_vars ], dim=1).float()
            else:
                y = torch.zeros((1))

            yield (x, y)
    
    def __len__(self):
        return len(self.dl)

In [40]:
train_dl = BatchWrapper(train_iter, "comment_text", label_cols + ['none'])
valid_dl = BatchWrapper(val_iter, "comment_text", label_cols + ['none'])
test_dl = BatchWrapper(test_iter, "comment_text", None)

In [41]:
next(train_dl.__iter__())

(Variable containing:
 
 Columns 0 to 10 
      0      0  28697    179      8      0     57      0     15     18   8394
    997      0    850   1570    206   1697    175      0      9   1348     31
    278      9   4570    425   1715   9740     13   1763      6  10572     48
      0      0   5878      0     22   1076   2541    938   6690      0     19
      1      1      1    425     42    190      0    680      0     18  11947
 
 Columns 11 to 21 
    646   1006    216      7    467     44   9475   1054  15965    467    467
      0      4  14411     20      0    796  16717     49      0      0      0
     71      8      5    992  12070     48   1671      6  15108   4156   8941
    168      5   1112      3    666   1942      0   1830    440   1989      0
   3584   1731    160    422   1132  19167   1733      0      0      0  18462
 
 Columns 22 to 31 
   7974  28318      8     40    467   2640    250      0   1711    467
   4788      9     21     14      0   7227     15  11882     12  

## 6. Training the model

Define a simple LSTM

In [43]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

In [68]:
class SimpleBiLSTMBaseline(nn.Module):
    def __init__(self, emb_sz, n_factors=300, n_hidden=256, n_linear=1, out_sz=1,
                 spatial_dropout=0.05, recurrent_dropout=0.1):
        super().__init__() # don't forget to call this!
        
        self.emb = nn.Embedding(emb_sz, n_factors)
        self.encoder = nn.LSTM(n_factors, n_hidden, num_layers=1, dropout=recurrent_dropout)
        
        self.linear_layers = []
        for _ in range(n_linear - 1):
            self.linear_layers.append(nn.Linear(n_hidden, n_hidden))
            
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(n_hidden, out_sz)
    
    def forward(self, seq):
        hdn, _ = self.encoder(self.emb(seq))
        
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
            
        preds = self.predictor(feature)
        return preds

In [69]:
emb_sz = len(TEXT_fld.vocab)
n_factors = 200
nh = 500
nl = 1 #3

model = SimpleBiLSTMBaseline(emb_sz, n_factors, n_hidden=nh, n_linear=nl, out_sz=7); model

SimpleBiLSTMBaseline(
  (emb): Embedding(28819, 200)
  (encoder): LSTM(200, 500, dropout=0.1)
  (linear_layers): ModuleList(
  )
  (predictor): Linear(in_features=500, out_features=7, bias=True)
)

In [70]:
# if you're using a GPU, remember to call model.cuda() to move your model to the GPU.
# model.cuda()

### Training Loop

In [71]:
import tqdm

In [72]:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()

In [73]:
epochs = 2

In [76]:
%%time

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    
    model.train() # turn on training mode
    
    for x, y in tqdm.tqdm(train_dl): # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()

        preds = model(x)
        loss = loss_func(y, preds)
        loss.backward()
        opt.step()
        
        running_loss += loss.data[0] * x.size(0)
        
    epoch_loss = running_loss / len(train_ds)
    
    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(y, preds)
        val_loss += loss.data[0] * x.size(0)

    val_loss /= len(valid_ds)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))


  0%|          | 0/3491 [00:00<?, ?it/s][A
  0%|          | 1/3491 [00:01<1:36:12,  1.65s/it][A
  0%|          | 2/3491 [00:02<1:17:45,  1.34s/it][A
  0%|          | 3/3491 [00:04<1:28:16,  1.52s/it][A
  0%|          | 4/3491 [00:05<1:17:09,  1.33s/it][A
Exception in thread Thread-6:
Traceback (most recent call last):
  File "/home/wgilliam/anaconda3/envs/fastai/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/wgilliam/anaconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py", line 148, in run
    for instance in self.tqdm_cls._instances:
  File "/home/wgilliam/anaconda3/envs/fastai/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration

  

  0%|          | 0/3491 [00:00<?, ?it/s]

Epoch: 1, Training Loss: -7385.2417, Validation Loss: -9113.6593


100%|██████████| 3491/3491 [1:12:20<00:00,  1.24s/it]


Epoch: 2, Training Loss: -12390.2384, Validation Loss: -13713.4056
CPU times: user 7h 2min 27s, sys: 2h 34min 58s, total: 9h 37min 25s
Wall time: 2h 36min 13s


### Predictions

In [85]:
test_preds = []

for x, y in tqdm.tqdm(test_dl):
    preds = model(x)
    
    # if you're data is on the GPU, you need to move the data back to the cpu
    # preds = preds.data.cpu().numpy()
    preds = preds.data.numpy()
    
    # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
    preds = 1 / (1 + np.exp(-preds))
    
    test_preds.append(preds)
    
# test_preds = np.hstack(test_preds)


  0%|          | 0/2394 [00:00<?, ?it/s][A
  0%|          | 1/2394 [00:01<1:04:31,  1.62s/it][A
  0%|          | 2/2394 [00:05<1:42:24,  2.57s/it][A
Exception in thread Thread-9:
Traceback (most recent call last):
  File "/home/wgilliam/anaconda3/envs/fastai/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/wgilliam/anaconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py", line 148, in run
    for instance in self.tqdm_cls._instances:
  File "/home/wgilliam/anaconda3/envs/fastai/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration

  



In [84]:
test_preds[0].shape

(64, 7)

### Prepare submission

In [None]:
subm_df = pd.read_csv("data/test.csv")

for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
    subm_df[col] = test_preds[:, i]

# if you want to write the submission file to disk, uncomment and run the below code
subm_df.drop("comment_text", axis=1).to_csv(f'{PATH}/submisions/subm1.csv', index=False)