# Chapter 5. Text Classification

In [1]:
import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from torchtext import data 
import torchtext
from pathlib import Path
import pandas as pd
import spacy

A famous example of embedding vectors is word2vec, which was released by Google in 2013.2 This was a set of word embeddings trained using a shallow neural network, and it revealed that the transformation into vector space seemed to capture something about the concepts underpinning the words. In its commonly cited finding, if you pulled the vectors for King, Man, and Woman and then subtracted the vector for Man from King and added the vector for Woman, you would get a result that was the vector representation for Queen. Since word2vec, other pretrained embeddings have become available, such as ELMo, GloVe, and fasttext.

As for using embeddings in PyTorch, it’s really simple:

In [2]:
vocab_size=5
dimension_size=10
embed = nn.Embedding(vocab_size, dimension_size)

In [3]:
device = "cuda"

This will contain a tensor of vocab_size x dimension_size initialized randomly. I prefer to think that it’s just a giant array or lookup table. 

Each word in your vocabulary indexes into an entry that is a vector of dimension_size, so if we go back to our cat and its epic adventures on the mat, we’d have something like this:

In [4]:
cat_mat_embed = nn.Embedding(5, 2)
cat_tensor = torch.LongTensor([0]) # max equal to 4 because we have a vocab of size 5
cat_mat_embed.forward(cat_tensor)

tensor([[-1.0167, -0.2432]], grad_fn=<EmbeddingBackward>)

We create our embedding, a tensor that contains the position of cat in our vocabulary, and pass it through the layer’s forward() method. That gives us our random embedding. The result also points out that we have a gradient function that  we can use for updating the parameters after we combine it with a loss function.

## Loading & Data Cleaning

In [5]:
path_data = "data/tweets/training.1600000.processed.noemoticon.csv"

In [6]:
tweetsDF = pd.read_csv(path_data, engine="python", header=None)

In [7]:
tweetsDF.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Annoyingly, we don’t have a header field in this CSV (again, welcome to the world of a data scientist!), but by looking at the website and using our intuition, we can see that what we’re interested in is the last column (the tweet text) and the first column (our labeling). However, the labels aren’t great, so let’s do a little feature engineering to work around that. Let’s see what counts we have in our training set:

In [8]:
tweetsDF[0].value_counts()

4    800000
0    800000
Name: 0, dtype: int64

Curiously, there are no neutral values in the training dataset. This means that we could formulate the problem as a binary choice between 0 and 1 and work out our predictions from there, but for now we stick to the original plan that we may possibly have neutral tweets in the future. To encode the classes as numbers starting from 0, we first create a column of type category from the label column:

In [9]:
tweetsDF["sentiment_cat"] = tweetsDF[0].astype('category')

Then we encode those classes as numerical information in another column:

In [10]:
tweetsDF["sentiment"] = tweetsDF["sentiment_cat"].cat.codes

We save the processed train and a sample of it:

In [11]:
tweetsDF.to_csv("data/tweets/train-processed.csv", header=None, index=None)      
tweetsDF.sample(10000).to_csv("data/tweets/train-processed-sample.csv", header=None, index=None)

## Defining Fields

torchtext takes a straightforward approach to generating datasets: you tell it what you want, and it’ll process the raw CSV (or JSON) for you. You do this by first defining fields. The Field class has a considerable number of parameters that can be assigned to it.

As we noted, we’re interested in only the labels and the tweets text. We define these by using the Field datatype:

In [12]:
LABEL = data.LabelField()
TWEET = data.Field(tokenize='spacy', lower=True)

We’re defining LABEL as a LabelField, which is a subclass of Field that sets sequential to False (as it’s our numerical category class). TWEET is a standard Field object, where we have decided to use the spaCy tokenizer and convert all the text to lowercase, but otherwise we’re using the defaults. 

If, when running through this example, the step of building the vocabulary is taking a very long time, try removing the tokenize parameter and rerunning. This will use the default of simply splitting on whitespace, which will speed up the tokenization step considerably, though the created vocabulary will not be as good as the one spaCy creates.

Having defined those fields, we now need to produce a list that maps them onto the list of rows that are in the CSV:

In [13]:
fields = [('score',None), ('id',None),('date',None),('query',None),
      ('name',None),
      ('tweet', TWEET),('category',None),('label',LABEL)]

Armed with our declared fields, we now use TabularDataset to apply that definition to the CSV:

In [14]:
twitterDataset = torchtext.data.TabularDataset(
        path="data/tweets/train-processed.csv", 
        format="CSV", 
        fields=fields,
        skip_header=False)

Finally, we can split into training, testing, and validation sets by using the split() method:

In [15]:
(train, test, valid)=twitterDataset.split(split_ratio=[0.6,0.2,0.2],stratified=True, strata_field='label')

(len(train),len(test),len(valid))

(960000, 320000, 320000)

Here’s an example pulled from the dataset:

In [16]:
vars(train.examples[7])

{'tweet': ['monday', 'morning', 'blues'], 'label': '0'}

## Building a Vocabulary

Traditionally, at this point we would build a one-hot encoding of each word that is present in the dataset—a rather tedious process. 
Thankfully, torchtext will do this for us, and will also allow a max_size parameter to be passed in to limit the vocabulary to the most common words. This is normally done to prevent the construction of a huge, memory-hungry model. 

We don’t want our GPUs too overwhelmed, after all. Let’s limit the vocabulary to a maximum of 20,000 words in our training set:

In [17]:
vocab_size = 20000
TWEET.build_vocab(train, max_size = vocab_size)
LABEL.build_vocab(train)
TWEET.vocab.freqs.most_common(10)

[('i', 598952),
 ('!', 542705),
 ('.', 485482),
 (' ', 352487),
 ('to', 338680),
 ('the', 313109),
 (',', 289756),
 ('a', 228502),
 ('my', 189686),
 ('and', 181814)]

We can then interrogate the vocab class instance object to make some discoveries about our dataset. First, we ask the traditional “How big is our vocabulary?":

In [18]:
len(TWEET.vocab)

20002

Wait, wait, what? Yes, we specified 20,000, but by default, torchtext will add two more special tokens, <unk> for unknown words (e.g., those that get cut off by the 20,000 max_size we specified), and <pad>, a padding token that will be used to pad all our text to roughly the same size to help with efficient batching on the GPU (remember that a GPU gets its speed from operating on regular batches). 

You can also specify eos_token or init_token symbols when you declare a field, but they’re not included by default.
    
Pretty much what you’d expect, as we’re not removing stop-words with our spaCy tokenizer. (Because it’s just 140 characters, we’d be in danger of losing too much information from our model if we did.)

We are almost finished with our datasets. We just need to create a data loader to feed into our training loop.

torchtext provides the BucketIterator method that will produce what it calls a Batch, which is almost, but not quite, like the data loader we used on images. (You’ll see shortly that we have to update our training loop to deal with some of the oddities of the Batch interface.)

In [19]:
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train, valid, test), 
batch_size = 32,
device = device,
sort_key = lambda x: len(x.tweet),
sort_within_batch = False)

With our data processing sorted, we can move on to defining our model.

## Creating Our Model

In [20]:
class OurFirstLSTM(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size):
        super(OurFirstLSTM, self).__init__()
    
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.encoder = nn.LSTM(input_size=embedding_dim,  
                hidden_size=hidden_size, num_layers=1)
        self.predictor = nn.Linear(hidden_size, 2)

    def forward(self, seq):
        output, (hidden,_) = self.encoder(self.embedding(seq))
        preds = self.predictor(hidden.squeeze(0))
        return preds

model = OurFirstLSTM(100,300, 20002)
model.to(device)

OurFirstLSTM(
  (embedding): Embedding(20002, 300)
  (encoder): LSTM(300, 100)
  (predictor): Linear(in_features=100, out_features=2, bias=True)
)

All we do in this model is create three layers. First, the words in our tweets are pushed into an Embedding layer, which we have established as a 300-dimensional vector embedding. 

That’s then fed into a LSTM with 100 hidden features (again, we’re compressing down from the 300-dimensional input like we did with images). 

Finally, the output of the LSTM (the final hidden state after processing the incoming tweet) is pushed through a standard fully connected layer with three outputs to correspond to our two possible classes.

In [21]:
torch.Tensor([[[1, 2], [2, 3]]]).squeeze(0).shape # remove the first dimension that is one and so useless

torch.Size([2, 2])

## Updating the Training Loop

Because of some torchtext’s quirks, we need to write a slightly modified training loop. First, we create an optimizer (we use Adam as usual) and a loss function. Because we were given three potential classes for each tweet, we use CrossEntropyLoss() as our loss function. 

However, it turns out that only two classes are present in the dataset; if we assumed there would be only two classes, we could in fact change the output of the model to produce a single number between 0 and 1 and then use binary cross-entropy (BCE) loss (and we can combine the sigmoid layer that squashes output between 0 and 1 plus the BCE layer into a single PyTorch loss function, BCEWithLogitsLoss()). 

I mention this because if you’re writing a classifier that must always be one state or the other, it’s a better fit than the standard cross-entropy loss that we’re about to use.

In [22]:
optimizer = optim.Adam(model.parameters(), lr=2e-2)
criterion = nn.CrossEntropyLoss()

def train(epochs, model, optimizer, criterion, train_iterator, valid_iterator):
    for epoch in range(1, epochs + 1):
     
        training_loss = 0.0
        valid_loss = 0.0
        model.train()
        for batch_idx, batch in enumerate(train_iterator):
            optimizer.zero_grad()
            predict = model(batch.tweet)
            loss = criterion(predict,batch.label)
            loss.backward()
            optimizer.step()
            training_loss += loss.data.item() * batch.tweet.size(0)
        training_loss /= len(train_iterator)
 
        
        model.eval()
        for batch_idx,batch in enumerate(valid_iterator):
            predict = model(batch.tweet)
            loss = criterion(predict,batch.label)
            valid_loss += loss.data.item() * batch.tweet.size(0)
 
        valid_loss /= len(valid_iterator)
        print('Epoch: {}, Training Loss: {:.2f}, Validation Loss: {:.2f}'.format(epoch, training_loss, valid_loss))

The main thing to be aware of in this new training loop is that we have to reference batch.tweet and batch.label to get the particular fields we’re interested in; they don’t fall out quite as nicely from the enumerator as they do in torchvision.

Once we’ve trained our model by using this function, we can use it to classify some tweets to do simple sentiment analysis.

In [23]:
train(2, model, optimizer, criterion, train_iterator, valid_iterator)

Epoch: 1, Training Loss: 22.93, Validation Loss: 13.43
Epoch: 2, Training Loss: 22.42, Validation Loss: 13.33


## Classifying Tweets

Another hassle of torchtext is that it’s a bit of a pain to get it to predict things. What you can do is emulate the processing pipeline that happens internally and make the required prediction on the output of that pipeline, as shown in this small function:

In [24]:
def classify_tweet(tweet):
    categories = {0: "Negative", 1:"Positive"}
    processed = TWEET.process([TWEET.preprocess(tweet)])
    processed = processed.to(device)
    return categories[model(processed).argmax().item()]

We have to call preprocess(), which performs our spaCy-based tokenization. After that, we can call process() to the tokens into a tensor based on our already-built vocabulary. The only thing we have to be careful about is that torchtext is expecting a batch of strings, so we have to turn it into a list of lists before handing it off to the processing function. Then we feed it into the model. This will produce a tensor that looks like this:

tensor([[ 0.7828, -0.0024]])

The tensor element with the highest value corresponds to the model’s chosen class, so we use argmax() to get the index of that, and then item() to turn that zero-dimension tensor into a Python integer that we index into our categories dictionary.

## Data Augmentation

You might wonder exactly how you can augment text data. After all, you can’t really flip it horizontally as you can an image! But you can use some techniques with text that will provide the model with a little more information for training. First, you could replace words in the sentence with synonyms.

In early 2019, the paper “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks” suggested three other augmentation strategies: random insertion, random swap, and random deletion. Let’s take a look at each of them.

### Random Insertion

A random insertion technique looks at a sentence and then randomly inserts synonyms of existing nonstop-words into the sentence n times. Assuming you have a way of getting a synonym of a word and a way of eliminating stop-words (common words such as and, it, the, etc.), shown, but not implemented, in this function via get_synonyms() and get_stopwords(), an implementation of this would be as follows:

In [28]:
# Note: you'll have to define remove_stopwords() and get_synonyms() elsewhere

def random_insertion(sentence,n):
    words = remove_stopwords(sentence)
    for _ in range(n):
        new_synonym = get_synonyms(random.choice(words))
        sentence.insert(randrange(len(sentence)+1), new_synonym)
    return sentence

### Random Deletion

As the name suggests, random deletion deletes words from a sentence. Given a probability parameter p, it will go through the sentence and decide whether to delete a word or not based on that random probability:

In [27]:
def random_deletion(words, p=0.5):
    if len(words) == 1:
        return words
    remaining = list(filter(lambda x: random.uniform(0,1) > p,words))
    if len(remaining) == 0:
        return [random.choice(words)]
    else:
        return remaining

The implementation deals with the edge cases—if there’s only one word, the technique returns it; and if we end up deleting all the words in the sentence, the technique samples a random word from the original set.

### Random Swap

The random swap augmentation takes a sentence and then swaps words within it n times, with each iteration working on the previously swapped sentence. Here’s an implementation:

In [29]:
def random_swap(sentence, n=5):
    length = range(len(sentence))
    for _ in range(n):
        idx1, idx2 = random.sample(length, 2)
        sentence[idx1], sentence[idx2] = sentence[idx2], sentence[idx1]
    return sentence

We sample two random numbers based on the length of the sentence, and then just keep swapping until we hit n.

The techniques in the EDA paper average about a 3% improvement in accuracy when used with small amounts of labeled examples (roughly 500). 

If you have more than 5,000 examples in your dataset, the paper suggests that this improvement may fall to 0.8% or lower, due to the model obtaining better generalization from the larger amounts of data available over the improvements that EDA can provide.

### Back Translation

Another popular approach for augmenting datasets is back translation. This involves translating a sentence from our target language into one or more other languages and then translating all of them back to the original language. 

We can use the Python library googletrans for this purpose. Install it with pip:

In [33]:
import googletrans
import random

translator = googletrans.Translator()

sentences = ['The cat sat on the mat']

translations_fr = translator.translate(sentences, dest='fr')
fr_text = [t.text for t in translations_fr] 
translations_en = translator.translate(fr_text, dest='en')
en_text = [t.text for t in translations_en]
print(en_text)   

['The cat sat on the mat']


That gives us an augmented sentence from English to French and back again, but let’s go a step further and select a language at random:

In [39]:
available_langs = list(googletrans.LANGUAGES.keys())
tr_lang = random.choice(available_langs)
print(f"Translating to {googletrans.LANGUAGES[tr_lang]}")

translations = translator.translate(sentences, dest=tr_lang)
t_text = [t.text for t in translations]
print(t_text)

translations_en_random = translator.translate(t_text, src=tr_lang, dest='en')
en_text = [t.text for t in translations_en_random]
print(en_text)

Translating to yiddish
['די קאַץ איז געזעסן אויף די ראָגאָזשע']
['The cat sat on the mat']


In this case, we use random.choice to grab a random language, translate to that language, and then translate back as before. We also pass in the language to the src parameter just to help the language detection of Google Translate along. Try it out and see how much it resembles the old game of Telephone.

You need to be aware of a few limits. First, you can translate only up to 15,000 characters at a time, though that shouldn’t be too much of a problem if you’re just translating sentences. Second, if you are going to use this on a large dataset, you want to do your data augmentation on a cloud instance rather than your home computer, because if Google bans your IP, you won’t be able to use Google Translate for normal use! Make sure that you send a few batches at a time rather than the entire dataset at once. This should also allow you to restart translation batches if there’s an error on the Google Translate backend as well.

## Augmentation and torchtext

You might have noticed that everything I’ve said so far about augmentation hasn’t involved torchtext. Sadly, there’s a reason for that. Unlike torchvision or torchaudio, torchtext doesn’t offer a transform pipeline, which is a little annoying. It does offer a way of performing pre- and post-processing, but this operates only on the token (word) level, which is perhaps enough for synonym replacement, but doesn’t provide enough control for something like back translation. 

And if you do try to hijack the pipelines for augmentation, you should probably do it in the preprocessing pipeline instead of the post-processing one, as all you’ll see in that one is the tensor that consists of integers, which you’ll have to map to words via the vocab rules.

For these reasons, I suggest not even bothering with spending your time trying to twist torchtext into knots to do data augmentation. Instead, do the augmentation outside PyTorch using techniques such as back translation to generate new data and feed that into the model as if it were real data.