# Pytorch CNN for NLP

### Introduction

In the last lesson, we saw how we could apply the convolutional layers and pooling to an NLP problem.  The first step is translating the sequence of words in a document to corresponding word vectors. 

<img src="./assets/sentiment9.png" width="40%">

And from there, applying kernels, that span the length of a word vector, and where the number of words to capture in sequence determine the number of rows of the kernel.

<img src="./assets/sentiment12.png" width="32%"> <img src="./assets/sentiment13.png" width="32%"> <img src="./assets/sentiment14.png" width="32%">


Each kernel that we apply results in a different one dimensional vector.

<img src="./activation-heat.png" width="20%">

And we can summarize the entire activation map across the document with average or max pooling.

<img src="./assets/sentiment15.png" width="50%">

### Loading our Data

> **Before running any cells**, go to `Runtime`, and then `Change Runtime Type` to change colab to use GPU, and high memory if available.

Ok, let's begin by loading up our data our from IMDB.

In [2]:
import torch
from torchtext import data
from torchtext import datasets

As we know we need to initialize a Field and LabelField to perform some initial preprocessing upon download.

> Initialize `TEXT` with the spacy tokenizer.  Set **batch_first** to `True`.

In [6]:
TEXT = data.Field(tokenize = 'spacy', batch_first = True)



In [5]:
TEXT.tokenize
# functools.partial(<function _spacy_tokenize at 0x128a00a70>, spacy=<spacy.lang.en.English object at 0x12b57ecd0>)

functools.partial(<function _spacy_tokenize at 0x128a00a70>, spacy=<spacy.lang.en.English object at 0x12b57ecd0>)

Initialize the `LABEL` field and set the datatype as float.

In [7]:
LABEL = data.LabelField(dtype = torch.float)

In [9]:
LABEL.dtype

# torch.float32

torch.float32

Then download the IMDB dataset, passing through the `TEXT` and `LABEL`.

In [11]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

Then, we can numericalize our data, by calling the `build_vocab` method to associate each word in our corpus with both an index, and a related vector.  If a word is not found with the pretrained glove representation, represented with a random vector, as specified in `unk_init`.

In [7]:
import torch
# device = torch.device('cuda')
TEXT.build_vocab(train_data, 
                 max_size = 25000, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

.vector_cache/glove.6B.zip: 862MB [13:28, 1.07MB/s]                               
100%|█████████▉| 399999/400000 [00:24<00:00, 16462.77it/s]


Then call `build_vocab` on `LABEL` to convert the labels into integers.

In [33]:
LABEL.build_vocab(train_data)

Then, we use our BucketIterator to batch our data.

> In doing so, pass through a keyword argument of `device` and pass through an initialized `device = torch.device('cuda')`.

In [197]:
device = torch.device('cuda')
train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, test_data), 
    batch_size = 64, device = device)

Now, let's take a look at a batch of our data.

In [233]:
for batch in train_iterator:
    text = batch.text
    labels = batch.label
    break



From here, we can access our text and labels.

In [219]:
text.shape

torch.Size([64, 725])

In [220]:
labels.shape

torch.Size([64])

### Translating to Pytorch

Ok, now let's begin to build our neural network in Pytorch.  The first step is to initialize our embedding.  

1. Define the  **Embedding**
* As we know, the number of embeddings should equal the number of words in our vocabulary.  And the `embedding_dim` should equal the number of columns for each word vector.  

First use the vocab object to find the number of words in the vocabulary.  Assign it to `num_embeddings`.

In [221]:
num_embeddings = len(TEXT.vocab.itos)

Ok, now define neural network with only an embedding.  The `forward` method should take in numericalized and return the corresponding word vectors from our randomly initialized embedding.

> We can move add in the wordvectors from Glove later on.

In [225]:
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(25002, 
                                      100)
    def forward(self, text):        
        #text = [batch size, sent len]
        embedded = self.embedding(text)
        return embedded


In [226]:
net = Net()
net

# Net(
#   (embedding): Embedding(25002, 100, padding_idx=1)
# )

Net(
  (embedding): Embedding(25002, 100)
)

Now let's test this out with our first batch of input data.

In [229]:
net(text).shape

# torch.Size([64, 725, 100])
# middle number may be different

torch.Size([64, 725, 100])

Ok, so we can see that for our batch of 64 observations, a certain number of words per observation, and with each word being represented by a vector of length 100.

2. Define the convolutions

* Next up is to define our first convolutional layer.  

For the first convolutional layer, the number of channels should be one.  We can have 100 different filters.  And the kernel size should be equal to the length of the word vector.  Write the layer so that the kernel is applied to three words at a time.  Let's assign the convolution as `conv_0`.

In [70]:
net(text).shape

# torch.Size([64, 1146, 100])

torch.Size([64, 1, 1146, 100])

> Copy the code from the earlier neural network to get you started.

In [237]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(25002, 
                                      100)
        self.conv_0 = nn.Conv2d(in_channels = 1, 
                                out_channels = 100, 
                                kernel_size = (5, 100))
    def forward(self, text):        
        #text = [batch size, sent len]
        embedded = self.embedding(text)
        embedded_with_channel = embedded.unsqueeze(1)
        conved_0 = F.relu(self.conv_0(embedded_with_channel).squeeze(3))
        #conved_0 = [batch size, n_filters, sent len - filter_sizes[n] + 1]
        #conved_0 = [64, 100, 1142]
        pooled_0 = F.max_pool1d(conved_0, conved_0.shape[-1]).squeeze(2)
        return pooled_0

In [238]:
net = Net()

In [236]:
net(text).shape

torch.Size([64, 100])

Ok, now let's fill in the forward method, on the code above.  

* First, after the code is returned from the embedding, we'll have to add a dimension for the channel.  This way, before being passed to a convolution, the data is in the shape of `torch.Size([batch size, channel, document length, word vectors])`.  The channel size is 1.

* Then apply the sequence of `conv > relu > maxpool1d`.
    * Before maxpooling the shape will be `[64, 100, document size - 3 + 1]`
    * We should maxpool across the entire length of the vector.
    * After maxpooling, shape will be `[64, 100]`

> Check the shape of your data, returning the output from max pooling.

In [241]:
net_max_pool = Net()

In [242]:
net_max_pool(text).shape

# torch.Size([64, 100])

torch.Size([64, 100])

So now for each document, we have a single summary number from each filter.

### Adding more Convolutional Layers

Let's keep going with our neural network.

Currently, we have a single convolutional layer we extract the features from a sequence of five words in a review at a time.  We can add 2 word and 3 word sequences to the mix by adding another convolutional layer for each and also feeding passing them the outputs from the embedded layer.

<img src="./cnndiag.png" width="40%">

We'll then concatenate all of these outputs together, and pass the concatenated vectors (one for each observation) to a neural network.

So we'll have two additional convolutional layers, assigned to `conv_1`, and `conv_2`.  One that takes in n-grams of length 4, the other of length 5, and with the sequences of: 

* `embedding > conv_2 > relu > max pool`
* `embedding > conv_3 > relu > max pool`

Before passing the data to the linear layer concatenate the outputs from the convolutional layers together with:

```python
concat_pooled = torch.cat((pooled_0, pooled_1, pooled_2), dim = 1)
```

Finally, let's finish up with a linear layer that takes in `3*100` inputs.  One input for each of the output channels in each of the convolutional layers.  And has a single output of a number between 1 and 0.  
> Do not worry about a final activation layer.  That can be taken care with the cost function.



However, we can add a dropout of that we apply before passing our data to the linear layer.

In [248]:
class MultiConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(25002, 
                                      100,
                              padding_idx = 1)
        self.conv_0 = nn.Conv2d(in_channels = 1, 
                                out_channels = 100, 
                                kernel_size = (3, 100))
        self.conv_1 = nn.Conv2d(in_channels = 1, 
                                out_channels = 100, 
                                kernel_size = (4, 100))
        self.conv_2 = nn.Conv2d(in_channels = 1, 
                                out_channels = 100, 
                                kernel_size = (5, 100))
        
        self.dropout = nn.Dropout(.5)
        self.linear = nn.Linear(3*100, 1)
        
    def forward(self, text):        
        #text = [batch size, sent len]
        embedded = self.embedding(text)
        embedded_with_channel = embedded.unsqueeze(1)
        conved_0 = F.relu(self.conv_0(embedded_with_channel).squeeze(3))
        pooled_0 = F.max_pool1d(conved_0, conved_0.shape[-1]).squeeze(2)
        
        conved_1 = F.relu(self.conv_1(embedded_with_channel).squeeze(3))
        pooled_1 = F.max_pool1d(conved_1, conved_1.shape[-1]).squeeze(2)
        #conved_0 = [64, 50]
        conved_2 = F.relu(self.conv_2(embedded_with_channel).squeeze(3))
        pooled_2 = F.max_pool1d(conved_1, conved_1.shape[-1]).squeeze(2)
        concat_pooled = torch.cat((pooled_0, pooled_1, pooled_2), dim = 1)
        dropout = self.dropout(concat_pooled)
        L1 = self.linear(dropout)
        return L1

Check that the neural network structure matches what's specified below.

In [249]:
multi_conv = MultiConvNet()
multi_conv

# MultiConvNet(
#   (embedding): Embedding(25002, 100, padding_idx=1)
#   (conv_0): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
#   (conv_1): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
#   (conv_2): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
#   (dropout): Dropout(p=0.5, inplace=False)
#   (linear): Linear(in_features=300, out_features=1, bias=True)
# )

MultiConvNet(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (conv_0): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
  (conv_1): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
  (conv_2): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
  (dropout): Dropout(p=0.5, inplace=False)
  (linear): Linear(in_features=300, out_features=1, bias=True)
)

To perform the calculations in the `multi_conv` net on cuda, run the following line.

In [None]:
multi_conv = multi_conv.to(device)

### Using the Pretrained embeddings

Ok, now let's move over our pretrained embeddings to the neural network.  Get the pretrained embeddings from the vocab object on our `TEXT` field.  Assign them to `pretrained_embeddings`.

In [None]:
pretrained_embeddings = TEXT.vocab.vectors

In [250]:
pretrained_embeddings.shape
# torch.Size([25002, 100])

torch.Size([25002, 100])

Then `copy` the embeddings to the `multiconv`'s embedding layer's `weight.data` attribute.

In [180]:
multi_conv.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.3092, -0.4611,  0.3395,  ..., -0.2275, -0.7580, -0.2616],
        [ 0.2219, -1.2999,  0.4127,  ...,  1.4192, -0.3857,  1.1795],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.3177, -0.4069, -0.4231,  ...,  0.2272, -0.6000,  0.3436],
        [ 0.6385,  1.1301,  0.1818,  ...,  0.4541, -0.7086, -0.4312],
        [ 0.6339,  0.0847, -1.0509,  ...,  0.2354, -1.3884, -0.3012]])

Then, we'll zero the initial weights of the unknown and padding tokens.  To do so, first get the `pad_token` and `unk_token` from the `vocab` object.

In [253]:
unknown_token = TEXT.unk_token
unknown_token

'<unk>'

Then get the padding token.

In [255]:
padding_token = TEXT.pad_token
padding_token

# '<pad>'

'<pad>'

Now these are both the strings.  So now we should be able to find the corresponding index of the padding and the unknown token.

In [256]:
unknown_idx = TEXT.vocab.stoi[TEXT.unk_token]
unknown_idx

# 0

0

In [258]:
pad_idx = TEXT.vocab.stoi[TEXT.pad_token]
pad_idx

# 1

1

Now select the vectors at those indices in the embedding and replace them with a vector of zeros.

In [None]:
multi_conv.embedding.weight.data[unknown_index] = torch.zeros(100)
multi_conv.embedding.weight.data[pad_idx] = torch.zeros(100)

## Train the Model

It's time to move onto training the model.  First let's define our optimizer by passing through the parameters and setting a learning rate of `.0005`

In [198]:
import torch.optim as optim
optimizer = optim.Adam(multi_conv.parameters(), lr = .0005)

In [259]:
optimizer

# Adam (
# Parameter Group 0
#     amsgrad: False
#     betas: (0.9, 0.999)
#     eps: 1e-08
#     lr: 0.0005
#     weight_decay: 0
# )

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.0005
    weight_decay: 0
)

Next up is to define the loss function.  For the loss function initialize an instance of `BCEWithLogitsLoss`.

> We'll have both the neural network and the loss function run on the cuda device.

In [None]:
bce_loss = nn.BCEWithLogitsLoss()

bce_loss = bce_loss.to(device)

> BCEWithLogitsLoss stands for binary cross entropy with logits loss.  Now binary cross entropy we've seen before.  It's simply cross entropy when the output is either 1 or 0.  In other words, it's another word for log loss.  Remember that for log loss we reward a hypothesis function for predicting a number close to 0 when the true label is a zero, and a 1 when the true label is a 1.


In [265]:
import numpy as np
# When 1, loss is -log(p) 
-np.log(.01), -np.log(1 - .01)
# When 0, loss is -log(1 - p) 


(4.605170185988091, 0.01005033585350145)

> So that takes care of the BCE part, the "with logits" is because of the output of our neural network.  Remember, we did not pass the output through an activation function (in the binary case it would be the sigmoid function).  The withlogits corresponds to that.     

Now for the loss function, we'll pass through a batch of predictions.

In [266]:
predictions = multi_conv(batch.text)

In [267]:
predictions[:2]

tensor([[-1.2708],
        [-0.3760]], grad_fn=<SliceBackward>)

In [269]:
batch.label[:3]

tensor([1., 0., 1.])

> But we need to remove the individual rows from the predictions, so they are of the same shape as the labels.  So we squeeze the predictions and then pass them, and the label into our `bce_loss` function.

In [270]:
loss = bce_loss(predictions.squeeze(1), batch.label)

Ok, now let's move through our training loop.  We can go through six epochs, and see how we do.  

> This may take a while, so feel free to move on to the next lesson after things are running properly.

In [202]:
for epoch in range(6):
    for batch in train_iterator:
        optimizer.zero_grad()
        predictions = multi_conv(batch.text)
        
        loss = bce_loss(predictions.squeeze(1), batch.label)
        loss.backward()
        optimizer.step()
    print(loss)



tensor(0.4136, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
tensor(0.3570, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
tensor(0.2350, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
tensor(0.1163, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
tensor(0.1894, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
tensor(0.0493, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)


When this is complete, let's move onto testing our model.  To do this, we can move through the batches of our model, calculating the score and the number of samples in the batch.  This way we can compute a weighted average at the end.

In [None]:
class_correct = list(0. for i in range(2))
class_total = list(0. for i in range(2))
with torch.no_grad():
    for batch in test_iterator:
        outputs = multi_conv(batch.text)
        labels = batch.label
        hard_outputs = torch.round(torch.sigmoid(outputs.reshape(-1))).int()
        is_corrects = (labels == hard_outputs).int()
        for label, is_correct in zip(labels, is_corrects):
            label_int = label.int().item()
            class_correct[label_int] += is_correct.item()
            class_total[label_int] += 1

In [None]:
for i in range(2):
    print('Accuracy of %5s : %2d %%' % (
        classes[i], 100 * class_correct[i] / class_total[i]))

### Sanity Check

Finally, we can use the following function to try out how our model handles some random text.  The function tokenizes a sentence, and then has our model predict the sentiment.  Try it out below.

In [203]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(model, sentence, min_len = 5):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    if len(tokenized) < min_len:
        tokenized += ['<pad>'] * (min_len - len(tokenized))
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed)
    tensor = tensor.unsqueeze(0)
    prediction = torch.sigmoid(model(tensor))
    return prediction.item()

An example negative review...

In [213]:
predict_sentiment(multi_conv, "not good")

0.5142894387245178

An example positive review...

In [216]:
predict_sentiment(multi_conv, "it's good")

0.9988067150115967

### Resources

Lab based excellent material of [Bentrevett](https://github.com/bentrevett/pytorch-sentiment-analysis).