# TOC

__Chapter 6 - Deep learning with sequence data and text__

1. [Import](#Import)
1. [Word embedding](#Word-embedding)
    1. [Training word embedding by building a sentiment classifier](#Training-word-embedding-by-building-a-sentiment-classifier)
    1. [torchtext.datasets](#torchtextdatasets)
    1. [Building vocabulary](#Building-vocabulary)
    1. [Generate batches of vectors](#Generate-batches-of-vectors)
1. [Creating a network model with embedding](#Creating-a-network-model-with-embedding)
    1. [Training the model](#Training-the-model)
    1. [Using pretrained word embeddings](#Using-pretrained-word-embeddings)
    1. [Loading the embeddings in the model](#Loading-the-embeddings-in-the-model)
    1. [Freeze the embedding layer weights](#Freeze-the-embedding-layer-weights)
1. [Recursive neural networks](#Recursive-neural-networks)
    1. [Understanding how RNN works with an example ](#Understanding-how-RNN-works-with-an-example)
1. [LSTM](#LSTM)
    1. [Preparing the data](#Preparing-the-data)
    1. [Creating batches](#Creating-batches)
    1. [Creating the network](#Creating-the-network)
    1. [Training the model](#Training-the-model2)
1. [Convolutional network on sequence data](#Convolutional-network-on-sequence-data)
    1. [Creating the network](#Creating-the-network2)
    1. [Training the model](#Training-the-model3)

# Import

<a id = 'Import'></a>

In [None]:
# standard libary and settings
import os
import sys
import importlib
import itertools
from PIL import Image
from glob import glob
import warnings

warnings.simplefilter("ignore")
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:95% !important; }</style>"))

# data extensions and settings
import numpy as np

np.set_printoptions(threshold=np.inf, suppress=True)
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.options.display.float_format = "{:,.6f}".format

# pytorch tools
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torch.autograd import Variable
from torchvision import datasets, models, transforms

# visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
sns.set_style("whitegrid")

# Word embedding

Word embedding is a popular way of representing text data in problems that are solved by deep learning algorithms. This technique provides a dense representation of a word filled with floats. The vector dimension varies based on the vocabulary size. It is common to use a word emebedding of dimension size 50, 100, 256, 300 and occassionally 1,000. This size is a hyperparameter.

Contrasting this with on-hot encoding, if we have a vocabulary of 20,000 words, then we end up with 20,000 x 20,000 numbers, the vast majority of which will be zero. This same vocabulary can be represented as a word emebedding of size 20,000 x (dimension size).

One method for creating word embeddings is to start with dense vectors of random numbers for each token, then train a model (such as a document classifier or sentiment classifier). The floating point numbers in the vectors, which collectively represent the tokens, are adjusted in a way such that semantically 'close' words will have similar represented.

Word embeddings may not be feasible if there isn't enough data. In these case, embeddings trained by some other machine learning algorithm can be used.

<a id = 'Word-embedding'></a>

## Training word embedding by building a sentiment classifier

Using a dataset called IMDB (which contains movie reviews), we will build a sentiment classifier. In the processing training the model, we will also train word embedding for the words in the IMDB dataset. This will be done using a library called torchtext.

The torchtext.data module has a class called Field, which defines how the data needs to be read and tokenized. Below, we define two Field objects, one for the text itself and a second for the labels. The Field constructor also accepts a tokenize argument, which by default use the str.split function. We can override this by passing in a tokenizer of choice.

<a id = 'Training-word-embedding-by-building-a-sentiment-classifier'></a>

In [None]:
#

from torchtext import data

text = data.Field(lower=True, batch_first=True, fix_length=20)
label = data.Field(sequential=False)

## torchtext.datasets

torchtext.datasets provides wrappers for several different datasets, such as IMDB. This utility abstracts away the process of downloading, tokenizing and splitting the datasets.

<a id = 'torchtextdatasets'></a>

In [None]:
# download IMDB
train, test = datasets.IMDB.splits(text, label)

In [None]:
#
print("train.fields", train.fields)

# results
print(vars(train[0]))

## Building vocabulary

We can use the build_vocab method to take in an object from which we will build a vocabulary. Below, we pass in the train object, and using the dim argument, initialize vectors with pretrained mebeddings of dimension 300. The max_size instance limits the number of words in the vocabulary, and min_freq removes any word which has not occurred more than 10 times.

Once the vocabulary is built we can obtain different values such as frequency, word index and the vector representation of each word.

<a id = 'Building-vocabulary'></a>

In [None]:
# build the vocabulary
text.build_vocab(train, vectors=GloVe(name="6B", dim=300), max_size=10000, min_freq=10)
label.build(train)

In [None]:
# print word frequencies
print(text.vocab.freqs)

In [None]:
# print word vectors, which displays the 300 dimension vector for each word
print(text.vocab.vectors)

In [None]:
# print word and their indexes
print(text.vocab.stoi)

## Generate batches of vectors

BucketIterator is a tools that helps to batch the text and replace the words with the index number of the individual words. The following code creates iterators that generate batches for the train and test objects.

<a id = 'Generate-batches-of-vectors'></a>

In [None]:
#
train_iter, test_iter = data.BuckerIterator.splits(
    (train, test), batch_size=18, device=-1, shuffle=True
)

batch = next(iter(train_iter))
print(batch.text)

print(batch.label)

# Creating a network model with embedding

In this section we will create word embeddings in our network architecture, and then train the entire model to predict the sentiment of each review. Once training is complete, we will have a sentiment classifier model as well as the word embeddings for the IMDB dataset.

In the following code, the init function initializes an object of the nn.Embedding class, which takes two arguments. emb_size is the size of the vocabulary, and hiddensize1 is the dimension we want to create for each word. We will set the vocabulary size at 10,000 and the embedding size (hidden_state1) at size 10. As a side note, small embeddings are great for speed, but production systems typically use much large embeddings. The last item in the init function is a linear layer that maps the word embeddings to the sentiment decision category: positive, negative, unknown.

The forward function determines how th einput is processed. When the batch size is 32 and the sentences have a max length of 20 words, so the inputs will have a shape of 32 by 20. The first embedding layer acts as a lookup table which replaces each word with the corresponding embedding vector. When the embedding dimension size is 10, the output becomes 32 by 20 by 10 after each word is replaced with the corresponding embedding. The view() function flattens the result from the embedding layer. The first argument given to view() will keep the dimensions intact. Since we're not interesting in combining data from the different batches, we have the view() preserve the first dimension and flatten the rest of the value in the tensor. Following view(), the tensor shape is now 32 by 200. Lastly, a dense layer maps the flattened embeddings to the output categories.

<a id = 'Creating-a-network-model-with-embedding'></a>

In [None]:
# create a network architecture to predic sentiment using word embeddings
class EmbNet(nn.Module):
    def __init__(self, emb_size, hidden_size1, hidden_size2=400):
        seuper().__init__()
        self.embedding = nn.Embedding(emb_size, hidden_size1)
        self.fc = nn.Linear(hidden_size2, 3)

    def forward(self, x):
        embeds = self.embedding(x).view(x.size(0), -1)
        out = self.fc(embeds)
        return F.log_softmax(out, dim=-1)

## Training the model

<a id = 'Training-the-model'></a>

In [None]:
# function for training the model
def fit(epoch, model, data_loader, phase="training", volatile=False):
    if phase == "training":
        model.train()
    if phase == "validation":
        model.eval()
        volatile = True
    running_loss = 0.0
    running_correct = 0

    for batch_idx, batch in enumerate(data_loader):
        text, target = batch.text, batch.label
        if torch.cuda.is_available():
            text, target = text.cuda(), target.cuda()

        if phase == "training":
            optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)

        running_loss += F.nll_loss(output, target, reduction="sum").data.item()
        preds = output.data.max(dim=1, keepdim=True)[1]
        running_correct += preds.eq(target.data.view_as(preds)).cpu().sum()
        if phase == "training":
            loss.backward()
            optimizer.step()

    loss = running_loss / len(data_loader.dataset)
    accuracy = 100.0 * running_correct / len(data_loader.dataset)

    print(
        "{0} loss is {1} and {0} accuracy is {2}/{3} {4}".format(
            phase, loss, running_correct, len(data_loader.dataset), accuracy
        )
    )
    return loss, accuracy

In [None]:
# run model for 20 epochs
model = EmbNet()
if torch.cuda.is_available():
    model.cuda()

train_losses, train_accuracy = [], []
val_losses, val_accuracy = [], []

# batch iterator by default does not stop generating batches, so the repeat variable
# object need to be set to False. Otherwise the training process will run indefinitely.
train_iter.repeat = False
test_iter.repeat = False

# 10 epochs gives a validation accuracy of around 70%
for epoch in range(1, 10):
    epoch_loss, epoch_accuracy = fit(epoch, model, train_iter, phase="training")
    val_epoch_loss, val_epoch_accuracy = fit(
        epoch, model, test_iter, phase="validation"
    )
    train_losses.append(epoch_loss)
    train_accuracy.append(epoch_accuracy)
    val_losses.append(val_epoch_loss)
    val_accuracy.append(val_epoch_accuracy)

# Using pretrained word embeddings

Pretrained word embeddings can be particularly helpful when working within a specific domain, such as medicine. There are pretrained embeddings that have been trained on massive corpuses, such as Wikipedia, Google News and Twitter tweets. We can use torchtext to easily access these resources. This process works similar to transfer learning in the context image classification.

- Download the embeddings
- Load the embeddings into the model
- Freeze the embedding layer weights

torchtext provides three class, GloVe, FastText and CharNGram in the vocab module, which facilitates the downloading of the embeddings, and then maps them to our vocabulary.

The vectors argument denote which embedding class to used and the name and dim arguments determine which embeddings to use.

<a id = 'Using-pretrained-word-embeddings'></a>

In [None]:
#  The build_vocab method of the Field object take an argument specifying the embedding
from torchtext.vocab import GloVe

TEXT.build_vocab(train, vectors=GloVe(name="6B", dim=300), max_size=1000, min_freq=10)
LABEL.build_vocab(train)

In [None]:
# access the embeddings from the vocab object
TEXT.vocab.vectors

## Loading the embeddings in the model

The vectors variable returns a torch tensor with the shape of vocab_size by dimensions containing the pretrained embeddings. We need to store the embeddings in our embedding layer. 



<a id = 'Loading-the-embeddings-in-the-model'></a>

In [None]:
# store embeddings as the weight in the embedding layer by accessing the weights of the embedding layer
# mdel represents the model object, embedding represents the embedding layer

model.embedding.weight.data = TEXT.vocab.vectors

In [None]:
# word embeddings architecture
class EmbNet(nn.Module):
    def __init__(self, emb_size, hidden_size1, hiddensize2=400):
        super().__init__()
        self.embedding = nn.Embedding(emb_size, hidden_size1)
        self.fc1 = nn.Linear(hiddensize2)

    def forward(self, x):
        embeds = self.embedding(x).view(x.size(0), -1)
        out = self.fc1(embeds)
        return F.log_softmax(out, dim=-1)


model = EmbNet(len(TEXT.vocab.stoi), 300, 12000)

## Freeze the embedding layer weights

Freezing the embedding layer is a two step process:

1. Set requires_grad to False
2. Prevent the embedding laye rparameters to the optimizer.

Up to this point, this architecture doesn't take advantage of the sequential nature of text data. The following sections will explore RNN and Conv1D, which do take advantage of text data structure.

<a id = 'Freeze-the-embedding-layer-weights'></a>

In [None]:
# turn of gradients and create optimizer object
model.embedding.weight.requires_grad = False
optimizer = optim.SGD([param for param in model.parameters() if param.requires_grad = True], lr = 0.001)


# Recursive neural networks

Feedforward networks are designed to look at all features at once and map them to the output. RNNs, by contrast, evaluate elements on at a time, retaining information evaluated up to that point in the sequence.


<a id = 'Recursive-neural-networks'></a>

## Understanding how RNN works with an example

Let's use this string of text as an example: "the action scenes were top notch in this movie."

We start by passing the word 'the' into the emodel, and the model generates two different things:

- State vector - This is passed to the model when it processes the next word in the input string.
- Output vector - The output of the model is reviewed once the last item of the sequence is evaluated.

In other words, the RNN recurseively passes the State vector to itself as it moves from item to item in the data sequence.

In the implementation below, the init function initializes two linear layers, one for calculating the output and another for calculating the state/hidden vector. The forward function combines the input and hidden vectors, and passes it through the two linear layers, which generates an output and state/hidden vector. The log_softmax function is applied in the output layer. The initHidden function is a helper funciton that creates hidden vectors with no state needed when calling the RNN the first time.

<a id = 'Understanding-how-RNN-works-with-an-example'></a>

In [None]:
# the 'hidden' variable represents the state vector
rnn = RNN(input_size, hidden_size, output_size)
for i in range(len(thor_review)):
    output, hidden = rnn(thor_review[i], hidden)

In [None]:
# RNN implementation
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return Variable(torch.zeros(1, self.hidden_size))

# LSTM

The vanilla implementation of an RNN above is rarely used in practice due to issues with vanishing gradients and gradient explosion. Instead, LSTM or GRU are used to address these issues that arise when dealing with large sequences of data. Generally speaking, LSTMs and other variants of RNN more successfully capture meaning in long sequences of data by addining different neural networks inside the LSTM which decide data gets remembered and which date is forgotten.


<a id = 'LSTM'></a>

## Preparing the data

RNN networks expect data to be in the form Sequence_length, batch_size and features. In the preparation step below, batch_first needs to be set to False.

<a id = 'Preparing the data'></a>

In [None]:
# prepare data
TEXT = data.Field(lower=True, fix_length=200, batch_first=False)
LABEL = data.Field(sequential=False)
train, test = IMDB.splits(TEXT, LABEL)
TEXT.build_vocab(train, vectors=GloVe(name="6B", dim=300), max_size=10000, min_frew=10)
LABEL.build_vocab(train)

## Creating batches

BuckerIterator is used for creating batches, and the size of the batches is equal to the seuqnce length and batches. In this case, the size will be 200 by 32, where 200 is the sequence length and 32 is the batch size.


<a id = 'Creating-batches'></a>

In [None]:
# create batch iterator object
train_iter, test_iter = data.BucketIterator.splits(
    (train, test), batch_size=32, device=-1
)
train_iter.repeat = False
test_iter.repeat = False

## Creating the network

In the implementation below, the init method creates an embedding layer with the size of n_vocab by hidden_size. It also creates LSTM and linear layer. The last layer is a LogSoftmax layer that converts results from the linear layer to probabilities. 

The forward function receives and input dataset of size 200 by 32, which gets passed through the embedding layer. Each token in the batch gets replaced by embeddings and the size transforms to 200 by 32 by 100. The dimension with size 100 represents the embeddings. The LSTM layer takes the output of the embedding layer along with two hidden variables. The hidden variables are of the same type as the embeddings output and are of the size num_layers by batch_size by hidden_size. The LSTM layer process the data in a sequence and generates an output of shape sequence_length by batch_size by hiden_size, where each sequence index represents the output of that sequence. In this implementation we are only interested in the output of the last sequence, which has the shape batch_size by hidden_dim, and this is passed on to a linear layer, where it is mapped to the output categories. The droppout layer is includes to fend off overfitting.


<a id = 'Creating-the-network'></a>

In [None]:
# RNN for IMDB dataset
class IMDBRnn(nn.Module):
    def __init__(self, vocab, hidden_size, n_cat, bs=1, n1=2):
        super().__init__()
        self.hidden_size = hidden_size
        self.bs = bs
        self.n1 = n1
        self.e = nn.Embedding(n_vocab, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, n1)
        self.fc2 = nn.Linear(hidden_size, n_cat)
        self.softmax = nn.LogSoftmax(dim=-1)

    def forward(self, inp):
        bs = inp.size()[1]
        if bs != self.bs:
            self.bs = bs
        e_out = self.e(inp)
        h0 = c0 = Variable(
            e_out.data.new(*(self.n1, self.bs, self.hidden_size)).zero_()
        )
        rnn_o, _ = self.rnn(e_out, (h0, c0))
        rnn_o = rnn_o[-1]
        fc = F.dropout(self.fc2(rnn_o), p=0.8)
        return self.softmax(fc)

## Training the model



<a id = 'Training-the-model2'></a>

In [None]:
# instantiate model
model = IMDBRnn(n_vocab, n_hidden, bs=32)
model = model.cuda()

# create optimizer
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# model fit function
def fit(epoch, model, data_loader, phase="training", volatile=False):
    if phase == "training":
        model.train()
    if phase == "validation":
        model.eval()
        volatile = True
    running_loss = 0.0
    running_correct = 0

    for batch_idx, batch in enumerate(data_loader):
        text, target = batch.text, batch.label
        if is_cuda:
            text, target = text.cuda(), target.cuda()

        if phase == "training":
            optimizer.zero_grad()
        output = model(text)
        loss = F.nll_loss(output, target)

        running_loss += F.nll_loss(output, target, size_average=False).data.item()
        preds = output.data.max(dim=1, keepdim=True)[1]
        running_correct += preds.eq(target.data.view_as(preds)).cpu().sum()
        if phase == "training":
            loss.backward()
            optimizer.step()

    loss, accuracy = running_loss / len(data_loader.dataset)
    accuracy = 100.0 * running_correct / len(data_loader.dataset)

    print(
        "{0} loss is {1} and {0} accuracy is {2}/{3} {4}".format(
            phase, loss, running_correct, len(data_loader.dataset), accuracy
        )
    )
    return loss, accuracy


# execute training loop
train_losses, train_accuracy = [], []
val_losses, val_accuracy = [], []
for epoch in range(1, 5):
    epoch_loss, epoch_accuracy = fit(epoch, model, train_iter, phase="training")
    val_epoch_loss, val_epoch_accuracy = fit(
        epoch, model, test_iter, phase="validation"
    )
    train.losses.append(epoch_loss)
    train_accuracy.append(epoch_accuracy)
    val_losses.append(val_epoch_loss)
    val_accuracy.append(val_epoch_accuracy)

# Convolutional network on sequence data

Just as CNNS can be used in computer vision problems for images, convolutions can also be helpful in model sequential data. One-dimensional convolutions sometimes perform better than RNNs and are computationally cheaper.

The convolution operation shares similarities with the technique's application to images. There is a kernel with weights of a set length that slides along the sequence of data, returning an abstraction that results from the vector multiplication. The original sequence can be padded just as we did with images.

<a id = 'Convolutional-network-on-sequence-data'></a>

## Creating the network

In the implementation below, we replace the LSTM layer with a Conv1d layer and an AdaptiveAvgPool1d layer. The convolution layer accepts the sequence length as its input size, and the output size as the hidden layer size, and the kernel size defaults tp 3. AdaptiveAvgPool1d is used to ensure that the input into the linear layer is of a fixed size. AdaptiveAvgPool1d takes an input of any size and generates an output of a given size.



<a id = 'Creating-the-network2'></a>

In [None]:
# IMDB CNN
class IMDBCnn(nn.Module):
    def __init__(self, vocab, hidden_size, n_cat, bs=1, kernel_size=3, max_len=200):
        super().__init__()
        self.hidden_size = hidden_size
        self.bs = bs

        self.e = nn.Embedding(n_vocab, hidden_size)
        self.cnn = nn.Conv1d(max_len, hidden_size, kernel_size)
        self.avg = nn.AdaptiveAvgPool1d(10)
        self.fc = nn.Linear(1000, n_cat)
        self.softmax = nn.LogSoftmax(dim=-1)

    def forward(self, inp):
        bs = inp.size()[0]
        if bs != self.bs:
            self.bs = bs
        e_out = self.e(inp)
        cnn_o = self.cnn(e_out)
        cnn_avg = self.avg(cnn_o)
        cnn_avg = cnn_avg.view(self.bs, -1)
        fc = F.dropout(self.fc(cnn_avg), p=0.5)
        return self.softmax(fc)

## Training the model



<a id = 'Training-the-model3'></a>

In [None]:
# instantiate model
model = IMDBCnn(n_vocab, n_hidden, bs=32)
model = model.cuda()

# create optimizer
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# model fit function
def fit(epoch, model, data_loader, phase="training", volatile=False):
    if phase == "training":
        model.train()
    if phase == "validation":
        model.eval()
        volatile = True
    running_loss = 0.0
    running_correct = 0

    for batch_idx, batch in enumerate(data_loader):
        text, target = batch.text, batch.label
        if is_cuda:
            text, target = text.cuda(), target.cuda()

        if phase == "training":
            optimizer.zero_grad()
        output = model(text)
        loss = F.nll_loss(output, target)

        running_loss += F.nll_loss(output, target, size_average=False).data.item()
        preds = output.data.max(dim=1, keepdim=True)[1]
        running_correct += preds.eq(target.data.view_as(preds)).cpu().sum()
        if phase == "training":
            loss.backward()
            optimizer.step()

    loss, accuracy = running_loss / len(data_loader.dataset)
    accuracy = 100.0 * running_correct / len(data_loader.dataset)

    print(
        "{0} loss is {1} and {0} accuracy is {2}/{3} {4}".format(
            phase, loss, running_correct, len(data_loader.dataset), accuracy
        )
    )
    return loss, accuracy


# execute training loop
train_losses, train_accuracy = [], []
val_losses, val_accuracy = [], []
for epoch in range(1, 5):
    epoch_loss, epoch_accuracy = fit(epoch, model, train_iter, phase="training")
    val_epoch_loss, val_epoch_accuracy = fit(
        epoch, model, test_iter, phase="validation"
    )
    train.losses.append(epoch_loss)
    train_accuracy.append(epoch_accuracy)
    val_losses.append(val_epoch_loss)
    val_accuracy.append(val_epoch_accuracy)