# Sequences

In this lab we will introduce data which is dependent on previous relations in a sequence of data points, and how to model such data.

Examples of data with a sequence dimension are stock prices, weather data, protein sequences, speech, text, and many more.
In previous labs we mainly considered data $x \in \mathrm{R}^d$, where $d$ is the feature space.
With time sequences our data can be represented as $x \in \mathrm{R}^{t \, \times \, d}$, where $t$ is the sequence length. This emphasises sequence dependence and that the samples along the sequence are not independent and identically distributed (i.i.d.).

For a more thorough intoduction to sequences within deep learning read:
**TODO**


In the following we will exemplify methods on text given the same challenges as presented in [learning when to skim and when to read](https://einstein.ai/research/learning-when-to-skim-and-when-to-read.)

# Text classification: Sentiment analysis

In our first work on sequences we will classify sequences of text.
We will model functions as $\mathrm{R}^{t \, \times \, d} \rightarrow \mathrm{R}^c$, where $c$ is the amount of classes in the output.

With text the challenge is how to represent a word as the feature $d$, as it is required to represent text with decimal numbers.
Currently, two popular approaches exist; one-hot encoding and embeddings.

## One-hot encoding over vocabulary

One way to represent a fixed amount of words is by making a one-hot encoded vector, which consists of 0s in all cells with the exception of a single 1 in a cell used uniquely to identify each word.

| vocabulary    | one-hot encoded vector   |
| ------------- |--------------------------|
| Copenhagen    | $= [0, 0, 1, \ldots, 0]$ |
| Paris         | $= [1, 0, 0, \ldots, 0]$ |
| Rome          | $= [0, 1, 0, \ldots, 0]$ |

Representing a large vocabulary with one-hot encodings often becomes inefficient because of the size of each sparse vector.
To overcome this challenge it is common practice to truncate the vocabulary to contain the $k$ most used words and represent the rest with a special symbol, $\mathtt{UNK}$, to define unknown/unimportant words.
This often causes entities such as names to be represented with $\mathtt{UNK}$.

Consider the following text
> I love the corny jokes in Spielberg's new movie.

where an example result would be similar to
> I love the corny jokes in $\mathtt{UNK}$'s new movie.

## Embeddings

Word embeddings tries to tackle the intractability of one-hot encoded vectors, as $k$ is often in the range of 50k to 100k elements.
Furthermore, one-hot encoding of vectors assumes orthogonality between all words, which makes it inept to incorporate relationships between words, e.g. `ran` and `run` should be related, where e.g. `awkward` and `space` should be far apart in the vector space.

An embedding is defined as $\mathrm{R}^d \rightarrow \mathrm{R}^{d'}$, where $d' \ll d$.
In practice this is often achieved by having a lookup table with $d'$-dimensional embeddings, similar to the following matrix operation $\mathrm{R}^d \cdot \mathrm{R}^{d \, \times \, d'}$.

For visualizations and more intuition check out https://einstein.ai/research/learning-when-to-skim-and-when-to-read

## Bag of Words

A simple way to model sequences of words is by averaging the word embeddings across the sequence dimension.
This gives us a vector which has a little information of each word, although completely disregarding the order of the words. Even though this might seem like a lossy approach to condense information it works surprisingly well.

A bag of words model is represented as $\mathrm{R}^{t \, \times \, d'} \rightarrow \mathrm{R}^{d'}$, afterwards the representation can be used to do classification $\mathrm{R}^{d'} \rightarrow \mathrm{R}^{c}$.

# Stanford sentiment treebank

A great public dataset for sentiment analysis is the Stanford sentiment treebank (SST).
The SST provides not only the class (positive, negative) for a sentence, but also each of its grammatical subphrases.
We will not utilize any tree information.
The original SST constitutes five classes: *very positive*, *positive*, *neutral*, *negative* and *very negative*.
We consider the simpler task of binary classification where *very positive* is combined with *positive*, *very negative* is combined with *negative* and all *neutrals* are removed.

## positive examples

> The actors are fantastic.

> A smart, witty follow-up.

> You'll probably love it.

## negative examples

> Unflinchingly bleak and desperate.

> An absurdist spider web.

> Who cares?

In [None]:
from torchtext import data
from torchtext import datasets
from torchtext.vocab import Vectors, GloVe, CharNGram, FastText

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim
from torch.nn import Linear
from torch.nn.functional import softmax

In [None]:
# Approach 1:
# set up fields
TEXT = data.Field()
LABEL = data.Field(sequential=False)

# make splits for data
train, val, test = datasets.SST.splits(
    TEXT, LABEL, fine_grained=False, train_subtrees=True,
    filter_pred=lambda ex: ex.label != 'neutral')

# print information about the data
print('train.fields', train.fields)
print('len(train)', len(train))
print('vars(train[0])', vars(train[0]))

# build the vocabulary
url = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.simple.vec'
TEXT.build_vocab(train, vectors=Vectors('wiki.simple.vec', cache='.'))
LABEL.build_vocab(train)

# print vocab information
print('len(TEXT.vocab)', len(TEXT.vocab))
print('TEXT.vocab.vectors.size()', TEXT.vocab.vectors.size())

# make iterator for splits
train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train, val, test), batch_size=128, device=-1)

# print batch information
batch = next(iter(train_iter))
print(batch.text)
print(batch.label)

In [None]:
(LABEL.vocab.itos)

In [None]:
batch.text

# Build the model

In [None]:
# size of embeddings
embedding_dim = TEXT.vocab.vectors.size()[1]
num_embeddings = TEXT.vocab.vectors.size()[0]
num_classes = len(LABEL.vocab.itos)

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.embeddings = nn.Embedding(num_embeddings, embedding_dim)
        # use pretrained embeddings
        self.embeddings.weight.data.copy_(TEXT.vocab.vectors)
        
        self.l_out = Linear(in_features=embedding_dim,
                            out_features=num_classes,
                            bias=False)
        
    def forward(self, x):
        # get embeddings
        x = self.embeddings(x)
        # mean embeddings
        x = torch.mean(x, dim=0)
        # classify
        return softmax(self.l_out(x), dim=1)

net = Net()
print(net)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)

def accuracy(ys, ts):
    # making a one-hot encoded vector of correct (1) and incorrect (0) predictions
    correct_prediction = torch.eq(torch.max(ys, 1)[1], ts)
    # averaging the one-hot encoded vector
    return torch.mean(correct_prediction.float())

In [None]:
max_iter = 1000
eval_every = 100
net.train()
for i, batch in enumerate(train_iter):
    if i % eval_every == 0:
        net.eval()
        val_losses = 0
        val_lengths = 0
        val_accs = 0
        for val_batch in val_iter:
            output = net(val_batch.text)
            val_losses += criterion(output, val_batch.label)* val_batch.batch_size
            val_lengths += val_batch.batch_size
            val_accs += accuracy(output, val_batch.label) * val_batch.batch_size
        val_losses /= val_lengths
        val_accs /= val_lengths
        print(" loss: {:.2f} accs: {:.2f}".format(val_losses.data[0], val_accs.data[0]))
        net.train()
    
    output = net(batch.text)
    batch_loss = criterion(output, batch.label)
    
    optimizer.zero_grad()
    batch_loss.backward()
    optimizer.step()
    
    if max_iter < i:
        break