<h1>Simple Sentiment Analysis<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocess-the-data" data-toc-modified-id="Preprocess-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocess the data</a></span></li><li><span><a href="#Create-the-model" data-toc-modified-id="Create-the-model-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Create the model</a></span></li><li><span><a href="#Train-the-model" data-toc-modified-id="Train-the-model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Train the model</a></span></li></ul></div>

---

In [1]:
import torch
import torchtext

# For reproducibility
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True

## Preprocess the data

The parameters of a [`torchtext.data.Field`](https://torchtext.readthedocs.io/en/latest/data.html) specify how the data should be processed.

`Field` class models common text processing datatypes that can be represented
by tensors.  It holds a `Vocab` object that defines the set of possible values
for elements of the field and their corresponding numerical representations.
The `Field` object also holds other parameters relating to how a datatype
should be numericalized, such as a tokenization method and the kind of
Tensor that should be produced.

* Our `TEXT` field has `tokenize='spacy'` as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the spaCy tokenizer. If no tokenize argument is passed, the default is simply splitting the string on spaces.
* `LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically used for handling labels.

In [3]:
TEXT = torchtext.data.Field(tokenize="spacy")         # tokenize text with spacy
LABEL = torchtext.data.LabelField(dtype=torch.float)  # convert type of label to torch.float



Load IMDB data with `torchtext`.

In [7]:
trainset, testset = torchtext.datasets.IMDB.splits(TEXT, LABEL, root="/Users/PSH/Downloads/data")

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [01:38<00:00, 856kB/s] 


In [8]:
len(trainset), len(testset)

(25000, 25000)

In [9]:
print(trainset.examples[0].text)
print(trainset.examples[0].label)

['For', 'a', 'movie', 'that', 'gets', 'no', 'respect', 'there', 'sure', 'are', 'a', 'lot', 'of', 'memorable', 'quotes', 'listed', 'for', 'this', 'gem', '.', 'Imagine', 'a', 'movie', 'where', 'Joe', 'Piscopo', 'is', 'actually', 'funny', '!', 'Maureen', 'Stapleton', 'is', 'a', 'scene', 'stealer', '.', 'The', 'Moroni', 'character', 'is', 'an', 'absolute', 'scream', '.', 'Watch', 'for', 'Alan', '"', 'The', 'Skipper', '"', 'Hale', 'jr', '.', 'as', 'a', 'police', 'Sgt', '.']
pos


Train-validation split

In [10]:
import random
trainset, validset = trainset.split(random_state=random.seed(0))

In [11]:
len(trainset), len(validset), len(testset)

(17500, 7500, 25000)

**Create vobabulary**

There are two ways effectively cut down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special unknown or `<unk>` token.

In [12]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(trainset, max_size=MAX_VOCAB_SIZE)
LABEL.build_vocab(trainset)

In [13]:
len(TEXT.vocab), len(LABEL.vocab)  # 25002, because of <unk> and <pad>

(25002, 2)

To ensure each sentence in the batch is the same size, any shorter than the longest within the batch are padded with `<pad>`.

![pad.png](https://github.com/bentrevett/pytorch-sentiment-analysis/raw/3cea8e83b83166845e5fb08ad753571437c683ed/assets/sentiment6.png)

In [14]:
TEXT.vocab.freqs.most_common(10)

[('the', 201836),
 (',', 192304),
 ('.', 165424),
 ('and', 109492),
 ('a', 109202),
 ('of', 100450),
 ('to', 93564),
 ('is', 76411),
 ('in', 61395),
 ('I', 53909)]

`stoi` (string to int) or `itos` (int to string)

In [15]:
TEXT.vocab.itos[:5]

['<unk>', '<pad>', 'the', ',', '.']

In [16]:
TEXT.vocab.stoi["nature"]

890

In [17]:
LABEL.vocab.stoi

defaultdict(None, {'pos': 0, 'neg': 1})

We'll use a `BucketIterator` which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

(place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using `torch.device`, we then pass this device to the iterator.)

In [18]:
BATCH_SIZE = 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [19]:
trainiter, validiter, testiter = torchtext.data.BucketIterator.splits(
    (trainset, validset, testset),
    batch_size=BATCH_SIZE,
    device=device
)



## Create the model

In [20]:
import torch.nn as nn

In [22]:
class SimpleSentimentCLassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SimpleSentimentCLassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.linear = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        x = self.embedding(x)
        _, x = self.rnn(x)  # many-to-one model
        
        x = x.squeeze(0)
        x = self.linear(x)
        
        return x

create instance

In [23]:
VOCAB_SIZE = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

In [24]:
model = SimpleSentimentCLassifier(
    VOCAB_SIZE,
    EMBEDDING_DIM,
    HIDDEN_DIM,
    OUTPUT_DIM
)

In [25]:
n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"{n_params:,} trainable parameters")

2,592,105 trainable parameters


## Train the model

The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps.

In [26]:
import torch.optim as optim

In [27]:
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=0.005, momentum=0.9)

In [28]:
model = model.to(device)
criterion = criterion.to(device)

In [37]:
def binary_accuracy(pred, target):
    pred = torch.round(torch.sigmoid(pred))
    correct = (pred == target).float()
    return correct.sum() / len(correct)

In [30]:
def train(model, iterator, optimizer, criterion):
    loss_epoch = 0.
    acc_epoch = 0.
    
    # set the model to train mode.
    # dropout, batchnorm, ... will be turned on.
    model.train()
    
    # per batch training
    for batch in iterator:
        model.zero_grad()
        
        out = model(batch.text).squeeze(1)
        loss = criterion(out, batch.label)
        loss.backward()
        optimizer.step()
        
        loss_epoch += loss.item()
        acc_epoch += binary_accuracy(out, batch.label)
    
    return loss_epoch / len(iterator),\
           acc_epoch / len(iterator)

In [31]:
def evaluate(model, iterator, criterion):
    loss_epoch = 0.
    acc_epoch = 0.
    
    # set the model to evaluation mode.
    # dropout, batchnorm, ... will be turned off.
    model.eval()
    
    with torch.no_grad():  # pause gradient calculation
        for batch in iterator:
            out = model(batch.text).squeeze(1)
            loss = criterion(out, batch.label)
            loss.backward()
            optimizer.step()
            
            loss_epoch += loss.item()
            acc_epoch += binary_accuracy(out, batch.label)
    
    return loss_epoch / len(iterator),\
           acc_epoch / len(iterator)

In [38]:
N_EPOCH = 10

for i in range(1, N_EPOCH+1):
    train_loss, train_acc = train(model, trainiter, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, validiter, criterion)
    print(f"train loss: {train_loss: .3f} | train acc: {train_Acc: .3f} | val. loss: {valid_loss: .3f} | val. acc: {valid_acc: .3f}")

KeyboardInterrupt: 