# Lab 04 - Upgrading to an LSTM and pre-trained word embeddings
In this lab we will experiment with an LSTM, as an improvement over the RNN.

The lab is adopted from the [popular PyTorch sentiment analysis tutorial by bentrevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb).

In [1]:
from IPython.display import HTML, display
colab_button = HTML(
    '<a target="_blank" href="https://colab.research.google.com/github/surrey-nlp/NLP-2026/blob/main/lab04/lab04_1b_LSTM.ipynb">'
    '<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>'
)
display(colab_button)

In [None]:
# Install dependencies
%pip -q install torch datasets spacy tqdm
!python -m spacy download en_core_web_sm

Note: you may need to restart the kernel to use updated packages.


In [None]:
import torch

SEED = 1234
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

print("PyTorch Version: ", torch.__version__)
print(f"Using {'GPU' if str(DEVICE) == 'cuda' else 'CPU'}.")


PyTorch Version:  2.10.0+cu128
Using GPU.


## Preparing the Data

In [None]:
from datasets import load_dataset
from torch.utils.data import random_split

# Load IMDb via Hugging Face Datasets
ds = load_dataset("imdb")

# Convert to a simple list of (label, text) tuples to keep the rest of the lab code familiar
train_full = list(zip(ds["train"]["label"], ds["train"]["text"]))
test_data = list(zip(ds["test"]["label"], ds["test"]["text"]))

# Validation split from the training set (70/30 split)
split_ratio = 0.7
train_samples = int(split_ratio * len(train_full))
valid_samples = len(train_full) - train_samples

# random_split expects a Dataset-like object; we can wrap the list with a simple indexable object
class ListDataset:
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

train_full = ListDataset(train_full)
test_data = ListDataset(test_data)

train_data, valid_data = random_split(train_full, [train_samples, valid_samples])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]



plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Just like last time we'll create some useful utilities for processing pipelines so we can tokenize with spaCy and get the lengths post-tokenization to use packed padded sequences.

In [None]:
import spacy

# Load spaCy English model
_nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

class SpacyTokenizer(torch.nn.Module):
    """A small wrapper so we can reuse the same tokenizer in preprocessing steps."""
    def __init__(self):
        super().__init__()
    def forward(self, input):
        if isinstance(input, list):
            return [self._tokenize(x) for x in input]
        elif isinstance(input, str):
            return self._tokenize(input)
        raise ValueError(f"Type {type(input)} is not supported.")
    def _tokenize(self, text: str):
        return [tok.text.lower() for tok in _nlp.tokenizer(text)]

class ToLengths(torch.nn.Module):
    def forward(self, input):
        if isinstance(input, list) and len(input) > 0 and isinstance(input[0], list):
            return [len(x) for x in input]
        elif isinstance(input, list):
            return len(input)
        raise ValueError(f"Type {type(input)} is not supported.")


Next is the use of **pre-trained word embeddings**. Instead of initialising an embedding matrix randomly and learning everything from scratch, we start from vectors that were trained on a very large corpus.

In this updated notebook (without the older version of this lab), we download the Stanford **GloVe 6B** vectors directly and load them ourselves. We use the **100-dimensional** version and keep only the first `MAX_VOCAB_SIZE` vectors to keep memory and loading time manageable.

The intuition is that pre-trained vectors already place semantically related words close together in vector space (e.g., *terrible*, *awful*, *dreadful*), giving the model a much better starting point.

**Note:** The full GloVe download is large (hundreds of MB). The first run may take a while, but the file is cached locally after it is downloaded.


In [None]:
import os, urllib.request, zipfile
import torch

MAX_VOCAB_SIZE = 25_000
GLOVE_NAME = "glove.6B.100d.txt"
GLOVE_ZIP = "glove.6B.zip"
GLOVE_URL = "https://nlp.stanford.edu/data/glove.6B.zip"

def download_glove(dest_dir="./.vector_cache"):
    os.makedirs(dest_dir, exist_ok=True)
    zip_path = os.path.join(dest_dir, GLOVE_ZIP)
    txt_path = os.path.join(dest_dir, GLOVE_NAME)

    if not os.path.exists(txt_path):
        if not os.path.exists(zip_path):
            print("Downloading GloVe vectors (this can take a while the first time)...")
            urllib.request.urlretrieve(GLOVE_URL, zip_path)
        print("Extracting GloVe...")
        with zipfile.ZipFile(zip_path, "r") as zf:
            zf.extract(GLOVE_NAME, path=dest_dir)

    return txt_path

glove_path = download_glove()
print("Using GloVe file:", glove_path)


Downloading GloVe vectors (this can take a while the first time)...
Extracting GloVe...
Using GloVe file: ./.vector_cache/glove.6B.100d.txt


So now that we have the vectors downloaded, how do we *actually* use them?

We build a vocabulary directly from the GloVe file:
- `stoi` maps a token (string) → an integer ID  
- `itos` maps an integer ID → a token  

We also build an embedding matrix `pretrained_embeddings` where row *i* contains the vector for the token with ID *i*.  
We add two special tokens at the start of the vocabulary:
- `<unk>` for unknown words (given a random vector)
- `<pad>` for padding (given an all-zero vector)

Later, we create the model’s embedding layer using `nn.Embedding.from_pretrained(pretrained_embeddings, freeze=True, padding_idx=PAD_IDX)` so that:
- the embedding weights start from GloVe
- the `<pad>` embedding is treated as padding
- the embeddings are kept fixed (not updated) during training, matching the original tutorial behaviour


In [None]:
import torch

# Build a vocab directly from the GloVe file (limited to MAX_VOCAB_SIZE vectors)
specials = ["<unk>", "<pad>"]
stoi = {tok: i for i, tok in enumerate(specials)}
itos = list(specials)
vectors = []  # will store only the non-special GloVe vectors

with open(glove_path, "r", encoding="utf-8") as f:
    for line in f:
        if len(itos) - len(specials) >= MAX_VOCAB_SIZE:
            break
        parts = line.rstrip().split(" ")
        word = parts[0]
        vals = parts[1:]
        if word in stoi:
            continue
        vec = torch.tensor([float(x) for x in vals], dtype=torch.float)
        stoi[word] = len(itos)
        itos.append(word)
        vectors.append(vec)

embedding_dim = vectors[0].shape[0]
pretrained_embeddings = torch.stack(vectors, dim=0)  # [MAX_VOCAB_SIZE, dim]

# Add special token vectors at the start to match stoi indices
unk_vec = torch.empty(1, embedding_dim).normal_()
pad_vec = torch.zeros(1, embedding_dim)
pretrained_embeddings = torch.cat([unk_vec, pad_vec, pretrained_embeddings], dim=0)

PAD_IDX = stoi["<pad>"]
UNK_IDX = stoi["<unk>"]

print("Vocab size:", len(itos))
print("Embedding matrix shape:", pretrained_embeddings.shape)


Vocab size: 25002
Embedding matrix shape: torch.Size([25002, 100])


In [None]:
print("Vocab size: ", len(itos))
print("Pretrained vectors shape: ", pretrained_embeddings.shape)
print("<unk> vector (index {}): ".format(UNK_IDX), pretrained_embeddings[UNK_IDX])
print("<pad> vector (index {}): ".format(PAD_IDX), pretrained_embeddings[PAD_IDX])


Vocab size:  25002
Pretrained vectors shape:  torch.Size([25002, 100])
<unk> vector (index 0):  tensor([-8.0321e-01, -2.0709e-01, -1.4458e+00,  6.3159e-02, -1.8717e-02,
         1.9214e-01,  8.2137e-01, -1.4905e+00, -1.5952e+00,  1.9304e+00,
        -1.1135e+00,  1.6696e+00, -1.4213e+00, -9.4966e-01,  1.1405e+00,
         2.1162e+00,  6.2582e-02, -4.5980e-01, -3.8454e-02, -9.7214e-01,
        -6.4611e-01, -2.6423e-01, -2.1345e+00, -7.7647e-01, -9.4903e-01,
        -6.4907e-01, -7.6878e-01,  5.2265e-01,  4.5545e-02, -1.2079e+00,
         1.3286e+00,  1.4876e-01,  2.2356e-01,  1.4247e+00,  1.6752e+00,
        -2.2740e-01,  1.9271e+00,  1.2892e+00,  1.1147e+00,  7.8226e-01,
        -1.0090e+00,  1.4562e+00, -8.6125e-01, -1.0863e+00,  1.5205e+00,
         1.0192e+00,  1.6661e+00,  3.0074e-01,  1.8615e-01, -1.0638e+00,
         1.7487e+00, -9.1451e-01, -1.9220e+00, -3.8310e-01, -1.2836e+00,
         3.9363e-01,  4.8144e-01,  1.5693e+00,  4.8916e-01, -3.5462e-01,
         5.5390e-01, -5.6849

The IMDb labels from Hugging Face are already integers:

- `0` = negative review  
- `1` = positive review  

We keep them as `0/1` and convert them to float tensors in the `collate_fn` so they work directly with `BCEWithLogitsLoss`.


In [None]:
# Labels in Hugging Face IMDb are already integers:
# 0 = negative, 1 = positive
# We'll keep them as 0/1 floats for BCEWithLogitsLoss.
NEG_LABEL = 0
POS_LABEL = 1


We can now define the rest of our preprocessing steps like last time.

In the original version, the older version of this lab transforms were used to build a reusable “pipeline”. In this version, we implement the same idea with a small set of Python functions plus a custom `collate_fn` in the `DataLoader` that handles:

- tokenisation (spaCy)  
- numericalisation (token → ID via `stoi`)  
- padding within a batch (using `<pad>`)  
- returning sequence lengths (needed for packing in the LSTM)


In [None]:
from torch.nn.utils.rnn import pad_sequence

tokenizer = SpacyTokenizer()

def numericalize(tokens):
    return [stoi.get(t, UNK_IDX) for t in tokens]

def text_transform(texts):
    # texts: List[str] -> padded tensor [batch, seq_len]
    token_lists = tokenizer(list(texts))
    id_seqs = [torch.tensor(numericalize(toks), dtype=torch.long) for toks in token_lists]
    lengths = torch.tensor([len(x) for x in id_seqs], dtype=torch.long)
    padded = pad_sequence(id_seqs, batch_first=True, padding_value=PAD_IDX)
    return padded, lengths


In [None]:
from torch.utils.data import DataLoader

BATCH_SIZE = 64

def collate_batch(batch):
    labels, texts = zip(*batch)  # labels are 0/1 ints, texts are strings
    text_tensor, lengths = text_transform(texts)
    labels = torch.tensor(labels, dtype=torch.float)

    return labels.to(DEVICE), text_tensor.to(DEVICE), lengths.cpu()

def _get_dataloader(data):
    return DataLoader(data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)

train_dataloader = _get_dataloader(train_data)
valid_dataloader = _get_dataloader(valid_data)
test_dataloader = _get_dataloader(test_data)


### Different RNN Architecture

We'll be using an RNN architecture called a Long Short-Term Memory (LSTM). Why is an LSTM better than a standard RNN? Standard RNNs suffer from the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). LSTMs overcome this by having an extra recurrent state called a _cell_, $c$ - which can be thought of as the "memory" of the LSTM - and the use use multiple _gates_ which control the flow of information into and out of the memory. For more information, go [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). We can simply think of the LSTM as a function of $x_t$, $h_t$ and $c_t$, instead of just $x_t$ and $h_t$.

$$(h_t, c_t) = \text{LSTM}(x_t, h_t, c_t)$$

Thus, the model using an LSTM looks something like (with the embedding layers omitted):

![](https://github.com/surrey-nlp/NLP-2025/blob/main/lab04/assets/sentiment2.png?raw=1)

The initial cell state, $c_0$, like the initial hidden state is initialized to a tensor of all zeros. The sentiment prediction is still, however, only made using the final hidden state, not the final cell state, i.e. $\hat{y}=f(h_T)$.

### Bidirectional RNN

The concept behind a bidirectional RNN is simple. As well as having an RNN processing the words in the sentence from the first to the last (a forward RNN), we have a second RNN processing the words in the sentence from the **last to the first** (a backward RNN). At time step $t$, the forward RNN is processing word $x_t$, and the backward RNN is processing word $x_{T-t+1}$.

In PyTorch, the hidden state (and cell state) tensors returned by the forward and backward RNNs are stacked on top of each other in a single tensor.

We make our sentiment prediction using a concatenation of the last hidden state from the forward RNN (obtained from final word of the sentence), $h_T^\rightarrow$, and the last hidden state from the backward RNN (obtained from the first word of the sentence), $h_T^\leftarrow$, i.e. $\hat{y}=f(h_T^\rightarrow, h_T^\leftarrow)$   

The image below shows a bi-directional RNN, with the forward RNN in orange, the backward RNN in green and the linear layer in silver.  

![](https://github.com/surrey-nlp/NLP-2025/blob/main/lab04/assets/sentiment3.png?raw=1)

### Multi-layer RNN

Multi-layer RNNs (also called *deep RNNs*) are another simple concept. The idea is that we add additional RNNs on top of the initial standard RNN, where each RNN added is another *layer*. The hidden state output by the first (bottom) RNN at time-step $t$ will be the input to the RNN above it at time step $t$. The prediction is then made from the final hidden state of the final (highest) layer.

The image below shows a multi-layer unidirectional RNN, where the layer number is given as a superscript. Also note that each layer needs their own initial hidden state, $h_0^L$.

![](https://github.com/surrey-nlp/NLP-2025/blob/main/lab04/assets/sentiment4.png?raw=1)

### Regularization

Although we've added improvements to our model, each one adds additional parameters. Without going into overfitting into too much detail, the more parameters you have in in your model, the higher the probability that your model will overfit (memorize the training data, causing  a low training error but high validation/testing error, i.e. poor generalization to new, unseen examples). To combat this, we use regularization. More specifically, we use a method of regularization called *dropout*. Dropout works by randomly *dropping out* (setting to 0) neurons in a layer during a forward pass. The probability that each neuron is dropped out is set by a hyperparameter and each neuron with dropout applied is considered indepenently. One theory about why dropout works is that a model with parameters dropped out can be seen as a "weaker" (less parameters) model. The predictions from all these "weaker" models (one for each forward pass) get averaged together withinin the parameters of the model. Thus, your one model can be thought of as an ensemble of weaker models, none of which are over-parameterized and thus should not overfit.

### Implementation Details

Another addition to this model is that we are not going to learn the embedding for the `<pad>` token. This is because we want to explitictly tell our model that padding tokens are irrelevant to determining the sentiment of a sentence. This means the embedding for the pad token will remain at what it is initialized to (we initialize it to all zeros later). We do this by passing the index of our pad token as the `padding_idx` argument to the `nn.Embedding` layer.

We also initialize the `nn.Embedding` layer with the `from_pretrained` function, as we'll be passing our downloaded pre-trained vectors directly to the model when initialising it. We also want the embeddings to not be trained further, so we set `freeze=True`.

To use an LSTM instead of the standard RNN, we use `nn.LSTM` instead of `nn.RNN`. Also, note that the LSTM returns the `output` and a tuple of the final `hidden` state and the final `cell` state, whereas the standard RNN only returned the `output` and final `hidden` state.

As the final hidden state of our LSTM has both a forward and a backward component, which will be concatenated together, the size of the input to the `nn.Linear` layer is twice that of the hidden dimension size.

Implementing bidirectionality and adding additional layers are done by passing values for the `num_layers` and `bidirectional` arguments for the RNN/LSTM.

Dropout is implemented by initializing an `nn.Dropout` layer (the argument is the probability of dropping out each neuron) and using it within the `forward` method after each layer we want to apply dropout to. **Note**: never use dropout on the input or output layers (`text` or `fc` in this case), you only ever want to use dropout on intermediate layers. The LSTM has a `dropout` argument which adds dropout on the connections between hidden states in one layer to hidden states in the next layer.

As we are passing the lengths of our sentences to be able to use packed padded sequences, we have to add a second argument, `lengths`, to our model's `forward` function.

Before we pass our embeddings to the RNN, we need to pack them like last time, which we do with `nn.utils.rnn.packed_padded_sequence`. This will cause our RNN to only process the non-padded elements of our sequence. The RNN will then return `packed_output` (a packed sequence) as well as the `hidden` and `cell` states (both of which are tensors). Without packed padded sequences, `hidden` and `cell` are tensors from the last element in the sequence, which will most probably be a pad token, however when using packed padded sequences they are both from the last non-padded element in the sequence. Note that the `lengths` argument of `packed_padded_sequence` must be a CPU tensor.

The final hidden state, `hidden`, has a shape of _**[num layers * num directions, batch size, hid dim]**_. These are ordered: **[forward_layer_0, backward_layer_0, forward_layer_1, backward_layer 1, ..., forward_layer_n, backward_layer n]**. As we want the final (top) layer forward and backward hidden states, we get the top two hidden layers from the first dimension, `hidden[-2,:,:]` and `hidden[-1,:,:]`, and concatenate them together before passing them to the linear layer (after applying dropout).

In [None]:
import torch.nn as nn

class LSTM(nn.Module):
    def __init__(self, pretrained_embeddings, hidden_dim, output_dim, n_layers, bidirectional, dropout, pad_idx):
        super().__init__()

        self.num_directions = 2 if bidirectional else 1

        self.embedding = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=True, padding_idx=pad_idx)
        self.rnn = nn.LSTM(pretrained_embeddings.shape[1],
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional,
                           dropout=dropout)
        self.fc = nn.Linear(hidden_dim * self.num_directions, output_dim)

        self.dropout = nn.Dropout(dropout)

    def forward(self, text, lengths):
        embedded = self.dropout(self.embedding(text))                   # VV note that lengths need to be on the CPU
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu(), batch_first=True, enforce_sorted=False)

        packed_output, (hidden, cell) = self.rnn(packed_embedded)

        if self.num_directions == 2:  # if bidirectional
            # Concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
            # and apply dropout
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])

        return self.fc(hidden)

Like before, we'll create an instance of our RNN class, with the new parameters and arguments for the number of layers, bidirectionality and dropout probability.

To ensure the pre-trained vectors are loaded into the model, we pass the embedding matrix (`pretrained_embeddings`) we created earlier.

Finally, we pass `PAD_IDX` so the embedding layer knows which index corresponds to padding.


In [None]:
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

model = LSTM(
    pretrained_embeddings,
    HIDDEN_DIM,
    OUTPUT_DIM,
    N_LAYERS,
    BIDIRECTIONAL,
    DROPOUT,
    PAD_IDX
)


We'll print out the number of parameters in our model.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,310,657 trainable parameters


## Train the Model

Now to training the model.

The only change we'll make here is changing the optimizer from `SGD` to `Adam`. SGD updates all parameters with the same learning rate and choosing this learning rate can be tricky. `Adam` adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates. More information about `Adam` (and other optimizers) can be found [here](http://ruder.io/optimizing-gradient-descent/index.html).

To change `SGD` to `Adam`, we simply change `optim.SGD` to `optim.Adam`, also note how we do not have to provide an initial learning rate for Adam as PyTorch specifies a sensibile default initial learning rate.

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

The rest of the steps for training the model are unchanged.

We define the criterion and place the model and criterion on the GPU (if available)...

In [None]:
criterion = nn.BCEWithLogitsLoss()

model = model.to(DEVICE)
criterion = criterion.to(DEVICE)

We implement the function to calculate accuracy...

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division
    acc = correct.sum() / len(correct)
    return acc

We define a function for training our model.

**Note**: as we are now using dropout, we must remember to use `model.train()` to ensure the dropout is "turned on" while training.

In [None]:
from tqdm import tqdm

def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in tqdm(iterator, desc="\tTraining"):
        optimizer.zero_grad()

        labels, texts, lengths = batch  # Note that this has to match the order in collate_batch
        predictions = model(texts, lengths).squeeze(1)
        loss = criterion(predictions, labels)
        acc = binary_accuracy(predictions, labels)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Then we define a function for testing our model.

**Note**: as we are now using dropout, we must remember to use `model.eval()` to ensure the dropout is "turned off" while evaluating.

In [None]:
from tqdm import tqdm

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in tqdm(iterator, desc="\tEvaluation"):
            labels, texts, lengths = batch  # Note that this has to match the order in collate_batch
            predictions = model(texts, lengths).squeeze(1)
            loss = criterion(predictions, labels)
            acc = binary_accuracy(predictions, labels)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

And also create a nice function to tell us how long our epochs are taking.

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we train our model...

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')
print(f"Using {'GPU' if str(DEVICE) == 'cuda' else 'CPU'} for training.")

for epoch in range(N_EPOCHS):
    print(f'Epoch: {epoch+1:02}')
    start_time = time.time()

    train_loss, train_acc = train(model, train_dataloader, optimizer, criterion)
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')

    valid_loss, valid_acc = evaluate(model, valid_dataloader, criterion)
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')

Using GPU for training.
Epoch: 01


	Training: 100%|██████████| 274/274 [01:00<00:00,  4.51it/s]


	Train Loss: 0.671 | Train Acc: 58.72%


	Evaluation: 100%|██████████| 118/118 [00:14<00:00,  7.97it/s]


	 Val. Loss: 0.667 |  Val. Acc: 56.71%


...and get our new and vastly improved test accuracy!

In [None]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_dataloader, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

	Evaluation: 100%|██████████| 391/391 [00:48<00:00,  8.01it/s]

Test Loss: 0.670 | Test Acc: 56.50%





## User Input

We can now use our model to predict the sentiment of any sentence we give it. As it has been trained on movie reviews, the sentences provided should also be movie reviews.

When using a model for inference it should always be in evaluation mode (from doing `evaluate` on the test set), however we explicitly set it to avoid any risk.

Our `predict_sentiment` function does a few things:
- sets the model to evaluation mode
- put the user input through the text processing pipeline
- squashes the output prediction from a real number between 0 and 1 with the `sigmoid` function
- converts the tensor holding a single value into an integer with the `item()` method

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

def predict_sentiment(model, sentence):
    model.eval()
    text_tensor, lengths = text_transform([sentence])
    text_tensor = text_tensor.to(DEVICE)
    prediction = torch.sigmoid(model(text_tensor, lengths))
    return prediction.item()


An example negative review...

In [None]:
predict_sentiment(model, "This film is terrible")

0.42760223150253296

An example positive review...

In [None]:
predict_sentiment(model, "This film is great")

0.6321732401847839