STAT 453: Deep Learning (Spring 2021)  
Instructor: Sebastian Raschka (sraschka@wisc.edu)  

Course website: http://pages.stat.wisc.edu/~sraschka/teaching/stat453-ss2021/  
GitHub repository: https://github.com/rasbt/stat453-deep-learning-ss21

---

# Same as 1_lstm.ipynb but with packed sequences

Explanation of packing: https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch

In [1]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p torch,torchtext,datasets

import torch
import torch.nn.functional as F
import torchtext
import time
import random
import pandas as pd

torch.backends.cudnn.deterministic = True

Author: Sebastian Raschka

Python implementation: CPython
Python version       : 3.11
IPython version      : 7.21.0

torch    : 2.3.0
torchtext: 0.18.0
datasets : 3.0.1



## General Settings

In [2]:
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)

LEARNING_RATE = 0.005
BATCH_SIZE = 128
NUM_EPOCHS = 15
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

EMBEDDING_DIM = 128
HIDDEN_DIM = 256
NUM_CLASSES = 2

## Download Dataset

The following cells will download the IMDB movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification in as CSV-formatted file:

In [3]:
!wget https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz

--2021-04-12 22:05:05--  https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch08/movie_data.csv.gz [following]
--2021-04-12 22:05:05--  https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch08/movie_data.csv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26521894 (25M) [application/octet-stream]
Saving to: ‘movie_data.csv.gz’


2021-04-12 22:05:07 (15.8 MB/s) - ‘movie_data.csv.gz’ saved [26521894/26521894]



In [4]:
!gunzip -f movie_data.csv.gz

Check that the dataset looks okay:

In [5]:
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [6]:
del df

## Prepare Dataset with Torchtext

In [7]:
%%capture --no-stderr

!pip install -U spacy==3.7.3
!python -m spacy download en_core_web_sm

Download English vocabulary via:
    
- `python -m spacy download en_core_web_sm`

## Split Dataset into Train/Validation/Test

In [8]:
from datasets import load_dataset

dataset = load_dataset("csv", data_files="movie_data.csv")
dataset

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 50000
    })
})

Split the dataset into training, validation, and test partitions:

In [9]:
trainvalid_test_dataset = dataset['train'].train_test_split(test_size=0.2)
train_valid_dataset = trainvalid_test_dataset['train'].train_test_split(test_size=0.15)

train_data, valid_data = train_valid_dataset['train'], train_valid_dataset['test']
test_data = trainvalid_test_dataset['test']

train_data, valid_data, test_data

(Dataset({
     features: ['review', 'sentiment'],
     num_rows: 34000
 }),
 Dataset({
     features: ['review', 'sentiment'],
     num_rows: 6000
 }),
 Dataset({
     features: ['review', 'sentiment'],
     num_rows: 10000
 }))

In [10]:
print(f'Num Train: {len(train_data)}')
print(f'Num Valid: {len(valid_data)}')
print(f'Num Test: {len(test_data)}')

Num Train: 34000
Num Valid: 6000
Num Test: 10000


In [11]:
train_data[0]

{'review': 'I am a member of a canoeing club and I can tell you the truth that Deliverance is synonomous with the peacefulness and tranquility of the experience. As we put our boats into the water, banjoes echo in the back of the conscious mind. This movie is timeless because it waxes philosophical of human\'s place in nature and technology\'s effect upon man\'s relationship with nature. We see it in the bow fishing. We see it in the home made tent. There is also city man\'s disdain and feeling of superiority to the rural woodsman "cracker". The fact that the Banker from Atlanta (Ned Beatty) has "bad teeth" is meant to put him on the same level with the woodsmen who also have bad teeth. Ultimately, the struggle of life and death supersedes "civilized man\'s" suppositives about "The Law". This canoe trip ends too soon for the viewer, but alas Not Soon Enough for the characters.',
 'sentiment': 1}

**Process the dataset:**

In [12]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [13]:
def tokenize_fn(row, nlp, lower=True):
    tokens = [token.text for token in nlp.tokenizer(row['review'])]
    if lower:
        tokens = [e.lower() for e in tokens]
    return {'tokens': tokens, 'TEXT_LENGTH': len(tokens)}

In [14]:
fn_kwargs={
    'nlp': nlp,
    'lower': True
}

train_data = train_data.map(tokenize_fn, fn_kwargs=fn_kwargs)
valid_data = valid_data.map(tokenize_fn, fn_kwargs=fn_kwargs)
test_data = test_data.map(tokenize_fn, fn_kwargs=fn_kwargs)

Map: 100%|██████████| 34000/34000 [00:31<00:00, 1081.32 examples/s]
Map: 100%|██████████| 6000/6000 [00:04<00:00, 1248.14 examples/s]
Map: 100%|██████████| 10000/10000 [00:07<00:00, 1268.91 examples/s]


In [15]:
train_data[0]

{'review': 'I am a member of a canoeing club and I can tell you the truth that Deliverance is synonomous with the peacefulness and tranquility of the experience. As we put our boats into the water, banjoes echo in the back of the conscious mind. This movie is timeless because it waxes philosophical of human\'s place in nature and technology\'s effect upon man\'s relationship with nature. We see it in the bow fishing. We see it in the home made tent. There is also city man\'s disdain and feeling of superiority to the rural woodsman "cracker". The fact that the Banker from Atlanta (Ned Beatty) has "bad teeth" is meant to put him on the same level with the woodsmen who also have bad teeth. Ultimately, the struggle of life and death supersedes "civilized man\'s" suppositives about "The Law". This canoe trip ends too soon for the viewer, but alas Not Soon Enough for the characters.',
 'sentiment': 1,
 'tokens': ['i',
  'am',
  'a',
  'member',
  'of',
  'a',
  'canoeing',
  'club',
  'and',

## Build Vocabulary

Build the vocabulary:

In [16]:
import torchtext
torchtext.disable_torchtext_deprecation_warning()

from torchtext.vocab import build_vocab_from_iterator


# Do not create an index for tokens which appear less than min_freq times in our training set.
min_freq = 5

unk_token = "<unk>"
pad_token = "<pad>"

special_tokens = [
    unk_token,
    pad_token
]

en_vocab = build_vocab_from_iterator(
    train_data["tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

print(f'Vocabulary size: {len(en_vocab)}')

Vocabulary size: 36500


- The special tokens (`<unk>` and `<pad>`) are added into the vocabuary
- PyTorch RNNs can deal with arbitrary lengths due to dynamic graphs, but padding is necessary for padding sequences to the same length in a given minibatch so we can store those in an array

**Tokens corresponding to the first 10 indices (0, 1, ..., 9):**

In [17]:
print(en_vocab.get_itos()[:10]) # itos = integer-to-string

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']


**Converting a string to an integer:**

In [18]:
print(en_vocab.get_stoi()['the']) # stoi = string-to-integer

2


**Using the `in` keyword to get a boolean indicating if a token is in the vocabulary**

In [19]:
'the' in en_vocab

True

In [20]:
'The' in en_vocab

False

Using the `set_default_index` method we can set what value is returned when we try and get the index of a token outside of our vocabulary. In this case, the index of the unknown token, `<unk>`.

In [21]:
unk_index = en_vocab[unk_token]
en_vocab.set_default_index(unk_index)

Now, we can happily get indexes of out of vocabulary tokens.

In [22]:
en_vocab["The"]

0

**Convert tokens into indices:**

In [23]:
tokens = ['they', 'are', 'very', 'unusual', 'in', 'a', 'period', 'drama', '.']

In [24]:
en_vocab.lookup_indices(tokens)

[40, 32, 63, 1756, 11, 6, 861, 466, 4]

Conversely, we can use the `lookup_tokens` method to convert a list of indices back into tokens using the vocabulary.

In [25]:
en_vocab.lookup_tokens(en_vocab.lookup_indices(tokens))

['they', 'are', 'very', 'unusual', 'in', 'a', 'period', 'drama', '.']

In [26]:
def numericalize_fn(row, vocab):
    token_ids = vocab.lookup_indices(row["tokens"])
    return {"token_ids": token_ids}

In [27]:
fn_kwargs = {"vocab": en_vocab}

train_data = train_data.map(numericalize_fn, fn_kwargs=fn_kwargs)
valid_data = valid_data.map(numericalize_fn, fn_kwargs=fn_kwargs)
test_data = test_data.map(numericalize_fn, fn_kwargs=fn_kwargs)

Map: 100%|██████████| 34000/34000 [00:14<00:00, 2304.65 examples/s]
Map: 100%|██████████| 6000/6000 [00:02<00:00, 2412.62 examples/s]
Map: 100%|██████████| 10000/10000 [00:04<00:00, 2351.91 examples/s]


In [28]:
train_data[0]

{'review': 'I am a member of a canoeing club and I can tell you the truth that Deliverance is synonomous with the peacefulness and tranquility of the experience. As we put our boats into the water, banjoes echo in the back of the conscious mind. This movie is timeless because it waxes philosophical of human\'s place in nature and technology\'s effect upon man\'s relationship with nature. We see it in the bow fishing. We see it in the home made tent. There is also city man\'s disdain and feeling of superiority to the rural woodsman "cracker". The fact that the Banker from Atlanta (Ned Beatty) has "bad teeth" is meant to put him on the same level with the woodsmen who also have bad teeth. Ultimately, the struggle of life and death supersedes "civilized man\'s" suppositives about "The Law". This canoe trip ends too soon for the viewer, but alas Not Soon Enough for the characters.',
 'sentiment': 1,
 'tokens': ['i',
  'am',
  'a',
  'member',
  'of',
  'a',
  'canoeing',
  'club',
  'and',

In [29]:
en_vocab.lookup_tokens(train_data[0]["token_ids"])

['i',
 'am',
 'a',
 'member',
 'of',
 'a',
 '<unk>',
 'club',
 'and',
 'i',
 'can',
 'tell',
 'you',
 'the',
 'truth',
 'that',
 'deliverance',
 'is',
 '<unk>',
 'with',
 'the',
 '<unk>',
 'and',
 'tranquility',
 'of',
 'the',
 'experience',
 '.',
 'as',
 'we',
 'put',
 'our',
 'boats',
 'into',
 'the',
 'water',
 ',',
 '<unk>',
 'echo',
 'in',
 'the',
 'back',
 'of',
 'the',
 'conscious',
 'mind',
 '.',
 'this',
 'movie',
 'is',
 'timeless',
 'because',
 'it',
 'waxes',
 'philosophical',
 'of',
 'human',
 "'s",
 'place',
 'in',
 'nature',
 'and',
 'technology',
 "'s",
 'effect',
 'upon',
 'man',
 "'s",
 'relationship',
 'with',
 'nature',
 '.',
 'we',
 'see',
 'it',
 'in',
 'the',
 'bow',
 'fishing',
 '.',
 'we',
 'see',
 'it',
 'in',
 'the',
 'home',
 'made',
 'tent',
 '.',
 'there',
 'is',
 'also',
 'city',
 'man',
 "'s",
 'disdain',
 'and',
 'feeling',
 'of',
 'superiority',
 'to',
 'the',
 'rural',
 '<unk>',
 '"',
 'cracker',
 '"',
 '.',
 'the',
 'fact',
 'that',
 'the',
 'banker'

In [30]:
column_mapping = {
    'token_ids': 'TEXT_COLUMN_NAME',
    'sentiment': 'LABEL_COLUMN_NAME'
}

train_data = train_data.rename_columns(column_mapping)
valid_data = valid_data.rename_columns(column_mapping)
test_data = test_data.rename_columns(column_mapping)

In [31]:
train_data[0]

{'review': 'I am a member of a canoeing club and I can tell you the truth that Deliverance is synonomous with the peacefulness and tranquility of the experience. As we put our boats into the water, banjoes echo in the back of the conscious mind. This movie is timeless because it waxes philosophical of human\'s place in nature and technology\'s effect upon man\'s relationship with nature. We see it in the bow fishing. We see it in the home made tent. There is also city man\'s disdain and feeling of superiority to the rural woodsman "cracker". The fact that the Banker from Atlanta (Ned Beatty) has "bad teeth" is meant to put him on the same level with the woodsmen who also have bad teeth. Ultimately, the struggle of life and death supersedes "civilized man\'s" suppositives about "The Law". This canoe trip ends too soon for the viewer, but alas Not Soon Enough for the characters.',
 'LABEL_COLUMN_NAME': 1,
 'tokens': ['i',
  'am',
  'a',
  'member',
  'of',
  'a',
  'canoeing',
  'club',


One other thing that the `datasets` library handles for us with the `Dataset` class is converting features to the correct type. Our indices in each example are currently basic Python integers. However, they need to be converted to PyTorch tensors in order to use them with PyTorch. We could convert them just before we pass them into the model, however it is more convenient to do it now.

The `with_format` method converts features indicated by the columns argument to a given type. Here, we specify the type as "torch" (for PyTorch) and the columns to be "TEXT_COLUMN_NAME" and "LABEL_COLUMN_NAME" (the features which we want to convert to PyTorch tensors). By default, `with_format` will remove any features not in the list of features passed to columns. If we want to keep those features, we can do with `output_all_columns=True`.

In [32]:
data_type = "torch"
format_columns = ["TEXT_COLUMN_NAME", "LABEL_COLUMN_NAME", "TEXT_LENGTH"]

train_data = train_data.with_format(
    type=data_type,
    columns=format_columns,
    # output_all_columns=True
)

valid_data = valid_data.with_format(
    type=data_type,
    columns=format_columns,
    #output_all_columns=True,
)

test_data = test_data.with_format(
    type=data_type,
    columns=format_columns,
    #output_all_columns=True,
)

In [33]:
train_data[0]

{'LABEL_COLUMN_NAME': tensor(1),
 'TEXT_LENGTH': tensor(182),
 'TEXT_COLUMN_NAME': tensor([   12,   244,     6,  1727,     7,     6,     0,  1384,     5,    12,
            72,   393,    26,     2,   884,    14,  7581,     9,     0,    21,
             2,     0,     5, 26161,     7,     2,   599,     4,    20,    82,
           286,   273,  9402,    97,     2,  1003,     3,     0,  9157,    11,
             2,   160,     7,     2,  5528,   351,     4,    13,    23,     9,
          3914,    98,    10, 28001,  4486,     7,   405,    16,   294,    11,
           919,     5,  2192,    16,   942,   695,   141,    16,   659,    21,
           919,     4,    82,    79,    10,    11,     2,  5840,  7377,     4,
            82,    79,    10,    11,     2,   361,   107, 10487,     4,    48,
             9,   100,   546,   141,    16, 10043,     5,   582,     7, 16912,
             8,     2,  3998,     0,    15, 12243,    15,     4,     2,   210,
            14,     2, 17041,    45, 10679,    29

In [34]:
train_data = train_data.sort(column_names=['TEXT_LENGTH'], reverse=True)
valid_data = valid_data.sort(column_names=['TEXT_LENGTH'], reverse=True)
test_data = test_data.sort(column_names=['TEXT_LENGTH'], reverse=True)

In [35]:
train_data[0]

{'LABEL_COLUMN_NAME': tensor(1),
 'TEXT_LENGTH': tensor(2789),
 'TEXT_COLUMN_NAME': tensor([1091,  521,   98,  ...,   18,  161,  533])}


## Define Data Loaders

In [36]:
def get_collate_fn(pad_index):
    def collate_fn(batch):
        batch_text_ids = [data["TEXT_COLUMN_NAME"] for data in batch]
        batch_text_ids = torch.nn.utils.rnn.pad_sequence(batch_text_ids, padding_value=pad_index)
        batch = {
            "TEXT_COLUMN_NAME": batch_text_ids,
            "LABEL_COLUMN_NAME": torch.tensor([data["LABEL_COLUMN_NAME"] for data in batch]),
            "TEXT_LENGTH": torch.tensor([data["TEXT_LENGTH"] for data in batch])
        }
        return batch

    return collate_fn

In [37]:
def get_data_loader(dataset, batch_size, pad_index, shuffle=False):
    collate_fn = get_collate_fn(pad_index)
    data_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        collate_fn=collate_fn,
        shuffle=shuffle,
    )
    return data_loader

In [38]:
pad_index = en_vocab[pad_token]

train_loader = get_data_loader(train_data, BATCH_SIZE, pad_index) # NEW
valid_loader = get_data_loader(valid_data, BATCH_SIZE, pad_index)
test_loader = get_data_loader(test_data, BATCH_SIZE, pad_index)

Testing the iterators (note that the number of rows depends on the longest document in the respective batch):

In [39]:
print('Train')
for batch in train_loader:
    print(f'Text matrix size: {batch["TEXT_COLUMN_NAME"].size()}')
    print(f'Target vector size: {batch["LABEL_COLUMN_NAME"].size()}')
    print(f'Text Length vector size: {batch["TEXT_LENGTH"].size()}')
    break

print('\nValid:')
for batch in valid_loader:
    print(f'Text matrix size: {batch["TEXT_COLUMN_NAME"].size()}')
    print(f'Target vector size: {batch["LABEL_COLUMN_NAME"].size()}')
    print(f'Text Length vector size: {batch["TEXT_LENGTH"].size()}')
    break

print('\nTest:')
for batch in test_loader:
    print(f'Text matrix size: {batch["TEXT_COLUMN_NAME"].size()}')
    print(f'Target vector size: {batch["LABEL_COLUMN_NAME"].size()}')
    print(f'Text Length vector size: {batch["TEXT_LENGTH"].size()}')
    break

Train
Text matrix size: torch.Size([1136, 128])
Target vector size: torch.Size([128])
Text Length vector size: torch.Size([128])

Valid:
Text matrix size: torch.Size([55, 128])
Target vector size: torch.Size([128])
Text Length vector size: torch.Size([128])

Test:
Text matrix size: torch.Size([52, 128])
Target vector size: torch.Size([128])
Text Length vector size: torch.Size([128])


## Model

In [40]:
class RNN(torch.nn.Module):

    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()

        self.embedding = torch.nn.Embedding(input_dim, embedding_dim)
        #self.rnn = torch.nn.RNN(embedding_dim,
        #                        hidden_dim,
        #                        nonlinearity='relu')
        self.rnn = torch.nn.LSTM(embedding_dim,
                                 hidden_dim)

        self.fc = torch.nn.Linear(hidden_dim, output_dim)


    def forward(self, text, text_length):
        # text dim: [sentence length, batch size]

        embedded = self.embedding(text)
        # embedded dim: [sentence length, batch size, embedding dim]

        ## NEW
        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, text_length.to('cpu'))

        output, (hidden, cell) = self.rnn(packed)
        # output dim: [sentence length, batch size, hidden dim]
        # hidden dim: [1, batch size, hidden dim]

        hidden.squeeze_(0)
        # hidden dim: [batch size, hidden dim]

        output = self.fc(hidden)
        return output

In [41]:
torch.manual_seed(RANDOM_SEED)
model = RNN(input_dim=len(en_vocab),
            embedding_dim=EMBEDDING_DIM,
            hidden_dim=HIDDEN_DIM,
            output_dim=NUM_CLASSES # could use 1 for binary classification
)

model = model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

## Training

In [42]:
def compute_accuracy(model, data_loader, device):

    with torch.inference_mode():

        correct_pred, num_examples = 0, 0

        for batch_data in data_loader:
            features = batch_data["TEXT_COLUMN_NAME"]
            text_length = batch_data["TEXT_LENGTH"] # NEW
            targets = batch_data["LABEL_COLUMN_NAME"]

            features = features.to(device)
            targets = targets.float().to(device)

            logits = model(features, text_length)
            _, predicted_labels = torch.max(logits, 1)

            num_examples += targets.size(0)
            correct_pred += (predicted_labels == targets).sum()
    return correct_pred.float()/num_examples * 100

In [43]:
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    model.train()
    for batch_idx, batch_data in enumerate(train_loader):

        text = batch_data["TEXT_COLUMN_NAME"].to(DEVICE)
        text_length = batch_data["TEXT_LENGTH"] # NEW
        labels = batch_data["LABEL_COLUMN_NAME"].to(DEVICE)

        ### FORWARD AND BACK PROP
        logits = model(text, text_length)
        loss = F.cross_entropy(logits, labels)
        optimizer.zero_grad()

        loss.backward()

        ### UPDATE MODEL PARAMETERS
        optimizer.step()

        ### LOGGING
        if not batch_idx % 50:
            print (f'Epoch: {epoch+1:03d}/{NUM_EPOCHS:03d} | '
                   f'Batch {batch_idx:03d}/{len(train_loader):03d} | '
                   f'Loss: {loss:.4f}')

    with torch.inference_mode():
        print(f'training accuracy: '
              f'{compute_accuracy(model, train_loader, DEVICE):.2f}%'
              f'\nvalid accuracy: '
              f'{compute_accuracy(model, valid_loader, DEVICE):.2f}%')

    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')

print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

Epoch: 001/015 | Batch 000/266 | Loss: 0.7058
Epoch: 001/015 | Batch 050/266 | Loss: 0.6915
Epoch: 001/015 | Batch 100/266 | Loss: 0.6958
Epoch: 001/015 | Batch 150/266 | Loss: 0.6933
Epoch: 001/015 | Batch 200/266 | Loss: 0.6998
Epoch: 001/015 | Batch 250/266 | Loss: 0.6887
training accuracy: 50.08%
valid accuracy: 49.23%
Time elapsed: 0.79 min
Epoch: 002/015 | Batch 000/266 | Loss: 0.6926
Epoch: 002/015 | Batch 050/266 | Loss: 0.6938
Epoch: 002/015 | Batch 100/266 | Loss: 0.6937
Epoch: 002/015 | Batch 150/266 | Loss: 0.6915
Epoch: 002/015 | Batch 200/266 | Loss: 0.6900
Epoch: 002/015 | Batch 250/266 | Loss: 0.6904
training accuracy: 50.16%
valid accuracy: 51.15%
Time elapsed: 1.60 min
Epoch: 003/015 | Batch 000/266 | Loss: 0.6902
Epoch: 003/015 | Batch 050/266 | Loss: 0.6932
Epoch: 003/015 | Batch 100/266 | Loss: 0.7039
Epoch: 003/015 | Batch 150/266 | Loss: 0.6932
Epoch: 003/015 | Batch 200/266 | Loss: 0.6928
Epoch: 003/015 | Batch 250/266 | Loss: 0.6915
training accuracy: 50.20%
va

In [44]:
import spacy


def predict(model, sentence):
    model.eval()
    with torch.inference_mode():
        tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
        indexed = [en_vocab[t.lower()] for t in tokenized]
        length = [len(indexed)]
        tensor = torch.LongTensor(indexed).to(DEVICE)
        tensor = tensor.unsqueeze(1)
        length_tensor = torch.LongTensor(length)
        predict_probas = torch.nn.functional.softmax(model(tensor, length_tensor), dim=1)
        predicted_label_index = torch.argmax(predict_probas)
        predicted_label_proba = torch.max(predict_probas)
        # classes: {0: Negative, 1: Positive}
        return predicted_label_index.item(), predicted_label_proba.item()

In [45]:
print('Probability positive:')

predicted_label, predicted_label_proba = \
    predict(model, "This is such an awesome movie, I really love it!")

print(f'Predicted label: {predicted_label}'
      f' | Probability: {predicted_label_proba}')

Probability positive:
Predicted label index: 0 | Predicted label: 1 | Probability: 0.9999784231185913


In [46]:
print('Probability negative:')

predicted_label, predicted_label_proba = \
    predict(model, "I really hate this movie. It is really bad and sucks!")

print(f'Predicted label index: {predicted_label}'
      f' | Probability: {predicted_label_proba}')

Probability negative:
Predicted label index: 1 | Predicted label: 0 | Probability: 0.9999545812606812


In [47]:
%watermark -iv

pandas   : 2.2.2
spacy    : 3.7.3
torch    : 2.3.0
torchtext: 0.18.0
datasets : 3.0.1

