THIS NOTEBOOK CAN RUN WITHOUT A GPU WITHOUT TOO MUCH TIME LOSS

## Overview

The purpose is to use Pytorch and Pytorch Lightning to compare the accuracy and methodology of fine-tuning a pre-trained BERT model vs Word2vec for sentiment analysis.


## Goals

It is basically a NLP text classification task: ie predict if a sentence has positive or negative sentiment. This notebook shows how to:
1. Train a deep learning model in PyTorch.
1. Understand pretrained features and foundation models.
1. Make predictions on new examples using a trained model.
1. Evaluating the performance of a trained model.
1. Get familiar with PyTorch and PyTorch Lightning
1. Compare the performance with using plain vanilla Word2vec


# Predictions with a fine-tuned pre-trained BERT model



In [11]:
from IPython.display import clear_output

In [12]:
!pip install datasets
!pip install transformers
!pip install pytorch-lightning
!pip install dotmap
!pip install jsonargparse[signatures]
!pip install --upgrade --no-cache-dir gdown
!pip install torchmetrics

import nltk                           # for NLP utilities
import random
import torch                          # deep learning utilities
import numpy as np                    # for array manipulation
import pandas as pd                   # for datasets
from tqdm import tqdm                 # for iteration counters
import matplotlib.pyplot as plt       # for plotting and visualization
from dotmap import DotMap             # for configs
import torchmetrics                   # for metrics
plt.style.use('ggplot')

from datasets import load_dataset     # to download datasets

clear_output()  # clears the output of the cell

### Dataset

We use the Stanford Sentiment TreeBank dataset, or SST. It contains a little under 11k sentences, each annotated with a score between 0 and 1, with 1 indicating positive sentiment. For example, an entry in the dataset is:

> "This was the worst restaurant I have ever had the misfortune of eating at."

Our goal is to train a deep learning model to infer the sentiment for sentences in the test set. A well-trained model will be able to correctly predict the sentiment for new unseen sentences.

In [None]:
# loading the dataset
train_dataset = load_dataset('sst', split='train')
dev_dataset = load_dataset('sst', split='validation')
test_dataset = load_dataset('sst', split='test')

The dataset is pre-split for us into three portions:
- **training**: as the name suggests, this is used to train the model.
- **dev**: this is typically used to tune any hyperparameters, and to detect overfitting or underfitting.
- **test**: this is used to measure performance. It should be used to decide model parameters.

In [None]:
print(f'{train_dataset.num_rows:,} training examples')
print(f'{dev_dataset.num_rows:,} dev examples')
print(f'{test_dataset.num_rows:,} test examples')

8,544 training examples
1,101 dev examples
2,210 test examples


To get a sense of the data, we can inspect the first example in the training dataset.

In [None]:
print(f"Sentence: {train_dataset[0]['sentence']}")
print(f"Label: {train_dataset[0]['label']}")

Sentence: The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
Label: 0.6944400072097778


Notice the label is a floating point number. We will be simplifying this to just two labels (by rounding to the closest integer). This way, our model will only need to predict positive or negative.

### Tokenization

Natural language is not the easiest form of input for neural networks to digest: the vocabulary (or number of unique words) is very high, most words do not appear in every sentence, and sentences can be of very different lengths. What is commonly done in practice is *tokenization* where every unique word (or word piece) is mapped to a unique integer. For example, cat --> 0, dog --> 1, ... Then a sentence is converted to a vector of integers, rather than a vector of strings. Since proper tokenization can be tricky, we use the Huggingface toolkit.

In [None]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-base') # download a few files

In [None]:
tokenized = tokenizer(
  'hello world!',
  truncation=True,
  padding='max_length',
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',
)

In [None]:
tokenized.keys()

dict_keys(['input_ids', 'attention_mask'])

There are two keys in the `tokenized` object!

1. `input_ids` contains a torch tensor of token ids (vocab ids).
2. The maximum sentence length for Huggingface is 512. If the sentence is less than 512 tokens, it will pad to 512. `attention_mask` will keep track of which tokens are for padding only.

In [None]:
print(tokenized['input_ids'][0][:5])

tensor([    0, 42891,   232,   328,     2])


There are 5 non-padding tokens. The middle three correspond to "Hello", "World", and "!". The first one is a special token that represents the start of the sentence. The fifth token is also a special one, for the end of the sentence.

### Data I/O

We build a `Dataset` object to serve data to a deep learning model. PyTorch offers a `Dataset` class that helps us do this. In particular, this class has three important methods:
1. `__init__`: often, the initialization function is used to preload data and sent class variables.
2. `__getitem__`: this function serves the logic of returning the example with index `index`. The outputs of this will be used in minibatching.
3. `__len__`: this compute the total number of examples and is used when looping through the dataset.

In [None]:
from torch.utils.data import Dataset


class SST(Dataset):
  """
  The Stanford Sentiment TreeBank dataset.

  Argument
  --------
  split: (str) the dataset portion
    Options - train | dev | test
  """

  def __init__(self, split = 'train'):
    super().__init__()
    assert split in ['train', 'dev', 'test'], f"Split {split} not supported."
    if split == 'dev':
      split = 'validation'
    self.split = split
    self.data = load_dataset('sst', split = split)
    self.tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

  def __getitem__(self, index):
    sentence = self.data[index]['sentence']
    label = round(self.data[index]['label']) # rounding automatically creates a label 0-1

    tokenized = self.tokenizer(
      sentence,
      truncation=True,
      padding='max_length',
      pad_to_max_length=True,
      return_attention_mask=True,
      return_tensors='pt',
    )
    output = {
      'input_ids': tokenized['input_ids'].squeeze(0),
      'attention_mask': tokenized['attention_mask'].squeeze(0),
      'label': label,
    }
    return output

  def __len__(self):
    return self.data.num_rows

Let's create a training dataset and look at the first item.

In [None]:
dataset = SST(split = 'train')
row = dataset.__getitem__(0)
print(row.keys())

dict_keys(['input_ids', 'attention_mask', 'label'])


### Pretrained Features

Training high quality NLP models usually requires billions of data points crawled from all over the internet that only a few companies can do.
What we can do is leverage *pre-trained models*, or large models customized for a modality (e.g. natural language) that can convert a sentence into a vector represented (of fixed size).

A popular pretrained model for text is BERT (Deep Bidirectional Transformers), which have been shown in research and practice to have good performance on text classification tasks like sentiment analysis. We will use Huggingface's pretrained BERT model to compute pretrained features.

In [None]:
from transformers import RobertaModel

pretrained = RobertaModel.from_pretrained("roberta-base"). # downloads a large model file
# Ignore the warning about weights not being initialized

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let's try feeding a dataset entry into the model.

In [None]:
dataset = SST(split = 'train')
row = dataset.__getitem__(0)

output = pretrained(
  input_ids = row['input_ids'].unsqueeze(0),
  attention_mask = row['attention_mask'].unsqueeze(0),
)

The output contains two fields:
- `last_hidden_state` is a tensor of shape (1, 512, 768). The first dimension is 1 because we only passed in 1 example. The second dimension is 512 because it is the maximum length allowed. The third dimension is 768, the pretrained model feature size.
- `pooler_output` is a tensor of shape (1, 768). It represents a "pooled" version of the `last_hidden_state` tensor. We will be treating this as our pretrained features that will be fed to the fine-tuning code!

In [None]:
output.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

This BERT model is too big to be trained on Colab, but the features (for this dataset) are available for download. Now, these features become the input of the fine-tuning part.  

In [9]:
# from my google drive PORTFOLIO/NLP/BERT-SST
!gdown 1-8PVSg0SKlfaJh4w7H5M6Shpum9dH87f
!gdown 1-8y5zOsYfC2MOzoQUT5k-IFqVHFZlvJY
!gdown 1eJpUgd1VOeb2lZkV7zcWFUlcNPQdWnFE

Downloading...
From: https://drive.google.com/uc?id=1-8PVSg0SKlfaJh4w7H5M6Shpum9dH87f
To: /content/sst-roberta-train.pt
100% 26.2M/26.2M [00:00<00:00, 60.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1-8y5zOsYfC2MOzoQUT5k-IFqVHFZlvJY
To: /content/sst-roberta-test.pt
100% 6.79M/6.79M [00:00<00:00, 59.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1eJpUgd1VOeb2lZkV7zcWFUlcNPQdWnFE
To: /content/sst-roberta-dev.pt
100% 3.38M/3.38M [00:00<00:00, 134MB/s]


In [None]:
train_features = torch.load('sst-roberta-train.pt')
dev_features   = torch.load('sst-roberta-dev.pt')
test_features  = torch.load('sst-roberta-test.pt')

> Let's treat these features as the new data, and create a Dataset object to feed Pytorch.

In [None]:
from torch.utils.data import Dataset


class SSTBERT(Dataset):
  """
  The Stanford Sentiment TreeBank dataset with BERT features.

  Argument
  --------
  split: (str) the dataset portion
    Options - train | dev | test
  """

  def __init__(self, split = 'train'):
    super().__init__()
    assert split in ['train', 'dev', 'test'], f"Split {split} not supported."
    self.features = torch.load(f'sst-roberta-{split}.pt').cpu()
    if split == 'dev': split = 'validation'
    self.split = split
    self.data = load_dataset('sst', split = split)

  def __getitem__(self, index):
    features = None
    label = None

    features = self.features[index]
    label = self.data[index]['label']

    # no tokenization here because that was done by BERT
    # all we get here is the output of BERT (fine-tuning)

    # features: torch.FloatTensor
    # label: numeric
    # ================================
    return features, round(label)

  def __len__(self):
    return self.data.num_rows

In [None]:
SSTBERT('train').data[0].keys()

dict_keys(['sentence', 'label', 'tokens', 'tree'])

In [None]:
ex_feat, ex_label = SSTBERT('train').__getitem__(0)
ex_feat.size()   # this will be our input dimension

torch.Size([768])

## Fine-tuning

Let's train a model a multi-layer perceptron MLP on top of the pretrained features to predict sentiment.

This is a demo project, so I will use an easy metrics, ie accuracy. Scikit-learn provides many [metrics for binary classifiers](https://scikit-learn.org/stable/modules/model_evaluation.html).

In [14]:
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score

import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.utils.data import random_split
clear_output()

Let's try a simple MLP

In [None]:
from collections import OrderedDict

class MLP(nn.Module):
  """
  A multi-layer perceptron.
  """

  def __init__(self, input_dim, output_dim, hidden_dim=64):
    super().__init__()

    # https://medium.com/writeasilearn/using-sequential-module-to-build-a-neural-network-a34ca3f37203

    self.hidden_dim = hidden_dim

    self.fc1 = nn.Sequential(OrderedDict([
          ('conv1', nn.Linear(input_dim, hidden_dim)),
          ('relu1', nn.ReLU()),
          ('conv2', nn.Linear(hidden_dim, output_dim)),
          ('relu2', nn.ReLU()),
          ('sigm', nn.Sigmoid())]))

    # works best with normalization (convergence) and dropout (variance)
    # a sequential model does all the forwarding for us
    self.fc = nn.Sequential(
        nn.Linear(input_dim, hidden_dim),
        nn.BatchNorm1d(hidden_dim),
        nn.ReLU(),
        nn.Linear(hidden_dim, hidden_dim),
        nn.BatchNorm1d(hidden_dim),
        nn.ReLU(),
        nn.Dropout(p=0.5),
        nn.Linear(hidden_dim, output_dim),
        nn.Sigmoid())


  def forward(self, x):

    probs = self.fc(x) # prob that x has a positive sentiment

    # probs: torch.FloatTensor
    #   shape: batch_size x 1
    # ================================
    return probs

In [15]:
# I like to took at the data at times to keep a mental representation of how they transform
input = torch.randn(5)
nn.Sigmoid()(input)

tensor([0.6751, 0.5322, 0.8084, 0.5032, 0.4480])

Let's use [PyTorch Lightning](https://www.pytorchlightning.ai/tutorials) which offers an easy-to-use framework that makes the many moving pieces of deep learning training feel more manageable.

In [None]:
import pytorch_lightning as pl

In [None]:
class SSTSystem(pl.LightningModule):

  def __init__(self):
    super().__init__()

    input_dim = 768  # size of our features (see before)
    output_dim = 1   # sentiment = binary

    self.model = MLP(input_dim, output_dim)
    # self.f1 = torchmetrics.F1Score('binary', num_classes=1)
    self.accuracy = accuracy_score
    self.loss = torch.nn.CrossEntropyLoss()

  def forward(self, features):
    # transforms the features into output probability

    probs = self.model(features)

    # probs: torch.FloatTensor
    #   shape: batch_size x 1
    # ================================
    return probs

  def configure_optimizers(self, lr: float =1e-3):

    optimizer = torch.optim.Adam(self.parameters(), lr=lr)

    return optimizer

  def _common_step(self, batch, batch_idx):

    features, labels = batch

    probs = self.forward(features)
    loss = self.loss(probs.view(-1), labels.float())

    # probs: torch.FloatTensor
    #   shape: batch_size x 1  <<< the 2nd dimension will have to be squeezed out
    # loss: torch.FloatTensor
    #   shape: 1


    with torch.no_grad():

      preds = (probs > 0.5).squeeze(1) # removes dimensions=1 ie shape [123,1] -> [123]

      accuracy = self.accuracy(preds.to('cpu'), labels.to('cpu')) # to('cpu) necessary with sklearn

      # preds: torch.FloatTensor or np.array
      # accuracy: float
      # ================================

    return loss, accuracy

  def training_step(self, train_batch, batch_idx):
    loss, acc = self._common_step(train_batch, batch_idx)
    self.log('train_loss', loss)
    self.log('train_acc', acc, prog_bar=True)
    return loss

  def validation_step(self, dev_batch, batch_idx):
    loss, acc = self._common_step(dev_batch, batch_idx)
    self.log('dev_loss', loss)
    self.log('dev_acc', acc, prog_bar=True)

  def test_step(self, test_batch, batch_idx):
    loss, acc = self._common_step(test_batch, batch_idx)
    self.log('test_loss', loss)
    self.log('test_acc', acc)

  def predict_step(self, batch, batch_idx):
    return self.forward(batch[0])

In [None]:
# some tests I want to keep
input = torch.randn(5)
probs = nn.Sigmoid()(input)
preds = (probs > 0.5)
print(preds)
#labels = torch.Tensor([ True,  True, False,  True,  False]).to('cuda')
labels = torch.Tensor([1,1,0,1,0]).to('cpu')
taccuracy = torchmetrics.Accuracy(task="binary")(preds, labels).item()
print(taccuracy)
accuracy = accuracy_score(preds, labels)
print(accuracy)
assert abs(accuracy-taccuracy) < min(accuracy,taccuracy)/100, 'accuracy mismatch between pytorch and sklearn'
print('Both methods give the same result')
# despite the tests I couldn't make torchmetrics.Accuracy work in the code

tensor([ True, False, False,  True,  True])
0.6000000238418579
0.6
Both methods give the same result


## DataModule
Let's create a [DataModule](https://pytorch-lightning.readthedocs.io/en/stable/extensions/datamodules.html) to easily handle minibatching our SST datasets.

In [None]:
class SSTDataModule(pl.LightningDataModule):
  """
  Data module wrapper around SST datasets.

  Arguments
  ---------
  batch_size: (int) minibatch size
    default = 32
  """
  def __init__(self, batch_size: int = 32):
    super().__init__()

    self.sst_train = SSTBERT('train')
    self.sst_dev = SSTBERT('dev')
    self.sst_test = SSTBERT('test')

    self.batch_size = batch_size

  def train_dataloader(self):
    # dataloader for train dataset
    return DataLoader(self.sst_train, batch_size=self.batch_size, shuffle=True)

  def val_dataloader(self):
    # dataloader for dev dataset
    return DataLoader(self.sst_dev, batch_size=self.batch_size)

  def test_dataloader(self):
    # dataloader for test dataset
    return DataLoader(self.sst_test, batch_size=self.batch_size)

  def predict_dataloader(self):
    # we also use the test dataset here
    return DataLoader(self.sst_test, batch_size=self.batch_size)

## Training
Let's put everything together with `pytorch_lightning.Trainer`.

In [None]:
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

In [None]:
def seed_everything(seed, use_cuda=True):
  """
  Important to standardize seeds!
  """
  random.seed(seed)
  torch.manual_seed(seed)
  if use_cuda:
    torch.cuda.manual_seed_all(seed)
  np.random.seed(seed)

In [None]:
# with GPU
dm = SSTDataModule(batch_size=32)
model = SSTSystem()

seed_everything(42, use_cuda=True)

checkpoint_callback = ModelCheckpoint(monitor='dev_loss')

trainer = Trainer(
  # you can add lots more custom config here for more advanced
  # functionality like early stopping, learning rate decay, etc.
  max_epochs=20, devices=1, accelerator="gpu",
  callbacks=[checkpoint_callback],  # for tracking best checkpoint
)

trainer.fit(model, dm)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type             | Params
-------------------------------------------
0 | model | MLP              | 102 K 
1 | f1    | BinaryF1Score    | 0     
2 | loss  | CrossEntropyLoss | 0     
-------------------------------------------
102 K     Trainable params
0         Non-trainable params
102 K     Total params
0.412     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.


In [None]:
results = trainer.test(model, dm, ckpt_path="best")

INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/lightning_logs/version_3/checkpoints/epoch=2-step=801.ckpt
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from the checkpoint at /content/lightning_logs/version_3/checkpoints/epoch=2-step=801.ckpt


Testing: 0it [00:00, ?it/s]

This model can be trained with a CPU without wasting too much time.

In [None]:
# without GPU
dm = SSTDataModule(batch_size=32)
model = SSTSystem()

seed_everything(42, use_cuda=True)

checkpoint_callback = ModelCheckpoint(monitor='dev_loss')

trainer = Trainer(
  # you can add lots more custom config here for more advanced
  # functionality like early stopping, learning rate decay, etc.
  max_epochs=20, devices=1, accelerator="cpu",
  callbacks=[checkpoint_callback],  # for tracking best checkpoint
)

trainer.fit(model, dm)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type             | Params
-------------------------------------------
0 | model | MLP              | 102 K 
1 | f1    | BinaryF1Score    | 0     
2 | loss  | CrossEntropyLoss | 0     
-------------------------------------------
102 K     Trainable params
0         Non-trainable params
102 K     Total params
0.412     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.


In [None]:
results = trainer.test(model, dm, ckpt_path="best")

INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/lightning_logs/version_3/checkpoints/epoch=11-step=3204.ckpt
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from the checkpoint at /content/lightning_logs/version_3/checkpoints/epoch=11-step=3204.ckpt


Testing: 0it [00:00, ?it/s]

## Results

As a baseline, logistic regression achieves around 73% after 20 epochs of training.  It didn't improve with more epochs.

> **Test accuracy**: 0.73 with a simple MLP.  The loss is quite high so there's for sure something to be done to improve but the purpose here is to compare two ways to do sentiment analysis using identical methods (logistic regression).

---

# Predictions with Word2vec embeddings

Can we avoid using pretrained features and just derive them directly from text using (simpler) Word2Vec embeddings.

A neural net can generate features by inputting it with normal text, otherwise we need to do preprocessing to pull out normalized words from sentences.

Here are some common standardization techniques for NLP:

- Lower case (My name -> my name)
- Remove whitespace ('   hi' -> 'hi')
- Replace multiple white spaces with one
- Remove punctuation (hi! -> hi)
- Stop word removal (remove words not useful to the semantic meaning)
- Tokenization
- Lemmatization (programming -> program)
- Rare word removal

In [None]:
import string
from collections import defaultdict
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def to_lower_case(x):
  # Return lower case of string.
  return x.lower()

def remove_whitespace(x):
  # Remove any outer whitespace (tabs, newlines, spaces) around text.
  return x.strip()

def remove_extra_whitespace(x):
  # Remove any extra space within a sentence.
  return ' '.join(x.split())

def remove_punctuation(x):
  # Remove all punctuation characters.
  return ''.join([ch for ch in x if ch not in string.punctuation])

def tokenize(x):
  # Convert a string into tokens.
  return x.split()

def remove_stop_words(tokens):
  # Remove tokens from list if token is a stop word.
  return [tok for tok in tokens if tok not in stop_words]

def lemmatize(tokens):
  # Lemmatize every token in list.
  return [lemmatizer.lemmatize(tok) for tok in tokens]

def build_vocab(list_of_tokens):
  vocab = defaultdict(lambda: 0)
  # Build a dictionary from token to count.
  # Input is a list of list of strings.

  for tokens in list_of_tokens:
    for token in tokens:
      vocab[token] += 1
  return vocab

def remove_rare_words(tokens, vocab, min_count = 3):
  # Only keep tokens that appear more than min_count
  # number of times in vocab
  return [tok for tok in tokens if vocab[tok] > min_count]


In [None]:
# let's take a look
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Let's use the functions you programmed above to build a `preprocess` dataset.

In [None]:
def preprocess(dataset):
  """
   Let's apply all the preprocessing steps in the following order:

  - to_lower_case
  - remove_whitespace
  - remove_extra_whitespace
  - remove_punctuation
  - tokenize
  - remove_stop_words
  - lemmatize
  - build_vocab
  - remove_rare_words
  """
  all_tokens = []
  pbar = tqdm(total=len(dataset), leave=True, position=0)
  for entry in dataset:
    text = entry['sentence']
    text = to_lower_case(text)
    text = remove_whitespace(text)
    text = remove_extra_whitespace(text)
    text = remove_punctuation(text)
    tokens = tokenize(text)
    tokens = remove_stop_words(tokens)
    tokens = lemmatize(tokens)
    all_tokens.append(tokens)
    pbar.update()
  pbar.close()

  vocab = build_vocab(all_tokens)

  new_dataset = []
  for i in tqdm(range(len(dataset)), leave=True, position=0):
    tokens = all_tokens[i]
    tokens = remove_rare_words(tokens, vocab, 3)
    row_i = dataset[i]
    # add postprocessed sentence as a string
    # so we can get the tokens later by splitting it again
    # without redoing the preprocessing
    row_i['processed'] = ' '.join(tokens)
    new_dataset.append(row_i)

  return new_dataset

In [None]:
train_dataset = preprocess(load_dataset('sst', split='train'))
dev_dataset = preprocess(load_dataset('sst', split='validation'))
test_dataset = preprocess(load_dataset('sst', split='test'))

100%|██████████| 8544/8544 [00:19<00:00, 429.09it/s] 
100%|██████████| 8544/8544 [00:02<00:00, 3168.85it/s]
100%|██████████| 1101/1101 [00:01<00:00, 969.47it/s] 
100%|██████████| 1101/1101 [00:00<00:00, 2344.55it/s]
100%|██████████| 2210/2210 [00:01<00:00, 1719.41it/s]
100%|██████████| 2210/2210 [00:00<00:00, 3469.37it/s]


Let's take a look at the first row in the dataset object.  It has a `processed` entry, which is not a complete sentence, only the words that the preprocessing deems useful.

In [None]:
train_dataset[0]

{'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 'label': 0.6944400072097778,
 'tokens': "The|Rock|is|destined|to|be|the|21st|Century|'s|new|``|Conan|''|and|that|he|'s|going|to|make|a|splash|even|greater|than|Arnold|Schwarzenegger|,|Jean-Claud|Van|Damme|or|Steven|Segal|.",
 'tree': '70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|55|54|53|52|51|49|47|47|46|46|45|40|40|41|39|38|38|43|37|37|69|44|39|42|41|42|43|44|45|50|48|48|49|50|51|52|53|54|55|66|57|59|59|60|61|62|63|64|65|66|67|68|69|71|71|0',
 'processed': 'rock destined 21st century new going make even greater arnold schwarzenegger van steven'}

### Word2Vec

Word2Vec maps individual words to high dimensional vector representations in a way that synonymous words will be close to each other in vector space. In fact, a famous example for Word2Vec is that features of `king` - features of `man` + features of `woman` returns the features of `queen`. Word2Vec is trained on a large text corpus, from Twitter to a collection of books.

In [None]:
import gensim.downloader
# this takes several minutes to download
word2vec = gensim.downloader.load('fasttext-wiki-news-subwords-300')



Let's create a `SSTWord2Vec` dataset to serve Word2Vec features for training.

In [None]:
from torch.utils.data import Dataset

class SSTWord2Vec(Dataset):
  """
  The Stanford Sentiment TreeBank dataset with Word2Vec features.

  Argument
  --------
  split: (str) the dataset portion
    Options - train | dev | test
  """

  def __init__(self, word2vec, split = 'train'):
    super().__init__()
    assert split in ['train', 'dev', 'test'], f"Split {split} not supported."
    if split == 'dev': split = 'validation'  # match their expectations
    self.data = preprocess(load_dataset('sst', split=split))
    self.word2vec = word2vec
    self.split = split

  def __getitem__(self, index):
    sentence = self.data[index]['processed']
    tokens = sentence.split()

    feature = []

    for token in tokens:
      try:
        # skip if word2vec doesn't have an embedding for that token
        feat = self.word2vec.get_vector(token)
        feature.append(feat)
      except KeyError:
        pass

    if len(feature) == 0:
      # all zeros if none of the words are in the dictionary
      feature = np.zeros(300)
    else:
      feature = np.stack(feature)
      # We treat the average of word embeddings as the sentence embedding!
      feature = np.mean(feature, axis=0)

    feature = torch.from_numpy(feature).float()
    label = round(self.data[index]['label'])
    return feature, label

  def __len__(self):
    return len(self.data)

Let's setup a PyTorch lightning dataset, and train the same model, as we did previously, but now using Word2Vec features.

In [None]:
import pytorch_lightning as pl

class SSTWord2VecDataModule(pl.LightningDataModule):
  """
  Data module wrapper around SST datasets with Word2Vec.

  Arguments
  ---------
  batch_size: (int) minibatch size
    default = 32
  """
  def __init__(self, word2vec, batch_size: int = 32):
    super().__init__()

    self.sst_train = SSTWord2Vec(word2vec, split='train')
    self.sst_dev = SSTWord2Vec(word2vec, split='dev')
    self.sst_test = SSTWord2Vec(word2vec, split='test')

    self.batch_size = batch_size

  def train_dataloader(self):
    return DataLoader(self.sst_train, batch_size=self.batch_size, shuffle=True)

  def val_dataloader(self):
    return DataLoader(self.sst_dev, batch_size=self.batch_size)

  def test_dataloader(self):
    return DataLoader(self.sst_test, batch_size=self.batch_size)

  def predict_dataloader(self):
    return DataLoader(self.sst_test, batch_size=self.batch_size)

In [None]:
word2vec.vector_size # must be 300

300

In [None]:
class SSTWord2VecSystem(pl.LightningModule):

  def __init__(self):
    super().__init__()

    input_dim = 300   # word2vec.vector_size
    output_dim = 1    # same as before

    self.model = MLP(input_dim, output_dim)
    self.accuracy = accuracy_score
    self.loss = torch.nn.CrossEntropyLoss()

  def forward(self, features):

    probs = self.model(features)

    # probs: torch.FloatTensor
    #   shape: batch_size x 1
    # ================================
    return probs

  def configure_optimizers(self, lr=1e-3):

    # trainable parameters can be accessed through `self.parameters`
    optimizer = torch.optim.Adam(self.parameters(), lr=lr)

    return optimizer

  def _common_step(self, batch, batch_idx):
    features, labels = batch

    probs = self.forward(features)
    loss = self.loss(probs.view(-1), labels.float())

    # probs: torch.FloatTensor
    #   shape: batch_size x 1
    # loss: torch.FloatTensor
    #   shape: 1
    # ================================

    with torch.no_grad():

      preds = (probs > 0.5).squeeze(1) # removes dimensions=1 ie shape [123,1] -> [123]

      accuracy = self.accuracy(preds.to('cpu'), labels.to('cpu')) # to('cpu) necessary with sklearn

      # preds: torch.Tensor or np.array
      # accuracy: float
      # ================================

    return loss, accuracy

  def training_step(self, train_batch, batch_idx):
    loss, acc = self._common_step(train_batch, batch_idx)
    self.log('train_loss', loss)
    self.log('train_acc', acc, prog_bar=True)
    return loss

  def validation_step(self, dev_batch, batch_idx):
    loss, acc = self._common_step(dev_batch, batch_idx)
    self.log('dev_loss', loss)
    self.log('dev_acc', acc, prog_bar=True)

  def test_step(self, test_batch, batch_idx):
    loss, acc = self._common_step(test_batch, batch_idx)
    self.log('test_loss', loss)
    self.log('test_acc', acc)

  def predict_step(self, batch, batch_idx):
    return self.forward(batch[0])

In [None]:
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

dm = SSTWord2VecDataModule(word2vec, batch_size=32)

100%|██████████| 8544/8544 [00:02<00:00, 3381.94it/s]
100%|██████████| 8544/8544 [00:01<00:00, 4576.41it/s]
100%|██████████| 1101/1101 [00:00<00:00, 5522.19it/s]
100%|██████████| 1101/1101 [00:00<00:00, 11770.25it/s]
100%|██████████| 2210/2210 [00:00<00:00, 5938.38it/s]
100%|██████████| 2210/2210 [00:00<00:00, 12230.61it/s]


In [None]:
model = SSTWord2VecSystem()

seed_everything(42, use_cuda=False) # <<< cpu because ran out of gpu time
checkpoint_callback = ModelCheckpoint(monitor='dev_loss')

trainer = Trainer(max_epochs=20,
  callbacks=[checkpoint_callback])
trainer.fit(model, dm)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type             | Params
-------------------------------------------
0 | model | MLP              | 43.1 K
1 | loss  | CrossEntropyLoss | 0     
-------------------------------------------
43.1 K    Trainable params
0         Non-trainable params
43.1 K    Total params
0.172     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.


In [None]:
results = trainer.test(model, dm, ckpt_path="best")

INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/lightning_logs/version_1/checkpoints/epoch=2-step=801.ckpt
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from the checkpoint at /content/lightning_logs/version_1/checkpoints/epoch=2-step=801.ckpt


Testing: 0it [00:00, ?it/s]

## Comparison

As a baseline, logistic regression on Word2Vec features achieves around 67% after 20 epochs of training while, using BERT features, logistic regression reached 73%.  It's not a huge difference I'd say.  I won't go further, I just wanted to demonstrate the two approaches: fine-tuning a pre-trained model vs training straight on word2vec embeddings.  

> **Word2Vec test accuracy**: 67%

> **BERT test accuracy**: 73%