<a href="https://colab.research.google.com/github/jaroorhmodi/word2vec-and-BERT/blob/main/Word2Vec_and_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Word2Vec and BERT

The goal of this notebook will be to motivate a more perfect understanding of BERT by showing how **CBOW Word2Vec** works and using it as a jumping-off point to motivate **BERT**. To those who know about these two methods, it's pretty clear how the two are conceptually related. I want to use this notebook to explore and explain the relationship between the two since BERT's training objective is similar to CBOW W2V's but contains some key differences.

To this end, I will be following, commenting on, and reimplementing the relevant models from the papers in which they were introduced: [**Efficient Implementation of Word Representations in Vector Space**](https://arxiv.org/pdf/1301.3781.pdf) (**Word2Vec**) and [**BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding**](https://arxiv.org/pdf/1810.04805.pdf) (**BERT**).

In [1]:
!pip install 'portalocker>=2.0.0'

Collecting portalocker>=2.0.0
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.8.2


In [2]:
import os

import torch
from torch import nn
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import DataLoader

import numpy as np

from torchtext.data import to_map_style_dataset
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import WikiText2, WikiText103 #our datasets for this project

DATASET_small = "WikiText2"
DATASET_large = "WikiText103"

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DATASET = DATASET_small
TOKENIZER="basic_english"
DATA_DIRECTORY = "/content/data"

RUN_TRAINING_EXAMPLES = True

##The Dataset

We will be using the **WikiText2** and **WikiText103** datasets available from `torchtext.datasets`. These are both compiled from wikipedia articles and have 2M and 100M+ tokens in them.

In [3]:
MIN_WORD_FREQUENCY = 20 #min frequency to appear in vocab
#HAD TO INCREASE FOR WT103 BECAUSE T4s ran out of memory on the very large onehot vectors

In [4]:
def fetch_dataset(dataset = DATASET, split=None):
  if dataset == DATASET_small:
    iter = WikiText2(root=DATA_DIRECTORY, split=split)
  elif dataset == DATASET_large:
    iter = WikiText103(root=DATA_DIRECTORY, split=split)
  else:
    raise ValueError(f"{dataset} is not a valid dataset")
  iter = to_map_style_dataset(iter)
  return iter

def build_vocab(data_iter, tokenizer=None):
  if tokenizer is None:
    tokenizer = get_tokenizer(TOKENIZER)
  vocab = build_vocab_from_iterator(
      map(tokenizer, data_iter),
      specials=["<unk>"],
      min_freq=MIN_WORD_FREQUENCY
  )
  vocab.set_default_index(vocab["<unk>"])
  return vocab


##Word2Vec

###Motivation:
The major motivation for Word2Vec when it was developed was that existing methods often failed to capture similarity when they developed representations of words in a vocabulary. Word2Vec aims to fix that.

The methods used in Word2Vec were inspired by [**Linguistic Regularities in Continuous Word Representations**](https://aclanthology.org/N13-1090.pdf). A key feature was that similar words would have vectors that were close to one another and that words could have **many degrees of similarity** in their vector representations. Word2Vec is inspired by this approach and aims to make these vectors as accurate as possible.

###CBOW Architecture

The architecture described in the paper uses a simple Feed-Forward Network with an embedding layer and no hidden layer. The input and output layers are both the size of the vocabulary given by `vocab_size` and the dimension of the embedding layer is given `embedding_dim`.

In [5]:
class CBOW(nn.Module):
  def __init__(self, vocab_size, embedding_dim, embedding_max_norm):
    super(CBOW, self).__init__()
    self.embedding = nn.Embedding(vocab_size, embedding_dim, max_norm = embedding_max_norm)
    self.output = nn.Linear(embedding_dim, vocab_size)

  def forward(self, x):
    #embed, take mean, then predict
    return self.output(self.embedding(x).mean(axis=1))

###CBOW Data Preparation

We need to generate training examples for the CBOW model. CBOW is trained by passing in a certain number of words neighboring the target word in the vocab (without respect to order of the words) and then training the model to predict the target.

In [6]:
#need to define some constants for data
MAX_PARAGRAPH_LEN = 256
CTX_WORDS = 2

In [7]:
#Create function to collate data into inputs and outputs
#batches will be paragraphs from the dataset

from functools import partial

def collate(batch, text_pipeline):
  """
  Collate paragraphs into inputs and outputs for CBOW.
  Inputs are word contexts and outputs are target words.
    Context lengths are determined by CTX_WORDS
  """
  batch_input, batch_output = [], []
  for text in batch:
    text = text_pipeline(text)
    text = text[:MAX_PARAGRAPH_LEN] #truncate too-long paragraphs
    if len(text)<(2*CTX_WORDS+1):
      continue

    for i in range(len(text)-(CTX_WORDS*2+1)):
      window = text[i:i+CTX_WORDS*2+1]
      inputs = window[:CTX_WORDS]+window[CTX_WORDS+1:]
      outputs = window[CTX_WORDS]
      batch_input.append(inputs)
      batch_output.append(outputs)

  batch_input = torch.tensor(batch_input, dtype=torch.long)
  batch_output = torch.tensor(batch_output, dtype=torch.long)

  return batch_input, batch_output


def get_dataloader_and_vocab(
    dataset, dataset_split, collate_fn, batch_size, shuffle, vocab=None
):
  data_iter = fetch_dataset(dataset, dataset_split)
  tokenizer = get_tokenizer(TOKENIZER)
  if vocab is None:
    vocab = build_vocab(data_iter, tokenizer)

  text_pipeline = lambda x: vocab(tokenizer(x))

  dataloader = DataLoader(
      dataset=data_iter,
      batch_size = batch_size,
      shuffle = shuffle,
      collate_fn = partial(collate_fn, text_pipeline=text_pipeline)
  )

  return dataloader, vocab

##CBOW Training

####Training utilities and constants

Defining a trainer class for CBOW model and some useful constants for when we are setting up the run.

In [8]:
class CBOWTrainer:
  def __init__(
      self,
      model,
      train_dataloader,
      val_dataloader,
      loss_fn,
      optimizer,
      lr_scheduler,
      num_epochs=1,
      report_epochs=1):

    self.model = model
    self.train_data = train_dataloader
    self.val_data = val_dataloader
    self.loss_fn = loss_fn
    self.optimizer = optimizer
    self.lr_scheduler = lr_scheduler
    self.train_loss = []
    self.val_loss = []
    self.num_epochs = num_epochs
    self.report_epochs = report_epochs

  def train(self):
    for epoch in range(self.num_epochs):
      # print(f"Training epoch:{epoch}")
      self.train_epoch()
      self.val_epoch()

      if epoch%self.report_epochs==0 or epoch==self.num_epochs-1:
        print(f"=====EPOCH:{epoch+1}/{self.num_epochs}=====")
        print(f"Train Loss: {self.train_loss[-1]:.5f}")
        print(f"Valid Loss: {self.val_loss[-1]:.5f}\n")

      self.lr_scheduler.step()

  def train_epoch(self):
    self.model.train()
    running_loss = []
    for iter, batch in enumerate(self.train_data):
      X, y = batch
      X = X.to(DEVICE)
      y = y.to(DEVICE)

      #zero out gradient on optimizer
      self.optimizer.zero_grad()

      #forward pass
      prediction = self.model(X)

      #get loss
      batch_loss = self.loss_fn(prediction, y)

      #backpropagate
      batch_loss.backward()
      self.optimizer.step()

      running_loss.append(batch_loss.item())

    epoch_loss = np.mean(running_loss)
    self.train_loss.append(epoch_loss)

  def val_epoch(self):
    self.model.eval()
    running_loss = []

    with torch.no_grad():
      for iter, batch in enumerate(self.val_data):
        X, y = batch
        X = X.to(DEVICE)
        y = y.to(DEVICE)

        #inference
        prediction = self.model(X)

        #loss
        batch_loss = self.loss_fn(prediction, y)

        running_loss.append(batch_loss.item())

    epoch_loss = np.mean(running_loss)
    self.val_loss.append(epoch_loss)

  def print_losses(self):
    for epoch, losses in enumerate(zip(self.train_loss, self.val_loss)):
      tloss, vloss = losses
      print(f"=====EPOCH:{epoch+1}/{self.num_epochs}=====")
      print(f"Train Loss: {tloss}")
      print(f"Valid loss: {vloss}\n")

####Training a CBOW model

In [9]:
#flag to not train a new model if existing one is available:
TRAIN=False

In [10]:
EMBEDDING_DIM = 300 #in the paper the recommended dimension is 300
EMBEDDING_MAX_NORM = 1 #make sure embedding vectors are no longer than unit vectors

NUM_EPOCHS = 50 #number of epochs in training
REPORT_EPOCHS = 5 #print losses every REPORT_EPOCHS epochs
TRAIN_BATCH_SIZE = 128
VAL_BATCH_SIZE = 128

LR = 1

torch.manual_seed(2118)

SHUFFLE = True

In [11]:
#Get Data
train_dataloader, vocab = get_dataloader_and_vocab(DATASET, "train", collate, TRAIN_BATCH_SIZE, SHUFFLE)
val_dataloader, _ = get_dataloader_and_vocab(DATASET, "valid", collate, VAL_BATCH_SIZE, SHUFFLE, vocab)

VOCAB_SIZE = len(vocab.get_stoi())
VOCAB_SIZE

8130

In [12]:
#Make our model
cbow_model = CBOW(VOCAB_SIZE, EMBEDDING_DIM, EMBEDDING_MAX_NORM).to(DEVICE)

In [13]:
#Make loss function, optimizer, and lr_scheduler
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cbow_model.parameters(), lr=LR)
lr_scheduler = LambdaLR(
    optimizer,
    lr_lambda= lambda epoch: 1-epoch/NUM_EPOCHS
)

In [14]:
#Make trainer
trainer = CBOWTrainer(
    cbow_model,
    train_dataloader,
    val_dataloader,
    loss_fn,
    optimizer,
    lr_scheduler,
    NUM_EPOCHS,
    REPORT_EPOCHS
)

In [15]:
if TRAIN:
  trainer.train()
  CBOW_MODEL = trainer.model

In [19]:
CBOW_VERSION = 3 #V3 was the best model on WikiText2, V30 will be WikiText103 model
PATH = '/content/models/'
MODEL_PATH = os.path.join(PATH, f'CBOW_{DATASET}_v{CBOW_VERSION}.pth')
VOCAB_PATH = os.path.join(PATH, f'vocab/vocabCBOW_{DATASET}_v{CBOW_VERSION}.pth')

if TRAIN:
  torch.save(trainer.model.state_dict(), f'/content/models/CBOW_{DATASET}_v{CBOW_VERSION}.pth')
  torch.save(vocab, VOCAB_PATH)
  trainer.print_losses

In [20]:
if TRAIN:
  CBOW_MODEL = trainer.model.state_dict()
  CBOW_VOCAB = vocab
else:
  CBOW_MODEL = torch.load(MODEL_PATH, map_location = DEVICE)
  CBOW_VOCAB = torch.load(VOCAB_PATH)

###Inference with CBOW

With CBOW our goal is to use the embedding parameters we learned as part of training the model to embed words as vectors.

####Getting word embeddings

In [21]:
CBOW_MODEL['embedding.weight']

tensor([[ 5.6096e-02,  4.9031e-04, -4.6071e-03,  ..., -1.9411e-02,
          4.0804e-02,  7.5908e-02],
        [ 8.7804e-04,  2.3060e-02,  6.1047e-01,  ...,  2.0349e-01,
          9.1078e-02,  5.6157e-02],
        [ 1.8184e-01, -1.8171e-02, -2.1015e-01,  ...,  1.2175e-01,
          3.3651e-02,  1.1468e-01],
        ...,
        [ 7.1910e-02,  1.9874e-02,  9.2585e-02,  ...,  7.4775e-02,
          6.1344e-03,  1.2275e-01],
        [ 2.1920e-02,  1.5548e-01,  1.6240e-01,  ...,  1.3780e-01,
          5.3199e-02,  9.0991e-02],
        [ 5.7362e-02, -1.9979e-01, -1.8871e-01,  ..., -2.3270e-01,
          2.1576e-01,  2.0968e-02]])

In [22]:
list(trainer.model.parameters())[0]

Parameter containing:
tensor([[-0.2605,  0.7215, -0.5726,  ..., -0.7684,  0.9044, -0.6204],
        [ 1.2230, -0.2633,  0.9279,  ..., -0.7177,  0.3129,  0.8207],
        [-1.2038,  0.3434,  0.0552,  ...,  0.1105, -0.5110,  0.4044],
        ...,
        [ 1.1076, -0.4373, -0.3206,  ..., -0.8111, -0.9718,  0.0356],
        [ 0.5355,  1.2681,  0.2831,  ..., -0.7057, -0.0936,  1.4982],
        [ 0.0691,  0.5220,  0.1944,  ...,  1.0291,  1.0935,  0.9966]],
       requires_grad=True)

In [23]:
trainer.model.state_dict()['embedding.weight']

tensor([[-0.2605,  0.7215, -0.5726,  ..., -0.7684,  0.9044, -0.6204],
        [ 1.2230, -0.2633,  0.9279,  ..., -0.7177,  0.3129,  0.8207],
        [-1.2038,  0.3434,  0.0552,  ...,  0.1105, -0.5110,  0.4044],
        ...,
        [ 1.1076, -0.4373, -0.3206,  ..., -0.8111, -0.9718,  0.0356],
        [ 0.5355,  1.2681,  0.2831,  ..., -0.7057, -0.0936,  1.4982],
        [ 0.0691,  0.5220,  0.1944,  ...,  1.0291,  1.0935,  0.9966]])

In [24]:
#EMBEDDINGS ARE IN THE FIRST LAYER OF THE MODEL
embeddings = CBOW_MODEL['embedding.weight']
embeddings = embeddings.cpu().detach().numpy()

#NORMALIZE EMBEDDINGS BASED ON NORMS
norms = (embeddings ** 2).sum(axis=1) ** (1 / 2)
norms = np.reshape(norms, (len(norms), 1))
embeddings_norm = embeddings / norms
embeddings_norm.shape

(8130, 300)

####Testing word vectors

In [25]:
def get_N_best_matches(
    word,
    N,
    embeddings_norm = embeddings_norm,
    vocab = CBOW_VOCAB
):
  word_idx = vocab[word]
  if word_idx == 0:
    print("OOV word")
    return None
  word_vec = embeddings_norm[word_idx]
  word_vec = np.reshape(word_vec, (len(word_vec), 1))
  dists = np.matmul(embeddings_norm, word_vec).flatten() #dot prods
  topN_ids = np.argsort(-dists)[1 : N + 1]
  topN_dict = {}
  for sim_word_id in topN_ids:
      sim_word = vocab.lookup_token(sim_word_id)
      topN_dict[sim_word] = dists[sim_word_id]
  return topN_dict

In [31]:
for word, sim in get_N_best_matches("brother", 10).items():
    print("{}: {:.3f}".format(word, sim))

son: 0.455
nephew: 0.446
father: 0.430
birthday: 0.399
biographer: 0.399
cousin: 0.398
sons: 0.377
reign: 0.372
favorite: 0.371
resignation: 0.367


We can see above that the best matches for brother have some semantic similarity to the word "brother". We see common male family relations like "son" and "nephew" and some words that might appear commonly near "son" like "favorite" in  phrases like "favorite son.

In [32]:
emb1 = embeddings[vocab["animal"]]
emb2 = embeddings[vocab["animal"]]
emb3 = embeddings[vocab["dog"]]

emb4 = emb1 - emb2 + emb3
emb4_norm = (emb4 ** 2).sum() ** (1 / 2)
emb4 = emb4 / emb4_norm

emb4 = np.reshape(emb4, (len(emb4), 1))
dists = np.matmul(embeddings_norm, emb4).flatten()

top5 = np.argsort(-dists)[:10]

for word_id in top5:
    print("{}: {:.3f}".format(vocab.lookup_token(word_id), dists[word_id]))

dog: 1.000
dodo: 0.285
obsolete: 0.282
tests: 0.276
boys: 0.265
shop: 0.254
fernandez: 0.254
sponsored: 0.247
communities: 0.246
twins: 0.239


While this implementation is far from ideal, we see that if we subtract animal from animal we get something empty which maps the next added vector to itself in this vector manipulation.

In [46]:
emb1 = embeddings[vocab["king"]]
emb2 = embeddings[vocab["male"]]
emb3 = np.zeros(emb1.shape)

emb4 = emb1 - emb2 + emb3
emb4_norm = (emb4 ** 2).sum() ** (1 / 2)
emb4 = emb4 / emb4_norm

emb4 = np.reshape(emb4, (len(emb4), 1))
dists = np.matmul(embeddings_norm, emb4).flatten()

top5 = np.argsort(-dists)[:10]

for word_id in top5:
    print("{}: {:.3f}".format(vocab.lookup_token(word_id), dists[word_id]))

king: 0.712
fastra: 0.286
provisions: 0.262
scott: 0.261
ahk: 0.250
queen: 0.249
newport: 0.245
terror: 0.241
clement: 0.238
valkyria: 0.231


While it is not as good as the example referred to in the paper, we see that when we subtract "male" from "king" we capture "queen" in the top ten closest vectors.

##BERT

After some thought, I have decided to move the BERT portion of this to another notebook. This one is getting too long and too messy, but they will be in the same repo and comparisons of the two models will be done in that notebook.