<a href="https://colab.research.google.com/github/jaroorhmodi/word2vec-and-BERT/blob/main/Word2Vec_and_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Word2Vec and BERT

The goal of this notebook will be to motivate a more perfect understanding of BERT by showing how **CBOW Word2Vec** works and using it as a jumping-off point to motivate **BERT**. To those who know about these two methods, it's pretty clear how the two are conceptually related. I want to use this notebook to explore and explain the relationship between the two since BERT's training objective is similar to CBOW W2V's but contains some key differences.

To this end, I will be following, commenting on, and reimplementing the relevant models from the papers in which they were introduced: [**Efficient Implementation of Word Representations in Vector Space**](https://arxiv.org/pdf/1301.3781.pdf) (**Word2Vec**) and [**BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding**](https://arxiv.org/pdf/1810.04805.pdf) (**BERT**).

In [None]:
import os
import tqdm
import torch
from torch import nn
from torchtext import get_tokenizer
from torch.utils.data import DataLoader

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import WikiText2, WikiText103 #our datasets for this project

DATASET_small = "WikiText2"
DATASET_large = "WikiText103"

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DATASET = DATASET_small
TOKENIZER="basic_english"

RUN_TRAINING_EXAMPLES = True

##The Dataset

We will be using the **WikiText2** and **WikiText103** datasets available from `torchtext.datasets`. These are both compiled from wikipedia articles and have 2M and 100M+ tokens in them.

In [None]:
def fetch_dataset(dataset = DATASET, split=None):
  if dataset == DATASET_small:
    iter = WikiText2(split=split)
  elif dataset == DATASET_large:
    iter = WikiText103(split=split)
  else:
    raise ValueError(f"{dataset} is not a valid dataset")
  return torchtext.data.to_map_style_dataset(iter)

def build_vocab(data_iter, tokenizer=None):
  if tokenizer is None:
    tokenizer = get_tokenizer(TOKENIZER)
    vocab = build_vocab_from_iterator(
        map(tokenizer, data_iter),
        specials=["<unk>"],
    )
    vocab.set_default_index(vocab["<unk>"])
  return vocab


##Word2Vec

###Motivation:
The major motivation for Word2Vec when it was developed was that existing methods often failed to capture similarity when they developed representations of words in a vocabulary. Word2Vec aims to fix that.

The methods used in Word2Vec were inspired by [**Linguistic Regularities in Continuous Word Representations**](https://aclanthology.org/N13-1090.pdf). A key feature was that similar words would have vectors that were close to one another and that words could have **many degrees of similarity** in their vector representations. Word2Vec is inspired by this approach and aims to make these vectors as accurate as possible.

###CBOW Architecture

The architecture described in the paper uses a simple Feed-Forward Network with an embedding layer and no hidden layer. The input and output layers are both the size of the vocabulary given by `vocab_size` and the dimension of the embedding layer is given `embedding_dim`.

In [None]:
class CBOW(nn.Module):
  def __init__(self, vocab_size, embedding_dim=300): #in the paper the embedding dimension suggested is 300
    super(CBOW, self).__init__()
    self.embedding = nn.Embedding(vocab_size, embedding_dim)
    self.output = nn.Linear(embedding_dim, vocab_size)

  def forward(self, x):
    return self.output(self.embedding(x))

###Data Preparation

We need to generate training examples for the CBOW model. CBOW is trained by passing in a certain number of words neighboring the target word in the vocab (without respect to order of the words) and then training the model to predict the target.

In [None]:
#need to define some constants for data
MAX_PARAGRAPH_LEN = 256
CTX_WORDS = 2

In [None]:
#Create function to collate data into inputs and outputs
#batches will be paragraphs from the dataset

from functools import partial

def collate(batch, text_pipeline):
  """
  Collate paragraphs into inputs and outputs for CBOW.
  Inputs are word contexts and outputs are target words.
    Context lengths are determined by CTX_WORDS
  """
  batch_input, batch_output = [], []
  for text in batch:
    text = text_pipeline(text)
    text = text[:MAX_PARAGRAPH_LEN] #truncate too-long paragraphs
    if len(text)<(2*CTX_WORDS+1):
      continue

    for i in range(len(text)-(CTX_WORDS*2+1)):
      window = text[i:i+CTX_WORDS*2+1]
      inputs = window[:CTX_WORDS]+window[CTX_WORDS+1:]
      outputs = window[CTX_WORDS]
      batch_input.append(inputs)
      batch_output.append(outputs)

  batch_input = torch.Tensor(batch_input, dtype=torch.long)
  batch_output = torch.Tensor(batch_output, dtype=torch.long)

  return batch_input, batch_output


def get_dataloader_and_vocab(
    dataset, dataset_split, batch_size, shuffle, vocab=None
):
  data_iter = fetch_dataset(dataset, dataset_split)
  tokenizer = get_tokenizer(TOKENIZER)
  vocab = build_vocab(data_iter, tokenizer)

  text_pipeline = lambda x: vocab(tokenizer(x))

  dataloader = DataLoader(
      data_iter = data_iter,
      batch_size = batch_size,
      shuffle = shuffle,
      collate_fn = partial(collate_fn, text_pipeline)
  )

  return dataloader, vocab

##Model Training