<a href="https://colab.research.google.com/github/jaroorhmodi/word2vec-and-BERT/blob/main/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#BERT (Bidirectional Encoder Representations from Transformers)

In this notebook I will be replicating the model in the paper [**BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding**](https://arxiv.org/pdf/1810.04805.pdf).

While I will be creating the model from (mostly) scratch in PyTorch, I will not go into too much detail about why Multi-Head Attention is designed the way it is and how exactly the original [Transformer](https://arxiv.org/abs/1706.03762) architecture works. I have made another (*albeit messy*) [notebook that covers that paper](https://github.com/jaroorhmodi/transformer-from-scratch).

The model will be trained on the [**WikiText-2**](https://paperswithcode.com/dataset/wikitext-2) and [**Wikitext-103**](https://paperswithcode.com/dataset/wikitext-103) datasets.

In [2]:
!pip install portalocker transformers datasets tokenizers

Collecting portalocker
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers
  Downloading tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
Collecting safetensors>

In [25]:
import os

import torch
from torch import nn
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import DataLoader

import nltk
import numpy as np
import pandas as pd
import spacy

from tokenizers import BertWordPieceTokenizer

from torchtext.data import to_map_style_dataset
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import WikiText2, WikiText103 #our datasets for this project

DATASET_small = "WikiText2"
DATASET_large = "WikiText103"

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DATASET = DATASET_small
TOKENIZER="basic_english"
DATA_DIRECTORY = "/content/data"

##Model Objective and Data

###Who (What) is BERT?


While BERT is a much more complex model and what it accomplishes isn't exactly akin to Word2Vec, the intuition behind both is similar. We pass in sentences and attempt to make a model learn how to represent text in a way that captures not only information about the tokens themselves but also something about their *meaning*.

Word2Vec does this by training a model on words and their context in sentences and learning  about their relationships with one another by either trying to predict context from words (*Skip-Gram*) or words from context (*CBOW*). The embeddings it produces are static for each word.

BERT trains a Transformer Encoder model on two specific objectives: *Masked Language Modeling* and *Next Sentence Prediction* to learn a wealth of information about tokens in their context and provide representations of them. Note that BERT is not simply learning static embeddings but rather representations that change based on context. Tokens in BERT are embedded using *WordPiece* embeddings.

The goal of the BERT paper was to introduce a way to represent words with a pre-trained transformer and not to make a model for a specific predictive goal. To this end it is trained in an unsupervised manner with the aforementioned MLM and NSP objectives (will be explained ahead).

###Data Processing

In [4]:
#We need to pull in the dataset and break it into sentence pairs for the NSP objective
#and we need to mask random words and create objectives for the MLM objective.
dataset_class = WikiText2 if DATASET == DATASET_small else WikiText103
data_train = dataset_class(DATA_DIRECTORY, split = "train")

In [7]:
train_loader = DataLoader(data_train, batch_size=10, shuffle=True)

In [12]:
x = []
for i in range(10):
  x.append(next(iter(train_loader)))




In [28]:
nlp = spacy.load("en_core_web_sm")

In [33]:
def split_sentences(text):
    doc = nlp(text)
    return [sent.text for sent in doc.sents]

In [40]:
x_splits = [[[split_sentences(text)] for text in x[0]] for paragraph in x]


In [42]:
x_splits[0]

[[[' Length : 16 @.@ 45 m ( 53 ft 11 ½ in ) \n']],
 [[' \n']],
 [[' Exposure to <unk> chemotherapy , in particular <unk> agents , can increase the risk of subsequently developing AML .',
   'The risk is highest about three to five years after chemotherapy .',
   'Other chemotherapy agents , specifically <unk> and <unk> , have also been associated with treatment @-@ related leukemias , which are often associated with specific chromosomal abnormalities in the leukemic cells . \n']],
 [[' Casualties in the battle are notoriously hard to calculate exactly .',
   'With only one exception ( Scipion ) , records made by the French captains of their losses at the time are incomplete .',
   'The only immediately available casualty counts are the sketchy reports of Saint @-@ André and the records made by British officers aboard the captured ships , neither of which can be treated as completely reliable .',
   'Most sources accept that French casualties in the campaign numbered approximately 7 @,@

In [43]:
tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=False,
    strip_accents=False,
    lowercase=True
)

In [None]:
tokenizer.train()

In [46]:

tokenizer.train(
    files=x_splits[0][-1],
    vocab_size=30_000,
    min_frequency=5,
    limit_alphabet=1000,
    wordpieces_prefix='##',
    special_tokens=['[PAD]', '[CLS]', '[SEP]', '[MASK]', '[UNK]']
    )

TypeError: ignored

In [31]:
x_tokenized = tokenizer.encode_batch(x_splits[0])

Exception: ignored

##Model Architecture

##Training