<a href="https://colab.research.google.com/github/lizhieffe/language_model/blob/main/Bert_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Good tutorial: https://coaxsoft.com/blog/building-bert-with-pytorch-from-scratch

Another tutorial: https://neptune.ai/blog/how-to-code-bert-using-pytorch-tutorial


In [122]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

import re

from typing import Dict, List

import numpy as np

!pip install tqdm
from tqdm import tqdm



In [28]:
USE_GPU = True

BLOCK_SIZE = 96 # Context length: how many chars do we take to predict the next one?

# number of workers in .map() call
# good number to use is ~order number of cpu cores // 2
NUM_PROC = 24

# Tokenizer

- **TODO**: the tokenizer in IMDBBertDataset doesn't convert the word to id. It similar to splitting the sentence to words. Integrate with a more advanced one.

# Download Dataset

In [26]:
# Download data - openwebtext

!pip install datasets # Since we are running in colab docker image, install it here.

from datasets import load_dataset # huggingface datasets

Collecting datasets
  Using cached datasets-2.14.5-py3-none-any.whl (519 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Using cached dill-0.3.7-py3-none-any.whl (115 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
Collecting multiprocess (from datasets)
  Using cached multiprocess-0.70.15-py310-none-any.whl (134 kB)
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, huggingface-hub, datasets
Successfully installed datasets-2.14.5 dill-0.3.7 huggingface-hub-0.17.3 multiprocess-0.70.15 xxhash-3.4.1


In [100]:
dataset = load_dataset("imdb", num_proc=NUM_PROC)

In [90]:
train_ds = dataset['train']

In [95]:
i = 0
for it  in train_ds:
  print(it)
  i += 1
  if i > 4:
    break

{'review': "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is du

In [92]:
train_ds

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 50000
})

# Prepare DS

- The original BERT uses BooksCorpus (800M words) and English Wikipedia (2,500M words) for pre-training.
- We use IMDB reviews data with ~72k words.

In [123]:
class Counter():
  """Store the counts for individual tokens."""
  def __init__(self):
    self.token_to_counts = {}

  def update(self, tokens:  List[int]):
    """Update the counts with new tokens"""
    for t in tokens:
      if t in self.token_to_counts:
        self.token_to_counts[t] += 1
      else:
        self.token_to_counts[t] = 1

  def get(self) -> Dict[str, int]:
    return self.token_to_counts.copy()

  def __str__(self):
    s = sorted(self.token_to_counts.items(), key=lambda x:x[1], reverse=True)
    return s.__str__()

In [130]:
class Vocab:
  def __init__(self):
    self.ttoi = {}
    self.itot = {}

  def insert_token(self, t: str):
    assert t not in self.ttoi
    i = len(self.ttoi)
    self.ttoi[t] = i
    self.itot[i] = t

  def lookup_indices(self, tokens: List[str]):
    return [self.ttoi[t] for t in tokens]

In [145]:
from torch.utils.data import Dataset
import pandas as pd
from torchtext.data import get_tokenizer
import random

class IMDBBertDataset(Dataset):
  # Define special tokens as attributes of class
  CLS = '[CLS]'
  PAD = '[PAD]'
  SEP = '[SEP]'
  MASK = '[MASK]'
  UNK = '[UNK]'

  MASK_PERCENTAGE = 0.15

  MASKED_INDICES_COLUMN = 'masked_indices'
  TARGET_COLUMN = 'indices'
  NSP_TARGET_COLUMN = 'is_next'
  TOKEN_MASK_COLUMN = 'token_mask'

  OPTIMAL_LENGTH_PERCENTILE = 70

  def __init__(self,
               ds_from=None,
               ds_to=None,
               should_include_text: bool=False):
    """
    Args:
      should_include_text: if true, include the raw text in the dataset. This
        should only be used for debugging purpose.
    """
    super().__init__()

    self.ds = []
    for it in dataset['train']:
      self.ds.append(it['text'])

    self.tokenizer = get_tokenizer('basic_english')
    self.counter = Counter()
    self.vocab = Vocab()

    self.optimal_sentence_length = None
    self.should_include_text = should_include_text

    if self.should_include_text:
      self.columns = [
          'masked_sentence',
          self.MASKED_INDICES_COLUMN,
          'sentence',
          self.TARGET_COLUMN,
          self.TOKEN_MASK_COLUMN,
          self.NSP_TARGET_COLUMN,
      ]
    else:
      self.columns = [
          self.MASKED_INDICES_COLUMN,
          self.TARGET_COLUMN,
          self.TOKEN_MASK_COLUMN,
          self.NSP_TARGET_COLUMN,
      ]

    self.df = self._prepare_dataset()

  def __len__(self):
      return len(self.df)

  def __getitem__(self, idx):
    return None

  def _update_length(self,
                     review_sentences: List[str],
                     sentence_lens: List[int]):
    for s in review_sentences:
      sentence_lens.append(len(s.split()))

  def _find_optimal_sentence_length(self, sentence_lens: List[int]):
    arr = np.array(sentence_lens)
    ret = int(np.percentile(arr, self.OPTIMAL_LENGTH_PERCENTILE))
    return ret

  def _fill_vocab(self, min_freq=2):
    self.vocab.insert_token(self.CLS)
    self.vocab.insert_token(self.PAD)
    self.vocab.insert_token(self.MASK)
    self.vocab.insert_token(self.SEP)
    self.vocab.insert_token(self.UNK)

    token_to_counts = self.counter.get()
    for t, counts in tqdm(token_to_counts.items()):
      if counts >= min_freq:
        self.vocab.insert_token(t)

  def _create_item(self, first: List[int], second: List[int], target: int):
    return None

  def _select_false_nsp_sentences(self, sentences: List[str]):
    sentences_len = len(sentences)
    i1 = random.randint(0, sentences_len-1)
    i2 = random.randint(0, sentences_len-1)

    # Make sure they are really not NSP
    while i1 == i2 - 1:
      i2 = random.randint(0, sentences_len-1)

    return sentences[i1], sentences[i2]

  def _prepare_dataset(self) -> pd.DataFrame:
    sentences = []
    nsp = []
    sentence_lens = []

    # split ds on sentences
    for review in self.ds:
      review_sentences = review.split('. ')
      sentences += review_sentences
      self._update_length(review_sentences, sentence_lens)

    self.optimal_sentence_length = self._find_optimal_sentence_length(sentence_lens)
    print(f'{self.optimal_sentence_length=}')

    # Create vocab
    print("Create vocabulary")
    for s in tqdm(sentences):
      self.counter.update(self.tokenizer(s))
    self._fill_vocab()
    print(f'\nvocab size = {len(self.vocab.ttoi)}')

    assert len(sentence_lens) == len(sentences)
    # print(self.counter)

    for review in self.ds:
      review_sentences = review.split('. ')
      if len(review_sentences) > 1:
        for i in range(len(review_sentences) - 1):
          # True NSP item
          first, second = self.tokenizer(review_sentences[i]), self.tokenizer(review_sentences[i+1])
          print(f'{first=}, {second=}')
          nsp.append(self._create_item(first, second, target=1))

          # False NSP item
          first, second = self._select_false_nsp_sentences(sentences)
          first, second = self.tokenizer(first), self.tokenizer(second)
          print(f'{first=}, {second=}')
          nsp.append(self._create_item(first, second, target=0))

          # break
      # break

    # df = pd.DataFrame(nsp, columns=self.columns)

ds = IMDBBertDataset()

self.optimal_sentence_length=27
Create vocabulary


100%|██████████| 247731/247731 [00:09<00:00, 27400.76it/s]
100%|██████████| 100682/100682 [00:00<00:00, 1162720.86it/s]


vocab size = 51721
first=['i', 'rented', 'i', 'am', 'curious-yellow', 'from', 'my', 'video', 'store', 'because', 'of', 'all', 'the', 'controversy', 'that', 'surrounded', 'it', 'when', 'it', 'was', 'first', 'released', 'in', '1967'], second=['i', 'also', 'heard', 'that', 'at', 'first', 'it', 'was', 'seized', 'by', 'u', '.', 's']
first=['this', 'show', 'is', 'highly', 'overrated', ',', 'and', 'less', 'worthy', 'of', 'your', 'channel', 'surfing', 'time', 'than', 'saturday', 'night', 'live', ',', 'another', 'horrible', 'show'], second=['all', 'of', 'fred', 'and', 'ginger', "'", 's', 'movies', 'had', 'sub', 'plots', 'that', 'depended', 'on', 'other', 'actors', 'to', 'fill', 'in', 'the', 'space', 'between', 'the', 'musical', 'numbers', ',', 'otherwise', 'the', 'movie', 'would', 'have', 'to', 'be', 'shortened', 'by', 'about', 'a', 'half', 'hour']





In [136]:
ds.vocab.lookup_indices(["[CLS]", "this"])

[0, 17]

# Preprocessing

In [10]:
text = (
       'Hello, how are you? I am Romeo.n'
       'Hello, Romeo My name is Juliet. Nice to meet you.n'
       'Nice meet you too. How are you today?n'
       'Great. My baseball team won the competition.n'
       'Oh Congratulations, Julietn'
       'Thanks you Romeo'
   )
len(text)

208

In [11]:
text = text.lower()
text

'hello, how are you? i am romeo.nhello, romeo my name is juliet. nice to meet you.nnice meet you too. how are you today?ngreat. my baseball team won the competition.noh congratulations, julietnthanks you romeo'

In [12]:
# 1) filter '.', ',', '?', '!'
# 2) create new line at 'n'
sentences = re.sub("[.,!?-]", '', text.lower()).split('n')
sentences

['hello how are you i am romeo',
 'hello romeo my ',
 'ame is juliet ',
 'ice to meet you',
 '',
 'ice meet you too how are you today',
 'great my baseball team wo',
 ' the competitio',
 '',
 'oh co',
 'gratulatio',
 's juliet',
 'tha',
 'ks you romeo']

In [13]:
word_list = list(set(" ".join(sentences).split()))
word_list[:8]

['juliet', 'romeo', 'ks', 'gratulatio', 'the', 'to', 'competitio', 'my']

In [14]:
wtoi = {
    '[PAD]': 0,
    '[CLS]': 1,
    '[SEP]': 2,
    '[MASK]': 3
}

for i, w in enumerate(word_list):
  wtoi[w] = len(wtoi)

itow = {}
for w, i in wtoi.items():
  itow[i] = w

vocab_size = len(wtoi)

print(f'{vocab_size=}')

vocab_size=32


In [19]:
# 1. Masking: Bert randomly assigns [MASK] to 15% of the sequence. Note that it
#   is not assigned to special tokens.
#
# 2. [PAD] is used to make sure all the sentences are of equal length.
#
#   For instance, if we take the sentence :
#       “The cat is walking. The dog is barking at the tree”
#   then with padding, it will look like this:
#       “[CLS] The cat is walking [PAD] [PAD] [PAD]. [CLS] The dog is barking at the tree.”
#
#   The length of the 1st sentence is equal to the length of the 2nd sentence.

def make_batch(sentences, batch_size: int, sentence_length: int):
  """Make a batch.

  Args:
    sentences: array of str
    batch_size: batch size
    sentence_length: the length of a sentence. Note that in each example there are
    two sentences.
  """
  batch = []
  positive = negative = 0



  while positive != batch_size / 2 or negative != batch_size / 2:
    tokens_a_index = torch.randint(0, len(sentences), (batch_size,))
    tokens_b_index = torch.randint(0, len(sentences), (batch_size,))

    tokens_a = sentences

In [24]:
torch.randint(0, len(sentences), (1,)).item()

1

# Model

# Loss and Optimization

# Training