<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/bert/tf_bert_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT from scratch

## Intro

In [54]:
#@title ## Import packages
import os
import re
import shutil
from dataclasses import dataclass
from google.colab import auth, data_table
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.preprocessing import text_dataset_from_directory 
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization 

tf.get_logger().setLevel("ERROR")

print("tensorflow", tf.__version__)

tensorflow 2.4.0


## 1. Configuration

In [42]:
@dataclass
class Config:
    MAX_LEN = 256
    BATCH_SIZE = 32
    LR = 0.001
    VOCAB_SIZE = 30000
    EMBED_DIM = 128
    NUM_HEAD = 8
    FF_DIM = 128
    NUM_LAYERS = 1
    DS_URL = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" 
    TRAIN_DIR = None
    TEST_DIR = None
    VALIDATION_SPLIT = 0.2
    SEED = 42
    MASK_TOKEN="[MASK]"

config = Config()

## 2. Download Dataset - Large Movie Review

We use [Stanford’s Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) as the dataset for sentiment analysis. This dataset is divided into two datasets for training and testing purposes, each containing 25,000 movie reviews downloaded from IMDb. In each dataset, the number of comments labeled as “positive” and “negative” is equal.


In [None]:
dataset = tf.keras.utils.get_file(
    "aclImdb.tar.gz", config.DS_URL, untar=True, cache_dir=".", cache_subdir=""
)
dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")
config.TRAIN_DIR = os.path.join(dataset_dir, "train")
config.TEST_DIR = os.path.join(dataset_dir, "test")
print("Train dataset:", config.TRAIN_DIR)
print("Train dataset:", config.TEST_DIR )

# remove unused folders from train folder
remove_dir = os.path.join(config.TRAIN_DIR, "unsup")
shutil.rmtree(remove_dir)

Train dataset: ./aclImdb/train
Train dataset: ./aclImdb/test


The IMDB dataset has already been divided into train and test, but it lacks a validation set. Let's create a validation split using a 80:20 split.

> Tensorflow Datasource requires to define a random seed or to pass shuffle=False so that the validation and training splits have no overlap. **We will pass a random seed**.

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE

def optimize_dataset(dataset):
    return dataset.cache().prefetch(buffer_size=AUTOTUNE)

train_ds = text_dataset_from_directory(
    config.TRAIN_DIR,
    subset="training",
    batch_size=config.BATCH_SIZE,
    validation_split=config.VALIDATION_SPLIT,
    seed=config.SEED,
)
class_names = train_ds.class_names
train_ds = optimize_dataset(train_ds)

val_ds = text_dataset_from_directory(
    config.TRAIN_DIR,
    subset="validation",
    batch_size=config.BATCH_SIZE,
    validation_split=config.VALIDATION_SPLIT,
    seed=config.SEED,
)
val_ds = optimize_dataset(val_ds)

test_ds = text_dataset_from_directory(
    config.TEST_DIR,
    batch_size=config.BATCH_SIZE
)
test_ds = optimize_dataset(test_ds)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


In [None]:
print("Loaded classes:", class_names)

Loaded classes: ['neg', 'pos']


In [None]:
text_batch, label_batch = next(train_ds.as_numpy_iterator())
df = pd.DataFrame({"text": text_batch, "label": label_batch})
data_table.DataTable(df, include_index=False, num_rows_per_page=3, )

Unnamed: 0,text,label
0,"b""Having seen most of Ringo Lam's films, I can...",1
1,b'Caution: May contain spoilers...<br /><br />...,1
2,"b""from the view of a NASCAR Maniac like I am, ...",1
3,"b'When it first came out, this work by the Mey...",1
4,"b""I thought that this was an absolutely charmi...",1
5,"b""The filming is pleasant and the environment ...",1
6,"b""This is one of those cheaply made TV Movies ...",0
7,b'Great movie - especially the music - Etta Ja...,0
8,b'This film is not even worth walking to the m...,0
9,"b""For me, this movie just seemed to fall on it...",0


## 3. Preprocessing dataset

Standarization will uncase our samples as well as removing special characters.

In [36]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape("!#$%&'()*+,-./:;<=>?@\^_`{|}~"), ""
    )

In [41]:
custom_standardization(tf.constant("Simple<br /> text~!")).numpy()

b'simple  text'

Build the preprocessing layer that will be used to tokenize raw text. The original BERT uses subword tokenization, but we will simplify that approach.

In [70]:
def build_vectorize_layer(
    texts,
    vocab_size=config.VOCAB_SIZE,
    max_seq=config.MAX_LEN,
    special_tokens=[config.MASK_TOKEN],
):
    vectorize_layer = TextVectorization(
        max_tokens=vocab_size,
        output_mode="int",
        standardize=custom_standardization,
        output_sequence_length=max_seq,
    )
    text_without_labels = texts.map(lambda text, label: text)
    vectorize_layer.adapt(text_without_labels)
    
    # insert mask token in vocabulary
    vocab = vectorize_layer.get_vocabulary()
    vocab = vocab[2:vocab_size - len(special_tokens)] + special_tokens
    vectorize_layer.set_vocabulary(vocab)
    return vectorize_layer


In [71]:
vectorize_layer = build_vectorize_layer(train_ds)

Let's test our tokenization layer.

In [79]:
list(vectorize_layer([tf.constant("this a simple test")]).numpy()[0][:10])

[10, 4, 583, 2216, 0, 0, 0, 0, 0, 0]

In [69]:
vectorize_layer = build_vectorize_layer(train_ds)



In [65]:
vectorize_layer = build_vectorize_layer(train_ds)

