In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 5.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 35.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 3.4 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.6.0 py

Load transformer model and its tokenizer from HuggingFace

In [2]:
from transformers import DistilBertTokenizerFast, TFDistilBertModel
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
bert = TFDistilBertModel.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [3]:
text = "This was an absolutely terrible movie!"
encoded_input = tokenizer(text, return_tensors='tf')
output = bert(encoded_input)
print(encoded_input)
print(output)

{'input_ids': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=
array([[ 101, 2023, 2001, 2019, 7078, 6659, 3185,  999,  102]],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}
TFBaseModelOutput(last_hidden_state=<tf.Tensor: shape=(1, 9, 768), dtype=float32, numpy=
array([[[ 0.10105765,  0.08145499,  0.14580603, ..., -0.17318317,
          0.41397157,  0.3861206 ],
        [-0.08768288, -0.30625236, -0.00278953, ..., -0.51468325,
          1.0388579 ,  0.3485065 ],
        [ 0.05506657, -0.41134804,  0.00242786, ..., -0.21425876,
          0.50868523,  0.6276908 ],
        ...,
        [ 0.5169393 , -0.19129173, -0.18188265, ..., -0.13861606,
          0.32506967, -0.1977762 ],
        [ 0.2934267 ,  0.08480907,  0.11095915, ...,  0.05719552,
          0.54716593, -0.01751508],
        [ 0.9839468 ,  0.33199084, -0.25651062, ...,  0.10096597,
         -0.20585994, -0.22742088]]], dtype=float32)>, hidden_

# kjp Notes:
- Note the output shape of the `print(output)` statement
  - a latent state (of length 768) for **each** (of the 9) tokens in the (batch length 1) input
  - the classifier will need to reduce the sequence to a single value
- In the steps below, we can see the `call` method for the newly created class (`TextClassificationModel`) being overridden
  - in particular: 

      > `x = x['last_hidden_state'][:, 0, :]`
    takes element 0 (corresponding to the `[CLS]` special token as the *single* value representing the sequence
    - Since BERT is bi-directional, both the first and last tokens summarize the entire sequence
- We can also see that the type of `output` is `TFBaseModelOutput`
  - Looks like a `dict` of sorts (like most HuggingFace models)
    - keys include `last_hidden_state` which is of shape `(1, 9, 768)`
      - 1 element in batch
      - sequence length 9
      - each element of sequence (latent state per element) of size 768

Build the text classification model on top of pretrained BERT.

In [4]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import numpy as np

In [5]:
class TextClassificationModel(keras.Model):
  def __init__(self, encoder, train_encoder=True):
    super(TextClassificationModel, self).__init__()
    self.encoder = encoder
    self.encoder.trainable = train_encoder
    self.dropout1 = layers.Dropout(0.1)
    self.dropout2 = layers.Dropout(0.1)
    self.dense1 = layers.Dense(20, activation="relu")
    self.dense2 = layers.Dense(2, activation='softmax')
  
  def call(self, input):
    x = self.encoder(input)
    x = x['last_hidden_state'][:, 0, :]
    x = self.dropout1(x)
    x = self.dense1(x)
    x = self.dropout2(x)
    x = self.dense2(x)
    return x

## kjp added:
- optional argument to `TextClassificationModel`: 
  - `train_encoder` Boolean indicating whether to train the Bert encoder's parameters (66 million !)
    - if `False`, we are really just training the new head
    - rather than "fine-tuning" Bert

In [6]:
text_classification_model = TextClassificationModel(bert, train_encoder=True)

Load IMDB review dataset and convert it into tensorflow dataset.

In [7]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2022-05-19 17:21:35--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2022-05-19 17:21:38 (30.3 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [8]:
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

In [9]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [10]:
train_encodings = tokenizer(train_texts, truncation=True, padding="max_length", max_length=512)
val_encodings = tokenizer(val_texts, truncation=True, padding="max_length", max_length=512)
test_encodings = tokenizer(test_texts, truncation=True, padding="max_length", max_length=512)

In [11]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

In [12]:
import datetime

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)


## kjp: Let's get an idea of the shape of the model's layers
- We can confirm that
-  the model reduces the sequence of latent representations produced by Bert to a singleton latent representation
- that the latent representation size is 768
  - It is input to the `Dense` layer with 20 units
  - With 768 inputs per unit, plus one bias per unit, we can see that the parameter count for the `Dense` unit that we calculate matches reality
      - 20*768 + 20

In [13]:
text_classification_model(next(iter(train_dataset.batch(4))))

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[0.4083037 , 0.59169626],
       [0.40228027, 0.5977197 ],
       [0.4191736 , 0.58082646],
       [0.39214107, 0.6078589 ]], dtype=float32)>

In [14]:
text_classification_model.summary()

Model: "text_classification_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 tf_distil_bert_model (TFDis  multiple                 66362880  
 tilBertModel)                                                   
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
 dropout_20 (Dropout)        multiple                  0         
                                                                 
 dense (Dense)               multiple                  15380     
                                                                 
 dense_1 (Dense)             multiple                  42        
                                                                 
Total params: 66,378,302
Trainable params: 66,378,302
Non-trainable params: 0
_____________________________

In [15]:
text_classification_model.compile(
    tf.keras.optimizers.Adam(learning_rate=5e-5), 
    "sparse_categorical_crossentropy", 
    metrics=["accuracy"])


# kjp: added
- reduce size of training and validation datasets (very time-consuming) by adding `take(...)`
- remove the `tensorboard` callback: 
  - at end of each epoch, it tries to `malloc` 5.6G of memory and causes Colab runtime to crash
- free up memory
  - CPU: runs out of memory on Colab w/o freeing up
    - definitely can delete the non-tokenized data (train_texts, test_texts)
    - seems to work deleting the tokenized data (train_encodings, test_encodings) since it is embedded into a `dict` in creating the `dataset`

In [16]:
num_train = len(train_texts)
del(train_texts)
del(test_texts)

In [17]:

del(test_encodings)

In [18]:
del(train_encodings)


# kjp added
- the standard `train_dataset.take(7500).shuffle(1000).batch(16)`
  - uses the *same* 7500 examples each epoch
  - I added the `take(7500)`
- change to several passes through entire data
  - by *not* resetting dataset between "epochs"

In [19]:
num_chunks = 4
num_epochs = 1

chunk_size = num_train // num_chunks

print(f"training on {num_train} examples in chunks of size {chunk_size}")

for epoch_num in range(num_epochs):
  for chunk_num in range(num_chunks):
    print(f"Epoch {epoch_num}, chunk {chunk_num}:")
    history = text_classification_model.fit(
      train_dataset.skip(chunk_num * chunk_size).take(chunk_size).shuffle(1000).batch(16), 
      epochs=1, 
      validation_data=val_dataset.take(500).batch(16),
      #callbacks=[tensorboard_callback]
    )

training on 20000 examples in chunks of size 5000
Epoch 0, chunk 0:
Epoch 0, chunk 1:
Epoch 0, chunk 2:
Epoch 0, chunk 3:


## kjp
Compare the above to the "original" below
- The below has the advantage of having the model's weights already trained by the above, so it starts out with a low loss
- But you can clearly see the training accuracy going to almost 100 %
  - overfitting to the *same* examples each epoch
- Contrast that with the above, which doesn't show overfitting in consecutive epochs

In [20]:
history = text_classification_model.fit(
    train_dataset.take(7500).shuffle(1000).batch(16), 
    epochs=5, 
    validation_data=val_dataset.take(500).batch(16),
    #callbacks=[tensorboard_callback]
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [21]:
%load_ext tensorboard

In [22]:
%tensorboard --logdir logs/fit

<IPython.core.display.Javascript object>