<a href="https://colab.research.google.com/github/nyp-sit/it3103/blob/main/session-12/bert-embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti107/blob/master/session-8/bert-embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/></a>
<br/>

# Using BERT as Feature Extractor

Other than fine-tuning BERT for downstream task such as text classification, we can use pretrained BERT model as a feature extractor, very much the same as we are using pretrained CNN such as ResNet as feature extractors for downstream task such as image classification and object detection.  

In this lab, we will see how we use a pretrained DistilBert Model to extract features (or embedding) from text and use the extracted features (embeddings) to train a classifier to classify text. You can contrast this with the other lab where we train the DistilBert end to end for the classification, and compare the performance of both. 

At the end of this session, you will be able to:
- prepare data and use model-specific Tokenizer to format data suitable for use by the model
- extract text embeddings from the bert model 
- use the extracted features for text classification


In [1]:
!pip install transformers



In [2]:
import numpy as np
import tensorflow as tf
import pandas as pd
import os 
import shutil

from transformers import (
    AutoTokenizer,
    TFAutoModel,
)
from transformers.utils import logging as hf_logging

# We enable logging level to info and use default log handler and log formatting
hf_logging.set_verbosity_info()
hf_logging.enable_default_handler()
hf_logging.enable_explicit_format()

In [3]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)

['README', 'imdbEr.txt', 'train', 'test', 'imdb.vocab']

In [4]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['pos',
 'urls_unsup.txt',
 'labeledBow.feat',
 'urls_pos.txt',
 'unsupBow.feat',
 'neg',
 'urls_neg.txt',
 'unsup']

In [5]:
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

In [6]:
batch_size = 128
seed = 123
train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, 
    subset='training', seed=seed)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, 
    subset='validation', seed=seed)
test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test', batch_size=batch_size, seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


In [7]:
batches = list(train_ds.as_numpy_iterator())
train_texts = []
train_labels = []

for batch in batches:
    texts, labels = batch[0], batch[1]
    texts_labels = zip(texts, labels)
    for text, label in texts_labels:
        train_texts.append(str(text))
        train_labels.append(label)

In [8]:
len(train_labels), len(train_texts)

(20000, 20000)

In [9]:
batches = list(val_ds.as_numpy_iterator())

val_texts = []
val_labels = []
for batch in batches:
    texts, labels = batch[0], batch[1]
    texts_labels = zip(texts, labels)
    for text, label in texts_labels:
        val_texts.append(str(text))
        val_labels.append(label)

In [10]:
len(val_labels), len(val_texts)

(5000, 5000)

In [11]:
batches = list(test_ds.as_numpy_iterator())

test_texts = []
test_labels = []
for batch in batches:
    texts, labels = batch[0], batch[1]
    texts_labels = zip(texts, labels)
    for text, label in texts_labels:
        test_texts.append(str(text))
        test_labels.append(label)

In [12]:
len(test_labels), len(test_texts)

(25000, 25000)

In [13]:
type(train_texts[0])

str

## Data Preparation

We will just use a small subset of the training and test data for this lab, as the feature extraction can take a long time, even with GPU. 

In [14]:
TRAIN_SIZE = 2000
TEST_SIZE = 200 

train_texts = train_texts[:2000]
train_labels = train_labels[:2000]
test_texts = test_texts[:200]
test_labels = test_labels[:200]

In [15]:
labels = np.array(train_labels)

In [16]:
np.unique(test_labels, return_counts=True)

(array([0, 1], dtype=int32), array([ 98, 102]))

## Tokenization

We will now load the DistilBert tokenizer for the pretrained model "distillbert-base-cased".  This is the same as the other lab exercise.

In [17]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')
#tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

[INFO|configuration_utils.py:517] 2021-06-16 06:14:10,486 >> loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
[INFO|configuration_utils.py:553] 2021-06-16 06:14:10,488 >> Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.6.1",
  "vocab_size": 28996
}

[INFO|tokenization_utils_base.py:1717] 2021-06-16 06:14:11,981 >> loading file https://huggingface.co/distilbert-base-case

The pretrained DistilBERT [tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer) expects a string or list of string, so we need to convert the data frame (or series) into list. 

Here we will tokenize the text string, and pad the text string to the longest sequence in the batch, and also to truncate the sequence if it exceeds the maximum length allowed by the model (in BERT's case, it is 512).

In [18]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

We will create a tensorflow dataset and use it's efficient batching later to obtain the embeddings.

In [19]:
BATCH_SIZE = 16

In [20]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    train_encodings['input_ids'],
    train_labels
)).batch(BATCH_SIZE)


test_dataset = tf.data.Dataset.from_tensor_slices((
    test_encodings['input_ids'],
    test_labels
)).batch(BATCH_SIZE)

In [21]:
train_data = train_dataset.as_numpy_iterator()
test_data = test_dataset.as_numpy_iterator()

In [22]:
len(train_encodings['input_ids'])

2000

Here we instantiate a pretrained model from 'distilbert-base-cased' and specify output_hidden_state=True so that we get the output from each of the attention layers. 

## Feature Extraction using (Distil)BERT. 

Here we will load the pretrained model for distibert-based-uncased and use it to extract features from the text (i.e. emeddings). 

In [23]:
model = TFAutoModel.from_pretrained("distilbert-base-cased",output_hidden_states=True)
#model = TFAutoModel.from_pretrained("bert-base-cased", output_hidden_states=True)

[INFO|configuration_utils.py:517] 2021-06-16 06:14:15,967 >> loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
[INFO|configuration_utils.py:553] 2021-06-16 06:14:15,971 >> Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_hidden_states": true,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.6.1",
  "vocab_size": 28996
}

[INFO|modeling_tf_utils.py:1261] 2021-06-16 06:14:16,266 >> loading weights file https://h

The model will produce two outputs: the 1st output `output[0]` is of shape `(16, 512, 768)` which corresponds to the output of the last hidden layer and the second output `output[1]` is a list of 7 outputs of shape `(16, 512, 768)`, corresponding to the output of each of the 7 attention layers. 768 refers to the hidden size.

In [24]:
train_embeddings = None

for batch in train_data:
    output = model.predict(batch[0])
    hidden_states = output[1]
    # here we take the output of the second last attention layer as our embeddings. 
    # We take the average of the embedding value of 512 tokens (at axis=1) to generate sentence embedding  
    sentence_embeddings = tf.reduce_mean(hidden_states[-2], axis=1).numpy()
    if train_embeddings is None:
        train_embeddings = sentence_embeddings
    else:
        train_embeddings = np.vstack([train_embeddings, sentence_embeddings])

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


In [25]:
test_embeddings = None

for batch in test_data:
    output = model.predict(batch[0])
    hidden_states = output[1]
    # here we take the output of the second last attention layer as our embeddings. 
    # We take the average of the embedding value of 512 tokens (at axis=1) to generate sentence embedding  
    sentence_embeddings = tf.reduce_mean(hidden_states[-2], axis=1).numpy()
    if test_embeddings is None:
        test_embeddings = sentence_embeddings
    else:
        test_embeddings = np.vstack([test_embeddings, sentence_embeddings])

## Train a classifier using the extracted features (embeddings)

In [26]:
X_train = train_embeddings
y_train = train_labels

In [27]:
X_test = test_embeddings
y_test = test_labels

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

In [29]:
clf = LinearSVC()

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.84      0.84        98
           1       0.84      0.85      0.85       102

    accuracy                           0.84       200
   macro avg       0.85      0.84      0.84       200
weighted avg       0.85      0.84      0.84       200





We should be getting an accuracy score of around 80% which is quite good, considering we are training with only 2000 samples!

**Exercise**

1. Modify the code to use the output from a different attention layer as input features (embeddings) to the classifier. 
2. Try to generate sentence embeddings using different strategy, e.g. take a average of the output from multiple layers instead of only a single layer.
2. Modify the code to use BERT model and see if it performs better than the DistilBERT. For BERT Model, the output of different layers are in `output[2]`