#**`This notebook is an Illustration to create a BERT Model for Text Classification with Hugging Face using Python and Tensorflow 🚀`**

For more information, check out [BERT](https://huggingface.co/transformers/model_doc/bert.html)

First let's check if google colab GPU is enabled in our notebook.

In [None]:
import tensorflow as tf
device_name = tf.test.gpu_device_name() 
if device_name == '/device:GPU:0':
  print('Found GPU at: {}'.format(device_name))
else:
  raise SystemError('GPU not found!')


Found GPU at: /device:GPU:0


Let's see which GPU is allocated to us.



In [None]:
!nvidia-smi

Thu Aug 13 09:25:56 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    56W / 149W |    134MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Let's install the required packages required.

In [None]:
!pip install -q transformers tensorflow_datasets 

[K     |████████████████████████████████| 778kB 7.5MB/s 
[K     |████████████████████████████████| 890kB 41.2MB/s 
[K     |████████████████████████████████| 3.0MB 39.4MB/s 
[K     |████████████████████████████████| 1.1MB 41.6MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


We will use the IMDB dataset which we can simply download with `tensorflow_datasets`.

In [None]:
import tensorflow_datasets as tfds

(ds_train, ds_test), ds_info = tfds.load('imdb_reviews',
                                        split = (tfds.Split.TRAIN, tfds.Split.TEST),
                                        as_supervised=True,
                                        with_info=True)

print('info', ds_info)   

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteL1BITW/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteL1BITW/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteL1BITW/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m
info tfds.core.DatasetInfo(
    name='imdb_reviews',
    version=1.0.0,
    description='Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.',
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=100000,
    splits={
        'test': 25000,
        'train': 25000,
        'unsupervised': 50000,
    },
    supervised_keys=('text', 'label'),
    citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
      author    =

Now let's explore the examples for fine-tuning, we can take first 5 examples using `ds_train.take(5)`.

In [None]:
for review, label in tfds.as_numpy(ds_train.take(5)):
  print('Review:', review.decode()[0:120], label)   

Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actor 0
Review: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tire 0
Review: Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable pe 0
Review: This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as y 1
Review: As others have mentioned, all the women that go nude in this film are mostly absolutely gorgeous. The plot very ably sho 1


#**Tokenization**

Let's load the pretrained [BERT Tokenizer](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer).

Note: The tokenizer should also match the core model that we would like to use as pretrained (e.g: cased or uncased).

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)    

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




The BERT Tokenizer used WOrdPiece vocabulary. It has 30000 words and it maps pretrained embeddings for each.

In [None]:
vocubulary = tokenizer.get_vocab() 
print(list(vocubulary.keys())[5000:5020])   

['knight', 'lap', 'survey', 'ma', '##ow', 'noise', 'billy', '##ium', 'shooting', 'guide', 'bedroom', 'priest', 'resistance', 'motor', 'homes', 'sounded', 'giant', '##mer', '150', 'scenes']


We'll use this example to understand tokenization process.

In [None]:
max_length_test = 20
test_sentance = 'Test tokenization sentance, followed by another sentance'  

Some basic operations can convert the text to tokens and tokens to unique integers (ids):

In [None]:
tokens = tokenizer.tokenize(test_sentance)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f' Sentence: {test_sentance}')
print(f'   Tokens: {tokens}')
print(f'Token IDs: {token_ids}')  

 Sentence: Test tokenization sentance, followed by another sentance
   Tokens: ['test', 'token', '##ization', 'sent', '##ance', ',', 'followed', 'by', 'another', 'sent', '##ance']
Token IDs: [3231, 19204, 3989, 2741, 6651, 1010, 2628, 2011, 2178, 2741, 6651]


All the heavy working is done by the [encode_plus](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.encode_plus) method.

In [None]:
encoding = tokenizer.encode_plus(
  test_sentance,
  max_length=max_length_test,
  add_special_tokens=True, # Add '[CLS]' and '[SEP]'
  return_token_type_ids=True,
  pad_to_max_length=True,
  truncation=True,
  return_attention_mask=True,
  return_tensors='tf',  # Return Tensorflow tensors
)

encoding.keys() 

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

The token ids are now stored in a Tensor and padded to a length of 20:

In [None]:
print(len(encoding['input_ids'][0]))
encoding['input_ids'][0] 

20


<tf.Tensor: shape=(20,), dtype=int32, numpy=
array([  101,  3231, 19204,  3989,  2741,  6651,  1010,  2628,  2011,
        2178,  2741,  6651,   102,     0,     0,     0,     0,     0,
           0,     0], dtype=int32)>

In [None]:
print(len(encoding['token_type_ids'][0]))
encoding['token_type_ids']  

20


<tf.Tensor: shape=(1, 20), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>


The attention mask has the same length:

In [None]:
print(len(encoding['attention_mask'][0]))
encoding['attention_mask']

20


<tf.Tensor: shape=(1, 20), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>

We can inverse the tokenization to have a look at the special tokens:

In [None]:
tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])   

['[CLS]',
 'test',
 'token',
 '##ization',
 'sent',
 '##ance',
 ',',
 'followed',
 'by',
 'another',
 'sent',
 '##ance',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

**`INPUT IDS`** - The input ids are often the only required parameter to pass to the models input. They are token indices, numerical representations of each token.

**`ATTENTION MASK`** - Mask to avoid performing attention on padded token indices. Token that is MASKED ==0, For NOT MASKED==1

**`TOKEN TYPE ID`** - They are a binary mask identifying the different sequences in the model.

#**Hyperparameter choice**
We need to be aware that BERT is trained to consume sequences with maximum of 512 tokens. Let's define `max_lenght` and `batch_size` allowed for our reviews.

In [None]:
#Upto 512 for BERT
max_length = 512
#Recommended batch size for BERT is 16, 32, .... however on this dataset we are overfitting quite often and smaller batches work like regularization.
batch_size = 6 

#**Encoding train and test dataset**

Now let's combine whole embedding process to one fucntion so we can map ove rour training and testing set.

In [None]:
def convert_example_to_feature(review):

  return tokenizer.encode_plus(review,
                               add_special_tokens = True,
                               max_length = max_length,
                               pad_to_max_length = True,
                               return_token_type_ids=True,
                               truncation=True,
                               return_attention_mask=True,
                              )  

We will now iterate over again and apply `encode` function for each item.

In [None]:
def map_example_to_dict(input_ids, attention_masks,token_type_ids, label):
  return{
      "input_ids" : input_ids,
      "token_type_ids" : token_type_ids,
      "attention_mask" : attention_masks,
  }, label


def encode_examples(ds, limit=-1):
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = [] 
  label_list = []


  if(limit > 0):
    ds = ds.take(limit) 

  for review, label in tfds.as_numpy(ds):
    bert_input = convert_example_to_feature(review.decode())

    input_ids_list.append(bert_input['input_ids'])
    token_type_ids_list.append(bert_input['token_type_ids']) 
    attention_mask_list.append(bert_input['attention_mask']) 
    label_list.append([label]) 

  return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict) 

We will convert our train and test datasets:

In [None]:
ds_train_encoded = encode_examples(ds_train).shuffle(10000).batch(batch_size)  

In [None]:
ds_test_encoded = encode_examples(ds_test).batch(batch_size) 

#**Model Initialization**

In [None]:
from transformers import TFBertForSequenceClassification
import tensorflow as tf

#recommended learning rate for Adam 5e-5, 3e-5, 2e-5

learning_ratee = 2e-5

#mutiple epochs can improve model performance, unless we overfit the model
no_of_epochs = 1

#model initialization

model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

#Adam classifier is recommended

optimizerr = tf.keras.optimizers.Adam(learning_rate = learning_ratee, epsilon = 1e-08)

#we do not have one-hot vectors, we can use sparse categorical crossentropy and accuracy

losss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizerr, loss=losss, metrics=[metric])  

#**Training**

In [None]:
bert_history = model.fit(ds_train_encoded, epochs = no_of_epochs, validation_data=ds_test_encoded)  



If you are getting the resource error, you will need to run this on GPU/TPU. At least 12gb is recommended. You might want to consider to decrease the input/batch size.

#**Summary**

- You learned how to use various fuctions pertaining to BERT
- Build a Text Classification Classifier using HuggingFace
- You can fine-tune the model by using a larger bert model, use other transformers like XLNet and much more!!

#**Reference**
- [BERT](https://huggingface.co/transformers/model_doc/bert.html)
- [Tensorflow Datasets](https://www.tensorflow.org/datasets)
- [Transformers](https://huggingface.co/transformers/)
- [BERTViz](https://github.com/jessevig/bertviz)