<a href="https://colab.research.google.com/github/oscar-defelice/TextClassifierModels/blob/model-bert/BERT/TextClassifierBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation

Google Colab offers free GPU and even TPU. For the purpose of simpler setup, we will stick to GPU. BERT models are quite big, so we need to be aware that we are constrained by 12 GB of VRAM in Google Colab as Tesla K80 is used (as of 15/04/2020). 

First, let's check if you have GPU enabled in your session here in Colab. You can do it by running the following code.

In [1]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:    
    raise SystemError('GPU device not found')


Found GPU at: /device:GPU:0


If you do not have the GPU enabled, just go to:

`Edit -> Notebook Settings -> Hardware accelerator -> Set to GPU`

To fine-tune our model, we need a couple of libraries to install first. 
TensorFlow 2 is already preinstalled, so the missing ones are [transformers](https://github.com/huggingface/transformers) and [TensorFlow datasets](https://github.com/tensorflow/datasets). This allows us to very easily import already pre-trained models for TensorFlow 2 and fine-tune with Keras API. 


In [2]:
!pip install -q transformers tensorflow_datasets==4.0.1 

[K     |████████████████████████████████| 1.3MB 9.6MB/s 
[K     |████████████████████████████████| 890kB 29.8MB/s 
[K     |████████████████████████████████| 2.9MB 56.0MB/s 
[K     |████████████████████████████████| 1.1MB 51.6MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


# Loading AG News dataset

We will use [ag_news dataset](https://www.tensorflow.org/datasets/catalog/ag_news_subset). We can load it very quickly just with `tensorflow_datasets` library.

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the complete dataset of 1 million of news. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

The cell below returns a dictionary 
```python
FeaturesDict({
  'description': Text(shape=(), dtype=tf.string),
  'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=4),
  'title': Text(shape=(), dtype=tf.string),
})
```

In [4]:
import tensorflow_datasets as tfds

(ds_train, ds_test), ds_info = tfds.load('ag_news_subset', 
          split = (tfds.Split.TRAIN, tfds.Split.TEST),
          as_supervised=True,
          with_info=True
          )

print('info', ds_info)

[1mDownloading and preparing dataset ag_news_subset/1.0.0 (download: 11.24 MiB, generated: 35.79 MiB, total: 47.03 MiB) to /root/tensorflow_datasets/ag_news_subset/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…









HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/ag_news_subset/1.0.0.incompleteIXLHV5/ag_news_subset-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=120000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/ag_news_subset/1.0.0.incompleteIXLHV5/ag_news_subset-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=7600.0), HTML(value='')))

[1mDataset ag_news_subset downloaded and prepared to /root/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.[0m
info tfds.core.DatasetInfo(
    name='ag_news_subset',
    version=1.0.0,
    description='AG is a collection of more than 1 million news articles.
News articles have been gathered from more than 2000  news sources by ComeToMyHead in more than 1 year of activity.
ComeToMyHead is an academic news search engine which has been running since July, 2004.
The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc),
information retrieval (ranking, search, etc), xml, data compression, data streaming,
and any other non-commercial activity.
For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above.
It is used as a text class

## Exploring dataset

Now let's explore the examples for fine-tunning. We can just take the top 5 examples and labels by `ds_train.take(5)`, so that we can explore the dataset without the need to iterate over whole 25000 examples in train dataset. 

In [None]:
for text, label in tfds.as_numpy(ds_train.take(5)):
    print('text:', text.decode()[0:150], label)

text: AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transac 3
text: Reuters - Major League Baseball\Monday announced a decision on the appeal filed by Chicago Cubs\pitcher Kerry Wood regarding a suspension stemming fro 1
text: President Bush #39;s  quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state a 2
text: Britain will run out of leading scientists unless science education is improved, says Professor Colin Pillinger. 3
text: London, England (Sports Network) - England midfielder Steven Gerrard injured his groin late in Thursday #39;s training session, but is hopeful he will 1


# Preprocessing

Here we have to prepare out data, with some text preprocessing, so add special tokens, removing punctuation and stopwords if necessary, and so on.

## Import Libraries

In [None]:
import json 
import pickle
import numpy as np

import tensorflow as tf
from keras.utils import to_categorical
from transformers import BertTokenizer
from transformers import TFBertForSequenceClassification, TFBertMainLayer
from sklearn.metrics import classification_report

## Rename labels

In [None]:
# rename labels

labels = {1:'World News', 2:'Sports News', 3:'Business News', 4:'Science-Technology News'}

n_labels = len(labels)

# Tokenization

Now we need to apply BERT tokenizer to use pre-trained tokenizers, see [HuggingFace Tokenizers](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer). The tokenizers should also match the core model that we would like to use as the pre-trained, e.g. cased and uncased version. The code for loading the tokenizer is as simple as:


In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased', 
                                          do_lower_case=True,
                                          )

The BERT tokenizer uses WordPiece vocabulary. It has over 30000 words and it maps pre-trained embeddings for each. Each word has its ids, we would need to map the tokens to those ids.

**An example**:

In [None]:
vocabulary = tokenizer.get_vocab()

print(list(vocabulary.keys())[5000:5020])

['knight', 'lap', 'survey', 'ma', '##ow', 'noise', 'billy', '##ium', 'shooting', 'guide', 'bedroom', 'priest', 'resistance', 'motor', 'homes', 'sounded', 'giant', '##mer', '150', 'scenes']


We will need the tokeniser exported as a `json` file since this will be useful for the web-api in tensorflow.js.

In [None]:
# Export the dictionary word-to-index to a json file
with open( 'tokeniser.json' , 'w' ) as file:    
    json.dump(tokenizer.get_vocab() , file )

In [None]:
max_length_test = 100
test_sentence = 'President Bush #39;s  quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state.'

# add special tokens

test_sentence_with_special_tokens = '[CLS]' + test_sentence + '[SEP]'

tokenized = tokenizer.tokenize(test_sentence_with_special_tokens)

print('tokenized', tokenized)

# convert tokens to ids in WordPiece
input_ids = tokenizer.convert_tokens_to_ids(tokenized)
  
# precalculation of pad length, so that we can reuse it later on
padding_length = max_length_test - len(input_ids)

# map tokens to WordPiece dictionary and add pad token for those text shorter than our max length
input_ids = input_ids + ([0] * padding_length)

# attention should focus just on sequence with non padded tokens
attention_mask = [1] * len(input_ids)

# do not focus attention on padded tokens
attention_mask = attention_mask + ([0] * padding_length)

# token types, needed for example for question answering, for our purpose we will just set 0 as we have just one sequence
token_type_ids = [0] * max_length_test

bert_input = {
    "token_ids": input_ids,
    "token_type_ids": token_type_ids,
    "attention_mask": attention_mask
}
print(bert_input)

tokenized ['[CLS]', 'president', 'bush', '#', '39', ';', 's', 'quo', '##t', ';', 'revenue', '-', 'neutral', 'quo', '##t', ';', 'tax', 'reform', 'needs', 'losers', 'to', 'balance', 'its', 'winners', ',', 'and', 'people', 'claiming', 'the', 'federal', 'de', '##duction', 'for', 'state', '.', '[SEP]']
{'token_ids': [101, 2343, 5747, 1001, 4464, 1025, 1055, 22035, 2102, 1025, 6599, 1011, 8699, 22035, 2102, 1025, 4171, 5290, 3791, 23160, 2000, 5703, 2049, 4791, 1010, 1998, 2111, 6815, 1996, 2976, 2139, 16256, 2005, 2110, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

These methods `tokenize` and `convert_token_to_ids` can be simplified into just one method called `encode_plus`, which will also add special tokens like `[CLS]` and `[SEP]` for us. 

In [None]:
bert_input = tokenizer.encode_plus(
                        test_sentence,                      
                        add_special_tokens = True, # add [CLS], [SEP]
                        max_length = max_length_test, # max length of the text that can go to BERT
                        truncation = True,
                        padding = 'max_length', # add [PAD] tokens
                        return_attention_mask = True,
              )

print('encoded', bert_input)

encoded {'input_ids': [101, 2343, 5747, 1001, 4464, 1025, 1055, 22035, 2102, 1025, 6599, 1011, 8699, 22035, 2102, 1025, 4171, 5290, 3791, 23160, 2000, 5703, 2049, 4791, 1010, 1998, 2111, 6815, 1996, 2976, 2139, 16256, 2005, 2110, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

This is taken from [HuggingFace glossary](https://huggingface.co/transformers/glossary.html):

**Input IDs** - The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

**Attention mask** - Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are NOT MASKED, 0 for MASKED tokens.



**Token type ids** - Some models’ purpose is to do sequence classification or question answering. These require two different sequences to be encoded in the same input IDs. They are usually separated by special tokens, such as the classifier and separator tokens. For example, the BERT model builds its two sequence input as such:



Before we can use such input for BERT fine-tunning, we need to be first aware that BERT is trained to consume sequence with maximum of 512 tokens. Let's define our **batch size** and **max length** allowed for our reviews.

## Tokenising train and test dataset

Here, we need to define some parameter in order to configure our model.

In [None]:
#@title Tokeniser configuration

max_length = 512#@param {type:"integer"} # can be up to 512 for BERT

Now let's combine the whole encoding process to one function so that we can map over our train and test dataset.

In [None]:
def convert_example_to_feature(text):
  
  # combine step for tokenization, WordPiece vector mapping, adding special tokens as well as truncating reviews longer than the max length
  
  return tokenizer.encode_plus(text,                      
                        add_special_tokens = True, # add [CLS], [SEP]
                        max_length = max_length, # max length of the text that can go to BERT
                        truncation = True, # truncate sequences longer than max_length
                        padding = 'max_length', # add [PAD] tokens
                        return_attention_mask = True,
              )


When we will now iterate over again we can apply the `encode` function for each item.

In [None]:
# map to the expected input to TFBertForSequenceClassification, see here 
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
  return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label

def encode_examples(ds, limit=-1):

  # prepare list, so that we can build up final TensorFlow dataset from slices.
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  label_list = []

  if (limit > 0):
      ds = ds.take(limit)
    
  for text, label in tfds.as_numpy(ds):

    bert_input = convert_example_to_feature(text.decode())
  
    input_ids_list.append(bert_input['input_ids'])
    token_type_ids_list.append(bert_input['token_type_ids'])
    attention_mask_list.append(bert_input['attention_mask'])
    label_list.append(to_categorical(label, num_classes=n_labels))

  return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)


Now use our `encode_examples` function to convert our train and test datasets.

In [None]:
# train dataset
ds_train_encoded = encode_examples(ds_train)

# test dataset
ds_test_encoded = encode_examples(ds_test)


# Model initialization

We will use already prepared TensorFlow models from transformers models. You can just import them from the library and call `from_pretrained` and you will be able to use them.

In the following cell, we configure model and the _training task_.



In [None]:
#@title Model Parameters
#@markdown Here we give a minimal set of parameters for model configuration.

# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5#@param [5e-5, 3e-5, 2e-5] {type:"raw", allow-input: true}
# little threshold to avoid zero division errors
epsilon=1e-08#@param {tipe: "number"}
number_of_epochs = 2#@param {type: "integer"}
batch_size = 10#@param ["2", "8", "16", "32", "64", "128", "256", "512"] {type:"raw", allow-input: true}

loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.CategoricalAccuracy('accuracy')
opt = tf.keras.optimizers.Adam(learning_rate = learning_rate, epsilon=epsilon)

In [None]:
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = n_labels)

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier', 'dropout_37']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  3076      
Total params: 109,485,316
Trainable params: 109,485,316
Non-trainable params: 0
_________________________________________________________________


# Fine tunning

## Training

Now we can start fine-tuning process. We will again use the Keras API `model.fit` and just pass the model configuration, that we have already defined.

We shuffle data, split them in batches and launch the training.

In [None]:
train_dataset = ds_train_encoded.shuffle(10000).batch(batch_size)
test_dataset = ds_test_encoded.batch(batch_size)

In [None]:
bert_history = model.fit(train_dataset, epochs=number_of_epochs, validation_data=test_dataset)

Epoch 1/2
Epoch 2/2


If you are getting the Resource error, you would need to run this on the GPU/TPU with higher VRAM. At least 12GB is recommended. You might consider to decrease input size/batch size or train in half precision.

## Evaluation

With our simple fine-tuning process, we have already achieved over 93% on the test dataset.

```
12000/12000 [==============================] - 3921s 327ms/step 
- loss: 0.1410 - accuracy: 0.9516 
- val_loss: 0.1856 - val_accuracy: 0.9395
```

This is not bad, considering [the state-of-the-art classifiers](https://nlpprogress.com/english/text_classification.html), however there are ways to improve further, e.g. larger BERT model, regularization, more epochs, further in task pretraining. 

Now, let's have a look at the classification report.

Below is a function whose output is a classifier we can use for predictions:


In [None]:
def create_predictor():
  def predict_probs(text):

      encodings = convert_example_to_feature(text)
      tfdataset = tf.data.Dataset.from_tensor_slices(encodings)
      tfdataset = tfdataset.batch(1)

      preds = model.predict(tfdataset)
      preds = tf.keras.activations.softmax(tf.convert_to_tensor(preds)).numpy()
      return preds[0][0]
    
  return predict_probs

classifier = create_predictor()
print(classifier(test_sentence))

In [None]:
print(classification_report(y_test, y_pred, target_names=labels.values(), digits=4))

This is quite good for the initial try in comparison with [current state of the art](https://nlpprogress.com/english/sentiment_analysis.html). 

## Tips for fine-tunning

* Use larger BERT model
* Further in-task pretraining
* Multi-task fine-tuning has lower effect than further pre-training, but can help as well, see (Sun et al. 2019)
* Do not constrain yourself just to BERT. Other transformers such as XLNet can work even better.


# Save model and export in JavaScript

In order to convert our model using Tensorflow.js, we have to save the trained model.

In [None]:
#save Keras model
saved_model_path = "./model/modelBERT"

model.save_pretrained(saved_model_path)
with open('./model/info.pkl', 'wb') as f:
    pickle.dump(('bert-fine-tuned', max_length), f)

INFO:tensorflow:Assets written to: modelBERT/assets


INFO:tensorflow:Assets written to: modelBERT/assets


In [None]:
%%bash
zip -r model.zip ./model

  adding: modelBERT/ (stored 0%)
  adding: modelBERT/variables/ (stored 0%)
  adding: modelBERT/variables/variables.data-00000-of-00001 (deflated 12%)
  adding: modelBERT/variables/variables.index (deflated 80%)
  adding: modelBERT/saved_model.pb (deflated 92%)
  adding: modelBERT/assets/ (stored 0%)


Hence, we are ready to convert the saved model.

In [None]:
%%bash
tensorflowjs_converter --input_format=keras modelCNN.h5 ./model/

2020-10-26 11:53:42.024115: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1


Since we have not only the model, but also weights files, we zip everything to make it ready to download.

In [None]:
%%bash
zip -r model.zip ./model

  adding: model/ (stored 0%)
  adding: model/group1-shard4of4.bin (deflated 8%)
  adding: model/group1-shard2of4.bin (deflated 7%)
  adding: model/model.json (deflated 82%)
  adding: model/group1-shard1of4.bin (deflated 7%)
  adding: model/group1-shard3of4.bin (deflated 7%)


# References

* [Stanford CS224N: NLP with Deep Learning](https://https://www.youtube.com/watch?v=8rXD5-xhemo)

* [Stanford CS224U Natural Language Understanding](https://www.youtube.com/watch?v=tZ_Jrc_nRJY&list=PLoROMvodv4rObpMCir6rNNUlFAn56Js20)

* [HuggingFace transformers library](https://huggingface.co/)

* [BERT, ULMFit, ELMO](https://jalammar.github.io/illustrated-bert/)

* [BERTViz - Vizualization of Attention Heads](https://github.com/jessevig/bertviz)

* [Illustrated transformer](https://jalammar.github.io/illustrated-transformer/)

* [Illustrated GPT-2](https://jalammar.github.io/illustrated-gpt2)

* http://nlpprogress.com/

* [GLUE benchmark ](https://gluebenchmark.com/tasks)

* Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT:  Pre-training of deep bidirectional transformers for languageunderstanding, 2018.

* Scott Gray, Alec Radford, and Diederik P. Kingma, GPU kernels for block-sparse weights, 2017.

* Jeremy Howard and Sebastian Ruder, Universal language model fine-tuning for text classification, 2018.

* Sepp Hochreiter and Jurgen Schmidhuber, Long short-term memory, Neural computation (1997), no. 8, 1735–1780.

* Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, Language models are unsupervised multitask learners.

* Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang, How to fine-tune BERT for text classification?, 2019.

* Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, LlionJones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, Attention is all you need, 2017.

* Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy,and Samuel R. Bowman,Glue:  A multi-task benchmark and analysisplatform for natural language understanding, 2018.


* Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, RuslanSalakhutdinov, and Quoc V. Le,Xlnet:  Generalized autoregressive pretraining for language understanding, 2019


