# Practical 6

## Transformers


Let us first use the Hugging Face Transformer library to perform some common NLP tasks. 

We will use the pipeline() function, which supports several NLP tasks such as classification, summarization, machine translation and so on. For a list of support tasks see the documentation [here](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.pipeline). Pipeline connects a task-specific model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer. 

Install the Transformers and Datasets libraries to run this notebook.

For this practical, it is best if you create a new conda environment. For this new environment, you will need to run the following commands

pip install tf-keras 

pip install datasets transformers[sentencepiece]

pip install jupyter notebook

pip install ipykernel

python -m ipykernel install --user

For the section Fine-Tuning Bert, you need to install

pip install transformers

pip install scikit-learn

Additionally, in your Jupyter notebook, you have to run the following cell

import os
os.environ['TF_USE_LEGACY_KERAS'] = '1' 

after which you should restart the kernel before running the cells.

In [None]:
# install the extra package "sentencepiece" required for machine translation tasks
!pip install datasets transformers[sentencepiece]

### Text Generation

Let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. Text generation involves randomness, so it’s normal if you don’t get the same results as shown below. The default model used is gpt-2.

You can control how many different sequences are generated with the argument `num_return_sequences` and the total length of the output text with the argument `max_length`.



In [None]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("Harry Potter whipped out his wand and", max_length=50, num_return_sequences=2)

### Mask filling

The next pipeline you’ll try is fill-mask. The idea of this task is to fill in the blanks in a given text. 

The top_k argument controls how many possibilities you want to be displayed. Note that here the model fills in the special <mask> word, which is often referred to as a mask token. Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. 

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("The tech giant has been accused of trademark <mask> by other companies.", top_k=2)

### Question Answering

The question-answering pipeline answers questions using information from a given context. Note that the answer is extracted from the given context and not generated. The `start` and `end` in the example below tells you the span of the text in the context that provide the answer.

In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="What am I specializing in?",
    context="I am currently taking Diploma in Information Technology in NYP, with specialization in Artificial Intelligence.",
)

### Summarization

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Like with text generation, you can specify a `max_length` or a `min_length` for the result.

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

# Fine-tuning BERT for Text Classification

One of the approaches where we can use BERT for downstream task such as text classification is to do fine-tuning of the pretrained model. 

In this lab, we will see how we can use a pretrained DistilBert Model and fine-tune it with custom training data for text classification task. 

### Install Hugging Face Transformers library

If you are running this notebook in Google Colab, you will need to install the Hugging Face transformers library as it is not part of the standard environment.

In [None]:
!pip install --upgrade tf_keras
!pip install --upgrade transformers
import transformers
print(transformers.__version__)
import os
os.environ['TF_USE_LEGACY_KERAS'] = '1' 

In [None]:
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

### Data Preparation

Lets download the dataset into a pandas dataframe.

In [None]:
test_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv'
train_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv'

train_df = pd.read_csv(train_data_url)
test_df = pd.read_csv(test_data_url)

train_df.head()


The train set has 40000 samples. We will use only a small subset (e.g. 2000) samples for finetuning our pretrained model. Similarly we will use a smaller test set for evaluating our model.  We use dataframe's `sample()` to randomly select a subset of samples.

In [None]:
TRAIN_SIZE = 2000
TEST_SIZE = 200 

train_df = train_df.sample(n=TRAIN_SIZE)
test_df = test_df.sample(n=TEST_SIZE)

We now convert the text label into numeric values of 0 (negative) and 1 (positive), separate the labels from text and split the training data into a train and validation sets.

In [None]:
train_df['sentiment'] =  train_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)
test_df['sentiment'] =  test_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)

In [None]:
train_df.sentiment.value_counts()

In [None]:
train_texts = train_df['review']
train_labels = train_df['sentiment']
test_texts = test_df['review']
test_labels = test_df['sentiment']

In [None]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

## Tokenization

We will be using the DistilBert tokenizer for the pretrained model "distillbert-base-uncased" for this practical.  The tokenizer helps to produce the input tokens that are suitable to be used by the DistilBert model, e.g. it automatically append the \[CLS\] token in the front of the sequence of tokens and the \[SEP\] token at the end of the sequence of tokens , and also the attention mask for those padded positions in the input sequence of tokens.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

The DistilBERT tokenizer (identical to Bert tokenizer) use WordPiece vocabulary. It has close to 30000 words. Each word has its own ids, we would need to map the tokens to those ids.

In [20]:
print(f"Tokenizer vocab size = {tokenizer.vocab_size}")
print(list(tokenizer.vocab.keys())[6000:6020])

Tokenizer vocab size = 30522
['batsman', 'transportation', '##hak', 'fortification', 'rr', 'discourage', 'fourier', 'improvisation', 'favour', 'plural', 'tanker', 'world', '1839', 'prague', 'summer', 'sworn', 'boost', 'doctor', '294', '##nostic']


We can now apply the tokenizer to a test sentence as below.

In [21]:
test_sentence = "Transformer is really good for Natural Language Processing."

encoding = tokenizer(test_sentence, padding=True, truncation=True)
print(f"Encoding keys:  {encoding.keys()}\n")

print(f"token ids: {encoding['input_ids']}\n")
print(f"attention_mask: {encoding['attention_mask']}\n")
print(f"tokens: {tokenizer.convert_ids_to_tokens(encoding['input_ids'])}")

Encoding keys:  dict_keys(['input_ids', 'attention_mask'])

token ids: [101, 10938, 2121, 2003, 2428, 2204, 2005, 3019, 2653, 6364, 1012, 102]

attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

tokens: ['[CLS]', 'transform', '##er', 'is', 'really', 'good', 'for', 'natural', 'language', 'processing', '.', '[SEP]']


Let us take a closer look at the output of the tokenization process. 

We notice that the tokenizer will return a dictionary of two items 'input_ids' and 'attention_mask'. The input_ids contains the IDs of the tokens. While the 'attention_mask' contains the masking pattern for those padded positions. 

We also notice that for the example sentence, the word 'Transformer' is being broken up into two tokens 'Trans' and '##former'. The '##' means that the rest of the token should be attached to the previous one.

We also see that the tokenizer appended \[CLS\] to the beginning of the token sequence, and \[SEP\] at the end. 

Now let's go ahead and tokenize our texts. But before we do so, we need to convert the pandas series to list first as the tokenizer cannot work with pandas series or dataframe directly. 

In [22]:
train_texts = train_texts.to_list()
train_labels = train_labels.to_list()
val_texts = val_texts.to_list()
val_labels = val_labels.to_list()
test_texts = test_texts.to_list()
test_labels = test_labels.to_list()

In [23]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

We then create a tensorflow dataset using the encodings and the labels.

In [24]:
batch_size = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(batch_size)

### Fine-tuning the model
Now let us fine-tune the pre-trained model by training it with our custom dataset.  

We will instantiate a pretrained model 'distilbert-base-uncased', using `TFAutoModelForSequenceClassification`, and passing `num_labels=2` to indicate we want to train a 2-class (binary) classifier.

In [25]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",num_labels=2)

Downloading tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_projector', 'vocab_transform', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_157']
You should probably TRAIN this model on a down-stream task to be able to use 

The model is a `tf.keras.Model` subclass. So you can train the model using Keras API such as `fit()`.

Transformer models benefit from a much lower learning rate than the default used by Adam, which is 1e-3. In this training, we will start the training with 5e-5 (0.00005) and slowly reduce the learning rate over the course of training. In the literature, you will sometimes see this referred to as decaying or annealing the learning rate. In Keras, the best way to do this is to use a learning rate scheduler. A good one to use is PolynomialDecay. Despite the name, with default settings it simply linearly decays the learning rate from the initial value to the final value over the course of training, which is exactly what we want. In order to use a scheduler correctly, though, we need to tell it how long training is going to be. We compute that as `num_train_steps` below.

In [26]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay

num_epochs = 1

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Since our dataset is already batched, we can simply take the len.
num_train_steps = len(train_dataset) * num_epochs

lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

Now we will just compile the model with the learning rate scheduler and the loss function and train our model for 1 epoch. 

Note that the transformer model output logits directly instead of going through a softmax layer. In your loss function, you will need to set `from_logits=True`.


In [27]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

opt = Adam(learning_rate=lr_scheduler)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

model.fit(train_dataset, validation_data=val_dataset, epochs=num_epochs)



<keras.callbacks.History at 0x20819f0a640>

You will notice that validation accuracy reaches around 89%.  Let's evaluate on our test set. We should see around the same accuracy. 

In [28]:
model.evaluate(test_dataset)



KeyboardInterrupt: 

Let's just go ahead and save our model for inference later. Note that we use transformers library specific save method `save_pretrained()` instead of normal keras model save.

In [29]:
model.save_pretrained('finetuned_model')

## Try out the model
Now let's try out our model with our own sentence.  We first load our saved fined-tuned model using `from_pretrained()` method and specify the folder name where we saved the model to.

In [30]:
my_model = TFAutoModelForSequenceClassification.from_pretrained("finetuned_model")
text = input('Write your review here:')
inputs = tokenizer(text, return_tensors="tf")
output = my_model(inputs)
pred_prob = tf.nn.softmax(output.logits, axis=-1)
print(pred_prob)
pred = np.argmax(pred_prob)
print(pred)
if pred == 1:
    print('positive')
else:
    print('negative')

Some layers from the model checkpoint at finetuned_model were not used when initializing TFDistilBertForSequenceClassification: ['dropout_157']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at finetuned_model and are newly initialized: ['dropout_177']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tf.Tensor([[0.5937765 0.4062235]], shape=(1, 2), dtype=float32)
0
negative


Other than fine-tuning BERT for downstream task such as text classification, we can use pretrained BERT model as a feature extractor, very much the same as we are using pretrained CNN such as ResNet as feature extractors for downstream task such as image classification and object detection.  

In this lab, we will see how we use a pretrained DistilBert Model to extract features (or embedding) from text and use the extracted features (embeddings) to train a classifier to classify text. You can contrast this with the earlier section where we trained DistilBert end to end for classification

Let us load the pretrained model for distibert-based-uncased and use it to extract features from the text (i.e. embeddings).

In [31]:
from transformers import TFAutoModel

# The bare, pre-trained DistilBERT transformer model outputting raw hidden-states and without any specific head on top.
distilBERT = TFAutoModel.from_pretrained("distilbert-base-uncased",output_hidden_states=True)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_projector', 'vocab_transform', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Now we will freeze the layers and add a classification head.

In [32]:
# Make DistilBERT layers untrainable
for layer in distilBERT.layers:
    layer.trainable = False

In [42]:
from tensorflow.keras.layers import Dense,Input
inps = Input(shape = (512,), dtype='int64', name='input_ids')
masks= Input(shape = (512,), dtype='int64', name='attention_mask')
dbert_layer = distilBERT(inps, attention_mask=masks)[0][:,0,:]
dense = Dense(64,activation='relu')(dbert_layer)
pred = Dense(1, activation='sigmoid')(dense)
model = tf.keras.Model(inputs=[inps,masks], outputs=pred)
print(model.summary())

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 512)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 512)]        0           []                               
                                                                                                  
 tf_distil_bert_model (TFDistil  TFBaseModelOutput(l  66362880   ['input_ids[0][0]',              
 BertModel)                     ast_hidden_state=(N               'attention_mask[0][0]']         
                                one, 512, 768),                                                   
                                 hidden_states=((No                                         

Finally, lets train the model.

In [43]:
opt = Adam()


model.compile(optimizer=opt, loss='binary_crossentropy', metrics=["accuracy"])

model.fit(train_dataset, validation_data=val_dataset, epochs=10)

Epoch 1/10
Epoch 2/10

KeyboardInterrupt: 