# Sentence Equivalence Task

## **STEP ZERO**: Import Data

Install the Transformers and Datasets libraries to run this notebook.

In [1]:
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.12.1-py3-none-any.whl (270 kB)
[K     |████████████████████████████████| 270 kB 5.2 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 39.6 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.7.4.post0-cp37-cp37m-manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 35.0 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 42.0 MB/s 
[?25hCollecting huggingface-hub<0.1.0,>=0.0.14
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.3 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.10.0-py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 38.7 MB/s 
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.7.0-cp37-cp37m-many

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer
import numpy as np

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

## **STEP ONE:** Preprocessing + Feature Engineering
Tasks done here:
- tokenise by subwords
- add [CLS], [SEP]
- generate sequence IDs

In [3]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_dataset(dataset):
    encoded = tokenizer(
        dataset["sentence1"],
        dataset["sentence2"],
        padding=True,
        truncation=True,
        return_tensors='np',
    )
    return encoded.data

tokenized_datasets = {
    split: tokenize_dataset(raw_datasets[split]) for split in raw_datasets.keys()
}## prepares a dictionary that has train, test and val's tokenised tensors

tokenized_datasets

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

{'test': {'attention_mask': array([[1, 1, 1, ..., 0, 0, 0],
         [1, 1, 1, ..., 0, 0, 0],
         [1, 1, 1, ..., 0, 0, 0],
         ...,
         [1, 1, 1, ..., 0, 0, 0],
         [1, 1, 1, ..., 0, 0, 0],
         [1, 1, 1, ..., 0, 0, 0]]),
  'input_ids': array([[  101,  7473,  2278, ...,     0,     0,     0],
         [  101,  1996,  2088, ...,     0,     0,     0],
         [  101,  2429,  2000, ...,     0,     0,     0],
         ...,
         [  101, 16559,  5226, ...,     0,     0,     0],
         [  101,  2197,  2733, ...,     0,     0,     0],
         [  101, 17540,  8004, ...,     0,     0,     0]]),
  'token_type_ids': array([[0, 0, 0, ..., 0, 0, 0],
         [0, 0, 0, ..., 0, 0, 0],
         [0, 0, 0, ..., 0, 0, 0],
         ...,
         [0, 0, 0, ..., 0, 0, 0],
         [0, 0, 0, ..., 0, 0, 0],
         [0, 0, 0, ..., 0, 0, 0]])},
 'train': {'attention_mask': array([[1, 1, 1, ..., 0, 0, 0],
         [1, 1, 1, ..., 0, 0, 0],
         [1, 1, 1, ..., 0, 0, 0],
         

## **STEP TWO**: BRAIN BUILDING
Tasks done here
- Import BERT transformer (replace its head with sequence classifier head)
- Set stuff for model
- Train

In [4]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [6]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

model.compile(
    optimizer='adam',
    loss=SparseCategoricalCrossentropy(from_logits=True), ## VERY IMP
    metrics=['accuracy'],
)
model.fit(
    tokenized_datasets['train'],
    np.array(raw_datasets['train']['label']), 
    validation_data=(
        tokenized_datasets['validation'],
        np.array(raw_datasets['validation']['label']),
    ),
    batch_size=8
)



<keras.callbacks.History at 0x7f28102cac10>

In [9]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay
batch_size = 8
num_epochs = 1
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs
num_train_steps = (len(tokenized_datasets['train']['input_ids']) // batch_size) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.,
    decay_steps=num_train_steps
    )
from tensorflow.keras.optimizers import Adam
opt = Adam(learning_rate=lr_scheduler)

In [11]:
import tensorflow as tf

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss)
model.fit(
    tokenized_datasets['train'],
    np.array(raw_datasets['train']['label']), 
    validation_data=(
        tokenized_datasets['validation'],
        np.array(raw_datasets['validation']['label']),
    ),
    batch_size=8
)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




<keras.callbacks.History at 0x7f27a09ddf50>

In [17]:
class F1_metric(tf.keras.metrics.Metric):
    def __init__(self, name='f1_score', **kwargs):
        super().__init__(name=name, **kwargs)
        # Initialize our metric by initializing the two metrics it's based on:
        # Precision and Recall
        self.precision = tf.keras.metrics.Precision()
        self.recall = tf.keras.metrics.Recall()

    def update_state(self, y_true, y_pred, sample_weight=None):
        # Update our metric by updating the two metrics it's based on
        class_preds = tf.math.argmax(y_pred, axis=1)
        self.precision.update_state(y_true, class_preds, sample_weight)
        self.recall.update_state(y_true, class_preds, sample_weight)

    def reset_state(self):
        self.precision.reset_state()
        self.recall.reset_state()

    def result(self):
        # To get the F1 result, we compute the harmonic mean of the current
        # precision and recall
        return 2 / ((1 / self.precision.result()) + (1 / self.recall.result())) 

In [None]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.,
    decay_steps=num_train_steps
    )
opt = Adam(learning_rate=lr_scheduler)
model.compile(optimizer=opt, loss=loss, metrics=['accuracy', F1_metric()])
model.fit(
    tokenized_datasets['train'],
    np.array(raw_datasets['train']['label']),
    validation_data=(tokenized_datasets['validation'], np.array(raw_datasets['validation']['label'])),
    batch_size=8,
    epochs=3
)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3

## **STEP FOUR**: Prediction and evaluation

Tasks done here:
- Pass validation dataset to model to get `logits`
- convert `logits` to `probabilities` and/or then to `class_predictions`

In [14]:
preds = model.predict(tokenized_datasets['validation'])['logits']

model.predict(tokenized_datasets['validation'])

TFSequenceClassifierOutput([('logits',
                             array([[-1.94257033e+00,  1.57106125e+00],
                                    [ 9.39822614e-01,  9.91308168e-02],
                                    [ 6.83423162e-01,  1.00472592e-01],
                                    [-1.11408901e+00,  1.00788760e+00],
                                    [ 9.71872568e-01,  1.00507990e-01],
                                    [-1.48447943e+00,  1.31783235e+00],
                                    [ 1.03325948e-01,  5.24246633e-01],
                                    [-1.70441830e+00,  1.40830112e+00],
                                    [-5.65391064e-01,  8.85207355e-01],
                                    [-1.81744504e+00,  1.46936810e+00],
                                    [-1.70544302e+00,  1.32569206e+00],
                                    [ 5.21274269e-01,  1.07376412e-01],
                                    [ 7.59158552e-01, -7.50378847e-01],
                         

In [15]:
class_preds = np.argmax(preds, axis=1)
print(preds.shape, class_preds.shape)

(408, 2) (408,)


In [16]:
from datasets import load_metric

metric = load_metric("glue", "mrpc")
metric.compute(predictions=class_preds, references=raw_datasets['validation']['label'])

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

{'accuracy': 0.8480392156862745, 'f1': 0.8927335640138409}