<a href="https://colab.research.google.com/github/hamednasr/transformers/blob/main/fine_tuning_textual_entailment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q transformers datasets

In [2]:
from datasets import load_dataset, load_metric
from transformers import (AutoTokenizer,
                          TrainingArguments,
                          Trainer,
                          AutoModelForSequenceClassification,
                          pipeline)
import pandas as pd
import numpy as np

In [3]:
raw_dataset = load_dataset('glue','rte')

In [4]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 2490
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 277
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3000
    })
})

In [5]:
raw_dataset['validation'].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['entailment', 'not_entailment'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [6]:
raw_dataset['validation'].data

MemoryMappedTable
sentence1: string
sentence2: string
label: int64
idx: int32
----
sentence1: [["Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.","Yet, we now are discovering that antibiotics are losing their effectiveness against illness. Disease-causing bacteria are mutating faster than we can come up with new antibiotics to fight the new variations.","Cairo is now home to some 15 million people - a burgeoning population that produces approximately 10,000 tonnes of rubbish per day, putting an enormous strain on public services. In the past 10 years, the government has tried hard to encourage private investment in the refuse sector, but some estimate 4,000 tonnes of waste is left behind every day, festering in the heat as it waits for someone to clear it up. It is often the people in the poorest neighbourhoods that are worst affected. But in some areas they are fighting back. In Shubra, one of the 

In [7]:
len(raw_dataset['validation']['sentence1']), len(raw_dataset['validation']['sentence2'])

(277, 277)

In [8]:
raw_dataset['validation']['sentence1'][11]

'In a bowl, whisk together the eggs and sugar until completely blended and frothy.'

In [9]:
raw_dataset['validation']['sentence2'][11],  raw_dataset['validation']['label'][11]

('In a bowl, whisk together the egg, sugar and vanilla until light in color.',
 1)

In [26]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [27]:
tokenizer(raw_dataset['validation']['sentence1'][11],
          raw_dataset['validation']['sentence2'][11])

{'input_ids': [101, 1999, 1037, 4605, 1010, 1059, 24158, 2243, 2362, 1996, 6763, 1998, 5699, 2127, 3294, 19803, 1998, 10424, 14573, 2100, 1012, 102, 1999, 1037, 4605, 1010, 1059, 24158, 2243, 2362, 1996, 8288, 1010, 5699, 1998, 21161, 2127, 2422, 1999, 3609, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [28]:
def tokenize_func(batch):
  return  tokenizer(batch['sentence1'],batch['sentence2'], truncation=True)

In [29]:
tokenized_datasets = raw_dataset.map(tokenize_func, batched=True)

Map:   0%|          | 0/2490 [00:00<?, ? examples/s]

Map:   0%|          | 0/277 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [44]:
tokenized_datasets['test']

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3000
})

In [14]:
training_args = TrainingArguments('my_trainer',
                                  evaluation_strategy = 'epoch',
                                  save_strategy = 'epoch',
                                  num_train_epochs=2,
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=64,
                                  logging_steps=150)

In [15]:
# pip install accelerate -U

In [16]:
# metric = load_metric('glue','rte')

In [17]:
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=2)

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [19]:
!pip -q install torchinfo

In [20]:
from torchinfo import summary

In [21]:
summary(model)

Layer (type:depth-idx)                                  Param #
BertForSequenceClassification                           --
├─BertModel: 1-1                                        --
│    └─BertEmbeddings: 2-1                              --
│    │    └─Embedding: 3-1                              23,440,896
│    │    └─Embedding: 3-2                              393,216
│    │    └─Embedding: 3-3                              1,536
│    │    └─LayerNorm: 3-4                              1,536
│    │    └─Dropout: 3-5                                --
│    └─BertEncoder: 2-2                                 --
│    │    └─ModuleList: 3-6                             85,054,464
│    └─BertPooler: 2-3                                  --
│    │    └─Linear: 3-7                                 590,592
│    │    └─Tanh: 3-8                                   --
├─Dropout: 1-2                                          --
├─Linear: 1-3                                           1,538
Total params: 10

In [22]:
metric.compute(predictions=[1,1,1], references=[1,0,0])

NameError: ignored

In [23]:
from sklearn.metrics import f1_score

def compute_metrics(logits_labels):
  logits, labels = logits_labels
  predictions = np.argmax(logits, axis = -1)
  accuracy = np.mean(predictions == labels)
  f1 = f1_score(labels, predictions)
  return {'accuracy':accuracy, 'f1-score':f1}

In [30]:
trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets['validation'],
    compute_metrics = compute_metrics,
    tokenizer = tokenizer
)

In [31]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1-score
1,0.653,0.617405,0.68231,0.62069
2,0.452,0.707528,0.66065,0.576577


Epoch,Training Loss,Validation Loss,Accuracy,F1-score
1,0.653,0.617405,0.68231,0.62069
2,0.452,0.707528,0.66065,0.576577


TrainOutput(global_step=312, training_loss=0.5459301104912391, metrics={'train_runtime': 158.5848, 'train_samples_per_second': 31.403, 'train_steps_per_second': 1.967, 'total_flos': 418221189160080.0, 'train_loss': 0.5459301104912391, 'epoch': 2.0})

In [52]:
tokenized_datasets['test'].data['sentence1']

<pyarrow.lib.ChunkedArray object at 0x7e0eb087b420>
[
  [
    "Mangla was summoned after Madhumita's sister Nidhi Shukla, who was the first witness in the case.",
    "Authorities in Brazil say that more than 200 people are being held hostage in a prison in the country's remote, Amazonian-jungle state of Rondonia.",
    "A mercenary group faithful to the warmongering policy of former Somozist colonel Enrique Bermudez attacked an IFA truck belonging to the interior ministry at 0900 on 26 March in El Jicote, wounded and killed an interior ministry worker and wounded five others.",
    "The British ambassador to Egypt, Derek Plumbly, told Reuters on Monday that authorities had compiled the list of 10 based on lists from tour companies and from families whose relatives have not been in contact since the bombings.",
    "Tibone estimated diamond production at four mines operated by Debswana -- Botswana's 50-50 joint venture with De Beers -- could reach 33 million carats this year.",
    ...

In [None]:
trainer.evaluate(tokenized_datasets['test'].data['sentence1'],
                 tokenized_datasets['test'].data['sentence2'])

In [70]:
trainer.save_model('my_model')

In [75]:
# newmodel = pipeline('text-classification', model = 'my_model', device=0)

In [None]:
# newmodel(['this was  a very good movie!','the movie was great'])