<a href="https://colab.research.google.com/github/hamednasr/transformers/blob/main/fine_tuning_textual_entailment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q transformers datasets

In [15]:
from datasets import load_dataset, load_metric
from transformers import (AutoTokenizer,
                          TrainingArguments,
                          Trainer,
                          AutoModelForSequenceClassification,
                          pipeline)
import pandas as pd

In [3]:
raw_dataset = load_dataset('glue','rte')

In [4]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 2490
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 277
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3000
    })
})

In [5]:
raw_dataset['validation'].data

MemoryMappedTable
sentence1: string
sentence2: string
label: int64
idx: int32
----
sentence1: [["Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.","Yet, we now are discovering that antibiotics are losing their effectiveness against illness. Disease-causing bacteria are mutating faster than we can come up with new antibiotics to fight the new variations.","Cairo is now home to some 15 million people - a burgeoning population that produces approximately 10,000 tonnes of rubbish per day, putting an enormous strain on public services. In the past 10 years, the government has tried hard to encourage private investment in the refuse sector, but some estimate 4,000 tonnes of waste is left behind every day, festering in the heat as it waits for someone to clear it up. It is often the people in the poorest neighbourhoods that are worst affected. But in some areas they are fighting back. In Shubra, one of the 

In [6]:
len(raw_dataset['validation']['sentence1']), len(raw_dataset['validation']['sentence2'])

(277, 277)

In [7]:
raw_dataset['validation']['sentence1'][11]

'In a bowl, whisk together the eggs and sugar until completely blended and frothy.'

In [8]:
raw_dataset['validation']['sentence2'][11],  raw_dataset['validation']['label'][11]

('In a bowl, whisk together the egg, sugar and vanilla until light in color.',
 1)

In [9]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [10]:
def tokenize_func(batch):
  return  tokenizer(batch['sentence1'],batch['sentence2'], truncation=True)

In [11]:
tokenized_dataset = raw_dataset.map(tokenize_func, batched=True)

Map:   0%|          | 0/277 [00:00<?, ? examples/s]

In [12]:
training_args = TrainingArguments('my_trainer',
                                  evaluation_strategy = 'epoch',
                                  save_strategy = 'epoch',
                                  num_train_epochs=1)

In [13]:
# pip install accelerate -U

In [17]:
metric = load_metric('glue','rte')

In [19]:
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased')

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 