# Previous example (flowchart)

1. bring in the tokenizer ("bert-base-uncased" in this case)

2. bring in the model from pre-trained (checkpoint)

3. create batch (make a dictionary out of input_ids,attention_mask)

4. Compile the model

5. convert the label to tensors

In [3]:
pip install -q transformers

[K     |████████████████████████████████| 3.8 MB 24.9 MB/s 
[K     |████████████████████████████████| 67 kB 4.7 MB/s 
[K     |████████████████████████████████| 895 kB 51.0 MB/s 
[K     |████████████████████████████████| 596 kB 43.8 MB/s 
[K     |████████████████████████████████| 6.5 MB 47.6 MB/s 
[?25h

In [4]:
import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for this movie my whole life.",
    "This course is amazing!",
]
batch = dict(tokenizer(sequences, padding=True, truncation=True, return_tensors="tf"))

# This is new
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
labels = tf.convert_to_tensor([1, 1])
model.train_on_batch(batch, labels)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.6931471824645996

In [5]:
batch

{'attention_mask': <tf.Tensor: shape=(2, 14), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>,
 'input_ids': <tf.Tensor: shape=(2, 14), dtype=int32, numpy=
 array([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 2023, 3185, 2026, 2878,
         2166, 1012,  102],
        [ 101, 2023, 2607, 2003, 6429,  999,  102,    0,    0,    0,    0,
            0,    0,    0]], dtype=int32)>,
 'token_type_ids': <tf.Tensor: shape=(2, 14), dtype=int32, numpy=
 array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}

# Pretrain using Datasets

##1. Import raw dataset

In [1]:
pip install -q datasets

[K     |████████████████████████████████| 325 kB 4.3 MB/s 
[K     |████████████████████████████████| 67 kB 3.4 MB/s 
[K     |████████████████████████████████| 134 kB 33.1 MB/s 
[K     |████████████████████████████████| 1.1 MB 48.7 MB/s 
[K     |████████████████████████████████| 212 kB 30.0 MB/s 
[K     |████████████████████████████████| 127 kB 55.6 MB/s 
[K     |████████████████████████████████| 144 kB 44.8 MB/s 
[K     |████████████████████████████████| 271 kB 42.5 MB/s 
[K     |████████████████████████████████| 94 kB 2.2 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

In [2]:
from datasets import load_dataset
raw = load_dataset("glue", "mrpc")

Downloading builder script:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
raw["test"].features

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

In [4]:
#have "train","test","validation" kets
raw.keys()

dict_keys(['train', 'validation', 'test'])

In [8]:
type(raw["train"]["label"])

list

In [9]:
raw["train"].features

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

In [10]:
raw["train"][0]

{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

##2. Divide into train validate test

In [11]:
raw_train = raw["train"]
raw_val = raw["validation"]
raw_test = raw["test"]

In [12]:
#As a list, pass on
from transformers import AutoTokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [15]:
#1. Extract each of them separately
tokenized_sentence_1 = raw_train["sentence1"]
tokenized_sentence_2 = raw_train["sentence2"]

#2. PREV : Convert them as pairs
#disadvantage of returning a dictionary (with our keys, input_ids, attention_mask, and token_type_ids, and values that are lists of lists)
"""
tokenized_dataset = tokenizer(
    raw["train"]["sentence1"],
    raw["train"]["sentence2"],
    padding=True,
    truncation=True,
)
"""

#3 - use map for extra flexibility
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_entire = raw.map(tokenize_function,batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [16]:
tokenized_entire

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

Final pre-processing by changing the labels (here, the padding still exists)

In [27]:
def tokenize_function_2(e):
  return tokenizer(e["sentence1"],e["sentence2"],padding="max_length",truncation=True,max_length=128)

tokenized_entire = raw.map(tokenize_function_2, batched=True)
tokenized_entire = tokenized_entire.with_format("torch")
tokenized_entire = tokenized_entire.remove_columns(['idx',"sentence1","sentence2"])
tokenized_entire = tokenized_entire.rename_column("label","labels")
tokenized_entire

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

When checking for the torch size, it will display the same outputs

In [28]:
from torch.utils.data import DataLoader
train_dataloader = DataLoader(tokenized_entire["train"],batch_size=8,shuffle=True)


In [29]:
print(enumerate(train_dataloader))

<enumerate object at 0x7f6b400e41e0>


In [30]:
for step,batch in enumerate(train_dataloader):
  print(batch["input_ids"].shape)
  if step > 5:
    break

torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 128])


##3. Add dynamic padding

In [17]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [None]:
def tokenize_function_3(e):
  return tokenizer(e["sentence1"],e["sentence2"],truncation=True)

tokenized_entire = raw.map(tokenize_function_2, batched=True)
tokenized_entire = tokenized_entire.with_format("torch")
tokenized_entire = tokenized_entire.remove_columns(['idx',"sentence1","sentence2"])
tokenized_entire = tokenized_entire.rename_column("label","labels")


In [37]:
collator = DataCollatorWithPadding(tokenizer) #tokenizer = AutoTokenizer.from_pretrained(checkpoint)

#collate_fn (callable, optional) – merges a list of samples to form a mini-batch of Tensor(s)
for step,batch in enumerate(train_dataloader):
  print(batch["input_ids"].shape) #All different without the padding
  if step > 5: #Earlystopping
    break

torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 128])


**Example**

In [38]:
tf_train_dataset = tokenized_entire["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_entire["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

In [39]:
tf_train_dataset

<PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(8, None), dtype=tf.int64, name=None), 'token_type_ids': TensorSpec(shape=(8, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(8, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(8,), dtype=tf.int64, name=None))>