This is basic example to explain concepts. 

In [1]:
from transformers import BertModel, BertTokenizer
import torch

  from .autonotebook import tqdm as notebook_tqdm


lets download and load the `bert-base-uncased` model. 

In [2]:
model = BertModel.from_pretrained("bert-base-uncased",  output_hidden_states=True)

Download and load the tokenizer which was used in pre-train `bert-base-uncased` model.

In [3]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Preprocessing the input 

In [4]:
example = "i love kathmandu"

In [5]:
tokens = tokenizer.tokenize(example)
tokens

['i', 'love', 'kathmandu']

Lets add additional tokens, `[CLS] [SEP]`

In [6]:
tokens = ["[CLS]"] + tokens + ["[SEP]"]
tokens

['[CLS]', 'i', 'love', 'kathmandu', '[SEP]']

We need to add padding tokens because we have token length is 7 but actual token is 5.

In [7]:
tokens = tokens + ["[PAD]", "[PAD]"]
tokens

['[CLS]', 'i', 'love', 'kathmandu', '[SEP]', '[PAD]', '[PAD]']

Now create a attention mask for actual token (1) and padding (0).

In [8]:
attn_mask = [0 if token == "[PAD]" else 1 for token in tokens]
attn_mask

[1, 1, 1, 1, 1, 0, 0]

lets convert tokens to their IDs.

In [9]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[101, 1045, 2293, 28045, 102, 0, 0]

In [10]:
token_ids = torch.tensor(token_ids).unsqueeze(0)
attn_mask = torch.tensor(attn_mask).unsqueeze(0)
print(f"token IDs: {token_ids.shape}, attention mask: {attn_mask.shape}")

token IDs: torch.Size([1, 7]), attention mask: torch.Size([1, 7])


The Embedding

`hidden_rep`: he representation of all the tokens obtained from the final encoder (12)

`pooler_output`: representation of the [CLS] token.

In [11]:
output = model(token_ids, attention_mask= attn_mask)

In [12]:
last_hidden_state= output.last_hidden_state
pooler_output = output.pooler_output
hidden_states = output.hidden_states

In [13]:
print(f"last hidden:{last_hidden_state.shape} hidden layers:{len(hidden_states)}, CLS: {pooler_output.shape}")

last hidden:torch.Size([1, 7, 768]) hidden layers:13, CLS: torch.Size([1, 768])


The size `[1,7,768]` indicates `[batch_size, sequence_length, hidden_size]`. The size `[1,768]` indicates `[batch_size, hidden_size]`.

In [14]:
print(f"last layer, representation of 'love'\n{last_hidden_state[0][1]}")

last layer, representation of 'love'
tensor([ 3.1862e-01,  5.0867e-01,  1.4940e-01, -2.8184e-01,  1.0760e-02,
         4.1292e-01, -1.1322e-01,  1.3862e+00, -2.2813e-01, -4.7011e-01,
        -2.5818e-01, -6.8372e-01,  1.7890e-01,  5.9333e-01,  2.3110e-01,
         6.0013e-01,  6.1079e-01,  1.1835e-01,  3.0312e-01,  4.7956e-01,
        -9.0474e-02, -4.7122e-01, -1.1157e+00,  2.8013e-01,  5.1831e-01,
         3.5709e-01, -4.4087e-01, -1.6675e-01,  2.4361e-01, -1.4690e-01,
         2.2097e-01, -1.4364e-01, -1.5772e-01,  2.3723e-01, -7.6925e-01,
        -1.7788e-01, -8.3212e-02, -5.7498e-02, -6.6200e-01, -8.1028e-02,
         6.1719e-02, -7.3170e-01, -1.1594e-01, -6.9784e-01, -2.7962e-02,
        -9.3233e-01, -7.4641e-02,  3.9156e-02,  6.4064e-01, -1.0957e+00,
         1.2172e-02,  1.2453e-01,  2.2704e-01,  3.5148e-01, -2.6128e-01,
         1.3611e+00, -3.4164e-02, -6.2567e-01, -1.7320e-01,  4.1086e-01,
        -2.2863e-01,  2.2597e-01,  8.0072e-01, -8.1334e-01, -4.6740e-01,
         8.928

`pooler_output`  is the entire input sequence generated by the BERT model. Specifically, it's the output of the "pooler" layer, which is typically a fully connected layer with a tanh activation function applied to it.
In general, it holds the aggregate representation of the sentence, so we can
use cls_head as the representation of the sentence `i love kathmandu`.

In [15]:
print(f"pooler output.i.e representation of 'I love kathmandu'. \n{pooler_output}")

pooler output.i.e representation of 'I love kathmandu'. 
tensor([[-0.8357, -0.1822,  0.4747,  0.6436, -0.2796, -0.1092,  0.8457,  0.1583,
          0.1907, -0.9996,  0.1853,  0.1747,  0.9669, -0.1852,  0.9110, -0.4437,
         -0.0987, -0.4950,  0.3837, -0.7624,  0.5009,  0.7871,  0.5610,  0.1600,
          0.3615,  0.3753, -0.4926,  0.8958,  0.9266,  0.6399, -0.6181,  0.0950,
         -0.9717, -0.1272,  0.3411, -0.9715,  0.1475, -0.7362,  0.0306,  0.0822,
         -0.8589,  0.2027,  0.9935, -0.0741, -0.0652, -0.2863, -0.9996,  0.1983,
         -0.8285, -0.3280, -0.2946, -0.5097,  0.1277,  0.3611,  0.3224,  0.2838,
         -0.1187,  0.1056, -0.0781, -0.4620, -0.4998,  0.2000,  0.0673, -0.8527,
         -0.2595, -0.5242, -0.0076, -0.1295, -0.0038, -0.0472,  0.7797,  0.1425,
          0.3412, -0.7356, -0.3774,  0.1124, -0.3790,  1.0000, -0.3603, -0.9575,
         -0.4016, -0.3286,  0.2985,  0.5956, -0.4848, -1.0000,  0.2178, -0.0367,
         -0.9782,  0.1734,  0.2324, -0.1238, -0.6653

These embedings are taken from topmost encoder layer of BERT, i.e. 12 layer. Can we extract the embeddings from all the encoding layers? YES 

In [16]:
output.hidden_states

(tensor([[[ 1.6855e-01, -2.8577e-01, -3.2613e-01,  ..., -2.7571e-02,
            3.8253e-02,  1.6400e-01],
          [-3.4027e-04,  5.3974e-01, -2.8805e-01,  ...,  7.5731e-01,
            8.9008e-01,  1.6575e-01],
          [ 1.1558e+00,  8.5331e-02, -1.1208e-01,  ...,  4.3965e-01,
            8.5903e-01, -3.2685e-01],
          ...,
          [-3.6430e-01, -1.6172e-01,  9.0174e-02,  ..., -1.7849e-01,
            1.2818e-01, -4.5116e-02],
          [ 1.6776e-01, -8.9038e-01, -3.1798e-01,  ...,  3.1737e-02,
            6.4863e-02,  1.8418e-01],
          [ 3.5675e-01, -8.7076e-01, -3.7621e-01,  ..., -7.9391e-02,
           -1.0115e-01,  1.9312e-01]]], grad_fn=<NativeLayerNormBackward0>),
 tensor([[[ 0.1616,  0.0459, -0.0789,  ..., -0.0204,  0.1398,  0.0676],
          [ 0.7336,  0.9462, -0.2094,  ...,  0.4195,  0.7503, -0.0092],
          [ 1.4193,  0.6091,  0.3613,  ...,  0.4949,  0.7861, -0.0039],
          ...,
          [-0.0169,  0.2284,  0.0774,  ..., -0.2815,  0.5997,  0.1019],
 

In [17]:
for i, layer_hidden_states in enumerate(hidden_states):
    print(f"Layer {i + 1} hidden states shape: {layer_hidden_states.shape}")

Layer 1 hidden states shape: torch.Size([1, 7, 768])
Layer 2 hidden states shape: torch.Size([1, 7, 768])
Layer 3 hidden states shape: torch.Size([1, 7, 768])
Layer 4 hidden states shape: torch.Size([1, 7, 768])
Layer 5 hidden states shape: torch.Size([1, 7, 768])
Layer 6 hidden states shape: torch.Size([1, 7, 768])
Layer 7 hidden states shape: torch.Size([1, 7, 768])
Layer 8 hidden states shape: torch.Size([1, 7, 768])
Layer 9 hidden states shape: torch.Size([1, 7, 768])
Layer 10 hidden states shape: torch.Size([1, 7, 768])
Layer 11 hidden states shape: torch.Size([1, 7, 768])
Layer 12 hidden states shape: torch.Size([1, 7, 768])
Layer 13 hidden states shape: torch.Size([1, 7, 768])


Fine Tuning the BERT for downstream task: Text classification

During fine tuning, we can adjust the weights of the model in the following two ways.
- Update the weights of the pre-trained BERT model along with the classification layer.
- Update only the weights of the calssification layer not the pertrained BERT model. The model as feature extractor.

In [18]:
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import torch
import numpy as np


In [19]:
ds = load_dataset("imdb")
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [20]:
train = ds["train"]
test = ds["test"]
val = ds["unsupervised"]

print(f"Train: {len(train)}, Test: {len(test)}, Validation: {len(val)}")

Train: 25000, Test: 25000, Validation: 50000


Lets initialize the pre-trained BERT model.

In [21]:
classifier = BertForSequenceClassification.from_pretrained("bert-base-uncased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [23]:
input_ids = tokenizer.convert_tokens_to_ids(tokens[:-2])
input_ids

[101, 1045, 2293, 28045, 102]

Segment IDs (token type IDs):

Suppose we have two sentences in the input. In that case, segment IDs are used to distinguish one sentence from the other. All the tokens from the first sentence will be apped to 0 and all the tokens from the second sentence will be mapped to 1. Since here we have only one sentence, all the tokens will be mapped to 0 as shown below.

In [24]:
token_type_ids = np.zeros(5)
token_type_ids

array([0., 0., 0., 0., 0.])

attention mask: 1s for all tokens 0s for [PAD] tokens.

In [25]:
attention_mask = np.ones(5)
attention_mask

array([1., 1., 1., 1., 1.])

In [26]:
tokenizer(example)

{'input_ids': [101, 1045, 2293, 28045, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

Lets pass four sentences, and set the maximum sequence laength to 5, padding True.

In [27]:
tokenizer(["I love Kathmandu", "Beautiful Nepal", "Mount Everest"], padding=True, max_length=5)



{'input_ids': [[101, 1045, 2293, 28045, 102], [101, 3376, 8222, 102, 0], [101, 4057, 23914, 102, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 0], [1, 1, 1, 1, 0]]}

With all this in mind, lets create a preprocessing function.

In [28]:
def preprocess(data):
    return tokenizer(data["text"], padding=True, truncation=True)

Lets preprocess the data using map for each dataset category.

In [29]:
train = train.map(preprocess, batched=True, batch_size=len(train))
test = test.map(preprocess, batched=True, batch_size=len(test))
val = val.map(preprocess, batched=True, batch_size=len(val))

set the data type to torch.

In [30]:
train.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test.set_format("torch", columns=["input_ids", "attention_mask", "label"])
val.set_format("torch", columns=["input_ids", "attention_mask", "label"])

Training the model

In [36]:
batch_size = 4
epochs = 3

warmup_steps = 500
weight_decay = 0.01

In [37]:
training_args = TrainingArguments(
    num_train_epochs=epochs,
    output_dir=f"../../results",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=warmup_steps,
    weight_decay=weight_decay,
    logging_dir=f"../../logs"
)

In [38]:
trainer = Trainer(
    model=classifier,
    args=training_args,
    train_dataset=train,
    eval_dataset=test
)

In [40]:
trainer.train()

  0%|          | 1/9375 [01:35<248:42:39, 95.52s/it]
  0%|          | 0/18750 [00:04<?, ?it/s]
  3%|▎         | 500/18750 [01:50<1:07:59,  4.47it/s]

{'loss': 0.5396, 'grad_norm': 57.49171447753906, 'learning_rate': 5e-05, 'epoch': 0.08}


  5%|▌         | 1000/18750 [03:40<1:03:58,  4.62it/s]

{'loss': 0.5416, 'grad_norm': 9.111489295959473, 'learning_rate': 4.863013698630137e-05, 'epoch': 0.16}


  8%|▊         | 1500/18750 [05:30<1:03:36,  4.52it/s]

{'loss': 0.4923, 'grad_norm': 0.3694137632846832, 'learning_rate': 4.726027397260274e-05, 'epoch': 0.24}


 11%|█         | 2000/18750 [07:16<56:13,  4.97it/s]  

{'loss': 0.4673, 'grad_norm': 25.703596115112305, 'learning_rate': 4.589041095890411e-05, 'epoch': 0.32}


 13%|█▎        | 2500/18750 [08:58<54:22,  4.98it/s]  

{'loss': 0.4554, 'grad_norm': 8.654304504394531, 'learning_rate': 4.452054794520548e-05, 'epoch': 0.4}


 16%|█▌        | 3000/18750 [10:39<52:57,  4.96it/s]  

{'loss': 0.4689, 'grad_norm': 17.73918342590332, 'learning_rate': 4.3150684931506855e-05, 'epoch': 0.48}


 19%|█▊        | 3500/18750 [12:21<51:02,  4.98it/s]  

{'loss': 0.4482, 'grad_norm': 0.19607552886009216, 'learning_rate': 4.1780821917808224e-05, 'epoch': 0.56}


 21%|██▏       | 4000/18750 [14:02<49:28,  4.97it/s]  

{'loss': 0.4312, 'grad_norm': 9.35656452178955, 'learning_rate': 4.041095890410959e-05, 'epoch': 0.64}


 24%|██▍       | 4500/18750 [15:44<47:57,  4.95it/s]  

{'loss': 0.4376, 'grad_norm': 1.4899685382843018, 'learning_rate': 3.904109589041096e-05, 'epoch': 0.72}


 27%|██▋       | 5000/18750 [17:25<46:03,  4.98it/s]  

{'loss': 0.4632, 'grad_norm': 0.2304084300994873, 'learning_rate': 3.767123287671233e-05, 'epoch': 0.8}


 29%|██▉       | 5500/18750 [19:07<44:23,  4.97it/s]  

{'loss': 0.3996, 'grad_norm': 25.21976089477539, 'learning_rate': 3.63013698630137e-05, 'epoch': 0.88}


 32%|███▏      | 6000/18750 [20:51<46:02,  4.62it/s]  

{'loss': 0.4196, 'grad_norm': 10.193493843078613, 'learning_rate': 3.493150684931507e-05, 'epoch': 0.96}


 35%|███▍      | 6500/18750 [22:42<44:30,  4.59it/s]  

{'loss': 0.3272, 'grad_norm': 0.05335897579789162, 'learning_rate': 3.356164383561644e-05, 'epoch': 1.04}


 37%|███▋      | 7000/18750 [24:32<42:49,  4.57it/s]  

{'loss': 0.2588, 'grad_norm': 0.7967578768730164, 'learning_rate': 3.219178082191781e-05, 'epoch': 1.12}


 40%|████      | 7500/18750 [26:22<41:17,  4.54it/s]  

{'loss': 0.2978, 'grad_norm': 48.885040283203125, 'learning_rate': 3.082191780821918e-05, 'epoch': 1.2}


 43%|████▎     | 8000/18750 [28:13<38:53,  4.61it/s]  

{'loss': 0.2567, 'grad_norm': 0.3560728430747986, 'learning_rate': 2.945205479452055e-05, 'epoch': 1.28}


 45%|████▌     | 8500/18750 [29:57<34:15,  4.99it/s]  

{'loss': 0.2406, 'grad_norm': 0.06602785736322403, 'learning_rate': 2.808219178082192e-05, 'epoch': 1.36}


 48%|████▊     | 9000/18750 [31:38<32:40,  4.97it/s]  

{'loss': 0.2412, 'grad_norm': 29.276275634765625, 'learning_rate': 2.671232876712329e-05, 'epoch': 1.44}


 51%|█████     | 9500/18750 [33:20<30:54,  4.99it/s]  

{'loss': 0.2416, 'grad_norm': 0.06330031901597977, 'learning_rate': 2.534246575342466e-05, 'epoch': 1.52}


 53%|█████▎    | 10000/18750 [35:01<29:14,  4.99it/s] 

{'loss': 0.2592, 'grad_norm': 0.05070902034640312, 'learning_rate': 2.3972602739726026e-05, 'epoch': 1.6}


 56%|█████▌    | 10500/18750 [36:42<27:36,  4.98it/s]  

{'loss': 0.2729, 'grad_norm': 81.3860855102539, 'learning_rate': 2.2602739726027396e-05, 'epoch': 1.68}


 59%|█████▊    | 11000/18750 [38:24<25:58,  4.97it/s]  

{'loss': 0.2288, 'grad_norm': 0.10501725971698761, 'learning_rate': 2.1232876712328768e-05, 'epoch': 1.76}


 61%|██████▏   | 11500/18750 [40:05<24:20,  4.97it/s]  

{'loss': 0.2385, 'grad_norm': 0.019870679825544357, 'learning_rate': 1.9863013698630137e-05, 'epoch': 1.84}


 64%|██████▍   | 12000/18750 [41:47<22:35,  4.98it/s]

{'loss': 0.272, 'grad_norm': 0.07312991470098495, 'learning_rate': 1.8493150684931506e-05, 'epoch': 1.92}


 67%|██████▋   | 12500/18750 [43:28<21:03,  4.95it/s]

{'loss': 0.2471, 'grad_norm': 0.13753291964530945, 'learning_rate': 1.7123287671232875e-05, 'epoch': 2.0}


 69%|██████▉   | 13000/18750 [45:10<19:16,  4.97it/s]

{'loss': 0.084, 'grad_norm': 1083.43701171875, 'learning_rate': 1.5753424657534248e-05, 'epoch': 2.08}


 72%|███████▏  | 13500/18750 [46:51<17:37,  4.97it/s]

{'loss': 0.123, 'grad_norm': 0.01554066687822342, 'learning_rate': 1.4383561643835617e-05, 'epoch': 2.16}


 75%|███████▍  | 14000/18750 [48:32<15:54,  4.97it/s]

{'loss': 0.1462, 'grad_norm': 0.028229428455233574, 'learning_rate': 1.3013698630136986e-05, 'epoch': 2.24}


 77%|███████▋  | 14500/18750 [50:14<14:17,  4.95it/s]

{'loss': 0.1217, 'grad_norm': 1.8434722423553467, 'learning_rate': 1.1643835616438355e-05, 'epoch': 2.32}


 80%|████████  | 15000/18750 [51:55<12:34,  4.97it/s]

{'loss': 0.0791, 'grad_norm': 0.011949531733989716, 'learning_rate': 1.0273972602739726e-05, 'epoch': 2.4}


 83%|████████▎ | 15500/18750 [53:37<10:54,  4.97it/s]

{'loss': 0.1024, 'grad_norm': 0.1625107228755951, 'learning_rate': 8.904109589041095e-06, 'epoch': 2.48}


 85%|████████▌ | 16000/18750 [55:18<09:11,  4.98it/s]

{'loss': 0.1172, 'grad_norm': 0.01604519970715046, 'learning_rate': 7.5342465753424655e-06, 'epoch': 2.56}


 88%|████████▊ | 16500/18750 [57:00<07:32,  4.97it/s]

{'loss': 0.0917, 'grad_norm': 0.09570044279098511, 'learning_rate': 6.1643835616438354e-06, 'epoch': 2.64}


 91%|█████████ | 17000/18750 [58:41<05:52,  4.96it/s]

{'loss': 0.0847, 'grad_norm': 0.99578857421875, 'learning_rate': 4.7945205479452054e-06, 'epoch': 2.72}


 93%|█████████▎| 17500/18750 [1:00:22<04:11,  4.98it/s]

{'loss': 0.0987, 'grad_norm': 0.01294002402573824, 'learning_rate': 3.4246575342465754e-06, 'epoch': 2.8}


 96%|█████████▌| 18000/18750 [1:02:04<02:30,  4.98it/s]

{'loss': 0.1108, 'grad_norm': 0.026291929185390472, 'learning_rate': 2.054794520547945e-06, 'epoch': 2.88}


 99%|█████████▊| 18500/18750 [1:03:45<00:50,  4.98it/s]

{'loss': 0.102, 'grad_norm': 0.02042091079056263, 'learning_rate': 6.849315068493151e-07, 'epoch': 2.96}


100%|██████████| 18750/18750 [1:04:36<00:00,  4.84it/s]

{'train_runtime': 3876.8226, 'train_samples_per_second': 19.346, 'train_steps_per_second': 4.836, 'train_loss': 0.2728967039489746, 'epoch': 3.0}





TrainOutput(global_step=18750, training_loss=0.2728967039489746, metrics={'train_runtime': 3876.8226, 'train_samples_per_second': 19.346, 'train_steps_per_second': 4.836, 'train_loss': 0.2728967039489746, 'epoch': 3.0})

In [41]:
trainer.evaluate()

100%|██████████| 6250/6250 [06:44<00:00, 15.44it/s]


{'eval_loss': 0.4087878167629242,
 'eval_runtime': 404.6947,
 'eval_samples_per_second': 61.775,
 'eval_steps_per_second': 15.444,
 'epoch': 3.0}