## BERT Fine Tuning method for Downstream Tasks.

### Text classification

In [1]:
!pip install nlp
!pip install transformers

Collecting nlp
  Downloading nlp-0.4.0-py3-none-any.whl (1.7 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.7 MB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.7 MB[0m [31m31.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from nlp)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, nlp
Successfully installed dill-0.3.7 nlp-0.4.0


In [1]:
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer,TrainingArguments
from nlp import load_dataset
import torch
import numpy as np

In [2]:
!gdown https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-
dataset = load_dataset('csv', data_files='./imdbs.csv', split='train')

Downloading...
From: https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-
To: /content/imdbs.csv
  0% 0.00/132k [00:00<?, ?B/s]100% 132k/132k [00:00<00:00, 88.3MB/s]




In [3]:
type(dataset)

nlp.arrow_dataset.Dataset

In [4]:
dataset = dataset.train_test_split(test_size=0.2)

In [5]:
dataset

{'train': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 80),
 'test': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 20)}

In [6]:
train_set = dataset['train']
test_set = dataset['test']

In [7]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

#### Preprocess the dataset

we can preprocess the dataset in a quicker way using our tokenizer. For example, consider the sentence: 'I love Paris'.

First, we tokenize the sentence and add the [CLS] token at the beginning and [SEP] token at the end as shown below:

tokens = [ [CLS], I, love, Paris, [SEP] ]

Next, we map the tokens to the unique input ids (token ids). Suppose the following are the unique input ids (token ids):

input_ids = [101, 1045, 2293, 3000, 102]

Then, we need to add the segment ids (token type ids). Wait, what are segment ids? Suppose we have two sentences in the input. In that case, segment ids are used to distinguish one sentence from the other. All the tokens from the first sentence will be mapped to 0 and all the tokens from the second sentence will be mapped to 1. Since here we have only one sentence, all the tokens will be mapped to 0 as shown below:

token_type_ids = [0, 0, 0, 0, 0]

Now, we need to create the attention mask. We know that an attention mask is used to differentiate the actual tokens and [PAD] tokens. It will map all the actual tokens to 1 and the [PAD] tokens to 0. Suppose, our tokens length should be 5. Now, our tokens list has already 5 tokens. So, we don't have to add [PAD] token. Then our attention mask will become:

attention_mask = [1, 1, 1, 1, 1]

That's it. But instead of doing all the above steps manually, our tokenizer will do these steps for us. We just need to pass the sentence to the tokenizer as shown below:

In [9]:
tokenizer('I love Seoul')

{'input_ids': [101, 1045, 2293, 10884, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [10]:
tokenizer(['I love Paris', 'birds fly','snow fall'], padding = True, max_length=5)



{'input_ids': [[101, 1045, 2293, 3000, 102], [101, 5055, 4875, 102, 0], [101, 4586, 2991, 102, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 0], [1, 1, 1, 1, 0]]}

That's it, with the tokenizer, we can easily preprocess our dataset. So we define a function called preprocess for processing the dataset as shown below :

In [11]:
def preprocess(data):
    return tokenizer(data['text'],padding=True,truncation=True)

Now, we preprocess the train and test set using the preprocess function:

In [22]:
!pip uninstall dill
!pip install dill==0.2.8.2

Found existing installation: dill 0.3.7
Uninstalling dill-0.3.7:
  Would remove:
    /usr/local/bin/get_gprof
    /usr/local/bin/get_objgraph
    /usr/local/bin/undill
    /usr/local/lib/python3.10/dist-packages/dill-0.3.7.dist-info/*
    /usr/local/lib/python3.10/dist-packages/dill/*
Proceed (Y/n)? y
  Successfully uninstalled dill-0.3.7
Collecting dill==0.2.8.2
  Downloading dill-0.2.8.2.tar.gz (150 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.1/150.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: dill
  Building wheel for dill (setup.py) ... [?25l[?25hdone
  Created wheel for dill: filename=dill-0.2.8.2-py3-none-any.whl size=76466 sha256=a7a91cf7b28f3a219590f5ae63ea7bc9c414c38339abc032febd4fa25caf5d80
  Stored in directory: /root/.cache/pip/wheels/41/7b/c2/449a7de961d41b03ff714fd80b35603435776c08ebce576ab1
Successfully built dill
Installing collected package

In [12]:
train_set = train_set.map(preprocess, batched=True, batch_size=len(train_set))
test_set = test_set.map(preprocess, batched=True, batch_size=len(test_set))


Next, we use the set_format function and select the columns which we need in our dataset and also in which format we need them as shown below:

In [13]:
train_set.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_set.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

#### Training the model

In [14]:
batch_size=8
epochs=2
warmup_steps=500
weight_decay=0.01

In [17]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/261.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/261.4 kB[0m [31m1.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m256.0/261.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.24.1


In [17]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=warmup_steps,
    weight_decay=weight_decay,
    logging_dir='./logs',
)

In [18]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=test_set
)


In [19]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=20, training_loss=0.6922211170196533, metrics={'train_runtime': 17.2659, 'train_samples_per_second': 9.267, 'train_steps_per_second': 1.158, 'total_flos': 42097768857600.0, 'train_loss': 0.6922211170196533, 'epoch': 2.0})

In [20]:
trainer.evaluate()

{'eval_loss': 0.7181774377822876,
 'eval_runtime': 0.6873,
 'eval_samples_per_second': 29.098,
 'eval_steps_per_second': 4.365,
 'epoch': 2.0}

In this way, we can finetune the pre-trained BERT. Now that we have learned how to finetune the BERT for the text classification task.

### Q&A with finetuned BERT

In this section, let's learn how to perform question answering with a finetuned Q&A BERT. First, let us import the necessary modules:

In [43]:
import torch
from transformers import BertForQuestionAnswering, BertTokenizer

Now, we download and load the model. We use the bert-large-uncased-whole-word-masking-finetuned-squad model which is finetuned on the SQUAD (Stanford question answering dataset).

In [44]:
model_qa = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [45]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


#### Preprocessing the input

First, we define the input to BERT which is question and paragraph text:

In [46]:
question = "What is the immune system?"
paragraph = "The immune system is a system of many biological structures and processes within an organism that protects against disease. To function properly, an immune system must detect a wide variety of agents, known as pathogens, from viruses to parasitic worms, and distinguish them from the organism's own healthy tissue."

Add [CLS] token to the beginning of the question and [SEP] token at the end of both the question and paragraph:

In [47]:
question = '[CLS]' + question + '[SEP]'
paragraph += '[SEP]'

In [48]:
question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)

Combine the question and paragraph tokens and convert them to input_ids:

In [49]:
tokens = question_tokens + paragraph_tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)

Next, we define the segment_ids. The segment_ids will be 0 for all the tokens of question and it will be 1 for all the tokens of the paragraph:

In [50]:
segment_ids = [0] * len(question_tokens)
segment_ids += [1] * len(paragraph_tokens)

In [51]:
input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])
print(input_ids)
print(segment_ids)

tensor([[  101,  2054,  2003,  1996, 11311,  2291,  1029,   102,  1996, 11311,
          2291,  2003,  1037,  2291,  1997,  2116,  6897,  5090,  1998,  6194,
          2306,  2019, 15923,  2008, 18227,  2114,  4295,  1012,  2000,  3853,
          7919,  1010,  2019, 11311,  2291,  2442, 11487,  1037,  2898,  3528,
          1997,  6074,  1010,  2124,  2004, 26835,  2015,  1010,  2013, 18191,
          2000, 26045, 16253,  1010,  1998, 10782,  2068,  2013,  1996, 15923,
          1005,  1055,  2219,  7965,  8153,  1012,   102]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


#### Getting the answer

we feed the input_ids and segment_ids to the model which return the start score and end score for all of the tokens:

In [54]:
result=model_qa(input_ids, token_type_ids = segment_ids)

In [63]:
model_qa

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), ep

In [59]:
print(input_ids.shape)

torch.Size([1, 67])


In [56]:
print(result.keys())

odict_keys(['start_logits', 'end_logits'])


In [57]:
print(result[0].shape)

torch.Size([1, 67])


In [64]:
print(result[0])

tensor([[-6.2588, -4.6880, -6.7744, -6.3712, -5.8096, -8.4909, -9.0369, -6.2588,
          2.3760, -0.8670, -4.0859,  2.1112,  7.0353,  3.1633, -2.0016,  1.8844,
          2.4239, -0.8321, -4.7245, -0.6628, -0.9607, -1.5406, -0.9789, -1.5246,
          1.5805, -3.6135, -1.7062, -6.2587, -4.3460, -5.7781, -6.2772, -7.2236,
         -2.5216, -2.8306, -5.5702, -4.4567, -3.9796, -6.1513, -5.8940, -6.4212,
         -7.3876, -5.6694, -7.7685, -4.6375, -6.5613, -3.7148, -7.0651, -8.1083,
         -5.4551, -4.3829, -7.9004, -4.8883, -5.8361, -7.9597, -6.8583, -4.6028,
         -7.3392, -7.3848, -6.5887, -5.8965, -5.8692, -7.9263, -6.7758, -5.4052,
         -5.2147, -7.6892, -6.2588]], grad_fn=<CloneBackward0>)


In [60]:
start_index = torch.argmax(result[0])
end_index = torch.argmax(result[1])

Now, we print the text span between the start and end index as our answer:

In [61]:
print(' '.join(tokens[start_index:end_index+1]))

a system of many biological structures and processes within an organism that protects against disease


정리하자면, 데이터 셋을 가지고 허깅페이스에서 모델을 불러와서 학습을 시키는 과정은 fine-tuning, 불러온 모델에 나의 데이터셋을 그냥 넣어보는것은 feature extractor라고 볼 수 있다.