우리는 모델을 파인튜닝 할 필요가 있다. 각각의 트랜스포머 모델은 다르다. 서로 다른 유즈케이스를 대상으로 파인튜닝 하는 것도 당연히 다르다.
https://towardsdatascience.com/how-to-train-bert-aaad00533168

## Fine-Tuning the Core
NSP + MLM    
### Next Sentence Prediction
모델의 입력이 문장의 쌍으로 구성되고 true 인지, 아닌지 구분한다.
- A + B = true pair
- A + C = non-true pair
- B + C = non-true pair

NSP는 특별한 헤드를 사용한다. '[CLS]' 토큰(768 dim)을 dense NN으로 처리하고 두 노드(isNext, NotNext)로 결과를 낸다. 학습 후에는 버린다.

### Masked-Language Modeling
텍스트의 청크로 구성되고, 주어진 수만큼의 토큰을 마스킹하고, 마스킹된 단어가 무엇인지 BERT에게 요청한다.    
헤드가 모델에 추가되고 각 토큰이 FFN으로 피딩된다. 각 토큰의 출력 차원이 vocab 크기와 같다. 가장 높은 확률을 가지는 토큰을 예측 결과로 사용한다.    
학습 중에는, 마스킹 되지 않은 토큰 들은 loss 함수 계산에서 무시된다. 마찬가지로 학습 후에 헤드는 버린다.

### 스페셜 토큰이 있다.
- CLS   ( 101 )
- SEP   ( 102 )
- MASK  ( 103 )
- PAD   ( 0   )

In [3]:
!wget https://raw.githubusercontent.com/jamescalam/transformers/main/data/text/meditations/clean.txt

--2021-08-22 10:59:06--  https://raw.githubusercontent.com/jamescalam/transformers/main/data/text/meditations/clean.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 241387 (236K) [text/plain]
Saving to: ‘clean.txt’


2021-08-22 10:59:06 (3.19 MB/s) - ‘clean.txt’ saved [241387/241387]



In [5]:
from transformers import BertTokenizer, BertForPreTraining
import torch 

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForPreTraining.from_pretrained('bert-base-uncased')

with open('./data/clean.txt', 'r') as f:
    text = f.read().split('\n')

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
text[:3]

['From my grandfather Verus I learned good morals and the government of my temper.',
 'From the reputation and remembrance of my father, modesty and a manly character.',
 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.']

## NSP를 위한 준비
두 문장이 원래 함께 있는 그대로와 랜덤한 문장 결합물을 생성해야 한다.    
```NotNextSentence```쌍 생성을 위해 bag of sentences를 만든다

In [9]:
bag = [item for sentence in text for item in sentence.split('.') if item != '']
bag_size = len(bag)

In [10]:
text[14]

'From Maximus I learned self-government, and not to be led aside by anything; and cheerfulness in all circumstances, as well as in illness; and a just admixture in the moral character of sweetness and dignity, and to do what was set before me without complaining. I observed that everybody believed that he thought as he spoke, and that in all that he did he never had any bad intention; and he never showed amazement and surprise, and was never in a hurry, and never put off doing a thing, nor was perplexed nor dejected, nor did he ever laugh to disguise his vexation, nor, on the other hand, was he ever passionate or suspicious. He was accustomed to do acts of beneficence, and was ready to forgive, and was free from all falsehood; and he presented the appearance of a man who could not be diverted from right rather than of a man who had been improved. I observed, too, that no man could ever think that he was despised by Maximus, or ever venture to think himself a better man. He had also the

In [11]:
bag[14:19]

['From Maximus I learned self-government, and not to be led aside by anything; and cheerfulness in all circumstances, as well as in illness; and a just admixture in the moral character of sweetness and dignity, and to do what was set before me without complaining',
 ' I observed that everybody believed that he thought as he spoke, and that in all that he did he never had any bad intention; and he never showed amazement and surprise, and was never in a hurry, and never put off doing a thing, nor was perplexed nor dejected, nor did he ever laugh to disguise his vexation, nor, on the other hand, was he ever passionate or suspicious',
 ' He was accustomed to do acts of beneficence, and was ready to forgive, and was free from all falsehood; and he presented the appearance of a man who could not be diverted from right rather than of a man who had been improved',
 ' I observed, too, that no man could ever think that he was despised by Maximus, or ever venture to think himself a better man',
 

In [16]:
import random 

sentence_a = [] 
sentence_b = [] 
label = [] 

for paragraph in text:
    sentences = [
        sentence for sentence in paragraph.split('.') if sentence != ''
    ]
    num_sentences = len(sentences)
    if num_sentences > 1:
        start = random.randint(0, num_sentences-2)
        if random.random() >= 0.5:  
            # IsNextSentence label 0 
            sentence_a.append(sentences[start])
            sentence_b.append(sentences[start+1])
            label.append(0)
        else:
            # NotNextSentence label 1
            index = random.randint(0, bag_size -1 )
            sentence_a.append(sentences[start])
            sentence_b.append(bag[index])
            label.append(1)

In [15]:
print(label[0], sentence_a[0], sentence_b[0])

0  He was accustomed to do acts of beneficence, and was ready to forgive, and was free from all falsehood; and he presented the appearance of a man who could not be diverted from right rather than of a man who had been improved  I observed, too, that no man could ever think that he was despised by Maximus, or ever venture to think himself a better man


## Tokenizer 
데이터를 토크나이즈한다. 시퀀스들을 truncate/pad 해서 512 길이 토큰들로 만든다.   
두 문장 들을 토크나이즈 한다, 우리의 토크나이저가 ```token_type_ids``` 텐서에 ```sentence_a```를 0으로 ```sentence_b```에 1을 할당한다.     
```input_ids``` 텐서에서, 토크나이저가 이러한 두 문장들의 사이에 SEP(102) 토큰을 자동으로 위치시켜서 둘 사이를 나눈다.    

In [17]:
inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt',
                    max_length=512, truncation=True, padding='max_length')

In [18]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [19]:
inputs

{'input_ids': tensor([[  101,  1045,  5159,  ...,     0,     0,     0],
        [  101,  2045,  2001,  ...,     0,     0,     0],
        [  101,  2000,  1996,  ...,     0,     0,     0],
        ...,
        [  101,  3459,  2185,  ...,     0,     0,     0],
        [  101,  2043, 15223,  ...,     0,     0,     0],
        [  101,  7887,  3288,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

## NSP Labels
NSP 라벨들을 ```next_setence_label```에 위치시켜야 한다.

In [26]:
inputs['next_sentence_label'] = torch.LongTensor([label]).T
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'next_sentence_label'])

In [27]:
inputs.next_sentence_label[:10]

tensor([[0],
        [0],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1]])

## Masking For MLM
MLM labels 텐서를 생성하기 위해 inputs의 ```input_ids```를 클론하고 ~15% 만큼의 토큰을 마스킹한다.

In [28]:
inputs['labels'] = inputs.input_ids.detach().clone()
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'next_sentence_label', 'labels'])

In [34]:
# mask array를 생성한다.
rand = torch.rand(inputs.input_ids.shape)
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102) * (inputs.input_ids != 0)

# 각 벡터 내의 True 값을 가지는 index 들을 선택한다.
selection = [] 

for i in range(inputs.input_ids.shape[0]):
    selection.append(
        torch.flatten(mask_arr[i].nonzero()).tolist()
    )

# input_ids의 각 row에 이렇게 선택한 인덱스의 값을 103로 할당 한다.
for i in range(inputs.input_ids.shape[0]):
    inputs.input_ids[i, selection[i]] = 103

In [35]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'next_sentence_label', 'labels'])

In [36]:
inputs.input_ids

tensor([[  101,  1045,  5159,  ...,     0,     0,     0],
        [  101,  2045,  2001,  ...,     0,     0,     0],
        [  101,  2000,   103,  ...,     0,     0,     0],
        ...,
        [  101,  3459,  2185,  ...,     0,     0,     0],
        [  101,  2043, 15223,  ...,     0,     0,     0],
        [  101,  7887,  3288,  ...,     0,     0,     0]])

## Dataloader
PyTorch 데이터셋 오브젝트를 만들어 학습 중에 데이터를 피딩할 데이터 배치를 생성한다.

In [40]:
class OurDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings 
    def __getitem__(self, idx): # 샘플 들 추출하는 메소드
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self): # 데이터 내 전체 샘플 수를 체크하는 메소드
        return len(self.encodings.input_ids)

In [41]:
dataset = OurDataset(inputs)
loader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)

## Training

In [43]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

BertForPreTraining(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine

In [45]:
from transformers import AdamW 

model.train() 
optim = AdamW(model.parameters(), lr=5e-5)

In [46]:
from tqdm import tqdm 

epochs = 2 

for epoch in range(epochs):
    loop = tqdm(loader, leave=True)
    for batch in loop:
        optim.zero_grad()
        
        input_ids = batch['input_ids'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        next_sentence_label = batch['next_sentence_label'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, 
                        attention_mask = attention_mask,
                        token_type_ids = token_type_ids,
                        next_sentence_label = next_sentence_label,
                        labels = labels)


        loss = outputs.loss
        loss.backward()
        optim.step()

        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss=loss.item())

  """
Epoch 0:   5%|▌         | 1/20 [02:05<39:35, 125.01s/it, loss=20.3]


KeyboardInterrupt: 

In [47]:
from transformers import TrainingArguments 
from transformers import Trainer 

args = TrainingArguments(
    output_dir='./out',
    per_device_train_batch_size=16,
    num_train_epochs=2
)

trainer = Trainer(
    model = model,
    args = args,
    train_dataset = dataset
)

In [48]:
trainer.train()

  """
100%|██████████| 40/40 [1:43:51<00:00, 155.78s/it]


{'train_runtime': 6231.5907, 'train_samples_per_second': 0.006, 'epoch': 2.0}


TrainOutput(global_step=40, training_loss=3.4587635040283202, metrics={'train_runtime': 6231.5907, 'train_samples_per_second': 0.006, 'epoch': 2.0, 'init_mem_cpu_alloc_delta': 293988, 'init_mem_cpu_peaked_delta': 17090, 'train_mem_cpu_alloc_delta': 361175, 'train_mem_cpu_peaked_delta': 142274})