# End-to-end Masked Language Modeling with BERT

**Author:** [Ankur Singh](https://twitter.com/ankur310794)<br>
**Date created:** 2020/09/18<br>
**Last modified:** 2020/09/18<br>
**Description:** Implement a Masked Language Model (MLM) with BERT and fine-tune it on the IMDB Reviews dataset.

## Introduction (Edited)

Masked Language Modeling is a fill-in-the-blank task,
where a model uses the context words surrounding a mask token to try to predict what the
masked word should be.

For an input that contains one or more mask tokens,
the model will generate the most likely substitution for each.

Example:

- Input: "I have watched this [MASK] and it was awesome."
- Output: "I have watched this movie and it was awesome."

Masked language modeling is a great way to train a language
model in a self-supervised setting (without human-annotated labels).
Such a model can then be fine-tuned to accomplish various supervised
NLP tasks.

This example teaches you how to build a BERT model from scratch,
train it with the masked language modeling task,
and then fine-tune this model on a sentiment classification task.

## Setup

In [1]:
import tensorflow as tf
import torch

from dataclasses import dataclass
import pandas as pd
import numpy as np
import glob
import re
from pprint import pprint

## Set-up Configuration

In [2]:
# Python decorator 설명 : https://choice-life.tistory.com/42
@dataclass
class Config:
    MAX_LEN = 256
    BATCH_SIZE = 32
    LR = 1e-04
    VOCAB_SIZE = 30000
    EMBED_DIM = 128
    NUM_HEAD = 8  # used in bert model
    FF_DIM = 128  # used in bert model
    NUM_LAYERS = 1
    NUM_CLASS = 2


config = Config()

## Load the data

We will first download the IMDB data and load into a Pandas dataframe.

In [3]:
# !curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# !tar -xf aclImdb_v1.tar.gz

* `<br />  <br>` : html tag

In [4]:
!cat aclImdb/train/pos/2203_8.txt

i, too, loved this series when i was a kid. In 1952 i was 5 and my family always watched this show. My favorite character was the one played by Marion Lorne as a rather stuttering, bumbling and very lovable "aunt" type person. i can still recall her "ubba bubba um um" type comments as she would try and say something important. And then when she came back and played Aunt Clara in Bewitched it was great casting! <br /><br />It was the first time that i can remember seeing Walter Matthau whose career i followed as a fan for many many years.<br /><br />i have a question if anyone can verify: was the title or end credits music the "Swedish Rhapsody" by Hugo Alfven? Every time i hear it played on my classical radio station here in Southern California it brings back memories of the image of Mr. Peepers walking away with his back to the camera. i'm not even certain if this image in my mind's eye is correct.

In [5]:
!ls aclImdb/train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [6]:
def get_text_list_from_files(files):
    text_list = []
    for name in files:
        with open(name) as f:
            for line in f:
                text_list.append(line)
    return text_list

# label -> pos : 1, neg : 0 
def get_data_from_text_files(folder_name):

    pos_files = glob.glob("aclImdb/" + folder_name + "/pos/*.txt")
    pos_texts = get_text_list_from_files(pos_files)
    neg_files = glob.glob("aclImdb/" + folder_name + "/neg/*.txt")
    neg_texts = get_text_list_from_files(neg_files)
    df = pd.DataFrame(
        {
            "review": pos_texts + neg_texts,
            "sentiment": [1] * len(pos_texts) + [0] * len(neg_texts),
        }
    )
    # sampling 후 reset_index : index 초기화 (https://yganalyst.github.io/data_handling/Pd_2/)
    # 두 데이터 프레임을 합치면서 index를 0부터 초기화, drop=True : 기존 index를 버림
    df = df.sample(len(df)).reset_index(drop=True)
    return df


train_df = get_data_from_text_files("train")
test_df = get_data_from_text_files("test")

all_data = train_df.append(test_df)

In [7]:
all_data.head()

Unnamed: 0,review,sentiment
0,The major fault in this film is that it is imp...,0
1,This had a good story...it had a nice pace and...,1
2,"Don't say I didn't warn you, but your gonna la...",1
3,I'm not sure why this little film has been ban...,1
4,Serious HOME ALONE/KARATE KID knock off with e...,0


In [8]:
all_data.iloc[0,]['review']

'The major fault in this film is that it is impossible to believe any of these people would ever be cast in a professional production of Macbeth. Hearing David Lansbury\'s soft voice struggling laboriously with the famous "Tomorrow, Tomorrow, and Tomorrow" speech made it impossible to believe anyone would ever consider him for the role. I kept believing therefore that he didn\'t get the part because he was a lousy actor; not because a bigger name was available. Then when we see portions of the play in rehearsal it is difficult to believe the director is not parodying things with a hopelessly miscast, misdirected travesty of actors who are unable to articulate or even understand the verse and directors who see the play through their own screwball interpretations. Sometimes directors are so anxious to have their films done (and writers think they have the ability to direct their own works)that they settle for less. This appears to be such an example.'

In [9]:
print("# of train : ", len(train_df))
print("# of test : ", len(test_df))

# of train :  25000
# of test :  25000


* keras에서는 train, test를 각각 25000개씩 사용했지만, 여기서는 일단 구현 및 디버깅 용이하게 하기 위해 5000개만 사용 (추후 25000개 학습 예정)

In [10]:
# train_df_sample = train_df.sample(5000)
# test_df_sample = test_df.sample(5000)

train_df_sample = train_df
test_df_sample = test_df

## Dataset preparation

**1. Keras Implementation**

   [Keras `TextVectorization layer`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization#used-in-the-notebooks_1) : vectorize the text into integer token ids.

    It transforms a batch of strings into either
    a sequence of token indices (one sample = 1D array of integer token indices, in order)
    or a dense representation (one sample = 1D array of float values encoding an unordered set of tokens).

        vectorize_layer = TextVectorization(
            max_tokens=vocab_size,
            output_mode="int",    -> more option : "binary", "count" or "tf-idf"
            standardize=custom_standardization,
            output_sequence_length=max_seq,
        )

    Below, we define 3 preprocessing functions.

    1.  The `get_vectorize_layer` function builds the `TextVectorization` layer.
    2.  The `encode` function encodes raw text into integer token ids.
    3.  The `get_masked_input_and_labels` function will mask input token ids.
    It masks 15% of all input tokens in each sequence at random.



**2. PyTorch Implementation**
  Huggingface 제공 `BertTokenizer`
 1) [`BertTokenizer` source code](https://github.com/huggingface/transformers/blob/86d5fb0b360e68de46d40265e7c707fe68c8015b/src/transformers/models/bert/tokenization_bert.py#L117)
    
 2) [`BertTokenizer` document description](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer
)

In [11]:
import transformers

In [12]:
#pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [13]:
# https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Pretrained model on English language using a masked language modeling (MLM) objective.

In [14]:
text = " [CLS] [MASK] [SEP] The capital of France, paris, contains the Eiffel Tower."

In [15]:
tokenizer.tokenize(text) # list

['[CLS]',
 '[MASK]',
 '[SEP]',
 'the',
 'capital',
 'of',
 'france',
 ',',
 'paris',
 ',',
 'contains',
 'the',
 'e',
 '##iff',
 '##el',
 'tower',
 '.']

In [16]:
encoding = tokenizer.encode_plus(text, add_special_tokens = True, max_length=256, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")


* BERT can only accept/take as input only `fixed length` tokens at a time, we must specify the truncation parameter to True. 
* The add special tokens parameter is just for BERT to add tokens like the start, end, [SEP], and [CLS] tokens.
      - start token :101
      - end token : 102
      - [CLS]: 101, [SEP] : 102, [MASK] : 103
      
* Return_tensors = “pt” is just for the tokenizer to return PyTorch tensors. 
     If you don’t want this to happen(maybe you want it to return a list), then you can remove the parameter and it will return lists.
* max_length : default값은 512이나

* tokenizer.encode는 단순 encoding 결과 tensor 만 제공
* tokenizer.encode_plus는 encoding 결과 tensor, token_type_ids, attention_mask 까지 dictionary로 제공

    cf. token_type_ids : https://huggingface.co/transformers/glossary.html#token-type-ids
        -> token_type_ids는 next sentence prediction 같은 task에서 문장 구분할때 사용


In [17]:
"""
{'input_ids': tensor([[  101,   101,   103,   102,  1996,  3007,  1997,  2605,  1010,  3000,
          1010,  3397,  1996,  1041, 13355,  2884,  3578,  1012,   102,     0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
                                          ...
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 
            'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                          ...
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]]), 
         'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                          ...
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])}
"""

print("input_ids : ", encoding['input_ids'].shape)
print("token_type_ids : ", encoding['token_type_ids'].shape)
print("attention_mask : ", encoding['attention_mask'].shape)

input_ids :  torch.Size([1, 256])
token_type_ids :  torch.Size([1, 256])
attention_mask :  torch.Size([1, 256])


In [18]:
print(tokenizer.cls_token, tokenizer.cls_token_id)
print(tokenizer.mask_token, tokenizer.mask_token_id)

[CLS] 101
[MASK] 103


In [19]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    # tf.strings.regex_replace("Text with tags.<br /><b>contains html</b>", "<[^>]+>", " ").numpy()
    return tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape("!#$%&'()*+,-./:;<=>?@\^_`{|}~"), ""
    ).numpy().decode('UTF-8') # Tensor -> bytes(from numpy) -> string

# 아래 데이터셋 구축 단계에서 쓰임
def encode(text):
    # tokenizer.encode 사용하여 Tensor 반환
    encoded_text = tokenizer.encode(text, add_special_tokens = True, max_length=256, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")
    encoded_text = tf.reshape(encoded_text, [-1]) # 2D->1D
    return encoded_text.numpy()

In [20]:
print(tokenizer.encode('this'))
print(tokenizer.decode(torch.tensor([101,2023,102]))) # list도 가능

[101, 2023, 102]
[CLS] this [SEP]


In [21]:
encode('this')

array([ 101, 2023,  102,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

## DataLoader

In [22]:
from torch.utils.data import Dataset, DataLoader
import tqdm
import torch

class BERTMLMDataset(Dataset):
    """
    torch.utils.data.Dataset은 데이터셋을 나타내는 추상클래스
    Custom Dataset은 Dataset에 상속하고 아래와 같이 override.

    DataLoader에서,
      __len__ 은 데이터셋의 크기를 반환 --> iterable 
      __getitem__ 은 i번째 샘플을 찾는데 사용
      
    * encoded_input_texts : encoded texts with random masking.
    * encoded_labels : ground truth encoded labels.
    """
    
    def __init__(self, encoded_input_texts, encoded_labels):
        self.encoded_input_texts = encoded_input_texts
        self.encoded_labels = encoded_labels

    def __len__(self):
        return len(self.encoded_input_texts)

    def __getitem__(self, idx):
        
        bert_input = self.encoded_input_texts[idx]
        bert_label = self.encoded_labels[idx]
        
        output = {"bert_input": bert_input,
                  "bert_label": bert_label}

        return output

### Dataset for Fine-tuning

In [23]:
##########################
## Get Training Dataset ##
##########################

# We have 5000 examples for training

# standardization & encoding
# x_train -> array (num_samples, dim=256)

x_train=[]
for i in range(len(train_df_sample.review.values)):
    if (i+1) % (len(train_df_sample.review.values)/10) == 0:
        print(f'{(i+1)/250} % Done')
    train_df_sample.review.values[i] = custom_standardization((train_df_sample.review.values[i]))
    x_train.append(encode(train_df_sample.review.values[i]))  # encode reviews with vectorizer
x_train = np.array(x_train)
print("x_train shape :", x_train.shape)

y_train = train_df_sample.sentiment.values

10.0 % Done
20.0 % Done
30.0 % Done
40.0 % Done
50.0 % Done
60.0 % Done
70.0 % Done
80.0 % Done
90.0 % Done
100.0 % Done
x_train shape : (25000, 256)


In [24]:
train_dataset = BERTMLMDataset(x_train, y_train)
train_loader = DataLoader(dataset=train_dataset, batch_size=config.BATCH_SIZE, shuffle=True)

In [25]:
print(iter(train_loader).next())

{'bert_input': tensor([[  101, 11519,  7284,  ...,     0,     0,     0],
        [  101,  2057,  2113,  ...,     0,     0,     0],
        [  101,  2019, 20998,  ...,     0,     0,     0],
        ...,
        [  101,  8923, 11531,  ...,     0,     0,     0],
        [  101,  2625,  2969,  ...,     0,     0,     0],
        [  101,  1000,  7367,  ...,     0,     0,     0]]), 'bert_label': tensor([0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
        0, 0, 0, 1, 1, 0, 0, 0])}


In [26]:
##########################
##   Get Test Dataset   ##
##########################


# We have 5000 examples for testing

x_test=[]
for i in range(len(test_df_sample.review.values)):
    if (i+1) % (len(test_df_sample.review.values)/10) == 0:
        print(f'{(i+1)/250} % Done')
    train_df_sample.review.values[i] = custom_standardization((test_df_sample.review.values[i]))
    x_test.append(encode(test_df_sample.review.values[i]))  # encode reviews with vectorizer
x_test = np.array(x_test)
print("x_test shape :", x_test.shape)

y_test = test_df_sample.sentiment.values

10.0 % Done
20.0 % Done
30.0 % Done
40.0 % Done
50.0 % Done
60.0 % Done
70.0 % Done
80.0 % Done
90.0 % Done
100.0 % Done
x_test shape : (25000, 256)


In [27]:
test_dataset = BERTMLMDataset(x_test, y_test)
test_loader = DataLoader(dataset=train_dataset, batch_size=config.BATCH_SIZE, shuffle=False)

### Dataset for Pre-train MLM & End-to-end MLM

In [28]:
def get_masked_input_and_labels(encoded_texts, mask_token_id):
    
    ####################
    # 15% BERT masking #
    ####################
    inp_mask = np.random.rand(*encoded_texts.shape) < 0.15
    # Do not mask special tokens
    special_tokens = [tokenizer.unk_token_id, tokenizer.sep_token_id, tokenizer.pad_token_id,
                      tokenizer.cls_token_id, tokenizer.mask_token_id]
    # Get boolean array where original array contains the elements in values list above
    masking_condition = np.isin(encoded_texts, special_tokens) 
    inp_mask[masking_condition] = False
    # Set targets to -1 by default, it means ignore
    labels = -1 * np.ones(encoded_texts.shape, dtype=int)
    # Set labels for masked tokens
    labels[inp_mask] = encoded_texts[inp_mask]

    # Prepare input
    encoded_texts_masked = np.copy(encoded_texts)

    ####################
    #   10% Unchanged  #
    ####################
    # Set input to [MASK] which is the last token for the 90% of tokens
    # This means leaving 10% unchanged
    inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90)
    encoded_texts_masked[
        inp_mask_2mask
    ] = mask_token_id # mask_token_id : 103

    
    ####################
    #    10% Random    #
    ####################
    # Set 10% to a random token
    inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9)
    encoded_texts_masked[inp_mask_2random] = np.random.randint(
        3, mask_token_id, inp_mask_2random.sum()
    )

    # y_labels would be same as encoded_texts i.e input tokens
    y_labels = np.copy(encoded_texts)

    return encoded_texts_masked, y_labels

In [29]:
######################################
##   Get End-tp-ennd Test Dataset   ##
######################################

# Build dataset for end to end model input (will be used at the end)
# x_test 가 encoding된 벡터가 아니라 string 원문이 들어감

test_raw_dataset = BERTMLMDataset(test_df_sample.review.values, y_test)
test_raw_loader = DataLoader(dataset=test_raw_dataset, batch_size=config.BATCH_SIZE, shuffle=False)

In [30]:
###########################################
##   Get Masked language model Dataset   ##
###########################################

# Prepare data for masked language model
# 기존 Train + Test 데이터 합쳐서 사용 -> 총 10000개
all_data_sample = train_df_sample.append(test_df_sample)

x_all_review=[]
for i in range(len(all_data_sample.review.values)):
    if (i+1) % (len(all_data_sample.review.values)/10) == 0:
        print(f'{(i+1)/500} % Done')
    x_all_review.append(encode(all_data_sample.review.values[i]))
x_all_review = np.array(x_all_review)
print("x_all_review shape :", x_all_review.shape)

10.0 % Done
20.0 % Done
30.0 % Done
40.0 % Done
50.0 % Done
60.0 % Done
70.0 % Done
80.0 % Done
90.0 % Done
100.0 % Done
x_all_review shape : (50000, 256)


In [31]:
x_masked_train, y_masked_labels = get_masked_input_and_labels(
    x_all_review, tokenizer.mask_token_id
)

mlm_dataset = BERTMLMDataset(x_masked_train, y_masked_labels)
mlm_loader = DataLoader(dataset=mlm_dataset, batch_size=config.BATCH_SIZE, shuffle=True)

In [32]:
print("mask_token_id : ", tokenizer.mask_token_id)
print("\n")
print("x_masked_train")
print(x_masked_train)
print("\ny_masked_labels")
print(y_masked_labels)

mask_token_id :  103


x_masked_train
[[  101  1045   103 ...   103  2216   102]
 [  101  5292   103 ...     0     0     0]
 [  101  3575   103 ...     0     0     0]
 ...
 [  101  1045  2941 ... 22889 18163   102]
 [  101  1045  2293 ...     0     0     0]
 [  101  2045  1005 ...  2018  1037   102]]

y_masked_labels
[[  101  1045  2387 ...  2011  2216   102]
 [  101  5292 27172 ...     0     0     0]
 [  101  3575  3849 ...     0     0     0]
 ...
 [  101  1045  2941 ... 22889 18163   102]
 [  101  1045  2293 ...     0     0     0]
 [  101  2045  1005 ...  2018  1037   102]]


## Save variables using pickle

In [33]:
# https://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
import pickle

with open('pickle_data/train.pickle', 'wb') as f:
    pickle.dump([x_train, y_train], f)

with open('pickle_data/test.pickle', 'wb') as f:
    pickle.dump([x_test, y_test], f)


with open('pickle_data/x_all_review.pickle', 'wb') as f:
    pickle.dump(x_all_review, f)

with open('pickle_data/masked_train.pickle', 'wb') as f:
    pickle.dump([x_masked_train, y_masked_labels], f)

* 참고 : nn.Embedding
----

## Create BERT model (Pretraining Model) for masked language modeling

We will create a BERT-like pretraining model architecture
using the `MultiHeadAttention` layer.
It will take token ids as inputs (including masked tokens)
and it will predict the correct ids for the masked input tokens.

PyTorch Implementation Reference Code : https://github.com/codertimo/BERT-pytorch/tree/master/bert_pytorch

In [34]:
import torch.nn as nn
import torch.nn.functional as F
import torch

import math

### Embedding layer

In [35]:
# NOT TRAINED
class PositionalEmbedding(nn.Module):

    def __init__(self, max_len, d_emb):
        super().__init__()

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_emb).float()
        pe.require_grad = False

        position = torch.arange(0, max_len).float().unsqueeze(1)
        div_term = (torch.arange(0, d_emb, 2).float() * -(math.log(10000.0) / d_emb)).exp()

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return self.pe[:, :x.size(1)]


class BERTEmbedding(nn.Module):
    """
    BERT Embedding which is consisted with under features
        1. TokenEmbedding : normal embedding matrix
        2. PositionalEmbedding : adding positional information using sin, cos
        sum of all these features are output of BERTEmbedding
    """

    def __init__(self, vocab_size, max_len, embed_size):
        """
        :param vocab_size: total vocab size
        :param embed_size: embedding size of token embedding
        """
        super().__init__()
        self.token = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_size, padding_idx=0) # tokenizer.pad_token_id = 0
        self.position = PositionalEmbedding(max_len=max_len, d_emb= embed_size)

    def forward(self, sequence):
        x = self.token(sequence) + self.position(sequence)
        return x

### Multi-headed Attention & Transformer Encoder Block

In [36]:
# Single Attention
class Attention(nn.Module):
    """
    Compute 'Scaled Dot Product Attention
    """

    def forward(self, query, key, value, mask=None, dropout=None):
        scores = torch.matmul(query, key.transpose(-2, -1)) \
                 / math.sqrt(query.size(-1))

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9) # masking value : negative infinity

        p_attn = F.softmax(scores, dim=-1)

        if dropout is not None:
            p_attn = dropout(p_attn)

        return torch.matmul(p_attn, value), p_attn

    
# Multi-Headed Attention
class MultiHeadedAttention(nn.Module):
    """
    Take in model size and number of heads.
    """
    
    def __init__(self, num_heads, embed_dim, dropout=0.1): # d_model : embed_dim
        super().__init__()
        assert embed_dim % num_heads == 0  # 나머지

        # We assume d_v always equals d_k
        self.d_k = embed_dim // num_heads  # 몫
        self.h = num_heads

        self.linear_layers = nn.ModuleList([nn.Linear(embed_dim, embed_dim) for _ in range(3)])
        self.output_linear = nn.Linear(embed_dim, embed_dim)
        self.attention = Attention()
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # 1) Do all the linear projections in batch from embed_dim => h x d_k
        query, key, value = [l(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
                             for l, x in zip(self.linear_layers, (query, key, value))]

        # 2) Apply attention on all the projected vectors in batch.
        x, attn = self.attention(query, key, value, mask=mask, dropout=self.dropout)

        # 3) "Concat" using a view and apply a final linear.
        # contiguous() 설명 
        # contiguous한 텐서는 storage 상에서 점핑없이 순서대로 효율적이게 방문할 수 있기 때문에 메모리 접근 성능을 향상 시킬 수 있음
        
        # Blog Post : https://subinium.github.io/pytorch-Tensor-Variable/
        # Stack Overflow : https://bit.ly/3uNnv2B
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)

        return self.output_linear(x)
        
        
class GELU(nn.Module):
    """
    Paper Section 3.4, last paragraph notice that BERT used the GELU instead of RELU
    """

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))

    
class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = GELU()

    def forward(self, x):
        return self.w_2(self.dropout(self.activation(self.w_1(x))))
    

class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
    

class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))

    
class TransformerBlock(nn.Module):
    """
    Bidirectional Encoder = Transformer (self-attention)
    Transformer = MultiHead_Attention + Feed_Forward with sublayer connection
    """
 
    def __init__(self, hidden, attn_heads, feed_forward_hidden, dropout):
        """
        :param hidden: hidden size of transformer
        :param attn_heads: head sizes of multi-head attention
        :param feed_forward_hidden: feed_forward_hidden, usually 4*hidden_size -> Keras Example에서는 4배 안 함
        :param dropout: dropout rate
        """

        super().__init__()
        self.attention = MultiHeadedAttention(num_heads=attn_heads, embed_dim=hidden, dropout=dropout)
        self.feed_forward = PositionwiseFeedForward(d_model=hidden, d_ff=feed_forward_hidden, dropout=dropout)
        self.input_sublayer = SublayerConnection(size=hidden, dropout=0.0) # multi-head self-attention block의 layer normalization
        self.output_sublayer = SublayerConnection(size=hidden, dropout=0.0) # FFN block의 layer normalization
        self.dropout = nn.Dropout(p=0.0)

    def forward(self, x, mask):
        x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask))
        x = self.output_sublayer(x, self.feed_forward)
        return self.dropout(x)
    

### BERT & MLMBERT Model

In [37]:
class BERT(nn.Module):
    """
    BERT model : Bidirectional Encoder Representations from Transformers.
    """

    def __init__(self, vocab_size, max_len=256, hidden=128, n_layers=1, attn_heads=8, dropout=0.1):
        """
        :param vocab_size: vocab_size of total words
        :param hidden: BERT model hidden size
        :param n_layers: numbers of Transformer blocks(layers)
        :param attn_heads: number of attention heads
        :param dropout: dropout rate
        """

        super().__init__()
        self.max_len= max_len
        self.hidden = hidden
        self.n_layers = n_layers
        self.attn_heads = attn_heads

        # Keras example used hidden_size=128 for ff_network_hidden_size
        self.feed_forward_hidden = hidden

        # embedding for BERT, sum of positional, segment, token embeddings
        self.embedding = BERTEmbedding(vocab_size=vocab_size, max_len=max_len, embed_size=hidden)

        # multi-layers transformer blocks, deep network
        self.transformer_blocks = nn.ModuleList(
            [TransformerBlock(hidden, attn_heads, hidden, dropout) for _ in range(n_layers)]) 

    def forward(self, x):
        # attention masking for padded token
        # torch.ByteTensor([batch_size, seq_len]) -> torch.ByteTensor([batch_size, 1, seq_len, seq_len])
        # TODO : attention 계산 시에 padding 부분에 대한 mask 생성
        mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1) 
        # embedding the indexed sequence to sequence of vectors
        # https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
        x = self.embedding(x)

        # running over multiple transformer blocks
        for transformer in self.transformer_blocks:
            x = transformer.forward(x, mask)

        return x


class MaskedLanguageModel(nn.Module):
    """
    predicting origin token from masked input sequence
    n-class classification problem, n-class = vocab_size
    """

    def __init__(self, hidden, vocab_size):
        """
        :param hidden: output size of BERT model
        :param vocab_size: total vocab size
        """
        super().__init__()
        self.linear = nn.Linear(hidden, vocab_size)  # 128 X 30000
        self.softmax = nn.LogSoftmax(dim=-1)

    def forward(self, x):
        # x : torch.Size([32, 256, 128])
        # return output : # torch.Size([32, 256, 30000])
        return self.softmax(self.linear(x)) 
    
    
class BERTMLM(nn.Module):
    """
    BERT Masked Language Model
    """

    def __init__(self, bert: BERT, vocab_size):
        """
        :param bert: BERT model which should be trained
        :param vocab_size: total vocab size for masked_lm
        """

        super().__init__()
        self.bert = bert
        self.mask_lm = MaskedLanguageModel(self.bert.hidden, vocab_size)

    def forward(self, x):
        x = self.bert(x)
        return self.mask_lm(x)

## Train and Save

In [38]:
import torch
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import DataLoader

import tqdm


class ScheduledOptim():
    '''A simple wrapper class for learning rate scheduling'''

    def __init__(self, optimizer, d_model, n_warmup_steps):
        self._optimizer = optimizer
        self.n_warmup_steps = n_warmup_steps
        self.n_current_steps = 0
        self.init_lr = np.power(d_model, -0.5)

    def step_and_update_lr(self):
        "Step with the inner optimizer"
        self._update_learning_rate()
        self._optimizer.step()

    def zero_grad(self):
        "Zero out the gradients by the inner optimizer"
        self._optimizer.zero_grad()

    def _get_lr_scale(self):
        return np.min([
            np.power(self.n_current_steps, -0.5),
            np.power(self.n_warmup_steps, -1.5) * self.n_current_steps])

    def _update_learning_rate(self):
        ''' Learning rate scheduling per step '''

        self.n_current_steps += 1
        lr = self.init_lr * self._get_lr_scale()

        for param_group in self._optimizer.param_groups:
            param_group['lr'] = lr
            

class BERTMLMTrainer:
    """
    BERTMLMTrainer make the pretrained BERT model with Masked Language Model.
    """

    def __init__(self, bert: BERT, vocab_size: int,
                 train_dataloader: DataLoader, test_dataloader: DataLoader = None,
                 lr: float = 1e-4, betas=(0.9, 0.999), weight_decay: float = 0.01, warmup_steps=10000,
                 with_cuda: bool = True, log_freq: int = 100):
        """
        :param bert: BERT model which you want to train
        :param vocab_size: total word vocab size
        :param train_dataloader: train dataset data loader
        :param test_dataloader: test dataset data loader [can be None]
        :param lr: learning rate of optimizer
        :param betas: Adam optimizer betas
        :param weight_decay: Adam optimizer weight decay param
        :param with_cuda: traning with cuda
        :param log_freq: logging frequency of the batch iteration
        """

        # Setup cuda device for BERT training, argument -c, --cuda should be true
        # CUDA out of memory.로 False 처리
        cuda_condition = False # torch.cuda.is_available() and with_cuda
        self.device = torch.device("cuda:0" if cuda_condition else "cpu")

        # This BERT model will be saved every epoch
        self.bert = bert
        # Initialize the BERT Masked Language Model, with BERT model
        self.model = BERTMLM(bert, vocab_size).to(self.device)

        # Setting the train and test data loader
        self.train_data = train_dataloader
        self.test_data = test_dataloader

        # Setting the Adam optimizer with hyper-param
        self.optim = Adam(self.model.parameters(), lr=lr, betas=betas, weight_decay=weight_decay)
        self.optim_schedule = ScheduledOptim(self.optim, self.bert.hidden, n_warmup_steps=warmup_steps)

        # Using Negative Log Likelihood Loss function for predicting the masked_token
        self.criterion = nn.NLLLoss(ignore_index=0)

        self.log_freq = log_freq

        print("Total Parameters:", sum([p.nelement() for p in self.model.parameters()]))

    def train(self, epoch):
        self.iteration(epoch, self.train_data)

    def test(self, epoch):
        self.iteration(epoch, self.test_data, train=False)

    def iteration(self, epoch, data_loader, train=True):
        """
        loop over the data_loader for training or testing
        if on train status, backward operation is activated
        and also auto save the model every peoch
        :param epoch: current epoch index
        :param data_loader: torch.utils.data.DataLoader for iteration
        :param train: boolean value of is train or test
        :return: None
        """
        str_code = "train" if train else "test"

        # Setting the tqdm progress bar
        data_iter = tqdm.tqdm(enumerate(data_loader),
                              desc="EP_%s:%d" % (str_code, epoch),
                              total=len(data_loader),
                              bar_format="{l_bar}{r_bar}")

        avg_loss = 0.0

        for i, data in data_iter:
            # 0. batch_data will be sent into the device(GPU or cpu)
            data = {key: value.to(self.device) for key, value in data.items()}

            # 1. forward the masked_lm model
            mask_lm_output = self.model.forward(data["bert_input"]) #  torch.Size([32, 256, 30000])
        
            # NLLLoss of predicting masked token word
            loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"]) # torch.Size([32, 30000, 256]), torch.Size([32, 256])

            # 2. backward and optimization only in train
            if train:
                self.optim_schedule.zero_grad()
                loss.backward()
                self.optim_schedule.step_and_update_lr()
    
            avg_loss += loss.item()

            post_fix = {
                "epoch": epoch,
                "iter": i,
                "avg_loss": avg_loss / (i + 1),
                "loss": loss.item()
            }

            if i % self.log_freq == 0:
                data_iter.write(str(post_fix))

        print("EP%d_%s, avg_loss=" % (epoch, str_code), avg_loss / len(data_iter))


    def save(self, epoch, file_path, mlm = True):
        """
        Saving the current BERT model on file_path
        :param epoch: current epoch number
        :param file_path: model output path which gonna be file_path+"ep%d" % epoch
        :param mlm: If True, save the full MLM BERT model. Otherwise, save only the BERT model.
        :return: final_output_path
        """
        if mlm:
            output_path = file_path
            # save full `mlm bert`
            torch.save(self.model.cpu(), output_path)
            self.model.to(self.device)
            
        else:
            output_path = file_path 
            # save only `bert`
            torch.save(self.bert.cpu(), output_path)
            self.bert.to(self.device)

        print("EP:%d Model Saved on:" % epoch, output_path)
        return output_path

In [39]:
print("Building BERT model")
bert = BERT(config.VOCAB_SIZE, max_len=config.MAX_LEN, hidden=config.EMBED_DIM, n_layers=config.NUM_LAYERS, attn_heads=config.NUM_HEAD)

print("Creating BERTMLM Pre-Trainer")
trainer = BERTMLMTrainer(bert, config.VOCAB_SIZE, train_dataloader=mlm_loader, test_dataloader=None, lr=config.LR)

print("Training Start")
# 17m / epoch (# of data samples : 5000), 1h 30m / epoch (# of data samples : 25000)
for epoch in range(5): 
    trainer.train(epoch)
    # Save BERT
    trainer.save(epoch, file_path="models/pretrained_bert_imdb" + "_ep%d" % epoch + ".pt", mlm = False)
    # Save BERT MLM
    trainer.save(epoch, file_path="models/pretrained_bert_mlm_imdb" + "_ep%d" % epoch + ".pt", mlm = True)

EP_train:0:   0%|| 0/1563 [00:00<?, ?it/s]

Building BERT model
Creating BERTMLM Pre-Trainer
Total Parameters: 7809584
Training Start


EP_train:0:   0%|| 1/1563 [00:03<1:23:36,  3.21s/it]

{'epoch': 0, 'iter': 0, 'avg_loss': 10.638824462890625, 'loss': 10.638824462890625}


EP_train:0:   6%|| 101/1563 [05:26<1:18:48,  3.23s/it]

{'epoch': 0, 'iter': 100, 'avg_loss': 10.605322082443992, 'loss': 10.55669116973877}


EP_train:0:  13%|| 201/1563 [10:51<1:13:33,  3.24s/it]

{'epoch': 0, 'iter': 200, 'avg_loss': 10.551837024404042, 'loss': 10.425453186035156}


EP_train:0:  19%|| 301/1563 [16:17<1:09:08,  3.29s/it]

{'epoch': 0, 'iter': 300, 'avg_loss': 10.463132956495317, 'loss': 10.140094757080078}


EP_train:0:  26%|| 401/1563 [21:49<1:04:58,  3.35s/it]

{'epoch': 0, 'iter': 400, 'avg_loss': 10.32018718338964, 'loss': 9.600696563720703}


EP_train:0:  32%|| 501/1563 [27:22<59:29,  3.36s/it]  

{'epoch': 0, 'iter': 500, 'avg_loss': 10.024912465832191, 'loss': 7.700746059417725}


EP_train:0:  38%|| 601/1563 [32:55<52:36,  3.28s/it]

{'epoch': 0, 'iter': 600, 'avg_loss': 9.527322860406759, 'loss': 6.503056526184082}


EP_train:0:  45%|| 701/1563 [38:28<48:04,  3.35s/it]

{'epoch': 0, 'iter': 700, 'avg_loss': 9.066506063376956, 'loss': 6.2775750160217285}


EP_train:0:  51%|| 801/1563 [44:03<42:35,  3.35s/it]

{'epoch': 0, 'iter': 800, 'avg_loss': 8.662292575717121, 'loss': 5.406287670135498}


EP_train:0:  58%|| 901/1563 [49:38<36:58,  3.35s/it]

{'epoch': 0, 'iter': 900, 'avg_loss': 8.30162570373332, 'loss': 5.428715705871582}


EP_train:0:  64%|| 1001/1563 [55:18<32:15,  3.44s/it]

{'epoch': 0, 'iter': 1000, 'avg_loss': 7.979950142192555, 'loss': 4.873656749725342}


EP_train:0:  70%|| 1101/1563 [1:00:48<26:18,  3.42s/it]

{'epoch': 0, 'iter': 1100, 'avg_loss': 7.690601850400504, 'loss': 4.651508808135986}


EP_train:0:  77%|| 1201/1563 [1:06:48<21:01,  3.48s/it]

{'epoch': 0, 'iter': 1200, 'avg_loss': 7.4260401975900105, 'loss': 4.370693683624268}


EP_train:0:  83%|| 1301/1563 [1:12:23<14:38,  3.35s/it]

{'epoch': 0, 'iter': 1300, 'avg_loss': 7.187353624736777, 'loss': 4.4637041091918945}


EP_train:0:  90%|| 1401/1563 [1:17:58<09:03,  3.36s/it]

{'epoch': 0, 'iter': 1400, 'avg_loss': 6.9712318570847005, 'loss': 4.081660270690918}


EP_train:0:  96%|| 1501/1563 [1:25:00<03:25,  3.32s/it]

{'epoch': 0, 'iter': 1500, 'avg_loss': 6.771714019743623, 'loss': 3.798304557800293}


EP_train:0: 100%|| 1563/1563 [1:28:25<00:00,  3.39s/it]
EP_train:1:   0%|| 0/1563 [00:00<?, ?it/s]

EP0_train, avg_loss= 6.658745258341061
EP:0 Model Saved on: models/pretrained_bert_imdb_ep0.pt
EP:0 Model Saved on: models/pretrained_bert_mlm_imdb_ep0.pt


EP_train:1:   0%|| 1/1563 [00:03<1:25:29,  3.28s/it]

{'epoch': 1, 'iter': 0, 'avg_loss': 4.054785251617432, 'loss': 4.054785251617432}


EP_train:1:   6%|| 101/1563 [05:35<1:20:53,  3.32s/it]

{'epoch': 1, 'iter': 100, 'avg_loss': 3.8205379259468306, 'loss': 3.733412742614746}


EP_train:1:  13%|| 201/1563 [11:07<1:15:25,  3.32s/it]

{'epoch': 1, 'iter': 200, 'avg_loss': 3.770584896429261, 'loss': 3.403268337249756}


EP_train:1:  19%|| 301/1563 [16:39<1:09:49,  3.32s/it]

{'epoch': 1, 'iter': 300, 'avg_loss': 3.725418607261886, 'loss': 3.6450743675231934}


EP_train:1:  26%|| 401/1563 [22:11<1:04:32,  3.33s/it]

{'epoch': 1, 'iter': 400, 'avg_loss': 3.701398490967596, 'loss': 3.31532621383667}


EP_train:1:  32%|| 501/1563 [27:45<58:51,  3.33s/it]  

{'epoch': 1, 'iter': 500, 'avg_loss': 3.674220739962336, 'loss': 3.4641294479370117}


EP_train:1:  38%|| 601/1563 [33:18<53:24,  3.33s/it]

{'epoch': 1, 'iter': 600, 'avg_loss': 3.6508583955082443, 'loss': 3.413344144821167}


EP_train:1:  45%|| 701/1563 [38:51<47:57,  3.34s/it]

{'epoch': 1, 'iter': 700, 'avg_loss': 3.6336541502349897, 'loss': 3.295443058013916}


EP_train:1:  51%|| 801/1563 [44:25<42:15,  3.33s/it]

{'epoch': 1, 'iter': 800, 'avg_loss': 3.6191450838143755, 'loss': 3.532745361328125}


EP_train:1:  58%|| 901/1563 [49:57<36:34,  3.32s/it]

{'epoch': 1, 'iter': 900, 'avg_loss': 3.6067620522438752, 'loss': 3.8155910968780518}


EP_train:1:  64%|| 1001/1563 [55:30<31:00,  3.31s/it]

{'epoch': 1, 'iter': 1000, 'avg_loss': 3.595968320057704, 'loss': 3.428986072540283}


EP_train:1:  70%|| 1101/1563 [1:01:04<25:43,  3.34s/it]

{'epoch': 1, 'iter': 1100, 'avg_loss': 3.587662027923764, 'loss': 3.443204641342163}


EP_train:1:  77%|| 1201/1563 [1:06:37<20:09,  3.34s/it]

{'epoch': 1, 'iter': 1200, 'avg_loss': 3.5814723394792543, 'loss': 3.3196585178375244}


EP_train:1:  83%|| 1301/1563 [1:12:10<14:30,  3.32s/it]

{'epoch': 1, 'iter': 1300, 'avg_loss': 3.5773279954615598, 'loss': 3.7282352447509766}


EP_train:1:  90%|| 1401/1563 [1:17:42<08:58,  3.32s/it]

{'epoch': 1, 'iter': 1400, 'avg_loss': 3.57540436202845, 'loss': 3.5696499347686768}


EP_train:1:  96%|| 1501/1563 [1:23:15<03:27,  3.34s/it]

{'epoch': 1, 'iter': 1500, 'avg_loss': 3.5746220460659184, 'loss': 3.5754637718200684}


EP_train:1: 100%|| 1563/1563 [1:26:40<00:00,  3.33s/it]
EP_train:2:   0%|| 0/1563 [00:00<?, ?it/s]

EP1_train, avg_loss= 3.574163303417955
EP:1 Model Saved on: models/pretrained_bert_imdb_ep1.pt
EP:1 Model Saved on: models/pretrained_bert_mlm_imdb_ep1.pt


EP_train:2:   0%|| 1/1563 [00:03<1:25:04,  3.27s/it]

{'epoch': 2, 'iter': 0, 'avg_loss': 3.79193115234375, 'loss': 3.79193115234375}


EP_train:2:   6%|| 101/1563 [05:36<1:21:20,  3.34s/it]

{'epoch': 2, 'iter': 100, 'avg_loss': 3.575910513943965, 'loss': 3.5780179500579834}


EP_train:2:  13%|| 201/1563 [11:09<1:15:47,  3.34s/it]

{'epoch': 2, 'iter': 200, 'avg_loss': 3.5715106269020347, 'loss': 3.27047061920166}


EP_train:2:  19%|| 301/1563 [16:42<1:09:38,  3.31s/it]

{'epoch': 2, 'iter': 300, 'avg_loss': 3.588355082610121, 'loss': 3.561704635620117}


EP_train:2:  26%|| 401/1563 [22:13<1:03:56,  3.30s/it]

{'epoch': 2, 'iter': 400, 'avg_loss': 3.5986989561161793, 'loss': 3.598663568496704}


EP_train:2:  32%|| 501/1563 [27:47<59:26,  3.36s/it]  

{'epoch': 2, 'iter': 500, 'avg_loss': 3.603646917971308, 'loss': 3.487985372543335}


EP_train:2:  38%|| 601/1563 [33:19<53:27,  3.33s/it]

{'epoch': 2, 'iter': 600, 'avg_loss': 3.60939435792247, 'loss': 3.5795154571533203}


EP_train:2:  45%|| 701/1563 [38:53<47:45,  3.32s/it]

{'epoch': 2, 'iter': 700, 'avg_loss': 3.617397413444247, 'loss': 3.7290163040161133}


EP_train:2:  51%|| 801/1563 [44:25<42:19,  3.33s/it]

{'epoch': 2, 'iter': 800, 'avg_loss': 3.6240241834138067, 'loss': 3.7559638023376465}


EP_train:2:  58%|| 901/1563 [49:59<36:59,  3.35s/it]

{'epoch': 2, 'iter': 900, 'avg_loss': 3.6326325685943535, 'loss': 3.7899270057678223}


EP_train:2:  64%|| 1001/1563 [55:33<31:16,  3.34s/it]

{'epoch': 2, 'iter': 1000, 'avg_loss': 3.6404307605503323, 'loss': 3.6170263290405273}


EP_train:2:  70%|| 1101/1563 [1:01:08<25:52,  3.36s/it]

{'epoch': 2, 'iter': 1100, 'avg_loss': 3.6470727645517154, 'loss': 3.709679126739502}


EP_train:2:  77%|| 1201/1563 [1:06:45<20:21,  3.37s/it]

{'epoch': 2, 'iter': 1200, 'avg_loss': 3.654342211255622, 'loss': 3.7558393478393555}


EP_train:2:  83%|| 1301/1563 [1:12:26<14:59,  3.43s/it]

{'epoch': 2, 'iter': 1300, 'avg_loss': 3.658344217120089, 'loss': 3.7786026000976562}


EP_train:2:  90%|| 1401/1563 [1:18:11<09:18,  3.45s/it]

{'epoch': 2, 'iter': 1400, 'avg_loss': 3.6630661807512914, 'loss': 3.6158828735351562}


EP_train:2:  96%|| 1501/1563 [1:23:53<03:31,  3.41s/it]

{'epoch': 2, 'iter': 1500, 'avg_loss': 3.668805219426622, 'loss': 3.9724879264831543}


EP_train:2: 100%|| 1563/1563 [1:27:22<00:00,  3.35s/it]
EP_train:3:   0%|| 0/1563 [00:00<?, ?it/s]

EP2_train, avg_loss= 3.6704552826481756
EP:2 Model Saved on: models/pretrained_bert_imdb_ep2.pt
EP:2 Model Saved on: models/pretrained_bert_mlm_imdb_ep2.pt


EP_train:3:   0%|| 1/1563 [00:03<1:28:55,  3.42s/it]

{'epoch': 3, 'iter': 0, 'avg_loss': 3.6901683807373047, 'loss': 3.6901683807373047}


EP_train:3:   6%|| 101/1563 [06:00<1:30:10,  3.70s/it]

{'epoch': 3, 'iter': 100, 'avg_loss': 3.7375717989288932, 'loss': 3.8158910274505615}


EP_train:3:  13%|| 201/1563 [12:22<1:27:32,  3.86s/it]

{'epoch': 3, 'iter': 200, 'avg_loss': 3.7339510383890637, 'loss': 3.6607964038848877}


EP_train:3:  19%|| 301/1563 [19:02<1:25:20,  4.06s/it]

{'epoch': 3, 'iter': 300, 'avg_loss': 3.7349421867104464, 'loss': 3.6039233207702637}


EP_train:3:  26%|| 401/1563 [25:45<1:17:22,  4.00s/it]

{'epoch': 3, 'iter': 400, 'avg_loss': 3.7326884121074344, 'loss': 3.850945472717285}


EP_train:3:  32%|| 501/1563 [32:17<1:09:24,  3.92s/it]

{'epoch': 3, 'iter': 500, 'avg_loss': 3.7295575432196824, 'loss': 3.5935637950897217}


EP_train:3:  38%|| 601/1563 [38:43<1:01:28,  3.83s/it]

{'epoch': 3, 'iter': 600, 'avg_loss': 3.7282499699743337, 'loss': 3.4811928272247314}


EP_train:3:  45%|| 701/1563 [45:07<55:02,  3.83s/it]  

{'epoch': 3, 'iter': 700, 'avg_loss': 3.729441330878439, 'loss': 3.6450796127319336}


EP_train:3:  51%|| 801/1563 [51:28<48:20,  3.81s/it]

{'epoch': 3, 'iter': 800, 'avg_loss': 3.7265280227684947, 'loss': 3.7261996269226074}


EP_train:3:  58%|| 901/1563 [57:50<41:48,  3.79s/it]

{'epoch': 3, 'iter': 900, 'avg_loss': 3.7251465026863935, 'loss': 3.803724527359009}


EP_train:3:  64%|| 1001/1563 [1:04:10<35:39,  3.81s/it]

{'epoch': 3, 'iter': 1000, 'avg_loss': 3.7214676802689497, 'loss': 3.7794032096862793}


EP_train:3:  70%|| 1101/1563 [1:10:30<29:06,  3.78s/it]

{'epoch': 3, 'iter': 1100, 'avg_loss': 3.7200956060927526, 'loss': 3.486332416534424}


EP_train:3:  77%|| 1201/1563 [1:16:51<22:59,  3.81s/it]

{'epoch': 3, 'iter': 1200, 'avg_loss': 3.7183928668350106, 'loss': 3.7254111766815186}


EP_train:3:  83%|| 1301/1563 [1:23:10<16:37,  3.81s/it]

{'epoch': 3, 'iter': 1300, 'avg_loss': 3.7149983456646085, 'loss': 3.789834499359131}


EP_train:3:  90%|| 1401/1563 [1:29:29<10:17,  3.81s/it]

{'epoch': 3, 'iter': 1400, 'avg_loss': 3.7125929964175826, 'loss': 3.6774072647094727}


EP_train:3:  96%|| 1501/1563 [1:35:49<03:55,  3.80s/it]

{'epoch': 3, 'iter': 1500, 'avg_loss': 3.7087008386036304, 'loss': 3.657022714614868}


EP_train:3: 100%|| 1563/1563 [1:39:44<00:00,  3.83s/it]
EP_train:4:   0%|| 0/1563 [00:00<?, ?it/s]

EP3_train, avg_loss= 3.7056290476427427
EP:3 Model Saved on: models/pretrained_bert_imdb_ep3.pt
EP:3 Model Saved on: models/pretrained_bert_mlm_imdb_ep3.pt


EP_train:4:   0%|| 1/1563 [00:03<1:37:12,  3.73s/it]

{'epoch': 4, 'iter': 0, 'avg_loss': 3.7270054817199707, 'loss': 3.7270054817199707}


EP_train:4:   6%|| 101/1563 [06:25<1:33:33,  3.84s/it]

{'epoch': 4, 'iter': 100, 'avg_loss': 3.67496123644385, 'loss': 3.745459794998169}


EP_train:4:  13%|| 201/1563 [12:48<1:27:12,  3.84s/it]

{'epoch': 4, 'iter': 200, 'avg_loss': 3.6575916204879535, 'loss': 3.4743785858154297}


EP_train:4:  19%|| 301/1563 [19:14<1:20:59,  3.85s/it]

{'epoch': 4, 'iter': 300, 'avg_loss': 3.6503711301226947, 'loss': 3.515561580657959}


EP_train:4:  26%|| 401/1563 [25:40<1:15:12,  3.88s/it]

{'epoch': 4, 'iter': 400, 'avg_loss': 3.6413358066444683, 'loss': 3.736457586288452}


EP_train:4:  32%|| 501/1563 [32:06<1:07:59,  3.84s/it]

{'epoch': 4, 'iter': 500, 'avg_loss': 3.63291634818513, 'loss': 3.459174633026123}


EP_train:4:  38%|| 601/1563 [38:32<1:01:49,  3.86s/it]

{'epoch': 4, 'iter': 600, 'avg_loss': 3.623168648181858, 'loss': 3.5275418758392334}


EP_train:4:  45%|| 701/1563 [44:51<54:04,  3.76s/it]  

{'epoch': 4, 'iter': 700, 'avg_loss': 3.6101035782682063, 'loss': 3.5620720386505127}


EP_train:4:  51%|| 801/1563 [51:05<47:41,  3.76s/it]

{'epoch': 4, 'iter': 800, 'avg_loss': 3.595455558410149, 'loss': 3.4742703437805176}


EP_train:4:  58%|| 901/1563 [57:24<42:02,  3.81s/it]

{'epoch': 4, 'iter': 900, 'avg_loss': 3.580404379788567, 'loss': 3.4566409587860107}


EP_train:4:  64%|| 1001/1563 [1:03:47<35:49,  3.83s/it]

{'epoch': 4, 'iter': 1000, 'avg_loss': 3.5649189351202843, 'loss': 3.416919708251953}


EP_train:4:  70%|| 1101/1563 [1:10:09<29:16,  3.80s/it]

{'epoch': 4, 'iter': 1100, 'avg_loss': 3.5474761389906466, 'loss': 3.3122475147247314}


EP_train:4:  77%|| 1201/1563 [1:16:27<22:35,  3.74s/it]

{'epoch': 4, 'iter': 1200, 'avg_loss': 3.529168447388102, 'loss': 3.480825424194336}


EP_train:4:  83%|| 1301/1563 [1:22:33<15:52,  3.63s/it]

{'epoch': 4, 'iter': 1300, 'avg_loss': 3.511002968678925, 'loss': 3.2279672622680664}


EP_train:4:  90%|| 1401/1563 [1:28:30<09:45,  3.61s/it]

{'epoch': 4, 'iter': 1400, 'avg_loss': 3.4938600099061237, 'loss': 3.1182541847229004}


EP_train:4:  96%|| 1501/1563 [1:34:13<03:29,  3.37s/it]

{'epoch': 4, 'iter': 1500, 'avg_loss': 3.475259596311911, 'loss': 3.1342809200286865}


EP_train:4: 100%|| 1563/1563 [1:37:41<00:00,  3.75s/it]

EP4_train, avg_loss= 3.465032256724967
EP:4 Model Saved on: models/pretrained_bert_imdb_ep4.pt
EP:4 Model Saved on: models/pretrained_bert_mlm_imdb_ep4.pt





## Inference

In [42]:
"""
    - id2token  : tokenizer.decode
    - token2id  : tokenizer.encode
"""

class MaskedTextGenerator():
    def __init__(self, model_path, sample_tokens, top_k=5):
        self.path = model_path
        self.sample_tokens = sample_tokens
        self.k = top_k

    def decode(self, tokens):
        return tokenizer.decode(tokens)
    
    def get_prediction(self):
        # load model and set to eval mode
        model = torch.load(self.path)
        model.eval()
        
        prediction = model.forward(self.sample_tokens)
        
        masked_index = np.where(self.sample_tokens == tokenizer.mask_token_id)
        masked_index = masked_index[1]
        mask_prediction = prediction[0][masked_index]
        top_indices = mask_prediction[0].argsort()[-self.k :]
        values = mask_prediction[0][top_indices]

        for i in range(len(top_indices)):
            p = top_indices[i]
            v = torch.exp(values[i]) #log softmax에서 log 상쇄
            tokens = np.copy(self.sample_tokens[0])
            tokens[masked_index[0]] = p
            
            result = {
                "input_text": self.decode(self.sample_tokens[0]),
                "prediction": self.decode(tokens),
                "probability": v,
                "predicted mask token": self.decode(p),
            }
            pprint(result)
            print("\n")

In [60]:
sample_tokens = tokenizer.encode("I have watched this [MASK] and it was awesome") # list
MaskedTextGenerator('models/pretrained_bert_mlm_imdb_ep3.pt', torch.tensor(sample_tokens).reshape(1,-1)).get_prediction()

{'input_text': '[CLS] i have watched this [MASK] and it was awesome [SEP]',
 'predicted mask token': 't h e',
 'prediction': '[CLS] i have watched this the and it was awesome [SEP]',
 'probability': tensor(0.0196, grad_fn=<ExpBackward>)}


{'input_text': '[CLS] i have watched this [MASK] and it was awesome [SEP]',
 'predicted mask token': 'b u t',
 'prediction': '[CLS] i have watched this but and it was awesome [SEP]',
 'probability': tensor(0.0227, grad_fn=<ExpBackward>)}


{'input_text': '[CLS] i have watched this [MASK] and it was awesome [SEP]',
 'predicted mask token': 'a n d',
 'prediction': '[CLS] i have watched this and and it was awesome [SEP]',
 'probability': tensor(0.0296, grad_fn=<ExpBackward>)}


{'input_text': '[CLS] i have watched this [MASK] and it was awesome [SEP]',
 'predicted mask token': 'h a v e',
 'prediction': '[CLS] i have watched this have and it was awesome [SEP]',
 'probability': tensor(0.0348, grad_fn=<ExpBackward>)}


{'input_text': '[CLS] i have watched 

___

## Fine-tune a sentiment classification model

We will fine-tune our self-supervised model on a downstream task of sentiment classification.
To do this, let's create a classifier by adding a pooling layer and a `Dense` layer on top of the
pretrained BERT features.

* Reference 1 : https://www.kaggle.com/kernelk/imdb-sa-bert-fine-tuning-gpu-92-acc
* Reference 2 : https://github.com/rahulbhalley/sentiment-analysis-bert.pytorch/blob/master/main.py

**ISSUE 1** : `requires_grad` vs `no_grad` 차이
* Reference 1 [(링크)](https://stackoverflow.com/questions/63785319/pytorch-torch-no-grad-versus-requires-grad-false)
> torch.no_grad()는 오차 역전파에 사용하는 계산량을 줄여서 처리 속도를 높임
* Reference 2 [(링크)](https://stackoverflow.com/questions/51748138/pytorch-how-to-set-requires-grad-false)
> torch.no_grad()로 감싸는 이유는, 가중치들이 requires_grad=True 지만
> autograd에서는 이를 추적하지 않음
* Reference 3 [(링크1)](https://statisticsplaybook.tistory.com/12) [(링크2)](https://codlingual.tistory.com/72)
> autograd / detach / requires_grad / no_grad 설명
> - 해당 텐서에 대한 계산 모두 tracking해서 기울기 구해주기 : requires_grad=True 
> - 중간에 requires_grad 넣어주고 싶다면(in-place) : .requires_grad_(True)
> - 중간에 requires_grad 그만하고 싶다면 : with torch.no_grad() 
* Reference 4 [(링크)](https://discuss.pytorch.org/t/no-grad-vs-requires-grad/21272)
> no_grad()는 backward() 호출하지 않음 (= gradient 계산하지 않음) <br>
requires_grad 는 tensor 생성시와 관련, 
기본적으로 nn.Modules 패키지에서는 gradient 가 필요한 layer의 parameter들이 정의될 때, requires_grad가 True로 설정됨 

**[정리]** : no_grad 는 gradient를 계산할 것인가 말 것인가 <br>
&emsp; &emsp; &emsp; requires_grad는 학습 가능한 parameter로서 메모리에 올릴 것인가 말 것인가

&emsp; &emsp; &emsp; no_grad() 로 감싸더라도, 정의된 parameter의 tensor 속성에서 reuqires_grad = True일 수 있음


In [61]:
class SentimentClassifierBERT(nn.Module):
    """
    BERT based Sentiment Classification Model
    """
    def __init__(self, bert: BERT, max_len, hidden, num_class, n_layers, freeze):
        super().__init__()
        self.bert = bert # pre-trained BERT, output size : (32, 256, 128)
        self.max_len = max_len
        self.hidden = hidden
        self.num_class = num_class
        self.n_layers = n_layers
        
        self.linear_1 = nn.Linear(self.hidden, int(self.hidden/2))        # 128 -> 64
        self.linear_2 = nn.Linear(int(self.hidden/2), self.num_class-1)   # 64 -> 1
        
        # True for freezing the BERT model when train the classifier 
        # False for unfreezing the BERT model for fine-tuning
        self.freeze = freeze
        
        # ISSUE
        if self.freeze:
            for param in self.bert.parameters():
                param.requires_grad = False
        else:
            for param in self.bert.parameters():
                param.requires_grad = True
    # ISSUE      
    def forward(self, x):
#         if self.freeze:
#             with torch.no_grad():
#                 out = self.bert(x)
#         else:
#             out = self.bert(x)
        out = self.bert(x)
    
        # CLS representation 부분만 추출 (32, 256, 128) -> (32, 128)
        # https://www.kaggle.com/kernelk/imdb-sa-bert-fine-tuning-gpu-92-acc
        out = out[:,:1,:].reshape(-1, self.hidden)
        out = self.linear_1(out)
        out = F.relu(out)
        out = self.linear_2(out)
        out = F.sigmoid(out)

        return out
        
# Train the classifier
class SentimentClassifierBERTTrainer:
    """
    BERTMLMTrainer make the pretrained BERT model with Masked Language Model.
    """

    def __init__(self, model: SentimentClassifierBERT,
                 train_dataloader: DataLoader, test_dataloader: DataLoader = None,
                 lr: float = 1e-4, betas=(0.9, 0.999), weight_decay: float = 0.01, warmup_steps=10000,
                 with_cuda: bool = True, log_freq: int = 10):
        """
        :param model: BERT based Sentiment Classification Model which you want to train
        :param train_dataloader: train dataset data loader
        :param test_dataloader: test dataset data loader [can be None]
        :param lr: learning rate of optimizer
        :param betas: Adam optimizer betas
        :param weight_decay: Adam optimizer weight decay param
        :param with_cuda: traning with cuda
        :param log_freq: logging frequency of the batch iteration
        """

        # Setup cuda device for BERT training, argument -c, --cuda should be true
        # CUDA out of memory.로 False 처리
        cuda_condition = False # torch.cuda.is_available() and with_cuda
        self.device = torch.device("cuda:0" if cuda_condition else "cpu")

        # Initialize the BERT based Sentiment Classification Model
        self.model = model.to(self.device)

        # Setting the train and test data loader
        self.train_data = train_dataloader
        self.test_data = test_dataloader

        # Setting the Adam optimizer with hyper-param
        self.optim = Adam(self.model.parameters(), lr=lr, betas=betas, weight_decay=weight_decay)
        self.optim_schedule = ScheduledOptim(self.optim, self.model.hidden, n_warmup_steps=warmup_steps)

        # Using Negative Log Likelihood Loss function for predicting the masked_token
        self.criterion = nn.BCELoss()

        self.log_freq = log_freq
    
        
    def train(self, epoch):
        self.iteration(epoch, self.train_data)

    def test(self, epoch):
        self.iteration(epoch, self.test_data, train=False)
        
    # computes accuracy
    def binary_accuracy(self, preds, y):
        # rounded_preds = torch.round(torch.sigmoid(preds))
        rounded_preds = torch.round(preds)
        correct = (rounded_preds == y).float()
        acc = correct.sum() / len(correct)
        return acc

    def iteration(self, epoch, data_loader, train=True):
        """
        loop over the data_loader for training or testing
        if on train status, backward operation is activated
        and also auto save the model every peoch
        :param epoch: current epoch index
        :param data_loader: torch.utils.data.DataLoader for iteration
        :param train: boolean value of is train or test
        :return: None
        """
        str_code = "train" if train else "test"

        # Setting the tqdm progress bar
        data_iter = tqdm.tqdm(enumerate(data_loader),
                              desc="EP_%s:%d" % (str_code, epoch),
                              total=len(data_loader),
                              bar_format="{l_bar}{r_bar}")

        avg_loss = 0.0
        avg_acc = 0.0
        
        for i, data in data_iter:

            # 0. batch_data will be sent into the device(GPU or cpu)
            data = {key: value.to(self.device) for key, value in data.items()}

            # 1. forward the bert classifier model
            classifier_output = self.model.forward(data["bert_input"]) # input : torch.Size([32, 256, 30000])

            # Binary Cross Entropy Loss
            target = data["bert_label"].reshape(classifier_output.shape[0], -1)
            loss = self.criterion(classifier_output, target.to(torch.float32)) # torch.Size([32, 1]), torch.Size([32, 1])

            # 2. backward and optimization only in train
            if train:
                self.optim_schedule.zero_grad()
                loss.backward()
                self.optim_schedule.step_and_update_lr()
    
            avg_loss += loss.item()

            # 3. Calculate Accuracy during Test
            if not train:
                acc = self.binary_accuracy(torch.tensor(classifier_output.reshape(-1)), data["bert_label"]) # torch.Size([32])
                avg_acc += acc.item()
                
            post_fix = {
                "epoch": epoch,
                "iter": i,
                "avg_loss": avg_loss / (i + 1),
                "loss": loss.item(),
                "avg_acc": avg_acc / (i + 1)
           }
                            
            if i % self.log_freq == 0:
                data_iter.write(str(post_fix))

        print("EP%d_%s, avg_loss=" % (epoch, str_code), avg_loss / len(data_iter))
        
        if not train:
            print("** Average Accuracy=", avg_acc / len(data_iter))
            print("\n")


    def save(self, epoch, file_path):
        """
        Saving the current BERT Classification Model on file_path
        :param epoch: current epoch number
        :param file_path: model output path which gonna be file_path+"ep%d" % epoch
        :return: final_output_path
        """
        
        output_path = file_path
        torch.save(self.model.cpu(), output_path)
        self.model.to(self.device)

        print("EP:%d Model Saved on:" % epoch, output_path)
        return output_path

### Train the classifier with frozen BERT stage

In [64]:
# mlm_model = torch.load('models/pretrained_bert_mlm_imdb_ep4.pt')
# mlm_model.train()

# pretrained_bert_model = mlm_model.bert

pretrained_bert_model = torch.load('models/pretrained_bert_imdb_ep4.pt')
pretrained_bert_model.train()

# Set freeze=True
print("Building BERT based Sentiment Classification model")
sentiment_classifer_bert = SentimentClassifierBERT(pretrained_bert_model, max_len=config.MAX_LEN,
                                                   hidden=config.EMBED_DIM, num_class = 2, 
                                                   n_layers=config.NUM_LAYERS, freeze=True)


# Method 1
# pretrained_bert_model.eval() 
# #질문 : eval 사용시 dropout 이나 batchnorm 수행 안 되는데, 
# #이렇게 freeze하고 학습시키는 것이 일반적인가? 아니면, 직접 gradient 계산 안 되게 하고 dropout이나 batchnor 수행되게 해야하는지

# Method 2
# # 직접 requires_grad False처리 -> 아래 기능을 SentimentClassifierBERT 내에서 option 으로 정의해둠
# for _ in pretrained_bert_model.parameters():
#     _.requires_grad_(False)

# sentiment_classifer_bert = SentimentClassifierBERT(pretrained_bert_model, max_len=config.MAX_LEN,
#                                                    hidden=config.EMBED_DIM, num_class = 2, 
#                                                    n_layers=config.NUM_LAYERS, freeze=True)

trainable_para = sum([p.nelement() for p in sentiment_classifer_bert.parameters() if p.requires_grad])
non_trainable_para = sum([p.nelement() for p in sentiment_classifer_bert.bert.parameters() if not p.requires_grad])

print("===================================")
print("Total Parameters:", trainable_para + non_trainable_para)
print("Trainable Parameters:", trainable_para)
print("Non-trainable Parameters:", non_trainable_para)
print("===================================")

print("Creating SentimentClassifierBERT Trainer")
trainer = SentimentClassifierBERTTrainer(sentiment_classifer_bert, train_dataloader=train_loader,
                         test_dataloader=test_loader, lr=config.LR, log_freq=100)

print("Training Start")
print("> Train the classifier with frozen BERT stage")
for epoch in range(5):
    trainer.train(epoch)

    # Save trained classifier
    trainer.save(epoch, file_path="models/frozen_bert_sent_classifier_imdb" + "_ep%d" % epoch + ".pt")
    
    if test_loader is not None:
        trainer.test(epoch)

EP_train:0:   0%|| 1/782 [00:00<02:29,  5.22it/s]

Building BERT based Sentiment Classification model
Total Parameters: 3947905
Trainable Parameters: 8321
Non-trainable Parameters: 3939584
Creating SentimentClassifierBERT Trainer
Training Start
> Train the classifier with frozen BERT stage
{'epoch': 0, 'iter': 0, 'avg_loss': 0.8372408747673035, 'loss': 0.8372408747673035, 'avg_acc': 0.0}


EP_train:0:  13%|| 102/782 [00:19<02:10,  5.21it/s]

{'epoch': 0, 'iter': 100, 'avg_loss': 0.7326829687203511, 'loss': 0.6723736524581909, 'avg_acc': 0.0}


EP_train:0:  26%|| 202/782 [00:38<01:50,  5.26it/s]

{'epoch': 0, 'iter': 200, 'avg_loss': 0.7206003381245172, 'loss': 0.7028611898422241, 'avg_acc': 0.0}


EP_train:0:  39%|| 302/782 [00:57<01:33,  5.16it/s]

{'epoch': 0, 'iter': 300, 'avg_loss': 0.714116308380203, 'loss': 0.7097827792167664, 'avg_acc': 0.0}


EP_train:0:  51%|| 402/782 [01:17<01:14,  5.13it/s]

{'epoch': 0, 'iter': 400, 'avg_loss': 0.7098862756220182, 'loss': 0.6920859217643738, 'avg_acc': 0.0}


EP_train:0:  64%|| 501/782 [01:35<00:52,  5.33it/s]

{'epoch': 0, 'iter': 500, 'avg_loss': 0.7072109860812357, 'loss': 0.6763255596160889, 'avg_acc': 0.0}


EP_train:0:  77%|| 602/782 [01:55<00:34,  5.16it/s]

{'epoch': 0, 'iter': 600, 'avg_loss': 0.7054477796181664, 'loss': 0.6936476826667786, 'avg_acc': 0.0}


EP_train:0:  90%|| 702/782 [02:14<00:15,  5.27it/s]

{'epoch': 0, 'iter': 700, 'avg_loss': 0.7036543609242296, 'loss': 0.6812425255775452, 'avg_acc': 0.0}


EP_train:0: 100%|| 782/782 [02:29<00:00,  5.23it/s]
EP_test:0:   0%|| 0/782 [00:00<?, ?it/s]

EP0_train, avg_loss= 0.7027720026957714
EP:0 Model Saved on: models/frozen_bert_sent_classifier_imdb_ep0.pt


EP_test:0:   0%|| 1/782 [00:00<02:32,  5.13it/s]

{'epoch': 0, 'iter': 0, 'avg_loss': 0.6935998201370239, 'loss': 0.6935998201370239, 'avg_acc': 0.5625}


EP_test:0:  13%|| 102/782 [00:19<02:09,  5.26it/s]

{'epoch': 0, 'iter': 100, 'avg_loss': 0.6941214451695433, 'loss': 0.691312849521637, 'avg_acc': 0.5077351485148515}


EP_test:0:  26%|| 202/782 [00:38<01:53,  5.13it/s]

{'epoch': 0, 'iter': 200, 'avg_loss': 0.6944519402968943, 'loss': 0.692236065864563, 'avg_acc': 0.5009328358208955}


EP_test:0:  39%|| 302/782 [00:57<01:31,  5.24it/s]

{'epoch': 0, 'iter': 300, 'avg_loss': 0.6938447065131609, 'loss': 0.6936025023460388, 'avg_acc': 0.5060215946843853}


EP_test:0:  51%|| 402/782 [01:16<01:11,  5.32it/s]

{'epoch': 0, 'iter': 400, 'avg_loss': 0.6941319333942156, 'loss': 0.6993346810340881, 'avg_acc': 0.5041302992518704}


EP_test:0:  64%|| 502/782 [01:35<00:53,  5.22it/s]

{'epoch': 0, 'iter': 500, 'avg_loss': 0.6938351574772132, 'loss': 0.6733691096305847, 'avg_acc': 0.5062375249500998}


EP_test:0:  77%|| 602/782 [01:54<00:34,  5.18it/s]

{'epoch': 0, 'iter': 600, 'avg_loss': 0.6932642514416064, 'loss': 0.6702039837837219, 'avg_acc': 0.5097233777038269}


EP_test:0:  90%|| 702/782 [02:14<00:15,  5.26it/s]

{'epoch': 0, 'iter': 700, 'avg_loss': 0.6933398903181481, 'loss': 0.7041476964950562, 'avg_acc': 0.5086929386590585}


EP_test:0: 100%|| 782/782 [02:29<00:00,  5.25it/s]
EP_train:1:   0%|| 1/782 [00:00<02:30,  5.20it/s]

EP0_test, avg_loss= 0.6933572802244855
** Average Accuracy= 0.5076726342710998


{'epoch': 1, 'iter': 0, 'avg_loss': 0.6975748538970947, 'loss': 0.6975748538970947, 'avg_acc': 0.0}


EP_train:1:  13%|| 102/782 [00:19<02:09,  5.24it/s]

{'epoch': 1, 'iter': 100, 'avg_loss': 0.6921120038126954, 'loss': 0.6841808557510376, 'avg_acc': 0.0}


EP_train:1:  26%|| 202/782 [00:38<01:48,  5.35it/s]

{'epoch': 1, 'iter': 200, 'avg_loss': 0.6935400933175537, 'loss': 0.6848986744880676, 'avg_acc': 0.0}


EP_train:1:  39%|| 302/782 [00:57<01:31,  5.27it/s]

{'epoch': 1, 'iter': 300, 'avg_loss': 0.6944372281679679, 'loss': 0.6909135580062866, 'avg_acc': 0.0}


EP_train:1:  51%|| 402/782 [01:17<01:13,  5.18it/s]

{'epoch': 1, 'iter': 400, 'avg_loss': 0.6943746244818195, 'loss': 0.7187724113464355, 'avg_acc': 0.0}


EP_train:1:  64%|| 502/782 [01:36<00:53,  5.20it/s]

{'epoch': 1, 'iter': 500, 'avg_loss': 0.6945163479108296, 'loss': 0.6970822215080261, 'avg_acc': 0.0}


EP_train:1:  77%|| 602/782 [01:55<00:34,  5.23it/s]

{'epoch': 1, 'iter': 600, 'avg_loss': 0.6951956585322363, 'loss': 0.6985996961593628, 'avg_acc': 0.0}


EP_train:1:  90%|| 702/782 [02:14<00:15,  5.12it/s]

{'epoch': 1, 'iter': 700, 'avg_loss': 0.6949623694773578, 'loss': 0.6778969764709473, 'avg_acc': 0.0}


EP_train:1: 100%|| 782/782 [02:29<00:00,  5.23it/s]
EP_test:1:   0%|| 0/782 [00:00<?, ?it/s]

EP1_train, avg_loss= 0.6949799536439158
EP:1 Model Saved on: models/frozen_bert_sent_classifier_imdb_ep1.pt


EP_test:1:   0%|| 2/782 [00:00<02:29,  5.23it/s]

{'epoch': 1, 'iter': 0, 'avg_loss': 0.6907983422279358, 'loss': 0.6907983422279358, 'avg_acc': 0.53125}


EP_test:1:  13%|| 102/782 [00:19<02:08,  5.31it/s]

{'epoch': 1, 'iter': 100, 'avg_loss': 0.6952131957110792, 'loss': 0.6886323690414429, 'avg_acc': 0.5}


EP_test:1:  26%|| 202/782 [00:38<01:49,  5.31it/s]

{'epoch': 1, 'iter': 200, 'avg_loss': 0.6934152844533399, 'loss': 0.7092522978782654, 'avg_acc': 0.509639303482587}


EP_test:1:  38%|| 301/782 [00:57<01:37,  4.91it/s]

{'epoch': 1, 'iter': 300, 'avg_loss': 0.6939746608765814, 'loss': 0.7141644954681396, 'avg_acc': 0.5057101328903655}


EP_test:1:  51%|| 401/782 [01:16<01:19,  4.79it/s]

{'epoch': 1, 'iter': 400, 'avg_loss': 0.694219509562352, 'loss': 0.6918331384658813, 'avg_acc': 0.5038965087281796}


EP_test:1:  64%|| 502/782 [01:35<00:53,  5.23it/s]

{'epoch': 1, 'iter': 500, 'avg_loss': 0.6939664992267738, 'loss': 0.6868457794189453, 'avg_acc': 0.5051771457085829}


EP_test:1:  77%|| 602/782 [01:54<00:34,  5.19it/s]

{'epoch': 1, 'iter': 600, 'avg_loss': 0.6945508972380602, 'loss': 0.712501585483551, 'avg_acc': 0.5011439267886856}


EP_test:1:  90%|| 702/782 [02:13<00:15,  5.31it/s]

{'epoch': 1, 'iter': 700, 'avg_loss': 0.6944295147516248, 'loss': 0.7003517746925354, 'avg_acc': 0.5025410128388017}


EP_test:1: 100%|| 782/782 [02:28<00:00,  5.25it/s]
EP_train:2:   0%|| 1/782 [00:00<02:28,  5.24it/s]

EP1_test, avg_loss= 0.6947336402695502
** Average Accuracy= 0.501238810741688


{'epoch': 2, 'iter': 0, 'avg_loss': 0.6847983002662659, 'loss': 0.6847983002662659, 'avg_acc': 0.0}


EP_train:2:  13%|| 102/782 [00:19<02:08,  5.28it/s]

{'epoch': 2, 'iter': 100, 'avg_loss': 0.6984588214666536, 'loss': 0.6993970274925232, 'avg_acc': 0.0}


EP_train:2:  26%|| 202/782 [00:38<01:50,  5.23it/s]

{'epoch': 2, 'iter': 200, 'avg_loss': 0.6970613865710017, 'loss': 0.6904000639915466, 'avg_acc': 0.0}


EP_train:2:  38%|| 301/782 [00:57<01:30,  5.29it/s]

{'epoch': 2, 'iter': 300, 'avg_loss': 0.6962155079920822, 'loss': 0.6946625709533691, 'avg_acc': 0.0}


EP_train:2:  51%|| 401/782 [01:17<01:11,  5.29it/s]

{'epoch': 2, 'iter': 400, 'avg_loss': 0.6963142725892197, 'loss': 0.7180339097976685, 'avg_acc': 0.0}


EP_train:2:  64%|| 502/782 [01:36<00:54,  5.14it/s]

{'epoch': 2, 'iter': 500, 'avg_loss': 0.6958215035602242, 'loss': 0.6912544965744019, 'avg_acc': 0.0}


EP_train:2:  77%|| 602/782 [01:55<00:35,  5.04it/s]

{'epoch': 2, 'iter': 600, 'avg_loss': 0.6959143745125629, 'loss': 0.7070531845092773, 'avg_acc': 0.0}


EP_train:2:  90%|| 702/782 [02:15<00:15,  5.25it/s]

{'epoch': 2, 'iter': 700, 'avg_loss': 0.6958309144164289, 'loss': 0.6940892338752747, 'avg_acc': 0.0}


EP_train:2: 100%|| 782/782 [02:30<00:00,  5.20it/s]
EP_test:2:   0%|| 0/782 [00:00<?, ?it/s]

EP2_train, avg_loss= 0.6959463668906171
EP:2 Model Saved on: models/frozen_bert_sent_classifier_imdb_ep2.pt


EP_test:2:   0%|| 2/782 [00:00<02:25,  5.35it/s]

{'epoch': 2, 'iter': 0, 'avg_loss': 0.7022498846054077, 'loss': 0.7022498846054077, 'avg_acc': 0.53125}


EP_test:2:  13%|| 102/782 [00:19<02:10,  5.19it/s]

{'epoch': 2, 'iter': 100, 'avg_loss': 0.705781898286083, 'loss': 0.7104378938674927, 'avg_acc': 0.4969059405940594}


EP_test:2:  26%|| 202/782 [00:38<01:49,  5.30it/s]

{'epoch': 2, 'iter': 200, 'avg_loss': 0.7013423383532472, 'loss': 0.7205120325088501, 'avg_acc': 0.507773631840796}


EP_test:2:  39%|| 302/782 [00:57<01:31,  5.23it/s]

{'epoch': 2, 'iter': 300, 'avg_loss': 0.702906066991166, 'loss': 0.7458125948905945, 'avg_acc': 0.5037375415282392}


EP_test:2:  51%|| 401/782 [01:16<01:14,  5.12it/s]

{'epoch': 2, 'iter': 400, 'avg_loss': 0.703130965369598, 'loss': 0.6988612413406372, 'avg_acc': 0.5028834164588528}


EP_test:2:  64%|| 502/782 [01:35<00:54,  5.18it/s]

{'epoch': 2, 'iter': 500, 'avg_loss': 0.7024619638800859, 'loss': 0.6833321452140808, 'avg_acc': 0.5043662674650699}


EP_test:2:  77%|| 602/782 [01:54<00:34,  5.22it/s]

{'epoch': 2, 'iter': 600, 'avg_loss': 0.7035155834850177, 'loss': 0.7579454779624939, 'avg_acc': 0.5007799500831946}


EP_test:2:  90%|| 702/782 [02:13<00:15,  5.21it/s]

{'epoch': 2, 'iter': 700, 'avg_loss': 0.7034609396855603, 'loss': 0.704017698764801, 'avg_acc': 0.50160485021398}


EP_test:2: 100%|| 782/782 [02:28<00:00,  5.26it/s]
EP_train:3:   0%|| 1/782 [00:00<02:30,  5.19it/s]

EP2_test, avg_loss= 0.7037517400958654
** Average Accuracy= 0.5002797314578005


{'epoch': 3, 'iter': 0, 'avg_loss': 0.7362059354782104, 'loss': 0.7362059354782104, 'avg_acc': 0.0}


EP_train:3:  13%|| 102/782 [00:19<02:09,  5.24it/s]

{'epoch': 3, 'iter': 100, 'avg_loss': 0.696674394135428, 'loss': 0.6924814581871033, 'avg_acc': 0.0}


EP_train:3:  26%|| 201/782 [00:38<01:51,  5.21it/s]

{'epoch': 3, 'iter': 200, 'avg_loss': 0.6972500228170139, 'loss': 0.6878581047058105, 'avg_acc': 0.0}


EP_train:3:  39%|| 302/782 [00:57<01:29,  5.35it/s]

{'epoch': 3, 'iter': 300, 'avg_loss': 0.696761871691162, 'loss': 0.6838343739509583, 'avg_acc': 0.0}


EP_train:3:  51%|| 402/782 [01:16<01:13,  5.18it/s]

{'epoch': 3, 'iter': 400, 'avg_loss': 0.6962556630893241, 'loss': 0.6683324575424194, 'avg_acc': 0.0}


EP_train:3:  64%|| 502/782 [01:35<00:53,  5.24it/s]

{'epoch': 3, 'iter': 500, 'avg_loss': 0.6963780452153402, 'loss': 0.6903771758079529, 'avg_acc': 0.0}


EP_train:3:  77%|| 601/782 [01:54<00:39,  4.57it/s]

{'epoch': 3, 'iter': 600, 'avg_loss': 0.6968694054544865, 'loss': 0.7062279582023621, 'avg_acc': 0.0}


EP_train:3:  90%|| 702/782 [02:13<00:15,  5.18it/s]

{'epoch': 3, 'iter': 700, 'avg_loss': 0.6974458719116134, 'loss': 0.6994295120239258, 'avg_acc': 0.0}


EP_train:3: 100%|| 782/782 [02:29<00:00,  5.25it/s]
EP_test:3:   0%|| 0/782 [00:00<?, ?it/s]

EP3_train, avg_loss= 0.6971708307485751
EP:3 Model Saved on: models/frozen_bert_sent_classifier_imdb_ep3.pt


EP_test:3:   0%|| 2/782 [00:00<02:28,  5.26it/s]

{'epoch': 3, 'iter': 0, 'avg_loss': 0.6864240169525146, 'loss': 0.6864240169525146, 'avg_acc': 0.5625}


EP_test:3:  13%|| 102/782 [00:19<02:09,  5.27it/s]

{'epoch': 3, 'iter': 100, 'avg_loss': 0.693004323114263, 'loss': 0.6865842342376709, 'avg_acc': 0.500309405940594}


EP_test:3:  26%|| 202/782 [00:38<01:51,  5.21it/s]

{'epoch': 3, 'iter': 200, 'avg_loss': 0.6926989353711333, 'loss': 0.6963832378387451, 'avg_acc': 0.5046641791044776}


EP_test:3:  39%|| 302/782 [00:57<01:31,  5.22it/s]

{'epoch': 3, 'iter': 300, 'avg_loss': 0.692268161282587, 'loss': 0.6860608458518982, 'avg_acc': 0.5118355481727574}


EP_test:3:  51%|| 402/782 [01:16<01:13,  5.16it/s]

{'epoch': 3, 'iter': 400, 'avg_loss': 0.6924411713928356, 'loss': 0.6955655217170715, 'avg_acc': 0.5116895261845387}


EP_test:3:  64%|| 502/782 [01:35<00:53,  5.22it/s]

{'epoch': 3, 'iter': 500, 'avg_loss': 0.692532124752532, 'loss': 0.6922560930252075, 'avg_acc': 0.5111027944111777}


EP_test:3:  77%|| 602/782 [01:54<00:33,  5.30it/s]

{'epoch': 3, 'iter': 600, 'avg_loss': 0.6923321419071635, 'loss': 0.6813623905181885, 'avg_acc': 0.5142470881863561}


EP_test:3:  90%|| 702/782 [02:13<00:15,  5.29it/s]

{'epoch': 3, 'iter': 700, 'avg_loss': 0.6925283943365372, 'loss': 0.7004382014274597, 'avg_acc': 0.5121255349500713}


EP_test:3: 100%|| 782/782 [02:29<00:00,  5.25it/s]
EP_train:4:   0%|| 1/782 [00:00<02:27,  5.30it/s]

EP3_test, avg_loss= 0.69251077452584
** Average Accuracy= 0.5116687979539642


{'epoch': 4, 'iter': 0, 'avg_loss': 0.6978344917297363, 'loss': 0.6978344917297363, 'avg_acc': 0.0}


EP_train:4:  13%|| 102/782 [00:19<02:10,  5.21it/s]

{'epoch': 4, 'iter': 100, 'avg_loss': 0.6980472814918744, 'loss': 0.7014200091362, 'avg_acc': 0.0}


EP_train:4:  26%|| 202/782 [00:38<01:49,  5.30it/s]

{'epoch': 4, 'iter': 200, 'avg_loss': 0.6973533037290052, 'loss': 0.7180112600326538, 'avg_acc': 0.0}


EP_train:4:  39%|| 302/782 [00:57<01:30,  5.28it/s]

{'epoch': 4, 'iter': 300, 'avg_loss': 0.6965466226850238, 'loss': 0.6993234157562256, 'avg_acc': 0.0}


EP_train:4:  51%|| 402/782 [01:16<01:12,  5.23it/s]

{'epoch': 4, 'iter': 400, 'avg_loss': 0.6963900740901728, 'loss': 0.723945677280426, 'avg_acc': 0.0}


EP_train:4:  64%|| 502/782 [01:35<00:53,  5.26it/s]

{'epoch': 4, 'iter': 500, 'avg_loss': 0.6964390457509283, 'loss': 0.6874792575836182, 'avg_acc': 0.0}


EP_train:4:  77%|| 602/782 [01:54<00:34,  5.15it/s]

{'epoch': 4, 'iter': 600, 'avg_loss': 0.6965688869679431, 'loss': 0.7117697596549988, 'avg_acc': 0.0}


EP_train:4:  90%|| 702/782 [02:13<00:15,  5.28it/s]

{'epoch': 4, 'iter': 700, 'avg_loss': 0.6962871555765073, 'loss': 0.6867028474807739, 'avg_acc': 0.0}


EP_train:4: 100%|| 782/782 [02:29<00:00,  5.25it/s]
EP_test:4:   0%|| 0/782 [00:00<?, ?it/s]

EP4_train, avg_loss= 0.6965518442871016
EP:4 Model Saved on: models/frozen_bert_sent_classifier_imdb_ep4.pt


EP_test:4:   0%|| 2/782 [00:00<02:33,  5.10it/s]

{'epoch': 4, 'iter': 0, 'avg_loss': 0.689827561378479, 'loss': 0.689827561378479, 'avg_acc': 0.46875}


EP_test:4:  13%|| 102/782 [00:19<02:08,  5.29it/s]

{'epoch': 4, 'iter': 100, 'avg_loss': 0.6937542799675819, 'loss': 0.6952723860740662, 'avg_acc': 0.5071163366336634}


EP_test:4:  26%|| 202/782 [00:38<01:49,  5.30it/s]

{'epoch': 4, 'iter': 200, 'avg_loss': 0.6941451258327237, 'loss': 0.6864056587219238, 'avg_acc': 0.5007773631840796}


EP_test:4:  39%|| 302/782 [00:57<01:31,  5.27it/s]

{'epoch': 4, 'iter': 300, 'avg_loss': 0.6935451810938179, 'loss': 0.6767478585243225, 'avg_acc': 0.5070598006644518}


EP_test:4:  51%|| 402/782 [01:16<01:11,  5.31it/s]

{'epoch': 4, 'iter': 400, 'avg_loss': 0.6937835805136664, 'loss': 0.7023389935493469, 'avg_acc': 0.5066240648379052}


EP_test:4:  64%|| 502/782 [01:35<00:52,  5.37it/s]

{'epoch': 4, 'iter': 500, 'avg_loss': 0.6937460877938185, 'loss': 0.7005965113639832, 'avg_acc': 0.5066117764471058}


EP_test:4:  77%|| 602/782 [01:54<00:33,  5.42it/s]

{'epoch': 4, 'iter': 600, 'avg_loss': 0.6934431920234058, 'loss': 0.6802788972854614, 'avg_acc': 0.5091514143094842}


EP_test:4:  90%|| 702/782 [02:13<00:15,  5.27it/s]

{'epoch': 4, 'iter': 700, 'avg_loss': 0.6936264697053123, 'loss': 0.7100619673728943, 'avg_acc': 0.5080242510699001}


EP_test:4: 100%|| 782/782 [02:28<00:00,  5.28it/s]

EP4_test, avg_loss= 0.6935905889629403
** Average Accuracy= 0.5083120204603581







### Train the classifier with unfrozen BERT stage
* Unfreeze the BERT model for fine-tuning

**ISSUE 2** : ckpt 불러와서 모든 parameter trainable하게 만들기 

   * 모델 정의 : args로 분기 태워서 freeze / unfreeze 하도록 함 (자체적으로 모델 학습 시에는 적용 ok)
   * freeze = True로 학습하고 저장 후 모델 불러왔을 때에는 freeze=True가 그대로 유지.
     아래처럼 ckpt를 load하고 freeze=False해도 반영이 되지 않음
    
```Python
# ISSUE
# trained_sentiment_classifier.freeze = False 
```

   * transfer learning시에 모델 받아서 직접 특정 layer를 `param.requires_grad=True/False`로 명시해야 함


**[참고] `model.eval()`과 `no_grad()`의 차이** [(링크)](https://coffeedjimmy.github.io/pytorch/2019/11/05/pytorch_nograd_vs_train_eval/)
- `model.eval()` : `torch.no_grad()`의 주된 목적은 autograd를 끔으로써 메모리 사용량을 줄이고 연산 속도를 높히기 위함이다. 사실상 어짜피 안쓸 gradient인데 inference시에 굳이 계산할 필요가 없지 않은가?
- `torch.no_grad()`만 쓰면 되지 않나? gradient 계산 안하고 이제 됐잖아 라고 생각할 수 있다. 맞는 말이지만, `model.eval()`의 역할은 약간 다르다. 현재(2019년) 시점에서는 모델링 시 training과 inference시에 다르게 동작하는 layer들이 존재한다. 예를 들면, Dropout layer는 학습시에는 동작해야하지만, inference시에는 동작하지 않는 것과 같은 예시를 들 수 있다. BatchNorm같은 경우도 마찬가지다.

- 사실상 `model.eval()`는 이런 layer들의 동작을 inference(eval) mode로 바꿔준다는 목적으로 사용된다. 따라서, 우리가 보통 원하는 모델의 동작을 위해서는 위의 두 가지를 모두 사용해야하는 것이 맞다.

#### Transfer learning with freezing
* [freeze selected layers of a model in Pytorch?](https://stackoverflow.com/questions/62523912/how-to-freeze-selected-layers-of-a-model-in-pytorch)
* [Transfer Learning tutorial(Py-Doc)](http://seba1511.net/tutorials/beginner/transfer_learning_tutorial.html)
* [Blog Post - freeze a speific layer](https://m.blog.naver.com/PostView.nhn?isHttpsRedirect=true&blogId=laowaibang&logNo=222155729655&proxyReferer=)
* [freeze example code](https://gist.github.com/L0SG/2f6d81e4ad119c4f798ab81fa8d62d3f)

In [70]:
# load trained BERT based sentiment classification model
print("Loading trained BERT based Sentiment Classifier")
trained_sentiment_classifier = torch.load('models/frozen_bert_sent_classifier_imdb_ep4.pt')
finetuned_frozen_bert_model = trained_sentiment_classifier.bert

# for param in trained_sentiment_classifier.parameters():
#     param.requires_grad = True

# Set freeze=True
print("Building BERT based Sentiment Classification model")
finetuned_sentiment_classifer_bert = SentimentClassifierBERT(finetuned_frozen_bert_model, max_len=config.MAX_LEN,
                                                   hidden=config.EMBED_DIM, num_class = 2, 
                                                   n_layers=config.NUM_LAYERS, freeze=False)

    
trainable_para = sum([p.nelement() for p in finetuned_sentiment_classifer_bert.parameters() if p.requires_grad])
non_trainable_para = sum([p.nelement() for p in finetuned_sentiment_classifer_bert.bert.parameters() if not p.requires_grad])

print("===================================")
print("Total Parameters:", trainable_para + non_trainable_para)
print("Trainable Parameters:", trainable_para)
print("Non-trainable Parameters:", non_trainable_para)
print("===================================")

print("Creating SentimentClassifierBERT Trainer")
trainer = SentimentClassifierBERTTrainer(finetuned_sentiment_classifer_bert, train_dataloader=train_loader,
                         test_dataloader=test_loader, lr=config.LR, log_freq=100)

print("Training Start")
print("> Train the classifier with unfrozen BERT stage")
for epoch in range(5):
    trainer.train(epoch)

    # Save fine-tuned classifier
    trainer.save(epoch, file_path="models/ft_bert_sent_classifier_imdb" + "_ep%d" % epoch + ".pt")
    
    if test_loader is not None:
        trainer.test(epoch)

EP_train:0:   0%|| 0/782 [00:00<?, ?it/s]

Loading trained BERT based Sentiment Classifier
Building BERT based Sentiment Classification model
Total Parameters: 3947905
Trainable Parameters: 3947905
Non-trainable Parameters: 0
Creating SentimentClassifierBERT Trainer
Training Start
> Train the classifier with unfrozen BERT stage


EP_train:0:   0%|| 1/782 [00:00<04:24,  2.95it/s]

{'epoch': 0, 'iter': 0, 'avg_loss': 1.3834377527236938, 'loss': 1.3834377527236938, 'avg_acc': 0.0}


EP_train:0:  13%|| 101/782 [00:31<03:37,  3.13it/s]

{'epoch': 0, 'iter': 100, 'avg_loss': 0.9261812302145628, 'loss': 0.6959831118583679, 'avg_acc': 0.0}


EP_train:0:  26%|| 201/782 [01:03<03:04,  3.14it/s]

{'epoch': 0, 'iter': 200, 'avg_loss': 0.8141611319276231, 'loss': 0.722547709941864, 'avg_acc': 0.0}


EP_train:0:  38%|| 301/782 [01:35<02:34,  3.12it/s]

{'epoch': 0, 'iter': 300, 'avg_loss': 0.7764555782178708, 'loss': 0.6736135482788086, 'avg_acc': 0.0}


EP_train:0:  51%|| 401/782 [02:07<02:00,  3.17it/s]

{'epoch': 0, 'iter': 400, 'avg_loss': 0.7574033540977801, 'loss': 0.7137646079063416, 'avg_acc': 0.0}


EP_train:0:  64%|| 501/782 [02:39<01:27,  3.21it/s]

{'epoch': 0, 'iter': 500, 'avg_loss': 0.7453714367634284, 'loss': 0.6914842128753662, 'avg_acc': 0.0}


EP_train:0:  77%|| 601/782 [03:10<00:56,  3.21it/s]

{'epoch': 0, 'iter': 600, 'avg_loss': 0.7371070376053428, 'loss': 0.6872071027755737, 'avg_acc': 0.0}


EP_train:0:  90%|| 701/782 [03:41<00:24,  3.25it/s]

{'epoch': 0, 'iter': 700, 'avg_loss': 0.7309767861509119, 'loss': 0.7067776918411255, 'avg_acc': 0.0}


EP_train:0: 100%|| 782/782 [04:06<00:00,  3.17it/s]
EP_test:0:   0%|| 0/782 [00:00<?, ?it/s]

EP0_train, avg_loss= 0.7272990787273172
EP:0 Model Saved on: models/ft_bert_sent_classifier_imdb_ep0.pt


EP_test:0:   0%|| 2/782 [00:00<02:15,  5.75it/s]

{'epoch': 0, 'iter': 0, 'avg_loss': 0.6924170851707458, 'loss': 0.6924170851707458, 'avg_acc': 0.53125}


EP_test:0:  13%|| 102/782 [00:16<01:44,  6.52it/s]

{'epoch': 0, 'iter': 100, 'avg_loss': 0.6929010436086371, 'loss': 0.6869609355926514, 'avg_acc': 0.5064975247524752}


EP_test:0:  26%|| 202/782 [00:32<01:33,  6.19it/s]

{'epoch': 0, 'iter': 200, 'avg_loss': 0.6923891774457486, 'loss': 0.7045437097549438, 'avg_acc': 0.5127487562189055}


EP_test:0:  39%|| 302/782 [00:48<01:21,  5.86it/s]

{'epoch': 0, 'iter': 300, 'avg_loss': 0.6924717305506582, 'loss': 0.6889855861663818, 'avg_acc': 0.5115240863787376}


EP_test:0:  51%|| 402/782 [01:03<00:59,  6.43it/s]

{'epoch': 0, 'iter': 400, 'avg_loss': 0.6929001103612847, 'loss': 0.6858383417129517, 'avg_acc': 0.5087281795511222}


EP_test:0:  64%|| 502/782 [01:20<00:47,  5.93it/s]

{'epoch': 0, 'iter': 500, 'avg_loss': 0.6924132626213714, 'loss': 0.6761914491653442, 'avg_acc': 0.5121631736526946}


EP_test:0:  77%|| 602/782 [01:36<00:27,  6.51it/s]

{'epoch': 0, 'iter': 600, 'avg_loss': 0.692743931653694, 'loss': 0.7078056335449219, 'avg_acc': 0.5077475041597338}


EP_test:0:  90%|| 702/782 [01:52<00:12,  6.19it/s]

{'epoch': 0, 'iter': 700, 'avg_loss': 0.6929490799230448, 'loss': 0.709393322467804, 'avg_acc': 0.5071772467902995}


EP_test:0: 100%|| 782/782 [02:05<00:00,  6.22it/s]
EP_train:1:   0%|| 0/782 [00:00<?, ?it/s]

EP0_test, avg_loss= 0.6931911559056139
** Average Accuracy= 0.5059942455242967




EP_train:1:   0%|| 1/782 [00:00<04:03,  3.21it/s]

{'epoch': 1, 'iter': 0, 'avg_loss': 0.6913052797317505, 'loss': 0.6913052797317505, 'avg_acc': 0.0}


EP_train:1:  13%|| 101/782 [00:31<03:34,  3.18it/s]

{'epoch': 1, 'iter': 100, 'avg_loss': 0.6948315920216022, 'loss': 0.6909817457199097, 'avg_acc': 0.0}


EP_train:1:  26%|| 201/782 [01:03<02:58,  3.26it/s]

{'epoch': 1, 'iter': 200, 'avg_loss': 0.6954520344734192, 'loss': 0.6913866996765137, 'avg_acc': 0.0}


EP_train:1:  38%|| 301/782 [01:34<02:28,  3.24it/s]

{'epoch': 1, 'iter': 300, 'avg_loss': 0.6953620140338657, 'loss': 0.6954710483551025, 'avg_acc': 0.0}


EP_train:1:  51%|| 401/782 [02:05<02:01,  3.13it/s]

{'epoch': 1, 'iter': 400, 'avg_loss': 0.6963499403951174, 'loss': 0.695523738861084, 'avg_acc': 0.0}


EP_train:1:  64%|| 501/782 [02:37<01:28,  3.19it/s]

{'epoch': 1, 'iter': 500, 'avg_loss': 0.6963460391866947, 'loss': 0.6861171126365662, 'avg_acc': 0.0}


EP_train:1:  77%|| 601/782 [03:07<00:53,  3.36it/s]

{'epoch': 1, 'iter': 600, 'avg_loss': 0.696193938247376, 'loss': 0.6829378008842468, 'avg_acc': 0.0}


EP_train:1:  90%|| 701/782 [03:37<00:23,  3.38it/s]

{'epoch': 1, 'iter': 700, 'avg_loss': 0.696231978425966, 'loss': 0.6964172720909119, 'avg_acc': 0.0}


EP_train:1: 100%|| 782/782 [04:00<00:00,  3.25it/s]
EP_test:1:   0%|| 1/782 [00:00<01:40,  7.78it/s]

EP1_train, avg_loss= 0.696256763108856
EP:1 Model Saved on: models/ft_bert_sent_classifier_imdb_ep1.pt
{'epoch': 1, 'iter': 0, 'avg_loss': 0.6964401602745056, 'loss': 0.6964401602745056, 'avg_acc': 0.46875}


EP_test:1:  13%|| 102/782 [00:14<01:43,  6.57it/s]

{'epoch': 1, 'iter': 100, 'avg_loss': 0.6931739313767689, 'loss': 0.6872131824493408, 'avg_acc': 0.5058787128712872}


EP_test:1:  26%|| 202/782 [00:28<01:21,  7.15it/s]

{'epoch': 1, 'iter': 200, 'avg_loss': 0.6927910055687179, 'loss': 0.6982260346412659, 'avg_acc': 0.5107276119402985}


EP_test:1:  39%|| 302/782 [00:41<01:07,  7.12it/s]

{'epoch': 1, 'iter': 300, 'avg_loss': 0.6928680693588384, 'loss': 0.7081049084663391, 'avg_acc': 0.5105897009966778}


EP_test:1:  51%|| 402/782 [00:55<00:55,  6.90it/s]

{'epoch': 1, 'iter': 400, 'avg_loss': 0.6933055870848106, 'loss': 0.6865959763526917, 'avg_acc': 0.5074033665835411}


EP_test:1:  64%|| 502/782 [01:09<00:38,  7.21it/s]

{'epoch': 1, 'iter': 500, 'avg_loss': 0.6929929522935026, 'loss': 0.691059410572052, 'avg_acc': 0.5092315369261478}


EP_test:1:  77%|| 602/782 [01:23<00:23,  7.57it/s]

{'epoch': 1, 'iter': 600, 'avg_loss': 0.6933316999981288, 'loss': 0.712873637676239, 'avg_acc': 0.5040557404326124}


EP_test:1:  90%|| 702/782 [01:37<00:11,  7.22it/s]

{'epoch': 1, 'iter': 700, 'avg_loss': 0.693117199202577, 'loss': 0.6922761797904968, 'avg_acc': 0.5067760342368046}


EP_test:1: 100%|| 782/782 [01:48<00:00,  7.19it/s]
EP_train:2:   0%|| 0/782 [00:00<?, ?it/s]

EP1_test, avg_loss= 0.69321975470199
** Average Accuracy= 0.5067934782608695




EP_train:2:   0%|| 1/782 [00:00<03:37,  3.60it/s]

{'epoch': 2, 'iter': 0, 'avg_loss': 0.6986697912216187, 'loss': 0.6986697912216187, 'avg_acc': 0.0}


EP_train:2:  13%|| 101/782 [00:29<03:21,  3.39it/s]

{'epoch': 2, 'iter': 100, 'avg_loss': 0.692786709506913, 'loss': 0.6711193919181824, 'avg_acc': 0.0}


EP_train:2:  26%|| 201/782 [00:59<02:51,  3.40it/s]

{'epoch': 2, 'iter': 200, 'avg_loss': 0.6937022182478834, 'loss': 0.6913369297981262, 'avg_acc': 0.0}


EP_train:2:  38%|| 301/782 [01:29<02:22,  3.38it/s]

{'epoch': 2, 'iter': 300, 'avg_loss': 0.6938880672090474, 'loss': 0.6852484941482544, 'avg_acc': 0.0}


EP_train:2:  51%|| 401/782 [01:58<01:51,  3.42it/s]

{'epoch': 2, 'iter': 400, 'avg_loss': 0.6942996107729296, 'loss': 0.7008100748062134, 'avg_acc': 0.0}


EP_train:2:  64%|| 501/782 [02:27<01:22,  3.39it/s]

{'epoch': 2, 'iter': 500, 'avg_loss': 0.6942756639983125, 'loss': 0.7151300311088562, 'avg_acc': 0.0}


EP_train:2:  77%|| 601/782 [02:57<00:56,  3.22it/s]

{'epoch': 2, 'iter': 600, 'avg_loss': 0.6943774626973068, 'loss': 0.7007508873939514, 'avg_acc': 0.0}


EP_train:2:  90%|| 701/782 [03:26<00:24,  3.36it/s]

{'epoch': 2, 'iter': 700, 'avg_loss': 0.6943266344138457, 'loss': 0.6859517693519592, 'avg_acc': 0.0}


EP_train:2: 100%|| 782/782 [03:49<00:00,  3.40it/s]
EP_test:2:   0%|| 1/782 [00:00<01:40,  7.77it/s]

EP2_train, avg_loss= 0.6948247574022054
EP:2 Model Saved on: models/ft_bert_sent_classifier_imdb_ep2.pt
{'epoch': 2, 'iter': 0, 'avg_loss': 0.6918513178825378, 'loss': 0.6918513178825378, 'avg_acc': 0.53125}


EP_test:2:  13%|| 102/782 [00:13<01:27,  7.73it/s]

{'epoch': 2, 'iter': 100, 'avg_loss': 0.7021178278592554, 'loss': 0.7000889182090759, 'avg_acc': 0.4962871287128713}


EP_test:2:  26%|| 202/782 [00:27<01:20,  7.21it/s]

{'epoch': 2, 'iter': 200, 'avg_loss': 0.6989722023555889, 'loss': 0.7300931811332703, 'avg_acc': 0.5071517412935324}


EP_test:2:  39%|| 302/782 [00:41<01:05,  7.38it/s]

{'epoch': 2, 'iter': 300, 'avg_loss': 0.6999345396048207, 'loss': 0.745032012462616, 'avg_acc': 0.5037375415282392}


EP_test:2:  51%|| 402/782 [00:55<00:52,  7.18it/s]

{'epoch': 2, 'iter': 400, 'avg_loss': 0.7002452030740771, 'loss': 0.6863038539886475, 'avg_acc': 0.5028054862842892}


EP_test:2:  64%|| 502/782 [01:09<00:38,  7.25it/s]

{'epoch': 2, 'iter': 500, 'avg_loss': 0.6998619434838286, 'loss': 0.6905147433280945, 'avg_acc': 0.5045533932135728}


EP_test:2:  77%|| 602/782 [01:23<00:24,  7.28it/s]

{'epoch': 2, 'iter': 600, 'avg_loss': 0.7010028832367375, 'loss': 0.7553405165672302, 'avg_acc': 0.5002079866888519}


EP_test:2:  90%|| 702/782 [01:37<00:11,  7.21it/s]

{'epoch': 2, 'iter': 700, 'avg_loss': 0.7007754353245723, 'loss': 0.6958279609680176, 'avg_acc': 0.5013373751783167}


EP_test:2: 100%|| 782/782 [01:48<00:00,  7.23it/s]
EP_train:3:   0%|| 0/782 [00:00<?, ?it/s]

EP2_test, avg_loss= 0.7010932782726824
** Average Accuracy= 0.500119884910486




EP_train:3:   0%|| 1/782 [00:00<03:46,  3.46it/s]

{'epoch': 3, 'iter': 0, 'avg_loss': 0.6752429008483887, 'loss': 0.6752429008483887, 'avg_acc': 0.0}


EP_train:3:  13%|| 101/782 [00:29<03:21,  3.38it/s]

{'epoch': 3, 'iter': 100, 'avg_loss': 0.6916707618401783, 'loss': 0.687860906124115, 'avg_acc': 0.0}


EP_train:3:  26%|| 201/782 [00:59<02:46,  3.50it/s]

{'epoch': 3, 'iter': 200, 'avg_loss': 0.694198287245053, 'loss': 0.6988617777824402, 'avg_acc': 0.0}


EP_train:3:  38%|| 301/782 [01:28<02:19,  3.46it/s]

{'epoch': 3, 'iter': 300, 'avg_loss': 0.6941491610980113, 'loss': 0.6854762434959412, 'avg_acc': 0.0}


EP_train:3:  51%|| 401/782 [01:57<01:53,  3.35it/s]

{'epoch': 3, 'iter': 400, 'avg_loss': 0.6939656223144912, 'loss': 0.7001341581344604, 'avg_acc': 0.0}


EP_train:3:  64%|| 501/782 [02:26<01:23,  3.35it/s]

{'epoch': 3, 'iter': 500, 'avg_loss': 0.6940438484479329, 'loss': 0.6931577324867249, 'avg_acc': 0.0}


EP_train:3:  77%|| 601/782 [02:56<00:54,  3.34it/s]

{'epoch': 3, 'iter': 600, 'avg_loss': 0.6940604682373326, 'loss': 0.6873683929443359, 'avg_acc': 0.0}


EP_train:3:  90%|| 701/782 [03:26<00:24,  3.32it/s]

{'epoch': 3, 'iter': 700, 'avg_loss': 0.6939111973521713, 'loss': 0.694148063659668, 'avg_acc': 0.0}


EP_train:3: 100%|| 782/782 [03:50<00:00,  3.39it/s]
EP_test:3:   0%|| 1/782 [00:00<01:40,  7.79it/s]

EP3_train, avg_loss= 0.6937951920434947
EP:3 Model Saved on: models/ft_bert_sent_classifier_imdb_ep3.pt
{'epoch': 3, 'iter': 0, 'avg_loss': 0.6933778524398804, 'loss': 0.6933778524398804, 'avg_acc': 0.53125}


EP_test:3:  13%|| 102/782 [00:13<01:32,  7.35it/s]

{'epoch': 3, 'iter': 100, 'avg_loss': 0.6923752226451836, 'loss': 0.6924172639846802, 'avg_acc': 0.5167079207920792}


EP_test:3:  26%|| 202/782 [00:27<01:20,  7.21it/s]

{'epoch': 3, 'iter': 200, 'avg_loss': 0.6919139433856034, 'loss': 0.6977490186691284, 'avg_acc': 0.5242537313432836}


EP_test:3:  39%|| 302/782 [00:41<01:05,  7.28it/s]

{'epoch': 3, 'iter': 300, 'avg_loss': 0.6920265939544602, 'loss': 0.6929969787597656, 'avg_acc': 0.520141196013289}


EP_test:3:  51%|| 402/782 [00:55<00:51,  7.34it/s]

{'epoch': 3, 'iter': 400, 'avg_loss': 0.6923623896596438, 'loss': 0.6904222369194031, 'avg_acc': 0.5136377805486284}


EP_test:3:  64%|| 502/782 [01:09<00:38,  7.26it/s]

{'epoch': 3, 'iter': 500, 'avg_loss': 0.6923004498262844, 'loss': 0.6878274083137512, 'avg_acc': 0.5157809381237525}


EP_test:3:  77%|| 602/782 [01:23<00:25,  7.07it/s]

{'epoch': 3, 'iter': 600, 'avg_loss': 0.6924160990659488, 'loss': 0.6969225406646729, 'avg_acc': 0.5137271214642263}


EP_test:3:  90%|| 702/782 [01:37<00:11,  7.10it/s]

{'epoch': 3, 'iter': 700, 'avg_loss': 0.6924194225911237, 'loss': 0.6972069144248962, 'avg_acc': 0.5141315977175464}


EP_test:3: 100%|| 782/782 [01:48<00:00,  7.20it/s]
EP_train:4:   0%|| 0/782 [00:00<?, ?it/s]

EP3_test, avg_loss= 0.6924209343960218
** Average Accuracy= 0.5138666879795396




EP_train:4:   0%|| 1/782 [00:00<03:46,  3.44it/s]

{'epoch': 4, 'iter': 0, 'avg_loss': 0.6906400322914124, 'loss': 0.6906400322914124, 'avg_acc': 0.0}


EP_train:4:  13%|| 101/782 [00:29<03:22,  3.36it/s]

{'epoch': 4, 'iter': 100, 'avg_loss': 0.6916895862853173, 'loss': 0.6939745545387268, 'avg_acc': 0.0}


EP_train:4:  26%|| 201/782 [00:59<02:48,  3.46it/s]

{'epoch': 4, 'iter': 200, 'avg_loss': 0.6920957004846032, 'loss': 0.6944126486778259, 'avg_acc': 0.0}


EP_train:4:  38%|| 301/782 [01:28<02:23,  3.34it/s]

{'epoch': 4, 'iter': 300, 'avg_loss': 0.6923433962058387, 'loss': 0.6908562183380127, 'avg_acc': 0.0}


EP_train:4:  51%|| 401/782 [01:57<01:48,  3.50it/s]

{'epoch': 4, 'iter': 400, 'avg_loss': 0.6926245597235282, 'loss': 0.6911389827728271, 'avg_acc': 0.0}


EP_train:4:  64%|| 501/782 [02:27<01:25,  3.29it/s]

{'epoch': 4, 'iter': 500, 'avg_loss': 0.6928256336086525, 'loss': 0.6996395587921143, 'avg_acc': 0.0}


EP_train:4:  77%|| 601/782 [02:56<00:53,  3.41it/s]

{'epoch': 4, 'iter': 600, 'avg_loss': 0.6928431889578427, 'loss': 0.6934700012207031, 'avg_acc': 0.0}


EP_train:4:  90%|| 701/782 [03:26<00:23,  3.38it/s]

{'epoch': 4, 'iter': 700, 'avg_loss': 0.6929204963923521, 'loss': 0.6940878629684448, 'avg_acc': 0.0}


EP_train:4: 100%|| 782/782 [03:50<00:00,  3.40it/s]
EP_test:4:   0%|| 1/782 [00:00<01:41,  7.67it/s]

EP4_train, avg_loss= 0.6930206586485324
EP:4 Model Saved on: models/ft_bert_sent_classifier_imdb_ep4.pt
{'epoch': 4, 'iter': 0, 'avg_loss': 0.6925820112228394, 'loss': 0.6925820112228394, 'avg_acc': 0.53125}


EP_test:4:  13%|| 102/782 [00:14<01:32,  7.33it/s]

{'epoch': 4, 'iter': 100, 'avg_loss': 0.6935104261530508, 'loss': 0.6932932734489441, 'avg_acc': 0.4962871287128713}


EP_test:4:  26%|| 202/782 [00:28<01:21,  7.15it/s]

{'epoch': 4, 'iter': 200, 'avg_loss': 0.6930005719412619, 'loss': 0.6989334225654602, 'avg_acc': 0.5071517412935324}


EP_test:4:  39%|| 302/782 [00:42<01:06,  7.19it/s]

{'epoch': 4, 'iter': 300, 'avg_loss': 0.6931919560479959, 'loss': 0.7003202438354492, 'avg_acc': 0.5037375415282392}


EP_test:4:  51%|| 402/782 [00:56<00:51,  7.33it/s]

{'epoch': 4, 'iter': 400, 'avg_loss': 0.693274387249031, 'loss': 0.6906512975692749, 'avg_acc': 0.5028054862842892}


EP_test:4:  64%|| 502/782 [01:10<00:39,  7.08it/s]

{'epoch': 4, 'iter': 500, 'avg_loss': 0.6931793862236236, 'loss': 0.6906706094741821, 'avg_acc': 0.5045533932135728}


EP_test:4:  77%|| 602/782 [01:24<00:25,  7.14it/s]

{'epoch': 4, 'iter': 600, 'avg_loss': 0.6933886758896356, 'loss': 0.7041993737220764, 'avg_acc': 0.5002079866888519}


EP_test:4:  90%|| 702/782 [01:38<00:11,  6.77it/s]

{'epoch': 4, 'iter': 700, 'avg_loss': 0.6933266299427321, 'loss': 0.6922998428344727, 'avg_acc': 0.5013373751783167}


EP_test:4: 100%|| 782/782 [01:49<00:00,  7.12it/s]

EP4_test, avg_loss= 0.6933846326587755
** Average Accuracy= 0.500119884910486







## Create an end-to-end model and evaluate it

When you want to deploy a model, it's best if it already includes its preprocessing
pipeline, so that you don't have to reimplement the preprocessing logic in your
production environment. Let's create an end-to-end model that incorporates
the `Input Representation` layer, and let's evaluate. Our model will accept raw strings
as input.

In [71]:
# For End to End Training in details
class E2ESentimentClassifier(nn.Module):
    """
    End to end BERT-based Sentiment Classification model
    """

    def __init__(self, vocab_size, max_len=256, hidden=128, n_layers=1, attn_heads=8, num_class=2):
        """
        :param vocab_size: vocab_size of total words
        :param hidden: BERT model hidden size
        :param n_layers: numbers of Transformer blocks(layers)
        :param attn_heads: number of attention heads
        :param num_class: number of classes
        """

        super().__init__()
        self.max_len= max_len
        self.hidden = hidden
        self.n_layers = n_layers
        self.attn_heads = attn_heads

        # Keras example used hidden_size=128 for ff_network_hidden_size
        self.feed_forward_hidden = hidden

        # embedding for BERT, sum of positional, segment, token embeddings
        self.embedding = BERTEmbedding(vocab_size=vocab_size, max_len=max_len, embed_size=hidden)

        # multi-layers transformer blocks, deep network
        self.transformer_blocks = nn.ModuleList(
            [TransformerBlock(hidden, attn_heads, hidden, dropout) for _ in range(n_layers)]) 

        # classification layer
        self.num_class = num_class
        self.linear_1 = nn.Linear(self.hidden, int(self.hidden/2))        # 128 -> 64
        self.linear_2 = nn.Linear(int(self.hidden/2), self.num_class-1)   # 64 -> 1
        
    def forward(self, x):
        # 1. Encoding layer 
        
        x = torch.tensor([encode(x[i]) for i in range(len(x))])
        
        # 2. BERT Representation layer 
        
        # attention masking for padded token
        mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1) 
        # embedding the indexed sequence to sequence of vectors
        x = self.embedding(x)

        # running over multiple transformer blocks
        for transformer in self.transformer_blocks:
            x = transformer.forward(x, mask)
        
        # 3. Classifcation layer   
        
        # Extract CLS representation 
        out = out[:,:1,:].reshape(-1, self.hidden)
        out = self.linear_1(out)
        out = F.relu(out)
        out = self.linear_2(out)
        out = F.sigmoid(out)
        
        return x

In [72]:
# Simplified Ver.
class E2ESentimentClassifierSimple(nn.Module):
    """
    BERT based Sentiment Classification Model
    """
    def __init__(self, model):
        super().__init__()
        self.bert_classifier = model
        
    def forward(self, x):
        # 1. Encoding layer      
        encoded_x = torch.tensor([encode(x[i]) for i in range(len(x))])

        # 2. bert-based sentiment classifier
        out = self.bert_classifier(encoded_x)
        
        return out

In [73]:
# Train the classifier
class E2ESentimentClassifierTrainer:
    """
    E2ESentimentClassifierTrainer make the End to end BERT based Sentiment Classification Model.
    """

    def __init__(self, model,
                 train_dataloader: DataLoader, test_dataloader: DataLoader = None,
                 lr: float = 1e-4, betas=(0.9, 0.999), weight_decay: float = 0.01, warmup_steps=10000,
                 with_cuda: bool = True, log_freq: int = 10):
        """
        :param model: BERT based Sentiment Classification Model which you want to train
        :param train_dataloader: train dataset data loader
        :param test_dataloader: test dataset data loader [can be None]
        :param lr: learning rate of optimizer
        :param betas: Adam optimizer betas
        :param weight_decay: Adam optimizer weight decay param
        :param with_cuda: traning with cuda
        :param log_freq: logging frequency of the batch iteration
        """

        # Setup cuda device for BERT training, argument -c, --cuda should be true
        # CUDA out of memory.로 False 처리
        cuda_condition = False # torch.cuda.is_available() and with_cuda
        self.device = torch.device("cuda:0" if cuda_condition else "cpu")

        # Initialize the BERT based Sentiment Classification Model
        # End to End Model (sentence to sentiment label)
        self.model = model.to(self.device)
        # Fine-tuned Model (token ids to label)
        self.bert_classifier = model.bert_classifier

        # Setting the train and test data loader
        self.train_data = train_dataloader
        self.test_data = test_dataloader

        # Setting the Adam optimizer with hyper-param
        self.optim = Adam(self.model.parameters(), lr=lr, betas=betas, weight_decay=weight_decay)
        self.optim_schedule = ScheduledOptim(self.optim,  self.bert_classifier.hidden, n_warmup_steps=warmup_steps)

        # Using Negative Log Likelihood Loss function for predicting the masked_token
        self.criterion = nn.BCELoss()

        self.log_freq = log_freq

    def train(self, epoch):
        self.iteration(epoch, self.train_data)

    def test(self, epoch):
        self.iteration(epoch, self.test_data, train=False)
        
    # computes accuracy
    def binary_accuracy(self, preds, y):
        # rounded_preds = torch.round(torch.sigmoid(preds))
        rounded_preds = torch.round(preds)
        correct = (rounded_preds == y).float()
        acc = correct.sum() / len(correct)
        return acc

    def iteration(self, epoch, data_loader, train=True):
        """
        loop over the data_loader for training or testing
        if on train status, backward operation is activated
        and also auto save the model every peoch
        :param epoch: current epoch index
        :param data_loader: torch.utils.data.DataLoader for iteration
        :param train: boolean value of is train or test
        :return: None
        """
        str_code = "train" if train else "test"

        # Setting the tqdm progress bar
        data_iter = tqdm.tqdm(enumerate(data_loader),
                              desc="EP_%s:%d" % (str_code, epoch),
                              total=len(data_loader),
                              bar_format="{l_bar}{r_bar}")

        avg_loss = 0.0
        avg_acc = 0.0
        
        for i, data in data_iter:

            # 0. batch_data will be sent into the device(GPU or cpu)
            # data = {key: value.to(self.device) for key, value in data.items()}

            # 1. forward the bert classifier model
            classifier_output = self.model.forward(data["bert_input"]) # raw sentence (list)

            # Binary Cross Entropy Loss
            target = data["bert_label"].reshape(classifier_output.shape[0], -1)
            loss = self.criterion(classifier_output, target.to(torch.float32)) # torch.Size([32, 1]), torch.Size([32, 1])

            # 2. backward and optimization only in train
            if train:
                self.optim_schedule.zero_grad()
                loss.backward()
                self.optim_schedule.step_and_update_lr()
    
            avg_loss += loss.item()

            # 3. Calculate Accuracy during Test
            if not train:
                acc = self.binary_accuracy(torch.tensor(classifier_output.reshape(-1)), data["bert_label"]) # torch.Size([32])
                avg_acc += acc.item()
                
            post_fix = {
                "epoch": epoch,
                "iter": i,
                "avg_loss": avg_loss / (i + 1),
                "loss": loss.item(),
                "avg_acc": avg_acc / (i + 1)
           }
                            
            if i % self.log_freq == 0:
                data_iter.write(str(post_fix))

        print("EP%d_%s, avg_loss=" % (epoch, str_code), avg_loss / len(data_iter))
        
        if not train:
            print("** Average Accuracy=", avg_acc / len(data_iter))
            print("\n")


    def save(self, epoch, file_path):
        """
        Saving the current BERT Classification Model on file_path
        :param epoch: current epoch number
        :param file_path: model output path which gonna be file_path+"ep%d" % epoch
        :return: final_output_path
        """
        
        output_path = file_path
        torch.save(self.model.cpu(), output_path)
        self.model.to(self.device)

        print("EP:%d Model Saved on:" % epoch, output_path)
        return output_path

In [74]:
# E2ESentimentClassifierSimple(trained_sentiment_classifier).bert_classifier.hidden

In [75]:
# test_raw_dataset = BERTMLMDataset(test_df_sample.review.values, y_test)
# test_raw_loader = DataLoader(dataset=test_raw_dataset, batch_size=config.BATCH_SIZE, shuffle=False)
train_raw_loader = test_raw_loader

In [76]:
trained_sentiment_classifier = torch.load('models/ft_bert_sent_classifier_imdb_ep4.pt')
trained_sentiment_classifier.train()

print("Building End to end BERT based Sentiment Classification model")

e2e_classifier = E2ESentimentClassifierSimple(trained_sentiment_classifier)

print("Creating End to end BERT based Sentiment Classification Model Pre-Trainer")
trainer = E2ESentimentClassifierTrainer(e2e_classifier, train_dataloader=train_raw_loader, test_dataloader=None, lr=config.LR, log_freq=100)

print("Training Start")
for epoch in range(5):
    trainer.train(epoch)
    # Save BERT
    trainer.save(epoch, file_path="models/e2e_bert_imdb" + "_ep%d" % epoch + ".pt")

EP_train:0:   0%|| 0/782 [00:00<?, ?it/s]

Building End to end BERT based Sentiment Classification model
Creating End to end BERT based Sentiment Classification Model Pre-Trainer
Training Start


EP_train:0:   0%|| 1/782 [00:00<06:45,  1.93it/s]

{'epoch': 0, 'iter': 0, 'avg_loss': 0.6973283290863037, 'loss': 0.6973283290863037, 'avg_acc': 0.0}


EP_train:0:  13%|| 101/782 [00:52<05:40,  2.00it/s]

{'epoch': 0, 'iter': 100, 'avg_loss': 0.6933576234496466, 'loss': 0.6934188604354858, 'avg_acc': 0.0}


EP_train:0:  26%|| 201/782 [01:44<05:17,  1.83it/s]

{'epoch': 0, 'iter': 200, 'avg_loss': 0.6934716419794074, 'loss': 0.6949549317359924, 'avg_acc': 0.0}


EP_train:0:  38%|| 301/782 [02:37<04:25,  1.81it/s]

{'epoch': 0, 'iter': 300, 'avg_loss': 0.6933421005442293, 'loss': 0.6918231248855591, 'avg_acc': 0.0}


EP_train:0:  51%|| 401/782 [03:32<03:36,  1.76it/s]

{'epoch': 0, 'iter': 400, 'avg_loss': 0.6932702498542995, 'loss': 0.6936849355697632, 'avg_acc': 0.0}


EP_train:0:  64%|| 501/782 [04:25<02:29,  1.88it/s]

{'epoch': 0, 'iter': 500, 'avg_loss': 0.6931426146787084, 'loss': 0.6955307722091675, 'avg_acc': 0.0}


EP_train:0:  77%|| 601/782 [05:17<01:39,  1.83it/s]

{'epoch': 0, 'iter': 600, 'avg_loss': 0.6932627850284989, 'loss': 0.6920150518417358, 'avg_acc': 0.0}


EP_train:0:  90%|| 701/782 [06:09<00:43,  1.88it/s]

{'epoch': 0, 'iter': 700, 'avg_loss': 0.6932474714883214, 'loss': 0.6917769908905029, 'avg_acc': 0.0}


EP_train:0: 100%|| 782/782 [06:51<00:00,  1.90it/s]
EP_train:1:   0%|| 0/782 [00:00<?, ?it/s]

EP0_train, avg_loss= 0.6932618060837621
EP:0 Model Saved on: models/e2e_bert_imdb_ep0.pt


EP_train:1:   0%|| 1/782 [00:00<06:33,  1.98it/s]

{'epoch': 1, 'iter': 0, 'avg_loss': 0.6959201693534851, 'loss': 0.6959201693534851, 'avg_acc': 0.0}


EP_train:1:  13%|| 101/782 [00:52<05:45,  1.97it/s]

{'epoch': 1, 'iter': 100, 'avg_loss': 0.6932929677538352, 'loss': 0.6931483149528503, 'avg_acc': 0.0}


EP_train:1:  26%|| 201/782 [01:43<05:05,  1.90it/s]

{'epoch': 1, 'iter': 200, 'avg_loss': 0.6932669803870851, 'loss': 0.6870962977409363, 'avg_acc': 0.0}


EP_train:1:  38%|| 301/782 [02:34<04:00,  2.00it/s]

{'epoch': 1, 'iter': 300, 'avg_loss': 0.6932580809656568, 'loss': 0.6914897561073303, 'avg_acc': 0.0}


EP_train:1:  51%|| 401/782 [03:26<03:21,  1.89it/s]

{'epoch': 1, 'iter': 400, 'avg_loss': 0.6932282367549335, 'loss': 0.6932811141014099, 'avg_acc': 0.0}


EP_train:1:  64%|| 501/782 [04:18<02:29,  1.88it/s]

{'epoch': 1, 'iter': 500, 'avg_loss': 0.6931224634309491, 'loss': 0.6957904100418091, 'avg_acc': 0.0}


EP_train:1:  77%|| 601/782 [05:09<01:35,  1.90it/s]

{'epoch': 1, 'iter': 600, 'avg_loss': 0.6932356594604581, 'loss': 0.6913183331489563, 'avg_acc': 0.0}


EP_train:1:  90%|| 701/782 [06:01<00:43,  1.85it/s]

{'epoch': 1, 'iter': 700, 'avg_loss': 0.6932121258489415, 'loss': 0.691987156867981, 'avg_acc': 0.0}


EP_train:1: 100%|| 782/782 [06:43<00:00,  1.94it/s]
EP_train:2:   0%|| 0/782 [00:00<?, ?it/s]

EP1_train, avg_loss= 0.6932347942038876
EP:1 Model Saved on: models/e2e_bert_imdb_ep1.pt


EP_train:2:   0%|| 1/782 [00:00<06:36,  1.97it/s]

{'epoch': 2, 'iter': 0, 'avg_loss': 0.6956122517585754, 'loss': 0.6956122517585754, 'avg_acc': 0.0}


EP_train:2:  13%|| 101/782 [00:51<05:42,  1.99it/s]

{'epoch': 2, 'iter': 100, 'avg_loss': 0.6932624861745551, 'loss': 0.6931557655334473, 'avg_acc': 0.0}


EP_train:2:  26%|| 201/782 [01:43<05:10,  1.87it/s]

{'epoch': 2, 'iter': 200, 'avg_loss': 0.6932270010905479, 'loss': 0.6870681047439575, 'avg_acc': 0.0}


EP_train:2:  38%|| 301/782 [02:35<03:59,  2.01it/s]

{'epoch': 2, 'iter': 300, 'avg_loss': 0.6932449847756826, 'loss': 0.6917574405670166, 'avg_acc': 0.0}


EP_train:2:  51%|| 401/782 [03:27<03:25,  1.86it/s]

{'epoch': 2, 'iter': 400, 'avg_loss': 0.6932139059254654, 'loss': 0.6932945847511292, 'avg_acc': 0.0}


EP_train:2:  64%|| 501/782 [04:19<02:25,  1.93it/s]

{'epoch': 2, 'iter': 500, 'avg_loss': 0.6931057840050338, 'loss': 0.695604681968689, 'avg_acc': 0.0}


EP_train:2:  77%|| 601/782 [05:10<01:36,  1.88it/s]

{'epoch': 2, 'iter': 600, 'avg_loss': 0.6932328969190601, 'loss': 0.6919404864311218, 'avg_acc': 0.0}


EP_train:2:  90%|| 701/782 [06:03<00:43,  1.87it/s]

{'epoch': 2, 'iter': 700, 'avg_loss': 0.6932118936374082, 'loss': 0.6920902132987976, 'avg_acc': 0.0}


EP_train:2: 100%|| 782/782 [06:45<00:00,  1.93it/s]
EP_train:3:   0%|| 0/782 [00:00<?, ?it/s]

EP2_train, avg_loss= 0.693235467004654
EP:2 Model Saved on: models/e2e_bert_imdb_ep2.pt


EP_train:3:   0%|| 1/782 [00:00<06:53,  1.89it/s]

{'epoch': 3, 'iter': 0, 'avg_loss': 0.6951404809951782, 'loss': 0.6951404809951782, 'avg_acc': 0.0}


EP_train:3:  13%|| 101/782 [00:52<05:46,  1.97it/s]

{'epoch': 3, 'iter': 100, 'avg_loss': 0.6932473713808721, 'loss': 0.6931476593017578, 'avg_acc': 0.0}


EP_train:3:  26%|| 201/782 [01:43<05:16,  1.84it/s]

{'epoch': 3, 'iter': 200, 'avg_loss': 0.6932194416795797, 'loss': 0.6880196928977966, 'avg_acc': 0.0}


EP_train:3:  38%|| 301/782 [02:35<04:03,  1.97it/s]

{'epoch': 3, 'iter': 300, 'avg_loss': 0.6932446762572887, 'loss': 0.6922767758369446, 'avg_acc': 0.0}


EP_train:3:  51%|| 401/782 [03:27<03:24,  1.86it/s]

{'epoch': 3, 'iter': 400, 'avg_loss': 0.6932042709312534, 'loss': 0.6932785511016846, 'avg_acc': 0.0}


EP_train:3:  64%|| 501/782 [04:19<02:24,  1.95it/s]

{'epoch': 3, 'iter': 500, 'avg_loss': 0.6931051017043596, 'loss': 0.6953201293945312, 'avg_acc': 0.0}


EP_train:3:  77%|| 601/782 [05:11<01:35,  1.89it/s]

{'epoch': 3, 'iter': 600, 'avg_loss': 0.6932421795738716, 'loss': 0.6928403973579407, 'avg_acc': 0.0}


EP_train:3:  90%|| 701/782 [06:03<00:43,  1.85it/s]

{'epoch': 3, 'iter': 700, 'avg_loss': 0.6932322359119094, 'loss': 0.69202721118927, 'avg_acc': 0.0}


EP_train:3: 100%|| 782/782 [06:45<00:00,  1.93it/s]
EP_train:4:   0%|| 0/782 [00:00<?, ?it/s]

EP3_train, avg_loss= 0.6932421984422542
EP:3 Model Saved on: models/e2e_bert_imdb_ep3.pt


EP_train:4:   0%|| 1/782 [00:00<06:42,  1.94it/s]

{'epoch': 4, 'iter': 0, 'avg_loss': 0.6951477527618408, 'loss': 0.6951477527618408, 'avg_acc': 0.0}


EP_train:4:  13%|| 101/782 [00:52<05:50,  1.94it/s]

{'epoch': 4, 'iter': 100, 'avg_loss': 0.6932463486595909, 'loss': 0.6931620836257935, 'avg_acc': 0.0}


EP_train:4:  26%|| 201/782 [01:44<05:08,  1.89it/s]

{'epoch': 4, 'iter': 200, 'avg_loss': 0.6932324461675995, 'loss': 0.6894408464431763, 'avg_acc': 0.0}


EP_train:4:  38%|| 301/782 [02:35<03:59,  2.01it/s]

{'epoch': 4, 'iter': 300, 'avg_loss': 0.6932513070264924, 'loss': 0.6928225159645081, 'avg_acc': 0.0}


EP_train:4:  51%|| 401/782 [03:27<03:25,  1.85it/s]

{'epoch': 4, 'iter': 400, 'avg_loss': 0.6932288937437862, 'loss': 0.6931719779968262, 'avg_acc': 0.0}


EP_train:4:  64%|| 501/782 [04:18<02:25,  1.93it/s]

{'epoch': 4, 'iter': 500, 'avg_loss': 0.6931476004109411, 'loss': 0.6944177746772766, 'avg_acc': 0.0}


EP_train:4:  77%|| 601/782 [05:10<01:34,  1.91it/s]

{'epoch': 4, 'iter': 600, 'avg_loss': 0.6932224418081578, 'loss': 0.6934870481491089, 'avg_acc': 0.0}


EP_train:4:  90%|| 701/782 [06:06<00:43,  1.84it/s]

{'epoch': 4, 'iter': 700, 'avg_loss': 0.6932213333635969, 'loss': 0.6932365894317627, 'avg_acc': 0.0}


EP_train:4: 100%|| 782/782 [06:49<00:00,  1.91it/s]

EP4_train, avg_loss= 0.693203994608901
EP:4 Model Saved on: models/e2e_bert_imdb_ep4.pt





____

## Inference

In [77]:
e2e_bert_sentiment_classifier = torch.load('models/e2e_bert_imdb_ep0.pt')
e2e_bert_sentiment_classifier.train()

# function to make sentiment prediction during inference
def predict_sentiment(model, sentence):
    """
    :model : end-to-end bert-based classification model
    :sentence : text sentence, input of the end-to-end model
    """
    model.eval()
    prediction = model.forward(sentence)
    return prediction.item()

In [78]:
# print("inputs")
# print(iter(train_raw_loader).next()['bert_input'][:5])
# print("labels")
# print(iter(train_raw_loader).next()['bert_label'][:5])

In [83]:
predict_sentiment(e2e_bert_sentiment_classifier, ['hate this movie'])

0.506757378578186

In [80]:
predict_sentiment(e2e_bert_sentiment_classifier, ['I like this movie.'])

0.5079782009124756

## Practice Code

In [81]:
# computes accuracy
def binary_accuracy(preds, y):
    rounded_preds = torch.round(preds) # preds : sigmoid output
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

binary_accuracy(torch.tensor([0.5, 0.6, 0.3]), torch.tensor([1.0,1.0, 0.0])).item()
# torch.tensor([[0.5, 0.6, 0.3]]).reshape(-1)

0.6666666865348816

In [82]:
x = torch.rand(10, 16)
print(x.shape)
mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1)

torch.Size([10, 16])


In [None]:
mask.shape

In [None]:
x

In [None]:
x.unsqueeze(1)

In [None]:
x.unsqueeze(1).shape

In [None]:
# https://seducinghyeok.tistory.com/9
x.unsqueeze(1).repeat(1, x.size(1), 1).shape # 특정 차원의 텐서 반복 ex. dim 0으로 1번, dim 1로 16번, dim 2로 1번

In [None]:
x.unsqueeze(1).repeat(1, x.size(1), 1)

In [None]:
# Returns a new tensor with a dimension of size one inserted at the specified position. 
# torch.unsqueeze(input, dim)
x.unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1).shape

In [None]:
x.unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1)