##  HW10: BERT fintuning. 

In this exercise, you are going to learn how to perform fine-tuning on a transformer-based model. First, we will provide a tutorial on fine-tuning the Large Movie Review Dataset (IMDB dataset) using distilBERT (https://arxiv.org/abs/1910.01108). After that, you have to complete the exercise by fine-tuning on the TRUE call-center dataset (HW5). This homework is based on the Hugging Face tutorial (https://huggingface.co/transformers/custom_datasets.html).

### 1. Install transformers library form Hugging Face

In [None]:
# !pip install torch==1.4.0
!pip install transformers
!pip install pythainlp
!pip install sentencepiece



### 2. Download Large Movie Review Dataset 

In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2021-04-04 08:28:01--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-04-04 08:28:04 (22.0 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



### 3. Preprocess the dataset  
Large Movie Review Dataset  is a dataset for binary sentiment classification. The input of this dataset is a movie review with its sentiment as a ground truth

In [None]:
from pathlib import Path
from sklearn.model_selection import train_test_split
import numpy as np

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [None]:
print("Unique label is {}, nb. of train data = {}, test_data = {}".format(np.unique(train_labels), len(train_texts), len(test_texts)))
for i in range(5):
  print("Data = {}".format(train_texts[i]))
  print("Label = {}".format(train_labels[i]))

Unique label is [0 1], nb. of train data = 20000, test_data = 25000
Data = In addition to his "Tarzan" series, the prolific Edgar Rice Burroughs did write many other books, although, aside from the popular "At the Earth's Core", few of these have been filmed. One exception is the novel entitled "The Lad and the Lion", brought to the screen as "The Lion Man" (1936), an over-talkative, static, old-hat, slow-moving and rather dull movie, despite being filmed on real desert locations. Actually "movie" is the wrong word. The narrative doesn't move but proceeds at a snail's pace in an abrupt series of jerks. For instance, at least five characters are given elaborate opening scenes and then just disappear. Even more frustrating for the keen movie fan, are the characters who make an impression of sorts (like the lass who plies Hall with drugged wine) but are enacted by players who are not credited! The credited thespians generally come off worse than the unknowns. One exception is Australian a

After the dataset is processed, we tokenize each input sentence. This tokenizer has a start token of '[CLS'] (id 101) and a seperator token '[SEP]' (id 102) at the end of each sentence. If the word is an Out-of-vocabulary word (OOV), the token id is 100. The tokenized output has the following format :

```python
{
  'input_ids': List[List[Int]]. List of tokenized input sentence.
  'attention_mask' : List[List[Int]].  List of masked token. See cell [7] for example.
}
```

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [None]:
tokenizer([ '[CLS] a' ], truncation=True, padding=True)

{'input_ids': [[101, 101, 1037, 102]], 'attention_mask': [[1, 1, 1, 1]]}

In [None]:
tokenizer( ['Pine apple apple pen  หมา ไก่', 'a b'], truncation=True, padding=True)

{'input_ids': [[101, 7222, 6207, 6207, 7279, 100, 100, 102], [101, 1037, 1038, 102, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0]]}

In [None]:
a = tokenizer(train_texts[:2], truncation=True, padding=True)
print(a)

{'input_ids': [[101, 1999, 2804, 2000, 2010, 1000, 24566, 1000, 2186, 1010, 1996, 12807, 9586, 5785, 25991, 2106, 4339, 2116, 2060, 2808, 1010, 2348, 1010, 4998, 2013, 1996, 2759, 1000, 2012, 1996, 3011, 1005, 1055, 4563, 1000, 1010, 2261, 1997, 2122, 2031, 2042, 6361, 1012, 2028, 6453, 2003, 1996, 3117, 4709, 1000, 1996, 14804, 1998, 1996, 7006, 1000, 1010, 2716, 2000, 1996, 3898, 2004, 1000, 1996, 7006, 2158, 1000, 1006, 4266, 1007, 1010, 2019, 2058, 1011, 2831, 8082, 1010, 10763, 1010, 2214, 1011, 6045, 1010, 4030, 1011, 3048, 1998, 2738, 10634, 3185, 1010, 2750, 2108, 6361, 2006, 2613, 5532, 5269, 1012, 2941, 1000, 3185, 1000, 2003, 1996, 3308, 2773, 1012, 1996, 7984, 2987, 1005, 1056, 2693, 2021, 10951, 2012, 1037, 10879, 1005, 1055, 6393, 1999, 2019, 18772, 2186, 1997, 12181, 2015, 1012, 2005, 6013, 1010, 2012, 2560, 2274, 3494, 2024, 2445, 9603, 3098, 5019, 1998, 2059, 2074, 10436, 1012, 2130, 2062, 25198, 2005, 1996, 10326, 3185, 5470, 1010, 2024, 1996, 3494, 2040, 2191, 2019, 

In [None]:
train_encodings = tokenizer(train_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
val_encodings = tokenizer(val_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
test_encodings = tokenizer(test_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )



Convert the dataset into training format. You can see the training input format of distilBERT is in https://huggingface.co/transformers/model_doc/distilbert.html. 

In [None]:
train_data = [np.array(train_encodings['input_ids']), np.array(train_encodings['attention_mask'])]
val_data = [np.array(val_encodings['input_ids']), np.array(val_encodings['attention_mask'])]
test_data = [np.array(test_encodings['input_ids']), np.array(test_encodings['attention_mask'])]

### 4. Model fine-tuning
The model we used for fine-tuning is distilBERT (https://arxiv.org/abs/1910.01108), which is a smaller model distilled from the original BERT. Knowledge distillation is a well-known trick for improving the performance of a small model by learning an estimated uncertainty from a larger model instead of using a hard-label. If you want to know more about knowledge distillation, read https://arxiv.org/abs/1503.02531.

#### Model Initialization

In [None]:
from transformers import DistilBertForSequenceClassification
import torch

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels= 2)
model = torch.nn.DataParallel(model.cuda(), device_ids=[0])

LEARNING_RATE =  1e-5
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

#### Set up training generator

In contrast to model.fit which you have used in the previous lab. A more common way to feed the data is to use a generator. It is more memory-efficient than model.fit as the data is only quired when the iterator executes. For example, you can set the generator to load the image from the folder when called instead of storing all of them in the RAM. An example below is a way to create a simple generator, which aggregate the data points into a batch. Both PyTorch and TensorFlow also has a utility module for creating a generator (torch.utils.data.DataLoader for Torch and tf.data.Dataset for Tensorflow) 

In [None]:
def batch_data_generator(data, label, bs = 8, training = True):
  while(True):
    X1= []
    X2 = []
    Y = []
    from sklearn.utils import shuffle
    ids, masks = data[0], data[1]
    if(training):
      ids, masks, label = shuffle(ids, masks, label, random_state = 42)
    for a, b, c in zip(ids, masks, label):
      X1.append(a)
      X2.append(b)
      Y.append(c)
      if(len(X1) == bs):
        yield [np.array(X1), np.array(X2)], np.array(Y)
        X1= []
        X2 = []
        Y = []
    if(len(X1) > 0):
      yield [np.array(X1), np.array(X2)], np.array(Y)
    if(not training):
      yield None
      break


In [None]:
train_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)

In [None]:
dummy_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)
X_dummy, Y_dummy = next(dummy_generator)
print(X_dummy[0].shape, X_dummy[1].shape, Y_dummy.shape)

(8, 512) (8, 512) (8,)


#### Start Fine-tuning

In [None]:
device = "cuda:0"
from tqdm import tqdm_notebook
from sklearn.metrics import accuracy_score
from collections import deque 

train_acc_stat =  deque(maxlen = 100)
train_loss_stat =  deque(maxlen = 100)

for step in  tqdm_notebook(range(1000)):
    X, Y = next(train_generator)
    ids = torch.tensor(X[0], dtype = torch.long, device = device)
    mask = torch.tensor(X[1], dtype = torch.long, device = device)
    targets = torch.tensor(Y, dtype = torch.long).to(device)

    optimizer.zero_grad()
    outputs = model(ids, mask)
    loss = loss_fn(outputs['logits'], targets)
    
    loss.backward()
    optimizer.step()

    with torch.no_grad():
      train_acc = accuracy_score(Y, outputs['logits'].argmax(axis = 1).cpu().detach().numpy() )
      train_loss = loss.cpu().detach().numpy()
      train_acc_stat.append(train_acc)
      train_loss_stat.append(train_loss)

    if (step + 1) %100==0:
      print("iter = {} train_acc = {}".format(step, np.array(train_acc_stat).mean()))
      print("iter = {} train_loss = {}".format(step, np.array(train_loss_stat).mean()))


    if (step + 1) %500==0:
      #validation step
      with torch.no_grad():
        val_generator = batch_data_generator(val_data, np.array(val_labels, dtype = np.int), training = False)
        y_true = []
        y_pred = []
        while(True):
          d = next(val_generator)
          if(d is None): break
          X, Y = d
          ids = torch.tensor(X[0], dtype = torch.long, device = device)
          mask = torch.tensor(X[1], dtype = torch.long, device = device)
          outputs_cls = model(ids, mask)['logits'].argmax(axis = 1).cpu().detach().numpy()
          y_true.append(Y)
          y_pred.append(outputs_cls)
        y_true = np.concatenate(y_true)
        y_pred = np.concatenate(y_pred)
        print("val acc", accuracy_score(y_true, y_pred))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if __name__ == '__main__':


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

iter = 99 train_acc = 0.73
iter = 99 train_loss = 0.570081889629364
iter = 199 train_acc = 0.88
iter = 199 train_loss = 0.326997309923172
iter = 299 train_acc = 0.89375
iter = 299 train_loss = 0.28449925780296326
iter = 399 train_acc = 0.9
iter = 399 train_loss = 0.25818705558776855
iter = 499 train_acc = 0.895
iter = 499 train_loss = 0.2770956754684448
val acc 0.8846
iter = 599 train_acc = 0.91125
iter = 599 train_loss = 0.23292399942874908
iter = 699 train_acc = 0.88375
iter = 699 train_loss = 0.2749811112880707
iter = 799 train_acc = 0.87375
iter = 799 train_loss = 0.28948721289634705
iter = 899 train_acc = 0.9025
iter = 899 train_loss = 0.2369956523180008
iter = 999 train_acc = 0.9
iter = 999 train_loss = 0.2622757852077484
val acc 0.8996



## TODO 
Compare the classification performance between the non-transformer model and the model fine-tuned using pretrained WangchanBERTa on TRUE call-center dataset (HW5). WangchanBERTa (https://arxiv.org/abs/2101.09635) is RoBERTa (https://arxiv.org/abs/1907.11692) trained on thai texts. RoBERTa is also supported in Hugging Face (https://huggingface.co/transformers/model_doc/roberta.html).

To successfully fine-tune WangchanBERTa on the TRUE call-center dataset, you should:

1. Preprocess the dataset into the same format as the tutorial.
2. Tokenize the input from 1. See (https://colab.research.google.com/drive/1Kbk6sBspZLwcnOE61adAQo30xxqOQ9ko?usp=sharing&fbclid=IwAR23b8ZEoP6YxlUx7wWEu7dRCrVcyTFrZb3YSgI-nsxe_t4gy-bh8Rv5R9E#scrollTo=kAcpAdkddVQ8) for more details.
3. Process the tokenized input from 1. to the format that could be fed to the model.
4. Initialize WangchanBERTa (<b> you should choose the pretrained weight w.r.t. the tokenizer in 2.</b>)
5. Fine-tune the pretrained model.
6.  (Optional) Before fine-tuning is performed (before step 5), domain adaptation is often performed first by training a masked language model (maskLM). You can train maskLM by following this guideline (https://huggingface.co/transformers/model_doc/bert.html#bertformaskedlm).

In [None]:
!wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
!pip install torch torchtext torchvision


--2021-04-04 08:51:12--  https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.6.18, 2620:100:601c:18::a27d:612
Connecting to www.dropbox.com (www.dropbox.com)|162.125.6.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv [following]
--2021-04-04 08:51:12--  https://www.dropbox.com/s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc3213c0286f2cde5ef1e168714b.dl.dropboxusercontent.com/cd/0/inline/BL8QTctRfAYC-0csQRo7P5TBYyryfCTba9OO4Y_V9nnNGaQqPCjOIA7k2FMi9_5aK90jMf0j1H_60OHuQ7x9O9S4j2jQSMHGMuCrPg2a3Tff4PWEcg7z8p3SVJ8gkTw7IEIiFPofCqOpNscEXnjXhTch/file# [following]
--2021-04-04 08:51:12--  https://uc3213c0286f2cde5ef1e168714b.dl.dropboxusercontent.com/cd/0/inline/BL8QTctRfAYC-0csQRo7P5TBYyr

In [None]:
import numpy as np
import pandas as pd
import torch

from sklearn.model_selection import train_test_split
from tensorflow.python.keras.utils.np_utils import to_categorical
from tqdm.auto import tqdm

#transformers
from transformers import (
    CamembertTokenizer,
    AutoTokenizer,
    AutoModel,
    AutoModelForMaskedLM,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    pipeline,
)

In [None]:
data_df = pd.read_csv('clean-phone-data-for-students.csv')

In [None]:
data_df2 = data_df[["Sentence Utterance", "Action", "Object"]]
data_df2.columns = ['input', 'raw_action','raw_object']

data_df2['clean_action']=data_df2['raw_action'].str.lower().copy()
data_df2['clean_object']=data_df2['raw_object'].str.lower().copy()

data_df2.drop('raw_action', axis=1, inplace=True)
data_df2.drop('raw_object', axis=1, inplace=True)

data_df2 = data_df2.drop_duplicates("input", keep="first")
display(data_df2.describe())

data = data_df2.to_numpy()

def strip_str(string):
    return string.strip()

data[:,0] = np.vectorize(strip_str)(data[:,0])

unique_object = data_df2.clean_object.unique()
object_2_num_map = dict(zip(unique_object, range(len(unique_object))))
num_2_object_map = dict(zip(range(len(unique_object)), unique_object))

data[:,2] = np.vectorize(object_2_num_map.get)(data[:,2])
data[:,2] = np.array(all_objects)

Unnamed: 0,input,clean_action,clean_object
count,13389,13389,13389
unique,13389,8,26
top,ผมอยากจะถามว่าถ้าเราโทรไปต่างประเทศนาทีละกี่บา...,enquire,service
freq,1,8658,2111


In [None]:
data[10:,2] 

array([4, 6, 3, ..., 7, 7, 1], dtype=object)

In [None]:
train_text, test_text , train_object, test_object = train_test_split(data[:,0], data[:,2], test_size=.2 , random_state = 42)
train_val_split = len(train_text)*7//10
train_text, val_text = train_text[:train_val_split],train_text[train_val_split:]
train_object, val_object = train_object[:train_val_split],train_object[train_val_split:]

In [None]:
train_object[10:]

array([3, 16, 3, ..., 9, 7, 0], dtype=object)

In [None]:
print(train_text.shape, val_text.shape,test_text.shape)
print(train_object.shape, val_object.shape,test_object.shape)

(7497,) (3214,) (2678,)
(7497,) (3214,) (2678,)


In [None]:
tokenizer = AutoTokenizer.from_pretrained('airesearch/wangchanberta-base-att-spm-uncased')

In [None]:
train_encodings = tokenizer(train_text.tolist(), add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
val_encodings = tokenizer(val_text.tolist(), add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
test_encodings = tokenizer(test_text.tolist(), add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )



In [None]:
train_data = [np.array(train_encodings['input_ids']), np.array(train_encodings['attention_mask'])]
val_data = [np.array(val_encodings['input_ids']), np.array(val_encodings['attention_mask'])]
test_data = [np.array(test_encodings['input_ids']), np.array(test_encodings['attention_mask'])]

In [None]:
model = AutoModelForSequenceClassification.from_pretrained('airesearch/wangchanberta-base-att-spm-uncased', num_labels= len(unique_object))
model = torch.nn.DataParallel(model.cuda(), device_ids=[0])

LEARNING_RATE =  1e-5
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)


Some weights of the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased were not used when initializing CamembertForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at airesearch/wa

In [None]:
import copy
def batch_data_generator(data, label, bs = 8, training = True):
  while(True):
    X1= []
    X2 = []
    Y = []
    from sklearn.utils import shuffle
    ids, masks = copy.deepcopy(data[0]), copy.deepcopy(data[1])
    labels = copy.deepcopy(label)
    if(training):
      ids, masks, labels = shuffle(ids, masks, labels, random_state = 42)
    for a, b, c in zip(ids, masks, labels):
      X1.append(a)
      X2.append(b)
      Y.append(c)
      if(len(X1) == bs):
        yield [np.array(X1), np.array(X2)], np.array(Y)
        X1= []
        X2 = []
        Y = []
    if(len(X1) > 0):
      yield [np.array(X1), np.array(X2)], np.array(Y)
    if(not training):
      yield None
      break


In [None]:
train_generator = batch_data_generator(train_data, np.array(train_object, dtype = np.int), training = True)

In [None]:
device = "cuda:0"
from tqdm import tqdm_notebook
from sklearn.metrics import accuracy_score
from collections import deque 

train_acc_stat =  deque(maxlen = 100)
train_loss_stat =  deque(maxlen = 100)

for step in  tqdm_notebook(range(4000)):
    X, Y = next(train_generator)
    ids = torch.tensor(X[0], dtype = torch.long, device = device)
    mask = torch.tensor(X[1], dtype = torch.long, device = device)
    targets = torch.tensor(Y, dtype = torch.long).to(device)

    optimizer.zero_grad()
    outputs = model(ids, mask)
    loss = loss_fn(outputs['logits'], targets)
    
    loss.backward()
    optimizer.step()

    with torch.no_grad():
      train_acc = accuracy_score(Y, outputs['logits'].argmax(axis = 1).cpu().detach().numpy() )
      train_loss = loss.cpu().detach().numpy()
      train_acc_stat.append(train_acc)
      train_loss_stat.append(train_loss)

    if (step + 1) %100==0:
      print("iter = {} train_acc = {}".format(step, np.array(train_acc_stat).mean()))
      print("iter = {} train_loss = {}".format(step, np.array(train_loss_stat).mean()))


    if (step + 1) %500==0:
      #validation step
      with torch.no_grad():
        val_generator = batch_data_generator(val_data, np.array(val_object, dtype = np.int), training = False)
        y_true = []
        y_pred = []
        while(True):
          d = next(val_generator)
          if(d is None): break
          X, Y = d
          ids = torch.tensor(X[0], dtype = torch.long, device = device)
          mask = torch.tensor(X[1], dtype = torch.long, device = device)
          outputs_cls = model(ids, mask)['logits'].argmax(axis = 1).cpu().detach().numpy()
          y_true.append(Y)
          y_pred.append(outputs_cls)
        y_true = np.concatenate(y_true)
        y_pred = np.concatenate(y_pred)
        print("val acc", accuracy_score(y_true, y_pred))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if __name__ == '__main__':


HBox(children=(FloatProgress(value=0.0, max=4000.0), HTML(value='')))

iter = 99 train_acc = 0.2025
iter = 99 train_loss = 2.8779850006103516
iter = 199 train_acc = 0.2325
iter = 199 train_loss = 2.6149468421936035
iter = 299 train_acc = 0.35875
iter = 299 train_loss = 2.360844135284424
iter = 399 train_acc = 0.4475
iter = 399 train_loss = 2.006945848464966
iter = 499 train_acc = 0.525
iter = 499 train_loss = 1.6820623874664307
val acc 0.544181705040448
iter = 599 train_acc = 0.565
iter = 599 train_loss = 1.5172691345214844
iter = 699 train_acc = 0.52625
iter = 699 train_loss = 1.6550440788269043
iter = 799 train_acc = 0.53625
iter = 799 train_loss = 1.6548097133636475
iter = 899 train_acc = 0.5825
iter = 899 train_loss = 1.4302165508270264
iter = 999 train_acc = 0.62375
iter = 999 train_loss = 1.3428040742874146
val acc 0.5382700684505289
iter = 1099 train_acc = 0.56
iter = 1099 train_loss = 1.487375020980835
iter = 1199 train_acc = 0.56625
iter = 1199 train_loss = 1.4734954833984375
iter = 1299 train_acc = 0.545
iter = 1299 train_loss = 1.50838685035705

In [None]:
test_generator = batch_data_generator(test_data, np.array(test_object, dtype = np.int), training = False)
y_true = []
y_pred = []
while(True):
 d = next(test_generator)
 if(d is None): break
 X, Y = d
 ids = torch.tensor(X[0], dtype = torch.long, device = device)
 mask = torch.tensor(X[1], dtype = torch.long, device = device)
 outputs_cls = model(ids, mask)['logits'].argmax(axis = 1).cpu().detach().numpy()
 y_true.append(Y)
 y_pred.append(outputs_cls)
y_true = np.concatenate(y_true)
y_pred = np.concatenate(y_pred)
print("Test acc", accuracy_score(y_true, y_pred))

Test acc 0.719193427931292


In HW6 ,object accurency is 61.67039522744221 % with using 120.874 s to finish traing 10 epochs. In this BERT, object accurency is 71.9193427931292 % with using 3838.921 s to finish traing 4 epochs