# **Homework 7 - Bert (Question Answering)**

If you have any questions, feel free to email us at mlta-2022-spring@googlegroups.com



Slide:    [Link](https://docs.google.com/presentation/d/1H5ZONrb2LMOCixLY7D5_5-7LkIaXO6AGEaV2mRdTOMY/edit?usp=sharing)　Kaggle: [Link](https://www.kaggle.com/c/ml2022spring-hw7)　Data: [Link](https://drive.google.com/uc?id=1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb)




## Task description
- Chinese Extractive Question Answering
  - Input: Paragraph + Question
  - Output: Answer

- Objective: Learn how to fine tune a pretrained model on downstream task using transformers

- Todo
    - Fine tune a pretrained chinese BERT model
    - Change hyperparameters (e.g. doc_stride)
    - Apply linear learning rate decay
    - Try other pretrained models
    - Improve preprocessing
    - Improve postprocessing
- Training tips
    - Automatic mixed precision
    - Gradient accumulation
    - Ensemble

- Estimated training time (tesla t4 with automatic mixed precision enabled)
    - Simple: 8mins
    - Medium: 8mins
    - Strong: 25mins
    - Boss: 2.5hrs
  

## Download Dataset

In [1]:
# Download link 1
# !gdown --id '1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb' --output hw7_data.zip

# Download Link 2 (if the above link fails)
# !gdown --id '1qwjbRjq481lHsnTrrF4OjKQnxzgoLEFR' --output hw7_data.zip

# Download Link 3 (if the above link fails)
# !gdown --id '1QXuWjNRZH6DscSd6QcRER0cnxmpZvijn' --output hw7_data.zip

# !unzip -o hw7_data.zip

# For this HW, K80 < P4 < T4 < P100 <= T4(fp16) < V100
# import torch
# torch.cuda.empty_cache()

!nvidia-smi
!kill -9 762102 

Sun Dec  3 17:56:50 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   42C    P8    37W / 370W |    687MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Install transformers

Documentation for the toolkit:　https://huggingface.co/transformers/

In [2]:
# You are allowed to change version of transformers or use other toolkits
!pip install transformers==4.5.0

[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

## Import Packages

In [3]:
import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW, BertForQuestionAnswering, BertTokenizerFast

from tqdm.auto import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"

# Fix random seed for reproducibility
def same_seeds(seed):
	  torch.manual_seed(seed)
	  if torch.cuda.is_available():
		    torch.cuda.manual_seed(seed)
		    torch.cuda.manual_seed_all(seed)
	  np.random.seed(seed)
	  random.seed(seed)
	  torch.backends.cudnn.benchmark = False
	  torch.backends.cudnn.deterministic = True
same_seeds(0)

In [4]:
# Change "fp16_training" to True to support automatic mixed precision training (fp16)
fp16_training = False

if fp16_training:
    !pip install accelerate==0.2.0
    from accelerate import Accelerator
    accelerator = Accelerator(fp16=True)
    device = accelerator.device

# Documentation for the toolkit:  https://huggingface.co/docs/accelerate/

[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

## Load Model and Tokenizer






In [5]:
# model = BertForQuestionAnswering.from_pretrained("uer/roberta-base-chinese-extractive-qa").to(device)
# tokenizer = BertTokenizerFast.from_pretrained("uer/roberta-base-chinese-extractive-qa")
from transformers import AutoModelForQuestionAnswering

# model_name = "albert_chinese_large"
# model_name = "chinese_roberta_large"
model_name = "NchuNLPChinese-QA"
# model_name = "chinese_roberta_extra"
model = AutoModelForQuestionAnswering.from_pretrained(model_name).to(device)
tokenizer = BertTokenizerFast.from_pretrained(model_name)


# You can safely ignore the warning message (it pops up because new prediction heads for QA are initialized randomly)

## Read Data

- Training set: 31690 QA pairs
- Dev set: 4131  QA pairs
- Test set: 4957  QA pairs

- {train/dev/test}_questions:
  - List of dicts with the following keys:
   - id (int)
   - paragraph_id (int)
   - question_text (string)
   - answer_text (string)
   - answer_start (int)
   - answer_end (int)
- {train/dev/test}_paragraphs:
  - List of strings
  - paragraph_ids in questions correspond to indexs in paragraphs
  - A paragraph may be used by several questions

In [6]:
def read_data(file):
    with open(file, 'r', encoding="utf-8") as reader:
        data = json.load(reader)
    return data["questions"], data["paragraphs"]

train_questions, train_paragraphs = read_data("hw7_train.json")
dev_questions, dev_paragraphs = read_data("hw7_dev.json")
test_questions, test_paragraphs = read_data("hw7_test.json")

## Tokenize Data

In [7]:
# Tokenize questions and paragraphs separately
# 「add_special_tokens」 is set to False since special tokens will be added when tokenized questions and paragraphs are combined in datset __getitem__

train_questions_tokenized = tokenizer([train_question["question_text"] for train_question in train_questions], add_special_tokens=False)
dev_questions_tokenized = tokenizer([dev_question["question_text"] for dev_question in dev_questions], add_special_tokens=False)
test_questions_tokenized = tokenizer([test_question["question_text"] for test_question in test_questions], add_special_tokens=False)

train_paragraphs_tokenized = tokenizer(train_paragraphs, add_special_tokens=False)
dev_paragraphs_tokenized = tokenizer(dev_paragraphs, add_special_tokens=False)
test_paragraphs_tokenized = tokenizer(test_paragraphs, add_special_tokens=False)

# You can safely ignore the warning message as tokenized sequences will be futher processed in datset __getitem__ before passing to model

Token indices sequence length is longer than the specified maximum sequence length for this model (570 > 512). Running this sequence through the model will result in indexing errors


## Dataset and Dataloader

In [8]:
doc_stride = 16
train_batch_size = 32 # According to the size of the model
max_paragraph_len = 150 # According to the ability of model

class QA_Dataset(Dataset):
    def __init__(self, split, questions, tokenized_questions, tokenized_paragraphs, doc_stride = doc_stride, max_paragraph_len = max_paragraph_len):
        self.split = split
        self.questions = questions
        self.tokenized_questions = tokenized_questions
        self.tokenized_paragraphs = tokenized_paragraphs
        self.max_question_len = 40
        self.max_paragraph_len = max_paragraph_len

        ##### TODO: Change value of doc_stride #####
        self.doc_stride = doc_stride

        # Input sequence length = [CLS] + question + [SEP] + paragraph + [SEP]
        self.max_seq_len = 1 + self.max_question_len + 1 + self.max_paragraph_len + 1

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        tokenized_question = self.tokenized_questions[idx]
        tokenized_paragraph = self.tokenized_paragraphs[question["paragraph_id"]]

        ##### TODO: Preprocessing #####
        # Hint: How to prevent model from learning something it should not learn

        if self.split == "train":
            # Convert answer's start/end positions in paragraph_text to start/end positions in tokenized_paragraph
            answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"])
            answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])

            # A single window is obtained by slicing the portion of paragraph containing the answer
            start_min = max(0, answer_end_token - self.max_paragraph_len + 1)
            start_max = min(answer_start_token, len(tokenized_paragraph) - self.max_paragraph_len)
            start_max = max(start_min, start_max)
            paragraph_start = random.randint(start_min, start_max + 1)
            paragraph_end = paragraph_start + self.max_paragraph_len

            # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
            input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
            input_ids_paragraph = tokenized_paragraph.ids[paragraph_start : paragraph_end] + [102]

            # Convert answer's start/end positions in tokenized_paragraph to start/end positions in the window
            answer_start_token += len(input_ids_question) - paragraph_start
            answer_end_token += len(input_ids_question) - paragraph_start

            # Pad sequence and obtain inputs to model
            input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
            return torch.tensor(input_ids), torch.tensor(token_type_ids), torch.tensor(attention_mask), answer_start_token, answer_end_token

        # Validation/Testing
        else:
            input_ids_list, token_type_ids_list, attention_mask_list = [], [], []

            # Paragraph is split into several windows, each with start positions separated by step "doc_stride"
            for i in range(0, len(tokenized_paragraph), self.doc_stride):

                # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
                input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
                input_ids_paragraph = tokenized_paragraph.ids[i : i + self.max_paragraph_len] + [102]

                # Pad sequence and obtain inputs to model
                input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)

                input_ids_list.append(input_ids)
                token_type_ids_list.append(token_type_ids)
                attention_mask_list.append(attention_mask)

            return torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list)

    def padding(self, input_ids_question, input_ids_paragraph):
        # Pad zeros if sequence length is shorter than max_seq_len
        padding_len = self.max_seq_len - len(input_ids_question) - len(input_ids_paragraph)
        # Indices of input sequence tokens in the vocabulary
        input_ids = input_ids_question + input_ids_paragraph + [0] * padding_len
        # Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
        token_type_ids = [0] * len(input_ids_question) + [1] * len(input_ids_paragraph) + [0] * padding_len
        # Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
        attention_mask = [1] * (len(input_ids_question) + len(input_ids_paragraph)) + [0] * padding_len

        return input_ids, token_type_ids, attention_mask

train_set = QA_Dataset("train", train_questions, train_questions_tokenized, train_paragraphs_tokenized)
dev_set = QA_Dataset("dev", dev_questions, dev_questions_tokenized, dev_paragraphs_tokenized)
test_set = QA_Dataset("test", test_questions, test_questions_tokenized, test_paragraphs_tokenized)

train_batch_size = 32

# Note: Do NOT change batch size of dev_loader / test_loader !
# Although batch size=1, it is actually a batch consisting of several windows from the same QA pair
train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)
dev_loader = DataLoader(dev_set, batch_size=1, shuffle=False, pin_memory=True)
test_loader = DataLoader(test_set, batch_size=1, shuffle=False, pin_memory=True)

In [9]:
for batch in test_loader:
    break

print(batch[0].shape)

torch.Size([1, 31, 193])


## Function for Evaluation

In [10]:
# def evaluate(data, output):
#     ##### TODO: Postprocessing #####
#     # There is a bug and room for improvement in postprocessing
#     # Hint: Open your prediction file to see what is wrong

#     answer = ''
#     final_start_index = 0
#     final_end_index = 0
#     max_prob = float('-inf')
#     num_of_windows = data[0].shape[1]

#     for k in range(num_of_windows):
#         # Obtain answer by choosing the most probable start position / end position
#         start_prob, start_index = torch.max(output.start_logits[k], dim=0)
#         end_prob, end_index = torch.max(output.end_logits[k], dim=0)

#         # Probability of answer is calculated as sum of start_prob and end_prob
#         prob = start_prob + end_prob

#         # Replace answer if calculated probability is larger than previous windows
#         if prob > max_prob:
#             max_prob = prob
#             # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
#             answer = tokenizer.decode(data[0][0][k][start_index : end_index + 1])
#             final_start_index = start_index
#             final_end_index = end_index

#     # Remove spaces in answer (e.g. "大 金" --> "大金")
#     return answer.replace(' ',''), final_start_index, final_end_index

def evaluate(data, output, doc_stride = doc_stride,  token_type_ids = None, paragraph = None, paragraph_tokenized = None):
    ##### TODO: Postprocessing #####
    # There is a bug and room for improvement in postprocessing
    # Hint: Open your prediction file to see what is wrong

    answer = ''
    max_prob = float('-inf')
    num_of_windows = data[0].shape[1] 
    # Because the batch's shape is [1, num_of_windows, max_seq_len]
    # batch include: input_token, token_type (question:0, paragraph:1), attention_mask

    MAX_ANSWER_LENGTH = 60  # This should be set according to the model's training data

    def is_valid_answer(start_index, end_index, max_answer_length):
        """Check if the answer length is within the allowable range."""
        return end_index >= start_index and (end_index - start_index + 1) <= max_answer_length
    
    for k in range(num_of_windows):
        # Obtain answer by choosing the most probable start position / end position
        start_prob, start_index = torch.max(output.start_logits[k], dim=0)
        end_prob, end_index = torch.max(output.end_logits[k], dim=0)

        # First we need to carry out postprocessing
        token_type_id = data[1][0][k].detach().cpu().numpy()
        #Because batchsize = 1, and we can't call .numpy() method on GPU stored tensor
        paragraph_start = token_type_id.argmax() # It returns the first 1, i.e. the start of the paragraph
        paragraph_end = len(token_type_id) - 1 - token_type_id[::-1].argmax() -1 
        # [::-1] is used to inverse the numpy array

        if(start_index > end_index or start_index < paragraph_start or end_index > paragraph_end):
            continue

        # Ensure the position is correct, we process the special tokens
        prob = start_prob + end_prob

        # Replace answer if calculated probability is larger than previous windows
        if prob > max_prob:
            max_prob = prob
            # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
            answer = tokenizer.decode(data[0][0][k][start_index : end_index + 1])

            original_start = start_index - paragraph_start + k * doc_stride
            original_end = end_index - paragraph_start + k * doc_stride

    # Remove spaces in answer (e.g. "大 金" --> "大金")
    answer = answer.replace(' ','')
    if '[UNK]' in answer:
        print("Detect [UNK] in answer, we use original context instead")
        print(f"The original answer is:{answer}")
        # .token_to_char() returns the ctoken's corresponding character's index in the context, the return is an interval corresponding to the original interval
        raw_start = paragraph_tokenized.token_to_chars(original_start)[0]
        raw_end = paragraph_tokenized.token_to_chars(original_end)[1]
        # Then we don't need to +1 on the end position
        answer = paragraph[raw_start:raw_end]
        print("The original context's answer is:", answer)
        print('--'.center(80,'-'))
    
    return answer


## Training

In [11]:
import transformers
num_epoch = 4
acc_steps = 1 # Used as Gradient Accumulation
validation = True
logging_step = 100
learning_rate = 1e-5
num_warmup_steps = 100
num_training_steps = num_epoch * len(train_loader) * 0.75
# num_training_steps = 1000
optimizer = AdamW(model.parameters(), lr=learning_rate)
schedular = transformers.get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps//acc_steps)

if fp16_training:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)

train_loss_list = []
train_acc_list = []
if validation:
    dev_acc_list = []
  

model.train()

print("Start Training ...")
step = 1

for epoch in range(num_epoch):
   
    train_loss = train_acc = 0
    optimizer.zero_grad()
    for data in tqdm(train_loader):
        # Load all data into GPU
        data = [i.to(device) for i in data]
        # Model inputs: input_ids, token_type_ids, attention_mask, start_positions, end_positions (Note: only "input_ids" is mandatory)
        # Model outputs: start_logits, end_logits, loss (return when start_positions/end_positions are provided)
        output = model(input_ids=data[0], token_type_ids=data[1], attention_mask=data[2], start_positions=data[3], end_positions=data[4])

        # Choose the most probable start position / end position
        start_index = torch.argmax(output.start_logits, dim=1)
        end_index = torch.argmax(output.end_logits, dim=1)

        # Prediction is correct only if both start_index and end_index are correct
        train_acc += ((start_index == data[3]) & (end_index == data[4])).float().mean()
        train_loss += output.loss

        if fp16_training:
            accelerator.backward(output.loss)
        else:
            output.loss.backward()

        step += 1
        # Apply Gradient Accumulation
        if step % acc_steps == 0:
            optimizer.step()
            optimizer.zero_grad()
            schedular.step()

        # Print training loss and accuracy over past logging step
        if step % logging_step == 0:
            lr = optimizer.state_dict()['param_groups'][0]['lr']
            print(f"Epoch {epoch + 1} | Step {step} | loss = {train_loss.item() / logging_step:.3f}, acc = {train_acc / logging_step:.3f}, lr = {lr:.2e}")
            train_loss_list.append(train_loss)
            train_acc_list.append(train_acc)
            train_loss = train_acc = 0


    if validation:
        print("Evaluating Dev Set ...")
        model.eval()
        with torch.no_grad():
            dev_acc = 0
            for i, data in enumerate(tqdm(dev_loader)):
                output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
                # prediction is correct only if answer text exactly matches
                answer = evaluate(data, output, doc_stride=doc_stride,paragraph=dev_paragraphs[dev_questions[i]["paragraph_id"]],
                                 paragraph_tokenized=dev_paragraphs_tokenized[dev_questions[i]["paragraph_id"]])
                
                dev_acc +=  answer == dev_questions[i]["answer_text"]
                if  answer != dev_questions[i]["answer_text"]:
                    print('***********Wrong Answer for question:', i, "********************")
                    print('Paragraphs:', dev_paragraphs[dev_questions[i]["paragraph_id"]])
                    print('Question:', dev_questions[i]["question_text"])
                    print('Ground Truth:', dev_questions[i]["answer_text"])
                    print('Prediction:', answer)
                    print('\n\n')

            print(f"Validation | Epoch {epoch + 1} | acc = {dev_acc / len(dev_loader):.3f}")
            dev_acc_list.append(100 * dev_acc / len(dev_loader))
        model.train()

# Save a model and its configuration file to the directory 「saved_model」
# i.e. there are two files under the direcory 「saved_model」: 「pytorch_model.bin」 and 「config.json」
# Saved model can be re-loaded using 「model = BertForQuestionAnswering.from_pretrained("saved_model")」
print("Saving Model ...")
model_save_dir = "ALBERRT_epoch4"
model.save_pretrained(model_save_dir)

Start Training ...


  0%|          | 0/991 [00:00<?, ?it/s]



Epoch 1 | Step 100 | loss = 2.095, acc = 0.783, lr = 9.90e-06
Epoch 1 | Step 200 | loss = 0.515, acc = 0.837, lr = 9.66e-06
Epoch 1 | Step 300 | loss = 0.439, acc = 0.856, lr = 9.31e-06
Epoch 1 | Step 400 | loss = 0.411, acc = 0.857, lr = 8.96e-06
Epoch 1 | Step 500 | loss = 0.399, acc = 0.852, lr = 8.61e-06
Epoch 1 | Step 600 | loss = 0.411, acc = 0.851, lr = 8.26e-06
Epoch 1 | Step 700 | loss = 0.387, acc = 0.861, lr = 7.92e-06
Epoch 1 | Step 800 | loss = 0.359, acc = 0.867, lr = 7.57e-06
Epoch 1 | Step 900 | loss = 0.364, acc = 0.872, lr = 7.22e-06
Evaluating Dev Set ...


  0%|          | 0/4131 [00:00<?, ?it/s]

***********Wrong Answer for question: 20 ********************
Paragraphs: 中華民國國民政府是中華民國在訓政時期的中央政府與最高行政機關，由原中華民國陸海軍大元帥大本營改組而來。1925年3月孫中山逝世後，於7月1日將原孫中山陸海空軍大元帥府改組而成，通稱「廣東革命政府」:2007。國民政府結束於1948年5月20日。在1925年成立後至1928年之間，其與北京的北洋政府相互對峙。1927年1月隨著北伐戰爭勝利，國民政府遷至武漢，稱「武漢政府」:2007。同年4月12日蔣介石發動清黨，4月18日在南京另立國民政府；7月15日汪精衛武漢政府宣布反共，與南京政府合流:2007。1928年北伐統一全國後，成為唯一代表中國的合法政府。1937年至1945年領導中國進行抗日戰爭，而於1941年珍珠港事件之後開始與同盟國共同對抗軸心國。1948年5月20日，依循《中華民國憲法》選出的第一任總統、副總統正式就職，國民政府即改組為中華民國政府，國民政府主席一職也改為總統，與訓政時期一起走入歷史。現今的中華民國總統府為其機關法人的延續。國民政府是中國國民黨依據孫文所著《國民政府建國大綱》建立之政府機構，由中國國民黨一黨專政，主要職位均由中國國民黨黨員擔任，但亦接納中國國民黨以外之人士參與。其存在期間，中國國民黨內部發生多次衝突與分裂，導致部分出走黨員自行成立不同的「國民政府」。此外在行憲後，中國國民黨執政的中華民國政府，也經常在眾多場合稱為「國民政府」。
Question: 7月15日汪精衛武漢政府宣布反共，與哪個政府合流?
Ground Truth: 南京
Prediction: 南京政府



***********Wrong Answer for question: 155 ********************
Paragraphs: 混合動力車輛是使用兩種或以上能量來源驅動的車輛，而驅動系統可以有一套或多套。常用的能量來源有燃油、電池、燃料電池、太陽能電池、壓縮氣體等，而常用的驅動系統包含內燃機、電動機、渦輪機等技術。使用燃油驅動內燃機加上電池驅動電動機的混合動力車稱為油電混合動力車，簡稱HEV，目前市面上的混合動力車多屬此種。油電混合動力車普遍比同型純內燃機車輛有更好的燃油效率及加速表現，被視為較環

  0%|          | 0/991 [00:00<?, ?it/s]

Epoch 2 | Step 1000 | loss = 0.023, acc = 0.071, lr = 6.87e-06
Epoch 2 | Step 1100 | loss = 0.311, acc = 0.878, lr = 6.52e-06
Epoch 2 | Step 1200 | loss = 0.297, acc = 0.883, lr = 6.17e-06
Epoch 2 | Step 1300 | loss = 0.307, acc = 0.876, lr = 5.83e-06
Epoch 2 | Step 1400 | loss = 0.297, acc = 0.887, lr = 5.48e-06
Epoch 2 | Step 1500 | loss = 0.296, acc = 0.886, lr = 5.13e-06
Epoch 2 | Step 1600 | loss = 0.298, acc = 0.886, lr = 4.78e-06
Epoch 2 | Step 1700 | loss = 0.292, acc = 0.885, lr = 4.43e-06
Epoch 2 | Step 1800 | loss = 0.277, acc = 0.887, lr = 4.09e-06
Epoch 2 | Step 1900 | loss = 0.252, acc = 0.898, lr = 3.74e-06
Evaluating Dev Set ...


  0%|          | 0/4131 [00:00<?, ?it/s]

***********Wrong Answer for question: 20 ********************
Paragraphs: 中華民國國民政府是中華民國在訓政時期的中央政府與最高行政機關，由原中華民國陸海軍大元帥大本營改組而來。1925年3月孫中山逝世後，於7月1日將原孫中山陸海空軍大元帥府改組而成，通稱「廣東革命政府」:2007。國民政府結束於1948年5月20日。在1925年成立後至1928年之間，其與北京的北洋政府相互對峙。1927年1月隨著北伐戰爭勝利，國民政府遷至武漢，稱「武漢政府」:2007。同年4月12日蔣介石發動清黨，4月18日在南京另立國民政府；7月15日汪精衛武漢政府宣布反共，與南京政府合流:2007。1928年北伐統一全國後，成為唯一代表中國的合法政府。1937年至1945年領導中國進行抗日戰爭，而於1941年珍珠港事件之後開始與同盟國共同對抗軸心國。1948年5月20日，依循《中華民國憲法》選出的第一任總統、副總統正式就職，國民政府即改組為中華民國政府，國民政府主席一職也改為總統，與訓政時期一起走入歷史。現今的中華民國總統府為其機關法人的延續。國民政府是中國國民黨依據孫文所著《國民政府建國大綱》建立之政府機構，由中國國民黨一黨專政，主要職位均由中國國民黨黨員擔任，但亦接納中國國民黨以外之人士參與。其存在期間，中國國民黨內部發生多次衝突與分裂，導致部分出走黨員自行成立不同的「國民政府」。此外在行憲後，中國國民黨執政的中華民國政府，也經常在眾多場合稱為「國民政府」。
Question: 7月15日汪精衛武漢政府宣布反共，與哪個政府合流?
Ground Truth: 南京
Prediction: 南京政府



***********Wrong Answer for question: 163 ********************
Paragraphs: 與此密切相關的問題是，什麼才算是一個好的科學解釋。除了提供對未來事件的預測，社會往往需要科學理論為經常發生或已經發生的事件提供解釋。哲學家們對「一個科學理論成功地解釋了一個現象」以及「一個科學理論具有解釋力」之說法所憑依的標準進行了調查研究。演繹-律則模型是一個早期的，有影響力的科學解釋的理論。它說，一個成功的科學解釋必須能從一個科學定律推斷出某個現象的發生。這種觀點受到

  0%|          | 0/991 [00:00<?, ?it/s]

Epoch 3 | Step 2000 | loss = 0.043, acc = 0.152, lr = 3.39e-06
Epoch 3 | Step 2100 | loss = 0.250, acc = 0.895, lr = 3.04e-06
Epoch 3 | Step 2200 | loss = 0.256, acc = 0.890, lr = 2.69e-06
Epoch 3 | Step 2300 | loss = 0.221, acc = 0.907, lr = 2.35e-06
Epoch 3 | Step 2400 | loss = 0.241, acc = 0.901, lr = 2.00e-06
Epoch 3 | Step 2500 | loss = 0.246, acc = 0.894, lr = 1.65e-06
Epoch 3 | Step 2600 | loss = 0.253, acc = 0.891, lr = 1.30e-06
Epoch 3 | Step 2700 | loss = 0.255, acc = 0.891, lr = 9.54e-07
Epoch 3 | Step 2800 | loss = 0.251, acc = 0.902, lr = 6.06e-07
Epoch 3 | Step 2900 | loss = 0.241, acc = 0.890, lr = 2.58e-07
Evaluating Dev Set ...


  0%|          | 0/4131 [00:00<?, ?it/s]

***********Wrong Answer for question: 12 ********************
Paragraphs: 康有為提出舉兵勤王計畫得到梁啟超、孫中山合作與支持，康試圖通過此舉令光緒執政，但孫想建立共和。孫中山堅持推翻滿清，試圖說服李鴻章據兩廣宣布獨立，進行和平改革。梁啟超為了調和康、孫二人矛盾，提出推舉光緒為共和國首任總統，以求兩者兼全。光緒二十五年冬，梁啟超的學生唐才常、林錫圭等人從日本歸國。翌年春在上海成立自立會，接受康有為、梁啟超、孫中山的指導，聯絡哥老會與農民入會。梁啟超將會黨口號「扶清滅洋」改為「救國自立」。光緒二十六年七月初一，唐才常籌劃中國議會在上海愚園成立，推選容閎出任議長。為執行合作勤王計畫，梁啟超自任總指揮，唐才常策劃自立軍定於七月十五起兵。七月廿六，梁啟超由日本急往上海，得知仍未收到康有為的軍餉，推遲於七月廿九起兵，對康有為極為不滿。但秦力山、沈藎不知起兵日期推遲，仍於七月十五在安徽大通、湖北新堤起事，因此暴露秘密，張之洞於七月廿七破獲自立軍在漢口英租界的總部，逮捕唐才常等二十名重要首領，8月23日於武昌滋陽湖畔處決。起義完全失敗，梁啟超留上海十天，南下香港前往新加坡。梁認為康有為故意不發軍餉造成，因此去檳榔嶼找康對質，遭到康的駁斥，指責梁與孫中山合作是叛逆行為；在檀香山談情說愛，無心募款；擅作主張分散兵力，導致勤王事敗。
Question: 梁啟超認為合作勤王計畫失敗的原因是?
Ground Truth: 康有為故意不發軍餉造成
Prediction: 主張分散兵力



***********Wrong Answer for question: 20 ********************
Paragraphs: 中華民國國民政府是中華民國在訓政時期的中央政府與最高行政機關，由原中華民國陸海軍大元帥大本營改組而來。1925年3月孫中山逝世後，於7月1日將原孫中山陸海空軍大元帥府改組而成，通稱「廣東革命政府」:2007。國民政府結束於1948年5月20日。在1925年成立後至1928年之間，其與北京的北洋政府相互對峙。1927年1月隨著北伐戰爭勝利，國民政府遷至武漢，稱「武漢政府」:2007。同年4月12日蔣介石發動清黨，4月18日在南京另立國民政府；7月15日汪精衛武漢政府宣布反共，與南京政府合流:

  0%|          | 0/991 [00:00<?, ?it/s]

Epoch 4 | Step 3000 | loss = 0.046, acc = 0.240, lr = 0.00e+00
Epoch 4 | Step 3100 | loss = 0.227, acc = 0.903, lr = 0.00e+00
Epoch 4 | Step 3200 | loss = 0.210, acc = 0.911, lr = 0.00e+00
Epoch 4 | Step 3300 | loss = 0.237, acc = 0.899, lr = 0.00e+00
Epoch 4 | Step 3400 | loss = 0.227, acc = 0.902, lr = 0.00e+00
Epoch 4 | Step 3500 | loss = 0.224, acc = 0.906, lr = 0.00e+00
Epoch 4 | Step 3600 | loss = 0.239, acc = 0.895, lr = 0.00e+00
Epoch 4 | Step 3700 | loss = 0.231, acc = 0.906, lr = 0.00e+00
Epoch 4 | Step 3800 | loss = 0.250, acc = 0.893, lr = 0.00e+00
Epoch 4 | Step 3900 | loss = 0.231, acc = 0.902, lr = 0.00e+00
Evaluating Dev Set ...


  0%|          | 0/4131 [00:00<?, ?it/s]

***********Wrong Answer for question: 12 ********************
Paragraphs: 康有為提出舉兵勤王計畫得到梁啟超、孫中山合作與支持，康試圖通過此舉令光緒執政，但孫想建立共和。孫中山堅持推翻滿清，試圖說服李鴻章據兩廣宣布獨立，進行和平改革。梁啟超為了調和康、孫二人矛盾，提出推舉光緒為共和國首任總統，以求兩者兼全。光緒二十五年冬，梁啟超的學生唐才常、林錫圭等人從日本歸國。翌年春在上海成立自立會，接受康有為、梁啟超、孫中山的指導，聯絡哥老會與農民入會。梁啟超將會黨口號「扶清滅洋」改為「救國自立」。光緒二十六年七月初一，唐才常籌劃中國議會在上海愚園成立，推選容閎出任議長。為執行合作勤王計畫，梁啟超自任總指揮，唐才常策劃自立軍定於七月十五起兵。七月廿六，梁啟超由日本急往上海，得知仍未收到康有為的軍餉，推遲於七月廿九起兵，對康有為極為不滿。但秦力山、沈藎不知起兵日期推遲，仍於七月十五在安徽大通、湖北新堤起事，因此暴露秘密，張之洞於七月廿七破獲自立軍在漢口英租界的總部，逮捕唐才常等二十名重要首領，8月23日於武昌滋陽湖畔處決。起義完全失敗，梁啟超留上海十天，南下香港前往新加坡。梁認為康有為故意不發軍餉造成，因此去檳榔嶼找康對質，遭到康的駁斥，指責梁與孫中山合作是叛逆行為；在檀香山談情說愛，無心募款；擅作主張分散兵力，導致勤王事敗。
Question: 梁啟超認為合作勤王計畫失敗的原因是?
Ground Truth: 康有為故意不發軍餉造成
Prediction: 主張分散兵力



***********Wrong Answer for question: 20 ********************
Paragraphs: 中華民國國民政府是中華民國在訓政時期的中央政府與最高行政機關，由原中華民國陸海軍大元帥大本營改組而來。1925年3月孫中山逝世後，於7月1日將原孫中山陸海空軍大元帥府改組而成，通稱「廣東革命政府」:2007。國民政府結束於1948年5月20日。在1925年成立後至1928年之間，其與北京的北洋政府相互對峙。1927年1月隨著北伐戰爭勝利，國民政府遷至武漢，稱「武漢政府」:2007。同年4月12日蔣介石發動清黨，4月18日在南京另立國民政府；7月15日汪精衛武漢政府宣布反共，與南京政府合流:

KeyboardInterrupt: 

## Save the training trajectory

In [None]:
import pickle

# Save the list to a file
def save_list(list_to_save, file_name):
    with open(file_name, 'wb') as f:
        pickle.dump(list_to_save, f)

# Load the list from a file
def load_list(file_name):
    with open(file_name, 'rb') as f:
        return pickle.load(f)

# Example usage:
save_list(train_acc_list, model_name + 'train_acc_list.pkl')
save_list(train_loss_list, model_name + 'train_loss_list.pkl')
save_list(dev_acc_list, model_name + 'dev_acc_list.pkl')
loaded_train_acc_list = load_list(model_name + 'train_acc_list.pkl')
print(loaded_train_acc_list)


[tensor(51., device='cuda:0'), tensor(67.9062, device='cuda:0'), tensor(70.1562, device='cuda:0'), tensor(70.3438, device='cuda:0'), tensor(72.5312, device='cuda:0'), tensor(74.5938, device='cuda:0'), tensor(74.0625, device='cuda:0'), tensor(75.3750, device='cuda:0'), tensor(73.6875, device='cuda:0'), tensor(6.0625, device='cuda:0'), tensor(77.3438, device='cuda:0'), tensor(78.0938, device='cuda:0'), tensor(78.5312, device='cuda:0'), tensor(78.0625, device='cuda:0'), tensor(78.0312, device='cuda:0'), tensor(78.0625, device='cuda:0'), tensor(78.8438, device='cuda:0'), tensor(78.0938, device='cuda:0'), tensor(78., device='cuda:0'), tensor(13.8750, device='cuda:0'), tensor(82.5000, device='cuda:0'), tensor(80.3750, device='cuda:0'), tensor(80.0312, device='cuda:0'), tensor(81.1250, device='cuda:0'), tensor(81.4062, device='cuda:0'), tensor(81.1250, device='cuda:0'), tensor(80.9688, device='cuda:0'), tensor(81.8438, device='cuda:0'), tensor(81.0625, device='cuda:0'), tensor(22.1562, device

## Plotting

In [None]:
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
import numpy as np
import os

def plot_metrics(epochs, train_loss_list_dict, train_acc_list_dict, dev_acc_list, lr, batchsize,save_dir = 'plots'):
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    regular_font_path = 'times.ttf'
    bold_font_path = "timesbd.ttf"
    regular_font_prop = font_manager.FontProperties(fname = regular_font_path, size = 20)
    bold_font_prop = font_manager.FontProperties(fname = bold_font_path, size = 20)

    total_step = num_epoch * len(train_loader)
    train_x_axis = np.linspace(0, total_step, logging_step)
    dev_x_axis = np.linspace(0, total_step, len(train_loader))

    plt.figure(figsize = (8,6))
    for model_name, train_acc_list in train_acc_list_dict:
        plt.plot(train_x_axis, train_acc_list, marker = 'x', label = f'{model_name} Train Acc')
    plt.plot(dev_x_axis, dev_acc_list, marker = '+', label = 'Test Acc')

    plt.xlabel('Steps', fontproperties=regular_font_prop)
    plt.ylabel('Accuracy(EM)', fontproperties=regular_font_prop)

    plt.legend(prop = bold_font_prop)
    plt.grid(True, linstyle = '--', linewidth = 0.5)

    plt.show()
    acc_filepath = os.path.join(save_dir,f'acc_plot_lr{lr}_epoch{epochs}_bs{batchsize}.png')
    plt.savefig(acc_filepath)


    # Plot the loss graph
    plt.figure(figsize = (8,6))
    for model_name, train_loss_list in train_loss_list_dict:
        plt.plot(train_x_axis, train_loss_list, marker = 'x', label = 'Train Acc')

    plt.xlabel('Steps', fontproperties=regular_font_prop)
    plt.ylabel('Loss', fontproperties=regular_font_prop)

    plt.legend(prop = bold_font_prop)
    plt.grid(True, linstyle = '--', linewidth = 0.5)

    plt.show()
    loss_filepath = os.path.join(save_dir,f'loss_plot_lr{lr}_epoch{epochs}_bs{batchsize}.png')
    plt.savefig(loss_filepath)      


train_loss_list

[tensor(163.3258, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(86.4376, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(76.0713, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(76.9452, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(69.5240, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(63.0207, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(62.9945, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(61.7826, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(61.2451, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(4.3286, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(53.2177, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(51.3471, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(50.4149, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(51.6739, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(51.6222, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(53.4528, device='cuda:0', grad_fn=<AddBackward0>),
 tensor(50.5624, device='cuda:0', grad_fn=<AddBackward0>

## Testing

In [14]:
# print("Evaluating Test Set ...")

# result = []
# start_position_list = []
# end_position_list = []

# model.eval()
# with torch.no_grad():
#     for data in tqdm(test_loader):
#         output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
#                        attention_mask=data[2].squeeze(dim=0).to(device))
#         test_result, start_position, end_position = evaluate(data, output)
#         result.append(test_result)
#         start_position_list.append(start_position)
#         end_position_list.append(end_position)

# result_file = "result.csv"
# with open(result_file, 'w') as f:
# 	  f.write("ID,Answer\n")
# 	  for i, test_question in enumerate(test_questions):
#         # Replace commas in answers with empty strings (since csv is separated by comma)
#         # Answers in kaggle are processed in the same way
# 		    f.write(f"{test_question['id']},{result[i].replace(',','')},{start_position_list[i]},{end_position_list[i]}\n")

# print(f"Completed! Result is in {result_file}")

from transformers import AutoModelForQuestionAnswering
print("Evaluating Test Set ...")

result = []
# model = AutoModelForQuestionAnswering.from_pretrained("saved_model").to(device)
model.eval()
with torch.no_grad():
    for i,data in enumerate(tqdm(test_loader)):
        output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        result.append(evaluate(data, output, doc_stride=doc_stride, paragraph=test_paragraphs[test_questions[i]["paragraph_id"]],
                               paragraph_tokenized=test_paragraphs_tokenized[test_questions[i]["paragraph_id"]]))


result_file = "result_revised_13_Chinese_ALBERT_stride32_epoch4_postprocess.csv"
with open(result_file, 'w') as f:
	  f.write("ID,Answer\n")
	  for i, test_question in enumerate(test_questions):
        # Replace commas in answers with empty strings (since csv is separated by comma)
        # Answers in kaggle are processed in the same way
		    f.write(f"{test_question['id']},{result[i].replace(',','')}\n")


print(f"Completed! Result is in {result_file}")

Evaluating Test Set ...


  0%|          | 0/4957 [00:00<?, ?it/s]

Detect [UNK] in answer, we use original context instead
The original answer is:拉丁文[UNK]
The original context's answer is: 拉丁文Civilis
--------------------------------------------------------------------------------
Detect [UNK] in answer, we use original context instead
The original answer is:大型購物中心[UNK]開幕
The original context's answer is: 大型購物中心MegaBox開幕
--------------------------------------------------------------------------------
Detect [UNK] in answer, we use original context instead
The original answer is:溥[UNK]
The original context's answer is: 溥儁
--------------------------------------------------------------------------------
Detect [UNK] in answer, we use original context instead
The original answer is:目前沒有觀察到任何語言純[UNK]以力道來區分不同輔音
The original context's answer is: 目前沒有觀察到任何語言純綷以力道來區分不同輔音
--------------------------------------------------------------------------------
Detect [UNK] in answer, we use original context instead
The original answer is:[UNK]人國
The original context's an

In [15]:
# !kill -9 744261
# !kill -9 747445
# !kill -9 748137

# !nvidia-smi
# torch.cuda.empty_cache()
