<a href="https://colab.research.google.com/github/jeremy-feng/deep-learning-coursework/blob/main/2-BERT/BERT-QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 作业2：对Bert进行微调，完成QA任务

**如果你对Bert没有了解，请先观看视频 [BERT 论文逐段精读【论文精读】](https://www.bilibili.com/video/BV1PL411M7eQ)**

注：本次作业并不需要预先了解任何Transformer的知识，如有兴趣，可以在观看Bert的视频前，先预习 [Transformer论文逐段精读【论文精读】](https://www.bilibili.com/video/BV1pu411o7BE)，后续课程中会讲解Transformer的知识。



In [7]:
# from google.colab import drive
# drive.mount('/content/drive')

In [8]:
!pwd

/kaggle/working


In [9]:
import pandas as pd
import json
from tqdm import tqdm
import torch
import numpy as np
import random

device = "cuda" if torch.cuda.is_available() else "cpu"


# Fix random seed for reproducibility
def same_seeds(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


same_seeds(0)

In [10]:
with open('/kaggle/input/cmrc2018/train.json') as f:
    train = json.load(f)

with open('/kaggle/input/cmrc2018/dev.json') as f:
    dev = json.load(f)


Let's have a glance at the data.

In [11]:
!pip install transformers
from transformers import BertTokenizerFast, BertForQuestionAnswering

# You can explore more pretrained models from https://huggingface.co/models
# 这段代码利用Hugging Face库中的BertTokenizerFast方法从预训练模型'bert-base-chinese'中加载tokenizer。
# 这个预训练模型是一个中文BERT模型，可以将中文句子或文本数据转换为相应的token，以便进行文本分类、序列标注等自然语言处理任务。
# BertTokenizerFast是BertTokenizer的升级版，速度更快，性能更优。
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = BertForQuestionAnswering.from_pretrained('bert-base-chinese').to(device)

# You can safely ignore the warning message (it pops up because new prediction heads for QA are initialized randomly)

[0m

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/269k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/412M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-chinese a

## PreProcessing

### Prepare training data

In [12]:
paragraphs = []
questions = []
start_positions = []
end_positions = []
for paragraph in train['data']:
    for qa in paragraph['paragraphs'][0]['qas']:
        
        ### START CODE HERE ### 
        # For each question, add its paragraph, question, start_position and end_position(after calculation) to its corresponding list.
        paragraphs.append(paragraph['paragraphs'][0]['context'])
        questions.append(qa['question'])
        start_position = qa['answers'][0]['answer_start']
        start_positions.append(start_position)
        anwser_length = len(qa['answers'][0]['text'])
        end_positions.append(start_position + anwser_length)
        ### END CODE HERE ###

In [13]:
questions[:5]

['范廷颂是什么时候被任为主教的？',
 '1990年，范廷颂担任什么职务？',
 '范廷颂是于何时何地出生的？',
 '1994年3月，范廷颂担任什么职务？',
 '范廷颂是何时去世的？']

In [14]:
start_positions[:5]

[30, 41, 97, 548, 759]

In [15]:
end_positions[:5]

[35, 62, 126, 598, 780]

查看第 3 个问题的回答是否正确

In [16]:
paragraphs[0][start_positions[2]: end_positions[2]]

'范廷颂于1919年6月15日在越南宁平省天主教发艳教区出生'

将 paragraphs 和 questions 进行 encoding。
参考：[https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__)

In [17]:
# 下面这段代码使用了 Hugging Face 的 tokenizer 方法，将 paragraphs 和 questions 转换成相应的 token，
# 返回一个字典 (train_encodings) 包含这些 token 的各种信息，这些信息包括 input_ids、attention_mask 等等。
# return_tensors='pt' 表示返回 PyTorch 下的 tensor 格式
# padding 用于填充不足 max_length 的 token
# truncation 用于在超过 max_length 时截断 token
# 最终的 token 长度被限制在 512 内
train_encodings = tokenizer(
    paragraphs,
    questions,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=512,
)


In [18]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

- 在问答任务中，`input_ids` 是将输入文本转换为整数序列后的输出。它将每个单词或子词映射到一个唯一的整数 ID, 位于 [CLS] 和 [SEP] 标记会被分别映射到一个特殊的 ID，(101: CLS, 102: SEP)。具体可以参考下方例子。

- 在 `token_type_ids` 中，这些标记的值通常为 0 或 1，其中 0 表示该 token 属于第一个文本序列（通常是问题），1 表示该 token 属于第二个文本序列（通常是段落）。

- 在 `attention_mask` 中，0 表示对应的标记应该被忽略，1 表示对应的标记应该被关注。当输入序列长度不足最大长度时，我们需要在序列末尾填充一些无意义的标记，以使序列长度达到最大长度。在这种情况下，`tokenizer`将填充的标记的 attention mask 设置为 0，以告诉模型它们不应该被关注。

In [19]:
a = tokenizer(
    paragraphs[0],
    questions[0],
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=512,
)['input_ids'][0]

In [20]:
b = tokenizer(
    paragraphs[0],
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=512,
)['input_ids'][0]

In [21]:
tokenizer.decode([711, 1921,  712, 3136, 3777, 1079, 2600, 3136, 1277,
        2134, 2429, 5392,  102, 5745, 2455, 7563, 3221,  784,  720, 3198,  952,
        6158,  818,  711,  712, 3136, 4638, 8043,  102])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'为 天 主 教 河 内 总 教 区 宗 座 署 [SEP] 范 廷 颂 是 什 么 时 候 被 任 为 主 教 的 ？ [SEP]'

In [22]:
len([5745, 2455, 7563, 3221,  784,  720, 3198,  952,
        6158,  818,  711,  712, 3136, 4638, 8043,  102])

16

最后 16 个元素为 1，这说明最后 16 个 token 对应的是 question

In [23]:
train_encodings['token_type_ids'][0]

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [24]:
train_encodings['token_type_ids'][0].sum()

tensor(16)

In [25]:
tokenizer.decode([711, 1921,  712, 3136, 3777, 1079, 2600, 3136, 1277,
        2134, 2429, 5392, 4415,  809, 1856, 6133, 6421, 3136, 1277, 2600,  712,
        3136, 4638, 4958, 5375,  511, 8447, 2399,  102])

'为 天 主 教 河 内 总 教 区 宗 座 署 理 以 填 补 该 教 区 总 主 教 的 空 缺 。 1994 年 [SEP]'

In [26]:
train_encodings['token_type_ids']

tensor([[0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]])

In [27]:
print(train_encodings['input_ids'].shape)
print(train_encodings['token_type_ids'].shape)
print(train_encodings['attention_mask'].shape)

torch.Size([10142, 512])
torch.Size([10142, 512])
torch.Size([10142, 512])


下面的代码的作用是：将 answer 在原始 paragrapgh 的起止索引，转换为在经过tokenizor 之后点 input_ids 中的起止索引

In [28]:
# `char_to_token` will convert answer's start/end positions in paragraph_text to start/end positions in tokenized_paragraph  
train_encodings['start_positions'] = torch.tensor([train_encodings.char_to_token(idx, x) if train_encodings.char_to_token(idx, x) != None else -1
                                      for idx, x in enumerate(start_positions)])
train_encodings['end_positions'] = torch.tensor([train_encodings.char_to_token(idx, x-1) if train_encodings.char_to_token(idx, x-1) != None else -1
                                    for idx, x in enumerate(end_positions)])

In [29]:
train_encodings['start_positions']

tensor([ 31,  39,  86,  ..., 142, 225,  17])

In [30]:
train_encodings['end_positions']

tensor([ 32,  56, 110,  ..., 143, 244,  19])

以第 3 个问题为例，即以 86 和 110 作为起止索引为例，查看它们对应的字符串是什么

In [31]:
questions[2]

'范廷颂是于何时何地出生的？'

In [32]:
paragraphs[0][start_positions[2]: end_positions[2]]

'范廷颂于1919年6月15日在越南宁平省天主教发艳教区出生'

In [33]:
tokenizer.decode(train_encodings['input_ids'][0][86: 110+1])

'范 廷 颂 于 1919 年 6 月 15 日 在 越 南 宁 平 省 天 主 教 发 艳 教 区 出 生'

### Prepare Dataset

In [34]:
import torch
from torch.utils.data import Dataset, DataLoader, TensorDataset
    
import torch

class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {k: v[idx].to(device) for k, v in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = SquadDataset(train_encodings)

Automatic Mixed Precision (AMP) is available on NVIDIA GPUs that support Tensor Cores, which are specialized hardware units for performing fast matrix multiplication and convolution operations in deep learning. Specifically, Tensor Cores are available on NVIDIA Volta, Turing, and Ampere architectures, which include the following GPU series:

- Volta: Tesla V100, Titan V
- Turing: Quadro RTX, GeForce RTX 20-series, Titan RTX
- Ampere: A100, GeForce RTX 30-series, Titan RTX

In [35]:
# Change "fp16_training" to True to support automatic mixed precision training (fp16)
fp16_training = True

if fp16_training:
    !pip install accelerate
    from accelerate import Accelerator

    accelerator = Accelerator()
    device = accelerator.device

# Documentation for the toolkit:  https://huggingface.co/docs/accelerate/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m

In [36]:
device

device(type='cuda')

In [37]:
next(model.parameters())

Parameter containing:
tensor([[ 0.0262,  0.0109, -0.0187,  ...,  0.0903,  0.0028,  0.0064],
        [ 0.0021,  0.0216,  0.0011,  ...,  0.0809,  0.0018,  0.0249],
        [ 0.0147,  0.0005,  0.0028,  ...,  0.0836,  0.0121,  0.0282],
        ...,
        [ 0.0346,  0.0021,  0.0085,  ...,  0.0085,  0.0337,  0.0099],
        [ 0.0541,  0.0289,  0.0263,  ...,  0.0526,  0.0651,  0.0353],
        [ 0.0200,  0.0023, -0.0089,  ...,  0.0799, -0.0562,  0.0247]],
       device='cuda:0', requires_grad=True)

In [38]:
next(model.parameters()).shape

torch.Size([21128, 768])

In [39]:
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm import tqdm


train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

### START CODE HERE ### 
# Use AdamW as the optimizer, and learning rate 5e-5.
# https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html
optim = torch.optim.AdamW(model.parameters(), lr=5e-5)
### END CODE HERE ### 


In [40]:
model, optim, train_loader = accelerator.prepare(model, optim, train_loader)

In [41]:
model.train()
loss_sum = 0.0
acc_start_sum = 0.0
acc_end_sum = 0.0

In [42]:
for batch_idx, batch in enumerate(tqdm(train_loader)):
    print(batch_idx, batch)
    break

  0%|          | 0/1268 [00:00<?, ?it/s]

0 {'input_ids': tensor([[ 101, 1102, 3777,  ...,    0,    0,    0],
        [ 101, 6437, 1923,  ...,    0,    0,    0],
        [ 101, 4549, 4565,  ..., 4638, 8043,  102],
        ...,
        [ 101, 7032, 7305,  ...,  763, 8043,  102],
        [ 101, 7942, 6586,  ...,    0,    0,    0],
        [ 101, 4482, 3419,  ...,    0,    0,    0]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0'), 'start_positions': tensor([ 38, 108,  19,  81,  10, 330, 204,  12], device='cuda:0'), 'end_positions': tensor([ 47, 131,  20,  86,  




In [43]:
batch_idx

0

由于 train_loader 设置了 batch_size=8，因此这里每一个批次包含 8 个数据



In [44]:
batch

{'input_ids': tensor([[ 101, 1102, 3777,  ...,    0,    0,    0],
         [ 101, 6437, 1923,  ...,    0,    0,    0],
         [ 101, 4549, 4565,  ..., 4638, 8043,  102],
         ...,
         [ 101, 7032, 7305,  ...,  763, 8043,  102],
         [ 101, 7942, 6586,  ...,    0,    0,    0],
         [ 101, 4482, 3419,  ...,    0,    0,    0]], device='cuda:0'),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 1, 1, 1],
         ...,
         [0, 0, 0,  ..., 1, 1, 1],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0'),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0'),
 'start_positions': tensor([ 38, 108,  19,  81,  10, 330, 204,  12], device='cuda:0'),
 'end_positions': tensor([ 

In [45]:
optim.zero_grad()

In [46]:
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
start_positions = batch['start_positions']
end_positions = batch['end_positions']

outputs = model(input_ids, attention_mask=attention_mask, 
                start_positions=start_positions, 
                end_positions=end_positions)

In [47]:
outputs

QuestionAnsweringModelOutput(loss=tensor(6.3567, device='cuda:0', grad_fn=<DivBackward0>), start_logits=tensor([[-0.0901, -0.0449,  0.6937,  ...,  0.1636,  0.2241,  0.0975],
        [-0.1821, -0.4386,  0.1097,  ..., -0.1325, -0.3646, -0.2178],
        [-0.1493, -0.1064,  0.2222,  ...,  0.2401,  0.1956,  0.3059],
        ...,
        [ 0.2721,  0.1159,  0.1111,  ...,  0.5787,  0.5692,  0.2100],
        [-0.1677,  0.0131,  0.3072,  ..., -0.1343, -0.0099, -0.1005],
        [ 0.0869, -0.1847,  0.7930,  ..., -0.1804, -0.2160, -0.2182]],
       device='cuda:0', grad_fn=<CloneBackward0>), end_logits=tensor([[ 0.6159,  0.2972, -0.0280,  ...,  0.1095, -0.1247,  0.1841],
        [ 0.6317,  0.4158, -0.6848,  ..., -0.0872, -0.0054,  0.0493],
        [ 0.6075,  0.3280, -0.6157,  ..., -0.5438,  0.6205,  0.2260],
        ...,
        [ 0.6398,  0.2011, -0.3014,  ..., -0.2863,  0.5134,  0.3975],
        [ 0.1596,  0.6876, -0.4636,  ..., -0.1172, -0.1384, -0.1386],
        [ 0.5356,  0.5155, -0.3263,  

In [48]:
loss = outputs.loss

In [49]:
loss

tensor(6.3567, device='cuda:0', grad_fn=<DivBackward0>)

In [50]:
accelerator.backward(loss)

In [51]:
optim.step()

In [52]:
loss_sum += loss.item()

对于第一个问答，输出每个 token 作为 start 的 logit，其值越大，说明越有可能作为 start

In [53]:
outputs.start_logits[0]

tensor([-0.0901, -0.0449,  0.6937,  0.2657,  0.5751,  0.2879,  0.5472,  0.5595,
         0.3849,  0.4474,  0.0366,  0.2149,  0.3718,  0.5434,  0.4492,  0.8463,
         0.5935,  0.1225,  0.0525,  0.2413,  0.2997,  0.3549,  1.1205,  0.2727,
        -0.0344,  0.4527,  0.2076, -0.0130,  0.3765, -0.1353, -0.5382,  0.1187,
        -0.3140,  0.6122,  0.0115, -0.0217,  0.0154,  0.0155,  0.2311,  0.1784,
        -0.1717,  0.7615,  0.0560,  0.2829, -0.6343, -0.8912,  0.0342, -0.0572,
         0.2255,  0.3893,  0.1140,  0.4778,  0.3645,  0.4438,  0.4550,  0.2165,
        -0.5609, -0.3129, -0.0089,  0.2255,  0.3166,  0.2603, -0.8078, -0.2189,
         0.2226, -0.1790, -0.1002,  0.4215, -0.1709,  0.3791,  0.0687, -0.4009,
         0.6916,  0.4439,  0.4447,  0.2035,  0.3343,  0.3576,  0.5502,  0.7318,
         0.3264, -0.3437, -0.5790,  0.0566, -0.3126, -0.2780, -0.2532,  0.4423,
         0.4679,  0.5663,  0.1061, -0.1086,  0.1776,  0.6165,  0.1101,  0.2595,
         0.5955,  0.2147,  0.2232, -0.15

以下结果说明，第 1 个问答中，第 306 个 token 最有可能是 start

In [54]:
torch.argmax(outputs.start_logits, dim=1)

tensor([ 22, 296, 332, 306, 349, 365, 340,  66], device='cuda:0')

In [55]:
start_pred = torch.argmax(outputs.start_logits, dim=1)
end_pred = torch.argmax(outputs.end_logits, dim=1)

In [56]:
start_pred

tensor([ 22, 296, 332, 306, 349, 365, 340,  66], device='cuda:0')

In [57]:
end_pred

tensor([424, 357, 270, 124,  39, 380,   1,  21], device='cuda:0')

In [58]:
if fp16_training:
    model, optim, train_loader = accelerator.prepare(model, optim, train_loader)
    
model.train()
for epoch in range(3):
    loss_sum = 0.0
    acc_start_sum = 0.0
    acc_end_sum = 0.0
    pbar = tqdm(train_loader, desc=f"Epoch {epoch}")
    for batch_idx, batch in enumerate(pbar):
        optim.zero_grad()
        
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        start_positions = batch['start_positions']
        end_positions = batch['end_positions']
        
        outputs = model(input_ids, attention_mask=attention_mask, 
                        start_positions=start_positions, 
                        end_positions=end_positions)
        
        loss = outputs.loss
        if fp16_training:
            accelerator.backward(loss)
        else:
            loss.backward()
        optim.step()
        
        loss_sum += loss.item()
        
        ### START CODE HERE ### 
        # Obtain answer by choosing the most probable start position / end position
        # Using `torch.argmax` and its `dim` parameter to extract preditions for start position and end position.
        start_pred = torch.argmax(outputs.start_logits, dim=1)
        end_pred = torch.argmax(outputs.end_logits, dim=1)
        
        # calculate accuracy for start and end positions. eg., using start_pred and start_positions to calculate acc_start.
        acc_start = (start_pred == start_positions).float().mean()
        acc_end = (end_pred == end_positions).float().mean()
        ### END CODE HERE ### 
        
        acc_start_sum += acc_start
        acc_end_sum += acc_end
        
        # Update progress bar
        postfix = {
            "loss": f"{loss_sum/(batch_idx+1):.4f}",
            "acc_start": f"{acc_start_sum/(batch_idx+1):.4f}",
            "acc_end": f"{acc_end_sum/(batch_idx+1):.4f}"
        }

        # Add batch accuracy to progress bar
        batch_desc = f"Epoch {epoch}, train loss: {postfix['loss']}"
        pbar.set_postfix_str(f"{batch_desc}, acc start: {postfix['acc_start']}, acc end: {postfix['acc_end']}")


Epoch 0: 100%|██████████| 1268/1268 [09:27<00:00,  2.23it/s, Epoch 0, train loss: 2.0183, acc start: 0.4461, acc end: 0.4412]
Epoch 1: 100%|██████████| 1268/1268 [09:26<00:00,  2.24it/s, Epoch 1, train loss: 1.2921, acc start: 0.5909, acc end: 0.5992]
Epoch 2: 100%|██████████| 1268/1268 [09:26<00:00,  2.24it/s, Epoch 2, train loss: 0.9709, acc start: 0.6605, acc end: 0.6678]


In [63]:
def predict(doc, query):
    print(doc)
    print('提问：', query)
    item = tokenizer([doc, query], max_length=512, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        input_ids = item['input_ids'].to(device).reshape(1,-1)
        attention_mask = item['attention_mask'].to(device).reshape(1,-1)
        
        outputs = model(input_ids[:, :512], attention_mask[:, :512])
        
        ### START CODE HERE ### 
        # Using `torch.argmax` and its `dim` parameter to extract preditions for start position and end position.
        start_pred = torch.argmax(outputs.start_logits, dim=1)
        end_pred = torch.argmax(outputs.end_logits, dim=1)
        ### END CODE HERE ### 
    
    try:
        start_pred = item.token_to_chars(0, start_pred)
        end_pred = item.token_to_chars(0, end_pred)
    except:
        return ''
    
    if start_pred.start > end_pred.end:
        return ''
    else:
        return doc[start_pred.start:end_pred.end]

In [60]:
dev['data'][100]

{'paragraphs': [{'id': 'DEV_109',
   'context': '岑朗天（），笔名朗天、霍惊觉。香港作家、影评人、文化活动策划、大学兼职讲师。香港新亚研究所硕士，师从牟宗三，父亲为香港专栏作家昆南。曾在香港多家报社从事繙译、编辑、采访工作。1995年加入香港电影评论学会，并于2003-2007年出任该会会长，2016年退出。1995年参与创立新研哲学会，后易名香港人文哲学会，再易名香港人文学会。1998年加入树宁．现在式单位，出任该剧团董事及编剧。2003年担任牛棚书展（2003-6）统筹，协助开拓主流以外的书展文化（牛棚书展精神后为九龙城书节继承）。2004年6月至2011年加入商业电台光明顶，担任嘉宾主持。2004年至2014年于香港中文大学新闻与传播学院兼职教授媒体创意写作。2012年始兼任香港浸会大学电影学院讲师，教授文学与影视相关课程。',
   'qas': [{'question': '岑朗天笔名叫什么？',
     'id': 'DEV_109_QUERY_0',
     'answers': [{'text': '朗天、霍惊觉', 'answer_start': 8},
      {'text': '朗天、霍惊觉', 'answer_start': 8},
      {'text': '朗天、霍惊觉', 'answer_start': 8}]},
    {'question': '岑朗天的职业都有哪些？',
     'id': 'DEV_109_QUERY_1',
     'answers': [{'text': '作家、影评人、文化活动策划、大学兼职讲师', 'answer_start': 17},
      {'text': '作家、影评人、文化活动策划、大学兼职讲师', 'answer_start': 17},
      {'text': '作家、影评人、文化活动策划、大学兼职讲师', 'answer_start': 17}]},
    {'question': '岑朗天哪年加入香港电影评论学会？',
     'id': 'DEV_109_QUERY_2',
     'answers': [{'text': '1995年', 'answer_start': 87},
      {'te

In [64]:
model.eval()
predict(dev['data'][100]['paragraphs'][0]['context'],
       dev['data'][100]['paragraphs'][0]['qas'][0]['question'])

岑朗天（），笔名朗天、霍惊觉。香港作家、影评人、文化活动策划、大学兼职讲师。香港新亚研究所硕士，师从牟宗三，父亲为香港专栏作家昆南。曾在香港多家报社从事繙译、编辑、采访工作。1995年加入香港电影评论学会，并于2003-2007年出任该会会长，2016年退出。1995年参与创立新研哲学会，后易名香港人文哲学会，再易名香港人文学会。1998年加入树宁．现在式单位，出任该剧团董事及编剧。2003年担任牛棚书展（2003-6）统筹，协助开拓主流以外的书展文化（牛棚书展精神后为九龙城书节继承）。2004年6月至2011年加入商业电台光明顶，担任嘉宾主持。2004年至2014年于香港中文大学新闻与传播学院兼职教授媒体创意写作。2012年始兼任香港浸会大学电影学院讲师，教授文学与影视相关课程。
提问： 岑朗天笔名叫什么？


'朗天、霍惊觉'

In [85]:
doc='温开宇在美国波士顿大学读数理金融和金融科技硕士，他的身高是193厘米。'

In [88]:
predict(
    doc=doc,
    query='温开宇在哪个大学？'
)

温开宇在美国波士顿大学读数理金融和金融科技硕士，他的身高是193厘米。
提问： 温开宇在哪个大学？


'美国波士顿大学'

In [89]:
predict(
    doc=doc,
    query='温开宇读什么专业？'
)

温开宇在美国波士顿大学读数理金融和金融科技硕士，他的身高是193厘米。
提问： 温开宇读什么专业？


'数理金融和金融科技硕士'

In [90]:
predict(
    doc=doc,
    query='温开宇多高？'
)

温开宇在美国波士顿大学读数理金融和金融科技硕士，他的身高是193厘米。
提问： 温开宇多高？


'193厘米'

In [91]:
model.save_pretrained("./fengchao-bert-qa", from_pt=True) 

In [94]:
model_load_from_file = BertForQuestionAnswering.from_pretrained('./fengchao-bert-qa').to(device)

In [None]:
model_load_from_file

In [96]:
def predict_local(doc, query):
    print(doc)
    print('提问：', query)
    item = tokenizer([doc, query], max_length=512, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        input_ids = item['input_ids'].to(device).reshape(1,-1)
        attention_mask = item['attention_mask'].to(device).reshape(1,-1)
        
        outputs = model_load_from_file(input_ids[:, :512], attention_mask[:, :512])
        
        ### START CODE HERE ### 
        # Using `torch.argmax` and its `dim` parameter to extract preditions for start position and end position.
        start_pred = torch.argmax(outputs.start_logits, dim=1)
        end_pred = torch.argmax(outputs.end_logits, dim=1)
        ### END CODE HERE ### 
    
    try:
        start_pred = item.token_to_chars(0, start_pred)
        end_pred = item.token_to_chars(0, end_pred)
    except:
        return ''
    
    if start_pred.start > end_pred.end:
        return ''
    else:
        return doc[start_pred.start:end_pred.end]

In [97]:
predict_local(
    doc=doc,
    query='温开宇多高？'
)

温开宇在美国波士顿大学读数理金融和金融科技硕士，他的身高是193厘米。
提问： 温开宇多高？


'193厘米'

In [98]:
!du -h fengchao-bert-qa

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
388M	fengchao-bert-qa


## Open Questions
可以查阅相关资料，并完成如下开放式的问答题。


- 我们使用了512长度的Bert，但是在实际应用中，输入长度可能大于512，你想怎么解决这个问题，请描述你的算法，在训练和预测时分别采取什么样的方法。（假设问题的长度都满足小于512token，段落的长度可能大于512token，以QA问题为例）


Your Answer:

- 在输出中，我们分别对start_pred和end_pred的位置进行预估，如果end_pred<start_pred，我们可以如何解决这样的问题?

Your Answer:

- Bert的分词方式是什么?在中文中，你觉得这样的方式会带来什么问题？什么样的分词方式适合中文？在中文的文本上，除了改变分词方式，还有哪些方式可以提升模型效果？

阅读资料：https://github.com/ymcui/Chinese-BERT-wwm

Your Answer: