# 基于Bert在多项选择任务上微调模型

### SWAG 数据集简介

鉴于部分描述，例如“她打开汽车的引擎盖”，人类可以推断出这种情况并预测接下来会发生什么（“然后，她检查了发动机”）。在本文中，我们介绍了基础常识推理的任务，统一自然语言推理和常识推理。


该数据集由 113k 个关于接地情况的多项选择题组成。每个问题都是来自 LSMDC 或 ActivityNet Captions 的视频字幕，有四个关于场景中接下来可能发生的事情的答案选项。正确答案是视频中下一个事件的（真实）视频字幕；三个不正确的答案是对抗性生成和人工验证的，以欺骗机器而不是人类。作者的目标是让 SWAG 成为评估基础常识 NLI 和学习表示的基准。

In [44]:
# !pip install transformers datasets -i https://pypi.tuna.tsinghua.edu.cn/simple # 清华镜像
!pip install -i https://mirrors.cloud.tencent.com/pypi/simple transformers datasets # 腾讯镜像
# https://mirrors.cloud.tencent.com/help/pypi.html

Looking in indexes: https://mirrors.cloud.tencent.com/pypi/simple
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m


数据集：[SWAG](https://www.aclweb.org/anthology/D18-1009/)
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

模型简介：https://paperswithcode.com/dataset/swag

huggingface dataset：https://huggingface.co/datasets/swag

In [45]:
model_checkpoint = "bert-base-uncased" # 使用的预训练模型
batch_size = 16

### 加载数据集

In [47]:
from datasets import load_dataset, load_metric

`load_dataset` 将缓存数据集，下次运行此单元格时不再下载。

In [48]:
# datasets = load_dataset("swag", "regular")

In [49]:
import os

data_path = '/home/mw/input/task063578' #数据路径/home/mw/input/task063578
cache_dir = './cache'
data_files = {
    'train': os.path.join(data_path, 'train.csv'),# 数据集名称 训练集
    'validation': os.path.join(data_path, 'validation.csv'), # 验证集
    'test': os.path.join(data_path, 'test.csv') # 测试集
 }
datasets = load_dataset(data_path, 'regular', data_files=data_files, cache_dir=cache_dir)

Using custom data configuration task063578-a70cb52678a611ac
Reusing dataset csv (./cache/csv/task063578-a70cb52678a611ac/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/3 [00:00<?, ?it/s]

dataset对象本身是DatasetDict，它包含用于训练、验证和测试集的键值对(mnli是一个特殊的例子，其中包含用于不匹配的验证和测试集的键值对)。

In [50]:
datasets

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 73546
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20006
    })
    test: Dataset({
        features: ['Unnamed: 0', 'video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20005
    })
})

假如我们要访问其中的元素，我们可以像下面一样读取train的第一个样本：
- 给定数据集名称：train
- 指定数据集的索引：0

In [51]:
datasets["train"][0]

{'Unnamed: 0': 0,
 'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': 3416,
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0}

In [52]:
datasets["train"][1]

{'Unnamed: 0': 1,
 'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': 3417,
 'startphrase': 'A drum line passes by walking down the street playing their instruments. Members of the procession',
 'sent1': 'A drum line passes by walking down the street playing their instruments.',
 'sent2': 'Members of the procession',
 'gold-source': 'gen',
 'ending0': 'are playing ping pong and celebrating one left each in quick.',
 'ending1': 'wait slowly towards the cadets.',
 'ending2': 'continues to play as well along the crowd along with the band being interviewed.',
 'ending3': 'continue to play marching, interspersed.',
 'label': 3}

In [10]:
# datasets["train"].to_pandas().to_csv('data/06/train.csv')
# datasets["validation"].to_pandas().to_csv('data/06/validation.csv')
# datasets["test"].to_pandas().to_csv('data/06/test.csv')

为了了解数据的具体内容，我们使用以下函数将显示数据集中随机选取的一些示例。

In [53]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    """
    随机选取10个样本进行展示
    """
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [54]:
show_random_elements(datasets["train"])

Unnamed: 0.1,Unnamed: 0,video-id,fold-ind,startphrase,sent1,sent2,gold-source,ending0,ending1,ending2,ending3,label
0,11514,anetv_TjDlEonao3s,545,She continues the process till the entire head...,She continues the process till the entire head...,The model then,gold,loose a hair around the mattress.,uses a different rag and rendition the use of ...,ties a thick part on her head.,smiles and takes a selfie while looking in the...,3
1,33024,lsmdc1023_Horrible_Bosses-81983,16102,They see someone pounding someone's chest. Som...,They see someone pounding someone's chest.,Someone,gold,unwittingly drops someone's mobile as they hur...,can not hear what he is saying to himself in t...,nears someone's ring.,glares at the panel.,0
2,51949,anetv_Nx4rK_jvvR4,11044,A ballet is fun graphic appears across the scr...,A ballet is fun graphic appears across the scr...,A man in a black leotard and a woman in a blac...,gold,is doing balance and dancing with the high gir...,is demonstrating how to throw her hula hoop.,is shown standing up even ballet and dressed i...,begin to dance in a well lit dance studio.,3
3,65127,anetv_zgdT41KjjrE,7921,A person leads a horse. The guy,A person leads a horse.,The guy,gold,tries one at a time.,leads the horse and runs towards that.,shifts the rocking horse in the horse balance ...,gets on top of the horse.,3
4,63691,lsmdc3064_SPARKLE_2012-4368,13738,"Someone waves, then rests her head on the wind...","Someone waves, then rests her head on the wind...","Now in prison, someone",gold,looks at someone through a pane of glass.,"arrives at the house, then peers toward a ceil...",walks toward the house and starts across the s...,heads through a courtyard.,0
5,10376,lsmdc3018_CINDERELLA_MAN-7734,16245,Someone and someone watch her exit quickly. In...,Someone and someone watch her exit quickly.,"In her sweater and scarf, someone",gold,crosses the snow to an alcove between buildings.,looks back to herself.,"sits by her mother's building, sitting in a di...",sits on the porch.,0
6,64835,anetv_Fu46pdVz4qY,17053,We see a lady in a pink shirt talking to the c...,We see a lady in a pink shirt talking to the c...,We,gold,see this store house.,credits appears on the screen.,cheer on the lady then an image appears in a s...,see the lady pick up a basket of laundry and p...,3
7,14159,lsmdc3016_CHASING_MAVERICKS-6937,14167,Someone comes out of the barrel near the bone ...,Someone comes out of the barrel near the bone ...,Someone,gold,gives someone the letter and leaves.,"sleeps in a lifeboat with his arms crossed, so...",nibbles across a puddle of hayseeds.,slides on top of the large pinata.,0
8,14863,lsmdc0021_Rear_Window-58244,9562,"She is quite flat - chested, and the dress han...","She is quite flat - chested, and the dress han...",As if she,gold,"were protesting, she holds in silence.",is just meeting 'new coach.,is preparing to meet someone.,knows she must go.,2
9,65566,anetv_wj0D-wiqEb0,3771,"When he is done, he lowers his instrument and ...","When he is done, he lowers his instrument and ...",They,gold,appear to be riding inside a train.,peer coolly at them.,are playing with a drum roll.,"play the harmonica, and throw the ice in the a...",0


数据集中的每个示例都有一个上下文，
- 该上下文由第一个句子（在“sent1”字段中）
- 第二个句子的介绍（在“sent2”字段中）组成。 
- 然后给出四个可能的结尾或者后续（在字段 `ending0`、`ending1`、`ending2` 和 `ending3`）
- 并且模型必须选择正确的一个（在字段 `label` 中指示）。 

下面的函数让我们更好地可视化一个给定的例子：

datasets["train"][0] 为：
```

{'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': '3416',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0}

```

英文翻译为中文：
```
{'video-id': 'anetv_jkn6uvmqwh4',
  'fold-ind'：'3416'，
  'startphrase'：'游行的成员拿着小号角铜管乐器走在街上。 鼓线'，
  'sent1': '游行的成员拿着小号角铜管乐器走在街上。',
  'sent2': '鼓线',
  'gold-source'：'gold'，
  'ending0': '路过演奏他们的乐器的街道。',
  'ending1': '听说正在接近他们。',
  'ending2': "到了，他们在外面跳舞睡着了。",
  'ending3': '轮流主唱观看演出。',
  'label'：0}
```

In [55]:
def show_one(example):
    print(f"Context: {example['sent1']}")
    print(f"  A - {example['sent2']} {example['ending0']}")
    print(f"  B - {example['sent2']} {example['ending1']}")
    print(f"  C - {example['sent2']} {example['ending2']}")
    print(f"  D - {example['sent2']} {example['ending3']}")
    print(f"\nGround truth: option {['A', 'B', 'C', 'D'][example['label']]}")

In [9]:
show_one(datasets["train"][0])

Context: Members of the procession walk down the street holding small horn brass instruments.
  A - A drum line passes by walking down the street playing their instruments.
  B - A drum line has heard approaching them.
  C - A drum line arrives and they're outside dancing and asleep.
  D - A drum line turns the lead singer watches the performance.

Ground truth: option A


In [57]:
show_one(datasets["train"][17])

Context: Someone looks up at her hero and sees someone closing in again.
  A - Someone offers his wand to find the entirely under world most obvious.
  B - Someone leans back to her surroundings, his eyes rimmed with tears.
  C - Someone gathers her belongings in a suitcase and walks out into the next room.
  D - Someone lights his walking stick as he aims his pistol at someone's wand.

Ground truth: option B


### 数据处理

在将这些文本输入模型之前，我们需要对它们进行预处理。 这是由 🤗 Transformers `Tokenizer` 完成的，它将（如名称所示）对输入进行标记（包括将标记转换为它们在预训练词汇表中的相应 ID）并将其放入模型期望的格式中，并生成 模型需要的其他输入。

为此，我们使用 AutoTokenizer.from_pretrained 方法实例化我们的标记器，这将确保：

- 完成分词
- 处理成AutoModelForMultipleChoice格式。
- AutoModelForMultipleChoice 比如 BertForMultipleChoice，XlnetForMultipleChoice

该词汇表将被缓存，因此下次我们运行单元时不会再次下载它。


![Image Name](https://cdn.kesci.com/upload/image/r9x57uzad7.png?imageView2/0/w/960/h/960)


In [19]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

我们将 `use_fast=True` 传递给上面的调用，以使用 🤗 Tokenizers 库中的一种快速分词器（由 Rust 支持）。 这些快速标记器可用于几乎所有模型，但如果您在之前的调用中遇到错误，请删除该参数。

您可以直接在一个句子或一对句子上调用此标记器：

In [20]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

确定模型输入是什么？

MultipleChoice任务输入就是将问题和备选项分别进行组合，相当于一个样本为输入备选项个数相同的句子对列表，如下所示：

```
[("Members of the procession walk down the street holding small horn brass instruments.","A drum line passes by walking down the street playing their instruments."),
("Members of the procession walk down the street holding small horn brass instruments.","A drum line has heard approaching them."),
("Members of the procession walk down the street holding small horn brass instruments.","A drum line arrives and they're outside dancing and asleep."),
("Members of the procession walk down the street holding small horn brass instruments.","A drum line turns the lead singer watches the performance.")]
```


语文阅读理解：
```
[{"ID": 1,    
    "Content": "奉和袭美抱疾杜门见寄次韵  陆龟蒙虽失春城醉上期，下帷裁遍未裁诗。因吟郢岸百亩蕙，欲采商崖三秀芝。栖野鹤笼宽使织，施山僧饭别教炊。但医沈约重瞳健，不怕江花不满枝。",   
    "Questions": [{"Question": "下列对这首诗的理解和赏析，不正确的一项是",        
    "Choices": ["A．作者写作此诗之时，皮日休正患病居家，闭门谢客，与外界不通音讯。",
                "B．由于友人患病，原有的约会被暂时搁置，作者游春的诗篇也未能写出。", 
                "C．作者虽然身在书斋从事教学，但心中盼望能走进自然，领略美好春光。",
                "D．尾联使用了关于沈约的典故，可以由此推测皮日休所患的疾病是目疾。"],
    "Answer": "A",
    "Q_id": "000101"
}]}
```

In [58]:
ending_names = ["ending0", "ending1", "ending2", "ending3"]

def preprocess_function(examples):
    # 预处理输入tokenizer的输入
    # Repeat each first sentence four times to go with the four possibilities of second sentences.
    # 复制四次句子sent1，个数与选项个数相同
    first_sentences = [[context] * 4 for context in examples["sent1"]]#构造和备选项个数相同的问题句，也是tokenizer的第一个句子[sent1,sent1,sent1,sent1]
    # Grab all second sentences possible for each context.
    question_headers = examples["sent2"] #tokenizer的第二个句子的上半句[sent2|ending0,sent2|ending1,sent2|ending2,sent2|ending3,]
    second_sentences = [[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)]#构造上半句拼接下半句作为tokenizer的第二个句子（也就是备选项）
    # 
    
    # Flatten everything
    first_sentences = sum(first_sentences, []) #合并成一个列表方便tokenizer一次性处理：[[e1_sen1,e1_sen1,e1_sen1,e1_sen1],[e2_sen1,e2_sen1,e2_sen1,e2_sen1],[e3_sen1,e3_sen1,e3_sen1,e3_sen1]]->
    # [e1_sen1,e1_sen1,e1_sen1,e1_sen1,e2_sen1,e2_sen1,e2_sen1,e2_sen1,e3_sen1,e3_sen1,e3_sen1,e3_sen1]
    second_sentences = sum(second_sentences, [])#合并成一个列表方便tokenizer一次性处理
    
    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    # Un-flatten
    # 转化成每个样本（一个样本中包括了四个k=[问题1,问题1,问题1,问题1],v=[备选项1,备选项2,备选项3,备选项4]）
    # [e1_tokens1,e1_tokens1,e1_tokens1,e1_tokens1,e2_tokens1,e2_tokens1,e2_tokens1,e2_tokens1,e3_tokens1,e3_tokens1,e3_tokens1,e3_tokens1]->
    # [[e1_tokens1,e1_tokens1,e1_tokens1,e1_tokens1],[e2_tokens1,e2_tokens1,e2_tokens1,e2_tokens1],[e3_tokens1,e3_tokens1,e3_tokens1]]
    return {k: [v[i:i+4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

此功能适用于一个或多个示例。 在多个示例的情况下，标记器将返回每个键的列表列表：所有示例的列表（此处为 5），然后是所有选择的列表（4）和输入 ID 列表（此处的长度不同 因为我们没有应用任何填充）：

In [59]:
examples = datasets["train"][:5] # 训练集中的五条样本
features = preprocess_function(examples) # 构造五条样本的分词输入
print(
    len(features["input_ids"]), # 
    len(features["input_ids"][0]), # 第一个样本四个句子对的token ids
    # [[sent1,sent2|option1],
    # [sent1,sent2|option2],
    # [sent1,sent2|option3],
    # [sent1,sent2|option4]]
    [len(x) for x in features["input_ids"][0]] # 每个句子对toeken ids个数
)

5 4 [30, 25, 30, 28]


In [62]:
features.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [60]:
examples

{'Unnamed: 0': [0, 1, 2, 3, 4],
 'video-id': ['anetv_jkn6uvmqwh4',
  'anetv_jkn6uvmqwh4',
  'anetv_jkn6uvmqwh4',
  'anetv_jkn6uvmqwh4',
  'anetv_Bri_myFFu4A'],
 'fold-ind': [3416, 3417, 3415, 3417, 2408],
 'startphrase': ['Members of the procession walk down the street holding small horn brass instruments. A drum line',
  'A drum line passes by walking down the street playing their instruments. Members of the procession',
  'A group of members in green uniforms walks waving flags. Members of the procession',
  'A drum line passes by walking down the street playing their instruments. Members of the procession',
  'The person plays a song on the violin. The man'],
 'sent1': ['Members of the procession walk down the street holding small horn brass instruments.',
  'A drum line passes by walking down the street playing their instruments.',
  'A group of members in green uniforms walks waving flags.',
  'A drum line passes by walking down the street playing their instruments.',
  'The perso

为了检查是否准确，我们可以打印下第四条样本

In [15]:
idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(4)]

['[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession are playing ping pong and celebrating one left each in quick. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession wait slowly towards the cadets. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession play and go back and forth hitting the drums while the audience claps for them. [SEP]']

与真实内容相比较

In [63]:
show_one(datasets["train"][3])

Context: A drum line passes by walking down the street playing their instruments.
  A - Members of the procession are playing ping pong and celebrating one left each in quick.
  B - Members of the procession wait slowly towards the cadets.
  C - Members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions.
  D - Members of the procession play and go back and forth hitting the drums while the audience claps for them.

Ground truth: option D


这似乎没问题，所以我们可以将这个函数应用于我们数据集中的所有示例，我们只需使用我们之前创建的 `dataset` 对象的 `map` 方法。 这会将函数应用于“数据集”中所有拆分的所有元素，因此我们的训练、验证和测试数据将在一个命令中进行预处理。

In [64]:
encoded_datasets = datasets.map(preprocess_function, batched=True)

Loading cached processed dataset at ./cache/csv/task063578-a70cb52678a611ac/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-c2ec216ab18d6a01.arrow
Loading cached processed dataset at ./cache/csv/task063578-a70cb52678a611ac/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-4a0d824c5f087eb0.arrow
Loading cached processed dataset at ./cache/csv/task063578-a70cb52678a611ac/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-4620c53b2807eee1.arrow


In [65]:
encoded_datasets

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 73546
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 20006
    })
    test: Dataset({
        features: ['Unnamed: 0', 'video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 20005
    })
})

In [66]:
encoded_datasets['train'][0]

{'Unnamed: 0': 0,
 'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': 3416,
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0,
 'input_ids': [[101,
   2372,
   1997,
   1996,
   14385,
   3328,
   2091,
   1996,
   2395,
   3173,
   2235,
   7109,
   8782,
   5693,
   1012,
   102,
   1037,
   6943,
   2240,
   5235,
   2011,
   3788,
   2091,
   1996,
   2395,
   2652,
   2037,
   5693,
   1012,
   102],
  [101,
   2372,
   1997,
   1996,
   14385,
   3328,
   2091,
   1996,
   2395,
   3173,
   2235,
   7109,
   8782,
   5693,
 

更好的是，🤗 Datasets 库会自动缓存结果，以避免在下次运行 notebook 时在这一步上花费时间。 🤗 Datasets 库通常足够智能，可以检测您传递给 map 的函数何时发生更改（因此需要不使用缓存数据）。 例如，它会正确检测您是否更改了第一个单元格中的任务并重新运行笔记本。 🤗 Datasets 在使用缓存文件时会发出警告，您可以在对 `map` 的调用中传递 `load_from_cache_file=False` 以不使用缓存文件并强制再次应用预处理。

请注意，我们传递了 `batched=True` 来对文本进行批量编码。 这是为了充分利用我们之前加载的快速分词器的全部优势，它将使用多线程同时处理批处理中的文本。

### 微调模型

现在我们的数据已经准备好了，我们可以下载预训练模型并对其进行微调。 由于我们所有的任务都是关于多项选择的，因此我们使用了 AutoModelForMultipleChoice 类。 与分词器一样，`from_pretrained` 方法将为我们下载并缓存模型。

In [32]:
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint) # 加载预训练模型 bert base

警告告诉我们我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层）并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。 在这种情况下，这是绝对正常的，因为我们正在移除用于在掩码语言建模目标上预训练模型的头部，并将其替换为我们没有预训练权重的新头部，因此库警告我们应该没问题 - 在使用它进行推理之前调整这个模型，这正是我们要做的。

要实例化一个“Trainer”，我们需要再定义三件事。 最重要的是 [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments)，这是一个包含自定义训练的所有属性的类。 它需要一个文件夹名称，用于保存模型的检查点，所有其他参数都是可选的：

In [19]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-swag",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

在这里，我们将评估设置为在每个 epoch 结束时进行，调整学习率，使用 notebook 顶部定义的 `batch_size` 并自定义训练的 epoch 数量以及权重衰减。

设置一切的最后一个参数，以便我们可以在训练期间定期将模型推送到 [Hub](https://huggingface.co/models)。如果您没有按照笔记本顶部的安装步骤将其删除。如果您想以不同于将要推送的存储库名称的名称在本地保存模型，或者如果您想将模型推送到组织而不是名称空间下，请使用 `hub_model_id` 参数设置repo 名称（它必须是全名，包括您的命名空间：例如“sgugger/bert-finetuned-swag”或“huggingface/bert-finetuned-swag”）。

然后我们需要告诉我们的“Trainer”如何从预处理的输入中形成批次。我们还没有做任何填充，因为我们会将每个批次填充到批次内的最大长度（而不是使用整个数据集的最大长度这样做）。这将是 *data collat​​or* 的工作。数据整理器获取示例列表并将它们转换为批处理（在我们的示例中，通过应用填充）。由于库中没有适用于我们特定问题的数据整理器，我们将编写一个，改编自“DataCollat​​orWithPadding”：

In [67]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

当在示例列表上调用时，它将展平大列表中的所有输入/注意掩码等，并将其传递给 `tokenizer.pad` 方法。 这将返回一个带有大张量的字典（形状为 `(batch_size * 4) x seq_length`），然后我们将其展开。

我们可以检查这个数据整理器是否适用于特征列表，我们只需要确保删除我们模型不接受的所有输入特征（之后“Trainer”会自动为我们做一些事情）：

In [74]:
accepted_keys = ["input_ids", "attention_mask", "label"] # 需要保留的输入
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(10)]# 
batch = DataCollatorForMultipleChoice(tokenizer)(features)

In [75]:
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [76]:
batch['input_ids'].shape

torch.Size([10, 4, 40])

为了检查我们转换后的数据是否长句，我们可以和原始内容进行比较

In [71]:
[tokenizer.decode(batch["input_ids"][8][i].tolist()) for i in range(4)]

['[CLS] someone walks over to the radio. [SEP] someone hands her another phone. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] someone walks over to the radio. [SEP] someone takes the drink, then holds it. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] someone walks over to the radio. [SEP] someone looks off then looks at someone. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] someone walks over to the radio. [SEP] someone stares blearily down at the floor. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']

In [72]:
show_one(datasets["train"][8])

Context: Someone walks over to the radio.
  A - Someone hands her another phone.
  B - Someone takes the drink, then holds it.
  C - Someone looks off then looks at someone.
  D - Someone stares blearily down at the floor.

Ground truth: option D


看起来很好

为我们的“Trainer”定义的最后一件事是如何根据预测计算指标。 我们需要为此定义一个函数，它将只使用我们之前加载的 `metric`，我们要做的唯一预处理是获取我们预测的 logits 的 argmax：

In [24]:
import numpy as np

def compute_metrics(eval_predictions):
    predictions, label_ids = eval_predictions
    preds = np.argmax(predictions, axis=1)
    return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

然后我们只需要将所有这些与我们的数据集一起传递给`Trainer`：

In [25]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics,
)

Cloning https://huggingface.co/quincyqiang/bert-base-uncased-finetuned-swag into local empty directory.


我们现在可以通过调用 `train` 方法来微调我们的模型：

In [26]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: sent1, video-id, ending3, gold-source, startphrase, ending0, ending1, sent2, fold-ind, ending2. If sent1, video-id, ending3, gold-source, startphrase, ending0, ending1, sent2, fold-ind, ending2 are not expected by `BertForMultipleChoice.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 73546
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 13791
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mquincyqiang[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.11 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install w

Epoch,Training Loss,Validation Loss,Accuracy
1,0.756,0.602119,0.764571
2,0.3978,0.661687,0.778267
3,0.1468,1.039732,0.789213


Saving model checkpoint to bert-base-uncased-finetuned-swag\checkpoint-500
Configuration saved in bert-base-uncased-finetuned-swag\checkpoint-500\config.json
Model weights saved in bert-base-uncased-finetuned-swag\checkpoint-500\pytorch_model.bin
tokenizer config file saved in bert-base-uncased-finetuned-swag\checkpoint-500\tokenizer_config.json
Special tokens file saved in bert-base-uncased-finetuned-swag\checkpoint-500\special_tokens_map.json
tokenizer config file saved in bert-base-uncased-finetuned-swag\tokenizer_config.json
Special tokens file saved in bert-base-uncased-finetuned-swag\special_tokens_map.json
Saving model checkpoint to bert-base-uncased-finetuned-swag\checkpoint-1000
Configuration saved in bert-base-uncased-finetuned-swag\checkpoint-1000\config.json
Model weights saved in bert-base-uncased-finetuned-swag\checkpoint-1000\pytorch_model.bin
tokenizer config file saved in bert-base-uncased-finetuned-swag\checkpoint-1000\tokenizer_config.json
Special tokens file saved i

TrainOutput(global_step=13791, training_loss=0.4617224831533055, metrics={'train_runtime': 2468.151, 'train_samples_per_second': 89.394, 'train_steps_per_second': 5.588, 'total_flos': 2.466167112293952e+16, 'train_loss': 0.4617224831533055, 'epoch': 3.0})

上传模型到huggingface hub平台

In [27]:
trainer.push_to_hub()

Saving model checkpoint to bert-base-uncased-finetuned-swag
Configuration saved in bert-base-uncased-finetuned-swag\config.json
Model weights saved in bert-base-uncased-finetuned-swag\pytorch_model.bin
tokenizer config file saved in bert-base-uncased-finetuned-swag\tokenizer_config.json
Special tokens file saved in bert-base-uncased-finetuned-swag\special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.
Everything up-to-date

Dropping the following result as it does not have all the necessary fields:
{'dataset': {'name': 'swag', 'type': 'swag', 'args': 'regular'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.789213240146637}]}
To https://huggingface.co/quincyqiang/bert-base-uncased-finetuned-swag
   22cea59..32a6065  main -> main



'https://huggingface.co/quincyqiang/bert-base-uncased-finetuned-swag/commit/22cea5901e2cfee5212caa5d751a681d99796f7e'

## 科大讯飞中文成语填空挑战赛

比赛名称：中文成语填空挑战赛算法挑战大赛
比赛链接：https://challenge.xfyun.cn/topic/info?type=chinese-idioms
关注公众号“ChallengeHub”回复“成语填空”获取完整baseline

### 一、赛事背景
中国文化博大精深源远流长，其中成语更是中国文化的精华。成语大多由四个字组成，一般都有典故或出处。有些成语从字面上不难理解，如“小题大做”、“后来居上”等。有些成语必须知道来源或典故才能懂得意思，如“朝三暮四”、“杯弓蛇影”等。

成语学习是小学语文和初中重要的学习内容，如何在语句中选择合适的成语？本次赛题中希望选手构建模型能理解中文成语。

### 二、赛事任务
给定一个中文句子的情况下，需要选手在给定上下文的情况下从待选的成语中选择最为合适的成语。即给定句子的上下文，完成合适的成语填入对应位置。

赛题训练集案例如下：
![](https://ai-contest-static.xfyun.cn/2021/158.jpg)

* 按照NLP中阅读理解题目处理比赛数据格式，具体内容可以参考swag格式
* 构建描述文本text和选项‘choice’，以及候选答案：四个候选‘成语’
* 输入‘AutoModelForMultipleChoice’模型进行训练和预测

### 三、构建训练集和测试集

In [None]:
import re
import pandas as pd
from tqdm import tqdm

train = pd.read_csv('data/train.csv', sep='\t')
test = pd.read_csv('data/test.csv', sep='\t')

print(train)
print(test)


def process_text(text):
    return re.sub(' +', ' ', text).strip()


def get_question(text):
    """
    根据[MASK][MASK][MASK][MASK]获取问题
    :param text:
    :return:
    """
    sentences = re.split('(。|！|\!|\.|？|\?)', text)  # 保留分割符
    for sent in sentences:
        if '[MASK][MASK][MASK][MASK]' in sent:
            return sent
    return text


cols = [
    "Unnamed: 0",
    "video-id",
    "fold-ind",  # q_id
    "startphrase",
    "sent1",  # content
    "sent2",  # question
    "gold-source",
    "ending0", "ending1", "ending2", "ending3",  # choice
    "label"]

# ======================================================
# 生成训练集
# ======================================================
res = []

for idx, row in tqdm(train.iterrows()):
    q_id = f'train_{idx}'
    content = row['text']
    content = process_text(content)
    question = get_question(content)
    modified_choices = eval(row['candidate'])
    label = modified_choices.index(row['label'])
    ## Hard-code for swag format!
    res.append(("",
                "",
                q_id,
                "",
                content,
                question,
                "",
                modified_choices[0],
                modified_choices[1],
                modified_choices[2],
                modified_choices[3],
                label))
df = pd.DataFrame(res, columns=cols)

### 四、模型训练

In [None]:
# Metric
    def compute_metrics(eval_predictions):
        predictions, label_ids = eval_predictions
        preds = np.argmax(predictions, axis=1)
        return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"] if training_args.do_train else None,
        eval_dataset=tokenized_datasets["validation"] if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    # Training
    if training_args.do_train:
        if last_checkpoint is not None:
            checkpoint = last_checkpoint
        elif os.path.isdir(model_args.model_name_or_path):
            checkpoint = model_args.model_name_or_path
        else:
            checkpoint = None
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
        if trainer.is_world_process_zero():
            with open(output_train_file, "w") as writer:
                logger.info("***** Train results *****")
                for key, value in sorted(train_result.metrics.items()):
                    logger.info(f"  {key} = {value}")
                    writer.write(f"{key} = {value}\n")

            # Need to save the state, since Trainer.save_model saves only the tokenizer with the model
            trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))

## 2021海华AI挑战赛·中文阅读理解

### 赛题背景
文字是人类用以记录和表达的最基本工具，也是信息传播的重要媒介。透过文字与符号，我们可以追寻人类文明的起源，可以传播知识与经验，读懂文字是认识与了解的第一步。对于人工智能而言，它的核心问题之一就是认知，而认知的核心则是语义理解。
 
机器阅读理解(Machine Reading Comprehension)是自然语言处理和人工智能领域的前沿课题，对于使机器拥有认知能力、提升机器智能水平具有重要价值，拥有广阔的应用前景。机器的阅读理解是让机器阅读文本，然后回答与阅读内容相关的问题，体现的是人工智能对文本信息获取、理解和挖掘的能力，在对话、搜索、问答、同声传译等领域，机器阅读理解可以产生的现实价值正在日益凸显，长远的目标则是能够为各行各业提供解决方案。
 
《2021海华AI挑战赛·中文阅读理解》大赛由中关村海华信息技术前沿研究院与清华大学交叉信息研究院联合主办，腾讯云计算协办。共设置题库16000条数据，总奖金池30万元，且腾讯云计算为中学组赛道提供独家算力资源支持。
 
本次比赛的数据来自小学/中高考语文阅读理解题库（其中，技术组的数据主要为中高考语文试题，中学组的数据主要来自小学语文试题）。相较于英文，中文阅读理解有着更多的歧义性和多义性，然而璀璨的中华文明得以绵延数千年，离不开每一个时代里努力钻研、坚守传承的人，这也正是本次大赛的魅力与挑战，让机器读懂文字，让机器学习文明。秉承着人才培养的初心，我们继续保留针对中学组以及技术组的两条平行赛道，科技创新，时代有我，期待你们的回响。
 

### 比赛任务
本次比赛技术组的数据来自中高考语文阅读理解题库。每条数据都包括一篇文章，至少一个问题和多个候选选项。参赛选手需要搭建模型，从候选选项中选出正确的一个。
https://www.biendata.xyz/competition/haihua_2021/