
200 examples in the public dataset leaves very little room for training!

Using `gpt-3.5-turbo` I created another 500 high quality examples at my expense [that I share freely here](https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam). They are what (as of now) pushes this notebook to the highest achieved score across the public notebooks (`0.723`)!

If you find the additional training examples useful, please upvote the dataset! 😊

👉 [additional train data for LLM Science Exam 🥳](https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam)

Thank you! Appreciate your help! 🙏🙏🙏

This notebook builds on a [notebook](https://www.kaggle.com/code/leonidkulyk/lb-0-709-llm-se-deberta-v3-large-i-1k-wiki) by LEONID KULYK. Among some of the changes are:

* use of a new high quality dataset for training
* modified training procedure carried out in the notebook
* general streamlining of code/training for readability

**A couple of related resources you might find useful:**

* [📊 15k high-quality train examples 🏆🔥🚀](https://www.kaggle.com/datasets/radek1/15k-high-quality-examples) - another 15 000 examples I created to help you grow that train/validation set of yours and improve results
* [📊 Best Open Source LLM Starter Pack 🧙🚀](https://www.kaggle.com/datasets/radek1/best-llm-starter-pack) -- the largest (and best) open source model I managed to run on Kaggle!
* [Science Exam Trained Model Weights 🚀](https://www.kaggle.com/datasets/radek1/science-exam-trained-model-weights)

# Preprocessing the dataset

In [152]:
from typing import Optional, Union
import pandas as pd
import numpy as np
import torch
from datasets import Dataset
from dataclasses import dataclass # 于简化创建和管理包含数据的类。dataclass 装饰器是 dataclasses 模块的一部分，它允许你定义一个类，其中的属性（成员变量）可以轻松地自动生成初始化方法、
from transformers import AutoTokenizer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer, AutoModel

deberta_v3_large = '/home/krisfeng/code/llm/kaggle/input/deberta-v3-large-hf-weights'

In [153]:
torch.cuda.empty_cache()

We begin by loading and processing the train data.

In [154]:
df_train = pd.read_csv('./kaggle/input/extra_train_set.csv')
#df_train = df_train.drop(columns="id")
df_train.shape

(500, 7)

In [155]:
df_valid = pd.read_csv('kaggle/input/train.csv')
df_valid = df_valid.drop(columns="id")

Let's add another 500 examples to the train set!

In [156]:
df_train = pd.concat([
    pd.read_csv('./kaggle/input/6000_train_examples.csv'),
    pd.read_csv('./kaggle/input/15k_gpt3.5-turbo.csv'),
    pd.read_csv('./kaggle/input/5900_examples.csv'),
])
df_train.reset_index(inplace=True, drop=True)
df_train.shape

(26900, 7)

In [157]:
df_train.head()

Unnamed: 0,prompt,A,B,C,D,E,answer
0,What is the primary role of Robin Juhkental in...,Robin Juhkental is the bassist of Malcolm Linc...,Robin Juhkental is the keyboardist of Malcolm ...,Robin Juhkental is the drummer of Malcolm Linc...,Robin Juhkental is the lead singer of Malcolm ...,Robin Juhkental is the lead guitarist of Malco...,D
1,Which of the following statements is true rega...,The theory of relativity only encompasses one ...,Special relativity explains the law of gravita...,The theory of relativity does not encompass an...,Special relativity applies to the cosmological...,General relativity only applies to the motion ...,D
2,In which country was the 1920 collection of co...,United States,Germany,Australia,France,England,E
3,What is one of the areas that Shimon Dovid Cow...,"Environmental conservation, opposing deforesta...","Homosexuality, looser abortion laws and volunt...","Freedom of speech, advocating for increased li...","Gun control, advocating for stricter regulatio...","Animal rights, opposing the use of animals for...",B
4,When did the Dirt Road Diaries Tour begin and ...,"February 17, 2014 - November 26, 2014","January 17, 2014 - October 26, 2014","February 17, 2013 - November 26, 2013","March 17, 2013 - December 26, 2013","January 17, 2013 - October 26, 2013",E


In [158]:
df_train.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
26895    False
26896    False
26897    False
26898    False
26899    False
Length: 26900, dtype: bool

Now that we have gone from 200 -> 700 train examples, let us preprocess the data and begin training.

In [159]:
option_to_index = {option: idx for idx, option in enumerate('ABCDE')}
index_to_option = {v: k for k,v in option_to_index.items()}
# 将prompt和每个问题配对，相当于concantenate,然后tokenization转换成一个一个input_id,然后也转换成token_type_id, 这个是标定从第几个字母开始不一样，然后标定1为，然后两个id都比原文要长
def preprocess(example):
    first_sentence = [str(example['prompt'])] * 5
    second_sentences = [str(example[option]) for option in 'ABCDE']
    tokenized_example = tokenizer(first_sentence, second_sentences, truncation=False)
    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example
# 这个类的主要作用是将多项选择任务的输入数据整理成适合模型输入的格式。
'''
class MyClass:
    def __init__(self, var_a, var_b):
        self.var_a = var_a
        self.var_b = var_b

转化成
@dataclass
class MyClass:
    var_a: str
    var_b: str
'''
@dataclass # 使用 @dataclass 装饰器可以大大简化类的定义，避免了手动编写冗长的初始化方法和其他特殊方法。去掉了__init__方法
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    # Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    padding: Union[bool, str, PaddingStrategy] = True 
    # Pad to a maximum length specified with the argument max_length or to the maximum
    # acceptable input length for the model if that argument is not provided.
    max_length: Optional[int] = None 
    # If set will pad the sequence to a multiple of the provided value
    pad_to_multiple_of: Optional[int] = None
    # __call__是  PreTrainedTokenizerBase’s encoding methods
    def __call__(self, features):
        # 现在还有label
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        # # pop完现在已经没有label了
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])# 多了一个外括号要去掉，并没有相加任何数
        # pad 0,且把相同的key的value值合并了
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',# 它指定了填充后的数据应该以 PyTorch 张量的形式返回。
        )
        #变tensor为dict
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # 添加label
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        #  dict,但是d里面 每个key对应的value是'torch.Tensor'
        return batch

We first create a HuggingFace `Dataset`.

In [160]:
# 使用Hugging Face Transformers库中的AutoTokenizer来加载一个预训练的DeBERTa v3 Large模型的分词器（tokenizer）
# 分词器（Tokenizer）是自然语言处理（NLP）中的一个重要工具，用于将文本数据分割成单词、子词或其他语言单位的序列，
# 以便计算机能够理解和处理文本数据。在NLP任务中，文本数据通常以连续的字符序列形式存在，而分词器的作用是将这些字符序列切分成离散的语言单元，以便进行词级别或子词级别的处理。
# attention mask不懂

tokenizer = AutoTokenizer.from_pretrained(deberta_v3_large)
# 这行代码的作用是将名为 df_train 的Pandas DataFrame 转化为Hugging Face Transformers库中的 Dataset 对象。
dataset = Dataset.from_pandas(df_train)
dataset

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Dataset({
    features: ['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'],
    num_rows: 26900
})

In [161]:
dataset_valid = Dataset.from_pandas(df_valid)
dataset_valid

Dataset({
    features: ['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'],
    num_rows: 200
})

In [162]:
# 直接map会增加column数量
tokenized_dataset1 = dataset.map(preprocess)
tokenized_dataset_pandas1 = tokenized_dataset1.to_pandas()

tokenized_dataset_pandas1.head()

Map: 100%|██████████| 26900/26900 [00:07<00:00, 3383.55 examples/s]


Unnamed: 0,prompt,A,B,C,D,E,answer,input_ids,token_type_ids,attention_mask,label
0,What is the primary role of Robin Juhkental in...,Robin Juhkental is the bassist of Malcolm Linc...,Robin Juhkental is the keyboardist of Malcolm ...,Robin Juhkental is the drummer of Malcolm Linc...,Robin Juhkental is the lead singer of Malcolm ...,Robin Juhkental is the lead guitarist of Malco...,D,"[[1, 458, 269, 262, 1862, 985, 265, 8175, 1433...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",3
1,Which of the following statements is true rega...,The theory of relativity only encompasses one ...,Special relativity explains the law of gravita...,The theory of relativity does not encompass an...,Special relativity applies to the cosmological...,General relativity only applies to the motion ...,D,"[[1, 2597, 265, 262, 776, 3741, 269, 980, 1712...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",3
2,In which country was the 1920 collection of co...,United States,Germany,Australia,France,England,E,"[[1, 344, 319, 658, 284, 262, 8547, 1283, 265,...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",4
3,What is one of the areas that Shimon Dovid Cow...,"Environmental conservation, opposing deforesta...","Homosexuality, looser abortion laws and volunt...","Freedom of speech, advocating for increased li...","Gun control, advocating for stricter regulatio...","Animal rights, opposing the use of animals for...",B,"[[1, 458, 269, 311, 265, 262, 893, 272, 81398,...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",1
4,When did the Dirt Road Diaries Tour begin and ...,"February 17, 2014 - November 26, 2014","January 17, 2014 - October 26, 2014","February 17, 2013 - November 26, 2013","March 17, 2013 - December 26, 2013","January 17, 2013 - October 26, 2013",E,"[[1, 486, 464, 262, 27469, 2055, 40512, 4590, ...","[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...",4


And let us now preprocess the examples for training.

In [163]:
tokenized_dataset1[0].keys()

dict_keys(['prompt', 'A', 'B', 'C', 'D', 'E', 'answer', 'input_ids', 'token_type_ids', 'attention_mask', 'label'])

In [164]:
type(tokenized_dataset1[0])

dict

In [165]:
# input_ids: Construct a DeBERTa tokenizer. Based on byte-level Byte-Pair-Encoding.
# token type_id:Create a mask from the two sequences passed to be used in a sequence-pair classification task. A DeBERTa
tokenized_dataset = dataset.map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_dataset_pandas = tokenized_dataset.to_pandas()
tokenized_dataset

Map: 100%|██████████| 26900/26900 [00:07<00:00, 3467.38 examples/s]


Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
    num_rows: 26900
})

In [166]:
print(df_train.prompt[0])
print(df_train.A[0])
print(df_train.B[0])
print(df_train.C[0])
print(df_train.D[0])
print(df_train.E[0])

What is the primary role of Robin Juhkental in the band Malcolm Lincoln?
Robin Juhkental is the bassist of Malcolm Lincoln and is responsible for laying down the foundation of their music.
Robin Juhkental is the keyboardist of Malcolm Lincoln and adds atmospheric sounds to their music.
Robin Juhkental is the drummer of Malcolm Lincoln and keeps the beat for the band's songs.
Robin Juhkental is the lead singer of Malcolm Lincoln and provides vocals for the band's songs.
Robin Juhkental is the lead guitarist of Malcolm Lincoln and is responsible for creating unique guitar melodies and solos.


In [167]:
# input_ids — List of token ids to be fed to a model.They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
tokenized_dataset_pandas.input_ids[0]

array([array([    1,   458,   269,   262,  1862,   985,   265,  8175, 14334,
              41672, 91707,   267,   262,  1704, 15084,  6175,   302,     2,
               8175, 14334, 41672, 91707,   269,   262, 26623,   265, 15084,
               6175,   263,   269,  1744,   270,  9022,   444,   262,  3332,
                265,   308,   755,   260,     2], dtype=int32)              ,
       array([    1,   458,   269,   262,  1862,   985,   265,  8175, 14334,
              41672, 91707,   267,   262,  1704, 15084,  6175,   302,     2,
               8175, 14334, 41672, 91707,   269,   262, 67509,   265, 15084,
               6175,   263,  3814, 13123,  2163,   264,   308,   755,   260,
                  2], dtype=int32)                                          ,
       array([    1,   458,   269,   262,  1862,   985,   265,  8175, 14334,
              41672, 91707,   267,   262,  1704, 15084,  6175,   302,     2,
               8175, 14334, 41672, 91707,   269,   262, 16090,   265, 1508

In [170]:
# List of token type ids to be fed to a model 
'''
token
A part of a sentence, usually a word, but can also be a subword (non-common
 words are often split in subwords) or a punctuation symbol.
token Type IDs
Some models’ purpose is to do classification on pairs of sentences or question answering.
We can use our tokenizer to automatically generate such a sentence by passing the two
 sequences to tokenizer as two arguments (and not a list, like before) like this
 
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])

which will return:

print(decoded)
[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]

The first sequence, the “context” used for the question, 
has all its tokens represented by a 0, whereas the second sequence, 
corresponding to the “question”, has all its tokens represented by a 1.
'''
tokenized_dataset_pandas.token_type_ids[0]

array([array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             dtype=int8)                                                       ,
       array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)        ,
       array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8),
       array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8),
       array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             dtype=int8)                                                       ],
      dtype=object)

In [171]:
# The attention mask is a binary tensor indicating the position of t
# The padded indices so that the model does not attend to them. 
# For the BertTokenizer , 1 indicates a value that should be attended to, while 0 indicates a padded value.
'''
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

sentences = ["It will rain in the",
            "I want to eat a big bowl of",
            "My dog is"]
inputs = tokenizer(sentences, return_tensors="pt", padding=True)

output_sequences = gpt2.generate(**inputs)

for seq in output_sequences:
    print(tokenizer.decode(seq))
    
{'input_ids': tensor([
    [50256, 50256, 50256,  1026,   481,  6290,   287,   262],
    [   40,   765,   284,  4483,   257,  1263,  9396,   286],
    [50256, 50256, 50256, 50256, 50256,  3666,  3290,   318]
  ]),
'attention_mask': tensor([
    [0, 0, 0, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 1, 1],
    [0, 0, 0, 0, 0, 1, 1, 1]
  ])}
'''
tokenized_dataset_pandas.attention_mask[0]

array([array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             dtype=int8)                                                       ,
       array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)        ,
       array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8),
       array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8),
       array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
             dtype=int8)                                                       ],
      dtype=object)

In [207]:
features = [tokenized_dataset[0]]
label_name = 'label' if 'label' in features[0].keys() else 'labels'
labels = [feature.pop(label_name) for feature in features]
print(labels)
batch_size = len(features)
print(batch_size)
num_choices = len(features[0]['input_ids'])
print(num_choices)
flattened_features = [
        [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
    ]
print(flattened_features)# 多了一个外括号要去掉
flattened_features = sum(flattened_features, [])# 多了一个外括号要去掉，并没有相加任何数
print(flattened_features)
# pad 0,且把相同的key的value值合并了
batch = tokenizer.pad(
    flattened_features,
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    return_tensors='pt',# 它指定了填充后的数据应该以 PyTorch 张量的形式返回。
)
print(batch.input_ids)
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
#print(batch)
batch['labels'] = torch.tensor(labels, dtype=torch.int64)

[3]
1
5
[[{'input_ids': [1, 458, 269, 262, 1862, 985, 265, 8175, 14334, 41672, 91707, 267, 262, 1704, 15084, 6175, 302, 2, 8175, 14334, 41672, 91707, 269, 262, 26623, 265, 15084, 6175, 263, 269, 1744, 270, 9022, 444, 262, 3332, 265, 308, 755, 260, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}, {'input_ids': [1, 458, 269, 262, 1862, 985, 265, 8175, 14334, 41672, 91707, 267, 262, 1704, 15084, 6175, 302, 2, 8175, 14334, 41672, 91707, 269, 262, 67509, 265, 15084, 6175, 263, 3814, 13123, 2163, 264, 308, 755, 260, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [None]:

tokenized_dataset_valid = dataset_valid.map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])

# Training

In [None]:
import numpy as np
def map_at_3(predictions, labels):
    map_sum = 0
    pred = np.argsort(-1*np.array(predictions),axis=1)[:,:3]
    for x,y in zip(pred,labels):
        z = [1/i if y==j else 0 for i,j in zip([1,2,3],x)]
        map_sum += np.sum(z)
    return map_sum / len(predictions)

# Define your custom evaluation function
def compute_metrics(p):
    predictions = p.predictions.tolist()
    labels = p.label_ids.tolist()
    return {"map@3": map_at_3(predictions, labels)}


In [None]:
training_args = TrainingArguments(
    # 具体来说，学习率预热是指在训练的开始阶段，逐渐增加学习率的过程。通常，在训练初期
    # 模型的权重随机初始化，对数据的拟合不够好，此时较小的学习率可以帮助模型更快地收敛
    # 到一个相对合适的权重状态。然后，随着训练的进行，学习率逐渐增加，以允许模型在局部极小值之间跳跃，以寻找更好的全局最小值。
    warmup_ratio=0.8,
    learning_rate=5e-6,
    load_best_model_at_end=True,
    evaluation_strategy='steps',
    logging_steps=500,
    metric_for_best_model='map@3',# 用于选择最佳模型的评估指标。在训练过程中，会监视此指标的变化，并选择在验证集上具有最佳该指标值的模型。
    per_device_train_batch_size=1, # 这是指每个GPU设备上的训练batch size大小。如果你有多个GPU，你可以选择同时在每个GPU上处理多个样本，以加速训练。这个参数会决定每个GPU上的批次大小。
    eval_steps=500,
    per_device_eval_batch_size=1,
    num_train_epochs=3,
    #report_to='none',
    output_dir='.',
    save_total_limit = 1,
)

model = AutoModelForMultipleChoice.from_pretrained(deberta_v3_large)
'''
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    train_dataset=tokenized_dataset,
    compute_metrics = compute_metrics,
)
'''
# hugging face API for training using pytorch
# Hugging Face Transformers库中的 Trainer 对象，用于训练和评估NLP模型
trainer = Trainer(
    model=model,# can be transformer pretrain model
    args=training_args,# 这是一个包含训练参数的配置对象，其中包括诸如训练步骤数、学习率、批处理大小等训练超参数的设置。
    tokenizer=tokenizer,# 这是用于分词文本的分词器，它会将输入数据分词成模型可接受的形式。
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),# Batch encoding, 专门用于处理多选题类型的数据集,这是一个数据收集器，用于将分词后的数据整理成适合模型输入的形式。因为此时tokenizedd后 数据都不等长，得padding才行
    train_dataset=tokenized_dataset,# 这是训练数据集，通常包含了经过分词和预处理后的数据。
    eval_dataset=tokenized_dataset_valid, # 这是用于模型验证的数据集，通常也包含了经过分词和预处理后的数据。
    compute_metrics = compute_metrics,# 这是一个函数，用于计算评估指标。你可以自定义此函数，以根据你的任务计算适当的指标，例如准确度、F1分数等。
)

trainer.train()

# Predicting on the test set

Now that we have trained our model, let us predict on the test set.

In [None]:
test_df = pd.read_csv('./kaggle/input/train.csv')
test_df['answer'] = 'A' # dummy answer that allows us to preprocess the test dataset just like we preprocessed the train dataset

tokenized_test_dataset = Dataset.from_pandas(test_df.drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E'])

In [None]:
test_predictions = trainer.predict(tokenized_test_dataset).predictions
test_predictions[:4]

The predictions are values output from the last layer of our neural network.

Let's obtain the predicted answer ids by sorting them from largest to the smallest.

In [None]:
predictions_as_ids = np.argsort(-test_predictions, 1)
predictions_as_ids[:3]

Let us now assign a letter corresponding to each predicted id (0 -> 'A', 1 -> 'B', etc). 

In [None]:
predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
predictions_as_answer_letters[:3]

And let us now go from this representation to outputting a string with 3 highest rated answers seperated by a space.

In [None]:
predictions_as_string = test_df['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]
predictions_as_string[:3]

And we are done! 🥳

Let us now output our submission.

In [None]:
submission = test_df[['id', 'prediction']]
submission.to_csv('submission.csv', index=False)

pd.read_csv('submission.csv').head()

I hope you enjoyed this notebook!

If you found it useful, please upvote 👉 [the corresponding dataset with 500 new training examples](https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam) 👈

Thank you, appreciate your help! 🙏😊

Thank you for reading and happy Kaggling!