# T5 Text2Text (sentence-level generation)
使用 MCQ dataset訓練，distractors label: ```<s>_ of distractors are d1, d2, d3</s>``` <br>
**將 train dataset 分割成 train 和 valid 9:1**<br>

### GPU

In [1]:
!nvidia-smi

Thu Aug 24 13:51:15 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA TITAN RTX               On  | 00000000:09:00.0 Off |                  N/A |
| 41%   42C    P8              36W / 280W |      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN RTX               On  | 00000000:0A:00.0 Off |  

### Weight and Bias (Assisting Metrics, Optional)

In [2]:
project_name = "test on MCQ with T5"
import os

os.environ["WANDB_PROJECT"] = project_name

### import

In [3]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

  from .autonotebook import tqdm as notebook_tqdm


### Loading the dataset

In [4]:
import json
import os, sys
import fnmatch

In [5]:
def read_data(item):
    path = '/user_data/CTG/data/MCQ/total_new_cleaned_{}.json'.format(item)
    with open(path) as f:
        data = json.load(f)
    return data

In [6]:
train = read_data('train')
test = read_data('test')

In [7]:
len(train), len(test)

(2321, 259)

In [8]:
train[0]

{'answer': 'gravity',
 'distractors': ['friction', 'erosion', 'magnetism'],
 'sentence': '**blank** causes rocks to roll downhill'}

### Prepare data

In [9]:
from sklearn.model_selection import train_test_split

train, valid = train_test_split(train, random_state=777, train_size=0.9)
len(train), len(valid)

(2088, 233)

In [10]:
def processData(data, task_prefix):
    
    sentences = []
    labels = []
    answers = []
    for d in data:
        sentence = d['sentence']
        distractors = d['distractors']
        
        sentence = sentence.replace('**blank**', '_')
        # 避免dataset的label有空白
        distractors = [dis.strip() for dis in distractors]
        
        sentences.append(task_prefix + sentence)
        labels.append('_ of distractors are ' + ', '.join(distractors))
        answers.append(d['answer'])
        
    return sentences, answers, labels

In [11]:
task_prefix = 'distractor generation: '
train_sent, train_answer, train_label = processData(train, task_prefix)
valid_sent, valid_answer, valid_label = processData(valid, task_prefix)
test_sent, test_answer, test_label = processData(test, task_prefix)

In [12]:
train[314]

{'answer': 'abrasion',
 'distractors': ['absorption', 'filtration', 'decomposition'],
 'sentence': 'the main way that wind causes erosion is **blank**'}

In [13]:
train_label[314]

'_ of distractors are absorption, filtration, decomposition'

In [14]:
for idx in range(5):
    print(train_sent[idx])
    print(train_answer[idx])
    print(train_label[idx])
    print()

distractor generation: In _ reinforcement, the reinforcer follows every correct response.
continuous
_ of distractors are intermittent, partial, negative

distractor generation: _ of glucose units is found in plants and serves a structural purpose
cellulose
_ of distractors are frucose, carbonate, sucrose

distractor generation: _ is associated with both public goods as well as common resources
nonexcludable
_ of distractors are rival, nonrival, excludable

distractor generation: _ can refer to a rope in a particular shape and a genetic structure involved in splicing
lariat
_ of distractors are braid, tourniquet, noose

distractor generation: the outer layer of cells in a root is called _
epidermis
_ of distractors are igneous, skeletal, muscles



In [15]:
len(train_sent), len(train_answer), len(train_label)

(2088, 2088, 2088)

In [16]:
len(valid_sent), len(valid_answer), len(valid_label)

(233, 233, 233)

In [17]:
len(test_sent), len(test_answer), len(test_label)

(259, 259, 259)

In [18]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')

In [19]:
train_encodings = tokenizer(train_sent, train_answer, truncation=True, padding=True)
valid_encodings = tokenizer(valid_sent, valid_answer, truncation=True, padding=True)
test_encodings = tokenizer(test_sent, test_answer, truncation=True, padding=True)

In [20]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

In [21]:
print(train_encodings.input_ids[0])

[15980, 127, 3381, 10, 86, 3, 834, 28050, 6, 8, 19452, 52, 6963, 334, 2024, 1773, 5, 1, 7558, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [22]:
len(train_encodings.input_ids[0])

130

In [23]:
tokenizer.decode(train_encodings.input_ids[0])

'distractor generation: In _ reinforcement, the reinforcer follows every correct response.</s> continuous</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>'

In [24]:
len(train_encodings.input_ids)

2088

In [25]:
def add_labels(encodings, distractors):
    
    distractors_encodings = tokenizer(distractors, padding=True)
    labels = []
    for i in range(len(distractors_encodings.input_ids)):
        labels.append(distractors_encodings.input_ids[i])
    
    encodings["labels"] = labels
    return encodings

In [26]:
train_encodings = add_labels(train_encodings, train_label)
valid_encodings = add_labels(valid_encodings, valid_label)
test_encodings = add_labels(test_encodings, test_label)

In [27]:
len(train_encodings.input_ids)

2088

In [28]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [29]:
print(train_encodings.labels[0])

[3, 834, 13, 1028, 29676, 33, 27688, 6, 11807, 6, 2841, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [30]:
tokenizer.decode(train_encodings.labels[0])

'_ of distractors are intermittent, partial, negative</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>'

In [31]:
class MCQDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = MCQDataset(train_encodings)
valid_dataset = MCQDataset(valid_encodings)
test_dataset = MCQDataset(test_encodings)

In [32]:
len(train_dataset), len(valid_dataset), len(test_dataset)

(2088, 233, 259)

In [33]:
type(train_dataset)

__main__.MCQDataset

In [34]:
train_dataset[0]

{'input_ids': tensor([15980,   127,  3381,    10,    86,     3,   834, 28050,     6,     8,
         19452,    52,  6963,   334,  2024,  1773,     5,     1,  7558,     1,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

In [35]:
train_dataset

<__main__.MCQDataset at 0x7f49994af320>

In [36]:
len(train_dataset)

2088

### Fine-tuning

In [37]:
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
import torch

model = T5ForConditionalGeneration.from_pretrained("t5-base")
model.resize_token_embeddings(len(tokenizer))

Embedding(32100, 768)

In [38]:
model_dict = torch.nn.ModuleDict({
    'model': model,
})
checkpoint = torch.load('/user_data/Cloze/dtt_mask_lm_model/t5/parameter_analysis/sciq_all_passage_level_k=12/checkpoints/epoch=03-dev_loss=0.14.ckpt')
model_dict.load_state_dict(checkpoint['state_dict'])

<All keys matched successfully>

In [39]:
batch_size = 16
args = Seq2SeqTrainingArguments(
    output_dir = "./results-3",
    save_strategy = "epoch",
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=30,
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="P@1",
    weight_decay=0.01,
    predict_with_generate=True,
    eval_accumulation_steps = 1,
    report_to="wandb" if os.getenv("WANDB_PROJECT") else "none"
)

In [40]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [41]:
import numpy as np
def compute_metrics(p):
    predictions, labels = p
    
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # store all article
    predicted = []
    true_label = []
    
    for k in range(len(decoded_labels)):
        pred = decoded_preds[k]
        label = decoded_labels[k]

        pred_list = pred.split(', ')
        label_list = label.split(', ')
        
        pred_list[0] = pred_list[0].split(' ')[-1]
        label_list[0] = label_list[0].split(' ')[-1]

        predicted.append(pred_list)
        true_label.append(label_list)

    # evaluation metrics
    p1 = 0
    p3 = 0
    r3 = 0
    f3 = 0
    for idx in range(len(true_label)):
        distractors = predicted[idx]
        labels = true_label[idx]

        act_set = set(labels)
        pred1_set = set(distractors[:1])
        pred3_set = set(distractors[:3])

        p_1 = len(act_set & pred1_set) / float(1)
        p_3 = len(act_set & pred3_set) / float(3)
        r_3 = len(act_set & pred3_set) / float(len(act_set))

        if p_3 == 0 and r_3 == 0:
            f1_3 = 0
        else:
            f1_3 = 2 * (p_3 * r_3 / (p_3 + r_3))

        p1+=p_1
        p3+=p_3
        r3+=r_3
        f3+=f1_3

    avg_p1 = p1 / len(true_label)
    avg_p3 = p3 / len(true_label)
    avg_r3 = r3 / len(true_label)
    avg_f3 = f3 / len(true_label)

    result = {'P@1': avg_p1,
              'P@3': avg_p3,
              'R@3': avg_r3,
              'F1@3': avg_f3}
    
    return result

In [42]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [43]:
trainer.train()

***** Running training *****
  Num examples = 2088
  Num Epochs = 30
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1980
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mms0004284[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss,P@1,P@3,R@3,F1@3
1,No log,0.555626,0.184549,0.095851,0.095494,0.094727
2,No log,0.532051,0.184549,0.113019,0.108727,0.110566
3,No log,0.521581,0.188841,0.108727,0.10515,0.106683
4,No log,0.516727,0.197425,0.123033,0.118383,0.120376
5,No log,0.513797,0.214592,0.123033,0.119456,0.120989
6,No log,0.512386,0.23176,0.135908,0.131617,0.133333
7,No log,0.517496,0.236052,0.143062,0.137339,0.139669
8,0.513300,0.516346,0.23176,0.153076,0.147353,0.149683
9,0.513300,0.519845,0.257511,0.16166,0.155222,0.157858
10,0.513300,0.520362,0.283262,0.16166,0.155222,0.157858


***** Running Evaluation *****
  Num examples = 233
  Batch size = 16
Saving model checkpoint to ./results-3/checkpoint-66
Configuration saved in ./results-3/checkpoint-66/config.json
Model weights saved in ./results-3/checkpoint-66/pytorch_model.bin
tokenizer config file saved in ./results-3/checkpoint-66/tokenizer_config.json
Special tokens file saved in ./results-3/checkpoint-66/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 233
  Batch size = 16
Saving model checkpoint to ./results-3/checkpoint-132
Configuration saved in ./results-3/checkpoint-132/config.json
Model weights saved in ./results-3/checkpoint-132/pytorch_model.bin
tokenizer config file saved in ./results-3/checkpoint-132/tokenizer_config.json
Special tokens file saved in ./results-3/checkpoint-132/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 233
  Batch size = 16
Saving model checkpoint to ./results-3/checkpoint-198
Configuration saved in ./results-3/checkpoint-198/con

TrainOutput(global_step=1980, training_loss=0.34437720558860085, metrics={'train_runtime': 1427.014, 'train_samples_per_second': 43.896, 'train_steps_per_second': 1.388, 'total_flos': 9685284950016000.0, 'train_loss': 0.34437720558860085, 'epoch': 30.0})

In [44]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 233
  Batch size = 16


{'eval_loss': 0.5379454493522644,
 'eval_P@1': 0.30472103004291845,
 'eval_P@3': 0.1831187410586552,
 'eval_R@3': 0.17417739628040058,
 'eval_F1@3': 0.1778867770284079,
 'eval_runtime': 9.9453,
 'eval_samples_per_second': 23.428,
 'eval_steps_per_second': 0.804,
 'epoch': 30.0}

In [45]:
trainer.save_model('/user_data/CTG/train/DG/parameter_analysis/sciq_all_retrieve_k=12/mcq/t5-base-text2text-mcq-pretrain-on-sciq-all-passage-level-e3')

Saving model checkpoint to /user_data/CTG/train/DG/parameter_analysis/sciq_all_retrieve_k=3/mcq/t5-base-text2text-mcq-pretrain-on-sciq-all-passage-level-e3
Configuration saved in /user_data/CTG/train/DG/parameter_analysis/sciq_all_retrieve_k=3/mcq/t5-base-text2text-mcq-pretrain-on-sciq-all-passage-level-e3/config.json
Model weights saved in /user_data/CTG/train/DG/parameter_analysis/sciq_all_retrieve_k=3/mcq/t5-base-text2text-mcq-pretrain-on-sciq-all-passage-level-e3/pytorch_model.bin
tokenizer config file saved in /user_data/CTG/train/DG/parameter_analysis/sciq_all_retrieve_k=3/mcq/t5-base-text2text-mcq-pretrain-on-sciq-all-passage-level-e3/tokenizer_config.json
Special tokens file saved in /user_data/CTG/train/DG/parameter_analysis/sciq_all_retrieve_k=3/mcq/t5-base-text2text-mcq-pretrain-on-sciq-all-passage-level-e3/special_tokens_map.json


In [46]:
predictions, labels, metrics = trainer.predict(valid_dataset)
print('valid: ')
metrics

***** Running Prediction *****
  Num examples = 233
  Batch size = 16


valid: 


{'test_loss': 0.5379454493522644,
 'test_P@1': 0.30472103004291845,
 'test_P@3': 0.1831187410586552,
 'test_R@3': 0.17417739628040058,
 'test_F1@3': 0.1778867770284079,
 'test_runtime': 8.1275,
 'test_samples_per_second': 28.668,
 'test_steps_per_second': 0.984}

In [47]:
predictions, labels, metrics = trainer.predict(test_dataset)
print('test: ')
metrics

***** Running Prediction *****
  Num examples = 259
  Batch size = 16


test: 


{'test_loss': 0.846320390701294,
 'test_P@1': 0.25096525096525096,
 'test_P@3': 0.1467181467181467,
 'test_R@3': 0.1467181467181467,
 'test_F1@3': 0.1467181467181467,
 'test_runtime': 6.294,
 'test_samples_per_second': 41.15,
 'test_steps_per_second': 1.43}

In [48]:
stop

NameError: name 'stop' is not defined

### Save Distractor Data

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
import torch

tokenizer = T5Tokenizer.from_pretrained("/user_data/CTG/train/DG/MCQ/T5_sciq_all_passage_level/sciq/t5-base-text2text-mcq-pretrain-on-sciq-all-passage-level-e3")
model = T5ForConditionalGeneration.from_pretrained("/user_data/CTG/train/DG/MCQ/T5_sciq_all_passage_level/sciq/t5-base-text2text-mcq-pretrain-on-sciq-all-passage-level-e3")

Didn't find file /user_data/CTG/model/t5-base-finetuned-text2text-sentence-mcq(pretrain)/added_tokens.json. We won't load it.
loading file /user_data/CTG/model/t5-base-finetuned-text2text-sentence-mcq(pretrain)/spiece.model
loading file None
loading file /user_data/CTG/model/t5-base-finetuned-text2text-sentence-mcq(pretrain)/special_tokens_map.json
loading file /user_data/CTG/model/t5-base-finetuned-text2text-sentence-mcq(pretrain)/tokenizer_config.json
loading configuration file /user_data/CTG/model/t5-base-finetuned-text2text-sentence-mcq(pretrain)/config.json
Model config T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  

In [None]:
batch_size = 64
args = Seq2SeqTrainingArguments(
    output_dir = "./results",
    save_strategy = "epoch",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=50,
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="P@1",
    weight_decay=0.01,
    predict_with_generate=True,
    eval_accumulation_steps = 1,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
import numpy as np
def compute_metrics(p):
    predictions, labels = p
    
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # store all article
    predicted = []
    true_label = []
    
    for k in range(len(decoded_labels)):
        pred = decoded_preds[k]
        label = decoded_labels[k]

        pred_list = pred.split(', ')
        label_list = label.split(', ')
        
        pred_list[0] = pred_list[0].split(' ')[-1]
        label_list[0] = label_list[0].split(' ')[-1]

        predicted.append(pred_list)
        true_label.append(label_list)

    # evaluation metrics
    p1 = 0
    p3 = 0
    r3 = 0
    f3 = 0
    for idx in range(len(true_label)):
        distractors = predicted[idx]
        labels = true_label[idx]

        act_set = set(labels)
        pred1_set = set(distractors[:1])
        pred3_set = set(distractors[:3])

        p_1 = len(act_set & pred1_set) / float(1)
        p_3 = len(act_set & pred3_set) / float(3)
        r_3 = len(act_set & pred3_set) / float(len(act_set))

        if p_3 == 0 and r_3 == 0:
            f1_3 = 0
        else:
            f1_3 = 2 * (p_3 * r_3 / (p_3 + r_3))

        p1+=p_1
        p3+=p_3
        r3+=r_3
        f3+=f1_3

    avg_p1 = p1 / len(true_label)
    avg_p3 = p3 / len(true_label)
    avg_r3 = r3 / len(true_label)
    avg_f3 = f3 / len(true_label)

    result = {'P@1': avg_p1,
              'P@3': avg_p3,
              'R@3': avg_r3,
              'F1@3': avg_f3}
    
    return result

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
valid_predictions, valid_labels, valid_metrics = trainer.predict(valid_dataset)
valid_metrics

***** Running Prediction *****
  Num examples = 233
  Batch size = 64


{'test_loss': 0.24040304124355316,
 'test_P@1': 0.4034334763948498,
 'test_P@3': 0.33476394849785396,
 'test_R@3': 0.31795422031473525,
 'test_F1@3': 0.32503576537911283,
 'test_runtime': 7.2694,
 'test_samples_per_second': 32.052,
 'test_steps_per_second': 0.275}

In [None]:
test_predictions, test_labels, test_metrics = trainer.predict(test_dataset)
test_metrics

***** Running Prediction *****
  Num examples = 259
  Batch size = 64
  Num examples = 259
  Batch size = 64


{'test_loss': 2.043412208557129,
 'test_P@1': 0.15057915057915058,
 'test_P@3': 0.1351351351351351,
 'test_R@3': 0.1351351351351351,
 'test_F1@3': 0.1351351351351351,
 'test_runtime': 5.0922,
 'test_samples_per_second': 50.863,
 'test_steps_per_second': 0.589}

In [None]:
import json
def write_json(data, path):
    
    jsonString = json.dumps(data)
    jsonFile = open(path, "w")
    jsonFile.write(jsonString)
    jsonFile.close()

In [None]:
def save_data(data, predictions, labels, file_name):
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # store all article
    predicted = []
    true_label = []
    
    for k in range(len(decoded_labels)):
        pred = decoded_preds[k]
        label = decoded_labels[k]

        pred_list = pred.split(', ')
        label_list = label.split(', ')
        
        pred_list[0] = pred_list[0].split(' ')[-1]
        label_list[0] = label_list[0].split(' ')[-1]

        predicted.append(pred_list)
        true_label.append(label_list)
    
    
    # evaluation metrics
    for idx in range(len(true_label)):
        distractors = predicted[idx]
        labels = true_label[idx]
        
        data[idx]['pred_distractors'] = distractors

        act_set = set(labels)
        pred1_set = set(distractors[:1])
        pred3_set = set(distractors[:3])

        p_1 = len(act_set & pred1_set) / float(1)
        p_3 = len(act_set & pred3_set) / float(3)
        r_3 = len(act_set & pred3_set) / float(len(act_set))

        if p_3 == 0 and r_3 == 0:
            f1_3 = 0
        else:
            f1_3 = 2 * (p_3 * r_3 / (p_3 + r_3))
            
        data[idx]['metric'] = {'P@1': p_1, 'P@3': p_3, 'R@3': r_3, 'F1@3': f1_3}
        
    write_json(data, file_name)
    print(file_name + ' is saved :)')

In [None]:
save_data(test, test_predictions, test_labels, '/user_data/CTG/train/DG/MCQ/T5_sciq_all_passage_level/sciq_test_result/mcq_test_t5_pretrain_on_sciq_all_passage_level.json')

/user_data/CTG/test_result/mcq_test_t5_text2text_sentence_pretrain_on_mcq_passage_level.json is saved :)


### Result

In [None]:
import json
def read_data(path):
    with open(path) as f:
        data = json.load(f)
    return data

In [None]:
test = read_data('/user_data/CTG/test_result/mcq_test_t5_text2text_sentence_pretrain_on_mcq_passage_level.json')

In [None]:
for i in range(0, 100, 7):
    example = test[i]
    sentence = example['sentence']
    answer = example['answer']
    distractors = example['distractors']
    pred_distractors = example['pred_distractors']
    metric = example['metric']
    
    print('sentence:', sentence.replace('**blank**', '_'))
    print('answer:', answer)
    print('distractors:', distractors)
    print('predict:', pred_distractors)
    print('metric:', metric)
    print()

sentence: _ is used to describe a chemical released by an animal that affects the behavior or physiology of animals of the same species
answer: pheromone
distractors: ['enzyme', 'isolate', 'amino']
predict: ['ribosomes', 'sperm', 'eggs']
metric: {'P@1': 0.0, 'P@3': 0.0, 'R@3': 0.0, 'F1@3': 0}

sentence: Gymnosperms have seeds but do not have _ 
answer: flowers
distractors: ['leaves', 'stems', 'roots']
predict: ['roots', 'stems', 'leaves']
metric: {'P@1': 1.0, 'P@3': 1.0, 'R@3': 1.0, 'F1@3': 1.0}

sentence: In recent years , however , researchers have discovered that microsporidia actually have tiny organelles derived from _ 
answer: mitochondria
distractors: ['proteins', 'carbohydrates', 'plasma']
predict: ['protazoa', 'proteins', 'bacteria']
metric: {'P@1': 0.0, 'P@3': 0.3333333333333333, 'R@3': 0.3333333333333333, 'F1@3': 0.3333333333333333}

sentence: _ can reproduce by infecting the cell of a living host
answer: virus
distractors: ['bacteria', 'mucus', 'carcinogens']
predict: ['par