<a href="https://colab.research.google.com/github/azizbarank/distilroberta-base-sst-2-distilled/blob/main/knowledge_distillation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing necessary packages

In [1]:
!pip install transformers datasets tensorboard
!sudo apt-get install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.


## Chhosing our "teacher" and "student" models

In [1]:
# student = "distilroberta-base" # this is just placeholder ignore
teacher = "klue/roberta-base"

In [2]:
import torch
import torch.nn as nn
import sklearn.metrics
import numpy as np

from tqdm import tqdm
from datasets import load_dataset, load_metric
from datasets.arrow_dataset import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW, Trainer
from torch.utils.data import DataLoader
from transformers import TrainingArguments

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Loading our SST-2 part of the GLUE dataset

In [4]:
dataset = load_dataset("klue", "re")

Found cached dataset klue (/home/hanjuncho/.cache/huggingface/datasets/klue/re/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)


  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(teacher)

In [6]:
def add_entity_tokens(sentence, object_entity, subject_entity):
    obj_start_idx, obj_end_idx = object_entity['start_idx'], object_entity['end_idx']
    subj_start_idx, subj_end_idx = subject_entity['start_idx'], subject_entity['end_idx']
    
    if obj_start_idx < subj_start_idx:
        new_sentence = sentence[:obj_start_idx] + '<obj>' + sentence[obj_start_idx:obj_end_idx+1] + '</obj>' + \
                       sentence[obj_end_idx+1:subj_start_idx] + '<subj>' + sentence[subj_start_idx:subj_end_idx+1] + \
                       '</subj>' + sentence[subj_end_idx+1:]
    else:
        new_sentence = sentence[:subj_start_idx] + '<subj>' + sentence[subj_start_idx:subj_end_idx+1] + '</subj>' + \
                       sentence[subj_end_idx+1:obj_start_idx] + '<obj>' + sentence[obj_start_idx:obj_end_idx+1] + \
                       '</obj>' + sentence[obj_end_idx+1:]
    
    return new_sentence


def read_klue_re(dataset):
    sentences = []
    labels = []
    
    if isinstance(dataset, Dataset):
        for data in dataset:
            sentence = add_entity_tokens(data['sentence'], data['object_entity'], data['subject_entity'])
            sentences.append(sentence)
            labels.append(data['label'])
    
    return sentences, labels

In [7]:
# train, validation데이터셋에서 sentence와 label만 저장.
train_sentences, train_labels = read_klue_re(dataset['train'])
val_sentences, val_labels = read_klue_re(dataset['validation'])

In [8]:
# 개체 토큰이 정상적으로 잘 추가됐는지 확인하기 위해 train 문장 5개만 출력.
for i, sentence in enumerate(train_sentences[:5]):
    print(sentence, '\n')

〈Something〉는 <obj>조지 해리슨</obj>이 쓰고 <subj>비틀즈</subj>가 1969년 앨범 《Abbey Road》에 담은 노래다. 

호남이 기반인 바른미래당·<obj>대안신당</obj>·<subj>민주평화당</subj>이 우여곡절 끝에 합당해 민생당(가칭)으로 재탄생한다. 

K리그2에서 성적 1위를 달리고 있는 <subj>광주FC</subj>는 지난 26일 <obj>한국프로축구연맹</obj>으로부터 관중 유치 성과와 마케팅 성과를 인정받아 ‘풀 스타디움상’과 ‘플러스 스타디움상’을 수상했다. 

균일가 생활용품점 (주)<subj>아성다이소</subj>(대표 <obj>박정부</obj>)는 코로나19 바이러스로 어려움을 겪고 있는 대구광역시에 행복박스를 전달했다고 10일 밝혔다. 

<obj>1967</obj>년 프로 야구 드래프트 1순위로 <subj>요미우리 자이언츠</subj>에게 입단하면서 등번호는 8번으로 배정되었다. 



In [9]:
tokenizer = AutoTokenizer.from_pretrained(teacher)

In [10]:
ex_sentence = dataset['train'][0]['sentence']

In [11]:
ex_sentence

'〈Something〉는 조지 해리슨이 쓰고 비틀즈가 1969년 앨범 《Abbey Road》에 담은 노래다.'

In [12]:
ex_encoding = tokenizer(ex_sentence,
                        max_length=128,
                        padding='max_length',
                        truncation=True)

## Special Token 추가

In [13]:
entity_special_tokens = {'additional_special_tokens': ['<obj>', '</obj>', '<subj>', '</subj>']}
num_additional_special_tokens = tokenizer.add_special_tokens(entity_special_tokens)

In [15]:
# For Dataloader
batch_size = 32

# For model
num_labels = 30

# For train
learning_rate = 2e-5
weight_decay = 0.01
epochs = 5

In [14]:
class KlueReDataset(torch.utils.data.Dataset):
    def __init__(self, tokenizer, sentences, labels, max_length=128):
        self.encodings = tokenizer(sentences,
                                   max_length=max_length,
                                   padding='max_length',
                                   truncation=True)
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item['labels'] = self.labels[idx]
        
        return item
    
    def __len__(self):
        return len(self.labels)


In [16]:
train_dataset = KlueReDataset(tokenizer, train_sentences, train_labels)
val_dataset = KlueReDataset(tokenizer, val_sentences, val_labels)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

In [17]:
model = AutoModelForSequenceClassification.from_pretrained(teacher, num_labels=num_labels).to(device)

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifi

In [None]:
model

In [18]:
model.resize_token_embeddings(len(tokenizer))

Embedding(32004, 1024)

In [19]:
args = TrainingArguments(
    # checkpoint
    output_dir='./models/',
    # overwrite_output_dir=True,
    # Model Save & Load
    save_strategy = "epoch", # 'steps'
    load_best_model_at_end=True,
    # save_steps = 500,
    # Dataset
    num_train_epochs=30,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    # Optimizer
    learning_rate=2e-5, # 5e-5
    weight_decay=0.01,  # 0
    # warmup_steps=200,
    # Resularization
    # max_grad_norm = 1.0,
    # label_smoothing_factor=0.1,
    # Evaluation
    metric_for_best_model='eval_f1',
    evaluation_strategy = "epoch",
    # Randomness
    seed=33,
)

In [20]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score, average_precision_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='micro')
    acc = accuracy_score(labels, preds)
    #auprc = average_precision_score(labels, preds, average='micro')
    #roc_auc = roc_auc_score(labels, preds)
    return {
        'acc' : acc,
        'f1': f1,
        #'roc_auc' : roc_auc
    }

In [21]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 32470
  Num Epochs = 30
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 30450
  Number of trainable parameters = 336691230
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


In [None]:
trainer.evaluate()

In [29]:
tokenizer.save_pretrained('./klue-roberta-base-re')
model.save_pretrained('./klue-roberta-base-re')

tokenizer config file saved in ./klue-roberta-base-re/tokenizer_config.json
Special tokens file saved in ./klue-roberta-base-re/special_tokens_map.json
Configuration saved in ./klue-roberta-base-re/config.json
Model weights saved in ./klue-roberta-base-re/pytorch_model.bin


In [38]:
def calc_f1_score(preds, labels):

    preds_relation = []
    labels_relation = []
    
    for pred, label in zip(preds, labels):
        if label != 0:
            preds_relation.append(pred)
            labels_relation.append(label)
    
    f1_score = sklearn.metrics.f1_score(labels_relation, preds_relation, average='micro', zero_division=1)
    
    return f1_score * 100

In [39]:
with torch.no_grad():
    model.eval()
    
    label_all = []
    pred_all = []
    for batch in tqdm(val_loader):
        inputs = {
            'input_ids': batch['input_ids'].to(device),
            'token_type_ids': batch['token_type_ids'].to(device),
            'attention_mask': batch['attention_mask'].to(device),
        }
        labels = batch['labels'].to(device)
        
        outputs = model(**inputs)
        logits = outputs['logits']
        
        preds = torch.argmax(logits, dim=1)
        
        label_all.extend(labels.detach().cpu().numpy().tolist())
        pred_all.extend(preds.detach().cpu().numpy().tolist())
    
    f1_score = calc_f1_score(label_all, pred_all)
f1_score

100%|██████████| 61/61 [00:08<00:00,  7.20it/s]


57.75982167734745

In [32]:
teacher_model = model

In [17]:
tokenizer = AutoTokenizer.from_pretrained('./klue-roberta-base-re')
teacher_model  = AutoModelForSequenceClassification.from_pretrained('./klue-roberta-base-re', num_labels=30)
#teacher_model.resize_token_embeddings(len(tokenizer))

In [18]:
student = "distilroberta-base" 

## Creating our Knowledge Distillation Trainer

In [19]:
from transformers import TrainingArguments

class DistillationTrainingArguments(TrainingArguments):
    def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
        super().__init__(*args, **kwargs)

        self.alpha = alpha
        self.temperature = temperature

In [112]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()
        self.step = 0

    def compute_loss(self, model, inputs, return_outputs=False):
        self.step += 1

        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
            outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # compute distillation loss and soften probabilities
        loss_function = nn.KLDivLoss(reduction="batchmean")
#         cos_loss_function = nn.CosineEmbeddingLoss(reduction="mean")
        
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # return weighted student loss
        loss = (1. - 1.*(2540-self.step)/2540) * student_loss + 1.*(2540-self.step)/2540 * loss_logits
        return (loss, outputs_student) if return_outputs else loss

## Defining the Metric

In [20]:
from datasets import load_metric
import numpy as np

accuracy_metric = load_metric("accuracy")
f1_metric = load_metric("f1")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="macro")
    return {
        "accuracy": acc["accuracy"],
        "f1": f1["f1"]
    }

  accuracy_metric = load_metric("accuracy")


## Defining the Training Arguments

In [21]:
from transformers import DistilBertConfig
from transformers import AutoConfig, AutoModel
from torch.utils.data import DataLoader

In [22]:
# For model
num_labels = 30
batch_size = 128

train_dataset = KlueReDataset(tokenizer, train_sentences, train_labels)
val_dataset = KlueReDataset(tokenizer, val_sentences, val_labels)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

In [128]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from huggingface_hub import HfFolder

# my_config = DistilBertConfig(activation="gelu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=6,
#                              hidden_dim=768, label2id=label2id, id2label=id2label)
# my_config.save_pretrained(save_directory='./models/distilkoroberta')    

# training arguments
training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

# data_collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# teacher model
# teacher_model = AutoModelForSequenceClassification.from_pretrained(
#     teacher,
#     num_labels=num_labels)

# student model
student_model = AutoModelForSequenceClassification.from_pretrained(
    student,
    num_labels=30)
student_model.resize_token_embeddings(32000)

PyTorch: setting up devices
loading configuration file config.json from cache at /home/hanjuncho/.cache/huggingface/hub/models--klue--roberta-base/snapshots/67dd433d36ebc66a42c9aaa85abcf8d2620e41d9/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "L

Embedding(32000, 768)

In [129]:
def get_n_params(model):
    pp=0
    for p in list(model.parameters()):
        nn=1
        for s in list(p.size()):
            nn = nn*s
        pp += nn
    return pp

In [124]:
get_n_params(student_model)

68117022

In [125]:
get_n_params(teacher_model)

110641182

## Training

In [126]:
### to continue learning

training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=10, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=1e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

PyTorch: setting up devices


In [127]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

RuntimeError: CUDA error: device-side assert triggered

In [54]:
trainer.train()

***** Running training *****
  Num examples = 32470
  Num Epochs = 10
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 2540
  Number of trainable parameters = 68113950


RuntimeError: CUDA error: device-side assert triggered

In [None]:
def calc_f1_score(preds, labels):

    preds_relation = []
    labels_relation = []
    
    for pred, label in zip(preds, labels):
        if label != 0:
            preds_relation.append(pred)
            labels_relation.append(label)
    
    f1_score = sklearn.metrics.f1_score(labels_relation, preds_relation, average='micro', zero_division=1)
    
    return f1_score * 100

In [60]:
with torch.no_grad():
    student_model.eval()
    
    label_all = []
    pred_all = []
    for batch in tqdm(val_loader):
        inputs = {
            'input_ids': batch['input_ids'].to(device),
            'token_type_ids': batch['token_type_ids'].to(device),
            'attention_mask': batch['attention_mask'].to(device),
        }
        labels = batch['labels'].to(device)
        
        outputs = model(**inputs)
        logits = outputs['logits']
        
        preds = torch.argmax(logits, dim=1)
        
        label_all.extend(labels.detach().cpu().numpy().tolist())
        pred_all.extend(preds.detach().cpu().numpy().tolist())
    
    f1_score = calc_f1_score(label_all, pred_all)
f1_score

100%|██████████| 61/61 [00:08<00:00,  7.17it/s]


59.18762088974855

In [None]:
trainer.predict(sst2_enc["validation"])

In [None]:
teacher_trainer = Trainer(teacher_model,
                          training_args,
                          train_dataset=sst2_enc["train"],
                          eval_dataset=sst2_enc["validation"],
                          data_collator=data_collator,
                          tokenizer=tokenizer,
                          compute_metrics=compute_metrics,
)

In [None]:
teacher_trainer.predict(sst2_enc["validation"])

In [None]:
torch.save(student_model.state_dict(), './models/distilkoroberta_first_7epochs.pt')

In [34]:
from copy import deepcopy

# Fine Tuning on Downstream Tasks

## RE

In [125]:
baseline = deepcopy(student_model)

In [1]:
teacher = "klue/roberta-base"

In [2]:
datasets = load_dataset("klue", 're')

NameError: name 'load_dataset' is not defined

In [39]:
metric = load_metric("glue", "qnli")

Downloading builder script: 5.76kB [00:00, 2.14MB/s]                   


In [26]:
tokenizer

PreTrainedTokenizerFast(name_or_path='klue/roberta-base', vocab_size=32000, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]', 'additional_special_tokens': ['', '', '', '']})

In [55]:
sentence1_key, sentence2_key = ("subject_entity", "object_entity")
print(f"Sentence 1: {datasets['train'][0][sentence1_key]['word']}")
print(f"Sentence 2: {datasets['train'][0][sentence2_key]['word']}")

Sentence 1: 비틀즈
Sentence 2: 조지 해리슨


In [54]:
datasets[sentence1_key]

KeyError: 'subject_entity'

In [49]:
def preprocess_function(examples):
    return tokenizer(
        examples[sentence1_key],
        examples[sentence2_key],
        truncation=True,
        return_token_type_ids=False,
    )

encoded_datasets = datasets.map(preprocess_function, batched=True)

 96%|█████████▌| 24/25 [00:01<00:00, 21.61ba/s]
 67%|██████▋   | 2/3 [00:00<00:00, 15.01ba/s]


In [117]:
my_config

DistilBertConfig {
  "activation": "relu",
  "attention_dropout": 0.4,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.23.1",
  "vocab_size": 32000
}

In [126]:
num_labels = 3
my_config = DistilBertConfig(activation="relu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=num_labels)
model = AutoModelForSequenceClassification.from_config(my_config)
model_dict = model.state_dict()
pretrained_dict = torch.load("/home/seungjoonpark/DistilKoBERT/models/distilkoroberta.pt")
del pretrained_dict[next(reversed(pretrained_dict))]
del pretrained_dict[next(reversed(pretrained_dict))]
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
model_dict.update(pretrained_dict) 
model.load_state_dict(pretrained_dict, strict=False)

_IncompatibleKeys(missing_keys=['classifier.weight', 'classifier.bias'], unexpected_keys=[])

In [127]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [128]:
batch_size=256

In [None]:
metric_name = "accuracy"

args = TrainingArguments(
    "test-nli",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [132]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3000
  Batch size = 256


{'eval_loss': 1.1056615114212036,
 'eval_accuracy': 0.37133333333333335,
 'eval_runtime': 2.453,
 'eval_samples_per_second': 1222.99,
 'eval_steps_per_second': 4.892,
 'epoch': 5.0}

## Installing Optuna for Hyperparameter Tuning

## Defining the Hyperparamater Space to be optimized over

In [137]:
def hp_space(trial):
    return {
      "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 10),
      "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-3 ,log=True),
      "alpha": trial.suggest_float("alpha", 0, 1),
      "temperature": trial.suggest_int("temperature", 2, 30),
      }

## Running the Hyperparameter Search

In [138]:
my_config = DistilBertConfig(activation="relu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=6,
                            label2id=label2id, id2label=id2label)


def student_init():
    return AutoModelForSequenceClassification.from_config(
    my_config)

trainer = DistillationTrainer(
    model_init=student_init,
    args=training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
best_run = trainer.hyperparameter_search(
    n_trials=2,
    direction="maximize",
    hp_space=hp_space
)

print(best_run)

You passed along `num_labels=6` with an incompatible id to label map: {'0': 'IT과학', '1': '경제', '2': '사회', '3': '생활문화', '4': '세계', '5': '스포츠', '6': '정치'}. The number of labels wil be overwritten to 7.
Using cuda_amp half precision backend
[32m[I 2022-10-27 23:39:03,618][0m A new study created in memory with name: no-name-72755f35-ffe2-455f-abeb-7ef48083cfc8[0m
Trial: {'num_train_epochs': 4, 'learning_rate': 0.0003356216196363318, 'alpha': 0.0038843531441111745, 'temperature': 28}
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 4
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulatio

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0099,0.007726,0.148018
2,0.0079,0.007775,0.148018
3,0.0078,0.007747,0.148018
4,0.0078,0.007662,0.148018


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-0/checkpoint-357
Configuration saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceC

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3879,0.372254,0.706929
2,0.3427,0.365087,0.71758
3,0.333,0.354984,0.77962
4,0.3272,0.358589,0.764577
5,0.3232,0.36462,0.752718
6,0.3206,0.358994,0.775338
7,0.3183,0.357679,0.784122
8,0.3168,0.363774,0.765565
9,0.3157,0.36286,0.767541
10,0.3147,0.361569,0.772812


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-1/checkpoint-357
Configuration saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceC

***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-1/checkpoint-3213
Configuration saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/special_tokens_map.json
Deleting older checkpoint [distilroberta-base-sst2-distilled/run-1/checkpoint-2856] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore th

BestRun(run_id='1', objective=0.7728121225430987, hyperparameters={'num_train_epochs': 10, 'learning_rate': 4.354784416636035e-05, 'alpha': 0.22918617625637505, 'temperature': 9})


## Updating the training arguments

In [139]:
# overwriting the previous hyperparameters
for k,v in best_run.hyperparameters.items():
    setattr(training_args, k, v)

# new repository
best_model_ckpt = "distilroberta-best"
training_args.output_dir = best_model_ckpt

## Final Training

In [140]:
# New Trainer with the updated parameters
optimal_trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

optimal_trainer.train()

Using cuda_amp half precision backend
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 10
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 3570


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3303,0.359125,0.763918
2,0.3238,0.362589,0.753596
3,0.3236,0.36197,0.765345
4,0.3202,0.366088,0.753926
5,0.3181,0.361465,0.772263
6,0.3164,0.365798,0.762271
7,0.315,0.367955,0.753267
8,0.3141,0.363063,0.773032
9,0.3132,0.364926,0.767761
10,0.3128,0.364964,0.767212


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-best/checkpoint-357
Configuration saved in distilroberta-best/checkpoint-357/config.json
Model weights saved in distilroberta-best/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-best/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-best/checkpoint-357/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, tok

Configuration saved in distilroberta-best/checkpoint-3570/config.json
Model weights saved in distilroberta-best/checkpoint-3570/pytorch_model.bin
tokenizer config file saved in distilroberta-best/checkpoint-3570/tokenizer_config.json
Special tokens file saved in distilroberta-best/checkpoint-3570/special_tokens_map.json
Deleting older checkpoint [distilroberta-best/checkpoint-3213] due to args.save_total_limit


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from distilroberta-best/checkpoint-2856 (score: 0.7730317338311189).


TrainOutput(global_step=3570, training_loss=0.3187510471717984, metrics={'train_runtime': 482.5227, 'train_samples_per_second': 946.65, 'train_steps_per_second': 7.399, 'total_flos': 2774037043248840.0, 'train_loss': 0.3187510471717984, 'epoch': 10.0})