<a href="https://colab.research.google.com/github/azizbarank/distilroberta-base-sst-2-distilled/blob/main/knowledge_distillation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing necessary packages

In [1]:
!pip install transformers datasets tensorboard
!sudo apt-get install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.


## Chhosing our "teacher" and "student" models

In [1]:
student = "distilroberta-base" # this is just placeholder ignore
teacher = "klue/roberta-base"

## Loading our SST-2 part of the GLUE dataset

In [2]:
from datasets import load_dataset

dataset = load_dataset("klue", "ynat")

Found cached dataset klue (/home/hanjuncho/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)


  0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
# config, unused_kwargs = AutoConfig.from_pretrained(student, output_attention=True,
#                                                    foo=False, return_unused_kwargs=True)
# config.num_hidden_layers = 6
# config.num_labels = num_labels

## Tokenization

### Initiating the tokenizer of our student model

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(teacher)

In [4]:
from IPython.core.debugger import set_trace

In [5]:
def process(examples):
    tokenized_inputs = tokenizer(
        examples["title"], truncation=True, max_length=512
    )
    return tokenized_inputs

sst2_enc = dataset.map(process, batched=True)
sst2_enc = sst2_enc.rename_column("label","labels")

sst2_enc["validation"].features

Loading cached processed dataset at /home/hanjuncho/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e/cache-fccd25ea59f26505.arrow
Loading cached processed dataset at /home/hanjuncho/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e/cache-95865b9e92d66c7c.arrow


{'guid': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['IT과학', '경제', '사회', '생활문화', '세계', '스포츠', '정치'], id=None),
 'url': Value(dtype='string', id=None),
 'date': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

## Creating our Knowledge Distillation Trainer

In [6]:
from transformers import TrainingArguments

class DistillationTrainingArguments(TrainingArguments):
    def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
        super().__init__(*args, **kwargs)

        self.alpha = alpha
        self.temperature = temperature

In [7]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()

    def compute_loss(self, model, inputs, return_outputs=False):

        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
            outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # compute distillation loss and soften probabilities
        loss_function = nn.KLDivLoss(reduction="batchmean")
#         cos_loss_function = nn.CosineEmbeddingLoss(reduction="mean")
        
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # return weighted student loss
        loss = self.args.alpha * student_loss + (1. - self.args.alpha) * loss_logits
        return (loss, outputs_student) if return_outputs else loss

## Defining the Metric

In [8]:
from datasets import load_metric
import numpy as np

accuracy_metric = load_metric("accuracy")
f1_metric = load_metric("f1")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="macro")
    return {
        "accuracy": acc["accuracy"],
        "f1": f1["f1"]
    }

  accuracy_metric = load_metric("accuracy")


## Teacher training

In [9]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from huggingface_hub import HfFolder

# id2label, label2id dicts for the outputs for the model
labels = sst2_enc["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

teacher_model = AutoModelForSequenceClassification.from_pretrained(
    teacher,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

Some weights of the model checkpoint at klue/roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight', 'clas

In [10]:
batch_size = 128

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

args = TrainingArguments(
    # checkpoint
    output_dir='./models/',
    # overwrite_output_dir=True,

    # Model Save & Load
    save_strategy = "epoch", # 'steps'
    load_best_model_at_end=True,
    # save_steps = 500,


    # Dataset
    num_train_epochs=5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    
    # Optimizer
    learning_rate=2e-5, # 5e-5
    weight_decay=0.01,  # 0
    # warmup_steps=200,b

    # Resularization
    # max_grad_norm = 1.0,
    # label_smoothing_factor=0.1,


    # Evaluation 
    metric_for_best_model='eval_f1',
    evaluation_strategy = "epoch",

    # Randomness
    seed=33,
)

In [11]:
trainer = Trainer(
    teacher_model,
    args,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [13]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: title, guid, url, date. If title, guid, url, date are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 5
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 1785
  Number of trainable parameters = 110623495
wandb: ERROR Dropped streaming file chunk (see wandb/debug-internal.log)


Epoch,Training Loss,Validation Loss


wandb: ERROR Dropped streaming file chunk (see wandb/debug-internal.log)


KeyboardInterrupt: 

In [81]:
# load pretrained model
teacher_model = teacher_model.from_pretrained('./models/checkpoint-1071')

loading configuration file ./models/checkpoint-1071/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "IT\uacfc\ud559",
    "1": "\uacbd\uc81c",
    "2": "\uc0ac\ud68c",
    "3": "\uc0dd\ud65c\ubb38\ud654",
    "4": "\uc138\uacc4",
    "5": "\uc2a4\ud3ec\uce20",
    "6": "\uc815\uce58"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "IT\uacfc\ud559": "0",
    "\uacbd\uc81c": "1",
    "\uc0ac\ud68c": "2",
    "\uc0dd\ud65c\ubb38\ud654": "3",
    "\uc138\uacc4": "4",
    "\uc2a4\ud3ec\uce20": "5",
    "\uc815\uce58": "6"
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "n

## Defining the Training Arguments

In [82]:
from transformers import DistilBertConfig
from transformers import AutoConfig, AutoModel

In [95]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from huggingface_hub import HfFolder

# id2label, label2id dicts for the outputs for the model
labels = sst2_enc["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

# my_config = DistilBertConfig(activation="gelu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=6,
#                              hidden_dim=768, label2id=label2id, id2label=id2label)
# my_config.save_pretrained(save_directory='./models/distilkoroberta')    

# training arguments
training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

# data_collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# # teacher model
# teacher_model = AutoModelForSequenceClassification.from_pretrained(
#     teacher,
#     num_labels=num_labels,
#     id2label=id2label,
#     label2id=label2id,
# )

# student model
student_model = AutoModelForSequenceClassification.from_pretrained(
    student,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)
student_model.resize_token_embeddings(32000)

PyTorch: setting up devices
loading configuration file config.json from cache at /home/seungjoonpark/.cache/huggingface/hub/models--distilroberta-base/snapshots/d5411c3ee9e1793fd9ef58390b40a80a4c10df32/config.json
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "IT\uacfc\ud559",
    "1": "\uacbd\uc81c",
    "2": "\uc0ac\ud68c",
    "3": "\uc0dd\ud65c\ubb38\ud654",
    "4": "\uc138\uacc4",
    "5": "\uc2a4\ud3ec\uce20",
    "6": "\uc815\uce58"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "IT\uacfc\ud559": "0",
    "\uacbd\uc81c": "1",
    "\uc0ac\ud68c": "2",
    "\uc0dd\ud65c\ubb38\ud654": "3",
    "\uc138\uacc4": "4",
    "\uc2a4\ud3ec\uce20": "5",
    "\uc815\uce58":

Embedding(32000, 768)

## Student Initialization

In [205]:
student_weights = []
for i, p in enumerate(student_model.parameters()):
    student_weights.append(p)

In [206]:
# initialized one layer out of two
teacher_weights = []
for i, p in enumerate(teacher_model.parameters()):
    teacher_weights.append(p)

In [207]:
# First and last layers
student_weights[0].data.copy_(teacher_weights[0].data)
student_weights[1].data.copy_(teacher_weights[1].data)
student_weights[2].data.copy_(teacher_weights[2].data)
student_weights[-1].data.copy_(teacher_weights[-1].data)
student_weights[-2].data.copy_(teacher_weights[-2].data)

tensor([[ 0.0064, -0.0096, -0.0269,  ...,  0.0058, -0.0119,  0.0324],
        [-0.0134, -0.0048, -0.0003,  ...,  0.0148, -0.0441, -0.0296],
        [ 0.0297, -0.0207, -0.0306,  ...,  0.0084, -0.0024,  0.0020],
        ...,
        [ 0.0204, -0.0178, -0.0371,  ...,  0.0082, -0.0252, -0.0052],
        [ 0.0049, -0.0012, -0.0135,  ...,  0.0450, -0.0214,  0.0027],
        [-0.0240,  0.0046, -0.0129,  ...,  0.0347,  0.0082,  0.0038]])

In [208]:
base = 3
for i in range(12):
    if i % 2 == 1:
        std_idx = i // 2
        for j in range(16):
            student_weights[base+std_idx*16+j].data.copy_(teacher_weights[base+i*16+j].data)

In [209]:
def get_n_params(model):
    pp=0
    for p in list(model.parameters())[:-2]:
        nn=1
        for s in list(p.size()):
            nn = nn*s
        pp += nn
    return pp

In [210]:
get_n_params(student_model)

68090880

In [211]:
get_n_params(teacher_model)

110618112

## Training

In [94]:
### to continue learning

training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

PyTorch: setting up devices


In [91]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [92]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: url, guid, date, title. If url, guid, date, title are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 7
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 2499


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5135,0.449238,0.831668,0.835222
2,0.2625,0.417876,0.841441,0.843777
3,0.1961,0.433857,0.840123,0.841995
4,0.1558,0.455981,0.835182,0.837008
5,0.1301,0.442305,0.839684,0.842268
6,0.1118,0.439861,0.842209,0.842439
7,0.0995,0.438866,0.845174,0.84545


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: url, guid, date, title. If url, guid, date, title are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-357
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-357/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/checkpoint-357/special_tokens_map.json
Deleting older checkpoint [distilroberta-base-sst2-distilled/checkpoint-714] due to args.save_total_limit
The following columns in the evaluation set don't have a correspondi

TrainOutput(global_step=2499, training_loss=0.2099120425147598, metrics={'train_runtime': 268.1034, 'train_samples_per_second': 1192.622, 'train_steps_per_second': 9.321, 'total_flos': 1941721913394072.0, 'train_loss': 0.2099120425147598, 'epoch': 7.0})

# Linearly Decaying

In [198]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()
        self.step = 0

    def compute_loss(self, model, inputs, return_outputs=False):
        self.step += 1
        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
            outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # compute distillation loss and soften probabilities
        loss_function = nn.KLDivLoss(reduction="batchmean")
#         cos_loss_function = nn.CosineEmbeddingLoss(reduction="mean")
        
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # return weighted student loss
        loss = 1.*(2499-self.step)/2499 * student_loss + (1. - 1.*(2499-self.step)/2499) * loss_logits
        return (loss, outputs_student) if return_outputs else loss

In [199]:
# load pretrained model
teacher_model = teacher_model.from_pretrained('./models/checkpoint-1071')

loading configuration file ./models/checkpoint-1071/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "IT\uacfc\ud559",
    "1": "\uacbd\uc81c",
    "2": "\uc0ac\ud68c",
    "3": "\uc0dd\ud65c\ubb38\ud654",
    "4": "\uc138\uacc4",
    "5": "\uc2a4\ud3ec\uce20",
    "6": "\uc815\uce58"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "IT\uacfc\ud559": "0",
    "\uacbd\uc81c": "1",
    "\uc0ac\ud68c": "2",
    "\uc0dd\ud65c\ubb38\ud654": "3",
    "\uc138\uacc4": "4",
    "\uc2a4\ud3ec\uce20": "5",
    "\uc815\uce58": "6"
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "n

In [204]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from huggingface_hub import HfFolder

# id2label, label2id dicts for the outputs for the model
labels = sst2_enc["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

# my_config = DistilBertConfig(activation="gelu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=6,
#                              hidden_dim=768, label2id=label2id, id2label=id2label)
# my_config.save_pretrained(save_directory='./models/distilkoroberta')    

# training arguments
training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

# data_collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# # teacher model
# teacher_model = AutoModelForSequenceClassification.from_pretrained(
#     teacher,
#     num_labels=num_labels,
#     id2label=id2label,
#     label2id=label2id,
# )

# student model
student_model = AutoModelForSequenceClassification.from_pretrained(
    student,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)
student_model.resize_token_embeddings(32000)

PyTorch: setting up devices
loading configuration file config.json from cache at /home/seungjoonpark/.cache/huggingface/hub/models--distilroberta-base/snapshots/d5411c3ee9e1793fd9ef58390b40a80a4c10df32/config.json
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "IT\uacfc\ud559",
    "1": "\uacbd\uc81c",
    "2": "\uc0ac\ud68c",
    "3": "\uc0dd\ud65c\ubb38\ud654",
    "4": "\uc138\uacc4",
    "5": "\uc2a4\ud3ec\uce20",
    "6": "\uc815\uce58"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "IT\uacfc\ud559": "0",
    "\uacbd\uc81c": "1",
    "\uc0ac\ud68c": "2",
    "\uc0dd\ud65c\ubb38\ud654": "3",
    "\uc138\uacc4": "4",
    "\uc2a4\ud3ec\uce20": "5",
    "\uc815\uce58":

Embedding(32000, 768)

In [201]:
### Do weight init !!! 위에 있는 코드 실행할것

In [212]:
### to continue learning

training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

PyTorch: setting up devices


In [213]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [214]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: url, guid, date, title. If url, guid, date, title are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 7
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 2499


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4704,0.455302,0.835072,0.837995
2,0.2585,0.412495,0.839903,0.841657
3,0.176,0.347741,0.846602,0.847644
4,0.1296,0.293335,0.8477,0.849718
5,0.0961,0.214862,0.856923,0.857054
6,0.066,0.161548,0.858021,0.857885
7,0.0357,0.111098,0.856813,0.857078


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: url, guid, date, title. If url, guid, date, title are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-357
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-357/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/checkpoint-357/special_tokens_map.json
Deleting older checkpoint [distilroberta-base-sst2-distilled/checkpoint-2142] due to args.save_total_limit
The following columns in the evaluation set don't have a correspond

TrainOutput(global_step=2499, training_loss=0.17604998580547943, metrics={'train_runtime': 278.4762, 'train_samples_per_second': 1148.198, 'train_steps_per_second': 8.974, 'total_flos': 1941721913394072.0, 'train_loss': 0.17604998580547943, 'epoch': 7.0})

## Further deacreaing model

In [215]:
new_teacher = student_model.from_pretrained('distilroberta-base-sst2-distilled/checkpoint-2142')

loading configuration file distilroberta-base-sst2-distilled/checkpoint-2142/config.json
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "IT\uacfc\ud559",
    "1": "\uacbd\uc81c",
    "2": "\uc0ac\ud68c",
    "3": "\uc0dd\ud65c\ubb38\ud654",
    "4": "\uc138\uacc4",
    "5": "\uc2a4\ud3ec\uce20",
    "6": "\uc815\uce58"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "IT\uacfc\ud559": "0",
    "\uacbd\uc81c": "1",
    "\uc0ac\ud68c": "2",
    "\uc0dd\ud65c\ubb38\ud654": "3",
    "\uc138\uacc4": "4",
    "\uc2a4\ud3ec\uce20": "5",
    "\uc815\uce58": "6"
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attent

In [216]:
student_model = AutoModelForSequenceClassification.from_pretrained(
    student,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

loading configuration file config.json from cache at /home/seungjoonpark/.cache/huggingface/hub/models--distilroberta-base/snapshots/d5411c3ee9e1793fd9ef58390b40a80a4c10df32/config.json
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "IT\uacfc\ud559",
    "1": "\uacbd\uc81c",
    "2": "\uc0ac\ud68c",
    "3": "\uc0dd\ud65c\ubb38\ud654",
    "4": "\uc138\uacc4",
    "5": "\uc2a4\ud3ec\uce20",
    "6": "\uc815\uce58"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "IT\uacfc\ud559": "0",
    "\uacbd\uc81c": "1",
    "\uc0ac\ud68c": "2",
    "\uc0dd\ud65c\ubb38\ud654": "3",
    "\uc138\uacc4": "4",
    "\uc2a4\ud3ec\uce20": "5",
    "\uc815\uce58": "6"
  },
  "layer_norm_eps"

In [217]:
new_config = student_model.config

In [218]:
new_config.__dict__['num_hidden_layers'] = 3
new_config.__dict__['num_labels'] = num_labels

In [219]:
student_model = AutoModelForSequenceClassification.from_config(new_config)
student_model.resize_token_embeddings(32000)

Embedding(32000, 768)

In [220]:
student_weights = []
for i, p in enumerate(student_model.parameters()):
    student_weights.append(p)
# initialized one layer out of two
teacher_weights = []
for i, p in enumerate(new_teacher.parameters()):
    teacher_weights.append(p)

In [221]:
# First and last layers
student_weights[0].data.copy_(teacher_weights[0].data)
student_weights[1].data.copy_(teacher_weights[1].data)
student_weights[2].data.copy_(teacher_weights[2].data)
student_weights[-1].data.copy_(teacher_weights[-1].data)
student_weights[-2].data.copy_(teacher_weights[-2].data)

tensor([[ 0.0043, -0.0110, -0.0264,  ...,  0.0062, -0.0133,  0.0303],
        [-0.0104, -0.0055,  0.0002,  ...,  0.0144, -0.0434, -0.0253],
        [ 0.0284, -0.0199, -0.0313,  ...,  0.0100, -0.0019,  0.0012],
        ...,
        [ 0.0211, -0.0181, -0.0386,  ...,  0.0067, -0.0246, -0.0070],
        [ 0.0043,  0.0008, -0.0120,  ...,  0.0450, -0.0220,  0.0026],
        [-0.0228,  0.0047, -0.0144,  ...,  0.0356,  0.0095,  0.0033]])

In [222]:
base = 3
for i in range(6):
    if i % 2 == 1:
        std_idx = i // 2
        for j in range(16):
            student_weights[base+std_idx*16+j].data.copy_(teacher_weights[base+i*16+j].data)

In [223]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()
        self.step = 0

    def compute_loss(self, model, inputs, return_outputs=False):
        self.step += 1
        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
            outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # compute distillation loss and soften probabilities
        loss_function = nn.KLDivLoss(reduction="batchmean")
#         cos_loss_function = nn.CosineEmbeddingLoss(reduction="mean")
        
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # return weighted student loss
        loss = 1.*(2499-self.step)/2499 * student_loss + (1. - 1.*(2499-self.step)/2499) * loss_logits
        return (loss, outputs_student) if return_outputs else loss

In [224]:
### to continue learning

training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled2",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled2/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

PyTorch: setting up devices


In [225]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=new_teacher, # changed for comparison
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [226]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: url, guid, date, title. If url, guid, date, title are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 7
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 2499


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3482,0.471432,0.832986,0.832685
2,0.1941,0.48873,0.81377,0.815619
3,0.1393,0.38022,0.825958,0.827142
4,0.1076,0.286902,0.835511,0.833949
5,0.0829,0.203673,0.838696,0.836164
6,0.0574,0.126996,0.83628,0.834681
7,0.0316,0.057677,0.838037,0.836389


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: url, guid, date, title. If url, guid, date, title are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled2/checkpoint-357
Configuration saved in distilroberta-base-sst2-distilled2/checkpoint-357/config.json
Model weights saved in distilroberta-base-sst2-distilled2/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled2/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled2/checkpoint-357/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: url, guid, date, tit

TrainOutput(global_step=2499, training_loss=0.13729658962584057, metrics={'train_runtime': 187.6145, 'train_samples_per_second': 1704.271, 'train_steps_per_second': 13.32, 'total_flos': 984312633408408.0, 'train_loss': 0.13729658962584057, 'epoch': 7.0})

In [94]:
torch.save(student_model.state_dict(), './models/distilkoroberta_first_7epochs.pt')

In [34]:
from copy import deepcopy

# Fine Tuning on Downstream Tasks

## NLI

In [125]:
baseline = deepcopy(student_model)

In [38]:
datasets = load_dataset("klue", 'nli')

Downloading and preparing dataset klue/nli (download: 1.20 MiB, generated: 6.10 MiB, post-processed: Unknown size, total: 7.30 MiB) to /home/seungjoonpark/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e...


Downloading data: 100%|██████████| 1.26M/1.26M [00:00<00:00, 6.19MB/s]
                                                                                          

Dataset klue downloaded and prepared to /home/seungjoonpark/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e. Subsequent calls will reuse this data.


100%|██████████| 2/2 [00:00<00:00, 422.51it/s]


In [39]:
metric = load_metric("glue", "qnli")

Downloading builder script: 5.76kB [00:00, 2.14MB/s]                   


In [45]:
tokenizer

PreTrainedTokenizerFast(name_or_path='klue/roberta-large', vocab_size=32000, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [48]:
sentence1_key, sentence2_key = ("premise", "hypothesis")
print(f"Sentence 1: {datasets['train'][0][sentence1_key]}")
print(f"Sentence 2: {datasets['train'][0][sentence2_key]}")

Sentence 1: 힛걸 진심 최고다 그 어떤 히어로보다 멋지다
Sentence 2: 힛걸 진심 최고로 멋지다.


In [49]:
def preprocess_function(examples):
    return tokenizer(
        examples[sentence1_key],
        examples[sentence2_key],
        truncation=True,
        return_token_type_ids=False,
    )

encoded_datasets = datasets.map(preprocess_function, batched=True)

 96%|█████████▌| 24/25 [00:01<00:00, 21.61ba/s]
 67%|██████▋   | 2/3 [00:00<00:00, 15.01ba/s]


In [117]:
my_config

DistilBertConfig {
  "activation": "relu",
  "attention_dropout": 0.4,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.23.1",
  "vocab_size": 32000
}

In [126]:
num_labels = 3
my_config = DistilBertConfig(activation="relu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=num_labels)
model = AutoModelForSequenceClassification.from_config(my_config)
model_dict = model.state_dict()
pretrained_dict = torch.load("/home/seungjoonpark/DistilKoBERT/models/distilkoroberta.pt")
del pretrained_dict[next(reversed(pretrained_dict))]
del pretrained_dict[next(reversed(pretrained_dict))]
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
model_dict.update(pretrained_dict) 
model.load_state_dict(pretrained_dict, strict=False)

_IncompatibleKeys(missing_keys=['classifier.weight', 'classifier.bias'], unexpected_keys=[])

In [127]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [128]:
batch_size=256

In [133]:
metric_name = "accuracy"

args = TrainingArguments(
    "test-nli",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [134]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [135]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 24998
  Num Epochs = 5
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 490


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.09759,0.380333
2,No log,1.113917,0.390667
3,No log,1.146158,0.391
4,No log,1.126838,0.397333
5,No log,1.137593,0.393


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3000
  Batch size = 256
Saving model checkpoint to test-nli/checkpoint-98
Configuration saved in test-nli/checkpoint-98/config.json
Model weights saved in test-nli/checkpoint-98/pytorch_model.bin
tokenizer config file saved in test-nli/checkpoint-98/tokenizer_config.json
Special tokens file saved in test-nli/checkpoint-98/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassifica

TrainOutput(global_step=490, training_loss=0.9852519132653061, metrics={'train_runtime': 288.5789, 'train_samples_per_second': 433.122, 'train_steps_per_second': 1.698, 'total_flos': 2619227819706804.0, 'train_loss': 0.9852519132653061, 'epoch': 5.0})

In [132]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3000
  Batch size = 256


{'eval_loss': 1.1056615114212036,
 'eval_accuracy': 0.37133333333333335,
 'eval_runtime': 2.453,
 'eval_samples_per_second': 1222.99,
 'eval_steps_per_second': 4.892,
 'epoch': 5.0}

## Installing Optuna for Hyperparameter Tuning

## Defining the Hyperparamater Space to be optimized over

In [137]:
def hp_space(trial):
    return {
      "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 10),
      "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-3 ,log=True),
      "alpha": trial.suggest_float("alpha", 0, 1),
      "temperature": trial.suggest_int("temperature", 2, 30),
      }

## Running the Hyperparameter Search

In [138]:
my_config = DistilBertConfig(activation="relu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=6,
                            label2id=label2id, id2label=id2label)


def student_init():
    return AutoModelForSequenceClassification.from_config(
    my_config)

trainer = DistillationTrainer(
    model_init=student_init,
    args=training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
best_run = trainer.hyperparameter_search(
    n_trials=2,
    direction="maximize",
    hp_space=hp_space
)

print(best_run)

You passed along `num_labels=6` with an incompatible id to label map: {'0': 'IT과학', '1': '경제', '2': '사회', '3': '생활문화', '4': '세계', '5': '스포츠', '6': '정치'}. The number of labels wil be overwritten to 7.
Using cuda_amp half precision backend
[32m[I 2022-10-27 23:39:03,618][0m A new study created in memory with name: no-name-72755f35-ffe2-455f-abeb-7ef48083cfc8[0m
Trial: {'num_train_epochs': 4, 'learning_rate': 0.0003356216196363318, 'alpha': 0.0038843531441111745, 'temperature': 28}
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 4
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulatio

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0099,0.007726,0.148018
2,0.0079,0.007775,0.148018
3,0.0078,0.007747,0.148018
4,0.0078,0.007662,0.148018


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-0/checkpoint-357
Configuration saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceC

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3879,0.372254,0.706929
2,0.3427,0.365087,0.71758
3,0.333,0.354984,0.77962
4,0.3272,0.358589,0.764577
5,0.3232,0.36462,0.752718
6,0.3206,0.358994,0.775338
7,0.3183,0.357679,0.784122
8,0.3168,0.363774,0.765565
9,0.3157,0.36286,0.767541
10,0.3147,0.361569,0.772812


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-1/checkpoint-357
Configuration saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceC

***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-1/checkpoint-3213
Configuration saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/special_tokens_map.json
Deleting older checkpoint [distilroberta-base-sst2-distilled/run-1/checkpoint-2856] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore th

BestRun(run_id='1', objective=0.7728121225430987, hyperparameters={'num_train_epochs': 10, 'learning_rate': 4.354784416636035e-05, 'alpha': 0.22918617625637505, 'temperature': 9})


## Updating the training arguments

In [139]:
# overwriting the previous hyperparameters
for k,v in best_run.hyperparameters.items():
    setattr(training_args, k, v)

# new repository
best_model_ckpt = "distilroberta-best"
training_args.output_dir = best_model_ckpt

## Final Training

In [140]:
# New Trainer with the updated parameters
optimal_trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

optimal_trainer.train()

Using cuda_amp half precision backend
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 10
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 3570


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3303,0.359125,0.763918
2,0.3238,0.362589,0.753596
3,0.3236,0.36197,0.765345
4,0.3202,0.366088,0.753926
5,0.3181,0.361465,0.772263
6,0.3164,0.365798,0.762271
7,0.315,0.367955,0.753267
8,0.3141,0.363063,0.773032
9,0.3132,0.364926,0.767761
10,0.3128,0.364964,0.767212


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-best/checkpoint-357
Configuration saved in distilroberta-best/checkpoint-357/config.json
Model weights saved in distilroberta-best/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-best/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-best/checkpoint-357/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, tok

Configuration saved in distilroberta-best/checkpoint-3570/config.json
Model weights saved in distilroberta-best/checkpoint-3570/pytorch_model.bin
tokenizer config file saved in distilroberta-best/checkpoint-3570/tokenizer_config.json
Special tokens file saved in distilroberta-best/checkpoint-3570/special_tokens_map.json
Deleting older checkpoint [distilroberta-best/checkpoint-3213] due to args.save_total_limit


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from distilroberta-best/checkpoint-2856 (score: 0.7730317338311189).


TrainOutput(global_step=3570, training_loss=0.3187510471717984, metrics={'train_runtime': 482.5227, 'train_samples_per_second': 946.65, 'train_steps_per_second': 7.399, 'total_flos': 2774037043248840.0, 'train_loss': 0.3187510471717984, 'epoch': 10.0})