<a href="https://colab.research.google.com/github/azizbarank/distilroberta-base-sst-2-distilled/blob/main/knowledge_distillation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing necessary packages

In [None]:
!pip install transformers datasets tensorboard
!sudo apt-get install git-lfs

[sudo] password for juhyeon: 

## Chhosing our "teacher" and "student" models

In [139]:
student = "distilroberta-base" # this is just placeholder ignore
teacher = "klue/roberta-base"

## Loading our SST-2 part of the GLUE dataset

In [140]:
from datasets import load_dataset

dataset = load_dataset("klue", "sts")

Found cached dataset klue (/home/seungjoonpark/.cache/huggingface/datasets/klue/sts/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)
100%|██████████| 2/2 [00:00<00:00, 157.04it/s]


## Tokenization

### Initiating the tokenizer of our student model

In [141]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(teacher)

loading file vocab.txt from cache at /home/seungjoonpark/.cache/huggingface/hub/models--klue--roberta-base/snapshots/67dd433d36ebc66a42c9aaa85abcf8d2620e41d9/vocab.txt
loading file tokenizer.json from cache at /home/seungjoonpark/.cache/huggingface/hub/models--klue--roberta-base/snapshots/67dd433d36ebc66a42c9aaa85abcf8d2620e41d9/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /home/seungjoonpark/.cache/huggingface/hub/models--klue--roberta-base/snapshots/67dd433d36ebc66a42c9aaa85abcf8d2620e41d9/special_tokens_map.json
loading file tokenizer_config.json from cache at /home/seungjoonpark/.cache/huggingface/hub/models--klue--roberta-base/snapshots/67dd433d36ebc66a42c9aaa85abcf8d2620e41d9/tokenizer_config.json


In [6]:
from IPython.core.debugger import set_trace

In [7]:
def process(examples):
    tokenized_inputs = tokenizer(
        examples["sentence1"], examples["sentence2"], truncation=True, max_length=512
    )
    t = lambda x: x["binary-label"]

    tokenized_inputs["labels"] = list(map(t, examples["labels"]))
    return tokenized_inputs

sst2_enc = dataset.map(process, batched=True, remove_columns=dataset["train"].column_names)
# sst2_enc = sst2_enc.rename_column("label","labels")
# sst2_enc["train"]["labels"] = sst2_enc["train"]["labels"]["binary-label"]
# sst2_enc["validation"]["labels"] = sst2_enc["validation"]["labels"]["binary-label"]
# t = lambda x: x["binary-label"]

# sst2_enc["train"]["labels"] = list(map(t, sst2_enc["train"]["labels"]))
# sst2_enc["validation"]["labels"] = list(map(t, sst2_enc["validation"]["labels"]))

sst2_enc["validation"].features

 92%|█████████▏| 11/12 [00:00<00:00, 25.58ba/s]
Loading cached processed dataset at /home/seungjoonpark/.cache/huggingface/datasets/klue/sts/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e/cache-546f76030e6fb252.arrow


{'labels': Value(dtype='int64', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

## Creating our Knowledge Distillation Trainer

In [165]:
from transformers import TrainingArguments

class DistillationTrainingArguments(TrainingArguments):
    def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
        super().__init__(*args, **kwargs)

        self.alpha = alpha
        self.temperature = temperature

In [166]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()

    def compute_loss(self, model, inputs, return_outputs=False):

        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
            outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # compute distillation loss and soften probabilities
        loss_function = nn.KLDivLoss(reduction="batchmean")
#         cos_loss_function = nn.CosineEmbeddingLoss(reduction="mean")
        
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # return weighted student loss
        loss = self.args.alpha * student_loss + (1. - self.args.alpha) * loss_logits
        return (loss, outputs_student) if return_outputs else loss

## Defining the Metric

In [10]:
from datasets import load_metric
import numpy as np

accuracy_metric = load_metric("accuracy")
f1_metric = load_metric("f1")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="macro")
    return {
        "accuracy": acc["accuracy"],
        "f1": f1["f1"]
    }

  accuracy_metric = load_metric("accuracy")


## Teacher Training

In [64]:
### to continue learning
teacher_model = AutoModelForSequenceClassification.from_pretrained(
    teacher,
    num_labels=num_labels
)

loading configuration file config.json from cache at /home/seungjoonpark/.cache/huggingface/hub/models--klue--roberta-base/snapshots/67dd433d36ebc66a42c9aaa85abcf8d2620e41d9/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "tokenizer_class": "BertTokenizer",
  "transformers_version": "4.23.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 32000
}

loading weights file pytorch_model.bin from cache at /home/seung

In [65]:
batch_size = 128

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

args = TrainingArguments(
    # checkpoint
    output_dir='./models/',
    # overwrite_output_dir=True,

    # Model Save & Load
    save_strategy = "epoch", # 'steps'
    load_best_model_at_end=True,
    # save_steps = 500,


    # Dataset
    num_train_epochs=5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    
    # Optimizer
    learning_rate=2e-5, # 5e-5
    weight_decay=0.01,  # 0
    # warmup_steps=200,b

    # Resularization
    # max_grad_norm = 1.0,
    # label_smoothing_factor=0.1,


    # Evaluation 
    metric_for_best_model='eval_f1',
    evaluation_strategy = "epoch",

    # Randomness
    seed=33,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [66]:
trainer = Trainer(
    teacher_model,
    args,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [67]:
trainer.train()

***** Running training *****
  Num examples = 11668
  Num Epochs = 5
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 460


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.367865,0.845857,0.843245
2,No log,0.580463,0.83237,0.832367
3,No log,0.474899,0.840077,0.837448
4,No log,0.70209,0.847784,0.847401
5,No log,0.727276,0.845857,0.84507


***** Running Evaluation *****
  Num examples = 519
  Batch size = 128
Saving model checkpoint to ./models/checkpoint-92
Configuration saved in ./models/checkpoint-92/config.json
Model weights saved in ./models/checkpoint-92/pytorch_model.bin
tokenizer config file saved in ./models/checkpoint-92/tokenizer_config.json
Special tokens file saved in ./models/checkpoint-92/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 519
  Batch size = 128
Saving model checkpoint to ./models/checkpoint-184
Configuration saved in ./models/checkpoint-184/config.json
Model weights saved in ./models/checkpoint-184/pytorch_model.bin
tokenizer config file saved in ./models/checkpoint-184/tokenizer_config.json
Special tokens file saved in ./models/checkpoint-184/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 519
  Batch size = 128
Saving model checkpoint to ./models/checkpoint-276
Configuration saved in ./models/checkpoint-276/config.json
Model weights saved in .

TrainOutput(global_step=460, training_loss=0.08918315224025561, metrics={'train_runtime': 166.2645, 'train_samples_per_second': 350.887, 'train_steps_per_second': 2.767, 'total_flos': 2967119815735680.0, 'train_loss': 0.08918315224025561, 'epoch': 5.0})

In [68]:
# load pretrained model
teacher_model = teacher_model.from_pretrained('./models/checkpoint-368')

loading configuration file ./models/checkpoint-368/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "tokenizer_class": "BertTokenizer",
  "torch_dtype": "float32",
  "transformers_version": "4.23.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 32000
}

loading weights file ./models/checkpoint-368/pytorch_model.bin
All model checkpoint weights were

## Defining the Training Arguments

In [37]:
from transformers import DistilBertConfig
from transformers import AutoConfig, AutoModel

In [142]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from huggingface_hub import HfFolder

# id2label, label2id dicts for the outputs for the model
num_labels = 2

# my_config = DistilBertConfig(activation="gelu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=6,
#                              hidden_dim=768, label2id=label2id, id2label=id2label)
# my_config.save_pretrained(save_directory='./models/distilkoroberta')    

# training arguments
training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

# data_collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# # teacher model
# teacher_model = AutoModelForSequenceClassification.from_pretrained(
#     teacher,
#     num_labels=num_labels)

# student model
student_model = AutoModelForSequenceClassification.from_pretrained(
    student,
    num_labels=num_labels)
student_model.resize_token_embeddings(32000)

PyTorch: setting up devices
loading configuration file config.json from cache at /home/seungjoonpark/.cache/huggingface/hub/models--distilroberta-base/snapshots/d5411c3ee9e1793fd9ef58390b40a80a4c10df32/config.json
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.23.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file pytorch_model.bin from cache at /home/seungjoonpark/.cache/huggingface/hub/models--disti

Embedding(32000, 768)

In [34]:
### to continue learning

training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

PyTorch: setting up devices


In [35]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [36]:
trainer.train()

***** Running training *****
  Num examples = 11668
  Num Epochs = 7
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 644


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6041,1.611867,0.566474,0.562734
2,0.2618,1.225722,0.620424,0.618156
3,0.1914,1.149848,0.649326,0.648636
4,0.1492,1.089765,0.647399,0.647315
5,0.1175,1.235284,0.630058,0.629749
6,0.0984,1.098776,0.668593,0.667407
7,0.0855,1.094123,0.66474,0.663539


***** Running Evaluation *****
  Num examples = 519
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-92
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-92/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-92/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/checkpoint-92/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/checkpoint-92/special_tokens_map.json
Deleting older checkpoint [distilroberta-base-sst2-distilled/checkpoint-460] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 519
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-184
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-184/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-184/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst

TrainOutput(global_step=644, training_loss=0.2154033028561136, metrics={'train_runtime': 138.9617, 'train_samples_per_second': 587.759, 'train_steps_per_second': 4.634, 'total_flos': 2094145023404592.0, 'train_loss': 0.2154033028561136, 'epoch': 7.0})

In [92]:
teacher_trainer = Trainer(teacher_model,
                          training_args,
                          train_dataset=sst2_enc["train"],
                          eval_dataset=sst2_enc["validation"],
                          data_collator=data_collator,
                          tokenizer=tokenizer,
                          compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


## Student Initialization

In [157]:
student_weights = []
for i, p in enumerate(student_model.parameters()):
    student_weights.append(p)

In [158]:
# initialized one layer out of two
teacher_weights = []
for i, p in enumerate(teacher_model.parameters()):
    teacher_weights.append(p)

In [159]:
# First and last layers
student_weights[0].data.copy_(teacher_weights[0].data)
student_weights[1].data.copy_(teacher_weights[1].data)
student_weights[2].data.copy_(teacher_weights[2].data)
student_weights[-1].data.copy_(teacher_weights[-1].data)
student_weights[-2].data.copy_(teacher_weights[-2].data)

tensor([[ 0.0304,  0.0341, -0.0020,  ...,  0.0046,  0.0323, -0.0413],
        [-0.0053,  0.0412, -0.0305,  ...,  0.0160, -0.0263,  0.0135]],
       device='cuda:0')

In [160]:
base = 3
for i in range(12):
    if i % 2 == 1:
        std_idx = i // 2
        for j in range(16):
            student_weights[base+std_idx*16+j].data.copy_(teacher_weights[base+i*16+j].data)

In [27]:
def get_n_params(model):
    pp=0
    for p in list(model.parameters())[:-2]:
        nn=1
        for s in list(p.size()):
            nn = nn*s
        pp += nn
    return pp

## Training

In [43]:
### to continue learning

training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

PyTorch: setting up devices


In [44]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [45]:
trainer.train()

***** Running training *****
  Num examples = 11668
  Num Epochs = 7
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 644


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.924,0.827755,0.739884,0.73478
2,0.2762,1.300967,0.695568,0.690409
3,0.1552,0.807425,0.732177,0.730734
4,0.1193,0.607502,0.805395,0.805383
5,0.0914,0.524736,0.789981,0.789728
6,0.0751,0.630178,0.786127,0.786114
7,0.0669,0.632832,0.774566,0.774513


***** Running Evaluation *****
  Num examples = 519
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-92
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-92/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-92/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/checkpoint-92/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/checkpoint-92/special_tokens_map.json
Deleting older checkpoint [distilroberta-base-sst2-distilled/checkpoint-552] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 519
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-184
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-184/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-184/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst

TrainOutput(global_step=644, training_loss=0.24401173828551487, metrics={'train_runtime': 138.8455, 'train_samples_per_second': 588.251, 'train_steps_per_second': 4.638, 'total_flos': 2094145023404592.0, 'train_loss': 0.24401173828551487, 'epoch': 7.0})

In [94]:
torch.save(student_model.state_dict(), './models/distilkoroberta_first_7epochs.pt')

## Gradual Decrease

In [161]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer


class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()
        self.step = 0

    def compute_loss(self, model, inputs, return_outputs=False):
        self.step += 1
        if self.step < 644 * 0.8:
            alpha = 1.
        elif self.step < 644 * 0.9:
            alpha = 0.5
        else:
            alpha = 0.
        
        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
            outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # compute distillation loss and soften probabilities
        loss_function = nn.KLDivLoss(reduction="batchmean")
#         cos_loss_function = nn.CosineEmbeddingLoss(reduction="mean")
        
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # return weighted student loss
        loss = (1. - alpha) * student_loss + 1.* alpha * loss_logits
        return (loss, outputs_student) if return_outputs else loss


In [144]:
# load pretrained model
teacher_model = teacher_model.from_pretrained('./models/checkpoint-368')
student_model = AutoModelForSequenceClassification.from_pretrained(
    student,
    num_labels=num_labels)
student_model.resize_token_embeddings(32000)

loading configuration file ./models/checkpoint-368/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "tokenizer_class": "BertTokenizer",
  "torch_dtype": "float32",
  "transformers_version": "4.23.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 32000
}

loading weights file ./models/checkpoint-368/pytorch_model.bin
All model checkpoint weights were

Embedding(32000, 768)

In [162]:
### to continue learning

training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

PyTorch: setting up devices


In [163]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [164]:
trainer.train()

***** Running training *****
  Num examples = 11668
  Num Epochs = 7
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 644


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.9956,1.568988,0.766859,0.765013
2,0.6203,1.61766,0.739884,0.738763
3,0.3793,1.455085,0.743738,0.74276
4,0.2521,1.279669,0.786127,0.786099
5,0.1713,1.20877,0.788054,0.787921
6,0.0988,1.074637,0.793834,0.793316
7,0.0442,0.96136,0.782274,0.782274


***** Running Evaluation *****
  Num examples = 519
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-92
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-92/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-92/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/checkpoint-92/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/checkpoint-92/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 519
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-184
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-184/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-184/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/checkpoint-184/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-dist

TrainOutput(global_step=644, training_loss=0.5088177630619973, metrics={'train_runtime': 139.3844, 'train_samples_per_second': 585.977, 'train_steps_per_second': 4.62, 'total_flos': 2094145023404592.0, 'train_loss': 0.5088177630619973, 'epoch': 7.0})

# Fine Tuning on Downstream Tasks

## NLI

In [125]:
baseline = deepcopy(student_model)

In [38]:
datasets = load_dataset("klue", 'nli')

Downloading and preparing dataset klue/nli (download: 1.20 MiB, generated: 6.10 MiB, post-processed: Unknown size, total: 7.30 MiB) to /home/seungjoonpark/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e...


Downloading data: 100%|██████████| 1.26M/1.26M [00:00<00:00, 6.19MB/s]
                                                                                          

Dataset klue downloaded and prepared to /home/seungjoonpark/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e. Subsequent calls will reuse this data.


100%|██████████| 2/2 [00:00<00:00, 422.51it/s]


In [39]:
metric = load_metric("glue", "qnli")

Downloading builder script: 5.76kB [00:00, 2.14MB/s]                   


In [45]:
tokenizer

PreTrainedTokenizerFast(name_or_path='klue/roberta-large', vocab_size=32000, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [48]:
sentence1_key, sentence2_key = ("premise", "hypothesis")
print(f"Sentence 1: {datasets['train'][0][sentence1_key]}")
print(f"Sentence 2: {datasets['train'][0][sentence2_key]}")

Sentence 1: 힛걸 진심 최고다 그 어떤 히어로보다 멋지다
Sentence 2: 힛걸 진심 최고로 멋지다.


In [49]:
def preprocess_function(examples):
    return tokenizer(
        examples[sentence1_key],
        examples[sentence2_key],
        truncation=True,
        return_token_type_ids=False,
    )

encoded_datasets = datasets.map(preprocess_function, batched=True)

 96%|█████████▌| 24/25 [00:01<00:00, 21.61ba/s]
 67%|██████▋   | 2/3 [00:00<00:00, 15.01ba/s]


In [117]:
my_config

DistilBertConfig {
  "activation": "relu",
  "attention_dropout": 0.4,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.23.1",
  "vocab_size": 32000
}

In [126]:
num_labels = 3
my_config = DistilBertConfig(activation="relu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=num_labels)
model = AutoModelForSequenceClassification.from_config(my_config)
model_dict = model.state_dict()
pretrained_dict = torch.load("/home/seungjoonpark/DistilKoBERT/models/distilkoroberta.pt")
del pretrained_dict[next(reversed(pretrained_dict))]
del pretrained_dict[next(reversed(pretrained_dict))]
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
model_dict.update(pretrained_dict) 
model.load_state_dict(pretrained_dict, strict=False)

_IncompatibleKeys(missing_keys=['classifier.weight', 'classifier.bias'], unexpected_keys=[])

In [127]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [128]:
batch_size=256

In [133]:
metric_name = "accuracy"

args = TrainingArguments(
    "test-nli",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [134]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [135]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 24998
  Num Epochs = 5
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 490


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.09759,0.380333
2,No log,1.113917,0.390667
3,No log,1.146158,0.391
4,No log,1.126838,0.397333
5,No log,1.137593,0.393


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3000
  Batch size = 256
Saving model checkpoint to test-nli/checkpoint-98
Configuration saved in test-nli/checkpoint-98/config.json
Model weights saved in test-nli/checkpoint-98/pytorch_model.bin
tokenizer config file saved in test-nli/checkpoint-98/tokenizer_config.json
Special tokens file saved in test-nli/checkpoint-98/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassifica

TrainOutput(global_step=490, training_loss=0.9852519132653061, metrics={'train_runtime': 288.5789, 'train_samples_per_second': 433.122, 'train_steps_per_second': 1.698, 'total_flos': 2619227819706804.0, 'train_loss': 0.9852519132653061, 'epoch': 5.0})

In [132]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3000
  Batch size = 256


{'eval_loss': 1.1056615114212036,
 'eval_accuracy': 0.37133333333333335,
 'eval_runtime': 2.453,
 'eval_samples_per_second': 1222.99,
 'eval_steps_per_second': 4.892,
 'epoch': 5.0}