<a href="https://colab.research.google.com/github/azizbarank/distilroberta-base-sst-2-distilled/blob/main/knowledge_distillation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing necessary packages

In [1]:
!pip install transformers datasets tensorboard
!sudo apt-get install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.


## Chhosing our "teacher" and "student" models

In [1]:
student = "distilroberta-base" # this is just placeholder ignore
teacher = "klue/roberta-base"

In [16]:
import torch.nn.utils.prune as prune

## Loading our SST-2 part of the GLUE dataset

In [2]:
from datasets import load_dataset
from datasets.arrow_dataset import Dataset
import torch

dataset = load_dataset("klue", "re")

Found cached dataset klue (/home/hanjuncho/.cache/huggingface/datasets/klue/re/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
# name = dataset["train"].features["label"].names

In [4]:
# dic = {}

In [5]:
# for i in range(len(name)):
#     dic[name[i]] = 0
# dic

{'no_relation': 0,
 'org:dissolved': 0,
 'org:founded': 0,
 'org:place_of_headquarters': 0,
 'org:alternate_names': 0,
 'org:member_of': 0,
 'org:members': 0,
 'org:political/religious_affiliation': 0,
 'org:product': 0,
 'org:founded_by': 0,
 'org:top_members/employees': 0,
 'org:number_of_employees/members': 0,
 'per:date_of_birth': 0,
 'per:date_of_death': 0,
 'per:place_of_birth': 0,
 'per:place_of_death': 0,
 'per:place_of_residence': 0,
 'per:origin': 0,
 'per:employee_of': 0,
 'per:schools_attended': 0,
 'per:alternate_names': 0,
 'per:parents': 0,
 'per:children': 0,
 'per:siblings': 0,
 'per:spouse': 0,
 'per:other_family': 0,
 'per:colleagues': 0,
 'per:product': 0,
 'per:religion': 0,
 'per:title': 0}

In [21]:
# for i in range(len(dataset["train"])):
#     dic[name[dataset["train"]["label"][i]]] += 1
    
# dic

{'no_relation': 9534,
 'org:dissolved': 66,
 'org:founded': 450,
 'org:place_of_headquarters': 1195,
 'org:alternate_names': 1320,
 'org:member_of': 1866,
 'org:members': 420,
 'org:political/religious_affiliation': 98,
 'org:product': 380,
 'org:founded_by': 155,
 'org:top_members/employees': 4284,
 'org:number_of_employees/members': 48,
 'per:date_of_birth': 1130,
 'per:date_of_death': 418,
 'per:place_of_birth': 166,
 'per:place_of_death': 40,
 'per:place_of_residence': 193,
 'per:origin': 1234,
 'per:employee_of': 3573,
 'per:schools_attended': 82,
 'per:alternate_names': 1001,
 'per:parents': 520,
 'per:children': 304,
 'per:siblings': 136,
 'per:spouse': 795,
 'per:other_family': 190,
 'per:colleagues': 534,
 'per:product': 139,
 'per:religion': 96,
 'per:title': 2103}

## Tokenization

### Initiating the tokenizer of our student model

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(teacher)

In [4]:
from IPython.core.debugger import set_trace

In [5]:
def add_entity_tokens(sentence, object_entity, subject_entity):
    obj_start_idx, obj_end_idx = object_entity['start_idx'], object_entity['end_idx']
    subj_start_idx, subj_end_idx = subject_entity['start_idx'], subject_entity['end_idx']
    
    if obj_start_idx < subj_start_idx:
        new_sentence = sentence[:obj_start_idx] + '<obj>' + sentence[obj_start_idx:obj_end_idx+1] + '</obj>' + \
                       sentence[obj_end_idx+1:subj_start_idx] + '<subj>' + sentence[subj_start_idx:subj_end_idx+1] + \
                       '</subj>' + sentence[subj_end_idx+1:]
    else:
        new_sentence = sentence[:subj_start_idx] + '<subj>' + sentence[subj_start_idx:subj_end_idx+1] + '</subj>' + \
                       sentence[subj_end_idx+1:obj_start_idx] + '<obj>' + sentence[obj_start_idx:obj_end_idx+1] + \
                       '</obj>' + sentence[obj_end_idx+1:]
    
    return new_sentence


def read_klue_re(dataset):
    sentences = []
    labels = []
    
    if isinstance(dataset, Dataset):
        for data in dataset:
            sentence = add_entity_tokens(data['sentence'], data['object_entity'], data['subject_entity'])
            sentences.append(sentence)
            labels.append(data['label'])
    
    return sentences, labels

In [6]:
# train, validation데이터셋에서 sentence와 label만 저장.
train_sentences, train_labels = read_klue_re(dataset['train'])
val_sentences, val_labels = read_klue_re(dataset['validation'])

In [7]:
ex_encoding = tokenizer(dataset['train'][0]['sentence'],
                        max_length=128,
                        padding='max_length',
                        truncation=True)

# Special Token 추가

In [8]:
entity_special_tokens = {'additional_special_tokens': ['<obj>', '</obj>', '<subj>', '</subj>']}
num_additional_special_tokens = tokenizer.add_special_tokens(entity_special_tokens)

In [9]:
class KlueReDataset(torch.utils.data.Dataset):
    def __init__(self, tokenizer, sentences, labels, max_length=128):
        self.encodings = tokenizer(sentences,
                                   max_length=max_length,
                                   padding='max_length',
                                   truncation=True)
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item['labels'] = self.labels[idx]
        
        return item
    
    def __len__(self):
        return len(self.labels)

## Creating our Knowledge Distillation Trainer

In [10]:
from transformers import TrainingArguments

class DistillationTrainingArguments(TrainingArguments):
    def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
        super().__init__(*args, **kwargs)

        self.alpha = alpha
        self.temperature = temperature

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()

    def compute_loss(self, model, inputs, return_outputs=False):

        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
            outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # compute distillation loss and soften probabilities
        loss_function = nn.KLDivLoss(reduction="batchmean")
#         cos_loss_function = nn.CosineEmbeddingLoss(reduction="mean")
        
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # return weighted student loss
        loss = self.args.alpha * student_loss + (1. - self.args.alpha) * loss_logits
        return (loss, outputs_student) if return_outputs else loss

## Defining the Metric

In [12]:
from datasets import load_metric
import numpy as np

accuracy_metric = load_metric("accuracy")
f1_metric = load_metric("f1")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="micro")
    return {
        "accuracy": acc["accuracy"],
        "f1": f1["f1"]
    }

  accuracy_metric = load_metric("accuracy")


In [13]:
from transformers import DistilBertConfig
from transformers import AutoConfig, AutoModel
from torch.utils.data import DataLoader

In [14]:
# For model
num_labels = 30
batch_size = 128

train_dataset = KlueReDataset(tokenizer, train_sentences, train_labels)
val_dataset = KlueReDataset(tokenizer, val_sentences, val_labels)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

## Teacher training

In [42]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from huggingface_hub import HfFolder

teacher_model = AutoModelForSequenceClassification.from_pretrained(
    teacher,
    num_labels=num_labels, #30
)
teacher_model.resize_token_embeddings(len(tokenizer))

loading configuration file config.json from cache at /home/hanjuncho/.cache/huggingface/hub/models--klue--roberta-base/snapshots/67dd433d36ebc66a42c9aaa85abcf8d2620e41d9/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "21": "LABEL_2

Embedding(32004, 768)

In [38]:
from transformers import AutoModel, DataCollatorWithPadding
from huggingface_hub import HfFolder

teacher_model = AutoModel.from_pretrained(
    teacher,
    num_labels=num_labels, #30
)
teacher_model.resize_token_embeddings(len(tokenizer))

Some weights of the model checkpoint at klue/roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for

Embedding(32004, 768)

In [154]:
batch_size = 128

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

args = TrainingArguments(
    # checkpoint
    output_dir='./models/',
    # overwrite_output_dir=True,

    # Model Save & Load
    save_strategy = "epoch", # 'steps'
    load_best_model_at_end=True,
    # save_steps = 500,


    # Dataset
    num_train_epochs=5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    
    # Optimizer
    learning_rate=2e-5, # 5e-5
    weight_decay=0.01,  # 0
    # warmup_steps=200,b

    # Resularization
    # max_grad_norm = 1.0,
    # label_smoothing_factor=0.1,


    # Evaluation 
    metric_for_best_model='eval_f1',
    evaluation_strategy = "epoch",

    # Randomness
    seed=33,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [163]:
teacher_model = teacher_model.from_pretrained('./klue-roberta-base-re')

loading configuration file ./klue-roberta-base-re/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "21": "LABEL_21",
    "22": "LABEL_22",
    "23": "LABEL_23",
    "24": "LABEL_24",
    "25": "LABEL_25",
    "26": "LAB

In [164]:
prune_model = teacher_model

parameters_to_prune = ()
for i in range(12):
    parameters_to_prune += (
        (prune_model.roberta.encoder.layer[i].attention.self.key, 'weight'),
        (prune_model.roberta.encoder.layer[i].attention.self.query, 'weight'),
        (prune_model.roberta.encoder.layer[i].attention.self.value, 'weight'),
    )

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.4,
)

In [157]:
prune_model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(32004, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerN

In [160]:
float(torch.sum(prune_model.roberta.encoder.layer[0].attention.self.key.weight==0))

477912.0

In [165]:
float(torch.sum(teacher_model.roberta.encoder.layer[0].attention.self.key.weight==0))

196133.0

In [166]:
for i in range(12):
    print(
        "Sparsity in Layer {}-th key weight: {:.2f}%".format(
            i+1,
            100. * float(torch.sum(prune_model.roberta.encoder.layer[i].attention.self.key.weight == 0))
            / float(prune_model.roberta.encoder.layer[i].attention.self.key.weight.nelement())
        )
    )
    print(
        "Sparsity in Layer {}-th query weightt: {:.2f}%".format(
            i+1,
            100. * float(torch.sum(prune_model.roberta.encoder.layer[i].attention.self.query.weight == 0))
            / float(prune_model.roberta.encoder.layer[i].attention.self.query.weight.nelement())
        )
    )
    print(
        "Sparsity in Layer {}-th value weight: {:.2f}%".format(
            i+1,
            100. * float(torch.sum(prune_model.roberta.encoder.layer[i].attention.self.value.weight == 0))
            / float(prune_model.roberta.encoder.layer[i].attention.self.value.weight.nelement())
        )
    )
    print()

numerator, denominator = 0, 0
for i in range(12):
    numerator += torch.sum(prune_model.roberta.encoder.layer[i].attention.self.key.weight == 0)
    numerator += torch.sum(prune_model.roberta.encoder.layer[i].attention.self.query.weight == 0)
    numerator += torch.sum(prune_model.roberta.encoder.layer[i].attention.self.value.weight == 0)

    denominator += prune_model.roberta.encoder.layer[i].attention.self.key.weight.nelement()
    denominator += prune_model.roberta.encoder.layer[i].attention.self.query.weight.nelement()
    denominator += prune_model.roberta.encoder.layer[i].attention.self.value.weight.nelement()
    
print("Global sparsity: {:.2f}%".format(100. * float(numerator) / float(denominator)))

Sparsity in Layer 1-th key weight: 33.25%
Sparsity in Layer 1-th query weightt: 34.30%
Sparsity in Layer 1-th value weight: 52.34%

Sparsity in Layer 2-th key weight: 37.52%
Sparsity in Layer 2-th query weightt: 37.97%
Sparsity in Layer 2-th value weight: 51.71%

Sparsity in Layer 3-th key weight: 38.17%
Sparsity in Layer 3-th query weightt: 38.28%
Sparsity in Layer 3-th value weight: 50.36%

Sparsity in Layer 4-th key weight: 36.78%
Sparsity in Layer 4-th query weightt: 36.83%
Sparsity in Layer 4-th value weight: 49.73%

Sparsity in Layer 5-th key weight: 37.39%
Sparsity in Layer 5-th query weightt: 36.89%
Sparsity in Layer 5-th value weight: 47.56%

Sparsity in Layer 6-th key weight: 36.52%
Sparsity in Layer 6-th query weightt: 35.75%
Sparsity in Layer 6-th value weight: 50.10%

Sparsity in Layer 7-th key weight: 36.01%
Sparsity in Layer 7-th query weightt: 35.88%
Sparsity in Layer 7-th value weight: 47.20%

Sparsity in Layer 8-th key weight: 36.28%
Sparsity in Layer 8-th query weigh

In [167]:
trainer = Trainer(
    teacher_model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [168]:
trainer2 = Trainer(
    prune_model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [169]:
trainer2.evaluate()

***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


{'eval_loss': 0.8514652848243713,
 'eval_accuracy': 0.7384417256922087,
 'eval_f1': 0.7384417256922087,
 'eval_runtime': 419.1337,
 'eval_samples_per_second': 18.526,
 'eval_steps_per_second': 0.146}

In [135]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


{'eval_loss': 0.915097177028656,
 'eval_accuracy': 0.719768190598841,
 'eval_f1': 0.7197681905988411,
 'eval_runtime': 410.6185,
 'eval_samples_per_second': 18.91,
 'eval_steps_per_second': 0.149}

In [45]:
trainer.train()

***** Running training *****
  Num examples = 32470
  Num Epochs = 5
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 1270
  Number of trainable parameters = 110644254
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [42]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128


{'eval_loss': 0.9150362610816956,
 'eval_accuracy': 0.719768190598841,
 'eval_f1': 0.7197681905988411,
 'eval_runtime': 9.534,
 'eval_samples_per_second': 814.456,
 'eval_steps_per_second': 6.398,
 'epoch': 5.0}

In [40]:
tokenizer.save_pretrained('./klue-roberta-base-re')
teacher_model.save_pretrained('./klue-roberta-base-re')

tokenizer config file saved in ./klue-roberta-base-re/tokenizer_config.json
Special tokens file saved in ./klue-roberta-base-re/special_tokens_map.json
Configuration saved in ./klue-roberta-base-re/config.json
Model weights saved in ./klue-roberta-base-re/pytorch_model.bin


In [144]:
# load pretrained model
teacher_model = teacher_model.from_pretrained('./klue-roberta-base-re')

loading configuration file ./klue-roberta-base-re/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "21": "LABEL_21",
    "22": "LABEL_22",
    "23": "LABEL_23",
    "24": "LABEL_24",
    "25": "LABEL_25",
    "26": "LAB

## Defining the Training Arguments

In [22]:
from transformers import DistilBertConfig
from transformers import AutoConfig, AutoModel

In [30]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from huggingface_hub import HfFolder

# training arguments
training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="eval_f1", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

# data_collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# # teacher model
teacher_model = teacher_model.from_pretrained(
    teacher,
    num_labels=num_labels,
)

# student model
student_model = AutoModelForSequenceClassification.from_pretrained(
    student,
    num_labels=num_labels,
)
student_model.resize_token_embeddings(len(tokenizer))
#teacher_model.resize_token_embeddings(len(tokenizer))

PyTorch: setting up devices
loading configuration file config.json from cache at /home/hanjuncho/.cache/huggingface/hub/models--klue--roberta-base/snapshots/67dd433d36ebc66a42c9aaa85abcf8d2620e41d9/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "21": "LABEL_21",
    "22"

Embedding(32004, 768)

## Student Initialization

In [31]:
student_weights = []
for i, p in enumerate(student_model.parameters()):
    student_weights.append(p)

In [32]:
# initialized one layer out of two
teacher_weights = []
for i, p in enumerate(teacher_model.parameters()):
    teacher_weights.append(p)

In [33]:
# First and last layers
student_weights[0].data.copy_(teacher_weights[0].data)
student_weights[1].data.copy_(teacher_weights[1].data)
student_weights[2].data.copy_(teacher_weights[2].data)
student_weights[-1].data.copy_(teacher_weights[-1].data)
student_weights[-2].data.copy_(teacher_weights[-2].data)

tensor([[ 0.0032,  0.0120,  0.0437,  ..., -0.0044,  0.0152,  0.0163],
        [ 0.0298,  0.0116,  0.0083,  ...,  0.0177, -0.0170,  0.0026],
        [-0.0386,  0.0028,  0.0212,  ..., -0.0239, -0.0213,  0.0151],
        ...,
        [ 0.0191, -0.0081, -0.0129,  ...,  0.0195, -0.0090, -0.0066],
        [-0.0281,  0.0092, -0.0060,  ..., -0.0333,  0.0045,  0.0329],
        [-0.0148, -0.0076, -0.0034,  ..., -0.0141, -0.0082,  0.0278]])

In [34]:
student_weights = []
for i, p in enumerate(student_model.parameters()):
    student_weights.append(p)

# initialized one layer out of two
teacher_weights = []
for i, p in enumerate(teacher_model.parameters()):
    teacher_weights.append(p)

# First and last layers
student_weights[0].data.copy_(teacher_weights[0].data)
student_weights[1].data.copy_(teacher_weights[1].data)
student_weights[2].data.copy_(teacher_weights[2].data)
student_weights[-1].data.copy_(teacher_weights[-1].data)
student_weights[-2].data.copy_(teacher_weights[-2].data)

base = 3
for i in range(12):
    if i % 2 == 1:
        std_idx = i // 2
        for j in range(16):
            student_weights[base+std_idx*16+j].data.copy_(teacher_weights[base+i*16+j].data)
            
def get_n_params(model):
    pp=0
    for p in list(model.parameters())[:-2]:
        nn=1
        for s in list(p.size()):
            nn = nn*s
        pp += nn
    return pp

get_n_params(student_model)
get_n_params(teacher_model)

110621184

In [35]:
base = 3
for i in range(12):
    if i % 2 == 1:
        std_idx = i // 2
        for j in range(16):
            student_weights[base+std_idx*16+j].data.copy_(teacher_weights[base+i*16+j].data)

In [36]:
def get_n_params(model):
    pp=0
    for p in list(model.parameters())[:-2]:
        nn=1
        for s in list(p.size()):
            nn = nn*s
        pp += nn
    return pp

In [37]:
get_n_params(student_model)

68093952

In [38]:
get_n_params(teacher_model)

110621184

## Training

In [39]:
### to continue learning

training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="eval_f1", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

PyTorch: setting up devices


In [40]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [41]:
trainer.train()

***** Running training *****
  Num examples = 32470
  Num Epochs = 7
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 1778
  Number of trainable parameters = 68117022


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.1226,1.125819,0.505473,0.505473
2,0.7851,0.99427,0.578364,0.578364
3,0.6777,1.033804,0.572054,0.572054
4,0.6042,1.082392,0.541017,0.541017
5,0.553,1.089619,0.560335,0.560335
6,0.5135,1.062945,0.584932,0.584932
7,0.4885,1.115968,0.565357,0.565357


***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-254
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-254/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-254/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/checkpoint-254/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/checkpoint-254/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-508
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-508/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-508/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/checkpoint-508/tokenizer_config.json
Special tokens file saved in distilroberta-base-ss

TrainOutput(global_step=1778, training_loss=0.6777900635726809, metrics={'train_runtime': 594.9921, 'train_samples_per_second': 382.005, 'train_steps_per_second': 2.988, 'total_flos': 7530887358489600.0, 'train_loss': 0.6777900635726809, 'epoch': 7.0})

In [27]:
# w/o init

In [28]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128


{'eval_loss': 1.54940664768219,
 'eval_accuracy': 0.5344494526722473,
 'eval_f1': 0.5344494526722473,
 'eval_runtime': 10.2213,
 'eval_samples_per_second': 759.69,
 'eval_steps_per_second': 5.968,
 'epoch': 7.0}

# Linearly Decaying

In [92]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()
        self.step = 0

    def compute_loss(self, model, inputs, return_outputs=False):
        self.step += 1
        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
            outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # compute distillation loss and soften probabilities
        loss_function = nn.KLDivLoss(reduction="batchmean")
#         cos_loss_function = nn.CosineEmbeddingLoss(reduction="mean")
        
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # return weighted student loss
        loss = 1.*(2499-self.step)/2499 * student_loss + (1. - 1.*(2499-self.step)/2499) * loss_logits
        return (loss, outputs_student) if return_outputs else loss

In [93]:
# load pretrained model
teacher_model = teacher_model.from_pretrained('./klue-roberta-base-re')
#teacher_model = teacher_model.from_pretrained("klue/roberta-base")

loading configuration file ./klue-roberta-base-re/config.json
Model config RobertaConfig {
  "_name_or_path": "klue/roberta-base",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "21": "LABEL_21",
    "22": "LABEL_22",
    "23": "LABEL_23",
    "24": "LABEL_24",
    "25": "LABEL_25",
    "26": "LAB

In [94]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from huggingface_hub import HfFolder

# training arguments
training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="eval_f1", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

# data_collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# student model
student_model = AutoModelForSequenceClassification.from_pretrained(
    student,
    num_labels=num_labels,
)
student_model.resize_token_embeddings(len(tokenizer))

PyTorch: setting up devices
loading configuration file config.json from cache at /home/hanjuncho/.cache/huggingface/hub/models--distilroberta-base/snapshots/d5411c3ee9e1793fd9ef58390b40a80a4c10df32/config.json
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "21": "LABEL_21",
  

Embedding(32004, 768)

In [95]:
### Do weight init !!! 위에 있는 코드 실행할것

In [96]:
student_weights = []
for i, p in enumerate(student_model.parameters()):
    student_weights.append(p)

# initialized one layer out of two
teacher_weights = []
for i, p in enumerate(teacher_model.parameters()):
    teacher_weights.append(p)

# First and last layers
student_weights[0].data.copy_(teacher_weights[0].data)
student_weights[1].data.copy_(teacher_weights[1].data)
student_weights[2].data.copy_(teacher_weights[2].data)
student_weights[-1].data.copy_(teacher_weights[-1].data)
student_weights[-2].data.copy_(teacher_weights[-2].data)

base = 3
for i in range(12):
    if i % 2 == 1:
        std_idx = i // 2
        for j in range(16):
            student_weights[base+std_idx*16+j].data.copy_(teacher_weights[base+i*16+j].data)
            
def get_n_params(model):
    pp=0
    for p in list(model.parameters())[:-2]:
        nn=1
        for s in list(p.size()):
            nn = nn*s
        pp += nn
    return pp

get_n_params(student_model)
get_n_params(teacher_model)

110621184

In [97]:
### to continue learning

training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="eval_f1", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

PyTorch: setting up devices


In [98]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [99]:
trainer.train()

***** Running training *****
  Num examples = 32470
  Num Epochs = 7
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 1778
  Number of trainable parameters = 68117022


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.3747,1.375127,0.511526,0.511526
2,0.6261,0.912751,0.637991,0.637991
3,0.4081,0.87804,0.649195,0.649195
4,0.2762,0.832446,0.639665,0.639665
5,0.1997,0.680994,0.65718,0.65718
6,0.1524,0.561346,0.651771,0.651771
7,0.1174,0.457319,0.642756,0.642756


***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-254
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-254/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-254/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/checkpoint-254/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/checkpoint-254/special_tokens_map.json
Deleting older checkpoint [distilroberta-base-sst2-distilled/checkpoint-1524] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-508
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-508/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-508/pytorch_model.bin
tokenizer config file saved in distilroberta-

TrainOutput(global_step=1778, training_loss=0.4506485824241681, metrics={'train_runtime': 592.6514, 'train_samples_per_second': 383.514, 'train_steps_per_second': 3.0, 'total_flos': 7530887358489600.0, 'train_loss': 0.4506485824241681, 'epoch': 7.0})

In [100]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128


{'eval_loss': 0.4472186863422394,
 'eval_accuracy': 0.6571796522858983,
 'eval_f1': 0.6571796522858983,
 'eval_runtime': 10.1921,
 'eval_samples_per_second': 761.864,
 'eval_steps_per_second': 5.985,
 'epoch': 7.0}

## Further deacreaing model

In [101]:
new_teacher = student_model
#new_teacher = teacher_model.from_pretrained('./klue-roberta-base-re')

In [102]:
student_model = AutoModelForSequenceClassification.from_pretrained(
    student,
    num_labels=num_labels,
)

loading configuration file config.json from cache at /home/hanjuncho/.cache/huggingface/hub/models--distilroberta-base/snapshots/d5411c3ee9e1793fd9ef58390b40a80a4c10df32/config.json
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "21": "LABEL_21",
    "22": "LABEL_22",
    "23"

In [103]:
new_config = student_model.config

In [104]:
new_config.__dict__['num_hidden_layers'] = 3
new_config.__dict__['num_labels'] = num_labels

In [105]:
student_model = AutoModelForSequenceClassification.from_config(new_config)
student_model.resize_token_embeddings(len(tokenizer))

Embedding(32004, 768)

In [106]:
student_weights = []
for i, p in enumerate(student_model.parameters()):
    student_weights.append(p)
# initialized one layer out of two
teacher_weights = []
for i, p in enumerate(new_teacher.parameters()):
    teacher_weights.append(p)

In [107]:
# First and last layers
student_weights[0].data.copy_(teacher_weights[0].data)
student_weights[1].data.copy_(teacher_weights[1].data)
student_weights[2].data.copy_(teacher_weights[2].data)
student_weights[-1].data.copy_(teacher_weights[-1].data)
student_weights[-2].data.copy_(teacher_weights[-2].data)

tensor([[-0.0037, -0.0081, -0.0167,  ...,  0.0285,  0.0038,  0.0092],
        [ 0.0271, -0.0415, -0.0143,  ..., -0.0109,  0.0231, -0.0177],
        [-0.0189, -0.0133,  0.0235,  ...,  0.0072,  0.0147, -0.0036],
        ...,
        [-0.0117,  0.0232, -0.0339,  ..., -0.0257,  0.0165, -0.0316],
        [-0.0233,  0.0075,  0.0217,  ...,  0.0090,  0.0020,  0.0254],
        [-0.0194,  0.0162, -0.0178,  ..., -0.0044,  0.0181, -0.0265]])

In [108]:
base = 3
for i in range(6):
    if i % 2 == 1:
        std_idx = i // 2
        for j in range(16):
            student_weights[base+std_idx*16+j].data.copy_(teacher_weights[base+i*16+j].data)

In [109]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()
        self.step = 0

    def compute_loss(self, model, inputs, return_outputs=False):
        self.step += 1
        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
            outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # compute distillation loss and soften probabilities
        loss_function = nn.KLDivLoss(reduction="batchmean")
#         cos_loss_function = nn.CosineEmbeddingLoss(reduction="mean")
        
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # return weighted student loss
        loss = 1.*(1778-self.step)/1778 * student_loss + (1. - 1.*(1778-self.step)/1778) * loss_logits
        return (loss, outputs_student) if return_outputs else loss

In [110]:
### to continue learning

training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled2",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled2/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="eval_f1", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

PyTorch: setting up devices


In [111]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=new_teacher, # changed for comparison
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [112]:
trainer.train()

***** Running training *****
  Num examples = 32470
  Num Epochs = 7
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 1778
  Number of trainable parameters = 46853406


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.2606,1.467542,0.518995,0.518995
2,0.5932,1.267018,0.549646,0.549646
3,0.3815,1.230793,0.560077,0.560077
4,0.2705,1.208514,0.523503,0.523503
5,0.1992,0.962329,0.561494,0.561494
6,0.1566,0.823961,0.537798,0.537798
7,0.1284,0.617718,0.552737,0.552737


***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled2/checkpoint-254
Configuration saved in distilroberta-base-sst2-distilled2/checkpoint-254/config.json
Model weights saved in distilroberta-base-sst2-distilled2/checkpoint-254/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled2/checkpoint-254/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled2/checkpoint-254/special_tokens_map.json
Deleting older checkpoint [distilroberta-base-sst2-distilled2/checkpoint-762] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled2/checkpoint-508
Configuration saved in distilroberta-base-sst2-distilled2/checkpoint-508/config.json
Model weights saved in distilroberta-base-sst2-distilled2/checkpoint-508/pytorch_model.bin
tokenizer config file saved in distil

TrainOutput(global_step=1778, training_loss=0.4271300158162755, metrics={'train_runtime': 320.8228, 'train_samples_per_second': 708.46, 'train_steps_per_second': 5.542, 'total_flos': 3819137766958080.0, 'train_loss': 0.4271300158162755, 'epoch': 7.0})

In [77]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 7765
  Batch size = 128


{'eval_loss': 0.7206025719642639,
 'eval_accuracy': 0.592530585962653,
 'eval_f1': 0.592530585962653,
 'eval_runtime': 8.9517,
 'eval_samples_per_second': 867.435,
 'eval_steps_per_second': 6.814,
 'epoch': 7.0}

In [98]:
torch.save(student_model.state_dict(), './models/distilkoroberta_first_7epochs.pt')

In [34]:
from copy import deepcopy

# Fine Tuning on Downstream Tasks

## NLI

In [125]:
baseline = deepcopy(student_model)

In [38]:
datasets = load_dataset("klue", 'nli')

Downloading and preparing dataset klue/nli (download: 1.20 MiB, generated: 6.10 MiB, post-processed: Unknown size, total: 7.30 MiB) to /home/seungjoonpark/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e...


Downloading data: 100%|██████████| 1.26M/1.26M [00:00<00:00, 6.19MB/s]
                                                                                          

Dataset klue downloaded and prepared to /home/seungjoonpark/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e. Subsequent calls will reuse this data.


100%|██████████| 2/2 [00:00<00:00, 422.51it/s]


In [39]:
metric = load_metric("glue", "qnli")

Downloading builder script: 5.76kB [00:00, 2.14MB/s]                   


In [45]:
tokenizer

PreTrainedTokenizerFast(name_or_path='klue/roberta-large', vocab_size=32000, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [48]:
sentence1_key, sentence2_key = ("premise", "hypothesis")
print(f"Sentence 1: {datasets['train'][0][sentence1_key]}")
print(f"Sentence 2: {datasets['train'][0][sentence2_key]}")

Sentence 1: 힛걸 진심 최고다 그 어떤 히어로보다 멋지다
Sentence 2: 힛걸 진심 최고로 멋지다.


In [49]:
def preprocess_function(examples):
    return tokenizer(
        examples[sentence1_key],
        examples[sentence2_key],
        truncation=True,
        return_token_type_ids=False,
    )

encoded_datasets = datasets.map(preprocess_function, batched=True)

 96%|█████████▌| 24/25 [00:01<00:00, 21.61ba/s]
 67%|██████▋   | 2/3 [00:00<00:00, 15.01ba/s]


In [117]:
my_config

DistilBertConfig {
  "activation": "relu",
  "attention_dropout": 0.4,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.23.1",
  "vocab_size": 32000
}

In [126]:
num_labels = 3
my_config = DistilBertConfig(activation="relu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=num_labels)
model = AutoModelForSequenceClassification.from_config(my_config)
model_dict = model.state_dict()
pretrained_dict = torch.load("/home/seungjoonpark/DistilKoBERT/models/distilkoroberta.pt")
del pretrained_dict[next(reversed(pretrained_dict))]
del pretrained_dict[next(reversed(pretrained_dict))]
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
model_dict.update(pretrained_dict) 
model.load_state_dict(pretrained_dict, strict=False)

_IncompatibleKeys(missing_keys=['classifier.weight', 'classifier.bias'], unexpected_keys=[])

In [127]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [128]:
batch_size=256

In [133]:
metric_name = "accuracy"

args = TrainingArguments(
    "test-nli",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [134]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [135]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 24998
  Num Epochs = 5
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 490


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.09759,0.380333
2,No log,1.113917,0.390667
3,No log,1.146158,0.391
4,No log,1.126838,0.397333
5,No log,1.137593,0.393


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3000
  Batch size = 256
Saving model checkpoint to test-nli/checkpoint-98
Configuration saved in test-nli/checkpoint-98/config.json
Model weights saved in test-nli/checkpoint-98/pytorch_model.bin
tokenizer config file saved in test-nli/checkpoint-98/tokenizer_config.json
Special tokens file saved in test-nli/checkpoint-98/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassifica

TrainOutput(global_step=490, training_loss=0.9852519132653061, metrics={'train_runtime': 288.5789, 'train_samples_per_second': 433.122, 'train_steps_per_second': 1.698, 'total_flos': 2619227819706804.0, 'train_loss': 0.9852519132653061, 'epoch': 5.0})

In [132]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: source, hypothesis, premise, guid. If source, hypothesis, premise, guid are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3000
  Batch size = 256


{'eval_loss': 1.1056615114212036,
 'eval_accuracy': 0.37133333333333335,
 'eval_runtime': 2.453,
 'eval_samples_per_second': 1222.99,
 'eval_steps_per_second': 4.892,
 'epoch': 5.0}

## Installing Optuna for Hyperparameter Tuning

## Defining the Hyperparamater Space to be optimized over

In [137]:
def hp_space(trial):
    return {
      "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 10),
      "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-3 ,log=True),
      "alpha": trial.suggest_float("alpha", 0, 1),
      "temperature": trial.suggest_int("temperature", 2, 30),
      }

## Running the Hyperparameter Search

In [138]:
my_config = DistilBertConfig(activation="relu", attention_dropout=0.4, vocab_size=32000, n_layers=6, num_labels=6,
                            label2id=label2id, id2label=id2label)


def student_init():
    return AutoModelForSequenceClassification.from_config(
    my_config)

trainer = DistillationTrainer(
    model_init=student_init,
    args=training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
best_run = trainer.hyperparameter_search(
    n_trials=2,
    direction="maximize",
    hp_space=hp_space
)

print(best_run)

You passed along `num_labels=6` with an incompatible id to label map: {'0': 'IT과학', '1': '경제', '2': '사회', '3': '생활문화', '4': '세계', '5': '스포츠', '6': '정치'}. The number of labels wil be overwritten to 7.
Using cuda_amp half precision backend
[32m[I 2022-10-27 23:39:03,618][0m A new study created in memory with name: no-name-72755f35-ffe2-455f-abeb-7ef48083cfc8[0m
Trial: {'num_train_epochs': 4, 'learning_rate': 0.0003356216196363318, 'alpha': 0.0038843531441111745, 'temperature': 28}
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 4
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulatio

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0099,0.007726,0.148018
2,0.0079,0.007775,0.148018
3,0.0078,0.007747,0.148018
4,0.0078,0.007662,0.148018


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-0/checkpoint-357
Configuration saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-0/checkpoint-357/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceC

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3879,0.372254,0.706929
2,0.3427,0.365087,0.71758
3,0.333,0.354984,0.77962
4,0.3272,0.358589,0.764577
5,0.3232,0.36462,0.752718
6,0.3206,0.358994,0.775338
7,0.3183,0.357679,0.784122
8,0.3168,0.363774,0.765565
9,0.3157,0.36286,0.767541
10,0.3147,0.361569,0.772812


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-1/checkpoint-357
Configuration saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-357/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceC

***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-1/checkpoint-3213
Configuration saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-3213/special_tokens_map.json
Deleting older checkpoint [distilroberta-base-sst2-distilled/run-1/checkpoint-2856] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore th

BestRun(run_id='1', objective=0.7728121225430987, hyperparameters={'num_train_epochs': 10, 'learning_rate': 4.354784416636035e-05, 'alpha': 0.22918617625637505, 'temperature': 9})


## Updating the training arguments

In [139]:
# overwriting the previous hyperparameters
for k,v in best_run.hyperparameters.items():
    setattr(training_args, k, v)

# new repository
best_model_ckpt = "distilroberta-best"
training_args.output_dir = best_model_ckpt

## Final Training

In [140]:
# New Trainer with the updated parameters
optimal_trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

optimal_trainer.train()

Using cuda_amp half precision backend
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 45678
  Num Epochs = 10
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 3570


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3303,0.359125,0.763918
2,0.3238,0.362589,0.753596
3,0.3236,0.36197,0.765345
4,0.3202,0.366088,0.753926
5,0.3181,0.361465,0.772263
6,0.3164,0.365798,0.762271
7,0.315,0.367955,0.753267
8,0.3141,0.363063,0.773032
9,0.3132,0.364926,0.767761
10,0.3128,0.364964,0.767212


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, token_type_ids, guid, url are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 128
Saving model checkpoint to distilroberta-best/checkpoint-357
Configuration saved in distilroberta-best/checkpoint-357/config.json
Model weights saved in distilroberta-best/checkpoint-357/pytorch_model.bin
tokenizer config file saved in distilroberta-best/checkpoint-357/tokenizer_config.json
Special tokens file saved in distilroberta-best/checkpoint-357/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, date, token_type_ids, guid, url. If title, date, tok

Configuration saved in distilroberta-best/checkpoint-3570/config.json
Model weights saved in distilroberta-best/checkpoint-3570/pytorch_model.bin
tokenizer config file saved in distilroberta-best/checkpoint-3570/tokenizer_config.json
Special tokens file saved in distilroberta-best/checkpoint-3570/special_tokens_map.json
Deleting older checkpoint [distilroberta-best/checkpoint-3213] due to args.save_total_limit


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from distilroberta-best/checkpoint-2856 (score: 0.7730317338311189).


TrainOutput(global_step=3570, training_loss=0.3187510471717984, metrics={'train_runtime': 482.5227, 'train_samples_per_second': 946.65, 'train_steps_per_second': 7.399, 'total_flos': 2774037043248840.0, 'train_loss': 0.3187510471717984, 'epoch': 10.0})