# Task-specific knowledge distillation for BERT using Hugging Face Transformers
### Text Classification Example using `BERT-Base` as Teacher and `BERT-Tiny` as Student

Welcome to our end-to-end task-specific knowledge distilattion Text-Classification example using Transformers, PyTorch & Amazon SageMaker. Distillation is the process of training a small "student" to mimic a larger "teacher". In this example, we will use [BERT-base](https://huggingface.co/textattack/bert-base-uncased-SST-2) as Teacher and [BERT-Tiny](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2) as Student. We will use [Text-Classification](https://huggingface.co/tasks/text-classification) as task-specific knowledge distillation task and the [Stanford Sentiment Treebank v2 (SST-2)](https://paperswithcode.com/dataset/sst) dataset for training.


They are two different types of knowledge distillation, the Task-agnostic knowledge distillation (right) and the Task-specific knowledge distillation (left). In this example we are going to use the Task-specific knowledge distillation.

![knowledge-distillation](./imgs/knowledge-distillation.png)
_Task-specific distillation (left) versus task-agnostic distillation (right). Figure from FastFormers by Y. Kim and H. Awadalla [arXiv:2010.13382]._


In Task-specific knowledge distillation a "second step of distillation" is used to "fine-tune" the model on a given dataset. This idea comes from the [DistilBERT paper](https://arxiv.org/pdf/1910.01108.pdf) where it was shown that a student performed better than simply finetuning the distilled language model:

> We also studied whether we could add another step of distillation during the adaptation phase by fine-tuning DistilBERT on SQuAD using a BERT model previously fine-tuned on SQuAD as a teacher for an additional term in the loss (knowledge distillation). In this setting, there are thus two successive steps of distillation, one during the pre-training phase and one during the adaptation phase. In this case, we were able to reach interesting performances given the size of the model:79.8 F1 and 70.4 EM, i.e. within 3 points of the full model.

If you are more interested in those topics you should defintely read:
* [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108)
* [FastFormers: Highly Efficient Transformer Models for Natural Language Understanding](https://arxiv.org/abs/2010.13382)

Especially the [FastFormers paper](https://arxiv.org/abs/2010.13382) contains great research on what works and doesn't work when using knowledge distillation.

---

Huge thanks to [Lewis Tunstall](https://www.linkedin.com/in/lewis-tunstall/) and his great [Weeknotes: Distilling distilled transformers](https://lewtun.github.io/blog/weeknotes/nlp/huggingface/transformers/2021/01/17/wknotes-distillation-and-generation.html#fn-1)


## Installation

In [1]:
#%pip install "pytorch==1.10.1"
%pip install transformers datasets tensorboard --upgrade

!sudo apt-get install git-lfs

Note: you may need to restart the kernel to use updated packages.


'sudo' is not recognized as an internal or external command,
operable program or batch file.


In [2]:
%pip install accelerate==0.15.0 -U
%pip install transformers[torch]==4.28.1
%pip install prettytable

Note: you may need to restart the kernel to use updated packages.
Collecting transformers==4.28.1 (from transformers[torch]==4.28.1)
  Using cached transformers-4.28.1-py3-none-any.whl.metadata (109 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.1->transformers[torch]==4.28.1)
  Using cached tokenizers-0.13.3-cp311-cp311-win_amd64.whl.metadata (6.9 kB)
Using cached transformers-4.28.1-py3-none-any.whl (7.0 MB)
Using cached tokenizers-0.13.3-cp311-cp311-win_amd64.whl (3.5 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled tokenizers-0.19.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.40.1
    Uninstalling transformers-4.40.1:
      Successfully uninstalled transformers-4.40.1
Successfully installed tokenizers-0.13.3 transformers-4.28.1
Note: you may need to rest

This example will use the [Hugging Face Hub](https://huggingface.co/models) as remote model versioning service. To be able to push our model to the Hub, you need to register on the [Hugging Face](https://huggingface.co/join).
If you already have an account you can skip this step.
After you have an account, we will use the `notebook_login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk.

In [1]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Setup & Configuration

In this step we will define global configurations and paramters, which are used across the whole end-to-end fine-tuning proccess, e.g. `teacher` and `studen` we will use.

In this example, we will use [BERT-base](textattack/bert-base-uncased-SST-2) as Teacher and [BERT-Tiny](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2) as Student. Our Teacher is already fine-tuned on our dataset, which makes it easy for us to directly start the distillation training job rather than fine-tuning the teacher first to then distill it afterwards.

_**IMPORTANT**: This example will only work with a `Teacher` & `Student` combination where the Tokenizer is creating the same output._

Additionally, describes the [FastFormers: Highly Efficient Transformer Models for Natural Language Understanding](https://arxiv.org/abs/2010.13382) paper an additional phenomenon.
> In our experiments, we have observed that dis-
tilled models do not work well when distilled to a
different model type. Therefore, we restricted our
setup to avoid distilling RoBERTa model to BERT
or vice versa. The major difference between the
two model groups is the input token (sub-word) em-
bedding. We think that different input embedding
spaces result in different output embedding spaces,
and knowledge transfer with different spaces does
not work well

In [2]:
import collections
from typing import Union, List
import numpy as np
from tqdm.auto import tqdm
from transformers.trainer_utils import PredictionOutput
from transformers.tokenization_utils import PreTrainedTokenizer
from transformers import TrainingArguments, Trainer, EvalPrediction, default_data_collator
from datasets import load_metric
student_id = "google/electra-small-discriminator"
teacher_id = "ahotrod/electra_large_discriminator_squad2_512"

# name for our repository on the hub
repo_name = "electra-distilled-qa"

Below are some checks to make sure the `Teacher` & `Student` are creating the same output.

In [3]:
from transformers import AutoTokenizer

# init tokenizer
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_id)
student_tokenizer = AutoTokenizer.from_pretrained(student_id)

# sample input
sample = "This is a basic example, with different words to test."

# assert results
assert teacher_tokenizer(sample) == student_tokenizer(sample), "Tokenizers haven't created the same output"


## Dataset & Pre-processing

https://huggingface.co/learn/nlp-course/en/chapter7/7

In [4]:
def prepare_train_features(examples: Union[str, List[str], List[List[str]]], tokenizer: PreTrainedTokenizer, 
                           pad_on_right: bool, max_length: int=512, doc_stride: int=128):
    "Tokenize and encode training examples in the SQuAD format"
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # label impossible answers with the index of the CLS token
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)
        sequence_ids = tokenized_examples.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

def prepare_validation_features(examples, tokenizer, pad_on_right, max_length, doc_stride):
    "Tokenize and encode validation examples in the SQuAD format"
    tokenized_examples = tokenizer(
        examples['question' if pad_on_right else 'context'],
        examples['context' if pad_on_right else 'question'],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

def convert_examples_to_features(dataset, tokenizer, num_train_examples, num_eval_examples, 
                                 max_length=512, doc_stride=128, seed=42):
    "Tokenize and encode the training and validation examples in the SQuAD format"
    max_length = max_length 
    doc_stride = doc_stride 
    pad_on_right = tokenizer.padding_side == "right"
    fn_kwargs = {
        "tokenizer": tokenizer,
        "max_length": max_length,
        "doc_stride": doc_stride,
        "pad_on_right": pad_on_right
    }
    train_enc = (dataset['train']
                 .shuffle(seed=seed)
                 .select(range(num_train_examples))
                 .map(prepare_train_features, fn_kwargs=fn_kwargs, batched=True, remove_columns=dataset["train"].column_names)
                )
    eval_enc = (dataset['validation']
                .shuffle(seed=seed)
                .select(range(num_eval_examples))
                .map(prepare_validation_features, fn_kwargs=fn_kwargs, batched=True, remove_columns=dataset["validation"].column_names)
               )
    eval_examples = dataset['validation'].shuffle(seed=seed).select(range(num_eval_examples))

    return train_enc, eval_enc, eval_examples

In [5]:
dataset_id="squad_v2"
dataset_config="sst2"

To load the `sst2` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.


In [6]:
from datasets import load_dataset

dataset = load_dataset(dataset_id)
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

### Pre-processing & Tokenization

To distill our model we need to convert our "Natural Language" to token IDs. This is done by a 🤗 Transformers Tokenizer which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary). If you are not sure what this means check out [chapter 6](https://huggingface.co/course/chapter6/1?fw=tf) of the Hugging Face Course.

We are going to use the tokenizer of the `Teacher`, but since both are creating same output you could also go with the `Student` tokenizer.


In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(teacher_id)

Additionally we add the `truncation=True` and `max_length=512` to align the length and truncate texts that are bigger than the maximum size allowed by the model.

In [10]:
num_train_examples = 3200
num_eval_examples = 320
train_ds, eval_ds, eval_examples = convert_examples_to_features(dataset, tokenizer, num_train_examples, num_eval_examples)
assert eval_examples.num_rows == num_eval_examples

In [11]:
metric = load_metric("squad_v2")

def squad_metrics(p: EvalPrediction):
    "Compute the Exact Match and F1-score metrics on SQuAD"
    return metric.compute(predictions=p.predictions, references=p.label_ids)

  metric = load_metric("squad_v2")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


## Distilling the model using `PyTorch` and `DistillationTrainer`


Now that our `dataset` is processed, we can distill it. Normally, when fine-tuning a transformer model using PyTorch you should go with the `Trainer-API`. The [Trainer](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/trainer#transformers.Trainer) class provides an API for feature-complete training in PyTorch for most standard use cases.

In our example we cannot use the `Trainer` out-of-the-box, since we need to pass in two models, the `Teacher` and the `Student` and compute the loss for both. But we can subclass the `Trainer` to create a `DistillationTrainer` which will take care of it and only overwrite the [compute_loss](https://github.com/huggingface/transformers/blob/c4ad38e5ac69e6d96116f39df789a2369dd33c21/src/transformers/trainer.py#L1962) method as well as the `init` method. In addition to this we also need to subclass the `TrainingArguments` to include the our distillation hyperparameters.


In [13]:
from transformers import TrainingArguments, Trainer
import torch
import torch.nn as nn
import torch.nn.functional as F

class QuestionAnsweringTrainingArguments(TrainingArguments):
    def __init__(self, *args, max_length=384, doc_stride=128, version_2_with_negative=True, 
                 null_score_diff_threshold=0., n_best_size=20, max_answer_length=30,  alpha=.5, temperature=4, **kwargs):
        super().__init__(*args, **kwargs)
        
        self.max_length = max_length
        self.doc_stride = doc_stride
        self.version_2_with_negative = version_2_with_negative
        self.null_score_diff_threshold = null_score_diff_threshold
        self.n_best_size = n_best_size
        self.max_answer_length = max_answer_length
        self.disable_tqdm = False
        self.alpha = alpha
        self.temperature = temperature
     

#export
class QuestionAnsweringTrainer(Trainer):
    def __init__(self, *args, eval_examples=None, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.eval_examples = eval_examples
        self.data_collator = default_data_collator
        self.compute_metrics = squad_metrics
        self.teacher = teacher_model
        # place teacher on same device as student
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()

    def compute_loss(self, model, inputs, return_outputs=False):
        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        
        
        # compute teacher output
        with torch.no_grad():
          outputs_teacher = self.teacher(**inputs)
        assert outputs_student.start_logits.size() == outputs_teacher.start_logits.size()
        assert outputs_student.end_logits.size() == outputs_teacher.end_logits.size()
        # Soften probabilities and compute distillation loss
        loss_function = nn.KLDivLoss(reduction="batchmean")
        kl_loss = dict()
        kl_loss['start_logits'] = loss_function(
            F.log_softmax(outputs_student.start_logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.start_logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2)
        kl_loss['end_logits'] = loss_function(
            F.log_softmax(outputs_student.end_logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.end_logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2)
        
        loss = student_loss * self.args.alpha + ((kl_loss['start_logits'] + kl_loss["end_logits"]) / 2) * (1. - self.args.alpha)

        return (loss, outputs_student) if return_outputs else loss
        
    def evaluate(self, eval_dataset=None, eval_examples=None, ignore_keys=None):
        eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
        eval_dataloader = self.get_eval_dataloader(eval_dataset)
        eval_examples = self.eval_examples if eval_examples is None else eval_examples

        compute_metrics = self.compute_metrics
        self.compute_metrics = None
        try:
            output = self.prediction_loop(
                eval_dataloader,
                description="Evaluation",
                prediction_loss_only=True if compute_metrics is None else None,
                ignore_keys=ignore_keys,
            )
        finally:
            self.compute_metrics = compute_metrics
        eval_dataset.set_format(type=eval_dataset.format["type"], columns=list(eval_dataset.features.keys()))

        if self.compute_metrics is not None:
            eval_preds = self._post_process_function(eval_examples, eval_dataset, output.predictions)
            metrics = self.compute_metrics(eval_preds)
            # For some reason the eval_loss is not returned in output's metrics
            # Work around since NotebookProgressCallback assumes eval_loss key exists
            metrics['eval_loss'] = 'No log'

            self.log(metrics)
        else:
            metrics = {}
            
        for key in list(metrics.keys()):
            if not key.startswith(f"eval_"):
                metrics[f"eval_{key}"] = metrics.pop(key)

        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, metrics)
        return metrics

    def predict(self, test_dataset, test_examples, ignore_keys=None):
        test_dataloader = self.get_test_dataloader(test_dataset)
        compute_metrics = self.compute_metrics
        self.compute_metrics = None
        try:
            output = self.prediction_loop(
                test_dataloader,
                description="Evaluation",
                prediction_loss_only=True if compute_metrics is None else None,
                ignore_keys=ignore_keys,
            )
        finally:
            self.compute_metrics = compute_metrics

        if self.compute_metrics is None:
            return output

        test_dataset.set_format(type=test_dataset.format["type"], columns=list(test_dataset.features.keys()))
        eval_preds = self._post_process_function(test_examples, test_dataset, output.predictions)
        metrics = self.compute_metrics(eval_preds)

        return PredictionOutput(predictions=eval_preds.predictions, label_ids=eval_preds.label_ids, metrics=metrics)
    
    
    def _post_process_function(self, examples, features, predictions):
        predictions = self._postprocess_qa_predictions(
            examples=examples,
            features=features,
            predictions=predictions,
            version_2_with_negative=self.args.version_2_with_negative,
            n_best_size=self.args.n_best_size,
            max_answer_length=self.args.max_answer_length,
            null_score_diff_threshold=self.args.null_score_diff_threshold,
            output_dir=self.args.output_dir,
            is_world_process_zero=self.is_world_process_zero(),
        )
        if self.args.version_2_with_negative:
            formatted_predictions = [
                {"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items()
            ]
        else:
            formatted_predictions = [{"id": k, "prediction_text": v} for k, v in predictions.items()]
        references = [{"id": ex["id"], "answers": ex['answers']} for ex in self.eval_examples]
        return EvalPrediction(predictions=formatted_predictions, label_ids=references)
    
    
    def _postprocess_qa_predictions(
        self,
        examples,
        features,
        predictions,
        version_2_with_negative= False,
        n_best_size = None,
        max_answer_length = None,
        null_score_diff_threshold = None,
        output_dir = None,
        prefix = None,
        is_world_process_zero = True,
    ):
        assert len(predictions) == 2, "`predictions` should be a tuple with two elements (start_logits, end_logits)."
        all_start_logits, all_end_logits = predictions
        assert len(predictions[0]) == len(features), f"Got {len(predictions[0])} predictions and {len(features)} features."

        example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
        features_per_example = collections.defaultdict(list)
        for i, feature in enumerate(features):
            features_per_example[example_id_to_index[feature["example_id"]]].append(i)

        all_predictions = collections.OrderedDict()

        for example_index, example in enumerate(tqdm(examples)):
            feature_indices = features_per_example[example_index]
            min_null_prediction = None
            prelim_predictions = []

            for feature_index in feature_indices:
                start_logits = all_start_logits[feature_index]
                end_logits = all_end_logits[feature_index]
                offset_mapping = features[feature_index]["offset_mapping"]
                token_is_max_context = features[feature_index].get("token_is_max_context", None)
                feature_null_score = start_logits[0] + end_logits[0]
                if min_null_prediction is None or min_null_prediction["score"] > feature_null_score:
                    min_null_prediction = {
                        "offsets": (0, 0),
                        "score": feature_null_score,
                        "start_logit": start_logits[0],
                        "end_logit": end_logits[0],
                    }

                start_indexes = np.argsort(start_logits)[-1 : -self.args.n_best_size - 1 : -1].tolist()
                end_indexes = np.argsort(end_logits)[-1 : -self.args.n_best_size - 1 : -1].tolist()
                for start_index in start_indexes:
                    for end_index in end_indexes:
                        if (
                            start_index >= len(offset_mapping)
                            or end_index >= len(offset_mapping)
                            or offset_mapping[start_index] is None
                            or offset_mapping[end_index] is None
                        ):
                            continue
                        if end_index < start_index or end_index - start_index + 1 > self.args.max_answer_length:
                            continue
                        if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False):
                            continue
                        prelim_predictions.append(
                            {
                                "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]),
                                "score": start_logits[start_index] + end_logits[end_index],
                                "start_logit": start_logits[start_index],
                                "end_logit": end_logits[end_index],
                            }
                        )
            if self.args.version_2_with_negative:
                prelim_predictions.append(min_null_prediction)
                null_score = min_null_prediction["score"]

            predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:self.args.n_best_size]
            if self.args.version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions):
                predictions.append(min_null_prediction)

            context = example["context"]
            for pred in predictions:
                offsets = pred["offsets"]
                pred["text"] = context[offsets[0] : offsets[1]]

            if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == ""):
                predictions.insert(0, {"text": "empty", "start_logit": 0.0, "end_logit": 0.0, "score": 0.0})

            scores = np.array([pred.pop("score") for pred in predictions])
            exp_scores = np.exp(scores - np.max(scores))
            probs = exp_scores / exp_scores.sum()

            for prob, pred in zip(probs, predictions):
                pred["probability"] = prob

            if not self.args.version_2_with_negative:
                all_predictions[example["id"]] = predictions[0]["text"]
            else:
                i = 0
                while predictions[i]["text"] == "":
                    i += 1
                best_non_null_pred = predictions[i]

                score_diff = null_score - best_non_null_pred["start_logit"] - best_non_null_pred["end_logit"]
                if score_diff > self.args.null_score_diff_threshold:
                    all_predictions[example["id"]] = ""
                else:
                    all_predictions[example["id"]] = best_non_null_pred["text"]
        return all_predictions

### Hyperparameter Definition, Model Loading

In [14]:
from huggingface_hub import HfFolder
import torch
from transformers import AutoModelForQuestionAnswering

def model_init():
    return AutoModelForQuestionAnswering.from_pretrained(student_id)

args = QuestionAnsweringTrainingArguments(
    output_dir=repo_name,
    num_train_epochs=3,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    fp16=True,
    learning_rate=6e-5,
    seed=33,
    # logging & evaluation strategies
    logging_dir=f"{repo_name}/logs",
    logging_strategy="epoch", # to get more information to TB
    evaluation_strategy='epoch',
    save_strategy="epoch",
    report_to="tensorboard",
    # push to hub parameters
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=repo_name,
    hub_token=HfFolder.get_token(),
    # distilation parameters
    alpha=0.5,
    temperature=4.0
)

# define model
teacher_model = AutoModelForQuestionAnswering.from_pretrained(
    teacher_id,
)

trainer = QuestionAnsweringTrainer(
    args=args,
    model_init=model_init,
    teacher_model=teacher_model,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    eval_examples=eval_examples,
    tokenizer=tokenizer
)

Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraForQuestionAnswering: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForQuestionAnswering were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['qa_outputs.weight', 'qa_output

In [15]:
trainer.train()

Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraForQuestionAnswering: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForQuestionAnswering were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['qa_outputs.weight', 'qa_output

  0%|          | 0/78 [00:00<?, ?it/s]

{'loss': 4.4486, 'learning_rate': 4.0769230769230773e-05, 'epoch': 1.0}




  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/320 [00:00<?, ?it/s]

Trainer is attempting to log a value of "No log" of type <class 'str'> for key "eval/loss" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


[{'id': '5733ea04d058e614000b6595', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad26a5fd7d075001a42931b', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a664447c2b11c001a425eef', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad3ff1b604f3c001a3ffc74', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad25d39d7d075001a428eda', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad265d2d7d075001a4291c6', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a1c8ea7b4fb5d00187146d1', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a57c667770dc0001aeefd69', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5705fd8475f01819005e7841', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5726431d271a42140099d7f9', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a8929e43b2508001a72a4de', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5706143575f0



  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/320 [00:00<?, ?it/s]

Trainer is attempting to log a value of "No log" of type <class 'str'> for key "eval/loss" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


[{'id': '5733ea04d058e614000b6595', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad26a5fd7d075001a42931b', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a664447c2b11c001a425eef', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad3ff1b604f3c001a3ffc74', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad25d39d7d075001a428eda', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad265d2d7d075001a4291c6', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a1c8ea7b4fb5d00187146d1', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a57c667770dc0001aeefd69', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5705fd8475f01819005e7841', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5726431d271a42140099d7f9', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a8929e43b2508001a72a4de', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5706143575f0

### Evaluation metric

we can create a `compute_metrics` function to evaluate our model on the test set. This function will be used during the training process to compute the `accuracy` & `f1` of our model.

In [None]:
from datasets import load_metric
import numpy as np

def count_parameters(model):
    pytorch_total_params = sum(p.numel() for p in model.parameters())
    trainable_pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return pytorch_total_params

print(count_parameters(teacher_model))
print(count_parameters(trainer.model))

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


334094338
13483522


In [110]:
trainer.evaluate()



  0%|          | 0/3 [00:00<?, ?it/s]

<function squad_metrics at 0x000001D8B0AC14E0>


  0%|          | 0/320 [00:00<?, ?it/s]

<transformers.trainer_utils.EvalPrediction object at 0x000001DB25C6ED90>
[{'id': '5733ea04d058e614000b6595', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad26a5fd7d075001a42931b', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a664447c2b11c001a425eef', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad3ff1b604f3c001a3ffc74', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad25d39d7d075001a428eda', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5ad265d2d7d075001a4291c6', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a1c8ea7b4fb5d00187146d1', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a57c667770dc0001aeefd69', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5705fd8475f01819005e7841', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5726431d271a42140099d7f9', 'prediction_text': '', 'no_answer_probability': 0.0}, {'id': '5a8929e43b2508001a72a4de', '

Trainer is attempting to log a value of "No log" of type <class 'str'> for key "eval/loss" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 'No log',
 'eval_exact': 52.1875,
 'eval_f1': 52.1875,
 'eval_total': 320,
 'eval_HasAns_exact': 0.0,
 'eval_HasAns_f1': 0.0,
 'eval_HasAns_total': 150,
 'eval_NoAns_exact': 98.23529411764706,
 'eval_NoAns_f1': 98.23529411764706,
 'eval_NoAns_total': 170,
 'eval_best_exact': 53.125,
 'eval_best_exact_thresh': 0.0,
 'eval_best_f1': 53.125,
 'eval_best_f1_thresh': 0.0}

## Training

Start training with calling `trainer.train`

In [14]:
"""
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
"""

'\ntrainer = DistillationTrainer(\n    student_model,\n    training_args,\n    teacher_model=teacher_model,\n    train_dataset=tokenized_datasets["train"],\n    eval_dataset=tokenized_datasets["validation"],\n    data_collator=data_collator,\n    tokenizer=tokenizer,\n    compute_metrics=compute_metrics,\n)\n'

start training using the `DistillationTrainer`.

## Hyperparameter Search for Distillation parameter `alpha` & `temperature` with optuna

The parameter `alpha` & `temparature` in the `DistillationTrainer` can also be used when doing Hyperparamter search to maxizime our "knowledge extraction". As Hyperparamter Optimization framework are we using [Optuna](https://optuna.org/), which has a integration into the `Trainer-API`. Since we the `DistillationTrainer` is a sublcass of the `Trainer` we can use the `hyperparameter_search` without any code changes.


In [16]:
%pip install optuna

Note: you may need to restart the kernel to use updated packages.


To do Hyperparameter Optimization using `optuna` we need to define our hyperparameter space. In this example we are trying to optimize/maximize the `num_train_epochs`, `learning_rate`, `alpha` & `temperature` for our `student_model`.

In [17]:
def hp_space(trial):
    return {
      "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 10),
      "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-3 ,log=True),
      "alpha": trial.suggest_float("alpha", 0, 1),
      "temperature": trial.suggest_int("temperature", 2, 30),
      }

To start our Hyperparmeter search we just need to call `hyperparameter_search` provide our `hp_space` and number of trials to run.

In [18]:
def student_init():
    return AutoModelForQuestionAnswering.from_pretrained(
        student_id,
    )

trainer = DistillationTrainer(
    model_init=student_init,
    args=training_args,
    teacher_model=teacher_model,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
best_run = trainer.hyperparameter_search(
    n_trials=50,
    direction="maximize",
    hp_space=hp_space
)

print(best_run)

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2024-04-19 01:25:51,439] A new study created in memory with name: no-name-11c6ea15-9913-4e54-8f02-6ebaaab6fbbb
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1581 [00:00<?, ?it/s]

{'loss': 1.4461, 'grad_norm': 32.83808898925781, 'learning_rate': 5.9109468466799715e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8842347860336304, 'eval_accuracy': 0.9151376146788991, 'eval_runtime': 1.4618, 'eval_samples_per_second': 596.511, 'eval_steps_per_second': 4.789, 'epoch': 1.0}
{'loss': 0.7, 'grad_norm': 52.87685775756836, 'learning_rate': 2.96107091088419e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.0896263122558594, 'eval_accuracy': 0.8967889908256881, 'eval_runtime': 1.4481, 'eval_samples_per_second': 602.175, 'eval_steps_per_second': 4.834, 'epoch': 2.0}
{'loss': 0.494, 'grad_norm': 13.138901710510254, 'learning_rate': 1.1194975088409036e-07, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.0067952871322632, 'eval_accuracy': 0.9151376146788991, 'eval_runtime': 1.4498, 'eval_samples_per_second': 601.446, 'eval_steps_per_second': 4.828, 'epoch': 3.0}
{'train_runtime': 337.9113, 'train_samples_per_second': 597.929, 'train_steps_per_second': 4.679, 'train_loss': 0.8800290111949518, 'epoch': 3.0}


[I 2024-04-19 01:32:06,884] Trial 0 finished with value: 0.9151376146788991 and parameters: {'num_train_epochs': 3, 'learning_rate': 8.849627807387343e-05, 'alpha': 0.1559830125797934, 'temperature': 3}. Best is trial 0 with value: 0.9151376146788991.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2635 [00:00<?, ?it/s]

{'loss': 2.5183, 'grad_norm': 66.74252319335938, 'learning_rate': 0.0002000029100592468, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 2.3820443153381348, 'eval_accuracy': 0.875, 'eval_runtime': 1.4493, 'eval_samples_per_second': 601.689, 'eval_steps_per_second': 4.83, 'epoch': 1.0}
{'loss': 1.237, 'grad_norm': 67.3247299194336, 'learning_rate': 0.00015007323994971434, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 2.0578901767730713, 'eval_accuracy': 0.8910550458715596, 'eval_runtime': 1.4579, 'eval_samples_per_second': 598.133, 'eval_steps_per_second': 4.802, 'epoch': 2.0}
{'loss': 0.8517, 'grad_norm': 9.862482070922852, 'learning_rate': 0.00010014356984018184, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 2.112179756164551, 'eval_accuracy': 0.8841743119266054, 'eval_runtime': 1.4398, 'eval_samples_per_second': 605.639, 'eval_steps_per_second': 4.862, 'epoch': 3.0}
{'loss': 0.62, 'grad_norm': 9.713326454162598, 'learning_rate': 5.021389973064936e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 2.1044843196868896, 'eval_accuracy': 0.8784403669724771, 'eval_runtime': 1.4493, 'eval_samples_per_second': 601.685, 'eval_steps_per_second': 4.83, 'epoch': 4.0}
{'loss': 0.4638, 'grad_norm': 34.83477020263672, 'learning_rate': 2.842296211168832e-07, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 2.0922484397888184, 'eval_accuracy': 0.8910550458715596, 'eval_runtime': 1.4483, 'eval_samples_per_second': 602.09, 'eval_steps_per_second': 4.833, 'epoch': 5.0}
{'train_runtime': 563.9714, 'train_samples_per_second': 597.096, 'train_steps_per_second': 4.672, 'train_loss': 1.1381755618940494, 'epoch': 5.0}


[I 2024-04-19 01:42:09,194] Trial 1 finished with value: 0.8910550458715596 and parameters: {'num_train_epochs': 5, 'learning_rate': 0.0002496483505476624, 'alpha': 0.15523880635662113, 'temperature': 16}. Best is trial 0 with value: 0.9151376146788991.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/3162 [00:00<?, ?it/s]

{'loss': 4.9041, 'grad_norm': 3.510751485824585, 'learning_rate': 0.0007235764985208607, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 4.841418266296387, 'eval_accuracy': 0.5091743119266054, 'eval_runtime': 1.4537, 'eval_samples_per_second': 599.861, 'eval_steps_per_second': 4.815, 'epoch': 1.0}
{'loss': 4.9002, 'grad_norm': 0.5188323259353638, 'learning_rate': 0.0005789160983992775, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 4.816441059112549, 'eval_accuracy': 0.5091743119266054, 'eval_runtime': 1.4412, 'eval_samples_per_second': 605.042, 'eval_steps_per_second': 4.857, 'epoch': 2.0}
{'loss': 4.9, 'grad_norm': 1.3247452974319458, 'learning_rate': 0.00043425569827769417, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 4.82835054397583, 'eval_accuracy': 0.5091743119266054, 'eval_runtime': 1.4366, 'eval_samples_per_second': 606.994, 'eval_steps_per_second': 4.873, 'epoch': 3.0}
{'loss': 4.893, 'grad_norm': 0.6336215734481812, 'learning_rate': 0.0002895952981561108, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 4.868070602416992, 'eval_accuracy': 0.5091743119266054, 'eval_runtime': 1.4415, 'eval_samples_per_second': 604.928, 'eval_steps_per_second': 4.856, 'epoch': 4.0}
{'loss': 4.8991, 'grad_norm': 1.2643113136291504, 'learning_rate': 0.0001449348980345275, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 4.832038879394531, 'eval_accuracy': 0.5091743119266054, 'eval_runtime': 1.448, 'eval_samples_per_second': 602.191, 'eval_steps_per_second': 4.834, 'epoch': 5.0}
{'loss': 4.8938, 'grad_norm': 1.2564785480499268, 'learning_rate': 2.744979129441809e-07, 'epoch': 6.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 4.896666526794434, 'eval_accuracy': 0.5091743119266054, 'eval_runtime': 1.4438, 'eval_samples_per_second': 603.96, 'eval_steps_per_second': 4.848, 'epoch': 6.0}
{'train_runtime': 676.7939, 'train_samples_per_second': 597.071, 'train_steps_per_second': 4.672, 'train_loss': 4.898371948010555, 'epoch': 6.0}


[I 2024-04-19 01:54:05,044] Trial 2 finished with value: 0.5091743119266054 and parameters: {'num_train_epochs': 6, 'learning_rate': 0.0008679624007294999, 'alpha': 0.2011698337552854, 'temperature': 5}. Best is trial 0 with value: 0.9151376146788991.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1054 [00:00<?, ?it/s]

{'loss': 1.4419, 'grad_norm': 20.69542694091797, 'learning_rate': 1.4084477147380312e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8999464511871338, 'eval_accuracy': 0.8704128440366973, 'eval_runtime': 1.4417, 'eval_samples_per_second': 604.854, 'eval_steps_per_second': 4.855, 'epoch': 1.0}
{'loss': 0.8235, 'grad_norm': inf, 'learning_rate': 5.335029222492542e-08, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.7633256912231445, 'eval_accuracy': 0.9013761467889908, 'eval_runtime': 1.4453, 'eval_samples_per_second': 603.342, 'eval_steps_per_second': 4.843, 'epoch': 2.0}
{'train_runtime': 225.2115, 'train_samples_per_second': 598.095, 'train_steps_per_second': 4.68, 'train_loss': 1.132694425347634, 'epoch': 2.0}


[I 2024-04-19 01:58:30,262] Trial 3 finished with value: 0.9013761467889908 and parameters: {'num_train_epochs': 2, 'learning_rate': 2.8115604002535698e-05, 'alpha': 0.7273510251657795, 'temperature': 7}. Best is trial 0 with value: 0.9151376146788991.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1054 [00:00<?, ?it/s]

{'loss': 1.7548, 'grad_norm': 26.105838775634766, 'learning_rate': 2.1019516058113692e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.9411500692367554, 'eval_accuracy': 0.8922018348623854, 'eval_runtime': 1.4396, 'eval_samples_per_second': 605.724, 'eval_steps_per_second': 4.862, 'epoch': 1.0}
{'loss': 0.93, 'grad_norm': inf, 'learning_rate': 1.1942906851200962e-07, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.9139881730079651, 'eval_accuracy': 0.9071100917431193, 'eval_runtime': 1.4435, 'eval_samples_per_second': 604.075, 'eval_steps_per_second': 4.849, 'epoch': 2.0}
{'train_runtime': 225.0715, 'train_samples_per_second': 598.468, 'train_steps_per_second': 4.683, 'train_loss': 1.3423719315646496, 'epoch': 2.0}


[I 2024-04-19 02:02:55,823] Trial 4 finished with value: 0.9071100917431193 and parameters: {'num_train_epochs': 2, 'learning_rate': 4.195941273721938e-05, 'alpha': 0.4067254850025479, 'temperature': 4}. Best is trial 0 with value: 0.9151376146788991.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/3689 [00:00<?, ?it/s]

{'loss': 1.2835, 'grad_norm': 13.014893531799316, 'learning_rate': 0.00020745409608256818, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 02:04:48,981] Trial 5 pruned. 


{'eval_loss': 1.5194565057754517, 'eval_accuracy': 0.8555045871559633, 'eval_runtime': 1.4482, 'eval_samples_per_second': 602.145, 'eval_steps_per_second': 4.834, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4743 [00:00<?, ?it/s]

{'loss': 1.6952, 'grad_norm': 17.92670440673828, 'learning_rate': 2.010330074098376e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.9234904050827026, 'eval_accuracy': 0.8910550458715596, 'eval_runtime': 1.446, 'eval_samples_per_second': 603.052, 'eval_steps_per_second': 4.841, 'epoch': 1.0}
{'loss': 0.8731, 'grad_norm': 43.87211990356445, 'learning_rate': 1.759575125325375e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.770187258720398, 'eval_accuracy': 0.9059633027522935, 'eval_runtime': 1.4489, 'eval_samples_per_second': 601.851, 'eval_steps_per_second': 4.831, 'epoch': 2.0}
{'loss': 0.6622, 'grad_norm': 13.310898780822754, 'learning_rate': 1.508343456117444e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.7737818956375122, 'eval_accuracy': 0.9139908256880734, 'eval_runtime': 1.4497, 'eval_samples_per_second': 601.516, 'eval_steps_per_second': 4.829, 'epoch': 3.0}
{'loss': 0.5443, 'grad_norm': 12.57313060760498, 'learning_rate': 1.2571117869095134e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.7234447598457336, 'eval_accuracy': 0.926605504587156, 'eval_runtime': 1.451, 'eval_samples_per_second': 600.977, 'eval_steps_per_second': 4.824, 'epoch': 4.0}
{'loss': 0.48, 'grad_norm': 28.64960289001465, 'learning_rate': 1.0058801177015824e-05, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.7743226289749146, 'eval_accuracy': 0.9174311926605505, 'eval_runtime': 1.4478, 'eval_samples_per_second': 602.308, 'eval_steps_per_second': 4.835, 'epoch': 5.0}
{'loss': 0.4283, 'grad_norm': 10.863966941833496, 'learning_rate': 7.546484484936517e-06, 'epoch': 6.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.7562010288238525, 'eval_accuracy': 0.9220183486238532, 'eval_runtime': 1.4492, 'eval_samples_per_second': 601.691, 'eval_steps_per_second': 4.83, 'epoch': 6.0}
{'loss': 0.3886, 'grad_norm': 25.16184425354004, 'learning_rate': 5.038934997206505e-06, 'epoch': 7.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.7708072662353516, 'eval_accuracy': 0.9197247706422018, 'eval_runtime': 1.4413, 'eval_samples_per_second': 604.999, 'eval_steps_per_second': 4.857, 'epoch': 7.0}
{'loss': 0.373, 'grad_norm': 19.518695831298828, 'learning_rate': 2.5266183051271976e-06, 'epoch': 8.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8256669640541077, 'eval_accuracy': 0.9185779816513762, 'eval_runtime': 1.4454, 'eval_samples_per_second': 603.301, 'eval_steps_per_second': 4.843, 'epoch': 8.0}
{'loss': 0.3533, 'grad_norm': 73.0574951171875, 'learning_rate': 1.4301613047889797e-08, 'epoch': 9.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8148481845855713, 'eval_accuracy': 0.9162844036697247, 'eval_runtime': 1.4457, 'eval_samples_per_second': 603.149, 'eval_steps_per_second': 4.842, 'epoch': 9.0}
{'train_runtime': 1014.8133, 'train_samples_per_second': 597.293, 'train_steps_per_second': 4.674, 'train_loss': 0.6442346221003733, 'epoch': 9.0}


[I 2024-04-19 02:22:30,284] Trial 6 finished with value: 0.9162844036697247 and parameters: {'num_train_epochs': 9, 'learning_rate': 2.261085022871377e-05, 'alpha': 0.7277023096419001, 'temperature': 29}. Best is trial 6 with value: 0.9162844036697247.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2108 [00:00<?, ?it/s]

{'loss': 2.4433, 'grad_norm': 58.061771392822266, 'learning_rate': 0.00017850057789948383, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 02:24:23,497] Trial 7 pruned. 


{'eval_loss': 2.149305820465088, 'eval_accuracy': 0.8784403669724771, 'eval_runtime': 1.4541, 'eval_samples_per_second': 599.695, 'eval_steps_per_second': 4.814, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/5270 [00:00<?, ?it/s]

{'loss': 0.94, 'grad_norm': 25.336427688598633, 'learning_rate': 8.271664431386432e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6703569889068604, 'eval_accuracy': 0.9036697247706422, 'eval_runtime': 1.4486, 'eval_samples_per_second': 601.968, 'eval_steps_per_second': 4.832, 'epoch': 1.0}
{'loss': 0.4737, 'grad_norm': 11.467299461364746, 'learning_rate': 7.352784339619853e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8164154887199402, 'eval_accuracy': 0.8990825688073395, 'eval_runtime': 1.4496, 'eval_samples_per_second': 601.528, 'eval_steps_per_second': 4.829, 'epoch': 2.0}
{'loss': 0.3435, 'grad_norm': 44.32056427001953, 'learning_rate': 6.433904247853274e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8911434412002563, 'eval_accuracy': 0.8979357798165137, 'eval_runtime': 1.4479, 'eval_samples_per_second': 602.254, 'eval_steps_per_second': 4.835, 'epoch': 3.0}
{'loss': 0.2599, 'grad_norm': 6.458914756774902, 'learning_rate': 5.515024156086696e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.7757946252822876, 'eval_accuracy': 0.9151376146788991, 'eval_runtime': 1.446, 'eval_samples_per_second': 603.064, 'eval_steps_per_second': 4.841, 'epoch': 4.0}
{'loss': 0.2156, 'grad_norm': inf, 'learning_rate': 4.5978876698073397e-05, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.7059925198554993, 'eval_accuracy': 0.9162844036697247, 'eval_runtime': 1.4472, 'eval_samples_per_second': 602.529, 'eval_steps_per_second': 4.837, 'epoch': 5.0}
{'loss': 0.1789, 'grad_norm': 26.35287857055664, 'learning_rate': 3.679007578040761e-05, 'epoch': 6.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6810566186904907, 'eval_accuracy': 0.9185779816513762, 'eval_runtime': 1.4505, 'eval_samples_per_second': 601.155, 'eval_steps_per_second': 4.826, 'epoch': 6.0}
{'loss': 0.1465, 'grad_norm': 32.874847412109375, 'learning_rate': 2.7601274862741822e-05, 'epoch': 7.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 02:37:33,896] Trial 8 pruned. 


{'eval_loss': 0.8203569650650024, 'eval_accuracy': 0.911697247706422, 'eval_runtime': 1.4518, 'eval_samples_per_second': 600.629, 'eval_steps_per_second': 4.822, 'epoch': 7.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2108 [00:00<?, ?it/s]

{'loss': 3.928, 'grad_norm': 7.340507507324219, 'learning_rate': 0.00037036253158777396, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 02:39:27,032] Trial 9 pruned. 


{'eval_loss': 5.1592206954956055, 'eval_accuracy': 0.6628440366972477, 'eval_runtime': 1.4517, 'eval_samples_per_second': 600.659, 'eval_steps_per_second': 4.822, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/5270 [00:00<?, ?it/s]

{'loss': 0.5695, 'grad_norm': 7.445520401000977, 'learning_rate': 1.0787054225555717e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 02:41:19,719] Trial 10 pruned. 


{'eval_loss': 0.4069703221321106, 'eval_accuracy': 0.8727064220183486, 'eval_runtime': 1.4502, 'eval_samples_per_second': 601.316, 'eval_steps_per_second': 4.827, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4216 [00:00<?, ?it/s]

{'loss': 0.5621, 'grad_norm': 14.338537216186523, 'learning_rate': 5.0174854029835415e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.39338329434394836, 'eval_accuracy': 0.9071100917431193, 'eval_runtime': 1.4401, 'eval_samples_per_second': 605.519, 'eval_steps_per_second': 4.861, 'epoch': 1.0}
{'loss': 0.3067, 'grad_norm': 21.374542236328125, 'learning_rate': 4.302061894724029e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.45042580366134644, 'eval_accuracy': 0.9185779816513762, 'eval_runtime': 1.461, 'eval_samples_per_second': 596.848, 'eval_steps_per_second': 4.791, 'epoch': 2.0}
{'loss': 0.2271, 'grad_norm': 5.032501697540283, 'learning_rate': 3.5852782657263795e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.42918458580970764, 'eval_accuracy': 0.9208715596330275, 'eval_runtime': 1.4441, 'eval_samples_per_second': 603.85, 'eval_steps_per_second': 4.847, 'epoch': 3.0}
{'loss': 0.176, 'grad_norm': 5.360136985778809, 'learning_rate': 2.868494636728731e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.4744165539741516, 'eval_accuracy': 0.9208715596330275, 'eval_runtime': 1.4509, 'eval_samples_per_second': 601.02, 'eval_steps_per_second': 4.825, 'epoch': 4.0}
{'loss': 0.1486, 'grad_norm': 11.912830352783203, 'learning_rate': 2.1517110077310823e-05, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.4420042037963867, 'eval_accuracy': 0.9185779816513762, 'eval_runtime': 1.4442, 'eval_samples_per_second': 603.815, 'eval_steps_per_second': 4.847, 'epoch': 5.0}
{'loss': 0.1268, 'grad_norm': 6.889047622680664, 'learning_rate': 1.4349273787334336e-05, 'epoch': 6.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.4492184817790985, 'eval_accuracy': 0.9277522935779816, 'eval_runtime': 1.4504, 'eval_samples_per_second': 601.225, 'eval_steps_per_second': 4.826, 'epoch': 6.0}
{'loss': 0.1059, 'grad_norm': 15.968618392944336, 'learning_rate': 7.195038704739208e-06, 'epoch': 7.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.47401389479637146, 'eval_accuracy': 0.9254587155963303, 'eval_runtime': 1.4461, 'eval_samples_per_second': 602.992, 'eval_steps_per_second': 4.841, 'epoch': 7.0}
{'loss': 0.0972, 'grad_norm': 5.614753246307373, 'learning_rate': 2.720241476271912e-08, 'epoch': 8.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.4893525540828705, 'eval_accuracy': 0.9185779816513762, 'eval_runtime': 1.4503, 'eval_samples_per_second': 601.271, 'eval_steps_per_second': 4.827, 'epoch': 8.0}
{'train_runtime': 902.0643, 'train_samples_per_second': 597.288, 'train_steps_per_second': 4.674, 'train_loss': 0.2188198389997971, 'epoch': 8.0}


[I 2024-04-19 02:57:53,375] Trial 11 finished with value: 0.9185779816513762 and parameters: {'num_train_epochs': 8, 'learning_rate': 5.73426903198119e-05, 'alpha': 0.9346338538228292, 'temperature': 23}. Best is trial 11 with value: 0.9185779816513762.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4216 [00:00<?, ?it/s]

{'loss': 0.397, 'grad_norm': 7.509405136108398, 'learning_rate': 2.0402991030650117e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.27017587423324585, 'eval_accuracy': 0.8990825688073395, 'eval_runtime': 1.4468, 'eval_samples_per_second': 602.727, 'eval_steps_per_second': 4.838, 'epoch': 1.0}
{'loss': 0.2429, 'grad_norm': 13.322591781616211, 'learning_rate': 1.748827802627153e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 03:01:39,211] Trial 12 pruned. 


{'eval_loss': 0.2570970952510834, 'eval_accuracy': 0.9002293577981652, 'eval_runtime': 1.4468, 'eval_samples_per_second': 602.705, 'eval_steps_per_second': 4.838, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4743 [00:00<?, ?it/s]

{'loss': 0.8573, 'grad_norm': 22.835023880004883, 'learning_rate': 4.3163581221778226e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5491362810134888, 'eval_accuracy': 0.9162844036697247, 'eval_runtime': 1.4473, 'eval_samples_per_second': 602.517, 'eval_steps_per_second': 4.837, 'epoch': 1.0}
{'loss': 0.4496, 'grad_norm': 38.72591018676758, 'learning_rate': 3.7769413020716535e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5399537682533264, 'eval_accuracy': 0.9105504587155964, 'eval_runtime': 1.4499, 'eval_samples_per_second': 601.429, 'eval_steps_per_second': 4.828, 'epoch': 2.0}
{'loss': 0.3331, 'grad_norm': 6.218717575073242, 'learning_rate': 3.237524481965485e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5268839001655579, 'eval_accuracy': 0.9208715596330275, 'eval_runtime': 1.4424, 'eval_samples_per_second': 604.54, 'eval_steps_per_second': 4.853, 'epoch': 3.0}
{'loss': 0.2593, 'grad_norm': 4.019476890563965, 'learning_rate': 2.6981076618593168e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5509608387947083, 'eval_accuracy': 0.9174311926605505, 'eval_runtime': 1.453, 'eval_samples_per_second': 600.123, 'eval_steps_per_second': 4.818, 'epoch': 4.0}
{'loss': 0.2191, 'grad_norm': 19.7722110748291, 'learning_rate': 2.1586908417531485e-05, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5370484590530396, 'eval_accuracy': 0.9162844036697247, 'eval_runtime': 1.4546, 'eval_samples_per_second': 599.496, 'eval_steps_per_second': 4.812, 'epoch': 5.0}
{'loss': 0.1874, 'grad_norm': 12.621796607971191, 'learning_rate': 1.6202975829754548e-05, 'epoch': 6.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.572212815284729, 'eval_accuracy': 0.926605504587156, 'eval_runtime': 1.4456, 'eval_samples_per_second': 603.228, 'eval_steps_per_second': 4.842, 'epoch': 6.0}
{'loss': 0.1614, 'grad_norm': 12.125116348266602, 'learning_rate': 1.0808807628692863e-05, 'epoch': 7.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5944486260414124, 'eval_accuracy': 0.9197247706422018, 'eval_runtime': 1.4613, 'eval_samples_per_second': 596.712, 'eval_steps_per_second': 4.79, 'epoch': 7.0}
{'loss': 0.1463, 'grad_norm': 7.69290828704834, 'learning_rate': 5.414639427631178e-06, 'epoch': 8.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.652703046798706, 'eval_accuracy': 0.9208715596330275, 'eval_runtime': 1.4499, 'eval_samples_per_second': 601.423, 'eval_steps_per_second': 4.828, 'epoch': 8.0}
{'loss': 0.1339, 'grad_norm': 29.435096740722656, 'learning_rate': 2.0471226569494057e-08, 'epoch': 9.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6081395149230957, 'eval_accuracy': 0.9220183486238532, 'eval_runtime': 1.4498, 'eval_samples_per_second': 601.454, 'eval_steps_per_second': 4.828, 'epoch': 9.0}
{'train_runtime': 1015.1206, 'train_samples_per_second': 597.112, 'train_steps_per_second': 4.672, 'train_loss': 0.3052620611285299, 'epoch': 9.0}


[I 2024-04-19 03:19:40,546] Trial 13 finished with value: 0.9220183486238532 and parameters: {'num_train_epochs': 9, 'learning_rate': 4.854751380955516e-05, 'alpha': 0.8585651118406669, 'temperature': 23}. Best is trial 13 with value: 0.9220183486238532.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4216 [00:00<?, ?it/s]

{'loss': 0.722, 'grad_norm': 17.319332122802734, 'learning_rate': 4.8198886816999536e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.4791809618473053, 'eval_accuracy': 0.9071100917431193, 'eval_runtime': 1.4471, 'eval_samples_per_second': 602.582, 'eval_steps_per_second': 4.837, 'epoch': 1.0}
{'loss': 0.3851, 'grad_norm': 32.41178894042969, 'learning_rate': 4.131519756156356e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5533952116966248, 'eval_accuracy': 0.9036697247706422, 'eval_runtime': 1.4377, 'eval_samples_per_second': 606.531, 'eval_steps_per_second': 4.869, 'epoch': 2.0}
{'loss': 0.2834, 'grad_norm': 11.841134071350098, 'learning_rate': 3.443150830612758e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.513361394405365, 'eval_accuracy': 0.9220183486238532, 'eval_runtime': 1.446, 'eval_samples_per_second': 603.045, 'eval_steps_per_second': 4.841, 'epoch': 3.0}
{'loss': 0.2197, 'grad_norm': 5.49588680267334, 'learning_rate': 2.7547819050691604e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5235697031021118, 'eval_accuracy': 0.9208715596330275, 'eval_runtime': 1.4444, 'eval_samples_per_second': 603.71, 'eval_steps_per_second': 4.846, 'epoch': 4.0}
{'loss': 0.1838, 'grad_norm': 17.881845474243164, 'learning_rate': 2.0677191824203323e-05, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5187216997146606, 'eval_accuracy': 0.9174311926605505, 'eval_runtime': 1.4478, 'eval_samples_per_second': 602.286, 'eval_steps_per_second': 4.835, 'epoch': 5.0}
{'loss': 0.1587, 'grad_norm': 9.661787033081055, 'learning_rate': 1.3793502568767347e-05, 'epoch': 6.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 03:30:58,583] Trial 14 pruned. 


{'eval_loss': 0.5268951058387756, 'eval_accuracy': 0.9220183486238532, 'eval_runtime': 1.4468, 'eval_samples_per_second': 602.727, 'eval_steps_per_second': 4.838, 'epoch': 6.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4216 [00:00<?, ?it/s]

{'loss': 0.8727, 'grad_norm': 19.587562561035156, 'learning_rate': 4.871658725828052e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.579243004322052, 'eval_accuracy': 0.908256880733945, 'eval_runtime': 1.4486, 'eval_samples_per_second': 601.96, 'eval_steps_per_second': 4.832, 'epoch': 1.0}
{'loss': 0.4502, 'grad_norm': 29.514617919921875, 'learning_rate': 4.175896083955048e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6629878878593445, 'eval_accuracy': 0.9013761467889908, 'eval_runtime': 1.4564, 'eval_samples_per_second': 598.751, 'eval_steps_per_second': 4.806, 'epoch': 2.0}
{'loss': 0.3289, 'grad_norm': 11.078173637390137, 'learning_rate': 3.4801334420820444e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 03:36:37,629] Trial 15 pruned. 


{'eval_loss': 0.6023737788200378, 'eval_accuracy': 0.9128440366972477, 'eval_runtime': 1.4423, 'eval_samples_per_second': 604.591, 'eval_steps_per_second': 4.853, 'epoch': 3.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/3689 [00:00<?, ?it/s]

{'loss': 1.4683, 'grad_norm': 43.93117141723633, 'learning_rate': 0.0001210915308814829, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.163783073425293, 'eval_accuracy': 0.893348623853211, 'eval_runtime': 1.4455, 'eval_samples_per_second': 603.251, 'eval_steps_per_second': 4.843, 'epoch': 1.0}
{'loss': 0.7178, 'grad_norm': 19.654346466064453, 'learning_rate': 0.0001009223662877593, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 03:40:23,653] Trial 16 pruned. 


{'eval_loss': 1.1105067729949951, 'eval_accuracy': 0.8887614678899083, 'eval_runtime': 1.4481, 'eval_samples_per_second': 602.163, 'eval_steps_per_second': 4.834, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4743 [00:00<?, ?it/s]

{'loss': 1.1526, 'grad_norm': 12.6882963180542, 'learning_rate': 1.161133741291204e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 03:42:16,550] Trial 17 pruned. 


{'eval_loss': 0.8052427172660828, 'eval_accuracy': 0.8727064220183486, 'eval_runtime': 1.447, 'eval_samples_per_second': 602.639, 'eval_steps_per_second': 4.838, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4743 [00:00<?, ?it/s]

{'loss': 1.6823, 'grad_norm': 30.691499710083008, 'learning_rate': 0.00012903895469071425, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.4031481742858887, 'eval_accuracy': 0.8922018348623854, 'eval_runtime': 1.462, 'eval_samples_per_second': 596.433, 'eval_steps_per_second': 4.788, 'epoch': 1.0}
{'loss': 0.8262, 'grad_norm': 63.848548889160156, 'learning_rate': 0.00011291673346690998, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 03:46:02,453] Trial 18 pruned. 


{'eval_loss': 1.4074316024780273, 'eval_accuracy': 0.8910550458715596, 'eval_runtime': 1.4452, 'eval_samples_per_second': 603.386, 'eval_steps_per_second': 4.844, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/3162 [00:00<?, ?it/s]

{'loss': 1.0934, 'grad_norm': 20.130027770996094, 'learning_rate': 3.554017689638086e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6624947786331177, 'eval_accuracy': 0.9048165137614679, 'eval_runtime': 1.4531, 'eval_samples_per_second': 600.115, 'eval_steps_per_second': 4.817, 'epoch': 1.0}
{'loss': 0.5628, 'grad_norm': 33.143531799316406, 'learning_rate': 2.8434838040389697e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6784337162971497, 'eval_accuracy': 0.9036697247706422, 'eval_runtime': 1.4531, 'eval_samples_per_second': 600.1, 'eval_steps_per_second': 4.817, 'epoch': 2.0}
{'loss': 0.4182, 'grad_norm': 15.539627075195312, 'learning_rate': 2.132949918439853e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 03:51:41,055] Trial 19 pruned. 


{'eval_loss': 0.6361051797866821, 'eval_accuracy': 0.9139908256880734, 'eval_runtime': 1.4428, 'eval_samples_per_second': 604.366, 'eval_steps_per_second': 4.852, 'epoch': 3.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/3689 [00:00<?, ?it/s]

{'loss': 1.6431, 'grad_norm': 44.26585006713867, 'learning_rate': 6.064402229159703e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.0662537813186646, 'eval_accuracy': 0.8944954128440367, 'eval_runtime': 1.4489, 'eval_samples_per_second': 601.833, 'eval_steps_per_second': 4.831, 'epoch': 1.0}
{'loss': 0.778, 'grad_norm': 44.72014617919922, 'learning_rate': 5.054307420446946e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.0714185237884521, 'eval_accuracy': 0.9036697247706422, 'eval_runtime': 1.4519, 'eval_samples_per_second': 600.589, 'eval_steps_per_second': 4.821, 'epoch': 2.0}
{'loss': 0.5608, 'grad_norm': 59.29368591308594, 'learning_rate': 4.044212611734189e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 03:57:20,480] Trial 20 pruned. 


{'eval_loss': 0.9973759651184082, 'eval_accuracy': 0.911697247706422, 'eval_runtime': 1.4463, 'eval_samples_per_second': 602.932, 'eval_steps_per_second': 4.84, 'epoch': 3.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4743 [00:00<?, ?it/s]

{'loss': 1.9242, 'grad_norm': 27.675573348999023, 'learning_rate': 1.689625603488882e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 03:59:13,772] Trial 21 pruned. 


{'eval_loss': 1.1285181045532227, 'eval_accuracy': 0.8795871559633027, 'eval_runtime': 1.4541, 'eval_samples_per_second': 599.682, 'eval_steps_per_second': 4.814, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/5270 [00:00<?, ?it/s]

{'loss': 0.6671, 'grad_norm': 13.01644229888916, 'learning_rate': 2.7002954057123424e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.428321897983551, 'eval_accuracy': 0.9036697247706422, 'eval_runtime': 1.451, 'eval_samples_per_second': 600.984, 'eval_steps_per_second': 4.824, 'epoch': 1.0}
{'loss': 0.3779, 'grad_norm': 25.425676345825195, 'learning_rate': 2.4002625828554156e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.40753328800201416, 'eval_accuracy': 0.9048165137614679, 'eval_runtime': 1.4531, 'eval_samples_per_second': 600.091, 'eval_steps_per_second': 4.817, 'epoch': 2.0}
{'loss': 0.2908, 'grad_norm': 7.495176792144775, 'learning_rate': 2.1002297599984884e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.4330578148365021, 'eval_accuracy': 0.9162844036697247, 'eval_runtime': 1.4523, 'eval_samples_per_second': 600.414, 'eval_steps_per_second': 4.82, 'epoch': 3.0}
{'loss': 0.2367, 'grad_norm': 5.248865127563477, 'learning_rate': 1.8007662593860718e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 04:06:46,520] Trial 22 pruned. 


{'eval_loss': 0.4434203505516052, 'eval_accuracy': 0.9151376146788991, 'eval_runtime': 1.4564, 'eval_samples_per_second': 598.722, 'eval_steps_per_second': 4.806, 'epoch': 4.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4743 [00:00<?, ?it/s]

{'loss': 1.6479, 'grad_norm': 23.717376708984375, 'learning_rate': 1.5081753611293434e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 04:08:39,963] Trial 23 pruned. 


{'eval_loss': 1.023799180984497, 'eval_accuracy': 0.8795871559633027, 'eval_runtime': 1.4511, 'eval_samples_per_second': 600.906, 'eval_steps_per_second': 4.824, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4216 [00:00<?, ?it/s]

{'loss': 1.7353, 'grad_norm': 29.21723175048828, 'learning_rate': 2.957608242808066e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.9055101871490479, 'eval_accuracy': 0.8967889908256881, 'eval_runtime': 1.4428, 'eval_samples_per_second': 604.362, 'eval_steps_per_second': 4.852, 'epoch': 1.0}
{'loss': 0.855, 'grad_norm': 32.18894958496094, 'learning_rate': 2.5360088022343416e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8889522552490234, 'eval_accuracy': 0.9059633027522935, 'eval_runtime': 1.4504, 'eval_samples_per_second': 601.197, 'eval_steps_per_second': 4.826, 'epoch': 2.0}
{'loss': 0.6404, 'grad_norm': 18.654333114624023, 'learning_rate': 2.113607841811618e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 04:14:19,333] Trial 24 pruned. 


{'eval_loss': 0.8922274112701416, 'eval_accuracy': 0.9139908256880734, 'eval_runtime': 1.4556, 'eval_samples_per_second': 599.055, 'eval_steps_per_second': 4.809, 'epoch': 3.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4743 [00:00<?, ?it/s]

{'loss': 0.7433, 'grad_norm': 26.669029235839844, 'learning_rate': 0.0001315277347504608, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6285983920097351, 'eval_accuracy': 0.8944954128440367, 'eval_runtime': 1.4456, 'eval_samples_per_second': 603.222, 'eval_steps_per_second': 4.842, 'epoch': 1.0}
{'loss': 0.384, 'grad_norm': 12.228119850158691, 'learning_rate': 0.00011509066664197304, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 04:18:05,696] Trial 25 pruned. 


{'eval_loss': 0.6967461109161377, 'eval_accuracy': 0.8956422018348624, 'eval_runtime': 1.4516, 'eval_samples_per_second': 600.703, 'eval_steps_per_second': 4.822, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/5270 [00:00<?, ?it/s]

{'loss': 0.9686, 'grad_norm': 25.633333206176758, 'learning_rate': 5.249775471708889e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6331567168235779, 'eval_accuracy': 0.9105504587155964, 'eval_runtime': 1.4527, 'eval_samples_per_second': 600.271, 'eval_steps_per_second': 4.819, 'epoch': 1.0}
{'loss': 0.4917, 'grad_norm': 32.01615905761719, 'learning_rate': 4.666590043043083e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6638227105140686, 'eval_accuracy': 0.911697247706422, 'eval_runtime': 1.4467, 'eval_samples_per_second': 602.739, 'eval_steps_per_second': 4.838, 'epoch': 2.0}
{'loss': 0.3597, 'grad_norm': 19.456010818481445, 'learning_rate': 4.0834046143772765e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6752657294273376, 'eval_accuracy': 0.9174311926605505, 'eval_runtime': 1.4552, 'eval_samples_per_second': 599.222, 'eval_steps_per_second': 4.81, 'epoch': 3.0}
{'loss': 0.2792, 'grad_norm': 7.038837432861328, 'learning_rate': 3.500219185711471e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6751970052719116, 'eval_accuracy': 0.9128440366972477, 'eval_runtime': 1.4453, 'eval_samples_per_second': 603.35, 'eval_steps_per_second': 4.843, 'epoch': 4.0}
{'loss': 0.2339, 'grad_norm': 17.21753692626953, 'learning_rate': 2.918140370762298e-05, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6162137389183044, 'eval_accuracy': 0.9220183486238532, 'eval_runtime': 1.4517, 'eval_samples_per_second': 600.678, 'eval_steps_per_second': 4.822, 'epoch': 5.0}
{'loss': 0.1981, 'grad_norm': 28.996667861938477, 'learning_rate': 2.3349549420964917e-05, 'epoch': 6.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 04:29:24,562] Trial 26 pruned. 


{'eval_loss': 0.6343953609466553, 'eval_accuracy': 0.9220183486238532, 'eval_runtime': 1.4524, 'eval_samples_per_second': 600.374, 'eval_steps_per_second': 4.82, 'epoch': 6.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/3162 [00:00<?, ?it/s]

{'loss': 0.7426, 'grad_norm': 9.74049186706543, 'learning_rate': 1.698970254360641e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 04:31:17,784] Trial 27 pruned. 


{'eval_loss': 0.47681862115859985, 'eval_accuracy': 0.8876146788990825, 'eval_runtime': 1.4523, 'eval_samples_per_second': 600.43, 'eval_steps_per_second': 4.82, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4216 [00:00<?, ?it/s]

{'loss': 2.1443, 'grad_norm': 33.01731872558594, 'learning_rate': 3.176988118911755e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.0754104852676392, 'eval_accuracy': 0.8967889908256881, 'eval_runtime': 1.4504, 'eval_samples_per_second': 601.214, 'eval_steps_per_second': 4.826, 'epoch': 1.0}
{'loss': 1.0288, 'grad_norm': 39.856449127197266, 'learning_rate': 2.7233785988178796e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 04:35:03,734] Trial 28 pruned. 


{'eval_loss': 1.06399405002594, 'eval_accuracy': 0.8967889908256881, 'eval_runtime': 1.451, 'eval_samples_per_second': 600.975, 'eval_steps_per_second': 4.824, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/3689 [00:00<?, ?it/s]

{'loss': 1.8023, 'grad_norm': 83.0543212890625, 'learning_rate': 9.041775344733817e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.2870945930480957, 'eval_accuracy': 0.9013761467889908, 'eval_runtime': 1.4502, 'eval_samples_per_second': 601.284, 'eval_steps_per_second': 4.827, 'epoch': 1.0}
{'loss': 0.8676, 'grad_norm': 34.712501525878906, 'learning_rate': 7.535765355266459e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 04:38:49,674] Trial 29 pruned. 


{'eval_loss': 1.2858450412750244, 'eval_accuracy': 0.8979357798165137, 'eval_runtime': 1.4558, 'eval_samples_per_second': 598.996, 'eval_steps_per_second': 4.808, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4743 [00:00<?, ?it/s]

{'loss': 0.6075, 'grad_norm': 26.225385665893555, 'learning_rate': 6.561713559027068e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.47316521406173706, 'eval_accuracy': 0.9139908256880734, 'eval_runtime': 1.4499, 'eval_samples_per_second': 601.43, 'eval_steps_per_second': 4.828, 'epoch': 1.0}
{'loss': 0.3214, 'grad_norm': 13.330994606018066, 'learning_rate': 5.741499364148685e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6051578521728516, 'eval_accuracy': 0.8956422018348624, 'eval_runtime': 1.4493, 'eval_samples_per_second': 601.653, 'eval_steps_per_second': 4.83, 'epoch': 2.0}
{'loss': 0.2348, 'grad_norm': 10.15234661102295, 'learning_rate': 4.9212851692703006e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.4945127069950104, 'eval_accuracy': 0.9197247706422018, 'eval_runtime': 1.4486, 'eval_samples_per_second': 601.961, 'eval_steps_per_second': 4.832, 'epoch': 3.0}
{'loss': 0.1824, 'grad_norm': inf, 'learning_rate': 4.1026273580634136e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5778486728668213, 'eval_accuracy': 0.9071100917431193, 'eval_runtime': 1.4513, 'eval_samples_per_second': 600.822, 'eval_steps_per_second': 4.823, 'epoch': 4.0}
{'loss': 0.1498, 'grad_norm': 21.25727653503418, 'learning_rate': 3.28241316318503e-05, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.515268087387085, 'eval_accuracy': 0.911697247706422, 'eval_runtime': 1.4401, 'eval_samples_per_second': 605.529, 'eval_steps_per_second': 4.861, 'epoch': 5.0}
{'loss': 0.1248, 'grad_norm': 11.034281730651855, 'learning_rate': 2.4621989683066463e-05, 'epoch': 6.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 04:50:08,275] Trial 30 pruned. 


{'eval_loss': 0.5297892689704895, 'eval_accuracy': 0.9151376146788991, 'eval_runtime': 1.4497, 'eval_samples_per_second': 601.487, 'eval_steps_per_second': 4.828, 'epoch': 6.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1581 [00:00<?, ?it/s]

{'loss': 0.9669, 'grad_norm': 29.794572830200195, 'learning_rate': 6.431256527502283e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6668250560760498, 'eval_accuracy': 0.8990825688073395, 'eval_runtime': 1.4508, 'eval_samples_per_second': 601.063, 'eval_steps_per_second': 4.825, 'epoch': 1.0}
{'loss': 0.4833, 'grad_norm': 28.9000301361084, 'learning_rate': 3.218676252626735e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 04:53:54,849] Trial 31 pruned. 


{'eval_loss': 0.7747304439544678, 'eval_accuracy': 0.8979357798165137, 'eval_runtime': 1.451, 'eval_samples_per_second': 600.968, 'eval_steps_per_second': 4.824, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2635 [00:00<?, ?it/s]

{'loss': 2.7545, 'grad_norm': 47.12026596069336, 'learning_rate': 4.2569562606597884e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.4578064680099487, 'eval_accuracy': 0.8990825688073395, 'eval_runtime': 1.4519, 'eval_samples_per_second': 600.58, 'eval_steps_per_second': 4.821, 'epoch': 1.0}
{'loss': 1.283, 'grad_norm': 78.47111511230469, 'learning_rate': 3.1942296148200405e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.441101312637329, 'eval_accuracy': 0.9071100917431193, 'eval_runtime': 1.4585, 'eval_samples_per_second': 597.892, 'eval_steps_per_second': 4.8, 'epoch': 2.0}
{'loss': 0.9314, 'grad_norm': 58.518775939941406, 'learning_rate': 2.1315029689802918e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.2692312002182007, 'eval_accuracy': 0.9208715596330275, 'eval_runtime': 1.4446, 'eval_samples_per_second': 603.643, 'eval_steps_per_second': 4.846, 'epoch': 3.0}
{'loss': 0.7341, 'grad_norm': 19.554431915283203, 'learning_rate': 1.0687763231405437e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.3720850944519043, 'eval_accuracy': 0.9151376146788991, 'eval_runtime': 1.4488, 'eval_samples_per_second': 601.895, 'eval_steps_per_second': 4.832, 'epoch': 4.0}
{'loss': 0.6406, 'grad_norm': 43.02271270751953, 'learning_rate': 8.066236401060707e-08, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.3230271339416504, 'eval_accuracy': 0.9208715596330275, 'eval_runtime': 1.4499, 'eval_samples_per_second': 601.441, 'eval_steps_per_second': 4.828, 'epoch': 5.0}
{'train_runtime': 565.5789, 'train_samples_per_second': 595.399, 'train_steps_per_second': 4.659, 'train_loss': 1.2687144559972428, 'epoch': 5.0}


[I 2024-04-19 05:04:12,562] Trial 32 finished with value: 0.9208715596330275 and parameters: {'num_train_epochs': 5, 'learning_rate': 5.313633229198741e-05, 'alpha': 0.2744072380603728, 'temperature': 16}. Best is trial 13 with value: 0.9220183486238532.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2635 [00:00<?, ?it/s]

{'loss': 2.7994, 'grad_norm': 53.124794006347656, 'learning_rate': 3.756448770526551e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:06:05,689] Trial 33 pruned. 


{'eval_loss': 1.4240434169769287, 'eval_accuracy': 0.8899082568807339, 'eval_runtime': 1.445, 'eval_samples_per_second': 603.469, 'eval_steps_per_second': 4.844, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2635 [00:00<?, ?it/s]

{'loss': 3.6984, 'grad_norm': 32.842525482177734, 'learning_rate': 1.2892322696027265e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:07:58,573] Trial 34 pruned. 


{'eval_loss': 2.2395052909851074, 'eval_accuracy': 0.8681192660550459, 'eval_runtime': 1.4404, 'eval_samples_per_second': 605.39, 'eval_steps_per_second': 4.86, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2108 [00:00<?, ?it/s]

{'loss': 3.9518, 'grad_norm': 61.12446212768555, 'learning_rate': 1.9903664448669858e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:09:51,465] Trial 35 pruned. 


{'eval_loss': 1.9436157941818237, 'eval_accuracy': 0.8818807339449541, 'eval_runtime': 1.4472, 'eval_samples_per_second': 602.56, 'eval_steps_per_second': 4.837, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/3162 [00:00<?, ?it/s]

{'loss': 4.839, 'grad_norm': 35.701171875, 'learning_rate': 8.6485284152374e-06, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:11:44,416] Trial 36 pruned. 


{'eval_loss': 3.408554792404175, 'eval_accuracy': 0.8497706422018348, 'eval_runtime': 1.4494, 'eval_samples_per_second': 601.625, 'eval_steps_per_second': 4.83, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2635 [00:00<?, ?it/s]

{'loss': 1.3149, 'grad_norm': 29.725248336791992, 'learning_rate': 5.633928647343514e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8187136054039001, 'eval_accuracy': 0.9094036697247706, 'eval_runtime': 1.4446, 'eval_samples_per_second': 603.645, 'eval_steps_per_second': 4.846, 'epoch': 1.0}
{'loss': 0.6381, 'grad_norm': 28.089006423950195, 'learning_rate': 4.226114329112109e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8704111576080322, 'eval_accuracy': 0.9013761467889908, 'eval_runtime': 1.4438, 'eval_samples_per_second': 603.964, 'eval_steps_per_second': 4.848, 'epoch': 2.0}
{'loss': 0.4568, 'grad_norm': 12.265154838562012, 'learning_rate': 2.818300010880705e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8463584184646606, 'eval_accuracy': 0.9162844036697247, 'eval_runtime': 1.4494, 'eval_samples_per_second': 601.628, 'eval_steps_per_second': 4.83, 'epoch': 3.0}
{'loss': 0.3524, 'grad_norm': 7.110481262207031, 'learning_rate': 1.413157067067197e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:19:17,124] Trial 37 pruned. 


{'eval_loss': 0.8207807540893555, 'eval_accuracy': 0.9128440366972477, 'eval_runtime': 1.4555, 'eval_samples_per_second': 599.109, 'eval_steps_per_second': 4.809, 'epoch': 4.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/3689 [00:00<?, ?it/s]

{'loss': 1.2267, 'grad_norm': 28.914613723754883, 'learning_rate': 3.73229259051389e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.7176969051361084, 'eval_accuracy': 0.9013761467889908, 'eval_runtime': 1.4578, 'eval_samples_per_second': 598.142, 'eval_steps_per_second': 4.802, 'epoch': 1.0}
{'loss': 0.6211, 'grad_norm': 30.996606826782227, 'learning_rate': 3.110440489596779e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:23:03,425] Trial 38 pruned. 


{'eval_loss': 0.7066077589988708, 'eval_accuracy': 0.9048165137614679, 'eval_runtime': 1.4514, 'eval_samples_per_second': 600.805, 'eval_steps_per_second': 4.823, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1581 [00:00<?, ?it/s]

{'loss': 1.791, 'grad_norm': 42.082271575927734, 'learning_rate': 8.05410721209954e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.2406145334243774, 'eval_accuracy': 0.8967889908256881, 'eval_runtime': 1.4497, 'eval_samples_per_second': 601.515, 'eval_steps_per_second': 4.829, 'epoch': 1.0}
{'loss': 0.8304, 'grad_norm': 37.11238098144531, 'learning_rate': 4.034680601515773e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:26:49,392] Trial 39 pruned. 


{'eval_loss': 1.3271125555038452, 'eval_accuracy': 0.8899082568807339, 'eval_runtime': 1.4503, 'eval_samples_per_second': 601.264, 'eval_steps_per_second': 4.827, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4216 [00:00<?, ?it/s]

{'loss': 0.5183, 'grad_norm': 10.066198348999023, 'learning_rate': 2.8944833614944396e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.3556877672672272, 'eval_accuracy': 0.9036697247706422, 'eval_runtime': 1.4517, 'eval_samples_per_second': 600.659, 'eval_steps_per_second': 4.822, 'epoch': 1.0}
{'loss': 0.3017, 'grad_norm': 20.411041259765625, 'learning_rate': 2.4809857384238055e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:30:35,473] Trial 40 pruned. 


{'eval_loss': 0.3567243814468384, 'eval_accuracy': 0.9036697247706422, 'eval_runtime': 1.4499, 'eval_samples_per_second': 601.416, 'eval_steps_per_second': 4.828, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1581 [00:00<?, ?it/s]

{'loss': 2.5123, 'grad_norm': 43.335304260253906, 'learning_rate': 5.1826206877294654e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.4461896419525146, 'eval_accuracy': 0.9059633027522935, 'eval_runtime': 1.4486, 'eval_samples_per_second': 601.959, 'eval_steps_per_second': 4.832, 'epoch': 1.0}
{'loss': 1.1875, 'grad_norm': 75.06561279296875, 'learning_rate': 2.5962181286069005e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.6673099994659424, 'eval_accuracy': 0.8944954128440367, 'eval_runtime': 1.452, 'eval_samples_per_second': 600.556, 'eval_steps_per_second': 4.821, 'epoch': 2.0}
{'loss': 0.8542, 'grad_norm': 39.55039596557617, 'learning_rate': 9.815569484336108e-08, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:36:14,558] Trial 41 pruned. 


{'eval_loss': 1.4095581769943237, 'eval_accuracy': 0.9128440366972477, 'eval_runtime': 1.4584, 'eval_samples_per_second': 597.936, 'eval_steps_per_second': 4.8, 'epoch': 3.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1054 [00:00<?, ?it/s]

{'loss': 1.439, 'grad_norm': 38.791255950927734, 'learning_rate': 0.00017028113410210688, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:38:07,811] Trial 42 pruned. 


{'eval_loss': 1.6293396949768066, 'eval_accuracy': 0.8532110091743119, 'eval_runtime': 1.4627, 'eval_samples_per_second': 596.162, 'eval_steps_per_second': 4.786, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2108 [00:00<?, ?it/s]

{'loss': 1.7178, 'grad_norm': 53.056884765625, 'learning_rate': 0.00014887206977648684, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.3425756692886353, 'eval_accuracy': 0.8956422018348624, 'eval_runtime': 1.449, 'eval_samples_per_second': 601.801, 'eval_steps_per_second': 4.831, 'epoch': 1.0}
{'loss': 0.8117, 'grad_norm': 54.11060333251953, 'learning_rate': 9.931074269360082e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:41:54,167] Trial 43 pruned. 


{'eval_loss': 1.2254834175109863, 'eval_accuracy': 0.9013761467889908, 'eval_runtime': 1.4593, 'eval_samples_per_second': 597.55, 'eval_steps_per_second': 4.797, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1054 [00:00<?, ?it/s]

{'loss': 3.4781, 'grad_norm': 47.874210357666016, 'learning_rate': 2.4172863288960017e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.7304165363311768, 'eval_accuracy': 0.8922018348623854, 'eval_runtime': 1.4497, 'eval_samples_per_second': 601.489, 'eval_steps_per_second': 4.828, 'epoch': 1.0}
{'loss': 1.7274, 'grad_norm': 82.40401458740234, 'learning_rate': 1.3682752805071707e-07, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.6832726001739502, 'eval_accuracy': 0.9059633027522935, 'eval_runtime': 1.4506, 'eval_samples_per_second': 601.151, 'eval_steps_per_second': 4.826, 'epoch': 2.0}
{'train_runtime': 225.816, 'train_samples_per_second': 596.494, 'train_steps_per_second': 4.668, 'train_loss': 2.602757839369367, 'epoch': 2.0}


[I 2024-04-19 05:46:41,115] Trial 44 finished with value: 0.9059633027522935 and parameters: {'num_train_epochs': 2, 'learning_rate': 4.80720715218186e-05, 'alpha': 0.12166614132948814, 'temperature': 24}. Best is trial 13 with value: 0.9220183486238532.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/5270 [00:00<?, ?it/s]

{'loss': 3.1459, 'grad_norm': 40.29032516479492, 'learning_rate': 2.1517146214571464e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:48:34,413] Trial 45 pruned. 


{'eval_loss': 1.5859317779541016, 'eval_accuracy': 0.8830275229357798, 'eval_runtime': 1.4459, 'eval_samples_per_second': 603.083, 'eval_steps_per_second': 4.841, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/3162 [00:00<?, ?it/s]

{'loss': 7.6286, 'grad_norm': inf, 'learning_rate': 0.0007840391316725399, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:50:27,462] Trial 46 pruned. 


{'eval_loss': 7.621352672576904, 'eval_accuracy': 0.5091743119266054, 'eval_runtime': 1.4505, 'eval_samples_per_second': 601.161, 'eval_steps_per_second': 4.826, 'epoch': 1.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4743 [00:00<?, ?it/s]

{'loss': 2.4694, 'grad_norm': 79.72722625732422, 'learning_rate': 7.544826842265819e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 1.513045310974121, 'eval_accuracy': 0.9025229357798165, 'eval_runtime': 1.4476, 'eval_samples_per_second': 602.382, 'eval_steps_per_second': 4.836, 'epoch': 1.0}
{'loss': 1.1582, 'grad_norm': 60.853755950927734, 'learning_rate': 6.602394098517517e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-04-19 05:54:14,028] Trial 47 pruned. 


{'eval_loss': 1.5530725717544556, 'eval_accuracy': 0.9048165137614679, 'eval_runtime': 1.4519, 'eval_samples_per_second': 600.593, 'eval_steps_per_second': 4.821, 'epoch': 2.0}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2635 [00:00<?, ?it/s]

{'loss': 0.3265, 'grad_norm': 8.276034355163574, 'learning_rate': 4.888681242124731e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.2659453749656677, 'eval_accuracy': 0.9036697247706422, 'eval_runtime': 1.4511, 'eval_samples_per_second': 600.924, 'eval_steps_per_second': 4.824, 'epoch': 1.0}
{'loss': 0.1866, 'grad_norm': 7.019628524780273, 'learning_rate': 3.6665109315935477e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.29601141810417175, 'eval_accuracy': 0.9059633027522935, 'eval_runtime': 1.4476, 'eval_samples_per_second': 602.36, 'eval_steps_per_second': 4.835, 'epoch': 2.0}
{'loss': 0.1398, 'grad_norm': 2.9231150150299072, 'learning_rate': 2.4443406210623653e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.27852392196655273, 'eval_accuracy': 0.9208715596330275, 'eval_runtime': 1.4489, 'eval_samples_per_second': 601.827, 'eval_steps_per_second': 4.831, 'epoch': 3.0}
{'loss': 0.1103, 'grad_norm': 6.364706039428711, 'learning_rate': 1.2221703105311827e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.31632736325263977, 'eval_accuracy': 0.9162844036697247, 'eval_runtime': 1.4507, 'eval_samples_per_second': 601.084, 'eval_steps_per_second': 4.825, 'epoch': 4.0}
{'loss': 0.094, 'grad_norm': 5.227040767669678, 'learning_rate': 0.0, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.3277532458305359, 'eval_accuracy': 0.9197247706422018, 'eval_runtime': 1.4468, 'eval_samples_per_second': 602.714, 'eval_steps_per_second': 4.838, 'epoch': 5.0}
{'train_runtime': 567.7945, 'train_samples_per_second': 593.075, 'train_steps_per_second': 4.641, 'train_loss': 0.1714439189411431, 'epoch': 5.0}


[I 2024-04-19 06:04:38,646] Trial 48 finished with value: 0.9197247706422018 and parameters: {'num_train_epochs': 5, 'learning_rate': 6.110851552655913e-05, 'alpha': 0.9947666589577905, 'temperature': 22}. Best is trial 13 with value: 0.9220183486238532.
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2635 [00:00<?, ?it/s]

{'loss': 0.3762, 'grad_norm': 8.770561218261719, 'learning_rate': 4.952771397331368e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.28651162981987, 'eval_accuracy': 0.908256880733945, 'eval_runtime': 1.4488, 'eval_samples_per_second': 601.892, 'eval_steps_per_second': 4.832, 'epoch': 1.0}
{'loss': 0.2105, 'grad_norm': 11.216989517211914, 'learning_rate': 3.714578547998526e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.3472459316253662, 'eval_accuracy': 0.9048165137614679, 'eval_runtime': 1.4467, 'eval_samples_per_second': 602.769, 'eval_steps_per_second': 4.839, 'epoch': 2.0}
{'loss': 0.1566, 'grad_norm': 3.215942621231079, 'learning_rate': 2.476385698665684e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.30982208251953125, 'eval_accuracy': 0.9243119266055045, 'eval_runtime': 1.448, 'eval_samples_per_second': 602.203, 'eval_steps_per_second': 4.834, 'epoch': 3.0}
{'loss': 0.1221, 'grad_norm': 7.901008129119873, 'learning_rate': 1.238192849332842e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.3680877685546875, 'eval_accuracy': 0.9174311926605505, 'eval_runtime': 1.4488, 'eval_samples_per_second': 601.891, 'eval_steps_per_second': 4.832, 'epoch': 4.0}
{'loss': 0.1029, 'grad_norm': 4.315232753753662, 'learning_rate': 0.0, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.37717682123184204, 'eval_accuracy': 0.9174311926605505, 'eval_runtime': 1.4394, 'eval_samples_per_second': 605.794, 'eval_steps_per_second': 4.863, 'epoch': 5.0}
{'train_runtime': 567.0988, 'train_samples_per_second': 593.803, 'train_steps_per_second': 4.646, 'train_loss': 0.19365764668589525, 'epoch': 5.0}


[I 2024-04-19 06:15:03,963] Trial 49 finished with value: 0.9174311926605505 and parameters: {'num_train_epochs': 5, 'learning_rate': 6.19096424666421e-05, 'alpha': 0.982936133975336, 'temperature': 22}. Best is trial 13 with value: 0.9220183486238532.


BestRun(run_id='13', objective=0.9220183486238532, hyperparameters={'num_train_epochs': 9, 'learning_rate': 4.854751380955516e-05, 'alpha': 0.8585651118406669, 'temperature': 23}, run_summary=None)


Since optuna is just finding the best hyperparameters we need to fine-tune our model again using the best hyperparamters from the `best_run`.

In [19]:
# overwrite initial hyperparameters with from the best_run
for k,v in best_run.hyperparameters.items():
    setattr(training_args, k, v)

# Define a new repository to store our distilled model
best_model_ckpt = "electra-distilled-best-sst"
training_args.output_dir = best_model_ckpt

# Data:

In [20]:
print(count_parameters(student_model))

13549314


We have overwritten the default Hyperparameters with the one from our `best_run` and can start the training now.

In [21]:
# Create a new Trainer with optimal parameters
optimal_trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

optimal_trainer.train()


# save best model, metrics and create model card
trainer.create_model_card(model_name=training_args.hub_model_id)
trainer.push_to_hub()

  0%|          | 0/4743 [00:00<?, ?it/s]

{'loss': 0.2625, 'grad_norm': 11.34455394744873, 'learning_rate': 4.3163581221778226e-05, 'epoch': 1.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5787972807884216, 'eval_accuracy': 0.911697247706422, 'eval_runtime': 1.4545, 'eval_samples_per_second': 599.512, 'eval_steps_per_second': 4.813, 'epoch': 1.0}
{'loss': 0.214, 'grad_norm': 29.352754592895508, 'learning_rate': 3.7769413020716535e-05, 'epoch': 2.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.8082430958747864, 'eval_accuracy': 0.8899082568807339, 'eval_runtime': 1.4519, 'eval_samples_per_second': 600.605, 'eval_steps_per_second': 4.821, 'epoch': 2.0}
{'loss': 0.1816, 'grad_norm': 2.591447353363037, 'learning_rate': 3.237524481965485e-05, 'epoch': 3.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5843978524208069, 'eval_accuracy': 0.9128440366972477, 'eval_runtime': 1.4556, 'eval_samples_per_second': 599.068, 'eval_steps_per_second': 4.809, 'epoch': 3.0}
{'loss': 0.1578, 'grad_norm': 16.45197105407715, 'learning_rate': 2.6981076618593168e-05, 'epoch': 4.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5983200073242188, 'eval_accuracy': 0.9197247706422018, 'eval_runtime': 1.4409, 'eval_samples_per_second': 605.194, 'eval_steps_per_second': 4.858, 'epoch': 4.0}
{'loss': 0.1352, 'grad_norm': 1.9843107461929321, 'learning_rate': 2.159714403081623e-05, 'epoch': 5.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6065379977226257, 'eval_accuracy': 0.9220183486238532, 'eval_runtime': 1.4411, 'eval_samples_per_second': 605.113, 'eval_steps_per_second': 4.858, 'epoch': 5.0}
{'loss': 0.1197, 'grad_norm': 21.910747528076172, 'learning_rate': 1.6202975829754548e-05, 'epoch': 6.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5545653700828552, 'eval_accuracy': 0.9208715596330275, 'eval_runtime': 1.4598, 'eval_samples_per_second': 597.337, 'eval_steps_per_second': 4.795, 'epoch': 6.0}
{'loss': 0.1034, 'grad_norm': 5.341484069824219, 'learning_rate': 1.0808807628692863e-05, 'epoch': 7.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.619475781917572, 'eval_accuracy': 0.9128440366972477, 'eval_runtime': 1.451, 'eval_samples_per_second': 600.96, 'eval_steps_per_second': 4.824, 'epoch': 7.0}
{'loss': 0.096, 'grad_norm': 0.40521571040153503, 'learning_rate': 5.414639427631178e-06, 'epoch': 8.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.5945138335227966, 'eval_accuracy': 0.9220183486238532, 'eval_runtime': 1.448, 'eval_samples_per_second': 602.228, 'eval_steps_per_second': 4.834, 'epoch': 8.0}
{'loss': 0.0889, 'grad_norm': 14.889159202575684, 'learning_rate': 3.070683985424109e-08, 'epoch': 9.0}


  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.6007155776023865, 'eval_accuracy': 0.9220183486238532, 'eval_runtime': 1.4604, 'eval_samples_per_second': 597.088, 'eval_steps_per_second': 4.793, 'epoch': 9.0}
{'train_runtime': 1017.9606, 'train_samples_per_second': 595.446, 'train_steps_per_second': 4.659, 'train_loss': 0.1510109045875995, 'epoch': 9.0}


CommitInfo(commit_url='https://huggingface.co/kasohrab/electra-distilled-sst/commit/593f17c76d2ba59af9f3f6046449f7c743fef377', commit_message='End of training', commit_description='', oid='593f17c76d2ba59af9f3f6046449f7c743fef377', pr_url=None, pr_revision=None, pr_num=None)

In [22]:
from huggingface_hub import HfApi

whoami = HfApi().whoami()
username = whoami['name']

print(f"https://huggingface.co/{username}/{repo_name}")

https://huggingface.co/kasohrab/electra-distilled-sst


## Results

We were able to achieve a `accuracy` of 0.8337, which is a very good result for our model. Our distilled `Tiny-Bert` has 96% less parameters than the teacher `bert-base` and runs ~46.5x faster while preserving over 90% of BERT’s performances as measured on the SST2 dataset.

| model | Parameter | Speed-up | Accuracy |
|------------|-----------|----------|----------|
| BERT-base  | 109M      | 1x       | 93%      |
| tiny-BERT  | 4M        | 46.5x    | 83%      |

_Note: The [FastFormers paper](https://arxiv.org/abs/2010.13382) uncovered that the biggest boost in performance is observerd when having 6 or more layers in the student. The [google/bert_uncased_L-2_H-128_A-2](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2) we used only had 2, which means when changing our student to, e.g. `distilbert-base-uncased` we should better performance in terms of accuracy._

If you are now planning to implement and add task-specific knowledge distillation to your models. I suggest to take a look at the [sagemaker-distillation](https://github.com/philschmid/knowledge-distillation-transformers-pytorch-sagemaker/blob/master/sagemaker-distillation.ipynb), which shows how to run task-specific knowledge distillation on Amazon SageMaker. For the example i created a script deriving this notebook to make it as easy as possible to use for you. You only need to define your `teacher_id`, `student_id` as well as your `dataset` config to run task-specific knowledge distillation for `text-classification`.

```python
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={
    'teacher_id':'textattack/bert-base-uncased-SST-2',           
    'student_id':'google/bert_uncased_L-2_H-128_A-2',           
    'dataset_id':'glue',           
    'dataset_config':'sst2',             
    # distillation parameter
    'alpha': 0.5,
    'temparature': 4,
    # hpo parameter
    "run_hpo": True,
    "n_trials": 100,            
}

# create the Estimator
huggingface_estimator = HuggingFace(..., hyperparameters=hyperparameters)

# start knwonledge distillation training
huggingface_estimator.fit()
```