1. Using different optimizer strategy - 
    a. Stochastic Weight Averaging, 
    b. Reinitializing Transformer Layers (Pooler and a few last layers) 
    c. Pooler block representation on the Intermidiate layers (

2. Train A bigger Deberta Model - Optimization technique (quantization etc)
3. Train A roberta model - reason cus it is smaller and much faster!
4. train a distilbert and use a fast tokenizer - for efficiency track 

Difficult
5. Knowledge Distillation (Multiple teacher models) - requires pseodo label, KD Loss Function... 
6. 


Plan
1. Do SWA (con do on composer but this is a different library - might not be consistent)
2. Train Deberta-Large (using optimization method) - finetuning with the best combination techniques we have found so far for deberta.
3. Do knowledge Distilation from Deberta-Base & Deberta Large to Deberta-XSmall/Deberta-Small
4. Hyperparameter tuning on Deberta-XSmall/Deberta-Small on a smaller sample size. 

In [1]:
!pip install iterative-stratification
!pip install coolname

Collecting iterative-stratification
  Downloading iterative_stratification-0.1.7-py3-none-any.whl (8.5 kB)
Installing collected packages: iterative-stratification
Successfully installed iterative-stratification-0.1.7
[0mCollecting coolname
  Downloading coolname-2.0.0-py2.py3-none-any.whl (37 kB)
Installing collected packages: coolname
Successfully installed coolname-2.0.0
[0m

In [2]:
import os
import gc
import copy
import time
import random
import string
import joblib

# For data manipulation
import numpy as np
import pandas as pd

# Pytorch Imports
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.utils.data import Dataset, DataLoader

# Utils|
from tqdm import tqdm
from collections import defaultdict
from feedback_custom_funtions import loss_fn, optimizer_setup, FeedBackDataset, RMSELoss, compute_metrics
from model_building import MeanPooling, MaxPooling, MinPooling, AttentionPooling, FeedBackModel
from coolname import generate_slug

# For splitting data
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

# For Transformer Models
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import AdamW, get_linear_schedule_with_warmup
from transformers import DataCollatorWithPadding
from transformers import Trainer, TrainingArguments
from transformers.modeling_outputs import SequenceClassifierOutput

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

# For descriptive error messages
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

## Training Confg

In [3]:
def set_seed(seed=42):
    '''Sets the seed of the entire notebook so results are the same every time we run.
    This is for REPRODUCIBILITY.'''
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)

hash_name = generate_slug(3)

config = {"seed": 42,
          "epochs": 5,
          "debug" : False,
          "model_name": "roberta-base",
          "PoolingLayer": AttentionPooling(768),
          "group" : "roberta-base-AP-LLRD" ,
          "loss_type": "smooth_l1", # ['mse', 'rmse', 'smooth_l1']
          "train_batch_size": 8,
          "valid_batch_size": 16,
          "fp16_enable": False,
          "max_length": 512,
          "layerwise" : True,
          "learning_rate": 1e-5,
          "decoder_lr": 1e-4,
          "weight_decay": 1e-6,
          "n_fold": 4,
          "n_accumulate": 1,
          "max_grad_norm": 1000,
          "num_classes": 6,
          "target_cols": ["cohesion", "syntax", "vocabulary", 
                          "phraseology", "grammar", "conventions"],
          "device": torch.device("cuda:0" if torch.cuda.is_available() else "cpu"),
          "hash_name": hash_name,
          "competition": "FeedBack3",
          "_wandb_kernel": "hazrul"
          }

set_seed(config['seed'])

In [4]:
if not config["debug"]:    
    import wandb

    try:
        from kaggle_secrets import UserSecretsClient
        user_secrets = UserSecretsClient()
        api_key = user_secrets.get_secret("WANDB_API_KEY")
        wandb.login(key=api_key)
        anony = None
        print("wandb Logged in Successfully")
    except:
        anony = "must"
        print('If you want to use your W&B account, go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as wandb_api. \nGet your W&B access token from here: https://wandb.ai/authorize')
else:
    os.environ["WANDB_DISABLED"] = "true"
    print("Debugging...")

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


wandb Logged in Successfully


## Data Loading

In [5]:
df = pd.read_csv("../input/feedback-prize-english-language-learning/train.csv")
mskf = MultilabelStratifiedKFold(n_splits=config['n_fold'], shuffle=True, random_state=config['seed'])

for fold, (train_idx, val_idx) in enumerate(mskf.split(X=df, y=df[config['target_cols']])):
    df.loc[val_idx , "kfold"] = int(fold)
    
df["kfold"] = df["kfold"].astype(int)
df.head()

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions,kfold
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0,2
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5,0
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5,1
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0,3
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5,3


In [6]:
tokenizer = AutoTokenizer.from_pretrained(config["model_name"])
config["tokenizer"] = tokenizer

collate_fn = DataCollatorWithPadding(tokenizer=config['tokenizer'])

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

## Training Setup

In [7]:
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        outputs = model(inputs['input_ids'], inputs['attention_mask'])
        loss = loss_fn(outputs.logits, inputs['target'], loss_type=config['loss_type'])
        return (loss, outputs) if return_outputs else loss

In [8]:
for fold in range(0, config['n_fold']):
    print(f"========== Fold: {fold} ==========")
    
    if not config["debug"]:
        run = wandb.init(project=config['competition'], 
                         config=config,
                         job_type='Train',
                         group=config['group'],
                         tags=[config['model_name'], config['loss_type']],
                         name=f'{config["hash_name"]}-fold-{fold}',
                         anonymous='must')

    df_train = df[df.kfold != fold].reset_index(drop=True)
    df_valid = df[df.kfold == fold].reset_index(drop=True)

    train_dataset = FeedBackDataset(df_train, tokenizer=config['tokenizer'], max_length=config['max_length'], target_label = config["target_cols"])
    valid_dataset = FeedBackDataset(df_valid, tokenizer=config['tokenizer'], max_length=config['max_length'], target_label = config["target_cols"])

    model = FeedBackModel(config['model_name'], config["num_classes"], PoolingLayer = config["PoolingLayer"]).to(config['device'])

    # Define Optimizer and Scheduler
    optimizer, scheduler = optimizer_setup(model=model, 
                                           config=config, 
                                           train_dataset_size =len(train_dataset),
                                           layerwise = config["layerwise"]
                                          )

    training_args = TrainingArguments(
        output_dir=f"outputs-{fold}/",
        evaluation_strategy="epoch",
        logging_strategy="epoch",
        per_device_train_batch_size=config['train_batch_size'],
        per_device_eval_batch_size=config['valid_batch_size'],
        num_train_epochs= config['epochs'],
        learning_rate= config['learning_rate'],
        weight_decay= config['weight_decay'],
        gradient_accumulation_steps=config['n_accumulate'],
        max_grad_norm=config['max_grad_norm'],
        seed= config['seed'],
        fp16  = config["fp16_enable"],
        fp16_full_eval  = config["fp16_enable"],
        half_precision_backend = "cuda_amp",
        group_by_length = True,
        metric_for_best_model= 'eval_mcrmse',
        load_best_model_at_end=True,
        greater_is_better=False,
        save_strategy="epoch",
        save_total_limit=1,
        report_to = "wandb",
        label_names = ["target"]
    )


    trainer = CustomTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        data_collator=collate_fn,
        optimizers=(optimizer, scheduler),
        compute_metrics=compute_metrics
    )

    trainer.train()

    #evaluation = trainer.evaluate()
    #run.log({"score_mcrmse": evaluation["eval_mcrmse"], "eval_runtime": evaluation["eval_runtime"]})
    
    if not config["debug"]:
        run.finish()

    del model, train_dataset, valid_dataset

    torch.cuda.empty_cache()
    gc.collect()

[34m[1mwandb[0m: Currently logged in as: [33mhazrulakmal[0m. Use [1m`wandb login --relogin`[0m to force relogin




Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
***** Running training *****
  Num examples = 2933
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1835
Automatic Weights & Biases logging enabled, to d

Epoch,Training Loss,Validation Loss,Mcrmse,Cohesion Rmse,Syntax Rmse,Vocabulary Rmse,Phraseology Rmse,Grammar Rmse,Conventions Rmse
1,0.539,0.193958,0.630112,0.62705,0.599265,0.550955,0.655828,0.688224,0.659351
2,0.1312,0.107456,0.46414,0.495524,0.451732,0.417282,0.470941,0.493352,0.456009
3,0.1159,0.113683,0.477408,0.510497,0.463868,0.42319,0.496818,0.517918,0.452157
4,0.1089,0.111055,0.472168,0.501292,0.459783,0.42332,0.473591,0.505691,0.46933
5,0.1013,0.104489,0.457654,0.492714,0.449945,0.413291,0.464335,0.47921,0.446429


***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-0/checkpoint-367
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-0/checkpoint-734
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Deleting older checkpoint [outputs-0/checkpoint-367] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-0/checkpoint-1101
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-0/checkpoint-1468
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Deleting older checkpoint [outputs-0/checkpoint-1101] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 978
  Batch size 

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/cohesion_rmse,█▁▂▁▁
eval/conventions_rmse,█▁▁▂▁
eval/grammar_rmse,█▁▂▂▁
eval/loss,█▁▂▂▁
eval/mcrmse,█▁▂▂▁
eval/phraseology_rmse,█▁▂▁▁
eval/runtime,▄█▁▁▆
eval/samples_per_second,▅▁██▂
eval/steps_per_second,▅▁▇█▃
eval/syntax_rmse,█▁▂▁▁

0,1
eval/cohesion_rmse,0.49271
eval/conventions_rmse,0.44643
eval/grammar_rmse,0.47921
eval/loss,0.10449
eval/mcrmse,0.45765
eval/phraseology_rmse,0.46434
eval/runtime,19.0151
eval/samples_per_second,51.433
eval/steps_per_second,3.261
eval/syntax_rmse,0.44995


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file https://huggingface.co/roberta-base/r

Epoch,Training Loss,Validation Loss,Mcrmse,Cohesion Rmse,Syntax Rmse,Vocabulary Rmse,Phraseology Rmse,Grammar Rmse,Conventions Rmse
1,0.5211,0.171043,0.590558,0.612549,0.547185,0.540441,0.569973,0.673359,0.59984
2,0.1353,0.117078,0.485391,0.522343,0.457527,0.449494,0.489818,0.5175,0.475665
3,0.1172,0.117507,0.486475,0.521845,0.468223,0.459043,0.497443,0.50447,0.467827
4,0.1058,0.110089,0.470265,0.503883,0.452467,0.432339,0.469959,0.494831,0.468109
5,0.0971,0.109505,0.468875,0.502636,0.45106,0.426544,0.469324,0.494685,0.469003


***** Running Evaluation *****
  Num examples = 977
  Batch size = 16
Saving model checkpoint to outputs-1/checkpoint-367
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
***** Running Evaluation *****
  Num examples = 977
  Batch size = 16
Saving model checkpoint to outputs-1/checkpoint-734
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Deleting older checkpoint [outputs-1/checkpoint-367] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 977
  Batch size = 16
Saving model checkpoint to outputs-1/checkpoint-1101
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
***** Running Evaluation *****
  Num examples = 977
  Batch size = 16
Saving model checkpoint to outputs-1/checkpoint-1468
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Deleting older checkpoint [outputs-1/checkpoint-734] due to args.save_total_limit
Deleting older checkpoint [outputs-1/checkpoint-1101] due to args.

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/cohesion_rmse,█▂▂▁▁
eval/conventions_rmse,█▁▁▁▁
eval/grammar_rmse,█▂▁▁▁
eval/loss,█▂▂▁▁
eval/mcrmse,█▂▂▁▁
eval/phraseology_rmse,█▂▃▁▁
eval/runtime,▁▂▅▄█
eval/samples_per_second,█▇▄▅▁
eval/steps_per_second,█▇▄▅▁
eval/syntax_rmse,█▁▂▁▁

0,1
eval/cohesion_rmse,0.50264
eval/conventions_rmse,0.469
eval/grammar_rmse,0.49469
eval/loss,0.10951
eval/mcrmse,0.46888
eval/phraseology_rmse,0.46932
eval/runtime,19.2761
eval/samples_per_second,50.685
eval/steps_per_second,3.216
eval/syntax_rmse,0.45106


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file https://huggingface.co/roberta-base/r

Epoch,Training Loss,Validation Loss,Mcrmse,Cohesion Rmse,Syntax Rmse,Vocabulary Rmse,Phraseology Rmse,Grammar Rmse,Conventions Rmse
1,0.5191,0.224094,0.675413,0.648953,0.634917,0.607193,0.615324,0.909814,0.636277
2,0.1278,0.137639,0.527508,0.53285,0.508603,0.477475,0.526476,0.565082,0.554563
3,0.1135,0.110038,0.470045,0.502494,0.455169,0.42596,0.473161,0.498543,0.464943
4,0.1033,0.111958,0.474695,0.495847,0.465354,0.437505,0.477524,0.494207,0.477732
5,0.0944,0.109416,0.468937,0.494689,0.45508,0.426312,0.474038,0.494648,0.468854


***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-2/checkpoint-367
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-2/checkpoint-734
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Deleting older checkpoint [outputs-2/checkpoint-367] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-2/checkpoint-1101
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Deleting older checkpoint [outputs-2/checkpoint-734] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-2/checkpoint-1468
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
***** Running Evaluation *****
  Num examples = 978
  Batch size =

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/cohesion_rmse,█▃▁▁▁
eval/conventions_rmse,█▅▁▂▁
eval/grammar_rmse,█▂▁▁▁
eval/loss,█▃▁▁▁
eval/mcrmse,█▃▁▁▁
eval/phraseology_rmse,█▄▁▁▁
eval/runtime,▁▃█▆▁
eval/samples_per_second,█▆▁▃█
eval/steps_per_second,█▆▁▃█
eval/syntax_rmse,█▃▁▁▁

0,1
eval/cohesion_rmse,0.49469
eval/conventions_rmse,0.46885
eval/grammar_rmse,0.49465
eval/loss,0.10942
eval/mcrmse,0.46894
eval/phraseology_rmse,0.47404
eval/runtime,18.932
eval/samples_per_second,51.659
eval/steps_per_second,3.275
eval/syntax_rmse,0.45508


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file https://huggingface.co/roberta-base/r

Epoch,Training Loss,Validation Loss,Mcrmse,Cohesion Rmse,Syntax Rmse,Vocabulary Rmse,Phraseology Rmse,Grammar Rmse,Conventions Rmse
1,0.5112,0.129203,0.510959,0.558011,0.488387,0.507209,0.479431,0.550225,0.482488
2,0.1283,0.129257,0.511139,0.549146,0.540371,0.483972,0.48757,0.507198,0.498576
3,0.1111,0.112411,0.475573,0.508389,0.463075,0.430437,0.459601,0.514446,0.47749
4,0.1013,0.108855,0.467917,0.495785,0.465173,0.430227,0.459563,0.500413,0.456339
5,0.0919,0.109735,0.469843,0.498295,0.467552,0.431566,0.461638,0.50205,0.457959


***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-3/checkpoint-367
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-3/checkpoint-734
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-3/checkpoint-1101
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Deleting older checkpoint [outputs-3/checkpoint-367] due to args.save_total_limit
Deleting older checkpoint [outputs-3/checkpoint-734] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs-3/checkpoint-1468
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Deleting older checkpoint [outputs-3/checkpoint-1101] due to args.

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/cohesion_rmse,█▇▂▁▁
eval/conventions_rmse,▅█▅▁▁
eval/grammar_rmse,█▂▃▁▁
eval/loss,██▂▁▁
eval/mcrmse,██▂▁▁
eval/phraseology_rmse,▆█▁▁▂
eval/runtime,▁▅▅█▇
eval/samples_per_second,█▄▄▁▂
eval/steps_per_second,█▃▃▁▂
eval/syntax_rmse,▃█▁▁▁

0,1
eval/cohesion_rmse,0.49829
eval/conventions_rmse,0.45796
eval/grammar_rmse,0.50205
eval/loss,0.10973
eval/mcrmse,0.46984
eval/phraseology_rmse,0.46164
eval/runtime,19.161
eval/samples_per_second,51.041
eval/steps_per_second,3.236
eval/syntax_rmse,0.46755
