# English to Twi Translation Model

This notebook demonstrates the fine-tuning of a pretrained Hugging Face model to translate English sentences into Twi, a native language of Ghana. By leveraging a state-of-the-art transformer-based model, I aim to understand and recongnize the value of machine translation - a task that presents unique linguistic challenges due to the differences in grammatical structure and vocabulary between the two languages.



In [1]:
# import libraries

import pandas as pd
from sklearn.model_selection import train_test_split


## Data Sourcing

The data used in this task is sourced from Zenodo and consists of a dataset containing over 25,000 English sentences and their corresponding translations in the Twi language. 

In [2]:
# Get data

!wget -O data.csv https://zenodo.org/records/4432117/files/verified_data.csv?download=1

--2024-10-05 12:12:10--  https://zenodo.org/records/4432117/files/verified_data.csv?download=1
Resolving zenodo.org (zenodo.org)... 188.185.79.172, 188.184.103.159, 188.184.98.238, ...
Connecting to zenodo.org (zenodo.org)|188.185.79.172|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1795088 (1.7M) [text/plain]
Saving to: ‘data.csv’


2024-10-05 12:12:12 (2.31 MB/s) - ‘data.csv’ saved [1795088/1795088]



In [3]:
# load csv data

df = pd.read_csv('data/data.csv')
df = df.rename(columns={"English": "input", "Akuapem Twi": "target"})

df

Unnamed: 0,input,target
0,What she lacks in charisma she makes up for wi...,Nea onni ho adwempa no de adwumaden na ɛba.
1,There was nothing I could do about it.,Na biribiara nni hɛ a metumi ayɔ
2,Kwaku saw John and Abena holding hands.,Kwaku hui se John ne Abena kurakura wɛn nsa.
3,Can you stay till 2:30?,So wubetumi atena ha akosi nnɛnmienu npaamu ad...
4,You haven't got much time.,Wonni mmre
...,...,...
25415,I'm not a killer,menyɛ owudini
25416,what are you searching for?,dɛn na worehwehwɛ?
25417,kwabena went out,Kwabena fii adi.
25418,kwabena stepped outside,Kwabena sii aduo


### Train test validation splitting using sklearn

In [4]:
# Split the data into train, validation, and test sets (80%, 10%, 10%)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)

print(f"Train size: {len(train_df)}, Validation size: {len(val_df)}, Test size: {len(test_df)}")

Train size: 20336, Validation size: 2542, Test size: 2542


## Using a Pre-trained Model - MarianMTModel

I am using a pretrained model because it allows me to build on the extensive language knowledge the model has already acquired from large-scale multilingual datasets. Pretrained models like those from Hugging Face's library have learned general language patterns, structures, and representations from diverse text corpora. By starting with a model that already understands the fundamentals of language translation, I can focus on fine-tuning it for the specific task of translating English to Twi, which significantly reduces the data requirements and training time compared to training a model from scratch.

Why I Chose MarianMTModel

I chose MarianMTModel because it is a specialized transformer-based model designed specifically for machine translation tasks. Here’s why it is ideal for this project:

- Multilingual Translation: MarianMTModel supports a wide range of language pairs, including low-resource languages like Twi. This makes it a perfect fit for translating to a language with limited training data.

- Pretrained for Translation: MarianMTModel is already optimized for translation tasks, unlike general models such as BERT or GPT. This makes it more effective for sentence-level translations, capturing the linguistic differences between English and Twi.

- Efficiency: MarianMTModel is lightweight and efficient, making it suitable for fine-tuning even on a smaller subset of the dataset. Given my choice to work with 5,000 data points, its efficiency is crucial for handling the task without excessive computational overhead.

Using MarianMTModel enables me to take advantage of its pre-existing translation capabilities while fine-tuning it for the specific needs of English to Twi translation.

In [5]:
from transformers import MarianMTModel, MarianTokenizer

# Load a pre-trained MarianMT model and tokenizer
model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
def tokenize_sentences(df, tokenizer, max_len=128):
    '''
    Tokenize the input and target columns

    Args:
        df -> pandas dataframe containing the input and the target
        tokenizer -> tokenizer instance used to tokenize the corpus
        max_len -> maximum number of tokens

    Returns:
        input encodings
        target encodings
    '''
    input_encodings = tokenizer(list(df['input']), padding=True, truncation=True, max_length=max_len, return_tensors='pt')
    target_encodings = tokenizer(list(df['target']), padding=True, truncation=True, max_length=max_len, return_tensors='pt')

    return input_encodings, target_encodings

train_input_encodings, train_target_encodings = tokenize_sentences(train_df, tokenizer)
val_input_encodings, val_target_encodings = tokenize_sentences(val_df, tokenizer)

In [7]:
import torch
from torch.utils.data import Dataset

class TranslationDataset(Dataset):
    """
    A custom Dataset class for handling translation tasks.

    This class inherits from torch.utils.data.Dataset and is used to prepare
    tokenized input and target data for a translation model
    """

    def __init__(self, input_encodings, target_encodings):
        '''
        class constructor

        Args:
            input_encodings (dict): A dictionary containing the tokenized input (source language) data.
            target_encodings (dict): A dictionary containing the tokenized target (translated language) data.
        '''
        self.input_encodings = input_encodings
        self.target_encodings = target_encodings

    def __len__(self):
        """ Returns the number of samples in the dataset (based on input encodings length) """
        return len(self.input_encodings['input_ids'])

    def __getitem__(self, idx):
        """ Retrieves a single sample of input and target encodings as tensors, for use by the model. """
        item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
        item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
        return item

train_dataset = TranslationDataset(train_input_encodings, train_target_encodings)
val_dataset = TranslationDataset(val_input_encodings, val_target_encodings)

In [8]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=1000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
  0%|          | 10/5084 [00:24<3:04:43,  2.18s/it]

{'loss': 2.0769, 'grad_norm': 2.7659943103790283, 'learning_rate': 4.9901652242328875e-05, 'epoch': 0.0}


  0%|          | 20/5084 [01:52<18:40:35, 13.28s/it]

{'loss': 1.4154, 'grad_norm': 2.5821735858917236, 'learning_rate': 4.9803304484657754e-05, 'epoch': 0.01}


  1%|          | 30/5084 [02:23<4:22:02,  3.11s/it] 

{'loss': 1.263, 'grad_norm': 2.795109748840332, 'learning_rate': 4.9704956726986626e-05, 'epoch': 0.01}


  1%|          | 40/5084 [03:02<4:46:56,  3.41s/it]

{'loss': 1.2266, 'grad_norm': 3.268669366836548, 'learning_rate': 4.9606608969315505e-05, 'epoch': 0.02}


  1%|          | 50/5084 [03:30<3:41:59,  2.65s/it]

{'loss': 1.1673, 'grad_norm': 2.8205726146698, 'learning_rate': 4.950826121164438e-05, 'epoch': 0.02}


  1%|          | 60/5084 [03:57<3:41:32,  2.65s/it]

{'loss': 0.9976, 'grad_norm': 3.3726956844329834, 'learning_rate': 4.940991345397325e-05, 'epoch': 0.02}


  1%|▏         | 70/5084 [04:23<3:32:09,  2.54s/it]

{'loss': 0.9445, 'grad_norm': 2.43869948387146, 'learning_rate': 4.931156569630213e-05, 'epoch': 0.03}


  2%|▏         | 80/5084 [04:50<3:36:29,  2.60s/it]

{'loss': 0.8657, 'grad_norm': 1.9044607877731323, 'learning_rate': 4.9213217938631e-05, 'epoch': 0.03}


  2%|▏         | 90/5084 [05:15<3:29:29,  2.52s/it]

{'loss': 0.8368, 'grad_norm': 2.039057970046997, 'learning_rate': 4.911487018095988e-05, 'epoch': 0.04}


  2%|▏         | 100/5084 [05:40<3:35:02,  2.59s/it]

{'loss': 0.9375, 'grad_norm': 3.101625919342041, 'learning_rate': 4.901652242328875e-05, 'epoch': 0.04}


  2%|▏         | 110/5084 [06:05<3:13:59,  2.34s/it]

{'loss': 0.8452, 'grad_norm': 3.0154526233673096, 'learning_rate': 4.891817466561763e-05, 'epoch': 0.04}


  2%|▏         | 120/5084 [06:29<3:19:58,  2.42s/it]

{'loss': 0.8958, 'grad_norm': 2.413696050643921, 'learning_rate': 4.88198269079465e-05, 'epoch': 0.05}


  3%|▎         | 130/5084 [06:55<3:29:05,  2.53s/it]

{'loss': 0.884, 'grad_norm': 2.152616500854492, 'learning_rate': 4.872147915027538e-05, 'epoch': 0.05}


  3%|▎         | 140/5084 [07:20<3:21:49,  2.45s/it]

{'loss': 0.8045, 'grad_norm': 2.7182273864746094, 'learning_rate': 4.862313139260425e-05, 'epoch': 0.06}


  3%|▎         | 150/5084 [07:44<3:30:34,  2.56s/it]

{'loss': 0.8015, 'grad_norm': 2.554842233657837, 'learning_rate': 4.8524783634933126e-05, 'epoch': 0.06}


  3%|▎         | 160/5084 [08:11<3:27:03,  2.52s/it]

{'loss': 0.7353, 'grad_norm': 2.2928402423858643, 'learning_rate': 4.8426435877262e-05, 'epoch': 0.06}


  3%|▎         | 170/5084 [08:34<3:15:05,  2.38s/it]

{'loss': 0.7724, 'grad_norm': 2.789350748062134, 'learning_rate': 4.832808811959088e-05, 'epoch': 0.07}


  4%|▎         | 180/5084 [08:59<3:26:21,  2.52s/it]

{'loss': 0.8064, 'grad_norm': 3.0411972999572754, 'learning_rate': 4.822974036191975e-05, 'epoch': 0.07}


  4%|▎         | 190/5084 [09:23<3:24:17,  2.50s/it]

{'loss': 0.7316, 'grad_norm': 2.0450236797332764, 'learning_rate': 4.813139260424863e-05, 'epoch': 0.07}


  4%|▍         | 200/5084 [09:47<3:21:30,  2.48s/it]

{'loss': 0.7331, 'grad_norm': 2.691394090652466, 'learning_rate': 4.80330448465775e-05, 'epoch': 0.08}


  4%|▍         | 210/5084 [10:12<3:16:03,  2.41s/it]

{'loss': 0.6745, 'grad_norm': 2.0936107635498047, 'learning_rate': 4.793469708890638e-05, 'epoch': 0.08}


  4%|▍         | 220/5084 [10:36<3:19:50,  2.47s/it]

{'loss': 0.7477, 'grad_norm': 2.3867506980895996, 'learning_rate': 4.783634933123525e-05, 'epoch': 0.09}


  5%|▍         | 230/5084 [11:01<3:09:31,  2.34s/it]

{'loss': 0.664, 'grad_norm': 2.1282670497894287, 'learning_rate': 4.7738001573564124e-05, 'epoch': 0.09}


  5%|▍         | 240/5084 [11:27<3:22:23,  2.51s/it]

{'loss': 0.7623, 'grad_norm': 2.2391905784606934, 'learning_rate': 4.7639653815892996e-05, 'epoch': 0.09}


  5%|▍         | 250/5084 [11:52<3:25:05,  2.55s/it]

{'loss': 0.7522, 'grad_norm': 2.5509066581726074, 'learning_rate': 4.7541306058221875e-05, 'epoch': 0.1}


  5%|▌         | 260/5084 [12:15<3:05:25,  2.31s/it]

{'loss': 0.7, 'grad_norm': 2.253870725631714, 'learning_rate': 4.744295830055075e-05, 'epoch': 0.1}


  5%|▌         | 270/5084 [12:39<3:17:32,  2.46s/it]

{'loss': 0.644, 'grad_norm': 2.3276069164276123, 'learning_rate': 4.7344610542879626e-05, 'epoch': 0.11}


  6%|▌         | 280/5084 [13:04<3:07:31,  2.34s/it]

{'loss': 0.6791, 'grad_norm': 1.7913743257522583, 'learning_rate': 4.72462627852085e-05, 'epoch': 0.11}


  6%|▌         | 290/5084 [13:28<3:18:08,  2.48s/it]

{'loss': 0.7741, 'grad_norm': 2.2301251888275146, 'learning_rate': 4.714791502753738e-05, 'epoch': 0.11}


  6%|▌         | 300/5084 [13:53<3:22:55,  2.55s/it]

{'loss': 0.6249, 'grad_norm': 2.2548835277557373, 'learning_rate': 4.704956726986625e-05, 'epoch': 0.12}


  6%|▌         | 310/5084 [14:18<3:23:31,  2.56s/it]

{'loss': 0.6597, 'grad_norm': 2.600221633911133, 'learning_rate': 4.695121951219512e-05, 'epoch': 0.12}


  6%|▋         | 320/5084 [14:41<3:01:49,  2.29s/it]

{'loss': 0.6144, 'grad_norm': 2.543311595916748, 'learning_rate': 4.6852871754523994e-05, 'epoch': 0.13}


  6%|▋         | 330/5084 [15:05<3:17:02,  2.49s/it]

{'loss': 0.7224, 'grad_norm': 2.675039768218994, 'learning_rate': 4.675452399685287e-05, 'epoch': 0.13}


  7%|▋         | 340/5084 [15:29<3:03:09,  2.32s/it]

{'loss': 0.7064, 'grad_norm': 2.9483652114868164, 'learning_rate': 4.6656176239181745e-05, 'epoch': 0.13}


  7%|▋         | 350/5084 [15:55<3:20:28,  2.54s/it]

{'loss': 0.6197, 'grad_norm': 2.4635422229766846, 'learning_rate': 4.6557828481510624e-05, 'epoch': 0.14}


  7%|▋         | 360/5084 [16:18<2:59:09,  2.28s/it]

{'loss': 0.7371, 'grad_norm': 2.2609305381774902, 'learning_rate': 4.6459480723839496e-05, 'epoch': 0.14}


  7%|▋         | 370/5084 [16:41<2:57:59,  2.27s/it]

{'loss': 0.6653, 'grad_norm': 2.30869722366333, 'learning_rate': 4.6361132966168375e-05, 'epoch': 0.15}


  7%|▋         | 380/5084 [17:07<3:22:10,  2.58s/it]

{'loss': 0.6589, 'grad_norm': 2.271439790725708, 'learning_rate': 4.626278520849725e-05, 'epoch': 0.15}


  8%|▊         | 390/5084 [17:31<3:06:53,  2.39s/it]

{'loss': 0.6572, 'grad_norm': 2.609375, 'learning_rate': 4.6164437450826126e-05, 'epoch': 0.15}


  8%|▊         | 400/5084 [17:56<3:19:29,  2.56s/it]

{'loss': 0.6433, 'grad_norm': 2.585216522216797, 'learning_rate': 4.6066089693155e-05, 'epoch': 0.16}


  8%|▊         | 410/5084 [18:19<2:57:15,  2.28s/it]

{'loss': 0.5586, 'grad_norm': 1.9950850009918213, 'learning_rate': 4.596774193548387e-05, 'epoch': 0.16}


  8%|▊         | 420/5084 [18:45<3:23:46,  2.62s/it]

{'loss': 0.6325, 'grad_norm': 2.2129452228546143, 'learning_rate': 4.586939417781275e-05, 'epoch': 0.17}


  8%|▊         | 430/5084 [19:09<3:12:21,  2.48s/it]

{'loss': 0.5613, 'grad_norm': 2.1260993480682373, 'learning_rate': 4.577104642014162e-05, 'epoch': 0.17}


  9%|▊         | 440/5084 [19:34<3:05:10,  2.39s/it]

{'loss': 0.6275, 'grad_norm': 2.2587502002716064, 'learning_rate': 4.56726986624705e-05, 'epoch': 0.17}


  9%|▉         | 450/5084 [19:59<3:16:33,  2.54s/it]

{'loss': 0.5664, 'grad_norm': 1.9509440660476685, 'learning_rate': 4.557435090479937e-05, 'epoch': 0.18}


  9%|▉         | 460/5084 [20:25<3:20:29,  2.60s/it]

{'loss': 0.5383, 'grad_norm': 1.997070074081421, 'learning_rate': 4.547600314712825e-05, 'epoch': 0.18}


  9%|▉         | 470/5084 [20:51<3:28:12,  2.71s/it]

{'loss': 0.6162, 'grad_norm': 2.345435857772827, 'learning_rate': 4.5377655389457124e-05, 'epoch': 0.18}


  9%|▉         | 480/5084 [21:16<3:13:37,  2.52s/it]

{'loss': 0.5669, 'grad_norm': 2.1728298664093018, 'learning_rate': 4.5279307631786e-05, 'epoch': 0.19}


 10%|▉         | 490/5084 [21:41<3:18:16,  2.59s/it]

{'loss': 0.5759, 'grad_norm': 2.7141809463500977, 'learning_rate': 4.518095987411487e-05, 'epoch': 0.19}


 10%|▉         | 500/5084 [22:06<3:12:45,  2.52s/it]

{'loss': 0.5877, 'grad_norm': 1.9842489957809448, 'learning_rate': 4.508261211644375e-05, 'epoch': 0.2}


 10%|█         | 510/5084 [22:31<3:13:49,  2.54s/it]

{'loss': 0.6388, 'grad_norm': 2.2480967044830322, 'learning_rate': 4.498426435877262e-05, 'epoch': 0.2}


 10%|█         | 520/5084 [22:57<3:22:33,  2.66s/it]

{'loss': 0.5317, 'grad_norm': 2.285118579864502, 'learning_rate': 4.48859166011015e-05, 'epoch': 0.2}


 10%|█         | 530/5084 [23:22<3:06:25,  2.46s/it]

{'loss': 0.5147, 'grad_norm': 2.1255338191986084, 'learning_rate': 4.478756884343037e-05, 'epoch': 0.21}


 11%|█         | 540/5084 [23:47<3:12:03,  2.54s/it]

{'loss': 0.5863, 'grad_norm': 2.3551535606384277, 'learning_rate': 4.468922108575925e-05, 'epoch': 0.21}


 11%|█         | 550/5084 [24:11<2:57:13,  2.35s/it]

{'loss': 0.5223, 'grad_norm': 1.8286410570144653, 'learning_rate': 4.459087332808812e-05, 'epoch': 0.22}


 11%|█         | 560/5084 [24:35<3:00:04,  2.39s/it]

{'loss': 0.5759, 'grad_norm': 2.434539556503296, 'learning_rate': 4.4492525570417e-05, 'epoch': 0.22}


 11%|█         | 570/5084 [25:01<3:02:37,  2.43s/it]

{'loss': 0.5936, 'grad_norm': 2.282048225402832, 'learning_rate': 4.439417781274587e-05, 'epoch': 0.22}


 11%|█▏        | 580/5084 [25:25<3:05:04,  2.47s/it]

{'loss': 0.543, 'grad_norm': 2.00604248046875, 'learning_rate': 4.4295830055074745e-05, 'epoch': 0.23}


 12%|█▏        | 590/5084 [25:50<3:09:38,  2.53s/it]

{'loss': 0.5298, 'grad_norm': 2.048185110092163, 'learning_rate': 4.419748229740362e-05, 'epoch': 0.23}


 12%|█▏        | 600/5084 [26:13<2:46:25,  2.23s/it]

{'loss': 0.5507, 'grad_norm': 2.2725133895874023, 'learning_rate': 4.4099134539732497e-05, 'epoch': 0.24}


 12%|█▏        | 610/5084 [26:38<3:07:54,  2.52s/it]

{'loss': 0.4808, 'grad_norm': 2.144101142883301, 'learning_rate': 4.400078678206137e-05, 'epoch': 0.24}


 12%|█▏        | 620/5084 [27:02<2:50:40,  2.29s/it]

{'loss': 0.5477, 'grad_norm': 2.5945687294006348, 'learning_rate': 4.390243902439025e-05, 'epoch': 0.24}


 12%|█▏        | 630/5084 [27:25<2:49:51,  2.29s/it]

{'loss': 0.5649, 'grad_norm': 1.905728816986084, 'learning_rate': 4.380409126671912e-05, 'epoch': 0.25}


 13%|█▎        | 640/5084 [27:50<3:09:57,  2.56s/it]

{'loss': 0.4902, 'grad_norm': 2.1836774349212646, 'learning_rate': 4.3705743509048e-05, 'epoch': 0.25}


 13%|█▎        | 650/5084 [28:15<2:58:30,  2.42s/it]

{'loss': 0.5771, 'grad_norm': 2.1521739959716797, 'learning_rate': 4.360739575137687e-05, 'epoch': 0.26}


 13%|█▎        | 660/5084 [28:40<3:03:36,  2.49s/it]

{'loss': 0.5092, 'grad_norm': 2.3527302742004395, 'learning_rate': 4.350904799370575e-05, 'epoch': 0.26}


 13%|█▎        | 670/5084 [29:05<3:05:12,  2.52s/it]

{'loss': 0.589, 'grad_norm': 2.5521936416625977, 'learning_rate': 4.3410700236034615e-05, 'epoch': 0.26}


 13%|█▎        | 680/5084 [29:28<2:54:50,  2.38s/it]

{'loss': 0.5208, 'grad_norm': 1.9687378406524658, 'learning_rate': 4.3312352478363494e-05, 'epoch': 0.27}


 14%|█▎        | 690/5084 [29:53<3:00:47,  2.47s/it]

{'loss': 0.5615, 'grad_norm': 1.6916520595550537, 'learning_rate': 4.3214004720692367e-05, 'epoch': 0.27}


 14%|█▍        | 700/5084 [30:18<3:00:51,  2.48s/it]

{'loss': 0.5233, 'grad_norm': 2.1639373302459717, 'learning_rate': 4.3115656963021246e-05, 'epoch': 0.28}


 14%|█▍        | 710/5084 [30:43<2:55:33,  2.41s/it]

{'loss': 0.4875, 'grad_norm': 2.509899377822876, 'learning_rate': 4.301730920535012e-05, 'epoch': 0.28}


 14%|█▍        | 720/5084 [31:07<2:55:36,  2.41s/it]

{'loss': 0.4926, 'grad_norm': 2.35917329788208, 'learning_rate': 4.2918961447679e-05, 'epoch': 0.28}


 14%|█▍        | 730/5084 [31:31<2:55:38,  2.42s/it]

{'loss': 0.5125, 'grad_norm': 1.8433493375778198, 'learning_rate': 4.282061369000787e-05, 'epoch': 0.29}


 15%|█▍        | 740/5084 [31:56<3:02:57,  2.53s/it]

{'loss': 0.5165, 'grad_norm': 1.6833597421646118, 'learning_rate': 4.272226593233675e-05, 'epoch': 0.29}


 15%|█▍        | 750/5084 [32:18<2:46:42,  2.31s/it]

{'loss': 0.5562, 'grad_norm': 2.4643208980560303, 'learning_rate': 4.262391817466562e-05, 'epoch': 0.3}


 15%|█▍        | 760/5084 [32:45<3:02:47,  2.54s/it]

{'loss': 0.4816, 'grad_norm': 2.4819626808166504, 'learning_rate': 4.252557041699449e-05, 'epoch': 0.3}


 15%|█▌        | 770/5084 [33:09<2:46:56,  2.32s/it]

{'loss': 0.4644, 'grad_norm': 2.348090887069702, 'learning_rate': 4.242722265932337e-05, 'epoch': 0.3}


 15%|█▌        | 780/5084 [33:33<3:03:27,  2.56s/it]

{'loss': 0.5028, 'grad_norm': 2.8537583351135254, 'learning_rate': 4.232887490165224e-05, 'epoch': 0.31}


 16%|█▌        | 790/5084 [33:58<3:00:36,  2.52s/it]

{'loss': 0.5537, 'grad_norm': 2.051326036453247, 'learning_rate': 4.223052714398112e-05, 'epoch': 0.31}


 16%|█▌        | 800/5084 [34:23<2:58:37,  2.50s/it]

{'loss': 0.4798, 'grad_norm': 1.8790966272354126, 'learning_rate': 4.2132179386309995e-05, 'epoch': 0.31}


 16%|█▌        | 810/5084 [34:52<4:02:04,  3.40s/it]

{'loss': 0.4088, 'grad_norm': 1.7658058404922485, 'learning_rate': 4.203383162863887e-05, 'epoch': 0.32}


 16%|█▌        | 820/5084 [35:20<3:09:27,  2.67s/it]

{'loss': 0.5285, 'grad_norm': 2.478180408477783, 'learning_rate': 4.1935483870967746e-05, 'epoch': 0.32}


 16%|█▋        | 830/5084 [35:46<3:09:55,  2.68s/it]

{'loss': 0.4341, 'grad_norm': 1.7770479917526245, 'learning_rate': 4.183713611329662e-05, 'epoch': 0.33}


 17%|█▋        | 840/5084 [36:11<2:54:55,  2.47s/it]

{'loss': 0.5314, 'grad_norm': 2.733259677886963, 'learning_rate': 4.173878835562549e-05, 'epoch': 0.33}


 17%|█▋        | 850/5084 [36:35<3:05:33,  2.63s/it]

{'loss': 0.4508, 'grad_norm': 2.223259210586548, 'learning_rate': 4.164044059795437e-05, 'epoch': 0.33}


 17%|█▋        | 860/5084 [37:01<2:46:35,  2.37s/it]

{'loss': 0.4748, 'grad_norm': 2.2725889682769775, 'learning_rate': 4.154209284028324e-05, 'epoch': 0.34}


 17%|█▋        | 870/5084 [37:25<2:42:53,  2.32s/it]

{'loss': 0.5162, 'grad_norm': 2.0591681003570557, 'learning_rate': 4.144374508261212e-05, 'epoch': 0.34}


 17%|█▋        | 880/5084 [37:51<3:08:35,  2.69s/it]

{'loss': 0.4741, 'grad_norm': 2.1931920051574707, 'learning_rate': 4.134539732494099e-05, 'epoch': 0.35}


 18%|█▊        | 890/5084 [38:14<2:45:50,  2.37s/it]

{'loss': 0.4536, 'grad_norm': 2.398189067840576, 'learning_rate': 4.124704956726987e-05, 'epoch': 0.35}


 18%|█▊        | 900/5084 [38:40<2:53:32,  2.49s/it]

{'loss': 0.4733, 'grad_norm': 2.3946235179901123, 'learning_rate': 4.1148701809598744e-05, 'epoch': 0.35}


 18%|█▊        | 910/5084 [39:05<2:51:18,  2.46s/it]

{'loss': 0.4532, 'grad_norm': 2.4543092250823975, 'learning_rate': 4.105035405192762e-05, 'epoch': 0.36}


 18%|█▊        | 920/5084 [39:30<2:50:29,  2.46s/it]

{'loss': 0.4353, 'grad_norm': 1.8009175062179565, 'learning_rate': 4.0952006294256495e-05, 'epoch': 0.36}


 18%|█▊        | 930/5084 [39:56<2:54:08,  2.52s/it]

{'loss': 0.4794, 'grad_norm': 2.022146224975586, 'learning_rate': 4.085365853658537e-05, 'epoch': 0.37}


 18%|█▊        | 940/5084 [40:21<2:55:36,  2.54s/it]

{'loss': 0.4064, 'grad_norm': 2.109271764755249, 'learning_rate': 4.075531077891424e-05, 'epoch': 0.37}


 19%|█▊        | 950/5084 [40:47<2:56:36,  2.56s/it]

{'loss': 0.4756, 'grad_norm': 2.1384832859039307, 'learning_rate': 4.065696302124312e-05, 'epoch': 0.37}


 19%|█▉        | 960/5084 [41:12<2:54:32,  2.54s/it]

{'loss': 0.4471, 'grad_norm': 2.8994855880737305, 'learning_rate': 4.055861526357199e-05, 'epoch': 0.38}


 19%|█▉        | 970/5084 [41:38<2:56:34,  2.58s/it]

{'loss': 0.448, 'grad_norm': 2.5694942474365234, 'learning_rate': 4.046026750590087e-05, 'epoch': 0.38}


 19%|█▉        | 980/5084 [42:05<2:58:11,  2.61s/it]

{'loss': 0.4914, 'grad_norm': 2.020062208175659, 'learning_rate': 4.036191974822974e-05, 'epoch': 0.39}


 19%|█▉        | 990/5084 [42:31<2:57:26,  2.60s/it]

{'loss': 0.4791, 'grad_norm': 2.207606792449951, 'learning_rate': 4.026357199055862e-05, 'epoch': 0.39}




{'loss': 0.4235, 'grad_norm': 2.665410041809082, 'learning_rate': 4.016522423288749e-05, 'epoch': 0.39}


  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
 20%|█▉        | 1010/5084 [43:32<3:01:57,  2.68s/it]

{'loss': 0.4479, 'grad_norm': 2.069385290145874, 'learning_rate': 4.006687647521637e-05, 'epoch': 0.4}


 20%|██        | 1020/5084 [43:58<2:59:16,  2.65s/it]

{'loss': 0.4527, 'grad_norm': 1.9970909357070923, 'learning_rate': 3.996852871754524e-05, 'epoch': 0.4}


 20%|██        | 1030/5084 [44:25<3:09:15,  2.80s/it]

{'loss': 0.4781, 'grad_norm': 2.3947055339813232, 'learning_rate': 3.9870180959874116e-05, 'epoch': 0.41}


 20%|██        | 1040/5084 [44:51<2:51:23,  2.54s/it]

{'loss': 0.4446, 'grad_norm': 2.543951988220215, 'learning_rate': 3.977183320220299e-05, 'epoch': 0.41}


 21%|██        | 1050/5084 [45:16<2:45:41,  2.46s/it]

{'loss': 0.4426, 'grad_norm': 2.1284077167510986, 'learning_rate': 3.967348544453187e-05, 'epoch': 0.41}


 21%|██        | 1060/5084 [45:43<2:53:08,  2.58s/it]

{'loss': 0.4882, 'grad_norm': 2.0559868812561035, 'learning_rate': 3.957513768686074e-05, 'epoch': 0.42}


 21%|██        | 1070/5084 [46:09<2:48:02,  2.51s/it]

{'loss': 0.4536, 'grad_norm': 2.9063913822174072, 'learning_rate': 3.947678992918962e-05, 'epoch': 0.42}


 21%|██        | 1080/5084 [46:34<2:52:18,  2.58s/it]

{'loss': 0.4242, 'grad_norm': 1.970786690711975, 'learning_rate': 3.937844217151849e-05, 'epoch': 0.42}


 21%|██▏       | 1090/5084 [47:00<2:59:23,  2.69s/it]

{'loss': 0.4787, 'grad_norm': 2.155827760696411, 'learning_rate': 3.928009441384737e-05, 'epoch': 0.43}


 22%|██▏       | 1100/5084 [47:23<2:35:30,  2.34s/it]

{'loss': 0.4499, 'grad_norm': 2.0933029651641846, 'learning_rate': 3.918174665617624e-05, 'epoch': 0.43}


 22%|██▏       | 1110/5084 [47:50<2:51:44,  2.59s/it]

{'loss': 0.4215, 'grad_norm': 2.1876652240753174, 'learning_rate': 3.9083398898505114e-05, 'epoch': 0.44}


 22%|██▏       | 1120/5084 [48:15<2:42:02,  2.45s/it]

{'loss': 0.4082, 'grad_norm': 2.1782729625701904, 'learning_rate': 3.8985051140833986e-05, 'epoch': 0.44}


 22%|██▏       | 1130/5084 [48:41<2:50:28,  2.59s/it]

{'loss': 0.3994, 'grad_norm': 1.903780460357666, 'learning_rate': 3.8886703383162865e-05, 'epoch': 0.44}


 22%|██▏       | 1140/5084 [49:06<2:43:10,  2.48s/it]

{'loss': 0.4139, 'grad_norm': 2.2456953525543213, 'learning_rate': 3.878835562549174e-05, 'epoch': 0.45}


 23%|██▎       | 1150/5084 [49:30<2:44:01,  2.50s/it]

{'loss': 0.4135, 'grad_norm': 2.531148672103882, 'learning_rate': 3.8690007867820616e-05, 'epoch': 0.45}


 23%|██▎       | 1160/5084 [49:55<2:51:33,  2.62s/it]

{'loss': 0.4132, 'grad_norm': 2.1181681156158447, 'learning_rate': 3.859166011014949e-05, 'epoch': 0.46}


 23%|██▎       | 1170/5084 [50:21<2:41:16,  2.47s/it]

{'loss': 0.449, 'grad_norm': 2.21722412109375, 'learning_rate': 3.849331235247837e-05, 'epoch': 0.46}


 23%|██▎       | 1180/5084 [50:46<2:40:08,  2.46s/it]

{'loss': 0.4558, 'grad_norm': 1.9797836542129517, 'learning_rate': 3.839496459480724e-05, 'epoch': 0.46}


 23%|██▎       | 1190/5084 [51:11<2:39:14,  2.45s/it]

{'loss': 0.4294, 'grad_norm': 2.7888567447662354, 'learning_rate': 3.829661683713612e-05, 'epoch': 0.47}


 24%|██▎       | 1200/5084 [51:35<2:42:48,  2.51s/it]

{'loss': 0.4056, 'grad_norm': 1.976777195930481, 'learning_rate': 3.819826907946499e-05, 'epoch': 0.47}


 24%|██▍       | 1210/5084 [52:01<2:51:57,  2.66s/it]

{'loss': 0.4617, 'grad_norm': 2.325056791305542, 'learning_rate': 3.809992132179386e-05, 'epoch': 0.48}


 24%|██▍       | 1220/5084 [52:49<6:50:48,  6.38s/it]

{'loss': 0.4029, 'grad_norm': 2.1828651428222656, 'learning_rate': 3.800157356412274e-05, 'epoch': 0.48}


 24%|██▍       | 1230/5084 [53:19<3:12:05,  2.99s/it]

{'loss': 0.4381, 'grad_norm': 2.5482242107391357, 'learning_rate': 3.7903225806451614e-05, 'epoch': 0.48}


 24%|██▍       | 1240/5084 [53:52<3:25:01,  3.20s/it]

{'loss': 0.4374, 'grad_norm': 2.371147632598877, 'learning_rate': 3.780487804878049e-05, 'epoch': 0.49}


 25%|██▍       | 1250/5084 [54:20<3:00:10,  2.82s/it]

{'loss': 0.4555, 'grad_norm': 1.8770601749420166, 'learning_rate': 3.7706530291109365e-05, 'epoch': 0.49}


 25%|██▍       | 1260/5084 [54:52<3:17:54,  3.11s/it]

{'loss': 0.4459, 'grad_norm': 2.2372071743011475, 'learning_rate': 3.7608182533438244e-05, 'epoch': 0.5}


 25%|██▍       | 1270/5084 [55:19<2:59:17,  2.82s/it]

{'loss': 0.4077, 'grad_norm': 1.8837246894836426, 'learning_rate': 3.7509834775767116e-05, 'epoch': 0.5}


 25%|██▌       | 1280/5084 [55:50<3:18:56,  3.14s/it]

{'loss': 0.4183, 'grad_norm': 1.9027249813079834, 'learning_rate': 3.741148701809599e-05, 'epoch': 0.5}


 25%|██▌       | 1290/5084 [56:20<3:17:56,  3.13s/it]

{'loss': 0.4096, 'grad_norm': 2.428196430206299, 'learning_rate': 3.731313926042486e-05, 'epoch': 0.51}


 26%|██▌       | 1300/5084 [56:49<3:09:23,  3.00s/it]

{'loss': 0.387, 'grad_norm': 2.136370897293091, 'learning_rate': 3.721479150275374e-05, 'epoch': 0.51}


 26%|██▌       | 1310/5084 [57:18<3:01:16,  2.88s/it]

{'loss': 0.4343, 'grad_norm': 2.5930328369140625, 'learning_rate': 3.711644374508261e-05, 'epoch': 0.52}


 26%|██▌       | 1320/5084 [57:49<3:11:43,  3.06s/it]

{'loss': 0.3931, 'grad_norm': 2.2396867275238037, 'learning_rate': 3.701809598741149e-05, 'epoch': 0.52}


 26%|██▌       | 1330/5084 [58:18<2:56:30,  2.82s/it]

{'loss': 0.3951, 'grad_norm': 2.24545955657959, 'learning_rate': 3.691974822974036e-05, 'epoch': 0.52}


 26%|██▋       | 1340/5084 [58:47<3:10:40,  3.06s/it]

{'loss': 0.3827, 'grad_norm': 2.192636251449585, 'learning_rate': 3.682140047206924e-05, 'epoch': 0.53}


 27%|██▋       | 1350/5084 [59:16<3:01:10,  2.91s/it]

{'loss': 0.3579, 'grad_norm': 2.4252867698669434, 'learning_rate': 3.6723052714398114e-05, 'epoch': 0.53}


 27%|██▋       | 1360/5084 [59:45<3:09:53,  3.06s/it]

{'loss': 0.3537, 'grad_norm': 1.5067641735076904, 'learning_rate': 3.662470495672699e-05, 'epoch': 0.54}


 27%|██▋       | 1370/5084 [1:00:13<2:51:06,  2.76s/it]

{'loss': 0.355, 'grad_norm': 2.12907075881958, 'learning_rate': 3.652635719905586e-05, 'epoch': 0.54}


 27%|██▋       | 1380/5084 [1:00:42<3:07:58,  3.04s/it]

{'loss': 0.3871, 'grad_norm': 2.3671209812164307, 'learning_rate': 3.642800944138474e-05, 'epoch': 0.54}


 27%|██▋       | 1390/5084 [1:01:10<2:52:12,  2.80s/it]

{'loss': 0.3707, 'grad_norm': 1.9506582021713257, 'learning_rate': 3.632966168371361e-05, 'epoch': 0.55}


 28%|██▊       | 1400/5084 [1:01:39<3:04:09,  3.00s/it]

{'loss': 0.4121, 'grad_norm': 2.1000661849975586, 'learning_rate': 3.623131392604249e-05, 'epoch': 0.55}


 28%|██▊       | 1410/5084 [1:02:08<2:49:03,  2.76s/it]

{'loss': 0.4258, 'grad_norm': 2.17693829536438, 'learning_rate': 3.613296616837136e-05, 'epoch': 0.55}


 28%|██▊       | 1420/5084 [1:02:36<2:59:54,  2.95s/it]

{'loss': 0.3988, 'grad_norm': 2.213618040084839, 'learning_rate': 3.603461841070024e-05, 'epoch': 0.56}


 28%|██▊       | 1430/5084 [1:03:05<2:47:29,  2.75s/it]

{'loss': 0.4178, 'grad_norm': 1.904998779296875, 'learning_rate': 3.593627065302911e-05, 'epoch': 0.56}


 28%|██▊       | 1440/5084 [1:03:34<2:53:05,  2.85s/it]

{'loss': 0.3778, 'grad_norm': 2.2829549312591553, 'learning_rate': 3.583792289535799e-05, 'epoch': 0.57}


 29%|██▊       | 1450/5084 [1:04:02<2:48:52,  2.79s/it]

{'loss': 0.4608, 'grad_norm': 2.514676809310913, 'learning_rate': 3.573957513768686e-05, 'epoch': 0.57}


 29%|██▊       | 1460/5084 [1:04:33<3:21:23,  3.33s/it]

{'loss': 0.4289, 'grad_norm': 2.157899856567383, 'learning_rate': 3.5641227380015735e-05, 'epoch': 0.57}


 29%|██▉       | 1470/5084 [1:05:04<3:11:45,  3.18s/it]

{'loss': 0.3436, 'grad_norm': 1.9143567085266113, 'learning_rate': 3.554287962234461e-05, 'epoch': 0.58}


 29%|██▉       | 1480/5084 [1:05:36<3:00:52,  3.01s/it]

{'loss': 0.3737, 'grad_norm': 2.0965993404388428, 'learning_rate': 3.5444531864673486e-05, 'epoch': 0.58}


 29%|██▉       | 1490/5084 [1:06:05<2:46:35,  2.78s/it]

{'loss': 0.393, 'grad_norm': 2.535547971725464, 'learning_rate': 3.534618410700236e-05, 'epoch': 0.59}


 30%|██▉       | 1500/5084 [1:06:45<4:22:26,  4.39s/it]

{'loss': 0.3906, 'grad_norm': 2.348639726638794, 'learning_rate': 3.524783634933124e-05, 'epoch': 0.59}


 30%|██▉       | 1510/5084 [1:07:30<3:38:08,  3.66s/it]

{'loss': 0.4755, 'grad_norm': 1.841005802154541, 'learning_rate': 3.514948859166011e-05, 'epoch': 0.59}


 30%|██▉       | 1520/5084 [1:08:28<6:26:38,  6.51s/it]

{'loss': 0.4338, 'grad_norm': 2.292283296585083, 'learning_rate': 3.505114083398899e-05, 'epoch': 0.6}


 30%|███       | 1530/5084 [1:09:00<3:12:34,  3.25s/it]

{'loss': 0.3933, 'grad_norm': 2.085921287536621, 'learning_rate': 3.495279307631786e-05, 'epoch': 0.6}


 30%|███       | 1540/5084 [1:09:35<4:09:42,  4.23s/it]

{'loss': 0.4352, 'grad_norm': 2.334312915802002, 'learning_rate': 3.485444531864674e-05, 'epoch': 0.61}


 30%|███       | 1550/5084 [1:10:21<5:56:30,  6.05s/it]

{'loss': 0.3829, 'grad_norm': 2.891214609146118, 'learning_rate': 3.475609756097561e-05, 'epoch': 0.61}


 31%|███       | 1560/5084 [1:10:55<3:12:01,  3.27s/it]

{'loss': 0.3645, 'grad_norm': 2.088085651397705, 'learning_rate': 3.4657749803304484e-05, 'epoch': 0.61}


 31%|███       | 1570/5084 [1:11:32<3:23:40,  3.48s/it]

{'loss': 0.3205, 'grad_norm': 1.6506779193878174, 'learning_rate': 3.455940204563336e-05, 'epoch': 0.62}


 31%|███       | 1580/5084 [1:12:06<3:11:33,  3.28s/it]

{'loss': 0.4393, 'grad_norm': 2.2183239459991455, 'learning_rate': 3.4461054287962236e-05, 'epoch': 0.62}


 31%|███▏      | 1590/5084 [1:12:41<3:24:16,  3.51s/it]

{'loss': 0.3666, 'grad_norm': 2.0031092166900635, 'learning_rate': 3.436270653029111e-05, 'epoch': 0.63}


 31%|███▏      | 1600/5084 [1:13:15<3:22:02,  3.48s/it]

{'loss': 0.3581, 'grad_norm': 2.044268846511841, 'learning_rate': 3.426435877261999e-05, 'epoch': 0.63}


 32%|███▏      | 1610/5084 [1:13:55<3:30:55,  3.64s/it]

{'loss': 0.3993, 'grad_norm': 2.201831102371216, 'learning_rate': 3.416601101494886e-05, 'epoch': 0.63}


 32%|███▏      | 1620/5084 [1:14:29<3:23:08,  3.52s/it]

{'loss': 0.4237, 'grad_norm': 2.0320959091186523, 'learning_rate': 3.406766325727774e-05, 'epoch': 0.64}


 32%|███▏      | 1630/5084 [1:15:06<3:20:24,  3.48s/it]

{'loss': 0.4132, 'grad_norm': 2.767683506011963, 'learning_rate': 3.396931549960661e-05, 'epoch': 0.64}


 32%|███▏      | 1640/5084 [1:15:40<3:09:41,  3.30s/it]

{'loss': 0.4211, 'grad_norm': 2.5174431800842285, 'learning_rate': 3.387096774193548e-05, 'epoch': 0.65}


 32%|███▏      | 1650/5084 [1:16:22<3:35:54,  3.77s/it]

{'loss': 0.4014, 'grad_norm': 2.551724910736084, 'learning_rate': 3.377261998426436e-05, 'epoch': 0.65}


 33%|███▎      | 1660/5084 [1:17:01<3:37:48,  3.82s/it]

{'loss': 0.3662, 'grad_norm': 1.970682144165039, 'learning_rate': 3.367427222659323e-05, 'epoch': 0.65}


 33%|███▎      | 1670/5084 [1:17:31<2:52:35,  3.03s/it]

{'loss': 0.3741, 'grad_norm': 2.265073776245117, 'learning_rate': 3.357592446892211e-05, 'epoch': 0.66}


 33%|███▎      | 1680/5084 [1:18:02<2:56:52,  3.12s/it]

{'loss': 0.3464, 'grad_norm': 2.3808741569519043, 'learning_rate': 3.3477576711250985e-05, 'epoch': 0.66}


 33%|███▎      | 1690/5084 [1:18:34<3:05:24,  3.28s/it]

{'loss': 0.4206, 'grad_norm': 2.5819034576416016, 'learning_rate': 3.3379228953579864e-05, 'epoch': 0.66}


 33%|███▎      | 1700/5084 [1:19:05<2:44:50,  2.92s/it]

{'loss': 0.3896, 'grad_norm': 2.3084065914154053, 'learning_rate': 3.3280881195908736e-05, 'epoch': 0.67}


 34%|███▎      | 1710/5084 [1:19:34<2:47:28,  2.98s/it]

{'loss': 0.368, 'grad_norm': 2.1382339000701904, 'learning_rate': 3.3182533438237615e-05, 'epoch': 0.67}


 34%|███▍      | 1720/5084 [1:20:06<2:52:40,  3.08s/it]

{'loss': 0.3839, 'grad_norm': 2.2144322395324707, 'learning_rate': 3.308418568056649e-05, 'epoch': 0.68}


 34%|███▍      | 1730/5084 [1:21:42<17:16:27, 18.54s/it]

{'loss': 0.3798, 'grad_norm': 1.829183578491211, 'learning_rate': 3.298583792289536e-05, 'epoch': 0.68}


 34%|███▍      | 1740/5084 [1:25:52<14:31:55, 15.64s/it]

{'loss': 0.3413, 'grad_norm': 2.281175374984741, 'learning_rate': 3.288749016522423e-05, 'epoch': 0.68}


 34%|███▍      | 1750/5084 [1:27:02<5:19:48,  5.76s/it] 

{'loss': 0.3305, 'grad_norm': 1.6497708559036255, 'learning_rate': 3.278914240755311e-05, 'epoch': 0.69}


 35%|███▍      | 1760/5084 [1:27:57<5:27:00,  5.90s/it]

{'loss': 0.3342, 'grad_norm': 2.526543617248535, 'learning_rate': 3.269079464988198e-05, 'epoch': 0.69}


 35%|███▍      | 1770/5084 [1:28:54<5:28:44,  5.95s/it]

{'loss': 0.3389, 'grad_norm': 1.8503831624984741, 'learning_rate': 3.259244689221086e-05, 'epoch': 0.7}


 35%|███▌      | 1780/5084 [1:29:50<4:32:52,  4.96s/it]

{'loss': 0.3326, 'grad_norm': 1.9436488151550293, 'learning_rate': 3.2494099134539734e-05, 'epoch': 0.7}


 35%|███▌      | 1790/5084 [1:30:36<3:51:26,  4.22s/it]

{'loss': 0.385, 'grad_norm': 2.310248613357544, 'learning_rate': 3.239575137686861e-05, 'epoch': 0.7}


 35%|███▌      | 1800/5084 [1:31:09<2:53:15,  3.17s/it]

{'loss': 0.3466, 'grad_norm': 2.3845882415771484, 'learning_rate': 3.2297403619197485e-05, 'epoch': 0.71}


 36%|███▌      | 1810/5084 [1:31:59<4:46:52,  5.26s/it]

{'loss': 0.3648, 'grad_norm': 1.7778468132019043, 'learning_rate': 3.219905586152636e-05, 'epoch': 0.71}


 36%|███▌      | 1820/5084 [1:32:58<4:12:04,  4.63s/it]

{'loss': 0.3656, 'grad_norm': 2.149019241333008, 'learning_rate': 3.210070810385523e-05, 'epoch': 0.72}


 36%|███▌      | 1830/5084 [1:33:31<2:53:05,  3.19s/it]

{'loss': 0.356, 'grad_norm': 2.244225025177002, 'learning_rate': 3.200236034618411e-05, 'epoch': 0.72}


 36%|███▌      | 1840/5084 [1:33:58<2:33:46,  2.84s/it]

{'loss': 0.3265, 'grad_norm': 1.9187195301055908, 'learning_rate': 3.190401258851298e-05, 'epoch': 0.72}


 36%|███▋      | 1850/5084 [1:34:29<2:51:12,  3.18s/it]

{'loss': 0.4186, 'grad_norm': 1.8809574842453003, 'learning_rate': 3.180566483084186e-05, 'epoch': 0.73}


 37%|███▋      | 1860/5084 [1:35:01<2:31:39,  2.82s/it]

{'loss': 0.2993, 'grad_norm': 2.700181245803833, 'learning_rate': 3.170731707317073e-05, 'epoch': 0.73}


 37%|███▋      | 1870/5084 [1:35:27<2:24:16,  2.69s/it]

{'loss': 0.4111, 'grad_norm': 3.002934455871582, 'learning_rate': 3.160896931549961e-05, 'epoch': 0.74}


 37%|███▋      | 1880/5084 [1:35:54<2:22:01,  2.66s/it]

{'loss': 0.4031, 'grad_norm': 2.0122954845428467, 'learning_rate': 3.151062155782848e-05, 'epoch': 0.74}


 37%|███▋      | 1890/5084 [1:36:18<2:10:13,  2.45s/it]

{'loss': 0.3374, 'grad_norm': 1.8792555332183838, 'learning_rate': 3.141227380015736e-05, 'epoch': 0.74}


 37%|███▋      | 1900/5084 [1:36:44<2:14:40,  2.54s/it]

{'loss': 0.3455, 'grad_norm': 2.0068130493164062, 'learning_rate': 3.131392604248623e-05, 'epoch': 0.75}


 38%|███▊      | 1910/5084 [1:37:11<2:19:33,  2.64s/it]

{'loss': 0.3422, 'grad_norm': 2.425273895263672, 'learning_rate': 3.1215578284815106e-05, 'epoch': 0.75}


 38%|███▊      | 1920/5084 [1:37:36<2:14:18,  2.55s/it]

{'loss': 0.3955, 'grad_norm': 2.462610960006714, 'learning_rate': 3.111723052714398e-05, 'epoch': 0.76}


 38%|███▊      | 1930/5084 [1:38:02<2:19:24,  2.65s/it]

{'loss': 0.3098, 'grad_norm': 1.6710566282272339, 'learning_rate': 3.101888276947286e-05, 'epoch': 0.76}


 38%|███▊      | 1940/5084 [1:38:26<2:05:20,  2.39s/it]

{'loss': 0.3502, 'grad_norm': 1.881818175315857, 'learning_rate': 3.092053501180173e-05, 'epoch': 0.76}


 38%|███▊      | 1950/5084 [1:38:52<2:12:32,  2.54s/it]

{'loss': 0.3584, 'grad_norm': 1.9300827980041504, 'learning_rate': 3.082218725413061e-05, 'epoch': 0.77}


 39%|███▊      | 1960/5084 [1:39:17<2:10:54,  2.51s/it]

{'loss': 0.3688, 'grad_norm': 1.9810919761657715, 'learning_rate': 3.072383949645948e-05, 'epoch': 0.77}


 39%|███▊      | 1970/5084 [1:39:43<2:17:39,  2.65s/it]

{'loss': 0.3128, 'grad_norm': 2.2061803340911865, 'learning_rate': 3.062549173878836e-05, 'epoch': 0.77}


 39%|███▉      | 1980/5084 [1:40:08<2:00:54,  2.34s/it]

{'loss': 0.4387, 'grad_norm': 2.061764717102051, 'learning_rate': 3.052714398111723e-05, 'epoch': 0.78}


 39%|███▉      | 1990/5084 [1:40:34<2:15:17,  2.62s/it]

{'loss': 0.3909, 'grad_norm': 2.3757972717285156, 'learning_rate': 3.0428796223446104e-05, 'epoch': 0.78}


 39%|███▉      | 2000/5084 [1:41:01<2:17:15,  2.67s/it]

{'loss': 0.3455, 'grad_norm': 1.8916473388671875, 'learning_rate': 3.033044846577498e-05, 'epoch': 0.79}


  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
 40%|███▉      | 2010/5084 [1:41:29<2:13:47,  2.61s/it]

{'loss': 0.3753, 'grad_norm': 1.5962990522384644, 'learning_rate': 3.0232100708103855e-05, 'epoch': 0.79}


 40%|███▉      | 2020/5084 [1:41:54<2:11:50,  2.58s/it]

{'loss': 0.3596, 'grad_norm': 2.0653178691864014, 'learning_rate': 3.013375295043273e-05, 'epoch': 0.79}


 40%|███▉      | 2030/5084 [1:42:20<2:10:40,  2.57s/it]

{'loss': 0.373, 'grad_norm': 2.399937391281128, 'learning_rate': 3.0035405192761606e-05, 'epoch': 0.8}


 40%|████      | 2040/5084 [1:42:46<2:10:06,  2.56s/it]

{'loss': 0.309, 'grad_norm': 2.0602822303771973, 'learning_rate': 2.993705743509048e-05, 'epoch': 0.8}


 40%|████      | 2050/5084 [1:43:11<2:09:28,  2.56s/it]

{'loss': 0.3557, 'grad_norm': 1.7976281642913818, 'learning_rate': 2.9838709677419357e-05, 'epoch': 0.81}


 41%|████      | 2060/5084 [1:43:37<2:09:21,  2.57s/it]

{'loss': 0.3724, 'grad_norm': 2.032834768295288, 'learning_rate': 2.9740361919748233e-05, 'epoch': 0.81}


 41%|████      | 2070/5084 [1:44:02<2:13:36,  2.66s/it]

{'loss': 0.3557, 'grad_norm': 2.1586174964904785, 'learning_rate': 2.964201416207711e-05, 'epoch': 0.81}


 41%|████      | 2080/5084 [1:44:27<2:05:50,  2.51s/it]

{'loss': 0.3411, 'grad_norm': 1.9574440717697144, 'learning_rate': 2.9543666404405977e-05, 'epoch': 0.82}


 41%|████      | 2090/5084 [1:44:52<2:02:32,  2.46s/it]

{'loss': 0.3458, 'grad_norm': 1.688568115234375, 'learning_rate': 2.9445318646734853e-05, 'epoch': 0.82}


 41%|████▏     | 2100/5084 [1:45:17<2:01:46,  2.45s/it]

{'loss': 0.3888, 'grad_norm': 2.512572765350342, 'learning_rate': 2.934697088906373e-05, 'epoch': 0.83}


 42%|████▏     | 2110/5084 [1:45:42<2:05:31,  2.53s/it]

{'loss': 0.3287, 'grad_norm': 2.3911266326904297, 'learning_rate': 2.9248623131392604e-05, 'epoch': 0.83}


 42%|████▏     | 2120/5084 [1:46:07<2:06:57,  2.57s/it]

{'loss': 0.3595, 'grad_norm': 1.912984013557434, 'learning_rate': 2.915027537372148e-05, 'epoch': 0.83}


 42%|████▏     | 2130/5084 [1:46:33<2:17:38,  2.80s/it]

{'loss': 0.327, 'grad_norm': 2.25665545463562, 'learning_rate': 2.9051927616050355e-05, 'epoch': 0.84}


 42%|████▏     | 2140/5084 [1:47:08<3:46:27,  4.62s/it]

{'loss': 0.3218, 'grad_norm': 2.3084983825683594, 'learning_rate': 2.895357985837923e-05, 'epoch': 0.84}


 42%|████▏     | 2150/5084 [1:47:44<2:24:28,  2.95s/it]

{'loss': 0.3393, 'grad_norm': 3.026265859603882, 'learning_rate': 2.8855232100708106e-05, 'epoch': 0.85}


 42%|████▏     | 2160/5084 [1:48:10<2:05:15,  2.57s/it]

{'loss': 0.3972, 'grad_norm': 1.9314240217208862, 'learning_rate': 2.8756884343036982e-05, 'epoch': 0.85}


 43%|████▎     | 2170/5084 [1:48:35<2:01:47,  2.51s/it]

{'loss': 0.3183, 'grad_norm': 2.390587568283081, 'learning_rate': 2.8658536585365854e-05, 'epoch': 0.85}


 43%|████▎     | 2180/5084 [1:49:01<2:02:16,  2.53s/it]

{'loss': 0.3345, 'grad_norm': 2.203577756881714, 'learning_rate': 2.856018882769473e-05, 'epoch': 0.86}


 43%|████▎     | 2190/5084 [1:49:26<1:55:19,  2.39s/it]

{'loss': 0.3188, 'grad_norm': 1.9868173599243164, 'learning_rate': 2.8461841070023605e-05, 'epoch': 0.86}


 43%|████▎     | 2200/5084 [1:49:49<1:56:20,  2.42s/it]

{'loss': 0.3865, 'grad_norm': 1.6590014696121216, 'learning_rate': 2.836349331235248e-05, 'epoch': 0.87}


 43%|████▎     | 2210/5084 [1:50:14<2:01:15,  2.53s/it]

{'loss': 0.3178, 'grad_norm': 1.8595807552337646, 'learning_rate': 2.8265145554681356e-05, 'epoch': 0.87}


 44%|████▎     | 2220/5084 [1:50:39<2:01:11,  2.54s/it]

{'loss': 0.3664, 'grad_norm': 2.296440601348877, 'learning_rate': 2.8166797797010232e-05, 'epoch': 0.87}


 44%|████▍     | 2230/5084 [1:51:04<1:59:32,  2.51s/it]

{'loss': 0.3687, 'grad_norm': 2.348506212234497, 'learning_rate': 2.8068450039339108e-05, 'epoch': 0.88}


 44%|████▍     | 2240/5084 [1:51:31<2:00:03,  2.53s/it]

{'loss': 0.3034, 'grad_norm': 2.4523799419403076, 'learning_rate': 2.7970102281667983e-05, 'epoch': 0.88}


 44%|████▍     | 2250/5084 [1:51:57<1:59:48,  2.54s/it]

{'loss': 0.37, 'grad_norm': 1.8897404670715332, 'learning_rate': 2.7871754523996855e-05, 'epoch': 0.89}


 44%|████▍     | 2260/5084 [1:52:24<2:19:27,  2.96s/it]

{'loss': 0.3593, 'grad_norm': 1.8314992189407349, 'learning_rate': 2.7773406766325727e-05, 'epoch': 0.89}


 45%|████▍     | 2270/5084 [1:53:03<4:10:53,  5.35s/it]

{'loss': 0.3224, 'grad_norm': 2.089735507965088, 'learning_rate': 2.7675059008654603e-05, 'epoch': 0.89}


 45%|████▍     | 2280/5084 [1:53:51<4:19:54,  5.56s/it]

{'loss': 0.3338, 'grad_norm': 1.890104055404663, 'learning_rate': 2.757671125098348e-05, 'epoch': 0.9}


 45%|████▌     | 2290/5084 [1:56:29<8:23:01, 10.80s/it] 

{'loss': 0.3176, 'grad_norm': 2.1826305389404297, 'learning_rate': 2.7478363493312354e-05, 'epoch': 0.9}


 45%|████▌     | 2300/5084 [1:57:21<4:55:19,  6.36s/it]

{'loss': 0.3547, 'grad_norm': 1.711499571800232, 'learning_rate': 2.738001573564123e-05, 'epoch': 0.9}


 45%|████▌     | 2310/5084 [1:57:53<2:19:06,  3.01s/it]

{'loss': 0.3724, 'grad_norm': 2.4270589351654053, 'learning_rate': 2.7281667977970105e-05, 'epoch': 0.91}


 46%|████▌     | 2320/5084 [1:58:27<2:33:15,  3.33s/it]

{'loss': 0.347, 'grad_norm': 1.9313631057739258, 'learning_rate': 2.718332022029898e-05, 'epoch': 0.91}


 46%|████▌     | 2330/5084 [1:59:08<2:51:03,  3.73s/it]

{'loss': 0.3204, 'grad_norm': 1.9236878156661987, 'learning_rate': 2.7084972462627857e-05, 'epoch': 0.92}


 46%|████▌     | 2340/5084 [1:59:39<2:25:39,  3.19s/it]

{'loss': 0.3688, 'grad_norm': 2.3789846897125244, 'learning_rate': 2.6986624704956725e-05, 'epoch': 0.92}


 46%|████▌     | 2350/5084 [2:00:07<2:01:00,  2.66s/it]

{'loss': 0.2904, 'grad_norm': 1.5948731899261475, 'learning_rate': 2.68882769472856e-05, 'epoch': 0.92}


 46%|████▋     | 2360/5084 [2:00:37<2:20:15,  3.09s/it]

{'loss': 0.2787, 'grad_norm': 2.056997299194336, 'learning_rate': 2.6789929189614476e-05, 'epoch': 0.93}


 47%|████▋     | 2370/5084 [2:01:06<2:06:46,  2.80s/it]

{'loss': 0.4109, 'grad_norm': 2.4142651557922363, 'learning_rate': 2.6691581431943352e-05, 'epoch': 0.93}


 47%|████▋     | 2380/5084 [2:01:34<2:01:08,  2.69s/it]

{'loss': 0.3019, 'grad_norm': 1.739406943321228, 'learning_rate': 2.6593233674272228e-05, 'epoch': 0.94}


 47%|████▋     | 2390/5084 [2:01:59<1:50:11,  2.45s/it]

{'loss': 0.3548, 'grad_norm': 1.776234745979309, 'learning_rate': 2.6494885916601103e-05, 'epoch': 0.94}


 47%|████▋     | 2400/5084 [2:02:23<1:50:13,  2.46s/it]

{'loss': 0.3251, 'grad_norm': 1.8967773914337158, 'learning_rate': 2.639653815892998e-05, 'epoch': 0.94}


 47%|████▋     | 2410/5084 [2:02:50<1:59:05,  2.67s/it]

{'loss': 0.3229, 'grad_norm': 2.5306735038757324, 'learning_rate': 2.6298190401258854e-05, 'epoch': 0.95}


 48%|████▊     | 2420/5084 [2:03:17<1:51:39,  2.51s/it]

{'loss': 0.3152, 'grad_norm': 1.8206312656402588, 'learning_rate': 2.619984264358773e-05, 'epoch': 0.95}


 48%|████▊     | 2430/5084 [2:03:41<1:50:31,  2.50s/it]

{'loss': 0.3102, 'grad_norm': 1.7799429893493652, 'learning_rate': 2.61014948859166e-05, 'epoch': 0.96}


 48%|████▊     | 2440/5084 [2:04:06<1:52:58,  2.56s/it]

{'loss': 0.3906, 'grad_norm': 2.391744613647461, 'learning_rate': 2.6003147128245474e-05, 'epoch': 0.96}


 48%|████▊     | 2450/5084 [2:04:30<1:48:09,  2.46s/it]

{'loss': 0.3101, 'grad_norm': 2.411252021789551, 'learning_rate': 2.590479937057435e-05, 'epoch': 0.96}


 48%|████▊     | 2460/5084 [2:04:56<1:51:00,  2.54s/it]

{'loss': 0.3206, 'grad_norm': 1.6086562871932983, 'learning_rate': 2.5806451612903226e-05, 'epoch': 0.97}


 49%|████▊     | 2470/5084 [2:05:21<1:45:41,  2.43s/it]

{'loss': 0.3472, 'grad_norm': 2.700281858444214, 'learning_rate': 2.57081038552321e-05, 'epoch': 0.97}


 49%|████▉     | 2480/5084 [2:05:47<1:54:12,  2.63s/it]

{'loss': 0.2971, 'grad_norm': 1.7722951173782349, 'learning_rate': 2.5609756097560977e-05, 'epoch': 0.98}


 49%|████▉     | 2490/5084 [2:06:12<1:47:59,  2.50s/it]

{'loss': 0.2999, 'grad_norm': 2.206698179244995, 'learning_rate': 2.5511408339889852e-05, 'epoch': 0.98}


 49%|████▉     | 2500/5084 [2:06:39<1:52:09,  2.60s/it]

{'loss': 0.3685, 'grad_norm': 2.4915435314178467, 'learning_rate': 2.5413060582218728e-05, 'epoch': 0.98}


 49%|████▉     | 2510/5084 [2:07:04<1:52:17,  2.62s/it]

{'loss': 0.323, 'grad_norm': 1.9800753593444824, 'learning_rate': 2.5314712824547603e-05, 'epoch': 0.99}


 50%|████▉     | 2520/5084 [2:07:29<1:47:04,  2.51s/it]

{'loss': 0.3363, 'grad_norm': 2.0640950202941895, 'learning_rate': 2.5216365066876476e-05, 'epoch': 0.99}


 50%|████▉     | 2530/5084 [2:07:55<1:46:46,  2.51s/it]

{'loss': 0.3836, 'grad_norm': 1.8844527006149292, 'learning_rate': 2.5118017309205348e-05, 'epoch': 1.0}


 50%|████▉     | 2540/5084 [2:08:20<1:40:02,  2.36s/it]

{'loss': 0.3326, 'grad_norm': 2.011941432952881, 'learning_rate': 2.5019669551534223e-05, 'epoch': 1.0}


                                                       
 50%|█████     | 2542/5084 [2:11:35<1:37:34,  2.30s/it]

{'eval_loss': 0.32972368597984314, 'eval_runtime': 190.0205, 'eval_samples_per_second': 13.378, 'eval_steps_per_second': 1.674, 'epoch': 1.0}


 50%|█████     | 2550/5084 [2:11:55<5:00:58,  7.13s/it] 

{'loss': 0.2985, 'grad_norm': 2.3024096488952637, 'learning_rate': 2.49213217938631e-05, 'epoch': 1.0}


 50%|█████     | 2560/5084 [2:12:20<1:51:38,  2.65s/it]

{'loss': 0.3156, 'grad_norm': 2.371612310409546, 'learning_rate': 2.4822974036191975e-05, 'epoch': 1.01}


 51%|█████     | 2570/5084 [2:12:46<1:45:06,  2.51s/it]

{'loss': 0.3212, 'grad_norm': 2.2485151290893555, 'learning_rate': 2.472462627852085e-05, 'epoch': 1.01}


 51%|█████     | 2580/5084 [2:13:10<1:43:15,  2.47s/it]

{'loss': 0.3311, 'grad_norm': 2.09279727935791, 'learning_rate': 2.4626278520849726e-05, 'epoch': 1.01}


 51%|█████     | 2590/5084 [2:13:35<1:43:36,  2.49s/it]

{'loss': 0.2632, 'grad_norm': 2.0497279167175293, 'learning_rate': 2.45279307631786e-05, 'epoch': 1.02}


 51%|█████     | 2600/5084 [2:14:00<1:43:59,  2.51s/it]

{'loss': 0.2819, 'grad_norm': 1.9119271039962769, 'learning_rate': 2.4429583005507477e-05, 'epoch': 1.02}


 51%|█████▏    | 2610/5084 [2:14:24<1:37:05,  2.35s/it]

{'loss': 0.2741, 'grad_norm': 2.1506993770599365, 'learning_rate': 2.4331235247836352e-05, 'epoch': 1.03}


 52%|█████▏    | 2620/5084 [2:14:48<1:43:30,  2.52s/it]

{'loss': 0.2408, 'grad_norm': 1.6729098558425903, 'learning_rate': 2.4232887490165228e-05, 'epoch': 1.03}


 52%|█████▏    | 2630/5084 [2:15:13<1:43:55,  2.54s/it]

{'loss': 0.3233, 'grad_norm': 1.9123562574386597, 'learning_rate': 2.41345397324941e-05, 'epoch': 1.03}


 52%|█████▏    | 2640/5084 [2:15:37<1:39:22,  2.44s/it]

{'loss': 0.321, 'grad_norm': 2.1452534198760986, 'learning_rate': 2.4036191974822976e-05, 'epoch': 1.04}


 52%|█████▏    | 2650/5084 [2:16:02<1:37:27,  2.40s/it]

{'loss': 0.2832, 'grad_norm': 1.8271397352218628, 'learning_rate': 2.393784421715185e-05, 'epoch': 1.04}


 52%|█████▏    | 2660/5084 [2:16:26<1:39:03,  2.45s/it]

{'loss': 0.2954, 'grad_norm': 2.4132704734802246, 'learning_rate': 2.3839496459480727e-05, 'epoch': 1.05}


 53%|█████▎    | 2670/5084 [2:16:54<1:55:25,  2.87s/it]

{'loss': 0.3028, 'grad_norm': 2.0473086833953857, 'learning_rate': 2.37411487018096e-05, 'epoch': 1.05}


 53%|█████▎    | 2680/5084 [2:17:45<3:10:13,  4.75s/it]

{'loss': 0.3113, 'grad_norm': 1.8964481353759766, 'learning_rate': 2.3642800944138475e-05, 'epoch': 1.05}


 53%|█████▎    | 2690/5084 [2:18:20<2:08:31,  3.22s/it]

{'loss': 0.2798, 'grad_norm': 2.124534845352173, 'learning_rate': 2.354445318646735e-05, 'epoch': 1.06}


 53%|█████▎    | 2700/5084 [2:18:52<2:09:04,  3.25s/it]

{'loss': 0.3077, 'grad_norm': 2.013749361038208, 'learning_rate': 2.3446105428796226e-05, 'epoch': 1.06}


 53%|█████▎    | 2710/5084 [2:19:24<2:27:55,  3.74s/it]

{'loss': 0.2555, 'grad_norm': 1.594275712966919, 'learning_rate': 2.33477576711251e-05, 'epoch': 1.07}


 54%|█████▎    | 2720/5084 [2:19:57<2:13:53,  3.40s/it]

{'loss': 0.3153, 'grad_norm': 2.015822172164917, 'learning_rate': 2.3249409913453974e-05, 'epoch': 1.07}


 54%|█████▎    | 2730/5084 [2:20:26<1:53:39,  2.90s/it]

{'loss': 0.289, 'grad_norm': 1.7218459844589233, 'learning_rate': 2.315106215578285e-05, 'epoch': 1.07}


 54%|█████▍    | 2740/5084 [2:20:59<2:23:29,  3.67s/it]

{'loss': 0.2671, 'grad_norm': 1.830124855041504, 'learning_rate': 2.3052714398111725e-05, 'epoch': 1.08}


 54%|█████▍    | 2750/5084 [2:21:33<1:51:46,  2.87s/it]

{'loss': 0.3521, 'grad_norm': 2.0907723903656006, 'learning_rate': 2.29543666404406e-05, 'epoch': 1.08}


 54%|█████▍    | 2760/5084 [2:22:01<2:08:53,  3.33s/it]

{'loss': 0.2942, 'grad_norm': 2.200529098510742, 'learning_rate': 2.2856018882769473e-05, 'epoch': 1.09}


 54%|█████▍    | 2770/5084 [2:22:30<1:49:29,  2.84s/it]

{'loss': 0.2988, 'grad_norm': 1.9564491510391235, 'learning_rate': 2.2757671125098348e-05, 'epoch': 1.09}


 55%|█████▍    | 2780/5084 [2:23:03<2:19:02,  3.62s/it]

{'loss': 0.265, 'grad_norm': 1.9350104331970215, 'learning_rate': 2.2659323367427224e-05, 'epoch': 1.09}


 55%|█████▍    | 2790/5084 [2:23:33<1:51:11,  2.91s/it]

{'loss': 0.3012, 'grad_norm': 2.2998437881469727, 'learning_rate': 2.25609756097561e-05, 'epoch': 1.1}


 55%|█████▌    | 2800/5084 [2:24:08<2:22:23,  3.74s/it]

{'loss': 0.3086, 'grad_norm': 2.095350980758667, 'learning_rate': 2.2462627852084975e-05, 'epoch': 1.1}


 55%|█████▌    | 2810/5084 [2:24:39<1:58:22,  3.12s/it]

{'loss': 0.3129, 'grad_norm': 2.1937639713287354, 'learning_rate': 2.2364280094413847e-05, 'epoch': 1.11}


 55%|█████▌    | 2820/5084 [2:25:11<1:57:33,  3.12s/it]

{'loss': 0.3066, 'grad_norm': 2.297017812728882, 'learning_rate': 2.2265932336742723e-05, 'epoch': 1.11}


 56%|█████▌    | 2830/5084 [2:25:38<1:44:49,  2.79s/it]

{'loss': 0.2853, 'grad_norm': 2.179156541824341, 'learning_rate': 2.2167584579071598e-05, 'epoch': 1.11}


 56%|█████▌    | 2840/5084 [2:26:05<1:37:29,  2.61s/it]

{'loss': 0.2711, 'grad_norm': 2.470963954925537, 'learning_rate': 2.2069236821400474e-05, 'epoch': 1.12}


 56%|█████▌    | 2850/5084 [2:26:34<1:57:30,  3.16s/it]

{'loss': 0.2946, 'grad_norm': 2.2159767150878906, 'learning_rate': 2.1970889063729346e-05, 'epoch': 1.12}


 56%|█████▋    | 2860/5084 [2:27:02<1:47:20,  2.90s/it]

{'loss': 0.2634, 'grad_norm': 1.6441774368286133, 'learning_rate': 2.187254130605822e-05, 'epoch': 1.13}


 56%|█████▋    | 2870/5084 [2:27:30<1:41:32,  2.75s/it]

{'loss': 0.3042, 'grad_norm': 2.189192533493042, 'learning_rate': 2.1774193548387097e-05, 'epoch': 1.13}


 57%|█████▋    | 2880/5084 [2:27:58<1:44:21,  2.84s/it]

{'loss': 0.2635, 'grad_norm': 2.252898693084717, 'learning_rate': 2.1675845790715973e-05, 'epoch': 1.13}


 57%|█████▋    | 2890/5084 [2:28:28<1:43:30,  2.83s/it]

{'loss': 0.2645, 'grad_norm': 1.7103097438812256, 'learning_rate': 2.157749803304485e-05, 'epoch': 1.14}


 57%|█████▋    | 2900/5084 [2:29:13<2:50:35,  4.69s/it]

{'loss': 0.2823, 'grad_norm': 2.025022268295288, 'learning_rate': 2.147915027537372e-05, 'epoch': 1.14}


 57%|█████▋    | 2910/5084 [2:29:42<1:46:30,  2.94s/it]

{'loss': 0.2657, 'grad_norm': 2.1499223709106445, 'learning_rate': 2.1380802517702596e-05, 'epoch': 1.14}


 57%|█████▋    | 2920/5084 [2:30:12<1:38:52,  2.74s/it]

{'loss': 0.2695, 'grad_norm': 1.5874598026275635, 'learning_rate': 2.128245476003147e-05, 'epoch': 1.15}


 58%|█████▊    | 2930/5084 [2:30:37<1:29:22,  2.49s/it]

{'loss': 0.3078, 'grad_norm': 1.9868533611297607, 'learning_rate': 2.1184107002360347e-05, 'epoch': 1.15}


 58%|█████▊    | 2940/5084 [2:31:03<1:36:21,  2.70s/it]

{'loss': 0.2923, 'grad_norm': 1.8659998178482056, 'learning_rate': 2.108575924468922e-05, 'epoch': 1.16}


 58%|█████▊    | 2950/5084 [2:31:28<1:25:46,  2.41s/it]

{'loss': 0.2665, 'grad_norm': 1.4244941473007202, 'learning_rate': 2.0987411487018095e-05, 'epoch': 1.16}


 58%|█████▊    | 2960/5084 [2:31:53<1:29:39,  2.53s/it]

{'loss': 0.2944, 'grad_norm': 1.500717043876648, 'learning_rate': 2.088906372934697e-05, 'epoch': 1.16}


 58%|█████▊    | 2970/5084 [2:32:18<1:28:29,  2.51s/it]

{'loss': 0.265, 'grad_norm': 1.837401032447815, 'learning_rate': 2.0790715971675846e-05, 'epoch': 1.17}


 59%|█████▊    | 2980/5084 [2:32:44<1:29:09,  2.54s/it]

{'loss': 0.2414, 'grad_norm': 1.5459365844726562, 'learning_rate': 2.0692368214004722e-05, 'epoch': 1.17}


 59%|█████▉    | 2990/5084 [2:33:11<1:32:28,  2.65s/it]

{'loss': 0.2808, 'grad_norm': 2.398409366607666, 'learning_rate': 2.0594020456333597e-05, 'epoch': 1.18}


 59%|█████▉    | 3000/5084 [2:33:36<1:31:43,  2.64s/it]

{'loss': 0.2524, 'grad_norm': 2.358090877532959, 'learning_rate': 2.0495672698662473e-05, 'epoch': 1.18}


  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
 59%|█████▉    | 3010/5084 [2:34:05<1:32:32,  2.68s/it]

{'loss': 0.2988, 'grad_norm': 2.2572903633117676, 'learning_rate': 2.039732494099135e-05, 'epoch': 1.18}


 59%|█████▉    | 3020/5084 [2:34:30<1:26:38,  2.52s/it]

{'loss': 0.284, 'grad_norm': 1.7258412837982178, 'learning_rate': 2.0298977183320224e-05, 'epoch': 1.19}


 60%|█████▉    | 3030/5084 [2:34:55<1:27:16,  2.55s/it]

{'loss': 0.3134, 'grad_norm': 2.162698745727539, 'learning_rate': 2.0200629425649096e-05, 'epoch': 1.19}


 60%|█████▉    | 3040/5084 [2:35:21<1:26:30,  2.54s/it]

{'loss': 0.2864, 'grad_norm': 1.8657525777816772, 'learning_rate': 2.0102281667977972e-05, 'epoch': 1.2}


 60%|█████▉    | 3050/5084 [2:35:47<1:24:55,  2.51s/it]

{'loss': 0.2627, 'grad_norm': 1.9713717699050903, 'learning_rate': 2.0003933910306847e-05, 'epoch': 1.2}


 60%|██████    | 3060/5084 [2:36:11<1:21:35,  2.42s/it]

{'loss': 0.2178, 'grad_norm': 1.6509865522384644, 'learning_rate': 1.9905586152635723e-05, 'epoch': 1.2}


 60%|██████    | 3070/5084 [2:36:36<1:22:42,  2.46s/it]

{'loss': 0.3159, 'grad_norm': 1.9980956315994263, 'learning_rate': 1.9807238394964595e-05, 'epoch': 1.21}


 61%|██████    | 3080/5084 [2:37:04<1:45:22,  3.16s/it]

{'loss': 0.2867, 'grad_norm': 2.5985162258148193, 'learning_rate': 1.970889063729347e-05, 'epoch': 1.21}


 61%|██████    | 3090/5084 [2:37:35<2:14:43,  4.05s/it]

{'loss': 0.2775, 'grad_norm': 1.988423228263855, 'learning_rate': 1.9610542879622346e-05, 'epoch': 1.22}


 61%|██████    | 3100/5084 [2:38:08<1:36:32,  2.92s/it]

{'loss': 0.2754, 'grad_norm': 2.265227794647217, 'learning_rate': 1.9512195121951222e-05, 'epoch': 1.22}


 61%|██████    | 3110/5084 [2:38:34<1:24:46,  2.58s/it]

{'loss': 0.2789, 'grad_norm': 1.847072720527649, 'learning_rate': 1.9413847364280098e-05, 'epoch': 1.22}


 61%|██████▏   | 3120/5084 [2:38:59<1:24:27,  2.58s/it]

{'loss': 0.2549, 'grad_norm': 1.9155058860778809, 'learning_rate': 1.931549960660897e-05, 'epoch': 1.23}


 62%|██████▏   | 3130/5084 [2:39:25<1:25:27,  2.62s/it]

{'loss': 0.3302, 'grad_norm': 2.501765012741089, 'learning_rate': 1.9217151848937845e-05, 'epoch': 1.23}


 62%|██████▏   | 3140/5084 [2:39:51<1:26:31,  2.67s/it]

{'loss': 0.2733, 'grad_norm': 1.8268687725067139, 'learning_rate': 1.911880409126672e-05, 'epoch': 1.24}


 62%|██████▏   | 3150/5084 [2:40:15<1:13:12,  2.27s/it]

{'loss': 0.2509, 'grad_norm': 1.4439303874969482, 'learning_rate': 1.9020456333595596e-05, 'epoch': 1.24}


 62%|██████▏   | 3160/5084 [2:40:40<1:21:24,  2.54s/it]

{'loss': 0.242, 'grad_norm': 1.1350570917129517, 'learning_rate': 1.892210857592447e-05, 'epoch': 1.24}


 62%|██████▏   | 3170/5084 [2:41:05<1:20:52,  2.54s/it]

{'loss': 0.3026, 'grad_norm': 1.9049760103225708, 'learning_rate': 1.8823760818253344e-05, 'epoch': 1.25}


 63%|██████▎   | 3180/5084 [2:41:31<1:20:47,  2.55s/it]

{'loss': 0.2658, 'grad_norm': 2.567274808883667, 'learning_rate': 1.872541306058222e-05, 'epoch': 1.25}


 63%|██████▎   | 3190/5084 [2:41:56<1:20:44,  2.56s/it]

{'loss': 0.282, 'grad_norm': 1.9695749282836914, 'learning_rate': 1.8627065302911095e-05, 'epoch': 1.25}


 63%|██████▎   | 3200/5084 [2:42:22<1:20:48,  2.57s/it]

{'loss': 0.2607, 'grad_norm': 1.5427402257919312, 'learning_rate': 1.8528717545239968e-05, 'epoch': 1.26}


 63%|██████▎   | 3210/5084 [2:42:47<1:22:23,  2.64s/it]

{'loss': 0.2949, 'grad_norm': 1.7696866989135742, 'learning_rate': 1.8430369787568843e-05, 'epoch': 1.26}


 63%|██████▎   | 3220/5084 [2:43:12<1:12:39,  2.34s/it]

{'loss': 0.2732, 'grad_norm': 1.7360079288482666, 'learning_rate': 1.833202202989772e-05, 'epoch': 1.27}


 64%|██████▎   | 3230/5084 [2:43:36<1:16:25,  2.47s/it]

{'loss': 0.2573, 'grad_norm': 1.7950782775878906, 'learning_rate': 1.8233674272226594e-05, 'epoch': 1.27}


 64%|██████▎   | 3240/5084 [2:44:01<1:17:38,  2.53s/it]

{'loss': 0.2949, 'grad_norm': 2.0095434188842773, 'learning_rate': 1.813532651455547e-05, 'epoch': 1.27}


 64%|██████▍   | 3250/5084 [2:44:26<1:19:09,  2.59s/it]

{'loss': 0.2816, 'grad_norm': 2.0446324348449707, 'learning_rate': 1.8036978756884342e-05, 'epoch': 1.28}


 64%|██████▍   | 3260/5084 [2:44:52<1:17:35,  2.55s/it]

{'loss': 0.3502, 'grad_norm': 1.9464221000671387, 'learning_rate': 1.7938630999213218e-05, 'epoch': 1.28}


 64%|██████▍   | 3270/5084 [2:45:17<1:15:22,  2.49s/it]

{'loss': 0.2554, 'grad_norm': 1.9756873846054077, 'learning_rate': 1.7840283241542093e-05, 'epoch': 1.29}


 65%|██████▍   | 3280/5084 [2:45:43<1:14:36,  2.48s/it]

{'loss': 0.2635, 'grad_norm': 2.3958091735839844, 'learning_rate': 1.774193548387097e-05, 'epoch': 1.29}


 65%|██████▍   | 3290/5084 [2:46:08<1:14:56,  2.51s/it]

{'loss': 0.2869, 'grad_norm': 2.16192889213562, 'learning_rate': 1.764358772619984e-05, 'epoch': 1.29}


 65%|██████▍   | 3300/5084 [2:46:31<1:08:39,  2.31s/it]

{'loss': 0.3053, 'grad_norm': 2.4396777153015137, 'learning_rate': 1.7545239968528717e-05, 'epoch': 1.3}


 65%|██████▌   | 3310/5084 [2:46:56<1:13:08,  2.47s/it]

{'loss': 0.2983, 'grad_norm': 2.2082083225250244, 'learning_rate': 1.7446892210857592e-05, 'epoch': 1.3}


 65%|██████▌   | 3320/5084 [2:47:21<1:13:35,  2.50s/it]

{'loss': 0.2673, 'grad_norm': 1.5464398860931396, 'learning_rate': 1.7348544453186468e-05, 'epoch': 1.31}


 65%|██████▌   | 3330/5084 [2:47:45<1:13:36,  2.52s/it]

{'loss': 0.3257, 'grad_norm': 1.9716143608093262, 'learning_rate': 1.7250196695515343e-05, 'epoch': 1.31}


 66%|██████▌   | 3340/5084 [2:48:09<1:07:01,  2.31s/it]

{'loss': 0.2636, 'grad_norm': 2.178924322128296, 'learning_rate': 1.7151848937844216e-05, 'epoch': 1.31}


 66%|██████▌   | 3350/5084 [2:48:33<1:10:25,  2.44s/it]

{'loss': 0.2671, 'grad_norm': 1.8578269481658936, 'learning_rate': 1.705350118017309e-05, 'epoch': 1.32}


 66%|██████▌   | 3360/5084 [2:48:58<1:12:18,  2.52s/it]

{'loss': 0.2779, 'grad_norm': 2.0788583755493164, 'learning_rate': 1.6955153422501967e-05, 'epoch': 1.32}


 66%|██████▋   | 3370/5084 [2:49:22<1:06:20,  2.32s/it]

{'loss': 0.2938, 'grad_norm': 1.7940914630889893, 'learning_rate': 1.6856805664830842e-05, 'epoch': 1.33}


 66%|██████▋   | 3380/5084 [2:49:46<1:09:23,  2.44s/it]

{'loss': 0.2641, 'grad_norm': 2.21386981010437, 'learning_rate': 1.6758457907159718e-05, 'epoch': 1.33}


 67%|██████▋   | 3390/5084 [2:50:12<1:09:40,  2.47s/it]

{'loss': 0.3128, 'grad_norm': 2.210641384124756, 'learning_rate': 1.6660110149488593e-05, 'epoch': 1.33}


 67%|██████▋   | 3400/5084 [2:50:36<1:05:08,  2.32s/it]

{'loss': 0.2655, 'grad_norm': 1.9678937196731567, 'learning_rate': 1.656176239181747e-05, 'epoch': 1.34}


 67%|██████▋   | 3410/5084 [2:51:00<1:09:53,  2.51s/it]

{'loss': 0.252, 'grad_norm': 1.673354983329773, 'learning_rate': 1.6463414634146345e-05, 'epoch': 1.34}


 67%|██████▋   | 3420/5084 [2:51:24<1:07:30,  2.43s/it]

{'loss': 0.3015, 'grad_norm': 1.691093921661377, 'learning_rate': 1.636506687647522e-05, 'epoch': 1.35}


 67%|██████▋   | 3430/5084 [2:51:48<1:09:05,  2.51s/it]

{'loss': 0.2546, 'grad_norm': 1.635137915611267, 'learning_rate': 1.6266719118804092e-05, 'epoch': 1.35}


 68%|██████▊   | 3440/5084 [2:52:13<1:08:49,  2.51s/it]

{'loss': 0.2959, 'grad_norm': 1.9205152988433838, 'learning_rate': 1.6168371361132968e-05, 'epoch': 1.35}


 68%|██████▊   | 3450/5084 [2:52:38<1:09:01,  2.53s/it]

{'loss': 0.3021, 'grad_norm': 1.7127803564071655, 'learning_rate': 1.6070023603461843e-05, 'epoch': 1.36}


 68%|██████▊   | 3460/5084 [2:53:04<1:09:38,  2.57s/it]

{'loss': 0.2806, 'grad_norm': 2.136791229248047, 'learning_rate': 1.597167584579072e-05, 'epoch': 1.36}


 68%|██████▊   | 3470/5084 [2:53:27<1:02:25,  2.32s/it]

{'loss': 0.2539, 'grad_norm': 2.0652551651000977, 'learning_rate': 1.587332808811959e-05, 'epoch': 1.37}


 68%|██████▊   | 3480/5084 [2:53:52<1:04:52,  2.43s/it]

{'loss': 0.2615, 'grad_norm': 1.8133177757263184, 'learning_rate': 1.5774980330448467e-05, 'epoch': 1.37}


 69%|██████▊   | 3490/5084 [2:54:16<1:01:54,  2.33s/it]

{'loss': 0.3217, 'grad_norm': 2.277010440826416, 'learning_rate': 1.5676632572777342e-05, 'epoch': 1.37}


 69%|██████▉   | 3500/5084 [2:54:41<1:06:54,  2.53s/it]

{'loss': 0.2305, 'grad_norm': 1.7112361192703247, 'learning_rate': 1.5578284815106218e-05, 'epoch': 1.38}


 69%|██████▉   | 3510/5084 [2:55:06<1:06:45,  2.54s/it]

{'loss': 0.2744, 'grad_norm': 1.6047022342681885, 'learning_rate': 1.547993705743509e-05, 'epoch': 1.38}


 69%|██████▉   | 3520/5084 [2:55:32<1:08:09,  2.61s/it]

{'loss': 0.3332, 'grad_norm': 2.5216188430786133, 'learning_rate': 1.5381589299763966e-05, 'epoch': 1.38}


 69%|██████▉   | 3530/5084 [2:55:56<1:02:52,  2.43s/it]

{'loss': 0.2423, 'grad_norm': 1.7409777641296387, 'learning_rate': 1.528324154209284e-05, 'epoch': 1.39}


 70%|██████▉   | 3540/5084 [2:56:19<1:01:50,  2.40s/it]

{'loss': 0.2497, 'grad_norm': 2.4398117065429688, 'learning_rate': 1.5184893784421717e-05, 'epoch': 1.39}


 70%|██████▉   | 3550/5084 [2:56:44<1:03:14,  2.47s/it]

{'loss': 0.2691, 'grad_norm': 1.8460029363632202, 'learning_rate': 1.5086546026750593e-05, 'epoch': 1.4}


 70%|███████   | 3560/5084 [2:57:09<1:01:34,  2.42s/it]

{'loss': 0.2579, 'grad_norm': 2.5685815811157227, 'learning_rate': 1.4988198269079465e-05, 'epoch': 1.4}


 70%|███████   | 3570/5084 [2:57:34<1:02:18,  2.47s/it]

{'loss': 0.2535, 'grad_norm': 1.6851623058319092, 'learning_rate': 1.488985051140834e-05, 'epoch': 1.4}


 70%|███████   | 3580/5084 [2:58:00<1:03:41,  2.54s/it]

{'loss': 0.2445, 'grad_norm': 2.055748701095581, 'learning_rate': 1.4791502753737216e-05, 'epoch': 1.41}


 71%|███████   | 3590/5084 [2:58:23<1:02:09,  2.50s/it]

{'loss': 0.2424, 'grad_norm': 2.2148807048797607, 'learning_rate': 1.4693154996066091e-05, 'epoch': 1.41}


 71%|███████   | 3600/5084 [2:58:46<56:41,  2.29s/it]  

{'loss': 0.272, 'grad_norm': 2.1826529502868652, 'learning_rate': 1.4594807238394964e-05, 'epoch': 1.42}


 71%|███████   | 3610/5084 [2:59:11<59:46,  2.43s/it]  

{'loss': 0.2928, 'grad_norm': 2.039512872695923, 'learning_rate': 1.449645948072384e-05, 'epoch': 1.42}


 71%|███████   | 3620/5084 [2:59:37<1:01:46,  2.53s/it]

{'loss': 0.2834, 'grad_norm': 2.83719801902771, 'learning_rate': 1.4398111723052715e-05, 'epoch': 1.42}


 71%|███████▏  | 3630/5084 [3:00:01<1:01:01,  2.52s/it]

{'loss': 0.2627, 'grad_norm': 1.929729700088501, 'learning_rate': 1.429976396538159e-05, 'epoch': 1.43}


 72%|███████▏  | 3640/5084 [3:00:27<1:01:38,  2.56s/it]

{'loss': 0.3409, 'grad_norm': 2.279449939727783, 'learning_rate': 1.4201416207710466e-05, 'epoch': 1.43}


 72%|███████▏  | 3650/5084 [3:00:52<59:29,  2.49s/it]  

{'loss': 0.2433, 'grad_norm': 1.8607475757598877, 'learning_rate': 1.4103068450039338e-05, 'epoch': 1.44}


 72%|███████▏  | 3660/5084 [3:01:15<57:48,  2.44s/it]

{'loss': 0.2801, 'grad_norm': 1.752498745918274, 'learning_rate': 1.4004720692368214e-05, 'epoch': 1.44}


 72%|███████▏  | 3670/5084 [3:01:39<55:27,  2.35s/it]

{'loss': 0.2566, 'grad_norm': 2.633777379989624, 'learning_rate': 1.390637293469709e-05, 'epoch': 1.44}


 72%|███████▏  | 3680/5084 [3:02:04<56:17,  2.41s/it]

{'loss': 0.2595, 'grad_norm': 2.081773042678833, 'learning_rate': 1.3808025177025965e-05, 'epoch': 1.45}


 73%|███████▎  | 3690/5084 [3:02:28<56:54,  2.45s/it]

{'loss': 0.2753, 'grad_norm': 1.926292896270752, 'learning_rate': 1.3709677419354839e-05, 'epoch': 1.45}


 73%|███████▎  | 3700/5084 [3:02:53<58:13,  2.52s/it]

{'loss': 0.2358, 'grad_norm': 1.9045648574829102, 'learning_rate': 1.3611329661683714e-05, 'epoch': 1.46}


 73%|███████▎  | 3710/5084 [3:03:17<54:50,  2.40s/it]

{'loss': 0.2743, 'grad_norm': 1.9806106090545654, 'learning_rate': 1.351298190401259e-05, 'epoch': 1.46}


 73%|███████▎  | 3720/5084 [3:03:41<55:13,  2.43s/it]

{'loss': 0.2356, 'grad_norm': 1.3253220319747925, 'learning_rate': 1.3414634146341466e-05, 'epoch': 1.46}


 73%|███████▎  | 3730/5084 [3:04:06<55:02,  2.44s/it]

{'loss': 0.3012, 'grad_norm': 2.3170995712280273, 'learning_rate': 1.3316286388670338e-05, 'epoch': 1.47}


 74%|███████▎  | 3740/5084 [3:04:32<55:46,  2.49s/it]

{'loss': 0.2346, 'grad_norm': 1.7830171585083008, 'learning_rate': 1.3217938630999213e-05, 'epoch': 1.47}


 74%|███████▍  | 3750/5084 [3:04:57<56:26,  2.54s/it]

{'loss': 0.2508, 'grad_norm': 2.279491662979126, 'learning_rate': 1.3119590873328089e-05, 'epoch': 1.48}


 74%|███████▍  | 3760/5084 [3:05:20<51:24,  2.33s/it]

{'loss': 0.2414, 'grad_norm': 1.7641425132751465, 'learning_rate': 1.3021243115656964e-05, 'epoch': 1.48}


 74%|███████▍  | 3770/5084 [3:05:45<52:56,  2.42s/it]

{'loss': 0.2443, 'grad_norm': 2.2965357303619385, 'learning_rate': 1.292289535798584e-05, 'epoch': 1.48}


 74%|███████▍  | 3780/5084 [3:06:10<53:21,  2.45s/it]

{'loss': 0.2325, 'grad_norm': 1.7450472116470337, 'learning_rate': 1.2824547600314712e-05, 'epoch': 1.49}


 75%|███████▍  | 3790/5084 [3:06:34<53:14,  2.47s/it]

{'loss': 0.2462, 'grad_norm': 1.892984390258789, 'learning_rate': 1.2726199842643588e-05, 'epoch': 1.49}


 75%|███████▍  | 3800/5084 [3:06:58<51:06,  2.39s/it]

{'loss': 0.256, 'grad_norm': 1.9075087308883667, 'learning_rate': 1.2627852084972463e-05, 'epoch': 1.49}


 75%|███████▍  | 3810/5084 [3:07:22<52:17,  2.46s/it]

{'loss': 0.2703, 'grad_norm': 1.8991667032241821, 'learning_rate': 1.2529504327301339e-05, 'epoch': 1.5}


 75%|███████▌  | 3820/5084 [3:07:47<51:49,  2.46s/it]

{'loss': 0.2639, 'grad_norm': 2.2035622596740723, 'learning_rate': 1.2431156569630213e-05, 'epoch': 1.5}


 75%|███████▌  | 3830/5084 [3:08:11<50:16,  2.41s/it]

{'loss': 0.2553, 'grad_norm': 1.893618106842041, 'learning_rate': 1.2332808811959087e-05, 'epoch': 1.51}


 76%|███████▌  | 3840/5084 [3:08:35<48:50,  2.36s/it]

{'loss': 0.2518, 'grad_norm': 2.2175052165985107, 'learning_rate': 1.2234461054287962e-05, 'epoch': 1.51}


 76%|███████▌  | 3850/5084 [3:08:59<51:39,  2.51s/it]

{'loss': 0.3279, 'grad_norm': 1.7811108827590942, 'learning_rate': 1.2136113296616838e-05, 'epoch': 1.51}


 76%|███████▌  | 3860/5084 [3:09:23<49:49,  2.44s/it]

{'loss': 0.3035, 'grad_norm': 2.3334872722625732, 'learning_rate': 1.2037765538945712e-05, 'epoch': 1.52}


 76%|███████▌  | 3870/5084 [3:09:47<49:42,  2.46s/it]

{'loss': 0.275, 'grad_norm': 2.0306246280670166, 'learning_rate': 1.1939417781274587e-05, 'epoch': 1.52}


 76%|███████▋  | 3880/5084 [3:10:11<45:52,  2.29s/it]

{'loss': 0.195, 'grad_norm': 1.8377827405929565, 'learning_rate': 1.1841070023603463e-05, 'epoch': 1.53}


 77%|███████▋  | 3890/5084 [3:10:36<48:41,  2.45s/it]

{'loss': 0.22, 'grad_norm': 2.187980890274048, 'learning_rate': 1.1742722265932338e-05, 'epoch': 1.53}


 77%|███████▋  | 3900/5084 [3:11:01<48:34,  2.46s/it]

{'loss': 0.29, 'grad_norm': 2.015291213989258, 'learning_rate': 1.1644374508261212e-05, 'epoch': 1.53}


 77%|███████▋  | 3910/5084 [3:11:26<48:40,  2.49s/it]

{'loss': 0.3048, 'grad_norm': 1.9123845100402832, 'learning_rate': 1.1546026750590088e-05, 'epoch': 1.54}


 77%|███████▋  | 3920/5084 [3:11:51<48:19,  2.49s/it]

{'loss': 0.226, 'grad_norm': 1.7340284585952759, 'learning_rate': 1.1447678992918962e-05, 'epoch': 1.54}


 77%|███████▋  | 3930/5084 [3:12:15<47:19,  2.46s/it]

{'loss': 0.2649, 'grad_norm': 1.5940676927566528, 'learning_rate': 1.1349331235247837e-05, 'epoch': 1.55}


 77%|███████▋  | 3940/5084 [3:12:40<47:24,  2.49s/it]

{'loss': 0.2419, 'grad_norm': 1.5763441324234009, 'learning_rate': 1.1250983477576711e-05, 'epoch': 1.55}


 78%|███████▊  | 3950/5084 [3:13:05<43:49,  2.32s/it]

{'loss': 0.2654, 'grad_norm': 1.7617326974868774, 'learning_rate': 1.1152635719905587e-05, 'epoch': 1.55}


 78%|███████▊  | 3960/5084 [3:13:29<46:23,  2.48s/it]

{'loss': 0.26, 'grad_norm': 2.3345322608947754, 'learning_rate': 1.1054287962234462e-05, 'epoch': 1.56}


 78%|███████▊  | 3970/5084 [3:13:53<46:04,  2.48s/it]

{'loss': 0.2501, 'grad_norm': 1.8381558656692505, 'learning_rate': 1.0955940204563336e-05, 'epoch': 1.56}


 78%|███████▊  | 3980/5084 [3:14:18<46:01,  2.50s/it]

{'loss': 0.2971, 'grad_norm': 2.3852686882019043, 'learning_rate': 1.0857592446892212e-05, 'epoch': 1.57}


 78%|███████▊  | 3990/5084 [3:14:43<44:15,  2.43s/it]

{'loss': 0.2547, 'grad_norm': 2.0186057090759277, 'learning_rate': 1.0759244689221086e-05, 'epoch': 1.57}


 79%|███████▊  | 4000/5084 [3:15:07<42:25,  2.35s/it]

{'loss': 0.289, 'grad_norm': 1.7518765926361084, 'learning_rate': 1.0660896931549961e-05, 'epoch': 1.57}


  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
 79%|███████▉  | 4010/5084 [3:15:37<47:04,  2.63s/it]  

{'loss': 0.2388, 'grad_norm': 2.0884292125701904, 'learning_rate': 1.0562549173878835e-05, 'epoch': 1.58}


 79%|███████▉  | 4020/5084 [3:16:02<44:12,  2.49s/it]

{'loss': 0.2573, 'grad_norm': 2.1447737216949463, 'learning_rate': 1.0464201416207711e-05, 'epoch': 1.58}


 79%|███████▉  | 4030/5084 [3:16:25<41:42,  2.37s/it]

{'loss': 0.2408, 'grad_norm': 1.8713099956512451, 'learning_rate': 1.0365853658536585e-05, 'epoch': 1.59}


 79%|███████▉  | 4040/5084 [3:16:50<43:45,  2.51s/it]

{'loss': 0.2878, 'grad_norm': 2.692110300064087, 'learning_rate': 1.026750590086546e-05, 'epoch': 1.59}


 80%|███████▉  | 4050/5084 [3:17:14<38:54,  2.26s/it]

{'loss': 0.2538, 'grad_norm': 2.1237430572509766, 'learning_rate': 1.0169158143194334e-05, 'epoch': 1.59}


 80%|███████▉  | 4060/5084 [3:17:37<38:27,  2.25s/it]

{'loss': 0.2476, 'grad_norm': 1.6866978406906128, 'learning_rate': 1.007081038552321e-05, 'epoch': 1.6}


 80%|████████  | 4070/5084 [3:18:02<42:19,  2.50s/it]

{'loss': 0.2599, 'grad_norm': 2.5825612545013428, 'learning_rate': 9.972462627852085e-06, 'epoch': 1.6}


 80%|████████  | 4080/5084 [3:18:27<41:41,  2.49s/it]

{'loss': 0.2773, 'grad_norm': 1.6391690969467163, 'learning_rate': 9.874114870180961e-06, 'epoch': 1.61}


 80%|████████  | 4090/5084 [3:18:52<41:16,  2.49s/it]

{'loss': 0.2385, 'grad_norm': 1.4507896900177002, 'learning_rate': 9.775767112509837e-06, 'epoch': 1.61}


 81%|████████  | 4100/5084 [3:19:16<40:55,  2.50s/it]

{'loss': 0.2497, 'grad_norm': 1.8485666513442993, 'learning_rate': 9.67741935483871e-06, 'epoch': 1.61}


 81%|████████  | 4110/5084 [3:19:41<39:27,  2.43s/it]

{'loss': 0.2567, 'grad_norm': 1.8791230916976929, 'learning_rate': 9.579071597167586e-06, 'epoch': 1.62}


 81%|████████  | 4120/5084 [3:20:06<39:32,  2.46s/it]

{'loss': 0.254, 'grad_norm': 1.9696664810180664, 'learning_rate': 9.48072383949646e-06, 'epoch': 1.62}


 81%|████████  | 4130/5084 [3:20:31<39:09,  2.46s/it]

{'loss': 0.2398, 'grad_norm': 1.5825971364974976, 'learning_rate': 9.382376081825335e-06, 'epoch': 1.62}


 81%|████████▏ | 4140/5084 [3:20:55<39:02,  2.48s/it]

{'loss': 0.2579, 'grad_norm': 2.1098477840423584, 'learning_rate': 9.28402832415421e-06, 'epoch': 1.63}


 82%|████████▏ | 4150/5084 [3:21:20<38:00,  2.44s/it]

{'loss': 0.3211, 'grad_norm': 2.2047111988067627, 'learning_rate': 9.185680566483085e-06, 'epoch': 1.63}


 82%|████████▏ | 4160/5084 [3:21:45<37:52,  2.46s/it]

{'loss': 0.2989, 'grad_norm': 2.9667272567749023, 'learning_rate': 9.087332808811959e-06, 'epoch': 1.64}


 82%|████████▏ | 4170/5084 [3:22:09<36:42,  2.41s/it]

{'loss': 0.3244, 'grad_norm': 2.267660140991211, 'learning_rate': 8.988985051140834e-06, 'epoch': 1.64}


 82%|████████▏ | 4180/5084 [3:22:33<37:19,  2.48s/it]

{'loss': 0.2855, 'grad_norm': 2.4118778705596924, 'learning_rate': 8.89063729346971e-06, 'epoch': 1.64}


 82%|████████▏ | 4190/5084 [3:22:59<37:35,  2.52s/it]

{'loss': 0.2697, 'grad_norm': 2.001352310180664, 'learning_rate': 8.792289535798584e-06, 'epoch': 1.65}


 83%|████████▎ | 4200/5084 [3:23:23<34:44,  2.36s/it]

{'loss': 0.2747, 'grad_norm': 2.719326972961426, 'learning_rate': 8.69394177812746e-06, 'epoch': 1.65}


 83%|████████▎ | 4210/5084 [3:23:47<36:07,  2.48s/it]

{'loss': 0.2675, 'grad_norm': 2.1248128414154053, 'learning_rate': 8.595594020456333e-06, 'epoch': 1.66}


 83%|████████▎ | 4220/5084 [3:24:11<35:29,  2.46s/it]

{'loss': 0.271, 'grad_norm': 3.3378589153289795, 'learning_rate': 8.497246262785209e-06, 'epoch': 1.66}


 83%|████████▎ | 4230/5084 [3:24:35<32:45,  2.30s/it]

{'loss': 0.2572, 'grad_norm': 1.8874417543411255, 'learning_rate': 8.398898505114083e-06, 'epoch': 1.66}


 83%|████████▎ | 4240/5084 [3:25:00<35:37,  2.53s/it]

{'loss': 0.2775, 'grad_norm': 1.988243818283081, 'learning_rate': 8.300550747442958e-06, 'epoch': 1.67}


 84%|████████▎ | 4250/5084 [3:25:23<34:11,  2.46s/it]

{'loss': 0.3045, 'grad_norm': 1.7366280555725098, 'learning_rate': 8.202202989771832e-06, 'epoch': 1.67}


 84%|████████▍ | 4260/5084 [3:25:48<33:16,  2.42s/it]

{'loss': 0.2719, 'grad_norm': 2.5577147006988525, 'learning_rate': 8.103855232100708e-06, 'epoch': 1.68}


 84%|████████▍ | 4270/5084 [3:26:13<33:26,  2.47s/it]

{'loss': 0.2625, 'grad_norm': 1.8092533349990845, 'learning_rate': 8.005507474429583e-06, 'epoch': 1.68}


 84%|████████▍ | 4280/5084 [3:26:37<31:17,  2.34s/it]

{'loss': 0.2573, 'grad_norm': 2.2488787174224854, 'learning_rate': 7.907159716758459e-06, 'epoch': 1.68}


 84%|████████▍ | 4290/5084 [3:27:01<32:56,  2.49s/it]

{'loss': 0.2695, 'grad_norm': 2.0513052940368652, 'learning_rate': 7.808811959087335e-06, 'epoch': 1.69}


 85%|████████▍ | 4300/5084 [3:27:25<30:33,  2.34s/it]

{'loss': 0.2774, 'grad_norm': 1.9076682329177856, 'learning_rate': 7.710464201416208e-06, 'epoch': 1.69}


 85%|████████▍ | 4310/5084 [3:27:50<31:44,  2.46s/it]

{'loss': 0.2604, 'grad_norm': 1.9516905546188354, 'learning_rate': 7.612116443745083e-06, 'epoch': 1.7}


 85%|████████▍ | 4320/5084 [3:28:14<31:33,  2.48s/it]

{'loss': 0.2651, 'grad_norm': 1.891755223274231, 'learning_rate': 7.513768686073958e-06, 'epoch': 1.7}


 85%|████████▌ | 4330/5084 [3:28:39<31:32,  2.51s/it]

{'loss': 0.231, 'grad_norm': 2.1997382640838623, 'learning_rate': 7.4154209284028335e-06, 'epoch': 1.7}


 85%|████████▌ | 4340/5084 [3:29:04<31:07,  2.51s/it]

{'loss': 0.2512, 'grad_norm': 1.6746488809585571, 'learning_rate': 7.317073170731707e-06, 'epoch': 1.71}


 86%|████████▌ | 4350/5084 [3:29:29<30:46,  2.52s/it]

{'loss': 0.2535, 'grad_norm': 1.8023916482925415, 'learning_rate': 7.218725413060583e-06, 'epoch': 1.71}


 86%|████████▌ | 4360/5084 [3:29:53<29:56,  2.48s/it]

{'loss': 0.2995, 'grad_norm': 2.067643642425537, 'learning_rate': 7.120377655389457e-06, 'epoch': 1.72}


 86%|████████▌ | 4370/5084 [3:30:18<29:24,  2.47s/it]

{'loss': 0.2411, 'grad_norm': 2.0820209980010986, 'learning_rate': 7.022029897718332e-06, 'epoch': 1.72}


 86%|████████▌ | 4380/5084 [3:30:43<28:11,  2.40s/it]

{'loss': 0.261, 'grad_norm': 1.7579225301742554, 'learning_rate': 6.923682140047208e-06, 'epoch': 1.72}


 86%|████████▋ | 4390/5084 [3:31:06<27:12,  2.35s/it]

{'loss': 0.28, 'grad_norm': 1.9827847480773926, 'learning_rate': 6.825334382376082e-06, 'epoch': 1.73}


 87%|████████▋ | 4400/5084 [3:31:32<28:14,  2.48s/it]

{'loss': 0.2502, 'grad_norm': 1.4250965118408203, 'learning_rate': 6.7269866247049575e-06, 'epoch': 1.73}


 87%|████████▋ | 4410/5084 [3:31:57<27:58,  2.49s/it]

{'loss': 0.2416, 'grad_norm': 1.9041013717651367, 'learning_rate': 6.628638867033831e-06, 'epoch': 1.73}


 87%|████████▋ | 4420/5084 [3:32:21<26:02,  2.35s/it]

{'loss': 0.229, 'grad_norm': 2.2449166774749756, 'learning_rate': 6.530291109362707e-06, 'epoch': 1.74}


 87%|████████▋ | 4430/5084 [3:32:46<27:38,  2.54s/it]

{'loss': 0.2726, 'grad_norm': 2.240827798843384, 'learning_rate': 6.431943351691582e-06, 'epoch': 1.74}


 87%|████████▋ | 4440/5084 [3:33:10<26:16,  2.45s/it]

{'loss': 0.1977, 'grad_norm': 1.3634850978851318, 'learning_rate': 6.333595594020457e-06, 'epoch': 1.75}


 88%|████████▊ | 4450/5084 [3:33:35<26:26,  2.50s/it]

{'loss': 0.289, 'grad_norm': 1.2084989547729492, 'learning_rate': 6.235247836349332e-06, 'epoch': 1.75}


 88%|████████▊ | 4460/5084 [3:33:59<24:38,  2.37s/it]

{'loss': 0.2306, 'grad_norm': 1.8843235969543457, 'learning_rate': 6.136900078678207e-06, 'epoch': 1.75}


 88%|████████▊ | 4470/5084 [3:34:22<24:55,  2.44s/it]

{'loss': 0.255, 'grad_norm': 2.571117639541626, 'learning_rate': 6.0385523210070814e-06, 'epoch': 1.76}


 88%|████████▊ | 4480/5084 [3:34:47<24:54,  2.48s/it]

{'loss': 0.2563, 'grad_norm': 2.15142822265625, 'learning_rate': 5.940204563335956e-06, 'epoch': 1.76}


 88%|████████▊ | 4490/5084 [3:35:11<23:29,  2.37s/it]

{'loss': 0.242, 'grad_norm': 1.9287103414535522, 'learning_rate': 5.841856805664831e-06, 'epoch': 1.77}


 89%|████████▊ | 4500/5084 [3:35:35<23:39,  2.43s/it]

{'loss': 0.2641, 'grad_norm': 1.7068898677825928, 'learning_rate': 5.743509047993706e-06, 'epoch': 1.77}


 89%|████████▊ | 4510/5084 [3:36:00<23:16,  2.43s/it]

{'loss': 0.2653, 'grad_norm': 2.1848208904266357, 'learning_rate': 5.64516129032258e-06, 'epoch': 1.77}


 89%|████████▉ | 4520/5084 [3:36:24<21:22,  2.27s/it]

{'loss': 0.2589, 'grad_norm': 1.7884624004364014, 'learning_rate': 5.546813532651456e-06, 'epoch': 1.78}


 89%|████████▉ | 4530/5084 [3:36:48<22:24,  2.43s/it]

{'loss': 0.2297, 'grad_norm': 1.8214404582977295, 'learning_rate': 5.448465774980331e-06, 'epoch': 1.78}


 89%|████████▉ | 4540/5084 [3:37:13<22:45,  2.51s/it]

{'loss': 0.2713, 'grad_norm': 2.2584404945373535, 'learning_rate': 5.350118017309206e-06, 'epoch': 1.79}


 89%|████████▉ | 4550/5084 [3:37:38<22:35,  2.54s/it]

{'loss': 0.2314, 'grad_norm': 1.7631043195724487, 'learning_rate': 5.251770259638081e-06, 'epoch': 1.79}


 90%|████████▉ | 4560/5084 [3:38:03<21:55,  2.51s/it]

{'loss': 0.2558, 'grad_norm': 2.6291966438293457, 'learning_rate': 5.153422501966956e-06, 'epoch': 1.79}


 90%|████████▉ | 4570/5084 [3:38:27<21:20,  2.49s/it]

{'loss': 0.2361, 'grad_norm': 2.1925342082977295, 'learning_rate': 5.0550747442958305e-06, 'epoch': 1.8}


 90%|█████████ | 4580/5084 [3:38:53<21:07,  2.51s/it]

{'loss': 0.2712, 'grad_norm': 1.8099064826965332, 'learning_rate': 4.956726986624705e-06, 'epoch': 1.8}


 90%|█████████ | 4590/5084 [3:39:17<18:43,  2.27s/it]

{'loss': 0.2618, 'grad_norm': 2.0281476974487305, 'learning_rate': 4.85837922895358e-06, 'epoch': 1.81}


 90%|█████████ | 4600/5084 [3:39:42<20:15,  2.51s/it]

{'loss': 0.2665, 'grad_norm': 1.899626612663269, 'learning_rate': 4.760031471282455e-06, 'epoch': 1.81}


 91%|█████████ | 4610/5084 [3:40:06<19:03,  2.41s/it]

{'loss': 0.2269, 'grad_norm': 1.0316717624664307, 'learning_rate': 4.661683713611329e-06, 'epoch': 1.81}


 91%|█████████ | 4620/5084 [3:40:31<18:16,  2.36s/it]

{'loss': 0.2635, 'grad_norm': 1.8593467473983765, 'learning_rate': 4.563335955940205e-06, 'epoch': 1.82}


 91%|█████████ | 4630/5084 [3:40:55<18:16,  2.42s/it]

{'loss': 0.2339, 'grad_norm': 1.7061381340026855, 'learning_rate': 4.46498819826908e-06, 'epoch': 1.82}


 91%|█████████▏| 4640/5084 [3:41:20<18:19,  2.48s/it]

{'loss': 0.2434, 'grad_norm': 2.524134635925293, 'learning_rate': 4.3666404405979544e-06, 'epoch': 1.83}


 91%|█████████▏| 4650/5084 [3:41:45<18:23,  2.54s/it]

{'loss': 0.2495, 'grad_norm': 1.6489242315292358, 'learning_rate': 4.26829268292683e-06, 'epoch': 1.83}


 92%|█████████▏| 4660/5084 [3:42:10<18:02,  2.55s/it]

{'loss': 0.3105, 'grad_norm': 2.518150806427002, 'learning_rate': 4.169944925255705e-06, 'epoch': 1.83}


 92%|█████████▏| 4670/5084 [3:42:36<17:26,  2.53s/it]

{'loss': 0.2806, 'grad_norm': 1.6523194313049316, 'learning_rate': 4.0715971675845795e-06, 'epoch': 1.84}


 92%|█████████▏| 4680/5084 [3:43:02<17:37,  2.62s/it]

{'loss': 0.2829, 'grad_norm': 1.69918692111969, 'learning_rate': 3.973249409913454e-06, 'epoch': 1.84}


 92%|█████████▏| 4690/5084 [3:43:27<16:39,  2.54s/it]

{'loss': 0.2665, 'grad_norm': 2.1919262409210205, 'learning_rate': 3.874901652242329e-06, 'epoch': 1.85}


 92%|█████████▏| 4700/5084 [3:43:52<15:48,  2.47s/it]

{'loss': 0.2958, 'grad_norm': 1.2956589460372925, 'learning_rate': 3.776553894571204e-06, 'epoch': 1.85}


 93%|█████████▎| 4710/5084 [3:44:15<14:23,  2.31s/it]

{'loss': 0.2823, 'grad_norm': 2.149148941040039, 'learning_rate': 3.678206136900079e-06, 'epoch': 1.85}


 93%|█████████▎| 4720/5084 [3:44:40<14:49,  2.44s/it]

{'loss': 0.2403, 'grad_norm': 2.19661021232605, 'learning_rate': 3.5798583792289536e-06, 'epoch': 1.86}


 93%|█████████▎| 4730/5084 [3:45:05<14:23,  2.44s/it]

{'loss': 0.2047, 'grad_norm': 2.084477663040161, 'learning_rate': 3.4815106215578283e-06, 'epoch': 1.86}


 93%|█████████▎| 4740/5084 [3:45:29<14:03,  2.45s/it]

{'loss': 0.2609, 'grad_norm': 2.389845132827759, 'learning_rate': 3.3831628638867034e-06, 'epoch': 1.86}


 93%|█████████▎| 4750/5084 [3:45:53<13:25,  2.41s/it]

{'loss': 0.2288, 'grad_norm': 2.6322107315063477, 'learning_rate': 3.284815106215578e-06, 'epoch': 1.87}


 94%|█████████▎| 4760/5084 [3:46:16<12:37,  2.34s/it]

{'loss': 0.2487, 'grad_norm': 1.4338098764419556, 'learning_rate': 3.1864673485444538e-06, 'epoch': 1.87}


 94%|█████████▍| 4770/5084 [3:46:39<12:35,  2.41s/it]

{'loss': 0.2286, 'grad_norm': 2.198336362838745, 'learning_rate': 3.088119590873328e-06, 'epoch': 1.88}


 94%|█████████▍| 4780/5084 [3:47:04<12:30,  2.47s/it]

{'loss': 0.2667, 'grad_norm': 2.2158308029174805, 'learning_rate': 2.9897718332022032e-06, 'epoch': 1.88}


 94%|█████████▍| 4790/5084 [3:47:29<11:51,  2.42s/it]

{'loss': 0.2483, 'grad_norm': 2.032911539077759, 'learning_rate': 2.891424075531078e-06, 'epoch': 1.88}


 94%|█████████▍| 4800/5084 [3:47:54<11:30,  2.43s/it]

{'loss': 0.2605, 'grad_norm': 1.8219243288040161, 'learning_rate': 2.793076317859953e-06, 'epoch': 1.89}


 95%|█████████▍| 4810/5084 [3:48:19<11:04,  2.42s/it]

{'loss': 0.2984, 'grad_norm': 2.712902545928955, 'learning_rate': 2.694728560188828e-06, 'epoch': 1.89}


 95%|█████████▍| 4820/5084 [3:48:42<09:48,  2.23s/it]

{'loss': 0.2683, 'grad_norm': 3.235875368118286, 'learning_rate': 2.5963808025177026e-06, 'epoch': 1.9}


 95%|█████████▌| 4830/5084 [3:49:06<10:25,  2.46s/it]

{'loss': 0.2501, 'grad_norm': 1.9281812906265259, 'learning_rate': 2.4980330448465773e-06, 'epoch': 1.9}


 95%|█████████▌| 4840/5084 [3:49:31<10:00,  2.46s/it]

{'loss': 0.236, 'grad_norm': 1.8334052562713623, 'learning_rate': 2.3996852871754525e-06, 'epoch': 1.9}


 95%|█████████▌| 4850/5084 [3:49:55<09:05,  2.33s/it]

{'loss': 0.2231, 'grad_norm': 1.862549901008606, 'learning_rate': 2.3013375295043276e-06, 'epoch': 1.91}


 96%|█████████▌| 4860/5084 [3:50:18<08:53,  2.38s/it]

{'loss': 0.2209, 'grad_norm': 2.1395986080169678, 'learning_rate': 2.2029897718332024e-06, 'epoch': 1.91}


 96%|█████████▌| 4870/5084 [3:50:42<08:12,  2.30s/it]

{'loss': 0.2386, 'grad_norm': 1.4732085466384888, 'learning_rate': 2.104642014162077e-06, 'epoch': 1.92}


 96%|█████████▌| 4880/5084 [3:51:06<07:54,  2.33s/it]

{'loss': 0.2618, 'grad_norm': 1.714317798614502, 'learning_rate': 2.006294256490952e-06, 'epoch': 1.92}


 96%|█████████▌| 4890/5084 [3:51:29<07:12,  2.23s/it]

{'loss': 0.3011, 'grad_norm': 1.7727887630462646, 'learning_rate': 1.907946498819827e-06, 'epoch': 1.92}


 96%|█████████▋| 4900/5084 [3:51:54<07:40,  2.51s/it]

{'loss': 0.3005, 'grad_norm': 1.8729088306427002, 'learning_rate': 1.809598741148702e-06, 'epoch': 1.93}


 97%|█████████▋| 4910/5084 [3:52:17<06:37,  2.29s/it]

{'loss': 0.2632, 'grad_norm': 2.126549243927002, 'learning_rate': 1.7112509834775769e-06, 'epoch': 1.93}


 97%|█████████▋| 4920/5084 [3:52:41<06:49,  2.49s/it]

{'loss': 0.2198, 'grad_norm': 1.8154655694961548, 'learning_rate': 1.6129032258064516e-06, 'epoch': 1.94}


 97%|█████████▋| 4930/5084 [3:53:05<05:55,  2.31s/it]

{'loss': 0.2475, 'grad_norm': 1.896937370300293, 'learning_rate': 1.5145554681353268e-06, 'epoch': 1.94}


 97%|█████████▋| 4940/5084 [3:53:29<05:43,  2.38s/it]

{'loss': 0.2422, 'grad_norm': 1.6217029094696045, 'learning_rate': 1.4162077104642015e-06, 'epoch': 1.94}


 97%|█████████▋| 4950/5084 [3:53:54<05:16,  2.36s/it]

{'loss': 0.2588, 'grad_norm': 1.8896340131759644, 'learning_rate': 1.3178599527930762e-06, 'epoch': 1.95}


 98%|█████████▊| 4960/5084 [3:54:18<05:03,  2.45s/it]

{'loss': 0.2454, 'grad_norm': 1.5790094137191772, 'learning_rate': 1.2195121951219514e-06, 'epoch': 1.95}


 98%|█████████▊| 4970/5084 [3:54:41<04:36,  2.42s/it]

{'loss': 0.2583, 'grad_norm': 1.9909378290176392, 'learning_rate': 1.121164437450826e-06, 'epoch': 1.96}


 98%|█████████▊| 4980/5084 [3:55:05<04:12,  2.43s/it]

{'loss': 0.2846, 'grad_norm': 1.872663974761963, 'learning_rate': 1.0228166797797013e-06, 'epoch': 1.96}


 98%|█████████▊| 4990/5084 [3:55:29<03:48,  2.43s/it]

{'loss': 0.2734, 'grad_norm': 2.3364458084106445, 'learning_rate': 9.24468922108576e-07, 'epoch': 1.96}


 98%|█████████▊| 5000/5084 [3:55:54<03:27,  2.47s/it]

{'loss': 0.2248, 'grad_norm': 1.210141897201538, 'learning_rate': 8.261211644374508e-07, 'epoch': 1.97}


  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
 99%|█████████▊| 5010/5084 [3:56:20<03:04,  2.49s/it]

{'loss': 0.2724, 'grad_norm': 1.9326716661453247, 'learning_rate': 7.277734067663258e-07, 'epoch': 1.97}


 99%|█████████▊| 5020/5084 [3:56:44<02:39,  2.50s/it]

{'loss': 0.246, 'grad_norm': 1.9432209730148315, 'learning_rate': 6.294256490952006e-07, 'epoch': 1.97}


 99%|█████████▉| 5030/5084 [3:57:08<02:08,  2.38s/it]

{'loss': 0.2854, 'grad_norm': 2.348257064819336, 'learning_rate': 5.310778914240756e-07, 'epoch': 1.98}


 99%|█████████▉| 5040/5084 [3:57:33<01:49,  2.48s/it]

{'loss': 0.1918, 'grad_norm': 1.7644274234771729, 'learning_rate': 4.327301337529505e-07, 'epoch': 1.98}


 99%|█████████▉| 5050/5084 [3:57:58<01:25,  2.51s/it]

{'loss': 0.2421, 'grad_norm': 1.7030493021011353, 'learning_rate': 3.343823760818254e-07, 'epoch': 1.99}


100%|█████████▉| 5060/5084 [3:58:23<00:59,  2.47s/it]

{'loss': 0.2896, 'grad_norm': 2.085404634475708, 'learning_rate': 2.3603461841070026e-07, 'epoch': 1.99}


100%|█████████▉| 5070/5084 [3:58:48<00:34,  2.46s/it]

{'loss': 0.2451, 'grad_norm': 2.180187940597534, 'learning_rate': 1.3768686073957515e-07, 'epoch': 1.99}


100%|█████████▉| 5080/5084 [3:59:12<00:09,  2.37s/it]

{'loss': 0.2386, 'grad_norm': 2.2725095748901367, 'learning_rate': 3.933910306845004e-08, 'epoch': 2.0}


  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
                                                     
100%|██████████| 5084/5084 [4:02:34<00:00,  2.86s/it]

{'eval_loss': 0.27893632650375366, 'eval_runtime': 188.5866, 'eval_samples_per_second': 13.479, 'eval_steps_per_second': 1.686, 'epoch': 2.0}
{'train_runtime': 14554.1039, 'train_samples_per_second': 2.795, 'train_steps_per_second': 0.349, 'train_loss': 0.37589631002780305, 'epoch': 2.0}





TrainOutput(global_step=5084, training_loss=0.37589631002780305, metrics={'train_runtime': 14554.1039, 'train_samples_per_second': 2.795, 'train_steps_per_second': 0.349, 'total_flos': 463161834602496.0, 'train_loss': 0.37589631002780305, 'epoch': 2.0})

In [9]:
test_input_encodings, test_target_encodings = tokenize_sentences(test_df, tokenizer)
test_dataset = TranslationDataset(test_input_encodings, test_target_encodings)

# Evaluate the model
eval_results = trainer.evaluate(eval_dataset=test_dataset)
print(f"Evaluation Results: {eval_results}")

  item = {key: torch.tensor(val[idx]) for key, val in self.input_encodings.items()}
  item['labels'] = torch.tensor(self.target_encodings['input_ids'][idx])
100%|██████████| 318/318 [03:31<00:00,  1.50it/s]

Evaluation Results: {'eval_loss': 0.3047913610935211, 'eval_runtime': 212.4013, 'eval_samples_per_second': 11.968, 'eval_steps_per_second': 1.497, 'epoch': 2.0}





## Text Translation

The following code defines the translate function, which takes an English sentence as input and returns its Twi translation. The function tokenizes the input text, uses the fine-tuned MarianMTModel to generate the translation, and then decodes the output back into readable Twi text. This function is essential for performing the actual translation task using the model and tokenizer.

In [10]:
def translate(text, trainer, tokenizer):
    '''
    Function that translates a given text to Twi

    Args:
        text -> the text to be translated
        trainer -> trainer instance that contains the model
        tokenizer -> tokenizer instance to tokenize the text

    Returns:
        Transalted text
    '''
    # Extract the model from the trainer
    model = trainer.model

    input_encodings = tokenizer(text, return_tensors='pt', padding=True)

    # Generate translation
    translated_tokens = model.generate(**input_encodings)

    # Decode the output
    translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated_tokens]

    return translated_text

# Test
text_to_translate = ["Kwaku saw John and Abena holding hands."]
translated_text = translate(text_to_translate, trainer, tokenizer)
print(f"Translation: {translated_text}")

Translation: ['Kwaku huu John ne Abena wɔn nsam.']


In [11]:
trainer.save_model('model/')
tokenizer.save_pretrained('model/')

('model/tokenizer_config.json',
 'model/special_tokens_map.json',
 'model/vocab.json',
 'model/source.spm',
 'model/target.spm',
 'model/added_tokens.json')

## Load and Translate

In [13]:
from transformers import MarianMTModel, MarianTokenizer

# Load the pretrained model and tokenizer
model_name = "model/"
tokenizer_name = "model/"

model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(tokenizer_name)

def translate(text, model, tokenizer):
    '''
    Function that translates a given text to Twi

    Args:
        text -> the text to be translated
        trainer -> trainer instance that contains the model
        tokenizer -> tokenizer instance to tokenize the text

    Returns:
        Transalted text
    '''
    input_encodings = tokenizer(text, return_tensors='pt', padding=True)

    # Generate translation
    translated_tokens = model.generate(**input_encodings)

    # Decode the output
    translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated_tokens]

    return translated_text

# Example translation
text_to_translate = ["I am quite hungry"]
translated_text = translate(text_to_translate, model, tokenizer)
print(f"Translation: {translated_text}")




Translation: ['Ɔkɔm de me paa.']
