# Llama 3.2

The Meta Llama 3.2 collection of multilingual LLMs is a collection of pretrained and instruction tuned generative model in 8B. The Llama 3.1 instruction tuned text only models are optimised for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.

Llama 3.2 is an auto-regressive language model that uses an optimised transformer architecture. The tuned versions are supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

### Transformer-Based Architecture
- Llama uses a decoder-only transformer architecture, which is similar in structure to models like GPT. In a decoder-only architecture, the model generates text by predicting the next token in a sequence, making it suitable for tasks like text generation, question answering and summarisation.

- Llama uses self-attention with multi-headed attention layers, which allows the model to capture relationships between words and understand context over varying distances.

### Pre-Normalization and RMSNorm
- **Layer Normalization:** Llama uses RMSNorm (Root Mean Square Normalization) instead of LayerNorm, which is the normalization technique typically used in transformers. RMSNorm normalizes each layer by the root mean square of its values rather than the mean and variance.

- **Pre-Normalization:** The model applies normalization layers before the attention and feed-forward layers instead of after. This change has shown to stabilise training and reduce issues that can arise during deep transformer training.

### Rotary Positional Embeddings (RoPE)
- **Positional Encoding:** Llama uses Rotary Positional Embeddings (RoPE), which were also used in models like GPT-NeoX and GPT-J. Unlike absolute positional encodings (used in BERT) or learned positional encodings, RoPE is a form of relative positional encoding that enhances the model's ability to capture position-dependent relationships between tokens.

- **Extended Context Length:** RoPE also allows for efficient scaling of the context window, meaning Llama can handle longer sequences more effectively than traditional transformers with fixed positional encodings. This is useful for tasks that require understanding longer documents or paragraphs.

### GeLU Activation
- Llama uses the GeLU (Gaussian Error Linear Unit) activation function, which is common in modern transformers. GeLU is smoother than ReLU and has shown to improve convergence and overall performance in deep learning models.


In [1]:
# %pip install  --upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"

In [None]:
# %pip install --upgrade transformers

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.




In [1]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from trl import setup_chat_format
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
data = pd.read_csv('final_df.csv')

In [3]:
data.head()

Unnamed: 0,year,month,sentiment,processed_full_review
0,2024,3,Neutral,ok use airlin go singapor london heathrow issu...
1,2024,3,Negative,don give money book paid receiv email confirm ...
2,2024,3,Positive,best airlin world best airlin world seat food ...
3,2024,3,Negative,premium economi seat singapor airlin not worth...
4,2024,3,Negative,imposs get promis refund book flight full mont...


In [4]:
data['sentiment'].value_counts()

sentiment
Positive    7913
Negative    2441
Neutral     1164
Name: count, dtype: int64

In [5]:
data = data.sample(frac=1, random_state=85).reset_index(drop=True)

# Split the DataFrame
train_size = 0.8
eval_size = 0.1

# Calculate sizes
train_end = int(train_size * len(data))
eval_end = train_end + int(eval_size * len(data))

# Split the data
X_train = data[:train_end]
X_eval = data[train_end:eval_end]
X_test = data[eval_end:]

# Define the prompt generation functions
def generate_prompt(data_point):
    return f"""
            Classify the text into Positive, Neutral, Negative, and return the answer as the corresponding airline sentiment label.
text: {data_point["processed_full_review"]}
label: {data_point["sentiment"]}""".strip()

def generate_test_prompt(data_point):
    return f"""
            Classify the text into Positive, Neutral, Negative, and return the answer as the corresponding airline sentiment label.
text: {data_point["processed_full_review"]}
label: """.strip()

# Generate prompts for training and evaluation data
X_train.loc[:,'text'] = X_train.apply(generate_prompt, axis=1)
X_eval.loc[:,'text'] = X_eval.apply(generate_prompt, axis=1)

# Generate test prompts and extract true labels
y_true = X_test.loc[:,'sentiment']
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train.loc[:,'text'] = X_train.apply(generate_prompt, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_eval.loc[:,'text'] = X_eval.apply(generate_prompt, axis=1)


In [8]:
X_train.sentiment.value_counts()

sentiment
Positive    6316
Negative    1975
Neutral      923
Name: count, dtype: int64

In [9]:
y_true.value_counts()

sentiment
Positive    803
Negative    234
Neutral     116
Name: count, dtype: int64

In [10]:
# Convert to datasets
train_data = Dataset.from_pandas(X_train[["text"]])
eval_data = Dataset.from_pandas(X_eval[["text"]])

In [11]:
train_data['text'][3]

'Classify the text into Positive, Neutral, Negative, and return the answer as the corresponding airline sentiment label.\ntext: great fligth flew jakarta transfer singapor san francisco flight jakarta hour late but staff provid everyon snack realli good chicken pasta flight hour layov singapor hour flight\nlabel: Positive'

In [12]:
# Initialize tokenizer and model
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="float16",
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

In [13]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

tokenizer.pad_token_id = tokenizer.eos_token_id

In [14]:
def predict(test, model, tokenizer):
    y_pred = []
    categories = ["Positive", "Neutral", "Negative"]
    
    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]["text"]
        pipe = pipeline(task="text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        max_new_tokens=2, 
                        temperature=0.1)
        
        result = pipe(prompt)
        answer = result[0]['generated_text'].split("label:")[-1].strip()
        
        # Determine the predicted category
        for category in categories:
            if category.lower() in answer.lower():
                y_pred.append(category)
                break
        else:
            y_pred.append("none")
    
    return y_pred

In [15]:
y_pred = predict(X_test, model, tokenizer)

100%|██████████| 1153/1153 [02:18<00:00,  8.31it/s]


In [16]:
def evaluate(y_true, y_pred):
    labels = ["Positive", "Neutral", "Negative"]
    mapping = {label: idx for idx, label in enumerate(labels)}
    
    def map_func(x):
        return mapping.get(x, -1)  # Map to -1 if not found, but should not occur with correct data
    
    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true_mapped)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]
        label_y_true = [y_true_mapped[i] for i in label_indices]
        label_y_pred = [y_pred_mapped[i] for i in label_indices]
        label_accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {labels[label]}: {label_accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true_mapped, y_pred=y_pred_mapped, target_names=labels, labels=list(range(len(labels))), digits=4)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped, labels=list(range(len(labels))))
    print('\nConfusion Matrix:')
    print(conf_matrix)

In [17]:
evaluate(y_true, y_pred)

Accuracy: 0.739
Accuracy for label Positive: 0.903
Accuracy for label Neutral: 0.138
Accuracy for label Negative: 0.474

Classification Report:
              precision    recall  f1-score   support

    Positive     0.8065    0.9029    0.8519       803
     Neutral     0.1860    0.1379    0.1584       116
    Negative     0.6607    0.4744    0.5522       234

    accuracy                         0.7389      1153
   macro avg     0.5511    0.5051    0.5209      1153
weighted avg     0.7145    0.7389    0.7213      1153


Confusion Matrix:
[[725  51  27]
 [ 70  16  30]
 [104  19 111]]


In [None]:
import bitsandbytes as bnb

# Find all linear layer names
def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

In [19]:
modules = find_all_linear_names(model)
modules

['gate_proj', 'k_proj', 'v_proj', 'down_proj', 'q_proj', 'up_proj', 'o_proj']

In [None]:
output_dir="llama-3.2-fine-tuned-model"

# Fine tune using QLoRA
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules,
)

training_arguments = TrainingArguments(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=1,                       # number of training epochs
    per_device_train_batch_size=1,            # batch size per device during training
    gradient_accumulation_steps=8,            # number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    logging_steps=1,                         
    learning_rate=2e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="wandb",                  # report metrics to w&b
    eval_strategy="steps",              # save checkpoint every epoch
    eval_steps = 0.2
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=512,
    packing=False,
    dataset_kwargs={
    "add_special_tokens": False,
    "append_concat_token": False,
    }
)

Map: 100%|██████████| 9214/9214 [00:00<00:00, 14352.02 examples/s]
Map: 100%|██████████| 1151/1151 [00:00<00:00, 14387.56 examples/s]
  super().__init__(
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [21]:
# Train model
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mshaun-zhou-2022[0m ([33mshaun-zhou-2022-singapore-management-university[0m). Use [1m`wandb login --relogin`[0m to force relogin


  return fn(*args, **kwargs)
  0%|          | 1/1151 [00:03<1:07:51,  3.54s/it]

{'loss': 6.2077, 'grad_norm': 0.810721218585968, 'learning_rate': 5.7142857142857145e-06, 'epoch': 0.0}


  0%|          | 2/1151 [00:06<1:01:10,  3.19s/it]

{'loss': 6.2041, 'grad_norm': 0.7819814085960388, 'learning_rate': 1.1428571428571429e-05, 'epoch': 0.0}


  0%|          | 3/1151 [00:09<59:06,  3.09s/it]  

{'loss': 6.5482, 'grad_norm': 0.8143168091773987, 'learning_rate': 1.7142857142857145e-05, 'epoch': 0.0}


  0%|          | 4/1151 [00:12<58:03,  3.04s/it]

{'loss': 6.4786, 'grad_norm': 0.7878596782684326, 'learning_rate': 2.2857142857142858e-05, 'epoch': 0.0}


  0%|          | 5/1151 [00:15<57:42,  3.02s/it]

{'loss': 6.2973, 'grad_norm': 0.848309338092804, 'learning_rate': 2.857142857142857e-05, 'epoch': 0.0}


  1%|          | 6/1151 [00:18<58:28,  3.06s/it]

{'loss': 6.3675, 'grad_norm': 0.7664622664451599, 'learning_rate': 3.428571428571429e-05, 'epoch': 0.01}


  1%|          | 7/1151 [00:21<56:04,  2.94s/it]

{'loss': 6.2006, 'grad_norm': 0.9196398258209229, 'learning_rate': 4e-05, 'epoch': 0.01}


  1%|          | 8/1151 [00:24<56:32,  2.97s/it]

{'loss': 6.2093, 'grad_norm': 0.8126217126846313, 'learning_rate': 4.5714285714285716e-05, 'epoch': 0.01}


  1%|          | 9/1151 [00:27<57:38,  3.03s/it]

{'loss': 6.0943, 'grad_norm': 0.833158016204834, 'learning_rate': 5.142857142857143e-05, 'epoch': 0.01}


  1%|          | 10/1151 [00:30<58:09,  3.06s/it]

{'loss': 6.0089, 'grad_norm': 0.8808699250221252, 'learning_rate': 5.714285714285714e-05, 'epoch': 0.01}


  1%|          | 11/1151 [00:33<58:37,  3.09s/it]

{'loss': 5.6177, 'grad_norm': 0.9608938097953796, 'learning_rate': 6.285714285714286e-05, 'epoch': 0.01}


  1%|          | 12/1151 [00:36<58:47,  3.10s/it]

{'loss': 5.8723, 'grad_norm': 0.9478917121887207, 'learning_rate': 6.857142857142858e-05, 'epoch': 0.01}


  1%|          | 13/1151 [00:40<1:02:16,  3.28s/it]

{'loss': 5.9237, 'grad_norm': 0.8142584562301636, 'learning_rate': 7.428571428571429e-05, 'epoch': 0.01}


  1%|          | 14/1151 [00:43<58:51,  3.11s/it]  

{'loss': 5.1181, 'grad_norm': 1.354085087776184, 'learning_rate': 8e-05, 'epoch': 0.01}


  1%|▏         | 15/1151 [00:46<59:48,  3.16s/it]

{'loss': 5.6052, 'grad_norm': 0.9683445692062378, 'learning_rate': 8.571428571428571e-05, 'epoch': 0.01}


  1%|▏         | 16/1151 [00:49<57:10,  3.02s/it]

{'loss': 4.5523, 'grad_norm': 1.3957226276397705, 'learning_rate': 9.142857142857143e-05, 'epoch': 0.01}


  1%|▏         | 17/1151 [00:52<1:01:13,  3.24s/it]

{'loss': 5.5913, 'grad_norm': 1.09844970703125, 'learning_rate': 9.714285714285715e-05, 'epoch': 0.01}


  2%|▏         | 18/1151 [00:55<59:54,  3.17s/it]  

{'loss': 4.9205, 'grad_norm': 1.8128786087036133, 'learning_rate': 0.00010285714285714286, 'epoch': 0.02}


  2%|▏         | 19/1151 [00:58<58:42,  3.11s/it]

{'loss': 4.6657, 'grad_norm': 1.278287410736084, 'learning_rate': 0.00010857142857142856, 'epoch': 0.02}


  2%|▏         | 20/1151 [01:02<59:20,  3.15s/it]

{'loss': 4.8897, 'grad_norm': 0.9836164712905884, 'learning_rate': 0.00011428571428571428, 'epoch': 0.02}


  2%|▏         | 21/1151 [01:05<58:37,  3.11s/it]

{'loss': 4.4511, 'grad_norm': 0.9999714493751526, 'learning_rate': 0.00012, 'epoch': 0.02}


  2%|▏         | 22/1151 [01:08<1:01:45,  3.28s/it]

{'loss': 5.0446, 'grad_norm': 0.797953724861145, 'learning_rate': 0.00012571428571428572, 'epoch': 0.02}


  2%|▏         | 23/1151 [01:11<1:00:40,  3.23s/it]

{'loss': 4.6547, 'grad_norm': 0.8605740070343018, 'learning_rate': 0.00013142857142857143, 'epoch': 0.02}


  2%|▏         | 24/1151 [01:15<1:01:55,  3.30s/it]

{'loss': 4.489, 'grad_norm': 0.8306737542152405, 'learning_rate': 0.00013714285714285716, 'epoch': 0.02}


  2%|▏         | 25/1151 [01:18<59:43,  3.18s/it]  

{'loss': 4.0796, 'grad_norm': 0.8703657388687134, 'learning_rate': 0.00014285714285714287, 'epoch': 0.02}


  2%|▏         | 26/1151 [01:21<59:43,  3.19s/it]

{'loss': 4.3559, 'grad_norm': 0.9454200863838196, 'learning_rate': 0.00014857142857142857, 'epoch': 0.02}


  2%|▏         | 27/1151 [01:24<1:00:38,  3.24s/it]

{'loss': 4.5155, 'grad_norm': 0.9083006978034973, 'learning_rate': 0.0001542857142857143, 'epoch': 0.02}


  2%|▏         | 28/1151 [01:28<1:00:04,  3.21s/it]

{'loss': 4.2518, 'grad_norm': 0.8879137635231018, 'learning_rate': 0.00016, 'epoch': 0.02}


  3%|▎         | 29/1151 [01:31<1:00:10,  3.22s/it]

{'loss': 4.1876, 'grad_norm': 0.9516821503639221, 'learning_rate': 0.00016571428571428575, 'epoch': 0.03}


  3%|▎         | 30/1151 [01:34<1:01:11,  3.28s/it]

{'loss': 4.3506, 'grad_norm': 0.984786868095398, 'learning_rate': 0.00017142857142857143, 'epoch': 0.03}


  3%|▎         | 31/1151 [01:37<1:00:59,  3.27s/it]

{'loss': 4.2311, 'grad_norm': 0.9049524664878845, 'learning_rate': 0.00017714285714285713, 'epoch': 0.03}


  3%|▎         | 32/1151 [01:41<59:43,  3.20s/it]  

{'loss': 3.8984, 'grad_norm': 0.9273548722267151, 'learning_rate': 0.00018285714285714286, 'epoch': 0.03}


  3%|▎         | 33/1151 [01:44<1:03:25,  3.40s/it]

{'loss': 4.8188, 'grad_norm': 0.8201109170913696, 'learning_rate': 0.00018857142857142857, 'epoch': 0.03}


  3%|▎         | 34/1151 [01:48<1:04:23,  3.46s/it]

{'loss': 4.192, 'grad_norm': 0.9194192290306091, 'learning_rate': 0.0001942857142857143, 'epoch': 0.03}


  3%|▎         | 35/1151 [01:51<1:04:24,  3.46s/it]

{'loss': 4.5513, 'grad_norm': 0.8575212359428406, 'learning_rate': 0.0002, 'epoch': 0.03}


  3%|▎         | 36/1151 [01:55<1:03:54,  3.44s/it]

{'loss': 4.4535, 'grad_norm': 0.8406976461410522, 'learning_rate': 0.00019999960377573023, 'epoch': 0.03}


  3%|▎         | 37/1151 [01:58<1:02:08,  3.35s/it]

{'loss': 3.9005, 'grad_norm': 0.8443418741226196, 'learning_rate': 0.00019999841510606067, 'epoch': 0.03}


  3%|▎         | 38/1151 [02:01<59:45,  3.22s/it]  

{'loss': 3.592, 'grad_norm': 0.8771318793296814, 'learning_rate': 0.000199996434000411, 'epoch': 0.03}


  3%|▎         | 39/1151 [02:04<59:46,  3.23s/it]

{'loss': 4.297, 'grad_norm': 0.891704261302948, 'learning_rate': 0.0001999936604744804, 'epoch': 0.03}


  3%|▎         | 40/1151 [02:07<59:39,  3.22s/it]

{'loss': 4.0233, 'grad_norm': 0.804443359375, 'learning_rate': 0.00019999009455024772, 'epoch': 0.03}


  4%|▎         | 41/1151 [02:10<56:26,  3.05s/it]

{'loss': 3.2827, 'grad_norm': 0.9857575297355652, 'learning_rate': 0.000199985736255971, 'epoch': 0.04}


  4%|▎         | 42/1151 [02:13<54:58,  2.97s/it]

{'loss': 3.6708, 'grad_norm': 0.9023380875587463, 'learning_rate': 0.0001999805856261875, 'epoch': 0.04}


  4%|▎         | 43/1151 [02:16<55:46,  3.02s/it]

{'loss': 4.1494, 'grad_norm': 0.8018292784690857, 'learning_rate': 0.0001999746427017133, 'epoch': 0.04}


  4%|▍         | 44/1151 [02:19<54:24,  2.95s/it]

{'loss': 3.7451, 'grad_norm': 0.9386932849884033, 'learning_rate': 0.00019996790752964305, 'epoch': 0.04}


  4%|▍         | 45/1151 [02:21<53:04,  2.88s/it]

{'loss': 3.3625, 'grad_norm': 1.035421371459961, 'learning_rate': 0.00019996038016334953, 'epoch': 0.04}


  4%|▍         | 46/1151 [02:24<54:06,  2.94s/it]

{'loss': 3.6802, 'grad_norm': 0.9926106929779053, 'learning_rate': 0.0001999520606624832, 'epoch': 0.04}


  4%|▍         | 47/1151 [02:28<55:26,  3.01s/it]

{'loss': 3.7845, 'grad_norm': 0.8671299815177917, 'learning_rate': 0.0001999429490929718, 'epoch': 0.04}


  4%|▍         | 48/1151 [02:31<55:58,  3.05s/it]

{'loss': 3.6339, 'grad_norm': 0.9820798635482788, 'learning_rate': 0.00019993304552701994, 'epoch': 0.04}


  4%|▍         | 49/1151 [02:34<54:17,  2.96s/it]

{'loss': 3.0798, 'grad_norm': 1.1323381662368774, 'learning_rate': 0.0001999223500431082, 'epoch': 0.04}


  4%|▍         | 50/1151 [02:37<56:37,  3.09s/it]

{'loss': 3.7569, 'grad_norm': 0.962151050567627, 'learning_rate': 0.00019991086272599273, 'epoch': 0.04}


  4%|▍         | 51/1151 [02:40<55:54,  3.05s/it]

{'loss': 3.7107, 'grad_norm': 1.1913173198699951, 'learning_rate': 0.00019989858366670476, 'epoch': 0.04}


  5%|▍         | 52/1151 [02:43<57:05,  3.12s/it]

{'loss': 3.9566, 'grad_norm': 1.0666351318359375, 'learning_rate': 0.00019988551296254938, 'epoch': 0.05}


  5%|▍         | 53/1151 [02:46<54:31,  2.98s/it]

{'loss': 3.089, 'grad_norm': 1.2769169807434082, 'learning_rate': 0.00019987165071710527, 'epoch': 0.05}


  5%|▍         | 54/1151 [02:49<55:45,  3.05s/it]

{'loss': 3.9877, 'grad_norm': 1.057003378868103, 'learning_rate': 0.00019985699704022357, 'epoch': 0.05}


  5%|▍         | 55/1151 [02:52<55:28,  3.04s/it]

{'loss': 3.6775, 'grad_norm': 0.9536727666854858, 'learning_rate': 0.00019984155204802714, 'epoch': 0.05}


  5%|▍         | 56/1151 [02:55<56:05,  3.07s/it]

{'loss': 3.7731, 'grad_norm': 0.9008031487464905, 'learning_rate': 0.00019982531586290958, 'epoch': 0.05}


  5%|▍         | 57/1151 [02:58<55:48,  3.06s/it]

{'loss': 3.9023, 'grad_norm': 0.9540742039680481, 'learning_rate': 0.0001998082886135343, 'epoch': 0.05}


  5%|▌         | 58/1151 [03:02<59:31,  3.27s/it]

{'loss': 3.9375, 'grad_norm': 0.7752512097358704, 'learning_rate': 0.00019979047043483349, 'epoch': 0.05}


  5%|▌         | 59/1151 [03:05<56:16,  3.09s/it]

{'loss': 3.2768, 'grad_norm': 0.8907281160354614, 'learning_rate': 0.00019977186146800707, 'epoch': 0.05}


  5%|▌         | 60/1151 [03:08<55:57,  3.08s/it]

{'loss': 3.5552, 'grad_norm': 0.8262165784835815, 'learning_rate': 0.0001997524618605215, 'epoch': 0.05}


  5%|▌         | 61/1151 [03:11<54:54,  3.02s/it]

{'loss': 3.6354, 'grad_norm': 0.9152100086212158, 'learning_rate': 0.00019973227176610869, 'epoch': 0.05}


  5%|▌         | 62/1151 [03:14<54:58,  3.03s/it]

{'loss': 3.572, 'grad_norm': 0.8837032914161682, 'learning_rate': 0.00019971129134476473, 'epoch': 0.05}


  5%|▌         | 63/1151 [03:17<55:02,  3.04s/it]

{'loss': 3.2496, 'grad_norm': 0.9268021583557129, 'learning_rate': 0.00019968952076274872, 'epoch': 0.05}


  6%|▌         | 64/1151 [03:20<53:48,  2.97s/it]

{'loss': 2.8481, 'grad_norm': 0.9076836109161377, 'learning_rate': 0.00019966696019258127, 'epoch': 0.06}


  6%|▌         | 65/1151 [03:22<52:42,  2.91s/it]

{'loss': 3.2411, 'grad_norm': 0.9334708452224731, 'learning_rate': 0.0001996436098130433, 'epoch': 0.06}


  6%|▌         | 66/1151 [03:26<57:13,  3.16s/it]

{'loss': 4.1811, 'grad_norm': 0.8444430232048035, 'learning_rate': 0.00019961946980917456, 'epoch': 0.06}


  6%|▌         | 67/1151 [03:29<55:39,  3.08s/it]

{'loss': 3.4486, 'grad_norm': 0.9254947304725647, 'learning_rate': 0.00019959454037227214, 'epoch': 0.06}


  6%|▌         | 68/1151 [03:32<56:02,  3.10s/it]

{'loss': 3.7405, 'grad_norm': 0.8842200636863708, 'learning_rate': 0.00019956882169988905, 'epoch': 0.06}


  6%|▌         | 69/1151 [03:35<56:22,  3.13s/it]

{'loss': 3.6546, 'grad_norm': 0.8223341703414917, 'learning_rate': 0.00019954231399583244, 'epoch': 0.06}


  6%|▌         | 70/1151 [03:38<55:51,  3.10s/it]

{'loss': 3.4556, 'grad_norm': 0.8018720746040344, 'learning_rate': 0.0001995150174701623, 'epoch': 0.06}


  6%|▌         | 71/1151 [03:42<57:48,  3.21s/it]

{'loss': 3.9225, 'grad_norm': 0.8827023506164551, 'learning_rate': 0.00019948693233918952, 'epoch': 0.06}


  6%|▋         | 72/1151 [03:45<55:18,  3.08s/it]

{'loss': 2.9896, 'grad_norm': 0.9002492427825928, 'learning_rate': 0.00019945805882547432, 'epoch': 0.06}


  6%|▋         | 73/1151 [03:47<54:21,  3.03s/it]

{'loss': 3.4013, 'grad_norm': 1.0088567733764648, 'learning_rate': 0.00019942839715782445, 'epoch': 0.06}


  6%|▋         | 74/1151 [03:50<53:43,  2.99s/it]

{'loss': 3.3205, 'grad_norm': 0.8973625898361206, 'learning_rate': 0.00019939794757129332, 'epoch': 0.06}


  7%|▋         | 75/1151 [03:53<52:11,  2.91s/it]

{'loss': 3.0064, 'grad_norm': 0.9555339813232422, 'learning_rate': 0.0001993667103071783, 'epoch': 0.07}


  7%|▋         | 76/1151 [03:56<51:36,  2.88s/it]

{'loss': 2.944, 'grad_norm': 0.8538064360618591, 'learning_rate': 0.00019933468561301857, 'epoch': 0.07}


  7%|▋         | 77/1151 [03:59<52:08,  2.91s/it]

{'loss': 3.5129, 'grad_norm': 0.9685425758361816, 'learning_rate': 0.00019930187374259337, 'epoch': 0.07}


  7%|▋         | 78/1151 [04:02<53:33,  2.99s/it]

{'loss': 3.311, 'grad_norm': 0.9429247379302979, 'learning_rate': 0.0001992682749559199, 'epoch': 0.07}


  7%|▋         | 79/1151 [04:05<52:16,  2.93s/it]

{'loss': 3.0129, 'grad_norm': 0.9907921552658081, 'learning_rate': 0.00019923388951925125, 'epoch': 0.07}


  7%|▋         | 80/1151 [04:08<52:19,  2.93s/it]

{'loss': 3.4488, 'grad_norm': 0.8917087912559509, 'learning_rate': 0.0001991987177050743, 'epoch': 0.07}


  7%|▋         | 81/1151 [04:10<50:29,  2.83s/it]

{'loss': 2.5485, 'grad_norm': 0.9273638129234314, 'learning_rate': 0.0001991627597921076, 'epoch': 0.07}


  7%|▋         | 82/1151 [04:13<50:56,  2.86s/it]

{'loss': 3.5106, 'grad_norm': 0.9840286374092102, 'learning_rate': 0.0001991260160652991, 'epoch': 0.07}


  7%|▋         | 83/1151 [04:16<50:22,  2.83s/it]

{'loss': 2.8282, 'grad_norm': 0.9241983294487, 'learning_rate': 0.00019908848681582391, 'epoch': 0.07}


  7%|▋         | 84/1151 [04:20<54:11,  3.05s/it]

{'loss': 4.1667, 'grad_norm': 0.9139185547828674, 'learning_rate': 0.000199050172341082, 'epoch': 0.07}


  7%|▋         | 85/1151 [04:23<56:57,  3.21s/it]

{'loss': 3.8473, 'grad_norm': 0.8012476563453674, 'learning_rate': 0.00019901107294469593, 'epoch': 0.07}


  7%|▋         | 86/1151 [04:26<54:53,  3.09s/it]

{'loss': 3.2656, 'grad_norm': 0.8870707750320435, 'learning_rate': 0.00019897118893650825, 'epoch': 0.07}


  8%|▊         | 87/1151 [04:30<58:07,  3.28s/it]

{'loss': 4.1634, 'grad_norm': 0.8222299814224243, 'learning_rate': 0.0001989305206325792, 'epoch': 0.08}


  8%|▊         | 88/1151 [04:33<58:33,  3.31s/it]

{'loss': 3.6577, 'grad_norm': 0.829655647277832, 'learning_rate': 0.0001988890683551842, 'epoch': 0.08}


  8%|▊         | 89/1151 [04:37<1:01:48,  3.49s/it]

{'loss': 4.2238, 'grad_norm': 0.7815793752670288, 'learning_rate': 0.00019884683243281116, 'epoch': 0.08}


  8%|▊         | 90/1151 [04:40<58:58,  3.33s/it]  

{'loss': 3.0882, 'grad_norm': 0.8467623591423035, 'learning_rate': 0.00019880381320015804, 'epoch': 0.08}


  8%|▊         | 91/1151 [04:43<55:34,  3.15s/it]

{'loss': 3.3061, 'grad_norm': 1.0017750263214111, 'learning_rate': 0.00019876001099813017, 'epoch': 0.08}


  8%|▊         | 92/1151 [04:45<52:13,  2.96s/it]

{'loss': 2.562, 'grad_norm': 0.9782940149307251, 'learning_rate': 0.00019871542617383743, 'epoch': 0.08}


  8%|▊         | 93/1151 [04:48<52:11,  2.96s/it]

{'loss': 3.6323, 'grad_norm': 1.04583740234375, 'learning_rate': 0.0001986700590805916, 'epoch': 0.08}


  8%|▊         | 94/1151 [04:51<51:41,  2.93s/it]

{'loss': 3.1604, 'grad_norm': 0.9099863767623901, 'learning_rate': 0.00019862391007790354, 'epoch': 0.08}


  8%|▊         | 95/1151 [04:55<55:27,  3.15s/it]

{'loss': 3.9904, 'grad_norm': 0.921252429485321, 'learning_rate': 0.00019857697953148037, 'epoch': 0.08}


  8%|▊         | 96/1151 [04:58<55:24,  3.15s/it]

{'loss': 3.5482, 'grad_norm': 0.8337631225585938, 'learning_rate': 0.00019852926781322255, 'epoch': 0.08}


  8%|▊         | 97/1151 [05:01<54:09,  3.08s/it]

{'loss': 3.515, 'grad_norm': 0.9392951726913452, 'learning_rate': 0.00019848077530122083, 'epoch': 0.08}


  9%|▊         | 98/1151 [05:04<53:15,  3.03s/it]

{'loss': 3.5847, 'grad_norm': 0.8964962959289551, 'learning_rate': 0.00019843150237975344, 'epoch': 0.09}


  9%|▊         | 99/1151 [05:07<54:16,  3.10s/it]

{'loss': 3.9526, 'grad_norm': 0.8998064994812012, 'learning_rate': 0.0001983814494392829, 'epoch': 0.09}


  9%|▊         | 100/1151 [05:10<53:15,  3.04s/it]

{'loss': 3.0642, 'grad_norm': 0.8201503753662109, 'learning_rate': 0.00019833061687645306, 'epoch': 0.09}


  9%|▉         | 101/1151 [05:12<50:47,  2.90s/it]

{'loss': 2.8076, 'grad_norm': 1.113187313079834, 'learning_rate': 0.00019827900509408581, 'epoch': 0.09}


  9%|▉         | 102/1151 [05:15<51:02,  2.92s/it]

{'loss': 3.5433, 'grad_norm': 0.9664319157600403, 'learning_rate': 0.0001982266145011779, 'epoch': 0.09}


  9%|▉         | 103/1151 [05:18<51:48,  2.97s/it]

{'loss': 3.2341, 'grad_norm': 0.8199257254600525, 'learning_rate': 0.00019817344551289795, 'epoch': 0.09}


  9%|▉         | 104/1151 [05:22<55:47,  3.20s/it]

{'loss': 3.9206, 'grad_norm': 0.7799148559570312, 'learning_rate': 0.0001981194985505827, 'epoch': 0.09}


  9%|▉         | 105/1151 [05:25<53:10,  3.05s/it]

{'loss': 3.0355, 'grad_norm': 0.9122828841209412, 'learning_rate': 0.0001980647740417341, 'epoch': 0.09}


  9%|▉         | 106/1151 [05:28<54:32,  3.13s/it]

{'loss': 4.1824, 'grad_norm': 0.8463876247406006, 'learning_rate': 0.00019800927242001577, 'epoch': 0.09}


  9%|▉         | 107/1151 [05:31<54:28,  3.13s/it]

{'loss': 3.871, 'grad_norm': 0.8897174000740051, 'learning_rate': 0.00019795299412524945, 'epoch': 0.09}


  9%|▉         | 108/1151 [05:34<53:33,  3.08s/it]

{'loss': 3.372, 'grad_norm': 0.8594722747802734, 'learning_rate': 0.0001978959396034117, 'epoch': 0.09}


  9%|▉         | 109/1151 [05:37<52:04,  3.00s/it]

{'loss': 3.0727, 'grad_norm': 0.8452082276344299, 'learning_rate': 0.0001978381093066302, 'epoch': 0.09}


 10%|▉         | 110/1151 [05:41<57:35,  3.32s/it]

{'loss': 3.8224, 'grad_norm': 0.7494063973426819, 'learning_rate': 0.00019777950369318035, 'epoch': 0.1}


 10%|▉         | 111/1151 [05:44<55:43,  3.21s/it]

{'loss': 3.2965, 'grad_norm': 0.939784049987793, 'learning_rate': 0.0001977201232274814, 'epoch': 0.1}


 10%|▉         | 112/1151 [05:48<56:37,  3.27s/it]

{'loss': 3.9264, 'grad_norm': 0.9740484356880188, 'learning_rate': 0.00019765996838009309, 'epoch': 0.1}


 10%|▉         | 113/1151 [05:51<55:34,  3.21s/it]

{'loss': 3.5237, 'grad_norm': 0.8800094127655029, 'learning_rate': 0.00019759903962771156, 'epoch': 0.1}


 10%|▉         | 114/1151 [05:54<55:09,  3.19s/it]

{'loss': 3.4299, 'grad_norm': 0.8502700328826904, 'learning_rate': 0.0001975373374531658, 'epoch': 0.1}


 10%|▉         | 115/1151 [05:57<53:37,  3.11s/it]

{'loss': 3.486, 'grad_norm': 0.9310592412948608, 'learning_rate': 0.00019747486234541383, 'epoch': 0.1}


 10%|█         | 116/1151 [06:00<52:55,  3.07s/it]

{'loss': 3.347, 'grad_norm': 0.8643791675567627, 'learning_rate': 0.0001974116147995387, 'epoch': 0.1}


 10%|█         | 117/1151 [06:03<53:54,  3.13s/it]

{'loss': 3.4425, 'grad_norm': 0.8139472007751465, 'learning_rate': 0.00019734759531674472, 'epoch': 0.1}


 10%|█         | 118/1151 [06:06<53:34,  3.11s/it]

{'loss': 3.4968, 'grad_norm': 0.8772261142730713, 'learning_rate': 0.0001972828044043533, 'epoch': 0.1}


 10%|█         | 119/1151 [06:09<51:37,  3.00s/it]

{'loss': 2.8641, 'grad_norm': 1.003015398979187, 'learning_rate': 0.00019721724257579907, 'epoch': 0.1}


 10%|█         | 120/1151 [06:12<52:22,  3.05s/it]

{'loss': 3.3947, 'grad_norm': 0.8838045597076416, 'learning_rate': 0.0001971509103506258, 'epoch': 0.1}


 11%|█         | 121/1151 [06:15<53:05,  3.09s/it]

{'loss': 3.7075, 'grad_norm': 0.9103016257286072, 'learning_rate': 0.00019708380825448227, 'epoch': 0.11}


 11%|█         | 122/1151 [06:18<53:30,  3.12s/it]

{'loss': 3.6114, 'grad_norm': 0.8242766261100769, 'learning_rate': 0.000197015936819118, 'epoch': 0.11}


 11%|█         | 123/1151 [06:21<52:05,  3.04s/it]

{'loss': 3.32, 'grad_norm': 0.8291691541671753, 'learning_rate': 0.00019694729658237926, 'epoch': 0.11}


 11%|█         | 124/1151 [06:24<51:05,  2.98s/it]

{'loss': 2.9702, 'grad_norm': 0.9064611792564392, 'learning_rate': 0.00019687788808820452, 'epoch': 0.11}


 11%|█         | 125/1151 [06:27<52:04,  3.05s/it]

{'loss': 3.5346, 'grad_norm': 0.8171778917312622, 'learning_rate': 0.00019680771188662044, 'epoch': 0.11}


 11%|█         | 126/1151 [06:30<50:16,  2.94s/it]

{'loss': 3.0892, 'grad_norm': 0.9846205115318298, 'learning_rate': 0.00019673676853373728, 'epoch': 0.11}


 11%|█         | 127/1151 [06:33<49:25,  2.90s/it]

{'loss': 3.0979, 'grad_norm': 0.9832815527915955, 'learning_rate': 0.00019666505859174463, 'epoch': 0.11}


 11%|█         | 128/1151 [06:36<50:45,  2.98s/it]

{'loss': 3.4488, 'grad_norm': 0.8531048893928528, 'learning_rate': 0.00019659258262890683, 'epoch': 0.11}


 11%|█         | 129/1151 [06:39<50:37,  2.97s/it]

{'loss': 3.1843, 'grad_norm': 0.8779845237731934, 'learning_rate': 0.00019651934121955864, 'epoch': 0.11}


 11%|█▏        | 130/1151 [06:41<48:30,  2.85s/it]

{'loss': 2.7595, 'grad_norm': 0.9418877363204956, 'learning_rate': 0.0001964453349441005, 'epoch': 0.11}


 11%|█▏        | 131/1151 [06:45<52:12,  3.07s/it]

{'loss': 4.2655, 'grad_norm': 0.8772928714752197, 'learning_rate': 0.0001963705643889941, 'epoch': 0.11}


 11%|█▏        | 132/1151 [06:48<51:35,  3.04s/it]

{'loss': 3.3389, 'grad_norm': 0.9772258400917053, 'learning_rate': 0.00019629503014675757, 'epoch': 0.11}


 12%|█▏        | 133/1151 [06:51<50:03,  2.95s/it]

{'loss': 3.162, 'grad_norm': 0.9433245658874512, 'learning_rate': 0.00019621873281596092, 'epoch': 0.12}


 12%|█▏        | 134/1151 [06:54<49:28,  2.92s/it]

{'loss': 3.0563, 'grad_norm': 0.900702714920044, 'learning_rate': 0.00019614167300122126, 'epoch': 0.12}


 12%|█▏        | 135/1151 [06:56<49:22,  2.92s/it]

{'loss': 3.3203, 'grad_norm': 0.8200231194496155, 'learning_rate': 0.00019606385131319792, 'epoch': 0.12}


 12%|█▏        | 136/1151 [06:59<48:35,  2.87s/it]

{'loss': 2.9221, 'grad_norm': 0.8634967803955078, 'learning_rate': 0.0001959852683685878, 'epoch': 0.12}


 12%|█▏        | 137/1151 [07:02<48:01,  2.84s/it]

{'loss': 3.1991, 'grad_norm': 0.8132777214050293, 'learning_rate': 0.00019590592479012023, 'epoch': 0.12}


 12%|█▏        | 138/1151 [07:05<46:55,  2.78s/it]

{'loss': 2.7275, 'grad_norm': 0.9919542074203491, 'learning_rate': 0.00019582582120655225, 'epoch': 0.12}


 12%|█▏        | 139/1151 [07:08<48:15,  2.86s/it]

{'loss': 3.4499, 'grad_norm': 0.8921993970870972, 'learning_rate': 0.00019574495825266358, 'epoch': 0.12}


 12%|█▏        | 140/1151 [07:11<48:30,  2.88s/it]

{'loss': 3.1783, 'grad_norm': 0.8777045607566833, 'learning_rate': 0.00019566333656925147, 'epoch': 0.12}


 12%|█▏        | 141/1151 [07:14<49:05,  2.92s/it]

{'loss': 3.4685, 'grad_norm': 0.8795403838157654, 'learning_rate': 0.00019558095680312576, 'epoch': 0.12}


 12%|█▏        | 142/1151 [07:17<49:35,  2.95s/it]

{'loss': 3.1649, 'grad_norm': 0.8192107081413269, 'learning_rate': 0.00019549781960710375, 'epoch': 0.12}


 12%|█▏        | 143/1151 [07:20<52:01,  3.10s/it]

{'loss': 3.6112, 'grad_norm': 0.8123399615287781, 'learning_rate': 0.00019541392564000488, 'epoch': 0.12}


 13%|█▎        | 144/1151 [07:23<51:52,  3.09s/it]

{'loss': 3.713, 'grad_norm': 0.9276314973831177, 'learning_rate': 0.00019532927556664573, 'epoch': 0.13}


 13%|█▎        | 145/1151 [07:26<50:14,  3.00s/it]

{'loss': 3.2459, 'grad_norm': 0.992774248123169, 'learning_rate': 0.0001952438700578345, 'epoch': 0.13}


 13%|█▎        | 146/1151 [07:29<48:56,  2.92s/it]

{'loss': 2.8792, 'grad_norm': 0.8694949746131897, 'learning_rate': 0.00019515770979036594, 'epoch': 0.13}


 13%|█▎        | 147/1151 [07:31<48:23,  2.89s/it]

{'loss': 2.9196, 'grad_norm': 0.8371928930282593, 'learning_rate': 0.00019507079544701583, 'epoch': 0.13}


 13%|█▎        | 148/1151 [07:35<50:00,  2.99s/it]

{'loss': 3.7259, 'grad_norm': 0.8007487654685974, 'learning_rate': 0.00019498312771653562, 'epoch': 0.13}


 13%|█▎        | 149/1151 [07:38<52:33,  3.15s/it]

{'loss': 3.8999, 'grad_norm': 0.8242142200469971, 'learning_rate': 0.00019489470729364692, 'epoch': 0.13}


 13%|█▎        | 150/1151 [07:41<49:43,  2.98s/it]

{'loss': 2.6182, 'grad_norm': 0.940188467502594, 'learning_rate': 0.00019480553487903613, 'epoch': 0.13}


 13%|█▎        | 151/1151 [07:44<49:38,  2.98s/it]

{'loss': 3.4862, 'grad_norm': 0.8686851859092712, 'learning_rate': 0.00019471561117934868, 'epoch': 0.13}


 13%|█▎        | 152/1151 [07:47<49:19,  2.96s/it]

{'loss': 2.9946, 'grad_norm': 0.9238370060920715, 'learning_rate': 0.0001946249369071837, 'epoch': 0.13}


 13%|█▎        | 153/1151 [07:49<48:22,  2.91s/it]

{'loss': 2.977, 'grad_norm': 0.853705644607544, 'learning_rate': 0.00019453351278108806, 'epoch': 0.13}


 13%|█▎        | 154/1151 [07:52<47:26,  2.86s/it]

{'loss': 3.2061, 'grad_norm': 0.8771235346794128, 'learning_rate': 0.00019444133952555096, 'epoch': 0.13}


 13%|█▎        | 155/1151 [07:56<49:36,  2.99s/it]

{'loss': 3.7227, 'grad_norm': 0.8246408700942993, 'learning_rate': 0.00019434841787099803, 'epoch': 0.13}


 14%|█▎        | 156/1151 [07:58<49:17,  2.97s/it]

{'loss': 3.2644, 'grad_norm': 0.9124945402145386, 'learning_rate': 0.0001942547485537855, 'epoch': 0.14}


 14%|█▎        | 157/1151 [08:01<48:43,  2.94s/it]

{'loss': 3.1347, 'grad_norm': 0.8780034184455872, 'learning_rate': 0.00019416033231619458, 'epoch': 0.14}


 14%|█▎        | 158/1151 [08:05<50:13,  3.04s/it]

{'loss': 3.5523, 'grad_norm': 0.8763325214385986, 'learning_rate': 0.00019406516990642532, 'epoch': 0.14}


 14%|█▍        | 159/1151 [08:08<50:08,  3.03s/it]

{'loss': 3.4984, 'grad_norm': 0.9045635461807251, 'learning_rate': 0.00019396926207859084, 'epoch': 0.14}


 14%|█▍        | 160/1151 [08:10<48:26,  2.93s/it]

{'loss': 2.9119, 'grad_norm': 0.957935631275177, 'learning_rate': 0.00019387260959271134, 'epoch': 0.14}


 14%|█▍        | 161/1151 [08:13<47:54,  2.90s/it]

{'loss': 2.9966, 'grad_norm': 0.8234552145004272, 'learning_rate': 0.00019377521321470805, 'epoch': 0.14}


 14%|█▍        | 162/1151 [08:17<50:46,  3.08s/it]

{'loss': 3.586, 'grad_norm': 0.8355332612991333, 'learning_rate': 0.00019367707371639712, 'epoch': 0.14}


 14%|█▍        | 163/1151 [08:19<49:16,  2.99s/it]

{'loss': 2.7187, 'grad_norm': 0.8502704501152039, 'learning_rate': 0.0001935781918754836, 'epoch': 0.14}


 14%|█▍        | 164/1151 [08:23<51:49,  3.15s/it]

{'loss': 3.8951, 'grad_norm': 0.7772170901298523, 'learning_rate': 0.00019347856847555512, 'epoch': 0.14}


 14%|█▍        | 165/1151 [08:26<51:36,  3.14s/it]

{'loss': 3.2966, 'grad_norm': 0.956703782081604, 'learning_rate': 0.00019337820430607593, 'epoch': 0.14}


 14%|█▍        | 166/1151 [08:29<49:26,  3.01s/it]

{'loss': 2.8413, 'grad_norm': 0.866827130317688, 'learning_rate': 0.0001932771001623804, 'epoch': 0.14}


 15%|█▍        | 167/1151 [08:32<50:46,  3.10s/it]

{'loss': 3.6913, 'grad_norm': 0.8099998235702515, 'learning_rate': 0.00019317525684566685, 'epoch': 0.14}


 15%|█▍        | 168/1151 [08:35<49:44,  3.04s/it]

{'loss': 3.3032, 'grad_norm': 0.8849478363990784, 'learning_rate': 0.00019307267516299116, 'epoch': 0.15}


 15%|█▍        | 169/1151 [08:38<48:44,  2.98s/it]

{'loss': 3.0428, 'grad_norm': 0.8083210587501526, 'learning_rate': 0.00019296935592726035, 'epoch': 0.15}


 15%|█▍        | 170/1151 [08:41<47:57,  2.93s/it]

{'loss': 3.2116, 'grad_norm': 0.8678802847862244, 'learning_rate': 0.00019286529995722623, 'epoch': 0.15}


 15%|█▍        | 171/1151 [08:45<52:37,  3.22s/it]

{'loss': 4.0339, 'grad_norm': 0.8394266366958618, 'learning_rate': 0.0001927605080774788, 'epoch': 0.15}


 15%|█▍        | 172/1151 [08:48<53:38,  3.29s/it]

{'loss': 3.7419, 'grad_norm': 0.8511570692062378, 'learning_rate': 0.00019265498111843975, 'epoch': 0.15}


 15%|█▌        | 173/1151 [08:51<52:30,  3.22s/it]

{'loss': 3.2587, 'grad_norm': 0.8501942753791809, 'learning_rate': 0.00019254871991635598, 'epoch': 0.15}


 15%|█▌        | 174/1151 [08:54<52:32,  3.23s/it]

{'loss': 3.7273, 'grad_norm': 0.7814270853996277, 'learning_rate': 0.00019244172531329275, 'epoch': 0.15}


 15%|█▌        | 175/1151 [08:57<51:56,  3.19s/it]

{'loss': 3.5535, 'grad_norm': 0.7982233166694641, 'learning_rate': 0.00019233399815712736, 'epoch': 0.15}


 15%|█▌        | 176/1151 [09:00<48:52,  3.01s/it]

{'loss': 2.5151, 'grad_norm': 0.9314026832580566, 'learning_rate': 0.00019222553930154198, 'epoch': 0.15}


 15%|█▌        | 177/1151 [09:03<49:53,  3.07s/it]

{'loss': 3.6671, 'grad_norm': 0.8813192248344421, 'learning_rate': 0.00019211634960601725, 'epoch': 0.15}


 15%|█▌        | 178/1151 [09:06<48:13,  2.97s/it]

{'loss': 2.6265, 'grad_norm': 0.8289639353752136, 'learning_rate': 0.00019200642993582535, 'epoch': 0.15}


 16%|█▌        | 179/1151 [09:08<45:38,  2.82s/it]

{'loss': 2.6921, 'grad_norm': 0.8700432777404785, 'learning_rate': 0.00019189578116202307, 'epoch': 0.16}


 16%|█▌        | 180/1151 [09:11<47:04,  2.91s/it]

{'loss': 3.2577, 'grad_norm': 0.8479534387588501, 'learning_rate': 0.000191784404161445, 'epoch': 0.16}


 16%|█▌        | 181/1151 [09:14<46:56,  2.90s/it]

{'loss': 3.2275, 'grad_norm': 0.9585872292518616, 'learning_rate': 0.00019167229981669655, 'epoch': 0.16}


 16%|█▌        | 182/1151 [09:18<48:39,  3.01s/it]

{'loss': 3.6435, 'grad_norm': 0.8528705835342407, 'learning_rate': 0.00019155946901614702, 'epoch': 0.16}


 16%|█▌        | 183/1151 [09:21<48:56,  3.03s/it]

{'loss': 3.468, 'grad_norm': 0.8812336325645447, 'learning_rate': 0.0001914459126539224, 'epoch': 0.16}


 16%|█▌        | 184/1151 [09:24<47:58,  2.98s/it]

{'loss': 3.0457, 'grad_norm': 0.9055640697479248, 'learning_rate': 0.0001913316316298984, 'epoch': 0.16}


 16%|█▌        | 185/1151 [09:27<48:29,  3.01s/it]

{'loss': 3.5967, 'grad_norm': 0.9366147518157959, 'learning_rate': 0.00019121662684969335, 'epoch': 0.16}


 16%|█▌        | 186/1151 [09:30<49:57,  3.11s/it]

{'loss': 3.6365, 'grad_norm': 0.955224335193634, 'learning_rate': 0.00019110089922466094, 'epoch': 0.16}


 16%|█▌        | 187/1151 [09:33<49:37,  3.09s/it]

{'loss': 3.3451, 'grad_norm': 0.8732817769050598, 'learning_rate': 0.00019098444967188306, 'epoch': 0.16}


 16%|█▋        | 188/1151 [09:37<51:22,  3.20s/it]

{'loss': 3.5306, 'grad_norm': 0.8752968907356262, 'learning_rate': 0.0001908672791141625, 'epoch': 0.16}


 16%|█▋        | 189/1151 [09:40<51:34,  3.22s/it]

{'loss': 3.6304, 'grad_norm': 0.861113965511322, 'learning_rate': 0.0001907493884800156, 'epoch': 0.16}


 17%|█▋        | 190/1151 [09:42<48:40,  3.04s/it]

{'loss': 2.4154, 'grad_norm': 0.938539981842041, 'learning_rate': 0.000190630778703665, 'epoch': 0.16}


 17%|█▋        | 191/1151 [09:46<50:05,  3.13s/it]

{'loss': 3.604, 'grad_norm': 0.8441646695137024, 'learning_rate': 0.00019051145072503215, 'epoch': 0.17}


 17%|█▋        | 192/1151 [09:49<49:31,  3.10s/it]

{'loss': 3.4016, 'grad_norm': 0.8208858966827393, 'learning_rate': 0.0001903914054897298, 'epoch': 0.17}


 17%|█▋        | 193/1151 [09:53<52:40,  3.30s/it]

{'loss': 3.933, 'grad_norm': 0.7917013168334961, 'learning_rate': 0.00019027064394905473, 'epoch': 0.17}


 17%|█▋        | 194/1151 [09:55<50:29,  3.17s/it]

{'loss': 3.3346, 'grad_norm': 1.0009939670562744, 'learning_rate': 0.00019014916705998002, 'epoch': 0.17}


 17%|█▋        | 195/1151 [09:58<49:15,  3.09s/it]

{'loss': 3.1553, 'grad_norm': 0.822516918182373, 'learning_rate': 0.00019002697578514747, 'epoch': 0.17}


 17%|█▋        | 196/1151 [10:01<48:32,  3.05s/it]

{'loss': 3.2038, 'grad_norm': 0.8460519313812256, 'learning_rate': 0.00018990407109286002, 'epoch': 0.17}


 17%|█▋        | 197/1151 [10:04<48:15,  3.04s/it]

{'loss': 3.3683, 'grad_norm': 0.7757458090782166, 'learning_rate': 0.00018978045395707418, 'epoch': 0.17}


 17%|█▋        | 198/1151 [10:08<50:37,  3.19s/it]

{'loss': 3.6902, 'grad_norm': 0.805929958820343, 'learning_rate': 0.00018965612535739207, 'epoch': 0.17}


 17%|█▋        | 199/1151 [10:11<48:40,  3.07s/it]

{'loss': 3.2844, 'grad_norm': 0.9061823487281799, 'learning_rate': 0.00018953108627905394, 'epoch': 0.17}


 17%|█▋        | 200/1151 [10:14<48:58,  3.09s/it]

{'loss': 3.5527, 'grad_norm': 0.8415430188179016, 'learning_rate': 0.00018940533771293007, 'epoch': 0.17}


 17%|█▋        | 201/1151 [10:17<49:02,  3.10s/it]

{'loss': 3.4363, 'grad_norm': 0.8787170648574829, 'learning_rate': 0.00018927888065551317, 'epoch': 0.17}


 18%|█▊        | 202/1151 [10:20<47:17,  2.99s/it]

{'loss': 3.213, 'grad_norm': 0.8238075971603394, 'learning_rate': 0.00018915171610891035, 'epoch': 0.18}


 18%|█▊        | 203/1151 [10:23<47:47,  3.02s/it]

{'loss': 3.6368, 'grad_norm': 0.837006688117981, 'learning_rate': 0.00018902384508083517, 'epoch': 0.18}


 18%|█▊        | 204/1151 [10:25<46:00,  2.91s/it]

{'loss': 3.0635, 'grad_norm': 1.0068773031234741, 'learning_rate': 0.00018889526858459975, 'epoch': 0.18}


 18%|█▊        | 205/1151 [10:28<45:57,  2.91s/it]

{'loss': 3.3436, 'grad_norm': 0.9123944044113159, 'learning_rate': 0.00018876598763910665, 'epoch': 0.18}


 18%|█▊        | 206/1151 [10:31<45:47,  2.91s/it]

{'loss': 3.1653, 'grad_norm': 0.9049444794654846, 'learning_rate': 0.00018863600326884082, 'epoch': 0.18}


 18%|█▊        | 207/1151 [10:35<49:03,  3.12s/it]

{'loss': 3.7438, 'grad_norm': 0.7482491731643677, 'learning_rate': 0.00018850531650386153, 'epoch': 0.18}


 18%|█▊        | 208/1151 [10:38<48:17,  3.07s/it]

{'loss': 2.995, 'grad_norm': 0.849029004573822, 'learning_rate': 0.00018837392837979416, 'epoch': 0.18}


 18%|█▊        | 209/1151 [10:41<50:45,  3.23s/it]

{'loss': 4.0194, 'grad_norm': 0.7909015417098999, 'learning_rate': 0.00018824183993782192, 'epoch': 0.18}


 18%|█▊        | 210/1151 [10:44<50:11,  3.20s/it]

{'loss': 3.7142, 'grad_norm': 0.809528112411499, 'learning_rate': 0.00018810905222467779, 'epoch': 0.18}


 18%|█▊        | 211/1151 [10:47<48:51,  3.12s/it]

{'loss': 3.1035, 'grad_norm': 0.9556522965431213, 'learning_rate': 0.00018797556629263602, 'epoch': 0.18}


 18%|█▊        | 212/1151 [10:50<47:52,  3.06s/it]

{'loss': 3.237, 'grad_norm': 0.8353976607322693, 'learning_rate': 0.00018784138319950398, 'epoch': 0.18}


 19%|█▊        | 213/1151 [10:53<47:42,  3.05s/it]

{'loss': 3.2939, 'grad_norm': 0.8530200123786926, 'learning_rate': 0.00018770650400861357, 'epoch': 0.18}


 19%|█▊        | 214/1151 [10:56<46:01,  2.95s/it]

{'loss': 3.0352, 'grad_norm': 0.8592647910118103, 'learning_rate': 0.00018757092978881302, 'epoch': 0.19}


 19%|█▊        | 215/1151 [10:59<44:20,  2.84s/it]

{'loss': 2.872, 'grad_norm': 0.9833694696426392, 'learning_rate': 0.00018743466161445823, 'epoch': 0.19}


 19%|█▉        | 216/1151 [11:02<44:59,  2.89s/it]

{'loss': 3.6021, 'grad_norm': 0.842170000076294, 'learning_rate': 0.00018729770056540436, 'epoch': 0.19}


 19%|█▉        | 217/1151 [11:04<44:47,  2.88s/it]

{'loss': 3.3027, 'grad_norm': 0.8420687317848206, 'learning_rate': 0.00018716004772699726, 'epoch': 0.19}


 19%|█▉        | 218/1151 [11:07<44:45,  2.88s/it]

{'loss': 3.0145, 'grad_norm': 0.8492799401283264, 'learning_rate': 0.00018702170419006482, 'epoch': 0.19}


 19%|█▉        | 219/1151 [11:10<44:22,  2.86s/it]

{'loss': 2.8346, 'grad_norm': 0.900541365146637, 'learning_rate': 0.0001868826710509084, 'epoch': 0.19}


 19%|█▉        | 220/1151 [11:13<44:16,  2.85s/it]

{'loss': 3.1575, 'grad_norm': 0.8332586884498596, 'learning_rate': 0.00018674294941129404, 'epoch': 0.19}


 19%|█▉        | 221/1151 [11:16<45:19,  2.92s/it]

{'loss': 3.2467, 'grad_norm': 0.9181031584739685, 'learning_rate': 0.00018660254037844388, 'epoch': 0.19}


 19%|█▉        | 222/1151 [11:19<45:57,  2.97s/it]

{'loss': 3.3922, 'grad_norm': 0.8871733546257019, 'learning_rate': 0.00018646144506502722, 'epoch': 0.19}


 19%|█▉        | 223/1151 [11:22<44:06,  2.85s/it]

{'loss': 2.4412, 'grad_norm': 0.8781335949897766, 'learning_rate': 0.0001863196645891518, 'epoch': 0.19}


 19%|█▉        | 224/1151 [11:25<44:13,  2.86s/it]

{'loss': 2.7691, 'grad_norm': 0.8358235955238342, 'learning_rate': 0.00018617720007435497, 'epoch': 0.19}


 20%|█▉        | 225/1151 [11:28<45:52,  2.97s/it]

{'loss': 3.3612, 'grad_norm': 0.8503965139389038, 'learning_rate': 0.0001860340526495947, 'epoch': 0.2}


 20%|█▉        | 226/1151 [11:31<48:32,  3.15s/it]

{'loss': 3.5687, 'grad_norm': 0.7487027645111084, 'learning_rate': 0.00018589022344924058, 'epoch': 0.2}


 20%|█▉        | 227/1151 [11:34<45:38,  2.96s/it]

{'loss': 2.5316, 'grad_norm': 0.8885727524757385, 'learning_rate': 0.0001857457136130651, 'epoch': 0.2}


 20%|█▉        | 228/1151 [11:37<44:57,  2.92s/it]

{'loss': 2.842, 'grad_norm': 0.8105471730232239, 'learning_rate': 0.00018560052428623434, 'epoch': 0.2}


 20%|█▉        | 229/1151 [11:39<43:02,  2.80s/it]

{'loss': 2.1436, 'grad_norm': 0.913627564907074, 'learning_rate': 0.00018545465661929896, 'epoch': 0.2}


 20%|█▉        | 230/1151 [11:42<42:34,  2.77s/it]

{'loss': 2.7632, 'grad_norm': 0.9372695088386536, 'learning_rate': 0.00018530811176818514, 'epoch': 0.2}


 20%|██        | 231/1151 [11:45<42:53,  2.80s/it]

{'loss': 3.2394, 'grad_norm': 1.0023356676101685, 'learning_rate': 0.00018516089089418549, 'epoch': 0.2}


                                                  
 20%|██        | 231/1151 [14:18<42:53,  2.80s/it]

{'eval_loss': 3.1962335109710693, 'eval_runtime': 153.5428, 'eval_samples_per_second': 7.496, 'eval_steps_per_second': 0.938, 'epoch': 0.2}


 20%|██        | 232/1151 [14:22<12:30:52, 49.02s/it]

{'loss': 3.6545, 'grad_norm': 0.8340851068496704, 'learning_rate': 0.00018501299516394964, 'epoch': 0.2}


 20%|██        | 233/1151 [14:25<8:58:31, 35.20s/it] 

{'loss': 3.1241, 'grad_norm': 0.8040930032730103, 'learning_rate': 0.00018486442574947511, 'epoch': 0.2}


 20%|██        | 234/1151 [14:28<6:30:41, 25.56s/it]

{'loss': 3.3714, 'grad_norm': 0.8059254884719849, 'learning_rate': 0.0001847151838280981, 'epoch': 0.2}


 20%|██        | 235/1151 [14:31<4:47:14, 18.81s/it]

{'loss': 3.2946, 'grad_norm': 0.8557555079460144, 'learning_rate': 0.000184565270582484, 'epoch': 0.2}


 21%|██        | 236/1151 [14:34<3:33:50, 14.02s/it]

{'loss': 2.8296, 'grad_norm': 0.8036426901817322, 'learning_rate': 0.00018441468720061815, 'epoch': 0.2}


 21%|██        | 237/1151 [14:37<2:46:03, 10.90s/it]

{'loss': 3.7891, 'grad_norm': 0.7816339731216431, 'learning_rate': 0.0001842634348757964, 'epoch': 0.21}


 21%|██        | 238/1151 [14:40<2:10:29,  8.58s/it]

{'loss': 3.245, 'grad_norm': 0.7828128933906555, 'learning_rate': 0.0001841115148066155, 'epoch': 0.21}


 21%|██        | 239/1151 [14:44<1:45:14,  6.92s/it]

{'loss': 3.2261, 'grad_norm': 0.9080841541290283, 'learning_rate': 0.00018395892819696389, 'epoch': 0.21}


 21%|██        | 240/1151 [14:47<1:28:09,  5.81s/it]

{'loss': 3.1559, 'grad_norm': 0.7643868923187256, 'learning_rate': 0.00018380567625601193, 'epoch': 0.21}


 21%|██        | 241/1151 [14:50<1:14:50,  4.93s/it]

{'loss': 2.871, 'grad_norm': 0.8517594933509827, 'learning_rate': 0.00018365176019820235, 'epoch': 0.21}


 21%|██        | 242/1151 [14:53<1:06:59,  4.42s/it]

{'loss': 3.5156, 'grad_norm': 0.8594346046447754, 'learning_rate': 0.00018349718124324076, 'epoch': 0.21}


 21%|██        | 243/1151 [14:56<1:00:03,  3.97s/it]

{'loss': 2.9869, 'grad_norm': 0.8242287635803223, 'learning_rate': 0.00018334194061608576, 'epoch': 0.21}


 21%|██        | 244/1151 [14:59<57:12,  3.78s/it]  

{'loss': 3.7541, 'grad_norm': 0.870486319065094, 'learning_rate': 0.00018318603954693948, 'epoch': 0.21}


 21%|██▏       | 245/1151 [15:03<55:39,  3.69s/it]

{'loss': 3.8289, 'grad_norm': 0.8010494112968445, 'learning_rate': 0.00018302947927123766, 'epoch': 0.21}


 21%|██▏       | 246/1151 [15:05<51:08,  3.39s/it]

{'loss': 2.8778, 'grad_norm': 0.9300327897071838, 'learning_rate': 0.00018287226102963987, 'epoch': 0.21}


 21%|██▏       | 247/1151 [15:08<49:11,  3.26s/it]

{'loss': 3.2534, 'grad_norm': 0.8109272122383118, 'learning_rate': 0.00018271438606801986, 'epoch': 0.21}


 22%|██▏       | 248/1151 [15:11<47:10,  3.13s/it]

{'loss': 3.0127, 'grad_norm': 0.7478317618370056, 'learning_rate': 0.00018255585563745538, 'epoch': 0.22}


 22%|██▏       | 249/1151 [15:14<44:42,  2.97s/it]

{'loss': 2.7701, 'grad_norm': 0.8122656345367432, 'learning_rate': 0.00018239667099421856, 'epoch': 0.22}


 22%|██▏       | 250/1151 [15:16<43:30,  2.90s/it]

{'loss': 3.0336, 'grad_norm': 0.935612142086029, 'learning_rate': 0.00018223683339976576, 'epoch': 0.22}


 22%|██▏       | 251/1151 [15:19<43:28,  2.90s/it]

{'loss': 3.303, 'grad_norm': 0.8569813966751099, 'learning_rate': 0.00018207634412072764, 'epoch': 0.22}


 22%|██▏       | 252/1151 [15:22<43:26,  2.90s/it]

{'loss': 3.1677, 'grad_norm': 0.833319902420044, 'learning_rate': 0.0001819152044288992, 'epoch': 0.22}


 22%|██▏       | 253/1151 [15:26<45:20,  3.03s/it]

{'loss': 3.2491, 'grad_norm': 0.8867670297622681, 'learning_rate': 0.0001817534156012295, 'epoch': 0.22}


 22%|██▏       | 254/1151 [15:28<44:21,  2.97s/it]

{'loss': 3.087, 'grad_norm': 0.8801069259643555, 'learning_rate': 0.00018159097891981186, 'epoch': 0.22}


 22%|██▏       | 255/1151 [15:31<44:43,  3.00s/it]

{'loss': 3.4277, 'grad_norm': 0.8196504712104797, 'learning_rate': 0.00018142789567187327, 'epoch': 0.22}


 22%|██▏       | 256/1151 [15:35<45:52,  3.07s/it]

{'loss': 3.5486, 'grad_norm': 0.8792059421539307, 'learning_rate': 0.0001812641671497646, 'epoch': 0.22}


 22%|██▏       | 257/1151 [15:38<44:52,  3.01s/it]

{'loss': 3.1799, 'grad_norm': 0.9142435789108276, 'learning_rate': 0.00018109979465095013, 'epoch': 0.22}


 22%|██▏       | 258/1151 [15:41<44:50,  3.01s/it]

{'loss': 3.7021, 'grad_norm': 0.8694690465927124, 'learning_rate': 0.00018093477947799734, 'epoch': 0.22}


 23%|██▎       | 259/1151 [15:44<45:43,  3.08s/it]

{'loss': 3.6744, 'grad_norm': 0.7946017980575562, 'learning_rate': 0.0001807691229385665, 'epoch': 0.22}


 23%|██▎       | 260/1151 [15:47<45:59,  3.10s/it]

{'loss': 3.3361, 'grad_norm': 0.7615281343460083, 'learning_rate': 0.00018060282634540053, 'epoch': 0.23}


 23%|██▎       | 261/1151 [15:50<46:28,  3.13s/it]

{'loss': 3.5178, 'grad_norm': 0.8659364581108093, 'learning_rate': 0.00018043589101631428, 'epoch': 0.23}


 23%|██▎       | 262/1151 [15:53<45:42,  3.08s/it]

{'loss': 3.1426, 'grad_norm': 0.8222209811210632, 'learning_rate': 0.00018026831827418433, 'epoch': 0.23}


 23%|██▎       | 263/1151 [15:56<45:55,  3.10s/it]

{'loss': 3.1228, 'grad_norm': 0.8038232922554016, 'learning_rate': 0.00018010010944693848, 'epoch': 0.23}


 23%|██▎       | 264/1151 [15:59<44:05,  2.98s/it]

{'loss': 2.8836, 'grad_norm': 0.8418593406677246, 'learning_rate': 0.00017993126586754508, 'epoch': 0.23}


 23%|██▎       | 265/1151 [16:02<44:42,  3.03s/it]

{'loss': 3.1804, 'grad_norm': 0.8217462301254272, 'learning_rate': 0.00017976178887400262, 'epoch': 0.23}


 23%|██▎       | 266/1151 [16:06<48:00,  3.26s/it]

{'loss': 3.9337, 'grad_norm': 0.7875433564186096, 'learning_rate': 0.00017959167980932908, 'epoch': 0.23}


 23%|██▎       | 267/1151 [16:09<46:32,  3.16s/it]

{'loss': 2.8596, 'grad_norm': 0.775007963180542, 'learning_rate': 0.0001794209400215512, 'epoch': 0.23}


 23%|██▎       | 268/1151 [16:12<45:59,  3.13s/it]

{'loss': 3.3942, 'grad_norm': 0.75957190990448, 'learning_rate': 0.00017924957086369402, 'epoch': 0.23}


 23%|██▎       | 269/1151 [16:15<47:35,  3.24s/it]

{'loss': 3.4609, 'grad_norm': 0.7396590113639832, 'learning_rate': 0.00017907757369376985, 'epoch': 0.23}


 23%|██▎       | 270/1151 [16:18<46:43,  3.18s/it]

{'loss': 3.479, 'grad_norm': 0.7503277659416199, 'learning_rate': 0.00017890494987476784, 'epoch': 0.23}


 24%|██▎       | 271/1151 [16:22<47:00,  3.21s/it]

{'loss': 3.402, 'grad_norm': 0.7368023991584778, 'learning_rate': 0.00017873170077464283, 'epoch': 0.24}


 24%|██▎       | 272/1151 [16:25<47:02,  3.21s/it]

{'loss': 3.2419, 'grad_norm': 0.7729032039642334, 'learning_rate': 0.00017855782776630483, 'epoch': 0.24}


 24%|██▎       | 273/1151 [16:28<45:13,  3.09s/it]

{'loss': 3.0329, 'grad_norm': 0.7779523134231567, 'learning_rate': 0.00017838333222760792, 'epoch': 0.24}


 24%|██▍       | 274/1151 [16:30<43:19,  2.96s/it]

{'loss': 2.7603, 'grad_norm': 0.8056163787841797, 'learning_rate': 0.0001782082155413395, 'epoch': 0.24}


 24%|██▍       | 275/1151 [16:33<43:26,  2.98s/it]

{'loss': 3.1236, 'grad_norm': 0.836113691329956, 'learning_rate': 0.0001780324790952092, 'epoch': 0.24}


 24%|██▍       | 276/1151 [16:36<42:34,  2.92s/it]

{'loss': 3.2229, 'grad_norm': 0.9502529501914978, 'learning_rate': 0.00017785612428183786, 'epoch': 0.24}


 24%|██▍       | 277/1151 [16:39<42:20,  2.91s/it]

{'loss': 2.8471, 'grad_norm': 0.8497734069824219, 'learning_rate': 0.00017767915249874665, 'epoch': 0.24}


 24%|██▍       | 278/1151 [16:42<43:25,  2.98s/it]

{'loss': 3.6422, 'grad_norm': 0.863925039768219, 'learning_rate': 0.0001775015651483459, 'epoch': 0.24}


 24%|██▍       | 279/1151 [16:45<41:02,  2.82s/it]

{'loss': 2.3919, 'grad_norm': 0.9253017902374268, 'learning_rate': 0.00017732336363792395, 'epoch': 0.24}


 24%|██▍       | 280/1151 [16:47<40:13,  2.77s/it]

{'loss': 2.8834, 'grad_norm': 0.8765822649002075, 'learning_rate': 0.00017714454937963606, 'epoch': 0.24}


 24%|██▍       | 281/1151 [16:50<40:01,  2.76s/it]

{'loss': 2.513, 'grad_norm': 0.8505395650863647, 'learning_rate': 0.00017696512379049325, 'epoch': 0.24}


 25%|██▍       | 282/1151 [16:53<39:52,  2.75s/it]

{'loss': 2.7926, 'grad_norm': 0.8106358647346497, 'learning_rate': 0.0001767850882923509, 'epoch': 0.24}


 25%|██▍       | 283/1151 [16:56<41:18,  2.86s/it]

{'loss': 2.9689, 'grad_norm': 0.8427779078483582, 'learning_rate': 0.0001766044443118978, 'epoch': 0.25}


 25%|██▍       | 284/1151 [16:59<42:37,  2.95s/it]

{'loss': 3.4481, 'grad_norm': 0.8782094120979309, 'learning_rate': 0.00017642319328064446, 'epoch': 0.25}


 25%|██▍       | 285/1151 [17:02<43:19,  3.00s/it]

{'loss': 3.7941, 'grad_norm': 0.8198152780532837, 'learning_rate': 0.00017624133663491205, 'epoch': 0.25}


 25%|██▍       | 286/1151 [17:05<42:22,  2.94s/it]

{'loss': 2.9081, 'grad_norm': 0.8482154607772827, 'learning_rate': 0.0001760588758158209, 'epoch': 0.25}


 25%|██▍       | 287/1151 [17:08<44:42,  3.11s/it]

{'loss': 3.8775, 'grad_norm': 0.8200011849403381, 'learning_rate': 0.0001758758122692791, 'epoch': 0.25}


 25%|██▌       | 288/1151 [17:11<43:13,  3.00s/it]

{'loss': 2.7854, 'grad_norm': 0.7845606207847595, 'learning_rate': 0.00017569214744597104, 'epoch': 0.25}


 25%|██▌       | 289/1151 [17:14<42:58,  2.99s/it]

{'loss': 3.2232, 'grad_norm': 0.7676264047622681, 'learning_rate': 0.00017550788280134598, 'epoch': 0.25}


 25%|██▌       | 290/1151 [17:17<42:40,  2.97s/it]

{'loss': 3.0421, 'grad_norm': 0.7811620831489563, 'learning_rate': 0.00017532301979560636, 'epoch': 0.25}


 25%|██▌       | 291/1151 [17:20<43:37,  3.04s/it]

{'loss': 3.5298, 'grad_norm': 0.8523257970809937, 'learning_rate': 0.00017513755989369636, 'epoch': 0.25}


 25%|██▌       | 292/1151 [17:24<44:21,  3.10s/it]

{'loss': 3.5557, 'grad_norm': 0.7984145879745483, 'learning_rate': 0.0001749515045652903, 'epoch': 0.25}


 25%|██▌       | 293/1151 [17:27<45:32,  3.18s/it]

{'loss': 3.5589, 'grad_norm': 0.7347412705421448, 'learning_rate': 0.00017476485528478093, 'epoch': 0.25}


 26%|██▌       | 294/1151 [17:30<44:28,  3.11s/it]

{'loss': 3.147, 'grad_norm': 0.7875076532363892, 'learning_rate': 0.00017457761353126768, 'epoch': 0.26}


 26%|██▌       | 295/1151 [17:33<43:42,  3.06s/it]

{'loss': 3.1489, 'grad_norm': 0.800813615322113, 'learning_rate': 0.00017438978078854512, 'epoch': 0.26}


 26%|██▌       | 296/1151 [17:36<43:51,  3.08s/it]

{'loss': 3.2766, 'grad_norm': 0.7687361240386963, 'learning_rate': 0.0001742013585450911, 'epoch': 0.26}


 26%|██▌       | 297/1151 [17:39<45:48,  3.22s/it]

{'loss': 3.6224, 'grad_norm': 0.7208420634269714, 'learning_rate': 0.0001740123482940549, 'epoch': 0.26}


 26%|██▌       | 298/1151 [17:42<44:14,  3.11s/it]

{'loss': 3.3714, 'grad_norm': 0.8375754952430725, 'learning_rate': 0.0001738227515332455, 'epoch': 0.26}


 26%|██▌       | 299/1151 [17:45<44:23,  3.13s/it]

{'loss': 3.2685, 'grad_norm': 0.7150170207023621, 'learning_rate': 0.00017363256976511972, 'epoch': 0.26}


 26%|██▌       | 300/1151 [17:48<42:45,  3.02s/it]

{'loss': 3.1724, 'grad_norm': 0.8425174951553345, 'learning_rate': 0.00017344180449677015, 'epoch': 0.26}


 26%|██▌       | 301/1151 [17:51<43:22,  3.06s/it]

{'loss': 3.2075, 'grad_norm': 0.7173529863357544, 'learning_rate': 0.0001732504572399134, 'epoch': 0.26}


 26%|██▌       | 302/1151 [17:55<47:17,  3.34s/it]

{'loss': 4.0839, 'grad_norm': 0.7686241865158081, 'learning_rate': 0.00017305852951087798, 'epoch': 0.26}


 26%|██▋       | 303/1151 [17:58<46:06,  3.26s/it]

{'loss': 3.3369, 'grad_norm': 0.8357962369918823, 'learning_rate': 0.00017286602283059238, 'epoch': 0.26}


 26%|██▋       | 304/1151 [18:02<45:54,  3.25s/it]

{'loss': 3.2166, 'grad_norm': 0.7771978974342346, 'learning_rate': 0.000172672938724573, 'epoch': 0.26}


 26%|██▋       | 305/1151 [18:05<44:29,  3.16s/it]

{'loss': 3.0374, 'grad_norm': 0.8369947671890259, 'learning_rate': 0.000172479278722912, 'epoch': 0.26}


 27%|██▋       | 306/1151 [18:08<44:25,  3.15s/it]

{'loss': 3.2501, 'grad_norm': 0.797243595123291, 'learning_rate': 0.00017228504436026527, 'epoch': 0.27}


 27%|██▋       | 307/1151 [18:11<44:27,  3.16s/it]

{'loss': 3.3194, 'grad_norm': 0.8421855568885803, 'learning_rate': 0.00017209023717584013, 'epoch': 0.27}


 27%|██▋       | 308/1151 [18:14<44:09,  3.14s/it]

{'loss': 3.0569, 'grad_norm': 0.7643089890480042, 'learning_rate': 0.00017189485871338327, 'epoch': 0.27}


 27%|██▋       | 309/1151 [18:17<44:27,  3.17s/it]

{'loss': 3.274, 'grad_norm': 0.798774003982544, 'learning_rate': 0.00017169891052116852, 'epoch': 0.27}


 27%|██▋       | 310/1151 [18:20<44:06,  3.15s/it]

{'loss': 3.2615, 'grad_norm': 0.7416277527809143, 'learning_rate': 0.00017150239415198438, 'epoch': 0.27}


 27%|██▋       | 311/1151 [18:23<43:36,  3.12s/it]

{'loss': 3.2915, 'grad_norm': 0.7319460511207581, 'learning_rate': 0.00017130531116312203, 'epoch': 0.27}


 27%|██▋       | 312/1151 [18:27<43:39,  3.12s/it]

{'loss': 3.2403, 'grad_norm': 0.8550884127616882, 'learning_rate': 0.0001711076631163627, 'epoch': 0.27}


 27%|██▋       | 313/1151 [18:30<43:49,  3.14s/it]

{'loss': 3.1972, 'grad_norm': 0.8767945170402527, 'learning_rate': 0.00017090945157796546, 'epoch': 0.27}


 27%|██▋       | 314/1151 [18:32<42:01,  3.01s/it]

{'loss': 2.6861, 'grad_norm': 0.9070021510124207, 'learning_rate': 0.00017071067811865476, 'epoch': 0.27}


 27%|██▋       | 315/1151 [18:36<42:42,  3.07s/it]

{'loss': 3.2105, 'grad_norm': 0.7999142408370972, 'learning_rate': 0.00017051134431360796, 'epoch': 0.27}


 27%|██▋       | 316/1151 [18:39<44:12,  3.18s/it]

{'loss': 3.5255, 'grad_norm': 0.7855270504951477, 'learning_rate': 0.00017031145174244285, 'epoch': 0.27}


 28%|██▊       | 317/1151 [18:42<43:18,  3.12s/it]

{'loss': 2.9915, 'grad_norm': 0.8197962045669556, 'learning_rate': 0.0001701110019892053, 'epoch': 0.28}


 28%|██▊       | 318/1151 [18:45<41:58,  3.02s/it]

{'loss': 2.7068, 'grad_norm': 0.8377986550331116, 'learning_rate': 0.0001699099966423563, 'epoch': 0.28}


 28%|██▊       | 319/1151 [18:48<42:09,  3.04s/it]

{'loss': 3.4241, 'grad_norm': 0.8583183884620667, 'learning_rate': 0.00016970843729475991, 'epoch': 0.28}


 28%|██▊       | 320/1151 [18:51<41:05,  2.97s/it]

{'loss': 2.4762, 'grad_norm': 0.8845911622047424, 'learning_rate': 0.00016950632554367019, 'epoch': 0.28}


 28%|██▊       | 321/1151 [18:54<41:36,  3.01s/it]

{'loss': 3.2783, 'grad_norm': 0.7653000354766846, 'learning_rate': 0.0001693036629907188, 'epoch': 0.28}


 28%|██▊       | 322/1151 [18:57<40:43,  2.95s/it]

{'loss': 3.3332, 'grad_norm': 0.8832595348358154, 'learning_rate': 0.00016910045124190207, 'epoch': 0.28}


 28%|██▊       | 323/1151 [19:00<40:13,  2.91s/it]

{'loss': 2.8209, 'grad_norm': 0.8303378224372864, 'learning_rate': 0.00016889669190756868, 'epoch': 0.28}


 28%|██▊       | 324/1151 [19:03<42:14,  3.06s/it]

{'loss': 3.7146, 'grad_norm': 0.797549843788147, 'learning_rate': 0.00016869238660240638, 'epoch': 0.28}


 28%|██▊       | 325/1151 [19:07<44:46,  3.25s/it]

{'loss': 3.5844, 'grad_norm': 0.7327180504798889, 'learning_rate': 0.00016848753694542965, 'epoch': 0.28}


 28%|██▊       | 326/1151 [19:10<43:22,  3.15s/it]

{'loss': 3.1291, 'grad_norm': 0.8368344306945801, 'learning_rate': 0.00016828214455996658, 'epoch': 0.28}


 28%|██▊       | 327/1151 [19:12<41:54,  3.05s/it]

{'loss': 3.0536, 'grad_norm': 0.7926154732704163, 'learning_rate': 0.00016807621107364613, 'epoch': 0.28}


 28%|██▊       | 328/1151 [19:15<42:00,  3.06s/it]

{'loss': 3.6934, 'grad_norm': 0.8327116370201111, 'learning_rate': 0.00016786973811838522, 'epoch': 0.28}


 29%|██▊       | 329/1151 [19:18<41:03,  3.00s/it]

{'loss': 3.0009, 'grad_norm': 0.7994216084480286, 'learning_rate': 0.00016766272733037576, 'epoch': 0.29}


 29%|██▊       | 330/1151 [19:21<39:28,  2.89s/it]

{'loss': 2.55, 'grad_norm': 0.855826735496521, 'learning_rate': 0.00016745518035007168, 'epoch': 0.29}


 29%|██▉       | 331/1151 [19:24<39:03,  2.86s/it]

{'loss': 2.796, 'grad_norm': 0.7876389026641846, 'learning_rate': 0.00016724709882217603, 'epoch': 0.29}


 29%|██▉       | 332/1151 [19:26<38:36,  2.83s/it]

{'loss': 2.7291, 'grad_norm': 0.7686324715614319, 'learning_rate': 0.00016703848439562785, 'epoch': 0.29}


 29%|██▉       | 333/1151 [19:29<39:10,  2.87s/it]

{'loss': 2.9399, 'grad_norm': 0.8502402305603027, 'learning_rate': 0.0001668293387235891, 'epoch': 0.29}


 29%|██▉       | 334/1151 [19:32<39:00,  2.86s/it]

{'loss': 3.3154, 'grad_norm': 0.7915471792221069, 'learning_rate': 0.0001666196634634316, 'epoch': 0.29}


 29%|██▉       | 335/1151 [19:35<39:51,  2.93s/it]

{'loss': 3.4726, 'grad_norm': 0.7620712518692017, 'learning_rate': 0.00016640946027672392, 'epoch': 0.29}


 29%|██▉       | 336/1151 [19:38<40:18,  2.97s/it]

{'loss': 3.5125, 'grad_norm': 0.7641885876655579, 'learning_rate': 0.00016619873082921808, 'epoch': 0.29}


 29%|██▉       | 337/1151 [19:41<40:16,  2.97s/it]

{'loss': 3.1729, 'grad_norm': 0.8451053500175476, 'learning_rate': 0.00016598747679083658, 'epoch': 0.29}


 29%|██▉       | 338/1151 [19:44<39:40,  2.93s/it]

{'loss': 3.2075, 'grad_norm': 0.9124047756195068, 'learning_rate': 0.0001657756998356589, 'epoch': 0.29}


 29%|██▉       | 339/1151 [19:47<40:05,  2.96s/it]

{'loss': 3.4485, 'grad_norm': 0.762852668762207, 'learning_rate': 0.00016556340164190845, 'epoch': 0.29}


 30%|██▉       | 340/1151 [19:50<39:30,  2.92s/it]

{'loss': 3.2855, 'grad_norm': 0.7767807841300964, 'learning_rate': 0.00016535058389193917, 'epoch': 0.3}


 30%|██▉       | 341/1151 [19:53<38:56,  2.88s/it]

{'loss': 2.8578, 'grad_norm': 0.8088005185127258, 'learning_rate': 0.00016513724827222227, 'epoch': 0.3}


 30%|██▉       | 342/1151 [19:56<38:37,  2.86s/it]

{'loss': 3.0876, 'grad_norm': 0.7944011688232422, 'learning_rate': 0.0001649233964733326, 'epoch': 0.3}


 30%|██▉       | 343/1151 [19:58<38:03,  2.83s/it]

{'loss': 2.8475, 'grad_norm': 0.8164867162704468, 'learning_rate': 0.00016470903018993578, 'epoch': 0.3}


 30%|██▉       | 344/1151 [20:01<38:30,  2.86s/it]

{'loss': 3.4628, 'grad_norm': 0.7862573862075806, 'learning_rate': 0.0001644941511207742, 'epoch': 0.3}


 30%|██▉       | 345/1151 [20:05<42:12,  3.14s/it]

{'loss': 3.9839, 'grad_norm': 0.788928747177124, 'learning_rate': 0.00016427876096865394, 'epoch': 0.3}


 30%|███       | 346/1151 [20:08<41:44,  3.11s/it]

{'loss': 3.1639, 'grad_norm': 0.8191173076629639, 'learning_rate': 0.00016406286144043112, 'epoch': 0.3}


 30%|███       | 347/1151 [20:11<40:56,  3.06s/it]

{'loss': 3.0219, 'grad_norm': 0.7380196452140808, 'learning_rate': 0.00016384645424699835, 'epoch': 0.3}


 30%|███       | 348/1151 [20:14<39:43,  2.97s/it]

{'loss': 2.9873, 'grad_norm': 0.9082412719726562, 'learning_rate': 0.00016362954110327134, 'epoch': 0.3}


 30%|███       | 349/1151 [20:17<39:20,  2.94s/it]

{'loss': 3.2919, 'grad_norm': 0.8933247923851013, 'learning_rate': 0.0001634121237281751, 'epoch': 0.3}


 30%|███       | 350/1151 [20:20<41:16,  3.09s/it]

{'loss': 3.9477, 'grad_norm': 0.7474888563156128, 'learning_rate': 0.0001631942038446304, 'epoch': 0.3}


 30%|███       | 351/1151 [20:24<43:55,  3.29s/it]

{'loss': 3.8355, 'grad_norm': 0.7447279691696167, 'learning_rate': 0.00016297578317954025, 'epoch': 0.3}


 31%|███       | 352/1151 [20:27<43:06,  3.24s/it]

{'loss': 3.6308, 'grad_norm': 0.7959639430046082, 'learning_rate': 0.000162756863463776, 'epoch': 0.31}


 31%|███       | 353/1151 [20:30<40:56,  3.08s/it]

{'loss': 2.9951, 'grad_norm': 0.8199824690818787, 'learning_rate': 0.00016253744643216368, 'epoch': 0.31}


 31%|███       | 354/1151 [20:33<42:00,  3.16s/it]

{'loss': 3.4, 'grad_norm': 0.7089518904685974, 'learning_rate': 0.00016231753382347047, 'epoch': 0.31}


 31%|███       | 355/1151 [20:36<41:27,  3.13s/it]

{'loss': 3.1422, 'grad_norm': 0.7253245115280151, 'learning_rate': 0.00016209712738039049, 'epoch': 0.31}


 31%|███       | 356/1151 [20:39<40:10,  3.03s/it]

{'loss': 2.8483, 'grad_norm': 0.7606346607208252, 'learning_rate': 0.00016187622884953145, 'epoch': 0.31}


 31%|███       | 357/1151 [20:42<38:46,  2.93s/it]

{'loss': 2.7759, 'grad_norm': 0.8018185496330261, 'learning_rate': 0.00016165483998140058, 'epoch': 0.31}


 31%|███       | 358/1151 [20:45<38:41,  2.93s/it]

{'loss': 3.0954, 'grad_norm': 0.7587693333625793, 'learning_rate': 0.00016143296253039063, 'epoch': 0.31}


 31%|███       | 359/1151 [20:48<38:44,  2.94s/it]

{'loss': 3.3573, 'grad_norm': 0.8425959944725037, 'learning_rate': 0.0001612105982547663, 'epoch': 0.31}


 31%|███▏      | 360/1151 [20:50<37:46,  2.87s/it]

{'loss': 3.0794, 'grad_norm': 0.8206770420074463, 'learning_rate': 0.00016098774891665, 'epoch': 0.31}


 31%|███▏      | 361/1151 [20:54<39:23,  2.99s/it]

{'loss': 3.2437, 'grad_norm': 0.8108681440353394, 'learning_rate': 0.00016076441628200806, 'epoch': 0.31}


 31%|███▏      | 362/1151 [20:57<39:23,  3.00s/it]

{'loss': 2.9943, 'grad_norm': 0.7303858995437622, 'learning_rate': 0.00016054060212063672, 'epoch': 0.31}


 32%|███▏      | 363/1151 [20:59<38:44,  2.95s/it]

{'loss': 3.1706, 'grad_norm': 0.9135777354240417, 'learning_rate': 0.00016031630820614797, 'epoch': 0.32}


 32%|███▏      | 364/1151 [21:02<37:04,  2.83s/it]

{'loss': 2.9279, 'grad_norm': 0.8634827733039856, 'learning_rate': 0.0001600915363159557, 'epoch': 0.32}


 32%|███▏      | 365/1151 [21:05<36:35,  2.79s/it]

{'loss': 3.056, 'grad_norm': 0.8110910058021545, 'learning_rate': 0.0001598662882312615, 'epoch': 0.32}


 32%|███▏      | 366/1151 [21:07<35:59,  2.75s/it]

{'loss': 3.1296, 'grad_norm': 0.861370861530304, 'learning_rate': 0.0001596405657370405, 'epoch': 0.32}


 32%|███▏      | 367/1151 [21:10<37:07,  2.84s/it]

{'loss': 3.4363, 'grad_norm': 0.7891243696212769, 'learning_rate': 0.0001594143706220273, 'epoch': 0.32}


 32%|███▏      | 368/1151 [21:13<36:45,  2.82s/it]

{'loss': 3.0754, 'grad_norm': 0.7930466532707214, 'learning_rate': 0.0001591877046787017, 'epoch': 0.32}


 32%|███▏      | 369/1151 [21:16<37:25,  2.87s/it]

{'loss': 3.0848, 'grad_norm': 0.7844529747962952, 'learning_rate': 0.00015896056970327485, 'epoch': 0.32}


 32%|███▏      | 370/1151 [21:19<38:14,  2.94s/it]

{'loss': 3.5551, 'grad_norm': 0.766892671585083, 'learning_rate': 0.00015873296749567442, 'epoch': 0.32}


 32%|███▏      | 371/1151 [21:22<36:47,  2.83s/it]

{'loss': 2.4979, 'grad_norm': 0.8541720509529114, 'learning_rate': 0.00015850489985953076, 'epoch': 0.32}


 32%|███▏      | 372/1151 [21:25<39:19,  3.03s/it]

{'loss': 3.5579, 'grad_norm': 0.747495710849762, 'learning_rate': 0.00015827636860216263, 'epoch': 0.32}


 32%|███▏      | 373/1151 [21:29<39:56,  3.08s/it]

{'loss': 3.2117, 'grad_norm': 0.7196103930473328, 'learning_rate': 0.0001580473755345625, 'epoch': 0.32}


 32%|███▏      | 374/1151 [21:32<39:50,  3.08s/it]

{'loss': 3.6899, 'grad_norm': 0.7815132737159729, 'learning_rate': 0.0001578179224713827, 'epoch': 0.32}


 33%|███▎      | 375/1151 [21:34<39:03,  3.02s/it]

{'loss': 3.074, 'grad_norm': 0.8933296203613281, 'learning_rate': 0.00015758801123092066, 'epoch': 0.33}


 33%|███▎      | 376/1151 [21:37<37:11,  2.88s/it]

{'loss': 2.4421, 'grad_norm': 0.7745075225830078, 'learning_rate': 0.0001573576436351046, 'epoch': 0.33}


 33%|███▎      | 377/1151 [21:40<38:04,  2.95s/it]

{'loss': 3.585, 'grad_norm': 0.7274974584579468, 'learning_rate': 0.00015712682150947923, 'epoch': 0.33}


 33%|███▎      | 378/1151 [21:43<36:32,  2.84s/it]

{'loss': 2.6256, 'grad_norm': 0.8443260788917542, 'learning_rate': 0.0001568955466831911, 'epoch': 0.33}


 33%|███▎      | 379/1151 [21:45<36:13,  2.82s/it]

{'loss': 3.0052, 'grad_norm': 0.7725199460983276, 'learning_rate': 0.00015666382098897412, 'epoch': 0.33}


 33%|███▎      | 380/1151 [21:48<36:26,  2.84s/it]

{'loss': 3.3676, 'grad_norm': 0.7531906366348267, 'learning_rate': 0.00015643164626313527, 'epoch': 0.33}


 33%|███▎      | 381/1151 [21:51<37:24,  2.92s/it]

{'loss': 3.4653, 'grad_norm': 0.8085964322090149, 'learning_rate': 0.0001561990243455397, 'epoch': 0.33}


 33%|███▎      | 382/1151 [21:54<37:05,  2.89s/it]

{'loss': 3.2096, 'grad_norm': 0.8053687214851379, 'learning_rate': 0.00015596595707959647, 'epoch': 0.33}


 33%|███▎      | 383/1151 [21:57<36:37,  2.86s/it]

{'loss': 3.0762, 'grad_norm': 0.8374385237693787, 'learning_rate': 0.00015573244631224365, 'epoch': 0.33}


 33%|███▎      | 384/1151 [22:00<36:08,  2.83s/it]

{'loss': 2.9907, 'grad_norm': 0.7979113459587097, 'learning_rate': 0.00015549849389393395, 'epoch': 0.33}


 33%|███▎      | 385/1151 [22:02<35:13,  2.76s/it]

{'loss': 3.1068, 'grad_norm': 0.8913275599479675, 'learning_rate': 0.00015526410167861988, 'epoch': 0.33}


 34%|███▎      | 386/1151 [22:05<35:42,  2.80s/it]

{'loss': 3.108, 'grad_norm': 0.8293577432632446, 'learning_rate': 0.00015502927152373914, 'epoch': 0.34}


 34%|███▎      | 387/1151 [22:08<35:28,  2.79s/it]

{'loss': 2.7637, 'grad_norm': 0.802361249923706, 'learning_rate': 0.00015479400529019985, 'epoch': 0.34}


 34%|███▎      | 388/1151 [22:11<36:20,  2.86s/it]

{'loss': 3.259, 'grad_norm': 0.7426996827125549, 'learning_rate': 0.00015455830484236585, 'epoch': 0.34}


 34%|███▍      | 389/1151 [22:14<36:16,  2.86s/it]

{'loss': 3.0963, 'grad_norm': 0.9129507541656494, 'learning_rate': 0.0001543221720480419, 'epoch': 0.34}


 34%|███▍      | 390/1151 [22:17<37:02,  2.92s/it]

{'loss': 3.2707, 'grad_norm': 0.73543381690979, 'learning_rate': 0.00015408560877845886, 'epoch': 0.34}


 34%|███▍      | 391/1151 [22:20<36:40,  2.90s/it]

{'loss': 3.1069, 'grad_norm': 0.8000308871269226, 'learning_rate': 0.0001538486169082589, 'epoch': 0.34}


 34%|███▍      | 392/1151 [22:23<36:43,  2.90s/it]

{'loss': 3.114, 'grad_norm': 0.8052908778190613, 'learning_rate': 0.00015361119831548069, 'epoch': 0.34}


 34%|███▍      | 393/1151 [22:25<35:40,  2.82s/it]

{'loss': 2.8057, 'grad_norm': 0.8477689623832703, 'learning_rate': 0.00015337335488154431, 'epoch': 0.34}


 34%|███▍      | 394/1151 [22:28<35:43,  2.83s/it]

{'loss': 2.9902, 'grad_norm': 0.7771901488304138, 'learning_rate': 0.00015313508849123668, 'epoch': 0.34}


 34%|███▍      | 395/1151 [22:32<37:19,  2.96s/it]

{'loss': 3.4976, 'grad_norm': 0.7555193305015564, 'learning_rate': 0.00015289640103269625, 'epoch': 0.34}


 34%|███▍      | 396/1151 [22:35<38:40,  3.07s/it]

{'loss': 3.5155, 'grad_norm': 0.8355767130851746, 'learning_rate': 0.00015265729439739833, 'epoch': 0.34}


 34%|███▍      | 397/1151 [22:38<37:47,  3.01s/it]

{'loss': 3.0885, 'grad_norm': 0.7700128555297852, 'learning_rate': 0.00015241777048014, 'epoch': 0.34}


 35%|███▍      | 398/1151 [22:41<37:10,  2.96s/it]

{'loss': 2.8111, 'grad_norm': 0.7606942057609558, 'learning_rate': 0.00015217783117902497, 'epoch': 0.35}


 35%|███▍      | 399/1151 [22:44<36:57,  2.95s/it]

{'loss': 3.2197, 'grad_norm': 0.7455344200134277, 'learning_rate': 0.00015193747839544876, 'epoch': 0.35}


 35%|███▍      | 400/1151 [22:46<36:54,  2.95s/it]

{'loss': 3.1837, 'grad_norm': 0.8597807884216309, 'learning_rate': 0.0001516967140340835, 'epoch': 0.35}


 35%|███▍      | 401/1151 [22:49<35:39,  2.85s/it]

{'loss': 2.6294, 'grad_norm': 0.8601344227790833, 'learning_rate': 0.0001514555400028629, 'epoch': 0.35}


 35%|███▍      | 402/1151 [22:52<36:37,  2.93s/it]

{'loss': 3.3556, 'grad_norm': 0.8305476903915405, 'learning_rate': 0.00015121395821296693, 'epoch': 0.35}


 35%|███▌      | 403/1151 [22:55<37:41,  3.02s/it]

{'loss': 3.5093, 'grad_norm': 0.7507681846618652, 'learning_rate': 0.00015097197057880706, 'epoch': 0.35}


 35%|███▌      | 404/1151 [22:58<37:30,  3.01s/it]

{'loss': 3.1779, 'grad_norm': 0.8003259897232056, 'learning_rate': 0.00015072957901801076, 'epoch': 0.35}


 35%|███▌      | 405/1151 [23:02<39:14,  3.16s/it]

{'loss': 3.6895, 'grad_norm': 0.7171081304550171, 'learning_rate': 0.00015048678545140633, 'epoch': 0.35}


 35%|███▌      | 406/1151 [23:05<38:42,  3.12s/it]

{'loss': 3.523, 'grad_norm': 0.8496529459953308, 'learning_rate': 0.0001502435918030079, 'epoch': 0.35}


 35%|███▌      | 407/1151 [23:08<38:34,  3.11s/it]

{'loss': 3.4473, 'grad_norm': 0.7551546692848206, 'learning_rate': 0.00015000000000000001, 'epoch': 0.35}


 35%|███▌      | 408/1151 [23:12<41:24,  3.34s/it]

{'loss': 3.8827, 'grad_norm': 0.6494044661521912, 'learning_rate': 0.0001497560119727223, 'epoch': 0.35}


 36%|███▌      | 409/1151 [23:15<39:58,  3.23s/it]

{'loss': 3.3508, 'grad_norm': 0.8343879580497742, 'learning_rate': 0.00014951162965465433, 'epoch': 0.36}


 36%|███▌      | 410/1151 [23:18<39:00,  3.16s/it]

{'loss': 3.5991, 'grad_norm': 0.8080812692642212, 'learning_rate': 0.00014926685498240028, 'epoch': 0.36}


 36%|███▌      | 411/1151 [23:21<38:15,  3.10s/it]

{'loss': 3.2166, 'grad_norm': 0.8206456899642944, 'learning_rate': 0.00014902168989567335, 'epoch': 0.36}


 36%|███▌      | 412/1151 [23:24<36:57,  3.00s/it]

{'loss': 3.0437, 'grad_norm': 0.8150476217269897, 'learning_rate': 0.00014877613633728078, 'epoch': 0.36}


 36%|███▌      | 413/1151 [23:27<36:52,  3.00s/it]

{'loss': 3.4285, 'grad_norm': 0.851311206817627, 'learning_rate': 0.00014853019625310813, 'epoch': 0.36}


 36%|███▌      | 414/1151 [23:29<36:11,  2.95s/it]

{'loss': 3.0845, 'grad_norm': 0.752656877040863, 'learning_rate': 0.00014828387159210397, 'epoch': 0.36}


 36%|███▌      | 415/1151 [23:33<37:42,  3.07s/it]

{'loss': 3.6773, 'grad_norm': 0.7770140171051025, 'learning_rate': 0.00014803716430626456, 'epoch': 0.36}


 36%|███▌      | 416/1151 [23:36<37:26,  3.06s/it]

{'loss': 3.4671, 'grad_norm': 0.7550210356712341, 'learning_rate': 0.0001477900763506181, 'epoch': 0.36}


 36%|███▌      | 417/1151 [23:39<36:40,  3.00s/it]

{'loss': 3.2982, 'grad_norm': 0.7403396368026733, 'learning_rate': 0.0001475426096832095, 'epoch': 0.36}


 36%|███▋      | 418/1151 [23:42<37:08,  3.04s/it]

{'loss': 3.3369, 'grad_norm': 0.7278348803520203, 'learning_rate': 0.00014729476626508485, 'epoch': 0.36}


 36%|███▋      | 419/1151 [23:45<37:26,  3.07s/it]

{'loss': 3.5634, 'grad_norm': 0.7267985343933105, 'learning_rate': 0.0001470465480602756, 'epoch': 0.36}


 36%|███▋      | 420/1151 [23:48<38:51,  3.19s/it]

{'loss': 3.7141, 'grad_norm': 0.7191134095191956, 'learning_rate': 0.00014679795703578325, 'epoch': 0.36}


 37%|███▋      | 421/1151 [23:51<38:12,  3.14s/it]

{'loss': 3.0336, 'grad_norm': 0.7279073596000671, 'learning_rate': 0.00014654899516156387, 'epoch': 0.37}


 37%|███▋      | 422/1151 [23:54<37:38,  3.10s/it]

{'loss': 3.1893, 'grad_norm': 0.8920530080795288, 'learning_rate': 0.00014629966441051208, 'epoch': 0.37}


 37%|███▋      | 423/1151 [23:57<36:36,  3.02s/it]

{'loss': 3.0795, 'grad_norm': 0.832920491695404, 'learning_rate': 0.00014604996675844585, 'epoch': 0.37}


 37%|███▋      | 424/1151 [24:00<36:25,  3.01s/it]

{'loss': 3.1511, 'grad_norm': 0.821115255355835, 'learning_rate': 0.0001457999041840906, 'epoch': 0.37}


 37%|███▋      | 425/1151 [24:03<35:45,  2.96s/it]

{'loss': 3.2449, 'grad_norm': 0.8656453490257263, 'learning_rate': 0.0001455494786690634, 'epoch': 0.37}


 37%|███▋      | 426/1151 [24:07<37:29,  3.10s/it]

{'loss': 3.3219, 'grad_norm': 0.7478826642036438, 'learning_rate': 0.00014529869219785777, 'epoch': 0.37}


 37%|███▋      | 427/1151 [24:10<37:39,  3.12s/it]

{'loss': 3.2349, 'grad_norm': 0.8255872130393982, 'learning_rate': 0.0001450475467578273, 'epoch': 0.37}


 37%|███▋      | 428/1151 [24:12<35:58,  2.99s/it]

{'loss': 2.9211, 'grad_norm': 0.8794426918029785, 'learning_rate': 0.00014479604433917045, 'epoch': 0.37}


 37%|███▋      | 429/1151 [24:15<34:47,  2.89s/it]

{'loss': 2.8667, 'grad_norm': 0.8522724509239197, 'learning_rate': 0.00014454418693491444, 'epoch': 0.37}


 37%|███▋      | 430/1151 [24:18<36:04,  3.00s/it]

{'loss': 3.4078, 'grad_norm': 0.6929606199264526, 'learning_rate': 0.00014429197654089955, 'epoch': 0.37}


 37%|███▋      | 431/1151 [24:21<36:01,  3.00s/it]

{'loss': 3.401, 'grad_norm': 0.8151595592498779, 'learning_rate': 0.00014403941515576344, 'epoch': 0.37}


 38%|███▊      | 432/1151 [24:24<36:22,  3.04s/it]

{'loss': 3.4278, 'grad_norm': 0.7914422750473022, 'learning_rate': 0.0001437865047809251, 'epoch': 0.38}


 38%|███▊      | 433/1151 [24:28<37:25,  3.13s/it]

{'loss': 3.7516, 'grad_norm': 0.7321418523788452, 'learning_rate': 0.000143533247420569, 'epoch': 0.38}


 38%|███▊      | 434/1151 [24:31<38:14,  3.20s/it]

{'loss': 3.2837, 'grad_norm': 0.7499338984489441, 'learning_rate': 0.0001432796450816295, 'epoch': 0.38}


 38%|███▊      | 435/1151 [24:34<36:59,  3.10s/it]

{'loss': 3.0654, 'grad_norm': 0.8212071061134338, 'learning_rate': 0.0001430256997737746, 'epoch': 0.38}


 38%|███▊      | 436/1151 [24:38<38:29,  3.23s/it]

{'loss': 3.7237, 'grad_norm': 0.7490758895874023, 'learning_rate': 0.00014277141350939023, 'epoch': 0.38}


 38%|███▊      | 437/1151 [24:41<37:27,  3.15s/it]

{'loss': 3.3713, 'grad_norm': 0.765770673751831, 'learning_rate': 0.00014251678830356408, 'epoch': 0.38}


 38%|███▊      | 438/1151 [24:44<37:35,  3.16s/it]

{'loss': 3.6908, 'grad_norm': 0.7035236358642578, 'learning_rate': 0.00014226182617406996, 'epoch': 0.38}


 38%|███▊      | 439/1151 [24:46<35:15,  2.97s/it]

{'loss': 2.4504, 'grad_norm': 0.8328005075454712, 'learning_rate': 0.0001420065291413515, 'epoch': 0.38}


 38%|███▊      | 440/1151 [24:49<35:01,  2.96s/it]

{'loss': 2.9261, 'grad_norm': 0.7677218317985535, 'learning_rate': 0.00014175089922850633, 'epoch': 0.38}


 38%|███▊      | 441/1151 [24:52<33:50,  2.86s/it]

{'loss': 2.7993, 'grad_norm': 0.7704971432685852, 'learning_rate': 0.00014149493846126994, 'epoch': 0.38}


 38%|███▊      | 442/1151 [24:55<33:32,  2.84s/it]

{'loss': 3.0041, 'grad_norm': 0.8464207053184509, 'learning_rate': 0.0001412386488679997, 'epoch': 0.38}


 38%|███▊      | 443/1151 [24:57<33:34,  2.84s/it]

{'loss': 3.0996, 'grad_norm': 0.8358824849128723, 'learning_rate': 0.00014098203247965875, 'epoch': 0.38}


 39%|███▊      | 444/1151 [25:01<34:48,  2.95s/it]

{'loss': 3.41, 'grad_norm': 0.7570542693138123, 'learning_rate': 0.00014072509132979994, 'epoch': 0.39}


 39%|███▊      | 445/1151 [25:04<35:35,  3.02s/it]

{'loss': 3.2116, 'grad_norm': 0.7207232713699341, 'learning_rate': 0.00014046782745454962, 'epoch': 0.39}


 39%|███▊      | 446/1151 [25:07<36:50,  3.14s/it]

{'loss': 3.5061, 'grad_norm': 0.7735158205032349, 'learning_rate': 0.00014021024289259158, 'epoch': 0.39}


 39%|███▉      | 447/1151 [25:10<36:49,  3.14s/it]

{'loss': 3.3095, 'grad_norm': 0.7719752788543701, 'learning_rate': 0.00013995233968515104, 'epoch': 0.39}


 39%|███▉      | 448/1151 [25:13<35:00,  2.99s/it]

{'loss': 2.6785, 'grad_norm': 0.8202794790267944, 'learning_rate': 0.0001396941198759781, 'epoch': 0.39}


 39%|███▉      | 449/1151 [25:16<35:03,  3.00s/it]

{'loss': 3.3598, 'grad_norm': 0.7658846974372864, 'learning_rate': 0.00013943558551133186, 'epoch': 0.39}


 39%|███▉      | 450/1151 [25:19<34:01,  2.91s/it]

{'loss': 2.7102, 'grad_norm': 0.7782614827156067, 'learning_rate': 0.00013917673863996419, 'epoch': 0.39}


 39%|███▉      | 451/1151 [25:22<34:05,  2.92s/it]

{'loss': 3.2151, 'grad_norm': 0.9369587898254395, 'learning_rate': 0.0001389175813131033, 'epoch': 0.39}


 39%|███▉      | 452/1151 [25:25<34:10,  2.93s/it]

{'loss': 2.9662, 'grad_norm': 0.714581310749054, 'learning_rate': 0.0001386581155844376, 'epoch': 0.39}


 39%|███▉      | 453/1151 [25:28<34:01,  2.92s/it]

{'loss': 2.8132, 'grad_norm': 0.7459930181503296, 'learning_rate': 0.00013839834351009954, 'epoch': 0.39}


 39%|███▉      | 454/1151 [25:30<33:32,  2.89s/it]

{'loss': 2.9987, 'grad_norm': 0.7857157588005066, 'learning_rate': 0.0001381382671486491, 'epoch': 0.39}


 40%|███▉      | 455/1151 [25:33<32:19,  2.79s/it]

{'loss': 2.6422, 'grad_norm': 0.8616517186164856, 'learning_rate': 0.0001378778885610576, 'epoch': 0.4}


 40%|███▉      | 456/1151 [25:36<32:52,  2.84s/it]

{'loss': 3.3729, 'grad_norm': 0.7961634397506714, 'learning_rate': 0.00013761720981069137, 'epoch': 0.4}


 40%|███▉      | 457/1151 [25:39<33:08,  2.87s/it]

{'loss': 3.0526, 'grad_norm': 0.8856217861175537, 'learning_rate': 0.00013735623296329537, 'epoch': 0.4}


 40%|███▉      | 458/1151 [25:43<36:36,  3.17s/it]

{'loss': 3.8344, 'grad_norm': 0.6670805215835571, 'learning_rate': 0.0001370949600869768, 'epoch': 0.4}


 40%|███▉      | 459/1151 [25:46<36:54,  3.20s/it]

{'loss': 3.1185, 'grad_norm': 0.7862786650657654, 'learning_rate': 0.00013683339325218873, 'epoch': 0.4}


 40%|███▉      | 460/1151 [25:49<34:37,  3.01s/it]

{'loss': 2.5596, 'grad_norm': 0.9056366086006165, 'learning_rate': 0.00013657153453171375, 'epoch': 0.4}


 40%|████      | 461/1151 [25:52<36:24,  3.17s/it]

{'loss': 3.5461, 'grad_norm': 0.7275100350379944, 'learning_rate': 0.00013630938600064747, 'epoch': 0.4}


 40%|████      | 462/1151 [25:55<35:01,  3.05s/it]

{'loss': 2.9224, 'grad_norm': 0.8655734658241272, 'learning_rate': 0.0001360469497363821, 'epoch': 0.4}


                                                  
 40%|████      | 462/1151 [28:11<35:01,  3.05s/it]

{'eval_loss': 3.0916013717651367, 'eval_runtime': 136.4916, 'eval_samples_per_second': 8.433, 'eval_steps_per_second': 1.055, 'epoch': 0.4}


 40%|████      | 463/1151 [28:14<8:24:44, 44.02s/it]

{'loss': 3.0634, 'grad_norm': 0.7451162338256836, 'learning_rate': 0.00013578422781858993, 'epoch': 0.4}


 40%|████      | 464/1151 [28:18<6:04:01, 31.79s/it]

{'loss': 3.3748, 'grad_norm': 0.7292166948318481, 'learning_rate': 0.00013552122232920707, 'epoch': 0.4}


 40%|████      | 465/1151 [28:21<4:25:59, 23.26s/it]

{'loss': 3.6896, 'grad_norm': 0.7724366188049316, 'learning_rate': 0.00013525793535241654, 'epoch': 0.4}


 40%|████      | 466/1151 [28:24<3:14:48, 17.06s/it]

{'loss': 2.8009, 'grad_norm': 0.91996830701828, 'learning_rate': 0.00013499436897463222, 'epoch': 0.4}


 41%|████      | 467/1151 [28:27<2:26:50, 12.88s/it]

{'loss': 3.2539, 'grad_norm': 0.7736149430274963, 'learning_rate': 0.00013473052528448201, 'epoch': 0.41}


 41%|████      | 468/1151 [28:30<1:53:18,  9.95s/it]

{'loss': 3.4158, 'grad_norm': 0.7360520362854004, 'learning_rate': 0.00013446640637279145, 'epoch': 0.41}


 41%|████      | 469/1151 [28:33<1:29:41,  7.89s/it]

{'loss': 3.1465, 'grad_norm': 0.747039258480072, 'learning_rate': 0.00013420201433256689, 'epoch': 0.41}


 41%|████      | 470/1151 [28:36<1:13:22,  6.47s/it]

{'loss': 3.2504, 'grad_norm': 0.7540971636772156, 'learning_rate': 0.00013393735125897925, 'epoch': 0.41}


 41%|████      | 471/1151 [28:39<1:01:16,  5.41s/it]

{'loss': 3.0928, 'grad_norm': 0.8219526410102844, 'learning_rate': 0.00013367241924934714, 'epoch': 0.41}


 41%|████      | 472/1151 [28:42<52:02,  4.60s/it]  

{'loss': 2.7953, 'grad_norm': 0.8453661799430847, 'learning_rate': 0.00013340722040312047, 'epoch': 0.41}


 41%|████      | 473/1151 [28:45<48:19,  4.28s/it]

{'loss': 3.8872, 'grad_norm': 0.7172425389289856, 'learning_rate': 0.0001331417568218636, 'epoch': 0.41}


 41%|████      | 474/1151 [28:49<44:58,  3.99s/it]

{'loss': 3.6073, 'grad_norm': 0.7195571660995483, 'learning_rate': 0.00013287603060923876, 'epoch': 0.41}


 41%|████▏     | 475/1151 [28:52<42:30,  3.77s/it]

{'loss': 3.479, 'grad_norm': 0.6866098046302795, 'learning_rate': 0.0001326100438709895, 'epoch': 0.41}


 41%|████▏     | 476/1151 [28:55<41:30,  3.69s/it]

{'loss': 3.6564, 'grad_norm': 0.7307948470115662, 'learning_rate': 0.0001323437987149238, 'epoch': 0.41}


 41%|████▏     | 477/1151 [28:59<40:26,  3.60s/it]

{'loss': 3.6374, 'grad_norm': 0.7327909469604492, 'learning_rate': 0.00013207729725089756, 'epoch': 0.41}


 42%|████▏     | 478/1151 [29:02<39:55,  3.56s/it]

{'loss': 3.5496, 'grad_norm': 0.7839037775993347, 'learning_rate': 0.00013181054159079768, 'epoch': 0.42}


 42%|████▏     | 479/1151 [29:05<38:18,  3.42s/it]

{'loss': 3.341, 'grad_norm': 0.8757696747779846, 'learning_rate': 0.00013154353384852558, 'epoch': 0.42}


 42%|████▏     | 480/1151 [29:08<36:40,  3.28s/it]

{'loss': 3.28, 'grad_norm': 0.7690321207046509, 'learning_rate': 0.0001312762761399801, 'epoch': 0.42}


 42%|████▏     | 481/1151 [29:11<35:46,  3.20s/it]

{'loss': 3.3761, 'grad_norm': 0.7724848985671997, 'learning_rate': 0.00013100877058304112, 'epoch': 0.42}


 42%|████▏     | 482/1151 [29:14<33:23,  2.99s/it]

{'loss': 2.5751, 'grad_norm': 0.8551284074783325, 'learning_rate': 0.00013074101929755252, 'epoch': 0.42}


 42%|████▏     | 483/1151 [29:17<32:26,  2.91s/it]

{'loss': 2.8714, 'grad_norm': 0.8013625741004944, 'learning_rate': 0.00013047302440530537, 'epoch': 0.42}


 42%|████▏     | 484/1151 [29:19<31:15,  2.81s/it]

{'loss': 2.7246, 'grad_norm': 0.8588359355926514, 'learning_rate': 0.00013020478803002142, 'epoch': 0.42}


 42%|████▏     | 485/1151 [29:22<30:37,  2.76s/it]

{'loss': 2.5359, 'grad_norm': 0.7672243118286133, 'learning_rate': 0.00012993631229733582, 'epoch': 0.42}


 42%|████▏     | 486/1151 [29:24<30:23,  2.74s/it]

{'loss': 2.8715, 'grad_norm': 0.8386601805686951, 'learning_rate': 0.00012966759933478057, 'epoch': 0.42}


 42%|████▏     | 487/1151 [29:27<30:50,  2.79s/it]

{'loss': 3.2613, 'grad_norm': 0.7384830713272095, 'learning_rate': 0.0001293986512717677, 'epoch': 0.42}


 42%|████▏     | 488/1151 [29:30<30:57,  2.80s/it]

{'loss': 3.0066, 'grad_norm': 0.8184902667999268, 'learning_rate': 0.00012912947023957212, 'epoch': 0.42}


 42%|████▏     | 489/1151 [29:33<30:21,  2.75s/it]

{'loss': 2.4571, 'grad_norm': 0.8006085753440857, 'learning_rate': 0.00012886005837131506, 'epoch': 0.42}


 43%|████▎     | 490/1151 [29:36<30:45,  2.79s/it]

{'loss': 2.9836, 'grad_norm': 0.7515769600868225, 'learning_rate': 0.00012859041780194692, 'epoch': 0.43}


 43%|████▎     | 491/1151 [29:39<30:54,  2.81s/it]

{'loss': 3.0604, 'grad_norm': 0.8185213804244995, 'learning_rate': 0.00012832055066823038, 'epoch': 0.43}


 43%|████▎     | 492/1151 [29:42<33:06,  3.01s/it]

{'loss': 3.5528, 'grad_norm': 0.735402524471283, 'learning_rate': 0.00012805045910872372, 'epoch': 0.43}


 43%|████▎     | 493/1151 [29:45<33:13,  3.03s/it]

{'loss': 3.5357, 'grad_norm': 0.790659487247467, 'learning_rate': 0.00012778014526376353, 'epoch': 0.43}


 43%|████▎     | 494/1151 [29:48<31:33,  2.88s/it]

{'loss': 2.7897, 'grad_norm': 0.8684771060943604, 'learning_rate': 0.0001275096112754478, 'epoch': 0.43}


 43%|████▎     | 495/1151 [29:50<31:24,  2.87s/it]

{'loss': 3.254, 'grad_norm': 0.7992540597915649, 'learning_rate': 0.00012723885928761933, 'epoch': 0.43}


 43%|████▎     | 496/1151 [29:54<32:48,  3.01s/it]

{'loss': 3.7457, 'grad_norm': 0.7334216237068176, 'learning_rate': 0.00012696789144584823, 'epoch': 0.43}


 43%|████▎     | 497/1151 [29:57<33:25,  3.07s/it]

{'loss': 3.7581, 'grad_norm': 0.7241879105567932, 'learning_rate': 0.00012669670989741517, 'epoch': 0.43}


 43%|████▎     | 498/1151 [30:00<33:03,  3.04s/it]

{'loss': 3.1705, 'grad_norm': 0.7223631143569946, 'learning_rate': 0.00012642531679129445, 'epoch': 0.43}


 43%|████▎     | 499/1151 [30:03<33:51,  3.12s/it]

{'loss': 3.3804, 'grad_norm': 0.7616962194442749, 'learning_rate': 0.0001261537142781367, 'epoch': 0.43}


 43%|████▎     | 500/1151 [30:06<32:51,  3.03s/it]

{'loss': 2.9094, 'grad_norm': 0.8278334736824036, 'learning_rate': 0.00012588190451025207, 'epoch': 0.43}


  return fn(*args, **kwargs)
 44%|████▎     | 501/1151 [30:53<2:55:11, 16.17s/it]

{'loss': 3.2408, 'grad_norm': 0.7824459075927734, 'learning_rate': 0.0001256098896415932, 'epoch': 0.43}


 44%|████▎     | 502/1151 [30:56<2:12:02, 12.21s/it]

{'loss': 3.5122, 'grad_norm': 0.728834867477417, 'learning_rate': 0.0001253376718277378, 'epoch': 0.44}


 44%|████▎     | 503/1151 [30:58<1:40:32,  9.31s/it]

{'loss': 2.9236, 'grad_norm': 0.8152850866317749, 'learning_rate': 0.00012506525322587207, 'epoch': 0.44}


 44%|████▍     | 504/1151 [31:01<1:19:17,  7.35s/it]

{'loss': 3.2133, 'grad_norm': 0.8240436315536499, 'learning_rate': 0.00012479263599477318, 'epoch': 0.44}


 44%|████▍     | 505/1151 [31:04<1:04:43,  6.01s/it]

{'loss': 3.275, 'grad_norm': 0.842441737651825, 'learning_rate': 0.00012451982229479244, 'epoch': 0.44}


 44%|████▍     | 506/1151 [31:07<53:32,  4.98s/it]  

{'loss': 2.9014, 'grad_norm': 0.8186134696006775, 'learning_rate': 0.000124246814287838, 'epoch': 0.44}


 44%|████▍     | 507/1151 [31:09<46:06,  4.30s/it]

{'loss': 3.155, 'grad_norm': 0.7589480876922607, 'learning_rate': 0.00012397361413735784, 'epoch': 0.44}


 44%|████▍     | 508/1151 [31:13<42:18,  3.95s/it]

{'loss': 3.4111, 'grad_norm': 0.715238630771637, 'learning_rate': 0.00012370022400832255, 'epoch': 0.44}


 44%|████▍     | 509/1151 [31:16<40:09,  3.75s/it]

{'loss': 3.4883, 'grad_norm': 0.729439377784729, 'learning_rate': 0.00012342664606720822, 'epoch': 0.44}


 44%|████▍     | 510/1151 [31:19<37:36,  3.52s/it]

{'loss': 3.1203, 'grad_norm': 0.7952986359596252, 'learning_rate': 0.00012315288248197925, 'epoch': 0.44}


 44%|████▍     | 511/1151 [31:21<34:43,  3.26s/it]

{'loss': 2.8859, 'grad_norm': 0.8384867310523987, 'learning_rate': 0.0001228789354220712, 'epoch': 0.44}


 44%|████▍     | 512/1151 [31:24<32:16,  3.03s/it]

{'loss': 2.3786, 'grad_norm': 0.8055822849273682, 'learning_rate': 0.0001226048070583735, 'epoch': 0.44}


 45%|████▍     | 513/1151 [31:27<33:17,  3.13s/it]

{'loss': 3.8821, 'grad_norm': 0.7944011688232422, 'learning_rate': 0.0001223304995632124, 'epoch': 0.45}


 45%|████▍     | 514/1151 [31:30<31:50,  3.00s/it]

{'loss': 2.9974, 'grad_norm': 0.8160977363586426, 'learning_rate': 0.0001220560151103336, 'epoch': 0.45}


 45%|████▍     | 515/1151 [31:33<30:53,  2.91s/it]

{'loss': 2.8455, 'grad_norm': 0.770142138004303, 'learning_rate': 0.00012178135587488515, 'epoch': 0.45}


 45%|████▍     | 516/1151 [31:36<30:26,  2.88s/it]

{'loss': 2.9903, 'grad_norm': 0.8145630955696106, 'learning_rate': 0.00012150652403340022, 'epoch': 0.45}


 45%|████▍     | 517/1151 [31:38<29:26,  2.79s/it]

{'loss': 2.5964, 'grad_norm': 0.7720887064933777, 'learning_rate': 0.00012123152176377961, 'epoch': 0.45}


 45%|████▌     | 518/1151 [31:41<29:22,  2.78s/it]

{'loss': 3.0108, 'grad_norm': 0.8234233856201172, 'learning_rate': 0.00012095635124527486, 'epoch': 0.45}


 45%|████▌     | 519/1151 [31:44<29:09,  2.77s/it]

{'loss': 2.9076, 'grad_norm': 0.8522347211837769, 'learning_rate': 0.00012068101465847075, 'epoch': 0.45}


 45%|████▌     | 520/1151 [31:47<29:45,  2.83s/it]

{'loss': 3.5195, 'grad_norm': 0.7510653138160706, 'learning_rate': 0.00012040551418526796, 'epoch': 0.45}


 45%|████▌     | 521/1151 [31:49<29:27,  2.81s/it]

{'loss': 3.1008, 'grad_norm': 0.7738161683082581, 'learning_rate': 0.00012012985200886602, 'epoch': 0.45}


 45%|████▌     | 522/1151 [31:52<28:37,  2.73s/it]

{'loss': 2.4084, 'grad_norm': 0.8161024451255798, 'learning_rate': 0.00011985403031374583, 'epoch': 0.45}


 45%|████▌     | 523/1151 [31:55<30:30,  2.91s/it]

{'loss': 3.5305, 'grad_norm': 0.767839789390564, 'learning_rate': 0.00011957805128565232, 'epoch': 0.45}


 46%|████▌     | 524/1151 [31:58<29:57,  2.87s/it]

{'loss': 3.1428, 'grad_norm': 0.7916854023933411, 'learning_rate': 0.00011930191711157737, 'epoch': 0.45}


 46%|████▌     | 525/1151 [32:01<29:01,  2.78s/it]

{'loss': 2.4145, 'grad_norm': 0.7758712768554688, 'learning_rate': 0.00011902562997974211, 'epoch': 0.46}


 46%|████▌     | 526/1151 [32:03<29:22,  2.82s/it]

{'loss': 3.0174, 'grad_norm': 0.8145041465759277, 'learning_rate': 0.00011874919207957993, 'epoch': 0.46}


 46%|████▌     | 527/1151 [32:06<29:05,  2.80s/it]

{'loss': 3.2641, 'grad_norm': 0.8353741765022278, 'learning_rate': 0.00011847260560171896, 'epoch': 0.46}


 46%|████▌     | 528/1151 [32:09<29:21,  2.83s/it]

{'loss': 2.887, 'grad_norm': 0.8225101828575134, 'learning_rate': 0.00011819587273796462, 'epoch': 0.46}


 46%|████▌     | 529/1151 [32:12<29:27,  2.84s/it]

{'loss': 3.0935, 'grad_norm': 0.7876485586166382, 'learning_rate': 0.00011791899568128253, 'epoch': 0.46}


 46%|████▌     | 530/1151 [32:15<28:48,  2.78s/it]

{'loss': 2.7166, 'grad_norm': 0.7375379800796509, 'learning_rate': 0.00011764197662578086, 'epoch': 0.46}


 46%|████▌     | 531/1151 [32:17<28:50,  2.79s/it]

{'loss': 2.9551, 'grad_norm': 0.8012605905532837, 'learning_rate': 0.00011736481776669306, 'epoch': 0.46}


 46%|████▌     | 532/1151 [32:20<28:30,  2.76s/it]

{'loss': 2.9112, 'grad_norm': 0.8451550602912903, 'learning_rate': 0.00011708752130036042, 'epoch': 0.46}


 46%|████▋     | 533/1151 [32:23<28:36,  2.78s/it]

{'loss': 3.1371, 'grad_norm': 0.8065146207809448, 'learning_rate': 0.00011681008942421483, 'epoch': 0.46}


 46%|████▋     | 534/1151 [32:26<29:50,  2.90s/it]

{'loss': 3.5044, 'grad_norm': 0.7409862875938416, 'learning_rate': 0.00011653252433676108, 'epoch': 0.46}


 46%|████▋     | 535/1151 [32:29<29:45,  2.90s/it]

{'loss': 3.2117, 'grad_norm': 0.7706514596939087, 'learning_rate': 0.00011625482823755965, 'epoch': 0.46}


 47%|████▋     | 536/1151 [32:32<29:42,  2.90s/it]

{'loss': 3.1256, 'grad_norm': 0.7480859756469727, 'learning_rate': 0.00011597700332720923, 'epoch': 0.47}


 47%|████▋     | 537/1151 [32:35<30:06,  2.94s/it]

{'loss': 3.5741, 'grad_norm': 0.7543289661407471, 'learning_rate': 0.00011569905180732928, 'epoch': 0.47}


 47%|████▋     | 538/1151 [32:38<29:58,  2.93s/it]

{'loss': 3.0902, 'grad_norm': 0.8048223853111267, 'learning_rate': 0.00011542097588054257, 'epoch': 0.47}


 47%|████▋     | 539/1151 [32:41<30:09,  2.96s/it]

{'loss': 3.2273, 'grad_norm': 0.7618952989578247, 'learning_rate': 0.00011514277775045768, 'epoch': 0.47}


 47%|████▋     | 540/1151 [32:44<29:28,  2.90s/it]

{'loss': 3.0549, 'grad_norm': 0.7821393013000488, 'learning_rate': 0.00011486445962165164, 'epoch': 0.47}


 47%|████▋     | 541/1151 [32:47<30:53,  3.04s/it]

{'loss': 3.4857, 'grad_norm': 0.7493245005607605, 'learning_rate': 0.00011458602369965243, 'epoch': 0.47}


 47%|████▋     | 542/1151 [32:49<29:03,  2.86s/it]

{'loss': 2.3412, 'grad_norm': 0.8051238059997559, 'learning_rate': 0.00011430747219092142, 'epoch': 0.47}


 47%|████▋     | 543/1151 [32:52<29:09,  2.88s/it]

{'loss': 3.6283, 'grad_norm': 0.7970044612884521, 'learning_rate': 0.00011402880730283598, 'epoch': 0.47}


 47%|████▋     | 544/1151 [32:55<28:31,  2.82s/it]

{'loss': 2.8598, 'grad_norm': 0.7814434170722961, 'learning_rate': 0.00011375003124367192, 'epoch': 0.47}


 47%|████▋     | 545/1151 [32:58<27:30,  2.72s/it]

{'loss': 2.6224, 'grad_norm': 0.8238996267318726, 'learning_rate': 0.00011347114622258612, 'epoch': 0.47}


 47%|████▋     | 546/1151 [33:01<28:43,  2.85s/it]

{'loss': 3.3406, 'grad_norm': 0.8368936777114868, 'learning_rate': 0.00011319215444959876, 'epoch': 0.47}


 48%|████▊     | 547/1151 [33:04<29:50,  2.96s/it]

{'loss': 3.5393, 'grad_norm': 0.8476670980453491, 'learning_rate': 0.00011291305813557615, 'epoch': 0.47}


 48%|████▊     | 548/1151 [33:07<29:46,  2.96s/it]

{'loss': 3.3337, 'grad_norm': 0.7180793881416321, 'learning_rate': 0.00011263385949221295, 'epoch': 0.48}


 48%|████▊     | 549/1151 [33:10<30:14,  3.01s/it]

{'loss': 3.2972, 'grad_norm': 0.8752597570419312, 'learning_rate': 0.00011235456073201467, 'epoch': 0.48}


 48%|████▊     | 550/1151 [33:13<29:57,  2.99s/it]

{'loss': 3.3985, 'grad_norm': 0.8235746026039124, 'learning_rate': 0.0001120751640682803, 'epoch': 0.48}


 48%|████▊     | 551/1151 [33:16<29:39,  2.97s/it]

{'loss': 3.1307, 'grad_norm': 0.8281412124633789, 'learning_rate': 0.00011179567171508463, 'epoch': 0.48}


 48%|████▊     | 552/1151 [33:18<28:28,  2.85s/it]

{'loss': 2.7362, 'grad_norm': 0.8190372586250305, 'learning_rate': 0.00011151608588726068, 'epoch': 0.48}


 48%|████▊     | 553/1151 [33:21<28:46,  2.89s/it]

{'loss': 3.5059, 'grad_norm': 0.8506483435630798, 'learning_rate': 0.00011123640880038233, 'epoch': 0.48}


 48%|████▊     | 554/1151 [33:25<29:20,  2.95s/it]

{'loss': 3.3772, 'grad_norm': 0.8739099502563477, 'learning_rate': 0.00011095664267074655, 'epoch': 0.48}


 48%|████▊     | 555/1151 [33:28<29:58,  3.02s/it]

{'loss': 3.7653, 'grad_norm': 0.8363804221153259, 'learning_rate': 0.00011067678971535589, 'epoch': 0.48}


 48%|████▊     | 556/1151 [33:30<28:44,  2.90s/it]

{'loss': 3.0082, 'grad_norm': 0.827339768409729, 'learning_rate': 0.0001103968521519011, 'epoch': 0.48}


 48%|████▊     | 557/1151 [33:33<27:40,  2.79s/it]

{'loss': 2.491, 'grad_norm': 0.8073518872261047, 'learning_rate': 0.00011011683219874323, 'epoch': 0.48}


 48%|████▊     | 558/1151 [33:36<28:21,  2.87s/it]

{'loss': 3.5, 'grad_norm': 0.7837657332420349, 'learning_rate': 0.00010983673207489636, 'epoch': 0.48}


 49%|████▊     | 559/1151 [33:39<28:34,  2.90s/it]

{'loss': 3.3545, 'grad_norm': 0.7280610799789429, 'learning_rate': 0.00010955655400000984, 'epoch': 0.49}


 49%|████▊     | 560/1151 [33:42<28:57,  2.94s/it]

{'loss': 3.1779, 'grad_norm': 0.7797881960868835, 'learning_rate': 0.00010927630019435066, 'epoch': 0.49}


 49%|████▊     | 561/1151 [33:45<28:05,  2.86s/it]

{'loss': 2.9667, 'grad_norm': 0.772923469543457, 'learning_rate': 0.00010899597287878609, 'epoch': 0.49}


 49%|████▉     | 562/1151 [33:47<27:50,  2.84s/it]

{'loss': 3.2211, 'grad_norm': 0.8004952669143677, 'learning_rate': 0.00010871557427476583, 'epoch': 0.49}


 49%|████▉     | 563/1151 [33:50<28:26,  2.90s/it]

{'loss': 3.2075, 'grad_norm': 0.7260450720787048, 'learning_rate': 0.00010843510660430447, 'epoch': 0.49}


 49%|████▉     | 564/1151 [33:54<29:29,  3.02s/it]

{'loss': 3.3476, 'grad_norm': 0.8173670172691345, 'learning_rate': 0.00010815457208996407, 'epoch': 0.49}


 49%|████▉     | 565/1151 [33:57<28:48,  2.95s/it]

{'loss': 2.9837, 'grad_norm': 0.765708327293396, 'learning_rate': 0.0001078739729548362, 'epoch': 0.49}


 49%|████▉     | 566/1151 [34:00<30:05,  3.09s/it]

{'loss': 3.5517, 'grad_norm': 0.712260365486145, 'learning_rate': 0.00010759331142252462, 'epoch': 0.49}


 49%|████▉     | 567/1151 [34:03<29:39,  3.05s/it]

{'loss': 3.1391, 'grad_norm': 0.8849897384643555, 'learning_rate': 0.00010731258971712761, 'epoch': 0.49}


 49%|████▉     | 568/1151 [34:06<30:09,  3.10s/it]

{'loss': 3.1771, 'grad_norm': 0.7164983153343201, 'learning_rate': 0.00010703181006322014, 'epoch': 0.49}


 49%|████▉     | 569/1151 [34:09<29:11,  3.01s/it]

{'loss': 3.0606, 'grad_norm': 0.7632298469543457, 'learning_rate': 0.00010675097468583652, 'epoch': 0.49}


 50%|████▉     | 570/1151 [34:12<29:02,  3.00s/it]

{'loss': 3.1643, 'grad_norm': 0.758009135723114, 'learning_rate': 0.00010647008581045263, 'epoch': 0.49}


 50%|████▉     | 571/1151 [34:15<30:28,  3.15s/it]

{'loss': 3.3112, 'grad_norm': 0.6517876386642456, 'learning_rate': 0.0001061891456629682, 'epoch': 0.5}


 50%|████▉     | 572/1151 [34:18<29:55,  3.10s/it]

{'loss': 3.1444, 'grad_norm': 0.8257942795753479, 'learning_rate': 0.00010590815646968934, 'epoch': 0.5}


 50%|████▉     | 573/1151 [34:21<28:44,  2.98s/it]

{'loss': 3.0664, 'grad_norm': 0.8098421096801758, 'learning_rate': 0.00010562712045731084, 'epoch': 0.5}


 50%|████▉     | 574/1151 [34:24<28:48,  3.00s/it]

{'loss': 3.1871, 'grad_norm': 0.7630062699317932, 'learning_rate': 0.00010534603985289844, 'epoch': 0.5}


 50%|████▉     | 575/1151 [34:27<28:19,  2.95s/it]

{'loss': 2.7696, 'grad_norm': 0.7240926027297974, 'learning_rate': 0.00010506491688387127, 'epoch': 0.5}


 50%|█████     | 576/1151 [34:30<29:26,  3.07s/it]

{'loss': 3.2044, 'grad_norm': 0.7018404603004456, 'learning_rate': 0.00010478375377798426, 'epoch': 0.5}


 50%|█████     | 577/1151 [34:33<29:35,  3.09s/it]

{'loss': 3.3403, 'grad_norm': 0.7806302309036255, 'learning_rate': 0.00010450255276331029, 'epoch': 0.5}


 50%|█████     | 578/1151 [34:36<28:39,  3.00s/it]

{'loss': 2.8468, 'grad_norm': 0.7387629151344299, 'learning_rate': 0.00010422131606822269, 'epoch': 0.5}


 50%|█████     | 579/1151 [34:39<27:12,  2.85s/it]

{'loss': 2.357, 'grad_norm': 0.8213130235671997, 'learning_rate': 0.00010394004592137757, 'epoch': 0.5}


 50%|█████     | 580/1151 [34:42<27:30,  2.89s/it]

{'loss': 3.1671, 'grad_norm': 0.8057137727737427, 'learning_rate': 0.00010365874455169611, 'epoch': 0.5}


 50%|█████     | 581/1151 [34:45<28:32,  3.00s/it]

{'loss': 3.4983, 'grad_norm': 0.7974985241889954, 'learning_rate': 0.00010337741418834684, 'epoch': 0.5}


 51%|█████     | 582/1151 [34:48<29:09,  3.08s/it]

{'loss': 3.3312, 'grad_norm': 0.7487960457801819, 'learning_rate': 0.00010309605706072816, 'epoch': 0.51}


 51%|█████     | 583/1151 [34:51<28:40,  3.03s/it]

{'loss': 2.9663, 'grad_norm': 0.8092122673988342, 'learning_rate': 0.00010281467539845051, 'epoch': 0.51}


 51%|█████     | 584/1151 [34:54<28:13,  2.99s/it]

{'loss': 3.0884, 'grad_norm': 0.7804121375083923, 'learning_rate': 0.00010253327143131879, 'epoch': 0.51}


 51%|█████     | 585/1151 [34:57<28:15,  3.00s/it]

{'loss': 3.0741, 'grad_norm': 0.8052957057952881, 'learning_rate': 0.00010225184738931461, 'epoch': 0.51}


 51%|█████     | 586/1151 [35:00<28:56,  3.07s/it]

{'loss': 3.111, 'grad_norm': 0.706733763217926, 'learning_rate': 0.00010197040550257869, 'epoch': 0.51}


 51%|█████     | 587/1151 [35:03<29:08,  3.10s/it]

{'loss': 3.2212, 'grad_norm': 0.7464808821678162, 'learning_rate': 0.0001016889480013931, 'epoch': 0.51}


 51%|█████     | 588/1151 [35:07<29:57,  3.19s/it]

{'loss': 3.6384, 'grad_norm': 0.7650639414787292, 'learning_rate': 0.00010140747711616378, 'epoch': 0.51}


 51%|█████     | 589/1151 [35:10<29:08,  3.11s/it]

{'loss': 3.1069, 'grad_norm': 0.7972902655601501, 'learning_rate': 0.00010112599507740259, 'epoch': 0.51}


 51%|█████▏    | 590/1151 [35:13<28:19,  3.03s/it]

{'loss': 3.1807, 'grad_norm': 0.9141247868537903, 'learning_rate': 0.00010084450411570985, 'epoch': 0.51}


 51%|█████▏    | 591/1151 [35:16<28:42,  3.08s/it]

{'loss': 3.2165, 'grad_norm': 0.792368471622467, 'learning_rate': 0.0001005630064617566, 'epoch': 0.51}


 51%|█████▏    | 592/1151 [35:19<28:57,  3.11s/it]

{'loss': 3.5017, 'grad_norm': 0.7724869251251221, 'learning_rate': 0.00010028150434626681, 'epoch': 0.51}


 52%|█████▏    | 593/1151 [35:22<28:24,  3.05s/it]

{'loss': 3.1477, 'grad_norm': 0.7942788600921631, 'learning_rate': 0.0001, 'epoch': 0.51}


 52%|█████▏    | 594/1151 [35:25<28:11,  3.04s/it]

{'loss': 3.143, 'grad_norm': 0.7316566109657288, 'learning_rate': 9.971849565373317e-05, 'epoch': 0.52}


 52%|█████▏    | 595/1151 [35:28<27:25,  2.96s/it]

{'loss': 3.1624, 'grad_norm': 0.8834524154663086, 'learning_rate': 9.943699353824345e-05, 'epoch': 0.52}


 52%|█████▏    | 596/1151 [35:31<27:01,  2.92s/it]

{'loss': 2.9087, 'grad_norm': 0.7509718537330627, 'learning_rate': 9.915549588429015e-05, 'epoch': 0.52}


 52%|█████▏    | 597/1151 [35:34<28:15,  3.06s/it]

{'loss': 3.6568, 'grad_norm': 0.7114935517311096, 'learning_rate': 9.887400492259742e-05, 'epoch': 0.52}


 52%|█████▏    | 598/1151 [35:37<27:20,  2.97s/it]

{'loss': 3.1257, 'grad_norm': 0.767936646938324, 'learning_rate': 9.859252288383625e-05, 'epoch': 0.52}


 52%|█████▏    | 599/1151 [35:41<30:26,  3.31s/it]

{'loss': 4.1903, 'grad_norm': 0.6568235754966736, 'learning_rate': 9.83110519986069e-05, 'epoch': 0.52}


 52%|█████▏    | 600/1151 [35:44<29:48,  3.25s/it]

{'loss': 3.2216, 'grad_norm': 0.7626500725746155, 'learning_rate': 9.802959449742132e-05, 'epoch': 0.52}


 52%|█████▏    | 601/1151 [35:47<30:05,  3.28s/it]

{'loss': 3.553, 'grad_norm': 0.73641037940979, 'learning_rate': 9.774815261068541e-05, 'epoch': 0.52}


 52%|█████▏    | 602/1151 [35:50<28:24,  3.10s/it]

{'loss': 2.6839, 'grad_norm': 0.749792754650116, 'learning_rate': 9.746672856868123e-05, 'epoch': 0.52}


 52%|█████▏    | 603/1151 [35:53<27:46,  3.04s/it]

{'loss': 2.8956, 'grad_norm': 0.7301895618438721, 'learning_rate': 9.718532460154948e-05, 'epoch': 0.52}


 52%|█████▏    | 604/1151 [35:56<29:09,  3.20s/it]

{'loss': 3.7082, 'grad_norm': 0.7646921873092651, 'learning_rate': 9.690394293927189e-05, 'epoch': 0.52}


 53%|█████▎    | 605/1151 [35:59<27:48,  3.06s/it]

{'loss': 2.6827, 'grad_norm': 0.7890782952308655, 'learning_rate': 9.662258581165319e-05, 'epoch': 0.53}


 53%|█████▎    | 606/1151 [36:02<27:00,  2.97s/it]

{'loss': 2.7815, 'grad_norm': 0.7931520938873291, 'learning_rate': 9.634125544830393e-05, 'epoch': 0.53}


 53%|█████▎    | 607/1151 [36:05<26:03,  2.87s/it]

{'loss': 2.5607, 'grad_norm': 0.7753874063491821, 'learning_rate': 9.605995407862247e-05, 'epoch': 0.53}


 53%|█████▎    | 608/1151 [36:08<26:29,  2.93s/it]

{'loss': 3.6786, 'grad_norm': 0.7701229453086853, 'learning_rate': 9.577868393177732e-05, 'epoch': 0.53}


 53%|█████▎    | 609/1151 [36:10<26:23,  2.92s/it]

{'loss': 3.1822, 'grad_norm': 0.8061325550079346, 'learning_rate': 9.549744723668972e-05, 'epoch': 0.53}


 53%|█████▎    | 610/1151 [36:13<25:32,  2.83s/it]

{'loss': 2.9264, 'grad_norm': 0.8551055192947388, 'learning_rate': 9.521624622201578e-05, 'epoch': 0.53}


 53%|█████▎    | 611/1151 [36:16<25:30,  2.83s/it]

{'loss': 3.0541, 'grad_norm': 0.7788719534873962, 'learning_rate': 9.493508311612874e-05, 'epoch': 0.53}


 53%|█████▎    | 612/1151 [36:19<24:46,  2.76s/it]

{'loss': 2.5662, 'grad_norm': 0.7522552013397217, 'learning_rate': 9.46539601471016e-05, 'epoch': 0.53}


 53%|█████▎    | 613/1151 [36:21<24:51,  2.77s/it]

{'loss': 3.0247, 'grad_norm': 0.7857813835144043, 'learning_rate': 9.43728795426892e-05, 'epoch': 0.53}


 53%|█████▎    | 614/1151 [36:25<26:12,  2.93s/it]

{'loss': 3.1841, 'grad_norm': 0.7457066178321838, 'learning_rate': 9.409184353031068e-05, 'epoch': 0.53}


 53%|█████▎    | 615/1151 [36:28<26:19,  2.95s/it]

{'loss': 3.2641, 'grad_norm': 0.7850664854049683, 'learning_rate': 9.381085433703182e-05, 'epoch': 0.53}


 54%|█████▎    | 616/1151 [36:31<26:05,  2.93s/it]

{'loss': 2.9441, 'grad_norm': 0.7903634905815125, 'learning_rate': 9.35299141895474e-05, 'epoch': 0.53}


 54%|█████▎    | 617/1151 [36:34<26:33,  2.98s/it]

{'loss': 3.5021, 'grad_norm': 0.7596079111099243, 'learning_rate': 9.324902531416349e-05, 'epoch': 0.54}


 54%|█████▎    | 618/1151 [36:37<27:04,  3.05s/it]

{'loss': 3.3966, 'grad_norm': 0.7203525304794312, 'learning_rate': 9.296818993677987e-05, 'epoch': 0.54}


 54%|█████▍    | 619/1151 [36:40<26:15,  2.96s/it]

{'loss': 2.7258, 'grad_norm': 0.7755522131919861, 'learning_rate': 9.268741028287239e-05, 'epoch': 0.54}


 54%|█████▍    | 620/1151 [36:43<26:34,  3.00s/it]

{'loss': 3.2367, 'grad_norm': 0.7137657999992371, 'learning_rate': 9.24066885774754e-05, 'epoch': 0.54}


 54%|█████▍    | 621/1151 [36:46<26:04,  2.95s/it]

{'loss': 3.083, 'grad_norm': 0.7269426584243774, 'learning_rate': 9.212602704516383e-05, 'epoch': 0.54}


 54%|█████▍    | 622/1151 [36:49<26:47,  3.04s/it]

{'loss': 3.7866, 'grad_norm': 0.7693426012992859, 'learning_rate': 9.184542791003594e-05, 'epoch': 0.54}


 54%|█████▍    | 623/1151 [36:52<26:37,  3.03s/it]

{'loss': 3.0205, 'grad_norm': 0.7491158843040466, 'learning_rate': 9.156489339569554e-05, 'epoch': 0.54}


 54%|█████▍    | 624/1151 [36:55<25:52,  2.95s/it]

{'loss': 3.0482, 'grad_norm': 0.8117009997367859, 'learning_rate': 9.128442572523417e-05, 'epoch': 0.54}


 54%|█████▍    | 625/1151 [36:58<26:03,  2.97s/it]

{'loss': 2.9327, 'grad_norm': 0.7539356350898743, 'learning_rate': 9.10040271212139e-05, 'epoch': 0.54}


 54%|█████▍    | 626/1151 [37:00<25:30,  2.92s/it]

{'loss': 2.9732, 'grad_norm': 0.8242088556289673, 'learning_rate': 9.072369980564935e-05, 'epoch': 0.54}


 54%|█████▍    | 627/1151 [37:03<24:33,  2.81s/it]

{'loss': 2.4843, 'grad_norm': 0.7894916534423828, 'learning_rate': 9.04434459999902e-05, 'epoch': 0.54}


 55%|█████▍    | 628/1151 [37:06<26:33,  3.05s/it]

{'loss': 3.5394, 'grad_norm': 0.7012103199958801, 'learning_rate': 9.016326792510365e-05, 'epoch': 0.55}


 55%|█████▍    | 629/1151 [37:09<25:52,  2.97s/it]

{'loss': 2.9573, 'grad_norm': 0.7938840389251709, 'learning_rate': 8.98831678012568e-05, 'epoch': 0.55}


 55%|█████▍    | 630/1151 [37:12<26:08,  3.01s/it]

{'loss': 3.3482, 'grad_norm': 0.7961865067481995, 'learning_rate': 8.960314784809893e-05, 'epoch': 0.55}


 55%|█████▍    | 631/1151 [37:15<26:03,  3.01s/it]

{'loss': 3.3866, 'grad_norm': 0.733866274356842, 'learning_rate': 8.932321028464412e-05, 'epoch': 0.55}


 55%|█████▍    | 632/1151 [37:18<25:54,  3.00s/it]

{'loss': 2.991, 'grad_norm': 0.8289429545402527, 'learning_rate': 8.90433573292535e-05, 'epoch': 0.55}


 55%|█████▍    | 633/1151 [37:21<25:09,  2.92s/it]

{'loss': 2.9141, 'grad_norm': 0.7817811965942383, 'learning_rate': 8.87635911996177e-05, 'epoch': 0.55}


 55%|█████▌    | 634/1151 [37:24<24:25,  2.83s/it]

{'loss': 2.7425, 'grad_norm': 0.8020218014717102, 'learning_rate': 8.848391411273933e-05, 'epoch': 0.55}


 55%|█████▌    | 635/1151 [37:27<24:46,  2.88s/it]

{'loss': 3.0992, 'grad_norm': 0.7254729866981506, 'learning_rate': 8.820432828491542e-05, 'epoch': 0.55}


 55%|█████▌    | 636/1151 [37:30<26:55,  3.14s/it]

{'loss': 3.5047, 'grad_norm': 0.6589810252189636, 'learning_rate': 8.792483593171973e-05, 'epoch': 0.55}


 55%|█████▌    | 637/1151 [37:33<26:15,  3.06s/it]

{'loss': 2.8327, 'grad_norm': 0.7781524062156677, 'learning_rate': 8.764543926798537e-05, 'epoch': 0.55}


 55%|█████▌    | 638/1151 [37:36<25:32,  2.99s/it]

{'loss': 2.6452, 'grad_norm': 0.7964915037155151, 'learning_rate': 8.73661405077871e-05, 'epoch': 0.55}


 56%|█████▌    | 639/1151 [37:39<25:00,  2.93s/it]

{'loss': 3.2094, 'grad_norm': 0.8111984729766846, 'learning_rate': 8.708694186442388e-05, 'epoch': 0.55}


 56%|█████▌    | 640/1151 [37:42<24:14,  2.85s/it]

{'loss': 2.6908, 'grad_norm': 0.8184029459953308, 'learning_rate': 8.680784555040122e-05, 'epoch': 0.56}


 56%|█████▌    | 641/1151 [37:45<24:18,  2.86s/it]

{'loss': 2.9375, 'grad_norm': 0.7775160074234009, 'learning_rate': 8.652885377741393e-05, 'epoch': 0.56}


 56%|█████▌    | 642/1151 [37:47<23:30,  2.77s/it]

{'loss': 2.3243, 'grad_norm': 0.8382955193519592, 'learning_rate': 8.62499687563281e-05, 'epoch': 0.56}


 56%|█████▌    | 643/1151 [37:50<23:15,  2.75s/it]

{'loss': 2.8768, 'grad_norm': 0.8638615012168884, 'learning_rate': 8.597119269716403e-05, 'epoch': 0.56}


 56%|█████▌    | 644/1151 [37:53<24:36,  2.91s/it]

{'loss': 3.2907, 'grad_norm': 0.6965726613998413, 'learning_rate': 8.569252780907862e-05, 'epoch': 0.56}


 56%|█████▌    | 645/1151 [37:56<24:41,  2.93s/it]

{'loss': 3.2404, 'grad_norm': 0.7409859895706177, 'learning_rate': 8.541397630034758e-05, 'epoch': 0.56}


 56%|█████▌    | 646/1151 [37:59<23:47,  2.83s/it]

{'loss': 2.3867, 'grad_norm': 0.761443555355072, 'learning_rate': 8.513554037834835e-05, 'epoch': 0.56}


 56%|█████▌    | 647/1151 [38:02<24:23,  2.90s/it]

{'loss': 3.1579, 'grad_norm': 0.7608804702758789, 'learning_rate': 8.485722224954237e-05, 'epoch': 0.56}


 56%|█████▋    | 648/1151 [38:04<23:58,  2.86s/it]

{'loss': 2.9457, 'grad_norm': 0.8566887974739075, 'learning_rate': 8.457902411945745e-05, 'epoch': 0.56}


 56%|█████▋    | 649/1151 [38:08<24:44,  2.96s/it]

{'loss': 3.1416, 'grad_norm': 0.7573408484458923, 'learning_rate': 8.430094819267072e-05, 'epoch': 0.56}


 56%|█████▋    | 650/1151 [38:11<26:39,  3.19s/it]

{'loss': 3.5892, 'grad_norm': 0.6599993109703064, 'learning_rate': 8.402299667279078e-05, 'epoch': 0.56}


 57%|█████▋    | 651/1151 [38:14<25:39,  3.08s/it]

{'loss': 3.2993, 'grad_norm': 0.8489956855773926, 'learning_rate': 8.374517176244038e-05, 'epoch': 0.57}


 57%|█████▋    | 652/1151 [38:17<24:26,  2.94s/it]

{'loss': 2.3521, 'grad_norm': 0.7414608001708984, 'learning_rate': 8.346747566323895e-05, 'epoch': 0.57}


 57%|█████▋    | 653/1151 [38:20<24:55,  3.00s/it]

{'loss': 3.2998, 'grad_norm': 0.8076521754264832, 'learning_rate': 8.31899105757852e-05, 'epoch': 0.57}


 57%|█████▋    | 654/1151 [38:23<25:06,  3.03s/it]

{'loss': 2.9149, 'grad_norm': 0.794824481010437, 'learning_rate': 8.291247869963959e-05, 'epoch': 0.57}


 57%|█████▋    | 655/1151 [38:26<23:58,  2.90s/it]

{'loss': 2.6913, 'grad_norm': 0.8531906008720398, 'learning_rate': 8.263518223330697e-05, 'epoch': 0.57}


 57%|█████▋    | 656/1151 [38:29<24:26,  2.96s/it]

{'loss': 3.1166, 'grad_norm': 0.7706714272499084, 'learning_rate': 8.235802337421919e-05, 'epoch': 0.57}


 57%|█████▋    | 657/1151 [38:32<24:03,  2.92s/it]

{'loss': 3.0743, 'grad_norm': 0.8272606730461121, 'learning_rate': 8.208100431871749e-05, 'epoch': 0.57}


 57%|█████▋    | 658/1151 [38:35<24:15,  2.95s/it]

{'loss': 3.2872, 'grad_norm': 0.7278018593788147, 'learning_rate': 8.180412726203539e-05, 'epoch': 0.57}


 57%|█████▋    | 659/1151 [38:38<24:24,  2.98s/it]

{'loss': 3.15, 'grad_norm': 0.7571255564689636, 'learning_rate': 8.15273943982811e-05, 'epoch': 0.57}


 57%|█████▋    | 660/1151 [38:40<23:48,  2.91s/it]

{'loss': 3.1143, 'grad_norm': 0.825310468673706, 'learning_rate': 8.12508079204201e-05, 'epoch': 0.57}


 57%|█████▋    | 661/1151 [38:44<25:11,  3.08s/it]

{'loss': 3.6144, 'grad_norm': 0.7313706874847412, 'learning_rate': 8.09743700202579e-05, 'epoch': 0.57}


 58%|█████▊    | 662/1151 [38:48<26:55,  3.30s/it]

{'loss': 3.774, 'grad_norm': 0.6553947925567627, 'learning_rate': 8.06980828884227e-05, 'epoch': 0.57}


 58%|█████▊    | 663/1151 [38:51<25:55,  3.19s/it]

{'loss': 3.3761, 'grad_norm': 0.8378012776374817, 'learning_rate': 8.04219487143477e-05, 'epoch': 0.58}


 58%|█████▊    | 664/1151 [38:54<25:26,  3.14s/it]

{'loss': 3.2643, 'grad_norm': 0.8288149833679199, 'learning_rate': 8.01459696862542e-05, 'epoch': 0.58}


 58%|█████▊    | 665/1151 [38:57<25:08,  3.10s/it]

{'loss': 3.2544, 'grad_norm': 0.7637948989868164, 'learning_rate': 7.987014799113397e-05, 'epoch': 0.58}


 58%|█████▊    | 666/1151 [39:00<25:08,  3.11s/it]

{'loss': 3.1882, 'grad_norm': 0.7879970669746399, 'learning_rate': 7.959448581473205e-05, 'epoch': 0.58}


 58%|█████▊    | 667/1151 [39:03<24:24,  3.03s/it]

{'loss': 3.0459, 'grad_norm': 0.844001293182373, 'learning_rate': 7.931898534152928e-05, 'epoch': 0.58}


 58%|█████▊    | 668/1151 [39:06<24:31,  3.05s/it]

{'loss': 3.3069, 'grad_norm': 0.7551218867301941, 'learning_rate': 7.904364875472513e-05, 'epoch': 0.58}


 58%|█████▊    | 669/1151 [39:08<23:43,  2.95s/it]

{'loss': 2.9882, 'grad_norm': 0.8388872742652893, 'learning_rate': 7.876847823622042e-05, 'epoch': 0.58}


 58%|█████▊    | 670/1151 [39:12<24:59,  3.12s/it]

{'loss': 3.711, 'grad_norm': 0.7364319562911987, 'learning_rate': 7.849347596659981e-05, 'epoch': 0.58}


 58%|█████▊    | 671/1151 [39:15<24:45,  3.09s/it]

{'loss': 3.3017, 'grad_norm': 0.7604413628578186, 'learning_rate': 7.821864412511485e-05, 'epoch': 0.58}


 58%|█████▊    | 672/1151 [39:18<24:27,  3.06s/it]

{'loss': 3.1231, 'grad_norm': 0.7280453443527222, 'learning_rate': 7.794398488966643e-05, 'epoch': 0.58}


 58%|█████▊    | 673/1151 [39:21<24:46,  3.11s/it]

{'loss': 3.2918, 'grad_norm': 0.735737144947052, 'learning_rate': 7.766950043678764e-05, 'epoch': 0.58}


 59%|█████▊    | 674/1151 [39:24<24:43,  3.11s/it]

{'loss': 3.3832, 'grad_norm': 0.7397370934486389, 'learning_rate': 7.739519294162652e-05, 'epoch': 0.59}


 59%|█████▊    | 675/1151 [39:27<24:23,  3.07s/it]

{'loss': 3.0834, 'grad_norm': 0.7561460137367249, 'learning_rate': 7.712106457792884e-05, 'epoch': 0.59}


 59%|█████▊    | 676/1151 [39:30<24:34,  3.10s/it]

{'loss': 3.0153, 'grad_norm': 0.7463605999946594, 'learning_rate': 7.684711751802076e-05, 'epoch': 0.59}


 59%|█████▉    | 677/1151 [39:33<24:18,  3.08s/it]

{'loss': 3.2771, 'grad_norm': 0.7134917378425598, 'learning_rate': 7.65733539327918e-05, 'epoch': 0.59}


 59%|█████▉    | 678/1151 [39:36<23:31,  2.98s/it]

{'loss': 2.8922, 'grad_norm': 0.8219555616378784, 'learning_rate': 7.629977599167749e-05, 'epoch': 0.59}


 59%|█████▉    | 679/1151 [39:39<23:05,  2.94s/it]

{'loss': 2.9733, 'grad_norm': 0.7595130205154419, 'learning_rate': 7.602638586264219e-05, 'epoch': 0.59}


 59%|█████▉    | 680/1151 [39:42<23:54,  3.04s/it]

{'loss': 3.2328, 'grad_norm': 0.7245994806289673, 'learning_rate': 7.5753185712162e-05, 'epoch': 0.59}


 59%|█████▉    | 681/1151 [39:45<23:50,  3.04s/it]

{'loss': 3.4877, 'grad_norm': 0.7593740820884705, 'learning_rate': 7.54801777052076e-05, 'epoch': 0.59}


 59%|█████▉    | 682/1151 [39:48<23:03,  2.95s/it]

{'loss': 3.2344, 'grad_norm': 0.8982473015785217, 'learning_rate': 7.520736400522684e-05, 'epoch': 0.59}


 59%|█████▉    | 683/1151 [39:51<23:10,  2.97s/it]

{'loss': 3.5003, 'grad_norm': 0.8188939094543457, 'learning_rate': 7.493474677412794e-05, 'epoch': 0.59}


 59%|█████▉    | 684/1151 [39:54<22:49,  2.93s/it]

{'loss': 2.9457, 'grad_norm': 0.8181058764457703, 'learning_rate': 7.466232817226224e-05, 'epoch': 0.59}


 60%|█████▉    | 685/1151 [39:57<23:11,  2.99s/it]

{'loss': 3.2474, 'grad_norm': 0.7836766242980957, 'learning_rate': 7.439011035840685e-05, 'epoch': 0.59}


 60%|█████▉    | 686/1151 [40:00<23:21,  3.01s/it]

{'loss': 3.4201, 'grad_norm': 0.7535693049430847, 'learning_rate': 7.411809548974792e-05, 'epoch': 0.6}


 60%|█████▉    | 687/1151 [40:03<22:52,  2.96s/it]

{'loss': 3.012, 'grad_norm': 0.7635622024536133, 'learning_rate': 7.384628572186333e-05, 'epoch': 0.6}


 60%|█████▉    | 688/1151 [40:06<22:58,  2.98s/it]

{'loss': 3.0875, 'grad_norm': 0.7428261637687683, 'learning_rate': 7.357468320870559e-05, 'epoch': 0.6}


 60%|█████▉    | 689/1151 [40:09<22:59,  2.99s/it]

{'loss': 3.1495, 'grad_norm': 0.8044684529304504, 'learning_rate': 7.330329010258483e-05, 'epoch': 0.6}


 60%|█████▉    | 690/1151 [40:12<23:31,  3.06s/it]

{'loss': 3.5594, 'grad_norm': 0.7353101968765259, 'learning_rate': 7.303210855415181e-05, 'epoch': 0.6}


 60%|██████    | 691/1151 [40:15<23:41,  3.09s/it]

{'loss': 3.0971, 'grad_norm': 0.7460084557533264, 'learning_rate': 7.276114071238069e-05, 'epoch': 0.6}


 60%|██████    | 692/1151 [40:18<23:11,  3.03s/it]

{'loss': 3.0284, 'grad_norm': 0.7470837235450745, 'learning_rate': 7.24903887245522e-05, 'epoch': 0.6}


 60%|██████    | 693/1151 [40:22<23:42,  3.11s/it]

{'loss': 3.2872, 'grad_norm': 0.7314863204956055, 'learning_rate': 7.221985473623654e-05, 'epoch': 0.6}


                                                  
 60%|██████    | 693/1151 [42:38<23:42,  3.11s/it]

{'eval_loss': 3.036646604537964, 'eval_runtime': 136.5581, 'eval_samples_per_second': 8.429, 'eval_steps_per_second': 1.054, 'epoch': 0.6}


 60%|██████    | 694/1151 [42:41<5:34:43, 43.95s/it]

{'loss': 2.749, 'grad_norm': 0.8176440000534058, 'learning_rate': 7.194954089127628e-05, 'epoch': 0.6}


 60%|██████    | 695/1151 [42:44<4:00:41, 31.67s/it]

{'loss': 3.1701, 'grad_norm': 0.7904983758926392, 'learning_rate': 7.16794493317696e-05, 'epoch': 0.6}


 60%|██████    | 696/1151 [42:47<2:54:35, 23.02s/it]

{'loss': 2.9776, 'grad_norm': 0.789238691329956, 'learning_rate': 7.140958219805312e-05, 'epoch': 0.6}


 61%|██████    | 697/1151 [42:49<2:07:58, 16.91s/it]

{'loss': 2.309, 'grad_norm': 0.8123688101768494, 'learning_rate': 7.113994162868496e-05, 'epoch': 0.61}


 61%|██████    | 698/1151 [42:53<1:37:40, 12.94s/it]

{'loss': 3.4717, 'grad_norm': 0.7379223108291626, 'learning_rate': 7.087052976042789e-05, 'epoch': 0.61}


 61%|██████    | 699/1151 [42:56<1:15:29, 10.02s/it]

{'loss': 3.2674, 'grad_norm': 0.7696315050125122, 'learning_rate': 7.060134872823234e-05, 'epoch': 0.61}


 61%|██████    | 700/1151 [42:59<59:58,  7.98s/it]  

{'loss': 3.6069, 'grad_norm': 0.8332427740097046, 'learning_rate': 7.033240066521944e-05, 'epoch': 0.61}


 61%|██████    | 701/1151 [43:03<48:42,  6.49s/it]

{'loss': 2.9704, 'grad_norm': 0.7556244730949402, 'learning_rate': 7.006368770266421e-05, 'epoch': 0.61}


 61%|██████    | 702/1151 [43:06<42:23,  5.66s/it]

{'loss': 3.6535, 'grad_norm': 0.7534502744674683, 'learning_rate': 6.979521196997863e-05, 'epoch': 0.61}


 61%|██████    | 703/1151 [43:09<36:02,  4.83s/it]

{'loss': 3.2211, 'grad_norm': 0.8651977777481079, 'learning_rate': 6.952697559469464e-05, 'epoch': 0.61}


 61%|██████    | 704/1151 [43:12<31:34,  4.24s/it]

{'loss': 3.087, 'grad_norm': 0.8372740149497986, 'learning_rate': 6.925898070244752e-05, 'epoch': 0.61}


 61%|██████▏   | 705/1151 [43:15<27:41,  3.73s/it]

{'loss': 2.7905, 'grad_norm': 0.8629733920097351, 'learning_rate': 6.899122941695893e-05, 'epoch': 0.61}


 61%|██████▏   | 706/1151 [43:18<26:33,  3.58s/it]

{'loss': 3.2869, 'grad_norm': 0.7923964262008667, 'learning_rate': 6.872372386001992e-05, 'epoch': 0.61}


 61%|██████▏   | 707/1151 [43:20<24:33,  3.32s/it]

{'loss': 2.8732, 'grad_norm': 0.8386919498443604, 'learning_rate': 6.845646615147445e-05, 'epoch': 0.61}


 62%|██████▏   | 708/1151 [43:23<23:06,  3.13s/it]

{'loss': 2.8049, 'grad_norm': 0.8936967253684998, 'learning_rate': 6.818945840920234e-05, 'epoch': 0.61}


 62%|██████▏   | 709/1151 [43:26<22:25,  3.04s/it]

{'loss': 3.0858, 'grad_norm': 0.8319125175476074, 'learning_rate': 6.792270274910246e-05, 'epoch': 0.62}


 62%|██████▏   | 710/1151 [43:29<21:44,  2.96s/it]

{'loss': 2.9163, 'grad_norm': 0.7804619669914246, 'learning_rate': 6.765620128507619e-05, 'epoch': 0.62}


 62%|██████▏   | 711/1151 [43:32<21:59,  3.00s/it]

{'loss': 3.4196, 'grad_norm': 0.7931954860687256, 'learning_rate': 6.738995612901051e-05, 'epoch': 0.62}


 62%|██████▏   | 712/1151 [43:35<22:22,  3.06s/it]

{'loss': 3.4932, 'grad_norm': 0.7804611921310425, 'learning_rate': 6.712396939076125e-05, 'epoch': 0.62}


 62%|██████▏   | 713/1151 [43:38<21:35,  2.96s/it]

{'loss': 2.5932, 'grad_norm': 0.7970877289772034, 'learning_rate': 6.685824317813643e-05, 'epoch': 0.62}


 62%|██████▏   | 714/1151 [43:41<21:07,  2.90s/it]

{'loss': 2.9315, 'grad_norm': 0.7738475799560547, 'learning_rate': 6.659277959687954e-05, 'epoch': 0.62}


 62%|██████▏   | 715/1151 [43:44<21:14,  2.92s/it]

{'loss': 3.2733, 'grad_norm': 0.8314353227615356, 'learning_rate': 6.632758075065288e-05, 'epoch': 0.62}


 62%|██████▏   | 716/1151 [43:46<20:59,  2.90s/it]

{'loss': 2.9802, 'grad_norm': 0.794183075428009, 'learning_rate': 6.606264874102079e-05, 'epoch': 0.62}


 62%|██████▏   | 717/1151 [43:49<20:31,  2.84s/it]

{'loss': 2.9241, 'grad_norm': 0.8931584358215332, 'learning_rate': 6.579798566743314e-05, 'epoch': 0.62}


 62%|██████▏   | 718/1151 [43:52<20:37,  2.86s/it]

{'loss': 3.2755, 'grad_norm': 0.797688901424408, 'learning_rate': 6.553359362720858e-05, 'epoch': 0.62}


 62%|██████▏   | 719/1151 [43:55<21:01,  2.92s/it]

{'loss': 3.4722, 'grad_norm': 0.8738051652908325, 'learning_rate': 6.526947471551798e-05, 'epoch': 0.62}


 63%|██████▎   | 720/1151 [43:58<20:32,  2.86s/it]

{'loss': 2.9724, 'grad_norm': 0.8363655209541321, 'learning_rate': 6.500563102536777e-05, 'epoch': 0.63}


 63%|██████▎   | 721/1151 [44:01<21:01,  2.93s/it]

{'loss': 3.0065, 'grad_norm': 0.7819164395332336, 'learning_rate': 6.474206464758351e-05, 'epoch': 0.63}


 63%|██████▎   | 722/1151 [44:04<20:24,  2.86s/it]

{'loss': 2.9588, 'grad_norm': 0.8158928155899048, 'learning_rate': 6.447877767079298e-05, 'epoch': 0.63}


 63%|██████▎   | 723/1151 [44:06<20:25,  2.86s/it]

{'loss': 3.0337, 'grad_norm': 0.8173909187316895, 'learning_rate': 6.421577218141008e-05, 'epoch': 0.63}


 63%|██████▎   | 724/1151 [44:09<20:28,  2.88s/it]

{'loss': 3.1005, 'grad_norm': 0.7475317120552063, 'learning_rate': 6.395305026361795e-05, 'epoch': 0.63}


 63%|██████▎   | 725/1151 [44:12<19:54,  2.80s/it]

{'loss': 2.6484, 'grad_norm': 0.8159957528114319, 'learning_rate': 6.369061399935255e-05, 'epoch': 0.63}


 63%|██████▎   | 726/1151 [44:15<20:35,  2.91s/it]

{'loss': 3.486, 'grad_norm': 0.7323564291000366, 'learning_rate': 6.342846546828625e-05, 'epoch': 0.63}


 63%|██████▎   | 727/1151 [44:18<20:06,  2.85s/it]

{'loss': 2.4837, 'grad_norm': 0.7542575001716614, 'learning_rate': 6.31666067478113e-05, 'epoch': 0.63}


 63%|██████▎   | 728/1151 [44:21<20:19,  2.88s/it]

{'loss': 3.2628, 'grad_norm': 0.7926567196846008, 'learning_rate': 6.290503991302324e-05, 'epoch': 0.63}


 63%|██████▎   | 729/1151 [44:24<20:52,  2.97s/it]

{'loss': 3.5768, 'grad_norm': 0.7346655130386353, 'learning_rate': 6.264376703670464e-05, 'epoch': 0.63}


 63%|██████▎   | 730/1151 [44:27<20:16,  2.89s/it]

{'loss': 2.8334, 'grad_norm': 0.7885376811027527, 'learning_rate': 6.238279018930864e-05, 'epoch': 0.63}


 64%|██████▎   | 731/1151 [44:30<20:29,  2.93s/it]

{'loss': 3.6725, 'grad_norm': 0.8692892789840698, 'learning_rate': 6.21221114389424e-05, 'epoch': 0.63}


 64%|██████▎   | 732/1151 [44:32<20:00,  2.86s/it]

{'loss': 2.8694, 'grad_norm': 0.7966857552528381, 'learning_rate': 6.18617328513509e-05, 'epoch': 0.64}


 64%|██████▎   | 733/1151 [44:35<20:29,  2.94s/it]

{'loss': 3.2426, 'grad_norm': 0.7115288376808167, 'learning_rate': 6.160165648990048e-05, 'epoch': 0.64}


 64%|██████▍   | 734/1151 [44:38<20:12,  2.91s/it]

{'loss': 2.5965, 'grad_norm': 0.8084887862205505, 'learning_rate': 6.134188441556241e-05, 'epoch': 0.64}


 64%|██████▍   | 735/1151 [44:41<19:44,  2.85s/it]

{'loss': 2.9511, 'grad_norm': 0.8377265930175781, 'learning_rate': 6.108241868689675e-05, 'epoch': 0.64}


 64%|██████▍   | 736/1151 [44:43<18:49,  2.72s/it]

{'loss': 2.0205, 'grad_norm': 0.8886970281600952, 'learning_rate': 6.0823261360035844e-05, 'epoch': 0.64}


 64%|██████▍   | 737/1151 [44:46<18:57,  2.75s/it]

{'loss': 2.8853, 'grad_norm': 0.9136629700660706, 'learning_rate': 6.0564414488668165e-05, 'epoch': 0.64}


 64%|██████▍   | 738/1151 [44:49<19:04,  2.77s/it]

{'loss': 3.1866, 'grad_norm': 0.8447416424751282, 'learning_rate': 6.030588012402194e-05, 'epoch': 0.64}


 64%|██████▍   | 739/1151 [44:52<19:34,  2.85s/it]

{'loss': 3.2348, 'grad_norm': 0.7490932941436768, 'learning_rate': 6.0047660314849006e-05, 'epoch': 0.64}


 64%|██████▍   | 740/1151 [44:55<18:44,  2.74s/it]

{'loss': 2.1959, 'grad_norm': 0.8232767581939697, 'learning_rate': 5.9789757107408416e-05, 'epoch': 0.64}


 64%|██████▍   | 741/1151 [44:57<18:04,  2.65s/it]

{'loss': 2.2771, 'grad_norm': 0.8443483710289001, 'learning_rate': 5.953217254545039e-05, 'epoch': 0.64}


 64%|██████▍   | 742/1151 [45:00<18:40,  2.74s/it]

{'loss': 3.0303, 'grad_norm': 0.7758704423904419, 'learning_rate': 5.92749086702001e-05, 'epoch': 0.64}


 65%|██████▍   | 743/1151 [45:03<19:23,  2.85s/it]

{'loss': 2.9592, 'grad_norm': 0.7619920372962952, 'learning_rate': 5.901796752034128e-05, 'epoch': 0.65}


 65%|██████▍   | 744/1151 [45:06<18:38,  2.75s/it]

{'loss': 2.4972, 'grad_norm': 0.8719333410263062, 'learning_rate': 5.8761351132000295e-05, 'epoch': 0.65}


 65%|██████▍   | 745/1151 [45:09<19:07,  2.83s/it]

{'loss': 3.2778, 'grad_norm': 0.8283442258834839, 'learning_rate': 5.85050615387301e-05, 'epoch': 0.65}


 65%|██████▍   | 746/1151 [45:11<19:02,  2.82s/it]

{'loss': 2.7212, 'grad_norm': 0.7611469030380249, 'learning_rate': 5.824910077149371e-05, 'epoch': 0.65}


 65%|██████▍   | 747/1151 [45:14<18:50,  2.80s/it]

{'loss': 2.8977, 'grad_norm': 0.7864307761192322, 'learning_rate': 5.799347085864851e-05, 'epoch': 0.65}


 65%|██████▍   | 748/1151 [45:17<19:23,  2.89s/it]

{'loss': 3.3297, 'grad_norm': 0.8137682676315308, 'learning_rate': 5.773817382593008e-05, 'epoch': 0.65}


 65%|██████▌   | 749/1151 [45:20<18:53,  2.82s/it]

{'loss': 2.8162, 'grad_norm': 0.8002147674560547, 'learning_rate': 5.748321169643596e-05, 'epoch': 0.65}


 65%|██████▌   | 750/1151 [45:23<18:33,  2.78s/it]

{'loss': 2.6306, 'grad_norm': 0.8143850564956665, 'learning_rate': 5.72285864906098e-05, 'epoch': 0.65}


 65%|██████▌   | 751/1151 [45:26<19:52,  2.98s/it]

{'loss': 3.6907, 'grad_norm': 0.7359307408332825, 'learning_rate': 5.697430022622542e-05, 'epoch': 0.65}


 65%|██████▌   | 752/1151 [45:29<19:06,  2.87s/it]

{'loss': 2.5661, 'grad_norm': 0.8881317973136902, 'learning_rate': 5.672035491837053e-05, 'epoch': 0.65}


 65%|██████▌   | 753/1151 [45:32<19:09,  2.89s/it]

{'loss': 2.9008, 'grad_norm': 0.8037918210029602, 'learning_rate': 5.6466752579431016e-05, 'epoch': 0.65}


 66%|██████▌   | 754/1151 [45:35<19:19,  2.92s/it]

{'loss': 3.0686, 'grad_norm': 0.7773424983024597, 'learning_rate': 5.6213495219074975e-05, 'epoch': 0.65}


 66%|██████▌   | 755/1151 [45:37<19:11,  2.91s/it]

{'loss': 3.4329, 'grad_norm': 0.800086259841919, 'learning_rate': 5.596058484423656e-05, 'epoch': 0.66}


 66%|██████▌   | 756/1151 [45:40<19:20,  2.94s/it]

{'loss': 3.5493, 'grad_norm': 0.7832162976264954, 'learning_rate': 5.570802345910044e-05, 'epoch': 0.66}


 66%|██████▌   | 757/1151 [45:43<19:09,  2.92s/it]

{'loss': 2.8789, 'grad_norm': 0.849820077419281, 'learning_rate': 5.545581306508556e-05, 'epoch': 0.66}


 66%|██████▌   | 758/1151 [45:47<19:35,  2.99s/it]

{'loss': 3.3322, 'grad_norm': 0.7206489443778992, 'learning_rate': 5.520395566082955e-05, 'epoch': 0.66}


 66%|██████▌   | 759/1151 [45:49<19:12,  2.94s/it]

{'loss': 2.6984, 'grad_norm': 0.784870445728302, 'learning_rate': 5.495245324217271e-05, 'epoch': 0.66}


 66%|██████▌   | 760/1151 [45:52<19:17,  2.96s/it]

{'loss': 3.4593, 'grad_norm': 0.772601842880249, 'learning_rate': 5.4701307802142244e-05, 'epoch': 0.66}


 66%|██████▌   | 761/1151 [45:55<18:55,  2.91s/it]

{'loss': 2.9801, 'grad_norm': 0.8015111088752747, 'learning_rate': 5.44505213309366e-05, 'epoch': 0.66}


 66%|██████▌   | 762/1151 [45:58<19:11,  2.96s/it]

{'loss': 2.9572, 'grad_norm': 0.7437425255775452, 'learning_rate': 5.420009581590947e-05, 'epoch': 0.66}


 66%|██████▋   | 763/1151 [46:01<18:24,  2.85s/it]

{'loss': 2.5906, 'grad_norm': 0.8120704889297485, 'learning_rate': 5.3950033241554146e-05, 'epoch': 0.66}


 66%|██████▋   | 764/1151 [46:03<18:04,  2.80s/it]

{'loss': 2.9319, 'grad_norm': 0.826380729675293, 'learning_rate': 5.3700335589487925e-05, 'epoch': 0.66}


 66%|██████▋   | 765/1151 [46:06<17:48,  2.77s/it]

{'loss': 2.557, 'grad_norm': 0.8420459628105164, 'learning_rate': 5.345100483843617e-05, 'epoch': 0.66}


 67%|██████▋   | 766/1151 [46:10<18:51,  2.94s/it]

{'loss': 3.0387, 'grad_norm': 0.7844605445861816, 'learning_rate': 5.320204296421675e-05, 'epoch': 0.67}


 67%|██████▋   | 767/1151 [46:13<19:18,  3.02s/it]

{'loss': 3.0675, 'grad_norm': 0.8177216649055481, 'learning_rate': 5.2953451939724454e-05, 'epoch': 0.67}


 67%|██████▋   | 768/1151 [46:16<19:13,  3.01s/it]

{'loss': 3.2121, 'grad_norm': 0.7832437753677368, 'learning_rate': 5.2705233734915196e-05, 'epoch': 0.67}


 67%|██████▋   | 769/1151 [46:19<18:50,  2.96s/it]

{'loss': 2.4926, 'grad_norm': 0.7937390804290771, 'learning_rate': 5.245739031679048e-05, 'epoch': 0.67}


 67%|██████▋   | 770/1151 [46:21<18:16,  2.88s/it]

{'loss': 2.4389, 'grad_norm': 0.8083308935165405, 'learning_rate': 5.220992364938193e-05, 'epoch': 0.67}


 67%|██████▋   | 771/1151 [46:24<17:59,  2.84s/it]

{'loss': 2.794, 'grad_norm': 0.8336740732192993, 'learning_rate': 5.19628356937355e-05, 'epoch': 0.67}


 67%|██████▋   | 772/1151 [46:27<17:28,  2.77s/it]

{'loss': 2.5897, 'grad_norm': 0.8260531425476074, 'learning_rate': 5.1716128407896035e-05, 'epoch': 0.67}


 67%|██████▋   | 773/1151 [46:30<17:56,  2.85s/it]

{'loss': 3.286, 'grad_norm': 0.801293134689331, 'learning_rate': 5.146980374689192e-05, 'epoch': 0.67}


 67%|██████▋   | 774/1151 [46:33<19:32,  3.11s/it]

{'loss': 3.8184, 'grad_norm': 0.7669059038162231, 'learning_rate': 5.122386366271923e-05, 'epoch': 0.67}


 67%|██████▋   | 775/1151 [46:37<20:12,  3.23s/it]

{'loss': 3.6736, 'grad_norm': 0.7675580978393555, 'learning_rate': 5.097831010432666e-05, 'epoch': 0.67}


 67%|██████▋   | 776/1151 [46:40<19:48,  3.17s/it]

{'loss': 3.1231, 'grad_norm': 0.8622584939002991, 'learning_rate': 5.073314501759977e-05, 'epoch': 0.67}


 68%|██████▊   | 777/1151 [46:43<19:39,  3.15s/it]

{'loss': 3.2121, 'grad_norm': 0.795464038848877, 'learning_rate': 5.048837034534566e-05, 'epoch': 0.67}


 68%|██████▊   | 778/1151 [46:46<18:53,  3.04s/it]

{'loss': 3.0452, 'grad_norm': 0.7945275902748108, 'learning_rate': 5.024398802727772e-05, 'epoch': 0.68}


 68%|██████▊   | 779/1151 [46:49<18:40,  3.01s/it]

{'loss': 3.2495, 'grad_norm': 0.7988882064819336, 'learning_rate': 5.000000000000002e-05, 'epoch': 0.68}


 68%|██████▊   | 780/1151 [46:52<18:51,  3.05s/it]

{'loss': 3.1876, 'grad_norm': 0.7647437453269958, 'learning_rate': 4.9756408196992086e-05, 'epoch': 0.68}


 68%|██████▊   | 781/1151 [46:55<18:12,  2.95s/it]

{'loss': 2.8131, 'grad_norm': 0.8351620435714722, 'learning_rate': 4.951321454859369e-05, 'epoch': 0.68}


 68%|██████▊   | 782/1151 [46:58<18:37,  3.03s/it]

{'loss': 3.4966, 'grad_norm': 0.7685355544090271, 'learning_rate': 4.9270420981989294e-05, 'epoch': 0.68}


 68%|██████▊   | 783/1151 [47:01<19:11,  3.13s/it]

{'loss': 3.8453, 'grad_norm': 0.8037889003753662, 'learning_rate': 4.902802942119293e-05, 'epoch': 0.68}


 68%|██████▊   | 784/1151 [47:04<19:09,  3.13s/it]

{'loss': 3.4185, 'grad_norm': 0.7397553324699402, 'learning_rate': 4.878604178703308e-05, 'epoch': 0.68}


 68%|██████▊   | 785/1151 [47:07<18:48,  3.08s/it]

{'loss': 2.9115, 'grad_norm': 0.7972074747085571, 'learning_rate': 4.854445999713715e-05, 'epoch': 0.68}


 68%|██████▊   | 786/1151 [47:10<18:29,  3.04s/it]

{'loss': 3.0987, 'grad_norm': 0.8050864338874817, 'learning_rate': 4.830328596591649e-05, 'epoch': 0.68}


 68%|██████▊   | 787/1151 [47:13<18:04,  2.98s/it]

{'loss': 3.075, 'grad_norm': 0.8579866290092468, 'learning_rate': 4.806252160455125e-05, 'epoch': 0.68}


 68%|██████▊   | 788/1151 [47:16<17:25,  2.88s/it]

{'loss': 2.8245, 'grad_norm': 0.8452641367912292, 'learning_rate': 4.7822168820975066e-05, 'epoch': 0.68}


 69%|██████▊   | 789/1151 [47:18<17:08,  2.84s/it]

{'loss': 2.9516, 'grad_norm': 0.846540629863739, 'learning_rate': 4.758222951986002e-05, 'epoch': 0.69}


 69%|██████▊   | 790/1151 [47:21<16:29,  2.74s/it]

{'loss': 2.1496, 'grad_norm': 0.9065106511116028, 'learning_rate': 4.7342705602601645e-05, 'epoch': 0.69}


 69%|██████▊   | 791/1151 [47:24<17:11,  2.87s/it]

{'loss': 3.3702, 'grad_norm': 0.7448009252548218, 'learning_rate': 4.710359896730379e-05, 'epoch': 0.69}


 69%|██████▉   | 792/1151 [47:27<17:47,  2.97s/it]

{'loss': 3.3785, 'grad_norm': 0.7849377989768982, 'learning_rate': 4.6864911508763356e-05, 'epoch': 0.69}


 69%|██████▉   | 793/1151 [47:30<17:01,  2.85s/it]

{'loss': 2.4802, 'grad_norm': 0.8320164084434509, 'learning_rate': 4.662664511845568e-05, 'epoch': 0.69}


 69%|██████▉   | 794/1151 [47:33<17:01,  2.86s/it]

{'loss': 3.3078, 'grad_norm': 0.8044442534446716, 'learning_rate': 4.638880168451938e-05, 'epoch': 0.69}


 69%|██████▉   | 795/1151 [47:36<16:52,  2.84s/it]

{'loss': 2.8809, 'grad_norm': 0.7615096569061279, 'learning_rate': 4.6151383091741115e-05, 'epoch': 0.69}


 69%|██████▉   | 796/1151 [47:39<17:17,  2.92s/it]

{'loss': 3.3193, 'grad_norm': 0.7737722992897034, 'learning_rate': 4.591439122154115e-05, 'epoch': 0.69}


 69%|██████▉   | 797/1151 [47:42<17:42,  3.00s/it]

{'loss': 3.2551, 'grad_norm': 0.7283592820167542, 'learning_rate': 4.567782795195816e-05, 'epoch': 0.69}


 69%|██████▉   | 798/1151 [47:44<16:58,  2.88s/it]

{'loss': 2.7805, 'grad_norm': 0.8633520007133484, 'learning_rate': 4.544169515763418e-05, 'epoch': 0.69}


 69%|██████▉   | 799/1151 [47:47<16:46,  2.86s/it]

{'loss': 2.7396, 'grad_norm': 0.7746496796607971, 'learning_rate': 4.520599470980015e-05, 'epoch': 0.69}


 70%|██████▉   | 800/1151 [47:50<17:17,  2.96s/it]

{'loss': 3.3713, 'grad_norm': 0.7692784667015076, 'learning_rate': 4.497072847626087e-05, 'epoch': 0.69}


 70%|██████▉   | 801/1151 [47:53<16:53,  2.89s/it]

{'loss': 2.8446, 'grad_norm': 0.8677606582641602, 'learning_rate': 4.4735898321380144e-05, 'epoch': 0.7}


 70%|██████▉   | 802/1151 [47:56<16:56,  2.91s/it]

{'loss': 3.1341, 'grad_norm': 0.7662348747253418, 'learning_rate': 4.4501506106066046e-05, 'epoch': 0.7}


 70%|██████▉   | 803/1151 [47:59<16:20,  2.82s/it]

{'loss': 2.9541, 'grad_norm': 0.7549664378166199, 'learning_rate': 4.426755368775637e-05, 'epoch': 0.7}


 70%|██████▉   | 804/1151 [48:02<16:23,  2.83s/it]

{'loss': 2.8577, 'grad_norm': 0.807948887348175, 'learning_rate': 4.403404292040357e-05, 'epoch': 0.7}


 70%|██████▉   | 805/1151 [48:05<16:47,  2.91s/it]

{'loss': 3.212, 'grad_norm': 0.7168009877204895, 'learning_rate': 4.38009756544603e-05, 'epoch': 0.7}


 70%|███████   | 806/1151 [48:07<16:19,  2.84s/it]

{'loss': 2.3104, 'grad_norm': 0.812061071395874, 'learning_rate': 4.3568353736864765e-05, 'epoch': 0.7}


 70%|███████   | 807/1151 [48:10<16:26,  2.87s/it]

{'loss': 3.0262, 'grad_norm': 0.7794392108917236, 'learning_rate': 4.333617901102591e-05, 'epoch': 0.7}


 70%|███████   | 808/1151 [48:13<16:33,  2.90s/it]

{'loss': 3.0486, 'grad_norm': 0.748249888420105, 'learning_rate': 4.3104453316808935e-05, 'epoch': 0.7}


 70%|███████   | 809/1151 [48:16<16:36,  2.91s/it]

{'loss': 3.0302, 'grad_norm': 0.7876554131507874, 'learning_rate': 4.287317849052075e-05, 'epoch': 0.7}


 70%|███████   | 810/1151 [48:19<16:43,  2.94s/it]

{'loss': 3.139, 'grad_norm': 0.7936844229698181, 'learning_rate': 4.264235636489542e-05, 'epoch': 0.7}


 70%|███████   | 811/1151 [48:22<16:21,  2.89s/it]

{'loss': 2.92, 'grad_norm': 0.7506943345069885, 'learning_rate': 4.241198876907936e-05, 'epoch': 0.7}


 71%|███████   | 812/1151 [48:25<16:44,  2.96s/it]

{'loss': 3.5056, 'grad_norm': 0.7424166798591614, 'learning_rate': 4.218207752861728e-05, 'epoch': 0.71}


 71%|███████   | 813/1151 [48:28<16:41,  2.96s/it]

{'loss': 3.3227, 'grad_norm': 0.7877633571624756, 'learning_rate': 4.195262446543753e-05, 'epoch': 0.71}


 71%|███████   | 814/1151 [48:31<16:42,  2.97s/it]

{'loss': 3.1581, 'grad_norm': 0.8152170777320862, 'learning_rate': 4.1723631397837416e-05, 'epoch': 0.71}


 71%|███████   | 815/1151 [48:34<16:36,  2.97s/it]

{'loss': 3.2552, 'grad_norm': 0.8553290963172913, 'learning_rate': 4.149510014046922e-05, 'epoch': 0.71}


 71%|███████   | 816/1151 [48:37<16:14,  2.91s/it]

{'loss': 3.2351, 'grad_norm': 0.7978230118751526, 'learning_rate': 4.126703250432561e-05, 'epoch': 0.71}


 71%|███████   | 817/1151 [48:40<16:31,  2.97s/it]

{'loss': 3.4113, 'grad_norm': 0.7543887495994568, 'learning_rate': 4.103943029672517e-05, 'epoch': 0.71}


 71%|███████   | 818/1151 [48:43<16:46,  3.02s/it]

{'loss': 3.4059, 'grad_norm': 0.7348291873931885, 'learning_rate': 4.081229532129827e-05, 'epoch': 0.71}


 71%|███████   | 819/1151 [48:46<17:18,  3.13s/it]

{'loss': 3.5073, 'grad_norm': 0.7250365018844604, 'learning_rate': 4.0585629377972744e-05, 'epoch': 0.71}


 71%|███████   | 820/1151 [48:50<17:43,  3.21s/it]

{'loss': 3.6637, 'grad_norm': 0.6981448531150818, 'learning_rate': 4.035943426295954e-05, 'epoch': 0.71}


 71%|███████▏  | 821/1151 [48:52<16:38,  3.03s/it]

{'loss': 2.4242, 'grad_norm': 0.825458824634552, 'learning_rate': 4.013371176873849e-05, 'epoch': 0.71}


 71%|███████▏  | 822/1151 [48:55<16:34,  3.02s/it]

{'loss': 3.2035, 'grad_norm': 0.7724847197532654, 'learning_rate': 3.9908463684044284e-05, 'epoch': 0.71}


 72%|███████▏  | 823/1151 [48:58<16:05,  2.94s/it]

{'loss': 2.7563, 'grad_norm': 0.8014175891876221, 'learning_rate': 3.968369179385204e-05, 'epoch': 0.71}


 72%|███████▏  | 824/1151 [49:01<15:33,  2.85s/it]

{'loss': 2.7756, 'grad_norm': 0.8572458028793335, 'learning_rate': 3.945939787936329e-05, 'epoch': 0.72}


 72%|███████▏  | 825/1151 [49:04<15:23,  2.83s/it]

{'loss': 3.0336, 'grad_norm': 0.7864625453948975, 'learning_rate': 3.9235583717991944e-05, 'epoch': 0.72}


 72%|███████▏  | 826/1151 [49:07<16:11,  2.99s/it]

{'loss': 3.3132, 'grad_norm': 0.7543684244155884, 'learning_rate': 3.901225108335004e-05, 'epoch': 0.72}


 72%|███████▏  | 827/1151 [49:10<16:29,  3.05s/it]

{'loss': 3.2838, 'grad_norm': 0.7283043265342712, 'learning_rate': 3.878940174523371e-05, 'epoch': 0.72}


 72%|███████▏  | 828/1151 [49:14<17:29,  3.25s/it]

{'loss': 3.7128, 'grad_norm': 0.7883419990539551, 'learning_rate': 3.856703746960939e-05, 'epoch': 0.72}


 72%|███████▏  | 829/1151 [49:17<16:51,  3.14s/it]

{'loss': 2.916, 'grad_norm': 0.8010755181312561, 'learning_rate': 3.8345160018599465e-05, 'epoch': 0.72}


 72%|███████▏  | 830/1151 [49:20<16:34,  3.10s/it]

{'loss': 3.1708, 'grad_norm': 0.845664918422699, 'learning_rate': 3.812377115046855e-05, 'epoch': 0.72}


 72%|███████▏  | 831/1151 [49:23<16:32,  3.10s/it]

{'loss': 3.2294, 'grad_norm': 0.7588586807250977, 'learning_rate': 3.790287261960953e-05, 'epoch': 0.72}


 72%|███████▏  | 832/1151 [49:26<16:31,  3.11s/it]

{'loss': 3.365, 'grad_norm': 0.7764555811882019, 'learning_rate': 3.76824661765296e-05, 'epoch': 0.72}


 72%|███████▏  | 833/1151 [49:29<16:59,  3.20s/it]

{'loss': 3.4594, 'grad_norm': 0.7556044459342957, 'learning_rate': 3.746255356783632e-05, 'epoch': 0.72}


 72%|███████▏  | 834/1151 [49:32<16:05,  3.04s/it]

{'loss': 3.1204, 'grad_norm': 0.8066299557685852, 'learning_rate': 3.724313653622404e-05, 'epoch': 0.72}


 73%|███████▎  | 835/1151 [49:35<16:21,  3.11s/it]

{'loss': 2.966, 'grad_norm': 0.7475111484527588, 'learning_rate': 3.702421682045974e-05, 'epoch': 0.72}


 73%|███████▎  | 836/1151 [49:39<16:23,  3.12s/it]

{'loss': 3.2413, 'grad_norm': 0.7608644366264343, 'learning_rate': 3.680579615536961e-05, 'epoch': 0.73}


 73%|███████▎  | 837/1151 [49:41<15:58,  3.05s/it]

{'loss': 3.3403, 'grad_norm': 0.8501598834991455, 'learning_rate': 3.658787627182495e-05, 'epoch': 0.73}


 73%|███████▎  | 838/1151 [49:44<15:20,  2.94s/it]

{'loss': 2.8402, 'grad_norm': 0.7986178398132324, 'learning_rate': 3.6370458896728674e-05, 'epoch': 0.73}


 73%|███████▎  | 839/1151 [49:47<14:48,  2.85s/it]

{'loss': 2.3495, 'grad_norm': 0.8036684393882751, 'learning_rate': 3.615354575300166e-05, 'epoch': 0.73}


 73%|███████▎  | 840/1151 [49:50<14:44,  2.85s/it]

{'loss': 2.6556, 'grad_norm': 0.7153275012969971, 'learning_rate': 3.593713855956893e-05, 'epoch': 0.73}


 73%|███████▎  | 841/1151 [49:52<14:44,  2.85s/it]

{'loss': 2.9503, 'grad_norm': 0.7445859313011169, 'learning_rate': 3.5721239031346066e-05, 'epoch': 0.73}


 73%|███████▎  | 842/1151 [49:55<14:42,  2.86s/it]

{'loss': 3.18, 'grad_norm': 0.8049424290657043, 'learning_rate': 3.550584887922582e-05, 'epoch': 0.73}


 73%|███████▎  | 843/1151 [49:58<14:38,  2.85s/it]

{'loss': 2.795, 'grad_norm': 0.8046831488609314, 'learning_rate': 3.5290969810064255e-05, 'epoch': 0.73}


 73%|███████▎  | 844/1151 [50:02<15:31,  3.03s/it]

{'loss': 3.4915, 'grad_norm': 0.7086635231971741, 'learning_rate': 3.5076603526667404e-05, 'epoch': 0.73}


 73%|███████▎  | 845/1151 [50:04<14:37,  2.87s/it]

{'loss': 2.0921, 'grad_norm': 0.7820438742637634, 'learning_rate': 3.4862751727777797e-05, 'epoch': 0.73}


 74%|███████▎  | 846/1151 [50:08<15:25,  3.04s/it]

{'loss': 3.4762, 'grad_norm': 0.714582622051239, 'learning_rate': 3.464941610806086e-05, 'epoch': 0.73}


 74%|███████▎  | 847/1151 [50:10<15:01,  2.97s/it]

{'loss': 3.2275, 'grad_norm': 0.8263099789619446, 'learning_rate': 3.4436598358091564e-05, 'epoch': 0.74}


 74%|███████▎  | 848/1151 [50:13<15:06,  2.99s/it]

{'loss': 3.3123, 'grad_norm': 0.750587522983551, 'learning_rate': 3.422430016434114e-05, 'epoch': 0.74}


 74%|███████▍  | 849/1151 [50:16<14:39,  2.91s/it]

{'loss': 2.7094, 'grad_norm': 0.8234931826591492, 'learning_rate': 3.401252320916345e-05, 'epoch': 0.74}


 74%|███████▍  | 850/1151 [50:19<14:44,  2.94s/it]

{'loss': 3.1631, 'grad_norm': 0.7248983979225159, 'learning_rate': 3.3801269170781933e-05, 'epoch': 0.74}


 74%|███████▍  | 851/1151 [50:22<15:16,  3.06s/it]

{'loss': 3.6, 'grad_norm': 0.7529737949371338, 'learning_rate': 3.3590539723276083e-05, 'epoch': 0.74}


 74%|███████▍  | 852/1151 [50:25<14:48,  2.97s/it]

{'loss': 2.6245, 'grad_norm': 0.7456470131874084, 'learning_rate': 3.338033653656839e-05, 'epoch': 0.74}


 74%|███████▍  | 853/1151 [50:28<14:34,  2.93s/it]

{'loss': 2.9456, 'grad_norm': 0.780863881111145, 'learning_rate': 3.317066127641091e-05, 'epoch': 0.74}


 74%|███████▍  | 854/1151 [50:31<14:19,  2.89s/it]

{'loss': 2.814, 'grad_norm': 0.709239661693573, 'learning_rate': 3.296151560437214e-05, 'epoch': 0.74}


 74%|███████▍  | 855/1151 [50:34<15:05,  3.06s/it]

{'loss': 3.4793, 'grad_norm': 0.7558260560035706, 'learning_rate': 3.275290117782397e-05, 'epoch': 0.74}


 74%|███████▍  | 856/1151 [50:37<14:24,  2.93s/it]

{'loss': 2.6665, 'grad_norm': 0.8746480345726013, 'learning_rate': 3.2544819649928346e-05, 'epoch': 0.74}


 74%|███████▍  | 857/1151 [50:40<14:19,  2.92s/it]

{'loss': 3.15, 'grad_norm': 0.8207349181175232, 'learning_rate': 3.2337272669624264e-05, 'epoch': 0.74}


 75%|███████▍  | 858/1151 [50:43<13:57,  2.86s/it]

{'loss': 2.6893, 'grad_norm': 0.8211855888366699, 'learning_rate': 3.2130261881614795e-05, 'epoch': 0.74}


 75%|███████▍  | 859/1151 [50:45<13:47,  2.83s/it]

{'loss': 2.7725, 'grad_norm': 0.7987608909606934, 'learning_rate': 3.1923788926353884e-05, 'epoch': 0.75}


 75%|███████▍  | 860/1151 [50:48<13:45,  2.84s/it]

{'loss': 2.9432, 'grad_norm': 0.7847573757171631, 'learning_rate': 3.171785544003342e-05, 'epoch': 0.75}


 75%|███████▍  | 861/1151 [50:51<13:34,  2.81s/it]

{'loss': 2.5898, 'grad_norm': 0.7597178220748901, 'learning_rate': 3.151246305457036e-05, 'epoch': 0.75}


 75%|███████▍  | 862/1151 [50:54<13:16,  2.76s/it]

{'loss': 3.0732, 'grad_norm': 0.8990047574043274, 'learning_rate': 3.130761339759365e-05, 'epoch': 0.75}


 75%|███████▍  | 863/1151 [50:57<14:39,  3.05s/it]

{'loss': 3.6499, 'grad_norm': 0.7087847590446472, 'learning_rate': 3.110330809243135e-05, 'epoch': 0.75}


 75%|███████▌  | 864/1151 [51:00<14:45,  3.08s/it]

{'loss': 3.3614, 'grad_norm': 0.8386366367340088, 'learning_rate': 3.089954875809794e-05, 'epoch': 0.75}


 75%|███████▌  | 865/1151 [51:04<14:43,  3.09s/it]

{'loss': 3.4261, 'grad_norm': 0.7893272042274475, 'learning_rate': 3.069633700928126e-05, 'epoch': 0.75}


 75%|███████▌  | 866/1151 [51:07<14:26,  3.04s/it]

{'loss': 3.508, 'grad_norm': 0.8569651246070862, 'learning_rate': 3.0493674456329813e-05, 'epoch': 0.75}


 75%|███████▌  | 867/1151 [51:10<14:24,  3.04s/it]

{'loss': 3.4104, 'grad_norm': 0.8147179484367371, 'learning_rate': 3.0291562705240105e-05, 'epoch': 0.75}


 75%|███████▌  | 868/1151 [51:12<14:09,  3.00s/it]

{'loss': 3.0232, 'grad_norm': 0.7773167490959167, 'learning_rate': 3.0090003357643727e-05, 'epoch': 0.75}


 75%|███████▌  | 869/1151 [51:16<14:10,  3.01s/it]

{'loss': 3.2789, 'grad_norm': 0.7555033564567566, 'learning_rate': 2.9888998010794743e-05, 'epoch': 0.75}


 76%|███████▌  | 870/1151 [51:18<13:24,  2.86s/it]

{'loss': 2.4111, 'grad_norm': 0.8909466862678528, 'learning_rate': 2.9688548257557125e-05, 'epoch': 0.76}


 76%|███████▌  | 871/1151 [51:21<13:12,  2.83s/it]

{'loss': 2.7529, 'grad_norm': 0.7922466993331909, 'learning_rate': 2.9488655686392086e-05, 'epoch': 0.76}


 76%|███████▌  | 872/1151 [51:23<12:59,  2.80s/it]

{'loss': 2.8514, 'grad_norm': 0.7811267971992493, 'learning_rate': 2.9289321881345254e-05, 'epoch': 0.76}


 76%|███████▌  | 873/1151 [51:26<12:51,  2.77s/it]

{'loss': 2.3778, 'grad_norm': 0.8311183452606201, 'learning_rate': 2.9090548422034525e-05, 'epoch': 0.76}


 76%|███████▌  | 874/1151 [51:29<13:25,  2.91s/it]

{'loss': 3.4908, 'grad_norm': 0.7844793796539307, 'learning_rate': 2.8892336883637327e-05, 'epoch': 0.76}


 76%|███████▌  | 875/1151 [51:32<13:29,  2.93s/it]

{'loss': 3.0674, 'grad_norm': 0.7311757206916809, 'learning_rate': 2.869468883687798e-05, 'epoch': 0.76}


 76%|███████▌  | 876/1151 [51:35<13:07,  2.87s/it]

{'loss': 2.6586, 'grad_norm': 0.8542475700378418, 'learning_rate': 2.8497605848015607e-05, 'epoch': 0.76}


 76%|███████▌  | 877/1151 [51:38<12:59,  2.84s/it]

{'loss': 2.9111, 'grad_norm': 0.8307856321334839, 'learning_rate': 2.8301089478831512e-05, 'epoch': 0.76}


 76%|███████▋  | 878/1151 [51:41<13:24,  2.95s/it]

{'loss': 3.3424, 'grad_norm': 0.8135689496994019, 'learning_rate': 2.8105141286616754e-05, 'epoch': 0.76}


 76%|███████▋  | 879/1151 [51:44<12:51,  2.83s/it]

{'loss': 2.3825, 'grad_norm': 0.8225374221801758, 'learning_rate': 2.790976282415989e-05, 'epoch': 0.76}


 76%|███████▋  | 880/1151 [51:47<13:36,  3.01s/it]

{'loss': 3.6134, 'grad_norm': 0.7451249361038208, 'learning_rate': 2.7714955639734764e-05, 'epoch': 0.76}


 77%|███████▋  | 881/1151 [51:50<13:06,  2.91s/it]

{'loss': 2.8095, 'grad_norm': 0.7840052843093872, 'learning_rate': 2.7520721277088024e-05, 'epoch': 0.76}


 77%|███████▋  | 882/1151 [51:53<12:52,  2.87s/it]

{'loss': 2.7254, 'grad_norm': 0.8131371736526489, 'learning_rate': 2.7327061275427e-05, 'epoch': 0.77}


 77%|███████▋  | 883/1151 [51:55<12:32,  2.81s/it]

{'loss': 2.7713, 'grad_norm': 0.8082220554351807, 'learning_rate': 2.713397716940763e-05, 'epoch': 0.77}


 77%|███████▋  | 884/1151 [51:58<12:52,  2.89s/it]

{'loss': 3.3819, 'grad_norm': 0.7533170580863953, 'learning_rate': 2.6941470489122056e-05, 'epoch': 0.77}


 77%|███████▋  | 885/1151 [52:01<12:58,  2.93s/it]

{'loss': 3.0037, 'grad_norm': 0.7382357716560364, 'learning_rate': 2.674954276008661e-05, 'epoch': 0.77}


 77%|███████▋  | 886/1151 [52:05<13:14,  3.00s/it]

{'loss': 3.3627, 'grad_norm': 0.7901514172554016, 'learning_rate': 2.6558195503229856e-05, 'epoch': 0.77}


 77%|███████▋  | 887/1151 [52:08<13:18,  3.03s/it]

{'loss': 3.4243, 'grad_norm': 0.7474024891853333, 'learning_rate': 2.6367430234880284e-05, 'epoch': 0.77}


 77%|███████▋  | 888/1151 [52:10<12:45,  2.91s/it]

{'loss': 2.6754, 'grad_norm': 0.8924990296363831, 'learning_rate': 2.617724846675448e-05, 'epoch': 0.77}


 77%|███████▋  | 889/1151 [52:13<12:39,  2.90s/it]

{'loss': 3.4331, 'grad_norm': 0.8265408873558044, 'learning_rate': 2.5987651705945117e-05, 'epoch': 0.77}


 77%|███████▋  | 890/1151 [52:16<13:14,  3.04s/it]

{'loss': 3.3442, 'grad_norm': 0.8180848360061646, 'learning_rate': 2.5798641454908944e-05, 'epoch': 0.77}


 77%|███████▋  | 891/1151 [52:20<13:18,  3.07s/it]

{'loss': 3.0614, 'grad_norm': 0.7866965532302856, 'learning_rate': 2.56102192114549e-05, 'epoch': 0.77}


 77%|███████▋  | 892/1151 [52:23<13:29,  3.13s/it]

{'loss': 3.4453, 'grad_norm': 0.8175772428512573, 'learning_rate': 2.5422386468732364e-05, 'epoch': 0.77}


 78%|███████▊  | 893/1151 [52:26<12:58,  3.02s/it]

{'loss': 2.6775, 'grad_norm': 0.7995033264160156, 'learning_rate': 2.523514471521913e-05, 'epoch': 0.78}


 78%|███████▊  | 894/1151 [52:29<13:07,  3.07s/it]

{'loss': 3.3572, 'grad_norm': 0.7463398575782776, 'learning_rate': 2.5048495434709708e-05, 'epoch': 0.78}


 78%|███████▊  | 895/1151 [52:32<12:46,  3.00s/it]

{'loss': 2.8492, 'grad_norm': 0.7889350652694702, 'learning_rate': 2.4862440106303665e-05, 'epoch': 0.78}


 78%|███████▊  | 896/1151 [52:34<12:21,  2.91s/it]

{'loss': 2.9592, 'grad_norm': 0.7862693667411804, 'learning_rate': 2.467698020439365e-05, 'epoch': 0.78}


 78%|███████▊  | 897/1151 [52:37<11:46,  2.78s/it]

{'loss': 2.5228, 'grad_norm': 0.8005091547966003, 'learning_rate': 2.449211719865404e-05, 'epoch': 0.78}


 78%|███████▊  | 898/1151 [52:40<12:19,  2.92s/it]

{'loss': 3.3854, 'grad_norm': 0.7274618744850159, 'learning_rate': 2.4307852554028943e-05, 'epoch': 0.78}


 78%|███████▊  | 899/1151 [52:43<12:28,  2.97s/it]

{'loss': 3.1788, 'grad_norm': 0.7512208819389343, 'learning_rate': 2.4124187730720903e-05, 'epoch': 0.78}


 78%|███████▊  | 900/1151 [52:47<13:21,  3.19s/it]

{'loss': 3.6595, 'grad_norm': 0.7350623607635498, 'learning_rate': 2.394112418417912e-05, 'epoch': 0.78}


 78%|███████▊  | 901/1151 [52:50<12:54,  3.10s/it]

{'loss': 2.7779, 'grad_norm': 0.722472071647644, 'learning_rate': 2.375866336508794e-05, 'epoch': 0.78}


 78%|███████▊  | 902/1151 [52:53<12:29,  3.01s/it]

{'loss': 3.1972, 'grad_norm': 0.8826914429664612, 'learning_rate': 2.357680671935554e-05, 'epoch': 0.78}


 78%|███████▊  | 903/1151 [52:56<12:22,  3.00s/it]

{'loss': 3.6992, 'grad_norm': 0.8689600825309753, 'learning_rate': 2.339555568810221e-05, 'epoch': 0.78}


 79%|███████▊  | 904/1151 [52:59<12:33,  3.05s/it]

{'loss': 3.3144, 'grad_norm': 0.90086430311203, 'learning_rate': 2.321491170764908e-05, 'epoch': 0.78}


 79%|███████▊  | 905/1151 [53:02<12:37,  3.08s/it]

{'loss': 3.3292, 'grad_norm': 0.7680996656417847, 'learning_rate': 2.3034876209506772e-05, 'epoch': 0.79}


 79%|███████▊  | 906/1151 [53:05<12:57,  3.18s/it]

{'loss': 3.13, 'grad_norm': 0.704352080821991, 'learning_rate': 2.285545062036397e-05, 'epoch': 0.79}


 79%|███████▉  | 907/1151 [53:08<12:27,  3.06s/it]

{'loss': 2.7257, 'grad_norm': 0.7680587768554688, 'learning_rate': 2.2676636362076076e-05, 'epoch': 0.79}


 79%|███████▉  | 908/1151 [53:11<12:02,  2.97s/it]

{'loss': 2.4628, 'grad_norm': 0.9104442596435547, 'learning_rate': 2.2498434851654126e-05, 'epoch': 0.79}


 79%|███████▉  | 909/1151 [53:14<11:41,  2.90s/it]

{'loss': 2.9407, 'grad_norm': 0.7696848511695862, 'learning_rate': 2.2320847501253383e-05, 'epoch': 0.79}


 79%|███████▉  | 910/1151 [53:16<11:35,  2.88s/it]

{'loss': 2.8163, 'grad_norm': 0.7812302112579346, 'learning_rate': 2.2143875718162154e-05, 'epoch': 0.79}


 79%|███████▉  | 911/1151 [53:19<11:33,  2.89s/it]

{'loss': 2.9529, 'grad_norm': 0.8023003935813904, 'learning_rate': 2.1967520904790827e-05, 'epoch': 0.79}


 79%|███████▉  | 912/1151 [53:22<11:49,  2.97s/it]

{'loss': 3.1302, 'grad_norm': 0.7363135814666748, 'learning_rate': 2.179178445866048e-05, 'epoch': 0.79}


 79%|███████▉  | 913/1151 [53:25<11:33,  2.91s/it]

{'loss': 3.0242, 'grad_norm': 0.771257758140564, 'learning_rate': 2.1616667772392074e-05, 'epoch': 0.79}


 79%|███████▉  | 914/1151 [53:28<11:45,  2.98s/it]

{'loss': 3.5111, 'grad_norm': 0.7570013999938965, 'learning_rate': 2.14421722336952e-05, 'epoch': 0.79}


 79%|███████▉  | 915/1151 [53:31<11:27,  2.91s/it]

{'loss': 3.0967, 'grad_norm': 0.822479784488678, 'learning_rate': 2.126829922535718e-05, 'epoch': 0.79}


 80%|███████▉  | 916/1151 [53:34<11:32,  2.95s/it]

{'loss': 3.2126, 'grad_norm': 0.7855017185211182, 'learning_rate': 2.1095050125232196e-05, 'epoch': 0.8}


 80%|███████▉  | 917/1151 [53:37<10:51,  2.79s/it]

{'loss': 2.4191, 'grad_norm': 0.8040766716003418, 'learning_rate': 2.092242630623016e-05, 'epoch': 0.8}


 80%|███████▉  | 918/1151 [53:39<10:38,  2.74s/it]

{'loss': 2.0622, 'grad_norm': 0.7727289795875549, 'learning_rate': 2.0750429136305992e-05, 'epoch': 0.8}


 80%|███████▉  | 919/1151 [53:42<10:36,  2.74s/it]

{'loss': 2.9884, 'grad_norm': 0.7751478552818298, 'learning_rate': 2.05790599784488e-05, 'epoch': 0.8}


 80%|███████▉  | 920/1151 [53:45<10:30,  2.73s/it]

{'loss': 2.7641, 'grad_norm': 0.7885124683380127, 'learning_rate': 2.040832019067096e-05, 'epoch': 0.8}


 80%|████████  | 921/1151 [53:48<11:03,  2.88s/it]

{'loss': 3.2871, 'grad_norm': 0.7995237112045288, 'learning_rate': 2.0238211125997397e-05, 'epoch': 0.8}


 80%|████████  | 922/1151 [53:51<10:43,  2.81s/it]

{'loss': 2.5045, 'grad_norm': 0.8406155705451965, 'learning_rate': 2.0068734132454946e-05, 'epoch': 0.8}


 80%|████████  | 923/1151 [53:53<10:36,  2.79s/it]

{'loss': 2.8121, 'grad_norm': 0.7727353572845459, 'learning_rate': 1.9899890553061562e-05, 'epoch': 0.8}


 80%|████████  | 924/1151 [53:57<11:17,  2.98s/it]

{'loss': 3.1331, 'grad_norm': 0.7172176241874695, 'learning_rate': 1.9731681725815676e-05, 'epoch': 0.8}


                                                  
 80%|████████  | 924/1151 [56:13<11:17,  2.98s/it]

{'eval_loss': 3.000704526901245, 'eval_runtime': 136.4706, 'eval_samples_per_second': 8.434, 'eval_steps_per_second': 1.055, 'epoch': 0.8}


 80%|████████  | 925/1151 [56:16<2:45:16, 43.88s/it]

{'loss': 3.1747, 'grad_norm': 0.7718472480773926, 'learning_rate': 1.956410898368576e-05, 'epoch': 0.8}


 80%|████████  | 926/1151 [56:19<1:58:33, 31.62s/it]

{'loss': 3.2229, 'grad_norm': 0.7933831810951233, 'learning_rate': 1.939717365459952e-05, 'epoch': 0.8}


 81%|████████  | 927/1151 [56:22<1:25:41, 22.95s/it]

{'loss': 2.9832, 'grad_norm': 0.8235167860984802, 'learning_rate': 1.9230877061433507e-05, 'epoch': 0.8}


 81%|████████  | 928/1151 [56:25<1:03:08, 16.99s/it]

{'loss': 3.3967, 'grad_norm': 0.8090938329696655, 'learning_rate': 1.9065220522002702e-05, 'epoch': 0.81}


 81%|████████  | 929/1151 [56:28<47:23, 12.81s/it]  

{'loss': 3.4794, 'grad_norm': 0.7887473702430725, 'learning_rate': 1.8900205349049904e-05, 'epoch': 0.81}


 81%|████████  | 930/1151 [56:32<37:02, 10.06s/it]

{'loss': 3.6538, 'grad_norm': 0.7076930403709412, 'learning_rate': 1.8735832850235425e-05, 'epoch': 0.81}


 81%|████████  | 931/1151 [56:34<28:58,  7.90s/it]

{'loss': 3.1989, 'grad_norm': 0.8083945512771606, 'learning_rate': 1.857210432812674e-05, 'epoch': 0.81}


 81%|████████  | 932/1151 [56:38<23:44,  6.50s/it]

{'loss': 3.1567, 'grad_norm': 0.7132396101951599, 'learning_rate': 1.8409021080188193e-05, 'epoch': 0.81}


 81%|████████  | 933/1151 [56:41<19:43,  5.43s/it]

{'loss': 3.1478, 'grad_norm': 0.7535510659217834, 'learning_rate': 1.8246584398770493e-05, 'epoch': 0.81}


 81%|████████  | 934/1151 [56:44<17:11,  4.75s/it]

{'loss': 3.1915, 'grad_norm': 0.7157431840896606, 'learning_rate': 1.808479557110081e-05, 'epoch': 0.81}


 81%|████████  | 935/1151 [56:47<15:44,  4.37s/it]

{'loss': 3.6279, 'grad_norm': 0.7361204028129578, 'learning_rate': 1.7923655879272393e-05, 'epoch': 0.81}


 81%|████████▏ | 936/1151 [56:51<14:49,  4.14s/it]

{'loss': 3.5549, 'grad_norm': 0.7000375986099243, 'learning_rate': 1.7763166600234272e-05, 'epoch': 0.81}


 81%|████████▏ | 937/1151 [56:54<13:36,  3.81s/it]

{'loss': 3.3496, 'grad_norm': 0.7566180229187012, 'learning_rate': 1.7603329005781444e-05, 'epoch': 0.81}


 81%|████████▏ | 938/1151 [56:57<12:30,  3.52s/it]

{'loss': 2.7665, 'grad_norm': 0.7796773910522461, 'learning_rate': 1.7444144362544625e-05, 'epoch': 0.81}


 82%|████████▏ | 939/1151 [57:00<11:56,  3.38s/it]

{'loss': 3.1164, 'grad_norm': 0.8185850977897644, 'learning_rate': 1.728561393198016e-05, 'epoch': 0.82}


 82%|████████▏ | 940/1151 [57:03<11:22,  3.24s/it]

{'loss': 3.0562, 'grad_norm': 0.790386438369751, 'learning_rate': 1.712773897036012e-05, 'epoch': 0.82}


 82%|████████▏ | 941/1151 [57:05<10:47,  3.08s/it]

{'loss': 2.6467, 'grad_norm': 0.7928968667984009, 'learning_rate': 1.6970520728762375e-05, 'epoch': 0.82}


 82%|████████▏ | 942/1151 [57:08<10:13,  2.94s/it]

{'loss': 3.0623, 'grad_norm': 0.8321557641029358, 'learning_rate': 1.6813960453060552e-05, 'epoch': 0.82}


 82%|████████▏ | 943/1151 [57:11<09:56,  2.87s/it]

{'loss': 3.1125, 'grad_norm': 0.8329512476921082, 'learning_rate': 1.6658059383914248e-05, 'epoch': 0.82}


 82%|████████▏ | 944/1151 [57:13<09:42,  2.81s/it]

{'loss': 2.8791, 'grad_norm': 0.8638900518417358, 'learning_rate': 1.6502818756759266e-05, 'epoch': 0.82}


 82%|████████▏ | 945/1151 [57:16<09:56,  2.90s/it]

{'loss': 3.2599, 'grad_norm': 0.7704311013221741, 'learning_rate': 1.634823980179766e-05, 'epoch': 0.82}


 82%|████████▏ | 946/1151 [57:20<10:05,  2.96s/it]

{'loss': 3.1141, 'grad_norm': 0.7083074450492859, 'learning_rate': 1.6194323743988084e-05, 'epoch': 0.82}


 82%|████████▏ | 947/1151 [57:22<09:46,  2.88s/it]

{'loss': 2.6091, 'grad_norm': 0.855840802192688, 'learning_rate': 1.60410718030361e-05, 'epoch': 0.82}


 82%|████████▏ | 948/1151 [57:25<09:25,  2.79s/it]

{'loss': 2.6289, 'grad_norm': 0.846360981464386, 'learning_rate': 1.5888485193384528e-05, 'epoch': 0.82}


 82%|████████▏ | 949/1151 [57:28<09:57,  2.96s/it]

{'loss': 3.2034, 'grad_norm': 0.7703725695610046, 'learning_rate': 1.5736565124203638e-05, 'epoch': 0.82}


 83%|████████▎ | 950/1151 [57:31<09:42,  2.90s/it]

{'loss': 2.673, 'grad_norm': 0.8205018043518066, 'learning_rate': 1.5585312799381846e-05, 'epoch': 0.82}


 83%|████████▎ | 951/1151 [57:34<09:55,  2.98s/it]

{'loss': 3.2527, 'grad_norm': 0.7846785187721252, 'learning_rate': 1.5434729417516037e-05, 'epoch': 0.83}


 83%|████████▎ | 952/1151 [57:37<10:10,  3.07s/it]

{'loss': 3.2828, 'grad_norm': 0.7593302726745605, 'learning_rate': 1.528481617190193e-05, 'epoch': 0.83}


 83%|████████▎ | 953/1151 [57:40<10:04,  3.05s/it]

{'loss': 3.0881, 'grad_norm': 0.7442106008529663, 'learning_rate': 1.5135574250524897e-05, 'epoch': 0.83}


 83%|████████▎ | 954/1151 [57:44<10:36,  3.23s/it]

{'loss': 3.4916, 'grad_norm': 0.7327720522880554, 'learning_rate': 1.49870048360504e-05, 'epoch': 0.83}


 83%|████████▎ | 955/1151 [57:47<10:41,  3.27s/it]

{'loss': 3.6707, 'grad_norm': 0.7492682337760925, 'learning_rate': 1.483910910581452e-05, 'epoch': 0.83}


 83%|████████▎ | 956/1151 [57:51<10:41,  3.29s/it]

{'loss': 3.2536, 'grad_norm': 0.7375083565711975, 'learning_rate': 1.4691888231814843e-05, 'epoch': 0.83}


 83%|████████▎ | 957/1151 [57:54<10:15,  3.17s/it]

{'loss': 2.9753, 'grad_norm': 0.7596966028213501, 'learning_rate': 1.454534338070106e-05, 'epoch': 0.83}


 83%|████████▎ | 958/1151 [57:57<10:01,  3.12s/it]

{'loss': 3.2839, 'grad_norm': 0.7820315361022949, 'learning_rate': 1.4399475713765686e-05, 'epoch': 0.83}


 83%|████████▎ | 959/1151 [57:59<09:37,  3.01s/it]

{'loss': 2.6154, 'grad_norm': 0.833322286605835, 'learning_rate': 1.425428638693489e-05, 'epoch': 0.83}


 83%|████████▎ | 960/1151 [58:03<09:52,  3.10s/it]

{'loss': 3.3352, 'grad_norm': 0.7471645474433899, 'learning_rate': 1.4109776550759424e-05, 'epoch': 0.83}


 83%|████████▎ | 961/1151 [58:05<09:18,  2.94s/it]

{'loss': 2.3368, 'grad_norm': 0.8697324991226196, 'learning_rate': 1.3965947350405351e-05, 'epoch': 0.83}


 84%|████████▎ | 962/1151 [58:08<09:08,  2.90s/it]

{'loss': 2.9423, 'grad_norm': 0.8092944622039795, 'learning_rate': 1.3822799925645036e-05, 'epoch': 0.84}


 84%|████████▎ | 963/1151 [58:11<09:10,  2.93s/it]

{'loss': 3.024, 'grad_norm': 0.7850580215454102, 'learning_rate': 1.368033541084821e-05, 'epoch': 0.84}


 84%|████████▍ | 964/1151 [58:14<08:56,  2.87s/it]

{'loss': 2.7871, 'grad_norm': 0.7629969716072083, 'learning_rate': 1.3538554934972813e-05, 'epoch': 0.84}


 84%|████████▍ | 965/1151 [58:17<08:54,  2.87s/it]

{'loss': 2.6813, 'grad_norm': 0.761199951171875, 'learning_rate': 1.339745962155613e-05, 'epoch': 0.84}


 84%|████████▍ | 966/1151 [58:20<08:53,  2.88s/it]

{'loss': 3.3433, 'grad_norm': 0.7963199019432068, 'learning_rate': 1.3257050588705978e-05, 'epoch': 0.84}


 84%|████████▍ | 967/1151 [58:23<09:17,  3.03s/it]

{'loss': 3.4035, 'grad_norm': 0.748278796672821, 'learning_rate': 1.3117328949091634e-05, 'epoch': 0.84}


 84%|████████▍ | 968/1151 [58:26<08:49,  2.90s/it]

{'loss': 2.4822, 'grad_norm': 0.8231669068336487, 'learning_rate': 1.2978295809935181e-05, 'epoch': 0.84}


 84%|████████▍ | 969/1151 [58:28<08:28,  2.79s/it]

{'loss': 2.2731, 'grad_norm': 0.9161112904548645, 'learning_rate': 1.2839952273002754e-05, 'epoch': 0.84}


 84%|████████▍ | 970/1151 [58:31<08:30,  2.82s/it]

{'loss': 2.993, 'grad_norm': 0.8183927536010742, 'learning_rate': 1.2702299434595666e-05, 'epoch': 0.84}


 84%|████████▍ | 971/1151 [58:34<08:21,  2.79s/it]

{'loss': 2.5263, 'grad_norm': 0.8200954794883728, 'learning_rate': 1.2565338385541792e-05, 'epoch': 0.84}


 84%|████████▍ | 972/1151 [58:36<08:08,  2.73s/it]

{'loss': 2.6712, 'grad_norm': 0.8340196013450623, 'learning_rate': 1.242907021118701e-05, 'epoch': 0.84}


 85%|████████▍ | 973/1151 [58:40<08:36,  2.90s/it]

{'loss': 3.2098, 'grad_norm': 0.7370133996009827, 'learning_rate': 1.229349599138645e-05, 'epoch': 0.84}


 85%|████████▍ | 974/1151 [58:43<08:38,  2.93s/it]

{'loss': 3.3556, 'grad_norm': 0.7915487289428711, 'learning_rate': 1.2158616800496059e-05, 'epoch': 0.85}


 85%|████████▍ | 975/1151 [58:45<08:26,  2.88s/it]

{'loss': 2.6744, 'grad_norm': 0.8836338520050049, 'learning_rate': 1.2024433707364002e-05, 'epoch': 0.85}


 85%|████████▍ | 976/1151 [58:49<08:47,  3.01s/it]

{'loss': 3.192, 'grad_norm': 0.7201728820800781, 'learning_rate': 1.1890947775322237e-05, 'epoch': 0.85}


 85%|████████▍ | 977/1151 [58:51<08:30,  2.94s/it]

{'loss': 2.9951, 'grad_norm': 0.8807223439216614, 'learning_rate': 1.1758160062178093e-05, 'epoch': 0.85}


 85%|████████▍ | 978/1151 [58:54<08:32,  2.96s/it]

{'loss': 3.1317, 'grad_norm': 0.8142136931419373, 'learning_rate': 1.162607162020587e-05, 'epoch': 0.85}


 85%|████████▌ | 979/1151 [58:58<08:35,  3.00s/it]

{'loss': 3.452, 'grad_norm': 0.7948034405708313, 'learning_rate': 1.1494683496138458e-05, 'epoch': 0.85}


 85%|████████▌ | 980/1151 [59:01<08:36,  3.02s/it]

{'loss': 3.1558, 'grad_norm': 0.732746958732605, 'learning_rate': 1.1363996731159188e-05, 'epoch': 0.85}


 85%|████████▌ | 981/1151 [59:03<08:18,  2.93s/it]

{'loss': 2.7222, 'grad_norm': 0.7770662903785706, 'learning_rate': 1.1234012360893375e-05, 'epoch': 0.85}


 85%|████████▌ | 982/1151 [59:06<08:20,  2.96s/it]

{'loss': 3.4167, 'grad_norm': 0.8758771419525146, 'learning_rate': 1.110473141540026e-05, 'epoch': 0.85}


 85%|████████▌ | 983/1151 [59:09<08:17,  2.96s/it]

{'loss': 3.0591, 'grad_norm': 0.7726160287857056, 'learning_rate': 1.097615491916485e-05, 'epoch': 0.85}


 85%|████████▌ | 984/1151 [59:12<08:20,  3.00s/it]

{'loss': 3.1352, 'grad_norm': 0.7786883115768433, 'learning_rate': 1.0848283891089683e-05, 'epoch': 0.85}


 86%|████████▌ | 985/1151 [59:15<08:08,  2.94s/it]

{'loss': 3.0726, 'grad_norm': 0.8946934938430786, 'learning_rate': 1.0721119344486841e-05, 'epoch': 0.86}


 86%|████████▌ | 986/1151 [59:19<08:50,  3.21s/it]

{'loss': 3.7982, 'grad_norm': 0.7244673371315002, 'learning_rate': 1.0594662287069945e-05, 'epoch': 0.86}


 86%|████████▌ | 987/1151 [59:22<08:34,  3.14s/it]

{'loss': 2.6649, 'grad_norm': 0.7947684526443481, 'learning_rate': 1.0468913720946084e-05, 'epoch': 0.86}


 86%|████████▌ | 988/1151 [59:25<08:15,  3.04s/it]

{'loss': 2.6162, 'grad_norm': 0.851550281047821, 'learning_rate': 1.034387464260792e-05, 'epoch': 0.86}


 86%|████████▌ | 989/1151 [59:28<07:58,  2.96s/it]

{'loss': 2.7523, 'grad_norm': 0.7348887920379639, 'learning_rate': 1.0219546042925843e-05, 'epoch': 0.86}


 86%|████████▌ | 990/1151 [59:31<08:01,  2.99s/it]

{'loss': 3.3111, 'grad_norm': 0.7390084266662598, 'learning_rate': 1.0095928907139985e-05, 'epoch': 0.86}


 86%|████████▌ | 991/1151 [59:34<07:54,  2.97s/it]

{'loss': 3.2446, 'grad_norm': 0.8586525321006775, 'learning_rate': 9.973024214852567e-06, 'epoch': 0.86}


 86%|████████▌ | 992/1151 [59:37<08:05,  3.05s/it]

{'loss': 3.1783, 'grad_norm': 0.7504982352256775, 'learning_rate': 9.850832940019994e-06, 'epoch': 0.86}


 86%|████████▋ | 993/1151 [59:40<08:05,  3.07s/it]

{'loss': 3.4894, 'grad_norm': 0.8591102957725525, 'learning_rate': 9.729356050945271e-06, 'epoch': 0.86}


 86%|████████▋ | 994/1151 [59:43<08:22,  3.20s/it]

{'loss': 3.1531, 'grad_norm': 0.7210131883621216, 'learning_rate': 9.608594510270218e-06, 'epoch': 0.86}


 86%|████████▋ | 995/1151 [59:46<08:12,  3.16s/it]

{'loss': 3.1415, 'grad_norm': 0.7281109094619751, 'learning_rate': 9.488549274967872e-06, 'epoch': 0.86}


 87%|████████▋ | 996/1151 [59:49<08:01,  3.11s/it]

{'loss': 3.3666, 'grad_norm': 0.8257257342338562, 'learning_rate': 9.369221296335006e-06, 'epoch': 0.86}


 87%|████████▋ | 997/1151 [59:52<07:42,  3.00s/it]

{'loss': 2.6457, 'grad_norm': 0.7604722380638123, 'learning_rate': 9.25061151998441e-06, 'epoch': 0.87}


 87%|████████▋ | 998/1151 [59:55<07:30,  2.94s/it]

{'loss': 2.9088, 'grad_norm': 0.7573918104171753, 'learning_rate': 9.13272088583751e-06, 'epoch': 0.87}


 87%|████████▋ | 999/1151 [59:58<07:23,  2.92s/it]

{'loss': 3.1753, 'grad_norm': 0.808631181716919, 'learning_rate': 9.015550328116939e-06, 'epoch': 0.87}


 87%|████████▋ | 1000/1151 [1:00:01<07:27,  2.96s/it]

{'loss': 3.058, 'grad_norm': 0.7338466048240662, 'learning_rate': 8.899100775339075e-06, 'epoch': 0.87}


  return fn(*args, **kwargs)
 87%|████████▋ | 1001/1151 [1:00:25<23:20,  9.34s/it]

{'loss': 3.1418, 'grad_norm': 0.7462207674980164, 'learning_rate': 8.783373150306661e-06, 'epoch': 0.87}


 87%|████████▋ | 1002/1151 [1:00:28<18:03,  7.27s/it]

{'loss': 2.4407, 'grad_norm': 0.823358952999115, 'learning_rate': 8.668368370101621e-06, 'epoch': 0.87}


 87%|████████▋ | 1003/1151 [1:00:31<14:49,  6.01s/it]

{'loss': 3.3668, 'grad_norm': 0.7865455150604248, 'learning_rate': 8.554087346077633e-06, 'epoch': 0.87}


 87%|████████▋ | 1004/1151 [1:00:34<12:37,  5.15s/it]

{'loss': 3.5199, 'grad_norm': 0.76784747838974, 'learning_rate': 8.440530983852978e-06, 'epoch': 0.87}


 87%|████████▋ | 1005/1151 [1:00:37<11:01,  4.53s/it]

{'loss': 3.6838, 'grad_norm': 0.7771735191345215, 'learning_rate': 8.327700183303434e-06, 'epoch': 0.87}


 87%|████████▋ | 1006/1151 [1:00:40<09:38,  3.99s/it]

{'loss': 3.1427, 'grad_norm': 0.7776831984519958, 'learning_rate': 8.215595838555013e-06, 'epoch': 0.87}


 87%|████████▋ | 1007/1151 [1:00:42<08:36,  3.58s/it]

{'loss': 2.6151, 'grad_norm': 0.7969372868537903, 'learning_rate': 8.10421883797694e-06, 'epoch': 0.87}


 88%|████████▊ | 1008/1151 [1:00:45<07:57,  3.34s/it]

{'loss': 2.9914, 'grad_norm': 0.7496045827865601, 'learning_rate': 7.993570064174649e-06, 'epoch': 0.88}


 88%|████████▊ | 1009/1151 [1:00:48<07:35,  3.21s/it]

{'loss': 3.1095, 'grad_norm': 0.7632767558097839, 'learning_rate': 7.883650393982756e-06, 'epoch': 0.88}


 88%|████████▊ | 1010/1151 [1:00:51<07:38,  3.25s/it]

{'loss': 3.4484, 'grad_norm': 0.7146852016448975, 'learning_rate': 7.77446069845803e-06, 'epoch': 0.88}


 88%|████████▊ | 1011/1151 [1:00:55<07:41,  3.29s/it]

{'loss': 3.7508, 'grad_norm': 0.7422003746032715, 'learning_rate': 7.666001842872638e-06, 'epoch': 0.88}


 88%|████████▊ | 1012/1151 [1:00:58<07:20,  3.17s/it]

{'loss': 2.9156, 'grad_norm': 0.8033910989761353, 'learning_rate': 7.558274686707245e-06, 'epoch': 0.88}


 88%|████████▊ | 1013/1151 [1:01:01<07:10,  3.12s/it]

{'loss': 3.5569, 'grad_norm': 0.8061712980270386, 'learning_rate': 7.4512800836440525e-06, 'epoch': 0.88}


 88%|████████▊ | 1014/1151 [1:01:03<06:57,  3.05s/it]

{'loss': 3.3569, 'grad_norm': 0.8304862380027771, 'learning_rate': 7.345018881560251e-06, 'epoch': 0.88}


 88%|████████▊ | 1015/1151 [1:01:06<06:45,  2.98s/it]

{'loss': 3.4272, 'grad_norm': 0.8121529817581177, 'learning_rate': 7.239491922521246e-06, 'epoch': 0.88}


 88%|████████▊ | 1016/1151 [1:01:09<06:37,  2.95s/it]

{'loss': 3.2911, 'grad_norm': 0.9013068079948425, 'learning_rate': 7.13470004277379e-06, 'epoch': 0.88}


 88%|████████▊ | 1017/1151 [1:01:12<06:31,  2.92s/it]

{'loss': 3.0373, 'grad_norm': 0.7599278688430786, 'learning_rate': 7.030644072739645e-06, 'epoch': 0.88}


 88%|████████▊ | 1018/1151 [1:01:15<06:18,  2.85s/it]

{'loss': 2.7575, 'grad_norm': 0.8327347040176392, 'learning_rate': 6.927324837008853e-06, 'epoch': 0.88}


 89%|████████▊ | 1019/1151 [1:01:17<06:11,  2.82s/it]

{'loss': 2.7805, 'grad_norm': 0.789781391620636, 'learning_rate': 6.824743154333157e-06, 'epoch': 0.88}


 89%|████████▊ | 1020/1151 [1:01:20<06:17,  2.88s/it]

{'loss': 3.3663, 'grad_norm': 0.8348886370658875, 'learning_rate': 6.722899837619601e-06, 'epoch': 0.89}


 89%|████████▊ | 1021/1151 [1:01:24<06:25,  2.96s/it]

{'loss': 3.5738, 'grad_norm': 0.7375679612159729, 'learning_rate': 6.621795693924082e-06, 'epoch': 0.89}


 89%|████████▉ | 1022/1151 [1:01:26<06:14,  2.91s/it]

{'loss': 2.7557, 'grad_norm': 0.8492767214775085, 'learning_rate': 6.5214315244448985e-06, 'epoch': 0.89}


 89%|████████▉ | 1023/1151 [1:01:29<06:09,  2.89s/it]

{'loss': 3.0973, 'grad_norm': 0.8316678404808044, 'learning_rate': 6.421808124516437e-06, 'epoch': 0.89}


 89%|████████▉ | 1024/1151 [1:01:32<06:11,  2.92s/it]

{'loss': 3.1855, 'grad_norm': 0.7972248792648315, 'learning_rate': 6.3229262836028924e-06, 'epoch': 0.89}


 89%|████████▉ | 1025/1151 [1:01:35<05:52,  2.80s/it]

{'loss': 2.4731, 'grad_norm': 0.8363842964172363, 'learning_rate': 6.224786785291959e-06, 'epoch': 0.89}


 89%|████████▉ | 1026/1151 [1:01:38<06:02,  2.90s/it]

{'loss': 3.3433, 'grad_norm': 0.7776674628257751, 'learning_rate': 6.127390407288658e-06, 'epoch': 0.89}


 89%|████████▉ | 1027/1151 [1:01:41<05:52,  2.84s/it]

{'loss': 2.565, 'grad_norm': 0.7599472999572754, 'learning_rate': 6.030737921409169e-06, 'epoch': 0.89}


 89%|████████▉ | 1028/1151 [1:01:43<05:49,  2.84s/it]

{'loss': 2.9391, 'grad_norm': 0.8384683132171631, 'learning_rate': 5.934830093574706e-06, 'epoch': 0.89}


 89%|████████▉ | 1029/1151 [1:01:46<05:40,  2.79s/it]

{'loss': 2.7446, 'grad_norm': 0.7906872034072876, 'learning_rate': 5.8396676838054275e-06, 'epoch': 0.89}


 89%|████████▉ | 1030/1151 [1:01:49<05:57,  2.96s/it]

{'loss': 3.3202, 'grad_norm': 0.7378215789794922, 'learning_rate': 5.745251446214494e-06, 'epoch': 0.89}


 90%|████████▉ | 1031/1151 [1:01:52<05:40,  2.84s/it]

{'loss': 2.6409, 'grad_norm': 0.845072329044342, 'learning_rate': 5.651582129001986e-06, 'epoch': 0.9}


 90%|████████▉ | 1032/1151 [1:01:55<05:41,  2.87s/it]

{'loss': 2.9178, 'grad_norm': 0.7924469113349915, 'learning_rate': 5.558660474449029e-06, 'epoch': 0.9}


 90%|████████▉ | 1033/1151 [1:01:58<05:38,  2.87s/it]

{'loss': 3.2586, 'grad_norm': 0.7844573855400085, 'learning_rate': 5.466487218911942e-06, 'epoch': 0.9}


 90%|████████▉ | 1034/1151 [1:02:01<05:43,  2.94s/it]

{'loss': 3.4526, 'grad_norm': 0.7637912034988403, 'learning_rate': 5.375063092816313e-06, 'epoch': 0.9}


 90%|████████▉ | 1035/1151 [1:02:05<06:08,  3.17s/it]

{'loss': 3.4625, 'grad_norm': 0.6666600704193115, 'learning_rate': 5.284388820651331e-06, 'epoch': 0.9}


 90%|█████████ | 1036/1151 [1:02:08<06:01,  3.15s/it]

{'loss': 3.0751, 'grad_norm': 0.7799046039581299, 'learning_rate': 5.194465120963899e-06, 'epoch': 0.9}


 90%|█████████ | 1037/1151 [1:02:11<05:58,  3.14s/it]

{'loss': 3.3738, 'grad_norm': 0.7718430757522583, 'learning_rate': 5.105292706353093e-06, 'epoch': 0.9}


 90%|█████████ | 1038/1151 [1:02:14<05:46,  3.07s/it]

{'loss': 3.1035, 'grad_norm': 0.7848131656646729, 'learning_rate': 5.0168722834644135e-06, 'epoch': 0.9}


 90%|█████████ | 1039/1151 [1:02:16<05:30,  2.95s/it]

{'loss': 2.6434, 'grad_norm': 0.7825751304626465, 'learning_rate': 4.929204552984168e-06, 'epoch': 0.9}


 90%|█████████ | 1040/1151 [1:02:20<05:57,  3.22s/it]

{'loss': 3.8394, 'grad_norm': 0.6879981160163879, 'learning_rate': 4.84229020963406e-06, 'epoch': 0.9}


 90%|█████████ | 1041/1151 [1:02:23<05:42,  3.11s/it]

{'loss': 2.8543, 'grad_norm': 0.7772479057312012, 'learning_rate': 4.756129942165511e-06, 'epoch': 0.9}


 91%|█████████ | 1042/1151 [1:02:26<05:35,  3.08s/it]

{'loss': 3.1621, 'grad_norm': 0.7581967711448669, 'learning_rate': 4.670724433354279e-06, 'epoch': 0.9}


 91%|█████████ | 1043/1151 [1:02:29<05:33,  3.09s/it]

{'loss': 3.3884, 'grad_norm': 0.7936345338821411, 'learning_rate': 4.586074359995119e-06, 'epoch': 0.91}


 91%|█████████ | 1044/1151 [1:02:32<05:22,  3.01s/it]

{'loss': 2.751, 'grad_norm': 0.7845537662506104, 'learning_rate': 4.502180392896272e-06, 'epoch': 0.91}


 91%|█████████ | 1045/1151 [1:02:35<05:26,  3.08s/it]

{'loss': 3.467, 'grad_norm': 0.692824125289917, 'learning_rate': 4.41904319687424e-06, 'epoch': 0.91}


 91%|█████████ | 1046/1151 [1:02:39<05:31,  3.15s/it]

{'loss': 3.0637, 'grad_norm': 0.7123619914054871, 'learning_rate': 4.336663430748555e-06, 'epoch': 0.91}


 91%|█████████ | 1047/1151 [1:02:41<05:11,  2.99s/it]

{'loss': 2.6472, 'grad_norm': 0.802973747253418, 'learning_rate': 4.255041747336452e-06, 'epoch': 0.91}


 91%|█████████ | 1048/1151 [1:02:44<05:01,  2.93s/it]

{'loss': 2.9489, 'grad_norm': 0.823865532875061, 'learning_rate': 4.174178793447758e-06, 'epoch': 0.91}


 91%|█████████ | 1049/1151 [1:02:47<05:01,  2.96s/it]

{'loss': 3.0872, 'grad_norm': 0.8701182007789612, 'learning_rate': 4.094075209879788e-06, 'epoch': 0.91}


 91%|█████████ | 1050/1151 [1:02:50<05:04,  3.02s/it]

{'loss': 3.2799, 'grad_norm': 0.7314190864562988, 'learning_rate': 4.014731631412227e-06, 'epoch': 0.91}


 91%|█████████▏| 1051/1151 [1:02:53<05:06,  3.06s/it]

{'loss': 3.2532, 'grad_norm': 0.7028700709342957, 'learning_rate': 3.936148686802077e-06, 'epoch': 0.91}


 91%|█████████▏| 1052/1151 [1:02:57<05:06,  3.10s/it]

{'loss': 3.5107, 'grad_norm': 0.7650861740112305, 'learning_rate': 3.858326998778761e-06, 'epoch': 0.91}


 91%|█████████▏| 1053/1151 [1:03:00<04:57,  3.04s/it]

{'loss': 2.8349, 'grad_norm': 0.7985067963600159, 'learning_rate': 3.7812671840390835e-06, 'epoch': 0.91}


 92%|█████████▏| 1054/1151 [1:03:02<04:45,  2.94s/it]

{'loss': 2.7746, 'grad_norm': 0.8247995972633362, 'learning_rate': 3.704969853242446e-06, 'epoch': 0.92}


 92%|█████████▏| 1055/1151 [1:03:05<04:45,  2.97s/it]

{'loss': 3.5072, 'grad_norm': 0.7728811502456665, 'learning_rate': 3.6294356110059157e-06, 'epoch': 0.92}


 92%|█████████▏| 1056/1151 [1:03:08<04:40,  2.95s/it]

{'loss': 2.7984, 'grad_norm': 0.7581237554550171, 'learning_rate': 3.5546650558994978e-06, 'epoch': 0.92}


 92%|█████████▏| 1057/1151 [1:03:11<04:34,  2.92s/it]

{'loss': 2.9386, 'grad_norm': 0.825867772102356, 'learning_rate': 3.48065878044137e-06, 'epoch': 0.92}


 92%|█████████▏| 1058/1151 [1:03:14<04:30,  2.90s/it]

{'loss': 3.163, 'grad_norm': 0.8298675417900085, 'learning_rate': 3.40741737109318e-06, 'epoch': 0.92}


 92%|█████████▏| 1059/1151 [1:03:17<04:28,  2.92s/it]

{'loss': 2.9085, 'grad_norm': 0.7638279795646667, 'learning_rate': 3.3349414082553875e-06, 'epoch': 0.92}


 92%|█████████▏| 1060/1151 [1:03:20<04:23,  2.90s/it]

{'loss': 3.1443, 'grad_norm': 0.8133586049079895, 'learning_rate': 3.2632314662627284e-06, 'epoch': 0.92}


 92%|█████████▏| 1061/1151 [1:03:23<04:27,  2.97s/it]

{'loss': 3.0809, 'grad_norm': 0.7624809145927429, 'learning_rate': 3.1922881133795825e-06, 'epoch': 0.92}


 92%|█████████▏| 1062/1151 [1:03:26<04:29,  3.03s/it]

{'loss': 3.1377, 'grad_norm': 0.755732536315918, 'learning_rate': 3.1221119117954887e-06, 'epoch': 0.92}


 92%|█████████▏| 1063/1151 [1:03:30<04:44,  3.24s/it]

{'loss': 3.6034, 'grad_norm': 0.6680932641029358, 'learning_rate': 3.0527034176207727e-06, 'epoch': 0.92}


 92%|█████████▏| 1064/1151 [1:03:32<04:29,  3.10s/it]

{'loss': 2.9395, 'grad_norm': 0.7688440680503845, 'learning_rate': 2.984063180882013e-06, 'epoch': 0.92}


 93%|█████████▎| 1065/1151 [1:03:35<04:16,  2.98s/it]

{'loss': 2.875, 'grad_norm': 0.8023025393486023, 'learning_rate': 2.916191745517749e-06, 'epoch': 0.92}


 93%|█████████▎| 1066/1151 [1:03:38<04:09,  2.94s/it]

{'loss': 2.9488, 'grad_norm': 0.8152556419372559, 'learning_rate': 2.8490896493742035e-06, 'epoch': 0.93}


 93%|█████████▎| 1067/1151 [1:03:41<04:03,  2.89s/it]

{'loss': 3.0195, 'grad_norm': 0.8148603439331055, 'learning_rate': 2.7827574242009437e-06, 'epoch': 0.93}


 93%|█████████▎| 1068/1151 [1:03:44<04:02,  2.93s/it]

{'loss': 3.0718, 'grad_norm': 0.8131545186042786, 'learning_rate': 2.7171955956467154e-06, 'epoch': 0.93}


 93%|█████████▎| 1069/1151 [1:03:47<04:00,  2.93s/it]

{'loss': 3.0698, 'grad_norm': 0.8445244431495667, 'learning_rate': 2.652404683255283e-06, 'epoch': 0.93}


 93%|█████████▎| 1070/1151 [1:03:49<03:51,  2.86s/it]

{'loss': 3.254, 'grad_norm': 0.830278754234314, 'learning_rate': 2.5883852004613074e-06, 'epoch': 0.93}


 93%|█████████▎| 1071/1151 [1:03:52<03:52,  2.91s/it]

{'loss': 3.0246, 'grad_norm': 0.7492499351501465, 'learning_rate': 2.525137654586185e-06, 'epoch': 0.93}


 93%|█████████▎| 1072/1151 [1:03:56<03:52,  2.94s/it]

{'loss': 3.196, 'grad_norm': 0.7878856658935547, 'learning_rate': 2.4626625468342157e-06, 'epoch': 0.93}


 93%|█████████▎| 1073/1151 [1:03:59<03:53,  3.00s/it]

{'loss': 3.4064, 'grad_norm': 0.7822268605232239, 'learning_rate': 2.4009603722884742e-06, 'epoch': 0.93}


 93%|█████████▎| 1074/1151 [1:04:01<03:42,  2.89s/it]

{'loss': 2.3516, 'grad_norm': 0.7777723073959351, 'learning_rate': 2.3400316199069238e-06, 'epoch': 0.93}


 93%|█████████▎| 1075/1151 [1:04:04<03:34,  2.83s/it]

{'loss': 3.0506, 'grad_norm': 0.8217331171035767, 'learning_rate': 2.2798767725185853e-06, 'epoch': 0.93}


 93%|█████████▎| 1076/1151 [1:04:07<03:35,  2.87s/it]

{'loss': 2.943, 'grad_norm': 0.6943807005882263, 'learning_rate': 2.2204963068196747e-06, 'epoch': 0.93}


 94%|█████████▎| 1077/1151 [1:04:10<03:32,  2.87s/it]

{'loss': 3.1014, 'grad_norm': 0.800353467464447, 'learning_rate': 2.1618906933698057e-06, 'epoch': 0.94}


 94%|█████████▎| 1078/1151 [1:04:13<03:35,  2.96s/it]

{'loss': 3.377, 'grad_norm': 0.7668176293373108, 'learning_rate': 2.104060396588314e-06, 'epoch': 0.94}


 94%|█████████▎| 1079/1151 [1:04:16<03:34,  2.98s/it]

{'loss': 3.2693, 'grad_norm': 0.756319522857666, 'learning_rate': 2.0470058747505516e-06, 'epoch': 0.94}


 94%|█████████▍| 1080/1151 [1:04:19<03:29,  2.95s/it]

{'loss': 3.0427, 'grad_norm': 0.8127665519714355, 'learning_rate': 1.9907275799842417e-06, 'epoch': 0.94}


 94%|█████████▍| 1081/1151 [1:04:22<03:20,  2.87s/it]

{'loss': 2.8103, 'grad_norm': 0.8189074397087097, 'learning_rate': 1.935225958265896e-06, 'epoch': 0.94}


 94%|█████████▍| 1082/1151 [1:04:25<03:25,  2.98s/it]

{'loss': 3.3255, 'grad_norm': 0.759559154510498, 'learning_rate': 1.8805014494173157e-06, 'epoch': 0.94}


 94%|█████████▍| 1083/1151 [1:04:28<03:18,  2.92s/it]

{'loss': 2.7986, 'grad_norm': 0.8049985766410828, 'learning_rate': 1.8265544871020723e-06, 'epoch': 0.94}


 94%|█████████▍| 1084/1151 [1:04:30<03:14,  2.90s/it]

{'loss': 2.9259, 'grad_norm': 0.8070777058601379, 'learning_rate': 1.7733854988220778e-06, 'epoch': 0.94}


 94%|█████████▍| 1085/1151 [1:04:33<03:12,  2.91s/it]

{'loss': 2.7912, 'grad_norm': 0.7830700874328613, 'learning_rate': 1.7209949059142083e-06, 'epoch': 0.94}


 94%|█████████▍| 1086/1151 [1:04:36<03:08,  2.90s/it]

{'loss': 2.9791, 'grad_norm': 0.8120966553688049, 'learning_rate': 1.6693831235469414e-06, 'epoch': 0.94}


 94%|█████████▍| 1087/1151 [1:04:40<03:14,  3.03s/it]

{'loss': 3.6375, 'grad_norm': 0.7206579446792603, 'learning_rate': 1.6185505607171026e-06, 'epoch': 0.94}


 95%|█████████▍| 1088/1151 [1:04:42<03:05,  2.95s/it]

{'loss': 2.5423, 'grad_norm': 0.7383268475532532, 'learning_rate': 1.5684976202465784e-06, 'epoch': 0.94}


 95%|█████████▍| 1089/1151 [1:04:45<03:00,  2.90s/it]

{'loss': 3.1345, 'grad_norm': 0.8411866426467896, 'learning_rate': 1.5192246987791981e-06, 'epoch': 0.95}


 95%|█████████▍| 1090/1151 [1:04:48<03:01,  2.98s/it]

{'loss': 3.5109, 'grad_norm': 0.8135318756103516, 'learning_rate': 1.4707321867774793e-06, 'epoch': 0.95}


 95%|█████████▍| 1091/1151 [1:04:52<03:06,  3.11s/it]

{'loss': 3.5595, 'grad_norm': 0.6941973567008972, 'learning_rate': 1.4230204685196203e-06, 'epoch': 0.95}


 95%|█████████▍| 1092/1151 [1:04:55<03:03,  3.11s/it]

{'loss': 3.4707, 'grad_norm': 0.7880838513374329, 'learning_rate': 1.3760899220964685e-06, 'epoch': 0.95}


 95%|█████████▍| 1093/1151 [1:04:58<03:08,  3.25s/it]

{'loss': 3.7394, 'grad_norm': 0.703346312046051, 'learning_rate': 1.3299409194084122e-06, 'epoch': 0.95}


 95%|█████████▌| 1094/1151 [1:05:01<02:59,  3.14s/it]

{'loss': 3.0353, 'grad_norm': 0.818889319896698, 'learning_rate': 1.2845738261625828e-06, 'epoch': 0.95}


 95%|█████████▌| 1095/1151 [1:05:04<02:46,  2.97s/it]

{'loss': 2.6166, 'grad_norm': 0.7703680396080017, 'learning_rate': 1.2399890018698347e-06, 'epoch': 0.95}


 95%|█████████▌| 1096/1151 [1:05:07<02:38,  2.88s/it]

{'loss': 2.745, 'grad_norm': 0.8142364025115967, 'learning_rate': 1.1961867998419585e-06, 'epoch': 0.95}


 95%|█████████▌| 1097/1151 [1:05:10<02:37,  2.91s/it]

{'loss': 3.1107, 'grad_norm': 0.7584063410758972, 'learning_rate': 1.1531675671888619e-06, 'epoch': 0.95}


 95%|█████████▌| 1098/1151 [1:05:13<02:37,  2.98s/it]

{'loss': 3.1944, 'grad_norm': 0.7687061429023743, 'learning_rate': 1.1109316448158268e-06, 'epoch': 0.95}


 95%|█████████▌| 1099/1151 [1:05:15<02:29,  2.87s/it]

{'loss': 2.7943, 'grad_norm': 0.7887791991233826, 'learning_rate': 1.0694793674208114e-06, 'epoch': 0.95}


 96%|█████████▌| 1100/1151 [1:05:18<02:30,  2.96s/it]

{'loss': 3.4001, 'grad_norm': 0.7663124799728394, 'learning_rate': 1.0288110634917636e-06, 'epoch': 0.96}


 96%|█████████▌| 1101/1151 [1:05:22<02:32,  3.05s/it]

{'loss': 3.2759, 'grad_norm': 0.8848044872283936, 'learning_rate': 9.889270553040786e-07, 'epoch': 0.96}


 96%|█████████▌| 1102/1151 [1:05:25<02:27,  3.01s/it]

{'loss': 3.0473, 'grad_norm': 0.8160297274589539, 'learning_rate': 9.49827658918001e-07, 'epoch': 0.96}


 96%|█████████▌| 1103/1151 [1:05:28<02:28,  3.09s/it]

{'loss': 3.1636, 'grad_norm': 0.7364712953567505, 'learning_rate': 9.11513184176116e-07, 'epoch': 0.96}


 96%|█████████▌| 1104/1151 [1:05:30<02:18,  2.94s/it]

{'loss': 2.5945, 'grad_norm': 0.8142598867416382, 'learning_rate': 8.739839347009171e-07, 'epoch': 0.96}


 96%|█████████▌| 1105/1151 [1:05:34<02:21,  3.08s/it]

{'loss': 3.685, 'grad_norm': 0.7313694953918457, 'learning_rate': 8.372402078924091e-07, 'epoch': 0.96}


 96%|█████████▌| 1106/1151 [1:05:38<02:26,  3.25s/it]

{'loss': 3.9122, 'grad_norm': 0.7298711538314819, 'learning_rate': 8.012822949256982e-07, 'epoch': 0.96}


 96%|█████████▌| 1107/1151 [1:05:40<02:13,  3.04s/it]

{'loss': 2.3875, 'grad_norm': 0.8399088978767395, 'learning_rate': 7.661104807487607e-07, 'epoch': 0.96}


 96%|█████████▋| 1108/1151 [1:05:43<02:08,  2.98s/it]

{'loss': 2.9273, 'grad_norm': 0.7579178810119629, 'learning_rate': 7.317250440801116e-07, 'epoch': 0.96}


 96%|█████████▋| 1109/1151 [1:05:46<02:04,  2.96s/it]

{'loss': 2.9777, 'grad_norm': 0.7393977642059326, 'learning_rate': 6.981262574066394e-07, 'epoch': 0.96}


 96%|█████████▋| 1110/1151 [1:05:49<01:58,  2.90s/it]

{'loss': 2.8669, 'grad_norm': 0.7909334897994995, 'learning_rate': 6.653143869814526e-07, 'epoch': 0.96}


 97%|█████████▋| 1111/1151 [1:05:51<01:55,  2.88s/it]

{'loss': 3.1182, 'grad_norm': 0.7850911617279053, 'learning_rate': 6.332896928217257e-07, 'epoch': 0.96}


 97%|█████████▋| 1112/1151 [1:05:54<01:54,  2.93s/it]

{'loss': 3.3468, 'grad_norm': 0.7949103713035583, 'learning_rate': 6.020524287066787e-07, 'epoch': 0.97}


 97%|█████████▋| 1113/1151 [1:05:57<01:51,  2.93s/it]

{'loss': 3.0012, 'grad_norm': 0.7277156114578247, 'learning_rate': 5.716028421755671e-07, 'epoch': 0.97}


 97%|█████████▋| 1114/1151 [1:06:00<01:48,  2.93s/it]

{'loss': 2.9618, 'grad_norm': 0.807074785232544, 'learning_rate': 5.419411745256841e-07, 'epoch': 0.97}


 97%|█████████▋| 1115/1151 [1:06:03<01:45,  2.92s/it]

{'loss': 3.0453, 'grad_norm': 0.7233234643936157, 'learning_rate': 5.130676608104845e-07, 'epoch': 0.97}


 97%|█████████▋| 1116/1151 [1:06:06<01:38,  2.82s/it]

{'loss': 2.7582, 'grad_norm': 0.8052281737327576, 'learning_rate': 4.849825298377186e-07, 'epoch': 0.97}


 97%|█████████▋| 1117/1151 [1:06:09<01:37,  2.87s/it]

{'loss': 3.0119, 'grad_norm': 0.7533091902732849, 'learning_rate': 4.576860041675679e-07, 'epoch': 0.97}


 97%|█████████▋| 1118/1151 [1:06:12<01:37,  2.96s/it]

{'loss': 3.3315, 'grad_norm': 0.7407034039497375, 'learning_rate': 4.3117830011097926e-07, 'epoch': 0.97}


 97%|█████████▋| 1119/1151 [1:06:15<01:33,  2.91s/it]

{'loss': 3.0038, 'grad_norm': 0.783211350440979, 'learning_rate': 4.054596277278666e-07, 'epoch': 0.97}


 97%|█████████▋| 1120/1151 [1:06:18<01:28,  2.87s/it]

{'loss': 2.9384, 'grad_norm': 0.8170983791351318, 'learning_rate': 3.805301908254455e-07, 'epoch': 0.97}


 97%|█████████▋| 1121/1151 [1:06:21<01:30,  3.01s/it]

{'loss': 3.0747, 'grad_norm': 0.7324486374855042, 'learning_rate': 3.56390186956701e-07, 'epoch': 0.97}


 97%|█████████▋| 1122/1151 [1:06:24<01:25,  2.95s/it]

{'loss': 2.8786, 'grad_norm': 0.7858882546424866, 'learning_rate': 3.3303980741873355e-07, 'epoch': 0.97}


 98%|█████████▊| 1123/1151 [1:06:27<01:23,  2.98s/it]

{'loss': 3.4069, 'grad_norm': 0.8481975197792053, 'learning_rate': 3.104792372512821e-07, 'epoch': 0.98}


 98%|█████████▊| 1124/1151 [1:06:30<01:20,  2.98s/it]

{'loss': 3.2425, 'grad_norm': 0.8657351732254028, 'learning_rate': 2.8870865523525915e-07, 'epoch': 0.98}


 98%|█████████▊| 1125/1151 [1:06:33<01:16,  2.94s/it]

{'loss': 2.8972, 'grad_norm': 0.7537127137184143, 'learning_rate': 2.6772823389131787e-07, 'epoch': 0.98}


 98%|█████████▊| 1126/1151 [1:06:35<01:11,  2.85s/it]

{'loss': 2.53, 'grad_norm': 0.7898945808410645, 'learning_rate': 2.4753813947849815e-07, 'epoch': 0.98}


 98%|█████████▊| 1127/1151 [1:06:38<01:09,  2.88s/it]

{'loss': 3.0441, 'grad_norm': 0.781391441822052, 'learning_rate': 2.2813853199292746e-07, 'epoch': 0.98}


 98%|█████████▊| 1128/1151 [1:06:41<01:04,  2.80s/it]

{'loss': 2.6635, 'grad_norm': 0.7895097136497498, 'learning_rate': 2.0952956516649969e-07, 'epoch': 0.98}


 98%|█████████▊| 1129/1151 [1:06:44<01:02,  2.85s/it]

{'loss': 3.3446, 'grad_norm': 0.8587287068367004, 'learning_rate': 1.917113864656983e-07, 'epoch': 0.98}


 98%|█████████▊| 1130/1151 [1:06:47<01:01,  2.93s/it]

{'loss': 3.297, 'grad_norm': 0.7295918464660645, 'learning_rate': 1.7468413709040843e-07, 'epoch': 0.98}


 98%|█████████▊| 1131/1151 [1:06:50<00:59,  2.95s/it]

{'loss': 3.1613, 'grad_norm': 0.7696442008018494, 'learning_rate': 1.58447951972851e-07, 'epoch': 0.98}


 98%|█████████▊| 1132/1151 [1:06:52<00:53,  2.82s/it]

{'loss': 2.3425, 'grad_norm': 0.8105838298797607, 'learning_rate': 1.430029597764171e-07, 'epoch': 0.98}


 98%|█████████▊| 1133/1151 [1:06:55<00:51,  2.85s/it]

{'loss': 3.0304, 'grad_norm': 0.7511553764343262, 'learning_rate': 1.2834928289472416e-07, 'epoch': 0.98}


 99%|█████████▊| 1134/1151 [1:06:58<00:49,  2.89s/it]

{'loss': 3.1061, 'grad_norm': 0.7408425807952881, 'learning_rate': 1.1448703745061684e-07, 'epoch': 0.98}


 99%|█████████▊| 1135/1151 [1:07:01<00:43,  2.74s/it]

{'loss': 2.2659, 'grad_norm': 0.8630918264389038, 'learning_rate': 1.0141633329525668e-07, 'epoch': 0.99}


 99%|█████████▊| 1136/1151 [1:07:04<00:42,  2.86s/it]

{'loss': 3.57, 'grad_norm': 0.7481727004051208, 'learning_rate': 8.913727400726712e-08, 'epoch': 0.99}


 99%|█████████▉| 1137/1151 [1:07:07<00:39,  2.83s/it]

{'loss': 2.8625, 'grad_norm': 0.772966742515564, 'learning_rate': 7.76499568918454e-08, 'epoch': 0.99}


 99%|█████████▉| 1138/1151 [1:07:10<00:39,  3.03s/it]

{'loss': 3.6498, 'grad_norm': 0.7506792545318604, 'learning_rate': 6.695447298008528e-08, 'epoch': 0.99}


 99%|█████████▉| 1139/1151 [1:07:13<00:36,  3.03s/it]

{'loss': 3.3143, 'grad_norm': 0.7360644340515137, 'learning_rate': 5.705090702819993e-08, 'epoch': 0.99}


 99%|█████████▉| 1140/1151 [1:07:17<00:35,  3.20s/it]

{'loss': 4.0577, 'grad_norm': 0.7659308314323425, 'learning_rate': 4.7939337516833546e-08, 'epoch': 0.99}


 99%|█████████▉| 1141/1151 [1:07:20<00:31,  3.13s/it]

{'loss': 3.3581, 'grad_norm': 0.7654390335083008, 'learning_rate': 3.961983665049518e-08, 'epoch': 0.99}


 99%|█████████▉| 1142/1151 [1:07:23<00:29,  3.26s/it]

{'loss': 3.6085, 'grad_norm': 0.7095639705657959, 'learning_rate': 3.2092470356948066e-08, 'epoch': 0.99}


 99%|█████████▉| 1143/1151 [1:07:26<00:24,  3.07s/it]

{'loss': 2.9539, 'grad_norm': 0.824100911617279, 'learning_rate': 2.5357298286698973e-08, 'epoch': 0.99}


 99%|█████████▉| 1144/1151 [1:07:29<00:21,  3.04s/it]

{'loss': 2.9747, 'grad_norm': 0.7302827835083008, 'learning_rate': 1.9414373812509655e-08, 'epoch': 0.99}


 99%|█████████▉| 1145/1151 [1:07:32<00:17,  2.98s/it]

{'loss': 3.3629, 'grad_norm': 0.8509369492530823, 'learning_rate': 1.426374402901942e-08, 'epoch': 0.99}


100%|█████████▉| 1146/1151 [1:07:35<00:14,  2.96s/it]

{'loss': 2.7156, 'grad_norm': 0.7977137565612793, 'learning_rate': 9.90544975230101e-09, 'epoch': 1.0}


100%|█████████▉| 1147/1151 [1:07:37<00:11,  2.85s/it]

{'loss': 2.5634, 'grad_norm': 0.8324938416481018, 'learning_rate': 6.3395255195941585e-09, 'epoch': 1.0}


100%|█████████▉| 1148/1151 [1:07:40<00:08,  2.99s/it]

{'loss': 3.4496, 'grad_norm': 0.718172013759613, 'learning_rate': 3.565999589016933e-09, 'epoch': 1.0}


100%|█████████▉| 1149/1151 [1:07:43<00:05,  2.93s/it]

{'loss': 2.7289, 'grad_norm': 0.8390475511550903, 'learning_rate': 1.5848939393436902e-09, 'epoch': 1.0}


100%|█████████▉| 1150/1151 [1:07:46<00:02,  2.99s/it]

{'loss': 2.737, 'grad_norm': 0.7340714931488037, 'learning_rate': 3.9622426980523433e-10, 'epoch': 1.0}


100%|██████████| 1151/1151 [1:07:49<00:00,  2.96s/it]

{'loss': 3.1121, 'grad_norm': 0.8045836687088013, 'learning_rate': 0.0, 'epoch': 1.0}


100%|██████████| 1151/1151 [1:08:01<00:00,  3.55s/it]

{'train_runtime': 4083.0623, 'train_samples_per_second': 2.257, 'train_steps_per_second': 0.282, 'train_loss': 3.2312729984237047, 'epoch': 1.0}





TrainOutput(global_step=1151, training_loss=3.2312729984237047, metrics={'train_runtime': 4083.0623, 'train_samples_per_second': 2.257, 'train_steps_per_second': 0.282, 'total_flos': 4989322401423360.0, 'train_loss': 3.2312729984237047, 'epoch': 0.9993488170175819})

In [23]:
import wandb
wandb.finish()
model.config.use_cache = True

0,1
eval/loss,█▄▂▁
eval/runtime,█▁▁▁
eval/samples_per_second,▁███
eval/steps_per_second,▁███
train/epoch,▁▁▁▁▁▁▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇███
train/grad_norm,▄▆█▄▄▃▃▄▄▄▆▅▅▆▃▂▁▃▃▂▂▃▃▂▂▁▁▂▅▂▂▂▄▃▃▄▃▂▄▂
train/learning_rate,█████▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁
train/loss,▇▆▆▆▅▃▄▁▄▄█▄▇▄▄▆▅▇▅▅▅▁▇▅▃▄▆▃▂▃▃▃▃▂▆▆▆▆▄▂

0,1
eval/loss,3.0007
eval/runtime,136.4706
eval/samples_per_second,8.434
eval/steps_per_second,1.055
total_flos,4989322401423360.0
train/epoch,0.99935
train/global_step,1151.0
train/grad_norm,0.80458
train/learning_rate,0.0
train/loss,3.1121


In [24]:
# Save trained model and tokenizer
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

('llama-3.2-fine-tuned-model\\tokenizer_config.json',
 'llama-3.2-fine-tuned-model\\special_tokens_map.json',
 'llama-3.2-fine-tuned-model\\tokenizer.json')

In [25]:
y_pred = predict(X_test, model, tokenizer)
evaluate(y_true, y_pred)

  0%|          | 0/1153 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
100%|██████████| 1153/1153 [17:34<00:00,  1.09it/s]

Accuracy: 0.874
Accuracy for label Positive: 0.940
Accuracy for label Neutral: 0.578
Accuracy for label Negative: 0.795

Classification Report:
              precision    recall  f1-score   support

    Positive     0.9618    0.9402    0.9509       803
     Neutral     0.4527    0.5776    0.5076       116
    Negative     0.8455    0.7949    0.8194       234

    accuracy                         0.8742      1153
   macro avg     0.7533    0.7709    0.7593      1153
weighted avg     0.8870    0.8742    0.8796      1153


Confusion Matrix:
[[755  41   7]
 [ 22  67  27]
 [  8  40 186]]



