# PEFT 库 QLoRA 实战 - ChatGLM3-6B

通常，模型被量化后不会进一步训练用于下游任务，因为由于权重和激活的较低精度，训练可能不稳定。

但是由于PEFT方法只添加额外的可训练参数，这使得我们可以使用PEFT适配器（Adapter）来训练一个量化模型！将量化与PEFT结合起来可以成为在单个GPU上训练大模型的微调策略。

例如，`QLoRA` 是一种将模型量化为4位然后使用LoRA进行训练的方法，使得在单个16GB GPU（本教程以 NVIDIA T4为例）上微调一个具有65B参数的大模型成为可能。

THUDM Hugging Face 主页：https://huggingface.co/THUDM

## 教程说明

本教程使用 QLoRA 论文中介绍的量化技术：`NF4 数据类型`、`双量化` 和 `混合精度计算`，在 `ChatGLM3-6b` 模型上实现了 QLoRA 微调。并展示了完整的 QLoRA 微调流程，具体如下：

- 数据准备
    - 下载数据集
    - 设计 Tokenizer 函数处理样本（map、shuffle、flatten）
    - 自定义批量数据处理类 DataCollatorForChatGLM
- 训练模型
    - 加载 ChatGLM3-6B 量化模型
    - PEFT 量化模型预处理（prepare_model_for_kbit_training）
    - QLoRA 适配器配置（TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING）
    - 微调训练超参数配置（TrainingArguments）
    - 开启训练（trainer.train)
    - 保存QLoRA模型（trainer.model.save_pretrained)
- [模型推理](peft_chatglm_inference.ipynb)
    - 加载 ChatGLM3-6B 基础模型
    - 加载 ChatGLM3-6B QLoRA 模型（PEFT Adapter）
    - 微调前后对比

In [1]:
# 定义全局变量和参数
model_name_or_path = 'THUDM/chatglm3-6b'  # 模型ID或本地路径
train_data_path = 'HasturOfficial/adgen'    # 训练数据路径
eval_data_path = None                     # 验证数据路径，如果没有则设置为None
seed = 8                                 # 随机种子
max_input_length = 512                    # 输入的最大长度
max_output_length = 1536                  # 输出的最大长度
lora_rank = 4                             # LoRA秩
lora_alpha = 32                           # LoRA alpha值
lora_dropout = 0.05                       # LoRA Dropout率
resume_from_checkpoint = None             # 如果从checkpoint恢复训练，指定路径
prompt_text = ''                          # 所有数据前的指令文本
compute_dtype = 'fp32'                    # 计算数据类型（fp32, fp16, bf16）

## 数据准备

### 下载数据集

从 Hugging Face 加载 adgen 数据集，并tokenize，shuffle

In [2]:
from datasets import load_dataset

dataset = load_dataset(train_data_path)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['content', 'summary'],
        num_rows: 114599
    })
    validation: Dataset({
        features: ['content', 'summary'],
        num_rows: 1070
    })
})

In [4]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [5]:
show_random_elements(dataset["train"], num_examples=3)

Unnamed: 0,content,summary
0,类型#裤*颜色#黑色*颜色#灰色*风格#运动*风格#休闲,黑色和灰色的配色非常适合商务通勤的场合，在运动休闲的时刻也非常适宜。竖线的文理立体感很强，富有机理感，小心机的将腿部曲线进行了拉伸。腰间的系带勾勒出纤细的腰围，也延伸出唯美的韵味。锥形裤的版型非常清爽。
1,类型#上衣*版型#宽松*版型#显瘦*颜色#黑白*图案#蝴蝶结*图案#撞色*衣样式#衬衫*衣领型#翻领*衣袖型#灯笼袖*衣门襟#系带,这款衬衫整体采用纯白色的设计，圆润的翻领设计流露出纯洁少女般的感觉。领部撞色的蝴蝶结系带设计，经典的黑白撞色衬托出简洁大方的美感，同时也为整体增添了可爱的风格。舒适宽松的灯笼袖造型，巧妙的遮肉显瘦也更显俏皮味道。
2,类型#上衣*风格#中国风*图案#刺绣*衣样式#卫衣*衣门襟#套头,这款无套头卫衣衣袖处采用中国风的刺绣工艺绣制有“<UNK>”“<UNK>”二字，赋予了其美好吉祥的寓意，汉字与其工艺结合起来，又带有浓浓的传统文化色彩。无套头的卫衣设计，利落清爽。没有普通卫衣的厚重臃肿感，美型自然。


### 使用 ChatGLM3-6b Tokenizer 处理数据


关于 `ignore_label_id` 的设置：

在许多自然语言处理和机器学习框架中，`ignore_label_id` 被设置为 -100 是一种常见的约定。这个特殊的值用于标记在计算损失函数时应该被忽略的目标标签。让我们详细了解一下这个选择的原因：

1. **损失函数忽略特定值**：训练语言模型时，损失函数（例如交叉熵损失）通常只计算对于模型预测重要或关键的标签的损失。在某些情况下，你可能不希望某些标签对损失计算产生影响。例如，在序列到序列的模型中，输入部分的标签通常被设置为一个忽略值，因为只有输出部分的标签对于训练是重要的。

2. **为何选择-100**：这个具体的值是基于实现细节选择的。在 PyTorch 的交叉熵损失函数中，可以指定一个 `ignore_index` 参数。当损失函数看到这个索引值时，它就会忽略对应的输出标签。使用 -100 作为默认值是因为它是一个不太可能出现在标签中的数字（特别是在处理分类问题时，标签通常是从0开始的正整数）。

3. **标准化和通用性**：由于这种做法在多个库和框架中被采纳，-100 作为忽略标签的默认值已经变得相对标准化，这有助于维护代码的通用性和可读性。

总的来说，将 `ignore_label_id` 设置为 -100 是一种在计算损失时排除特定标签影响的便捷方式。这在处理特定类型的自然语言处理任务时非常有用，尤其是在涉及序列生成或修改的任务中。

#### 关于 ChatGLM3 的填充处理说明

- input_id（query）里的填充补全了输入长度，目的是不改变原始文本的含义。
- label（answer）里的填充会用来跟模型基于 query 生成的结果计算 Loss，为了不影响损失值计算，也需要设置。咱们计算损失时，是针对 answer 部分的 Embedding Vector，因此 label 这样填充，前面的序列就自动忽略掉了，只比较生成内容的 loss。因此，需要将answer前面的部分做忽略填充。

In [6]:
from transformers import AutoTokenizer

# revision='b098244' 版本对应的 ChatGLM3-6B 设置 use_reentrant=False
# 最新版本 use_reentrant 被设置为 True，会增加不必要的显存开销
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
                                          trust_remote_code=True,
                                          revision='b098244')



In [8]:
# tokenize_func 函数
def tokenize_func(example, tokenizer, ignore_label_id=-100):
    """
    对单个数据样本进行tokenize处理。

    参数:
    example (dict): 包含'content'和'summary'键的字典，代表训练数据的一个样本。
    tokenizer (transformers.PreTrainedTokenizer): 用于tokenize文本的tokenizer。
    ignore_label_id (int, optional): 在label中用于填充的忽略ID，默认为-100。

    返回:
    dict: 包含'tokenized_input_ids'和'labels'的字典，用于模型训练。
    """

    # 构建问题文本
    question = prompt_text + example['content']
    if example.get('input', None) and example['input'].strip():
        question += f'\n{example["input"]}'

    # 构建答案文本
    answer = example['summary']

    # 对问题和答案文本进行tokenize处理
    q_ids = tokenizer.encode(text=question, add_special_tokens=False)
    a_ids = tokenizer.encode(text=answer, add_special_tokens=False)

    # 如果tokenize后的长度超过最大长度限制，则进行截断
    if len(q_ids) > max_input_length - 2:  # 保留空间给gmask和bos标记
        q_ids = q_ids[:max_input_length - 2]
    if len(a_ids) > max_output_length - 1:  # 保留空间给eos标记
        a_ids = a_ids[:max_output_length - 1]

    # 构建模型的输入格式
    input_ids = tokenizer.build_inputs_with_special_tokens(q_ids, a_ids)
    question_length = len(q_ids) + 2  # 加上gmask和bos标记

    # 构建标签，对于问题部分的输入使用ignore_label_id进行填充
    labels = [ignore_label_id] * question_length + input_ids[question_length:]

    return {'input_ids': input_ids, 'labels': labels}


In [10]:
column_names = dataset['train'].column_names
tokenized_dataset = dataset['train'].map(
    lambda example: tokenize_func(example, tokenizer),
    batched=False, 
    remove_columns=column_names
)

In [11]:
show_random_elements(tokenized_dataset, num_examples=1)

Unnamed: 0,input_ids,labels
0,"[64790, 64792, 30910, 33467, 31010, 56778, 30998, 33692, 31010, 40962, 30998, 32799, 31010, 40512, 30998, 32799, 31010, 51336, 30998, 32799, 31010, 40589, 30998, 37505, 31010, 55336, 54668, 30998, 56778, 54578, 56164, 31010, 54594, 56890, 30998, 56778, 40877, 31010, 56897, 54882, 30910, 56897, 54882, 31735, 31123, 54805, 38142, 54642, 36259, 56420, 55569, 31123, 41424, 32291, 37320, 34319, 31155, 40962, 55336, 35490, 51480, 31123, 33550, 46903, 35752, 47745, 34219, 31123, 32745, 54589, 35088, 34317, 31123, 54619, 33481, 35804, 31155, 40512, 54530, 56597, 55857, 31123, 45588, 54557, 40315, 39215, 55379, 38871, 31123, 56778, 56164, 54807, 54594, 56890, 31735, 31123, 33638, 41424, 54539, 54625, ...]","[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 30910, 56897, 54882, 31735, 31123, 54805, 38142, 54642, 36259, 56420, 55569, 31123, 41424, 32291, 37320, 34319, 31155, 40962, 55336, 35490, 51480, 31123, 33550, 46903, 35752, 47745, 34219, 31123, 32745, 54589, 35088, 34317, 31123, 54619, 33481, 35804, 31155, 40512, 54530, 56597, 55857, 31123, 45588, 54557, 40315, 39215, 55379, 38871, 31123, 56778, 56164, 54807, 54594, 56890, 31735, 31123, 33638, 41424, 54539, 54625, ...]"


### 数据集处理：shuffle & flatten 

洗牌(shuffle)会将数据集的索引列表打乱，以创建一个索引映射。

然而，一旦您的数据集具有索引映射，速度可能会变慢10倍。这是因为需要额外的步骤来使用索引映射获取要读取的行索引，并且最重要的是，您不再连续地读取数据块。

要恢复速度，需要再次使用 Dataset.flatten_indices()将整个数据集重新写入磁盘上，从而删除索引映射。

ref: https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.flatten_indices

In [12]:
tokenized_dataset = tokenized_dataset.shuffle(seed=seed)

In [13]:
tokenized_dataset = tokenized_dataset.flatten_indices()

### 定义 DataCollatorForChatGLM 类 批量处理数据

In [14]:
import torch
from typing import List, Dict, Optional

# DataCollatorForChatGLM 类
class DataCollatorForChatGLM:
    """
    用于处理批量数据的DataCollator，尤其是在使用 ChatGLM 模型时。

    该类负责将多个数据样本（tokenized input）合并为一个批量，并在必要时进行填充(padding)。

    属性:
    pad_token_id (int): 用于填充(padding)的token ID。
    max_length (int): 单个批量数据的最大长度限制。
    ignore_label_id (int): 在标签中用于填充的ID。
    """

    def __init__(self, pad_token_id: int, max_length: int = 2048, ignore_label_id: int = -100):
        """
        初始化DataCollator。

        参数:
        pad_token_id (int): 用于填充(padding)的token ID。
        max_length (int): 单个批量数据的最大长度限制。
        ignore_label_id (int): 在标签中用于填充的ID，默认为-100。
        """
        self.pad_token_id = pad_token_id
        self.ignore_label_id = ignore_label_id
        self.max_length = max_length

    def __call__(self, batch_data: List[Dict[str, List]]) -> Dict[str, torch.Tensor]:
        """
        处理批量数据。

        参数:
        batch_data (List[Dict[str, List]]): 包含多个样本的字典列表。

        返回:
        Dict[str, torch.Tensor]: 包含处理后的批量数据的字典。
        """
        # 计算批量中每个样本的长度
        len_list = [len(d['input_ids']) for d in batch_data]
        batch_max_len = max(len_list)  # 找到最长的样本长度

        input_ids, labels = [], []
        for len_of_d, d in sorted(zip(len_list, batch_data), key=lambda x: -x[0]):
            pad_len = batch_max_len - len_of_d  # 计算需要填充的长度
            # 添加填充，并确保数据长度不超过最大长度限制
            ids = d['input_ids'] + [self.pad_token_id] * pad_len
            label = d['labels'] + [self.ignore_label_id] * pad_len
            if batch_max_len > self.max_length:
                ids = ids[:self.max_length]
                label = label[:self.max_length]
            input_ids.append(torch.LongTensor(ids))
            labels.append(torch.LongTensor(label))

        # 将处理后的数据堆叠成一个tensor
        input_ids = torch.stack(input_ids)
        labels = torch.stack(labels)

        return {'input_ids': input_ids, 'labels': labels}


In [15]:
# 准备数据整理器
data_collator = DataCollatorForChatGLM(pad_token_id=tokenizer.pad_token_id)

## 训练模型

### 加载 ChatGLM3-6B 量化模型

使用 `nf4` 量化数据类型加载模型，开启双量化配置，以`bf16`混合精度训练，预估显存占用接近4GB

In [16]:
from transformers import AutoModel, BitsAndBytesConfig

_compute_dtype_map = {
    'fp32': torch.float32,
    'fp16': torch.float16,
    'bf16': torch.bfloat16
}

# QLoRA 量化配置
q_config = BitsAndBytesConfig(load_in_4bit=True,
                              bnb_4bit_quant_type='nf4',
                              bnb_4bit_use_double_quant=True,
                              bnb_4bit_compute_dtype=_compute_dtype_map['bf16'])



### 加载模型


In [18]:
# revision='b098244' 版本对应的 ChatGLM3-6B 设置 use_reentrant=False
# 最新版本 use_reentrant 被设置为 True，会增加不必要的显存开销
model = AutoModel.from_pretrained(model_name_or_path,
                                  quantization_config=q_config,
                                  device_map='auto',
                                  trust_remote_code=True,
                                  revision='b098244')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading shards: 100%|██████████| 7/7 [34:23<00:00, 294.79s/it]
Loading checkpoint shards: 100%|██████████| 7/7 [00:12<00:00,  1.73s/it]


In [22]:
# 获取当前模型占用的 GPU显存（差值为预留给 PyTorch 的显存）
memory_footprint_bytes = model.get_memory_footprint()
memory_footprint_mib = memory_footprint_bytes / (1024 ** 2)  # 转换为 MiB

print(f"{memory_footprint_mib:.2f}MiB")

4756.38MiB


### 预处理量化模型

预处理量化后的模型，使其可以支持低精度微调训练

ref: https://huggingface.co/docs/peft/main/en/developer_guides/quantization#quantize-a-model

In [23]:
from peft import TaskType, LoraConfig, get_peft_model, prepare_model_for_kbit_training

kbit_model = prepare_model_for_kbit_training(model)

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


### 自定义模型新增 Adapter 

当新的热门 transformer 网络架构（新模型）发布时，Huggingface 社区会尽力快速将它们添加到PEFT中。

如果是 Hugging Face Transformers 库还未内置支持的模型，可以使用自定义模型的方式进行配置。

具体来说，在初始化相应的微调配置类（例如`LoraConfig`）时，我们需要显式指定在哪些层新增适配器（Adapter），并将其设置正确。

ref: https://huggingface.co/docs/peft/developer_guides/custom_models


#### PEFT 适配模块设置


在PEFT库的 [constants.py](https://github.com/huggingface/peft/blob/main/src/peft/utils/constants.py) 文件中定义了不同的 PEFT 方法，在各类大模型上的微调适配模块。

通常，名称相同的模型架构也类似，应用微调方法时的适配器设置也几乎一致。

例如，如果新模型架构是`mistral`模型的变体，并且您想应用 LoRA 微调。在 TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING中`mistral`包含["q_proj", "v_proj"]。

这表示说，对于`mistral`模型，LoRA 的 target_modules 通常是 ["q_proj", "v_proj"]。

In [24]:
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING

target_modules = TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING['chatglm']

In [25]:
target_modules

['query_key_value']

### LoRA 适配器配置

In [26]:
lora_config = LoraConfig(
    target_modules=target_modules,
    r=lora_rank,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias='none',
    inference_mode=False,
    task_type=TaskType.CAUSAL_LM
)

In [27]:
qlora_model = get_peft_model(kbit_model, lora_config)

In [28]:
qlora_model.print_trainable_parameters()

trainable params: 974,848 || all params: 6,244,558,848 || trainable%: 0.01561115883009451


### 训练超参数配置

- 1个epoch表示对训练集的所有样本进行一次完整的训练。
- `num_train_epochs` 表示要完整进行多少个 epochs 的训练。

#### 关于使用 num_train_epochs 时，训练总步数 `steps` 的计算方法

- 训练总步数： `total_steps = steps/epoch * num_train_epochs` 
- 每个epoch的训练步数：`steps/epoch = num_train_examples / (batch_size * gradient_accumulation_steps)`


**以 `adgen` 数据集为例计算**

```json
DatasetDict({
    train: Dataset({
        features: ['content', 'summary'],
        num_rows: 114599
    })
    validation: Dataset({
        features: ['content', 'summary'],
        num_rows: 1070
    })
})
```

代入超参数和配置进行计算：

```python
num_train_epochs = 1
num_train_examples = 114599
batch_size = 16
gradient_accumulation_steps = 4


steps = num_train_epochs * num_train_examples / (batch_size * gradient_accumulation_steps)
      = 1 * 114599 / (16 * 4)
      = 1790
```

In [35]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir=f"models/{model_name_or_path}",          # 输出目录
    per_device_train_batch_size=16,                     # 每个设备的训练批量大小
    gradient_accumulation_steps=4,                     # 梯度累积步数
    # per_device_eval_batch_size=8,                      # 每个设备的评估批量大小
    learning_rate=1e-3,                                # 学习率
    num_train_epochs=1,                                # 训练轮数
    lr_scheduler_type="linear",                        # 学习率调度器类型
    warmup_ratio=0.1,                                  # 预热比例
    logging_steps=10,                                 # 日志记录步数
    save_strategy="steps",                             # 模型保存策略
    save_steps=100,                                    # 模型保存步数
    # evaluation_strategy="steps",                       # 评估策略
    # eval_steps=500,                                    # 评估步数
    optim="adamw_torch",                               # 优化器类型
    fp16=True,                                        # 是否使用混合精度训练
)


In [36]:
trainer = Trainer(
        model=qlora_model,
        args=training_args,
        train_dataset=tokenized_dataset,
        data_collator=data_collator
    )

#### 训练参数（用于演示）

In [31]:
# from transformers import TrainingArguments, Trainer

# training_demo_args = TrainingArguments(
#     output_dir=f"models/demo/{model_name_or_path}",          # 输出目录
#     per_device_train_batch_size=16,                     # 每个设备的训练批量大小
#     gradient_accumulation_steps=4,                     # 梯度累积步数
#     learning_rate=1e-3,                                # 学习率
#     max_steps=100,                                     # 训练步数
#     lr_scheduler_type="linear",                        # 学习率调度器类型
#     warmup_ratio=0.1,                                  # 预热比例
#     logging_steps=10,                                 # 日志记录步数
#     save_strategy="steps",                             # 模型保存策略
#     save_steps=20,                                    # 模型保存步数
#     optim="adamw_torch",                               # 优化器类型
#     fp16=True,                                        # 是否使用混合精度训练
# )

In [32]:
# trainer = Trainer(
#         model=qlora_model,
#         args=training_demo_args,
#         train_dataset=tokenized_dataset,
#         data_collator=data_collator
#     )

### 开始训练


In [37]:
trainer.train()

  1%|          | 10/1790 [01:39<5:00:23, 10.13s/it]

{'loss': 3.3382, 'learning_rate': 5.58659217877095e-05, 'epoch': 0.01}


  1%|          | 20/1790 [03:20<5:02:19, 10.25s/it]

{'loss': 3.3021, 'learning_rate': 0.000111731843575419, 'epoch': 0.01}


  2%|▏         | 30/1790 [05:11<5:24:24, 11.06s/it]

{'loss': 3.2689, 'learning_rate': 0.0001675977653631285, 'epoch': 0.02}


  2%|▏         | 40/1790 [06:55<4:59:36, 10.27s/it]

{'loss': 3.2543, 'learning_rate': 0.000223463687150838, 'epoch': 0.02}


  3%|▎         | 50/1790 [08:39<5:02:17, 10.42s/it]

{'loss': 3.2292, 'learning_rate': 0.00027932960893854746, 'epoch': 0.03}


  3%|▎         | 60/1790 [10:23<5:03:09, 10.51s/it]

{'loss': 3.2456, 'learning_rate': 0.000335195530726257, 'epoch': 0.03}


  4%|▍         | 70/1790 [12:07<4:57:55, 10.39s/it]

{'loss': 3.2759, 'learning_rate': 0.00039106145251396646, 'epoch': 0.04}


  4%|▍         | 80/1790 [13:54<5:06:13, 10.74s/it]

{'loss': 3.2833, 'learning_rate': 0.000446927374301676, 'epoch': 0.04}


  5%|▌         | 90/1790 [15:44<5:11:48, 11.01s/it]

{'loss': 3.2944, 'learning_rate': 0.0005027932960893855, 'epoch': 0.05}


  6%|▌         | 100/1790 [17:31<5:04:42, 10.82s/it]

{'loss': 3.2731, 'learning_rate': 0.0005586592178770949, 'epoch': 0.06}


  6%|▌         | 110/1790 [19:27<5:01:25, 10.77s/it]

{'loss': 3.3063, 'learning_rate': 0.0006145251396648044, 'epoch': 0.06}


  7%|▋         | 120/1790 [21:16<4:57:51, 10.70s/it]

{'loss': 3.2758, 'learning_rate': 0.000670391061452514, 'epoch': 0.07}


  7%|▋         | 130/1790 [23:07<5:08:12, 11.14s/it]

{'loss': 3.2567, 'learning_rate': 0.0007262569832402235, 'epoch': 0.07}


  8%|▊         | 140/1790 [24:52<4:54:37, 10.71s/it]

{'loss': 3.2391, 'learning_rate': 0.0007821229050279329, 'epoch': 0.08}


  8%|▊         | 150/1790 [26:41<5:04:28, 11.14s/it]

{'loss': 3.2346, 'learning_rate': 0.0008379888268156424, 'epoch': 0.08}


  9%|▉         | 160/1790 [28:29<4:45:00, 10.49s/it]

{'loss': 3.2601, 'learning_rate': 0.000893854748603352, 'epoch': 0.09}


  9%|▉         | 170/1790 [30:16<4:43:40, 10.51s/it]

{'loss': 3.2425, 'learning_rate': 0.0009497206703910615, 'epoch': 0.09}


 10%|█         | 180/1790 [32:06<4:52:35, 10.90s/it]

{'loss': 3.2898, 'learning_rate': 0.000999379267535692, 'epoch': 0.1}


 11%|█         | 190/1790 [33:53<4:47:50, 10.79s/it]

{'loss': 3.283, 'learning_rate': 0.0009931719428926133, 'epoch': 0.11}




{'loss': 3.2627, 'learning_rate': 0.0009869646182495344, 'epoch': 0.11}


 12%|█▏        | 210/1790 [37:29<4:48:04, 10.94s/it]

{'loss': 3.2382, 'learning_rate': 0.0009807572936064556, 'epoch': 0.12}


 12%|█▏        | 220/1790 [39:17<4:38:22, 10.64s/it]

{'loss': 3.1958, 'learning_rate': 0.0009745499689633768, 'epoch': 0.12}


 13%|█▎        | 230/1790 [41:03<4:36:08, 10.62s/it]

{'loss': 3.2194, 'learning_rate': 0.0009683426443202979, 'epoch': 0.13}


 13%|█▎        | 240/1790 [42:49<4:31:49, 10.52s/it]

{'loss': 3.1925, 'learning_rate': 0.0009621353196772191, 'epoch': 0.13}


 14%|█▍        | 250/1790 [44:39<4:36:24, 10.77s/it]

{'loss': 3.2293, 'learning_rate': 0.0009559279950341403, 'epoch': 0.14}


 15%|█▍        | 260/1790 [46:28<4:35:04, 10.79s/it]

{'loss': 3.2158, 'learning_rate': 0.0009497206703910615, 'epoch': 0.15}


 15%|█▌        | 270/1790 [48:16<4:31:20, 10.71s/it]

{'loss': 3.2372, 'learning_rate': 0.0009435133457479826, 'epoch': 0.15}


 16%|█▌        | 280/1790 [50:05<4:39:14, 11.10s/it]

{'loss': 3.195, 'learning_rate': 0.0009373060211049038, 'epoch': 0.16}


 16%|█▌        | 290/1790 [51:57<4:38:10, 11.13s/it]

{'loss': 3.2324, 'learning_rate': 0.000931098696461825, 'epoch': 0.16}




{'loss': 3.2009, 'learning_rate': 0.0009248913718187462, 'epoch': 0.17}


 17%|█▋        | 310/1790 [55:32<4:28:32, 10.89s/it]

{'loss': 3.1957, 'learning_rate': 0.0009186840471756674, 'epoch': 0.17}


 18%|█▊        | 320/1790 [57:18<4:21:52, 10.69s/it]

{'loss': 3.2185, 'learning_rate': 0.0009124767225325885, 'epoch': 0.18}


 18%|█▊        | 330/1790 [59:03<4:25:18, 10.90s/it]

{'loss': 3.2106, 'learning_rate': 0.0009062693978895097, 'epoch': 0.18}


 19%|█▉        | 340/1790 [1:00:54<4:26:52, 11.04s/it]

{'loss': 3.1951, 'learning_rate': 0.0009000620732464308, 'epoch': 0.19}


 20%|█▉        | 350/1790 [1:02:42<4:18:17, 10.76s/it]

{'loss': 3.1966, 'learning_rate': 0.000893854748603352, 'epoch': 0.2}


 20%|██        | 360/1790 [1:04:27<4:06:43, 10.35s/it]

{'loss': 3.2194, 'learning_rate': 0.0008876474239602731, 'epoch': 0.2}


 21%|██        | 370/1790 [1:06:18<4:27:49, 11.32s/it]

{'loss': 3.1834, 'learning_rate': 0.0008814400993171943, 'epoch': 0.21}


 21%|██        | 380/1790 [1:08:08<4:25:23, 11.29s/it]

{'loss': 3.1975, 'learning_rate': 0.0008752327746741154, 'epoch': 0.21}


 22%|██▏       | 390/1790 [1:09:59<4:14:02, 10.89s/it]

{'loss': 3.187, 'learning_rate': 0.0008690254500310366, 'epoch': 0.22}


 22%|██▏       | 400/1790 [1:11:47<4:14:46, 11.00s/it]

{'loss': 3.1904, 'learning_rate': 0.0008628181253879578, 'epoch': 0.22}


 23%|██▎       | 410/1790 [1:13:47<4:15:40, 11.12s/it]

{'loss': 3.1896, 'learning_rate': 0.000856610800744879, 'epoch': 0.23}


 23%|██▎       | 420/1790 [1:15:35<4:10:07, 10.95s/it]

{'loss': 3.1673, 'learning_rate': 0.0008504034761018001, 'epoch': 0.23}


 24%|██▍       | 430/1790 [1:17:20<4:00:50, 10.63s/it]

{'loss': 3.1701, 'learning_rate': 0.0008441961514587213, 'epoch': 0.24}


 25%|██▍       | 440/1790 [1:19:13<4:09:06, 11.07s/it]

{'loss': 3.1633, 'learning_rate': 0.0008379888268156424, 'epoch': 0.25}


 25%|██▌       | 450/1790 [1:21:06<4:04:43, 10.96s/it]

{'loss': 3.1377, 'learning_rate': 0.0008317815021725637, 'epoch': 0.25}


 26%|██▌       | 460/1790 [1:23:03<4:16:52, 11.59s/it]

{'loss': 3.1342, 'learning_rate': 0.0008255741775294849, 'epoch': 0.26}


 26%|██▋       | 470/1790 [1:24:57<4:12:46, 11.49s/it]

{'loss': 3.1651, 'learning_rate': 0.000819366852886406, 'epoch': 0.26}


 27%|██▋       | 480/1790 [1:26:50<4:06:46, 11.30s/it]

{'loss': 3.1712, 'learning_rate': 0.0008131595282433272, 'epoch': 0.27}


 27%|██▋       | 490/1790 [1:28:44<4:10:07, 11.54s/it]

{'loss': 3.1487, 'learning_rate': 0.0008069522036002483, 'epoch': 0.27}




{'loss': 3.1373, 'learning_rate': 0.0008007448789571696, 'epoch': 0.28}


 28%|██▊       | 510/1790 [1:32:41<4:08:57, 11.67s/it]

{'loss': 3.142, 'learning_rate': 0.0007945375543140907, 'epoch': 0.28}


 29%|██▉       | 520/1790 [1:34:36<4:04:54, 11.57s/it]

{'loss': 3.1295, 'learning_rate': 0.0007883302296710118, 'epoch': 0.29}


 30%|██▉       | 530/1790 [1:36:32<4:02:21, 11.54s/it]

{'loss': 3.1548, 'learning_rate': 0.0007821229050279329, 'epoch': 0.3}


 30%|███       | 540/1790 [1:38:24<3:46:36, 10.88s/it]

{'loss': 3.183, 'learning_rate': 0.0007759155803848541, 'epoch': 0.3}


 31%|███       | 550/1790 [1:40:18<3:55:07, 11.38s/it]

{'loss': 3.148, 'learning_rate': 0.0007697082557417753, 'epoch': 0.31}


 31%|███▏      | 560/1790 [1:42:11<3:54:53, 11.46s/it]

{'loss': 3.1408, 'learning_rate': 0.0007635009310986965, 'epoch': 0.31}


 32%|███▏      | 570/1790 [1:44:07<3:54:55, 11.55s/it]

{'loss': 3.1544, 'learning_rate': 0.0007572936064556176, 'epoch': 0.32}


 32%|███▏      | 580/1790 [1:46:02<3:55:01, 11.65s/it]

{'loss': 3.1397, 'learning_rate': 0.0007510862818125388, 'epoch': 0.32}


 33%|███▎      | 590/1790 [1:47:52<3:40:58, 11.05s/it]

{'loss': 3.1422, 'learning_rate': 0.0007448789571694599, 'epoch': 0.33}


 34%|███▎      | 600/1790 [1:49:46<3:42:20, 11.21s/it]

{'loss': 3.1322, 'learning_rate': 0.0007386716325263812, 'epoch': 0.34}


 34%|███▍      | 610/1790 [1:51:48<3:41:26, 11.26s/it]

{'loss': 3.1547, 'learning_rate': 0.0007324643078833023, 'epoch': 0.34}


 35%|███▍      | 620/1790 [1:53:41<3:37:31, 11.15s/it]

{'loss': 3.1391, 'learning_rate': 0.0007262569832402235, 'epoch': 0.35}


 35%|███▌      | 630/1790 [1:55:37<3:46:48, 11.73s/it]

{'loss': 3.1392, 'learning_rate': 0.0007200496585971447, 'epoch': 0.35}


 36%|███▌      | 640/1790 [1:57:33<3:37:43, 11.36s/it]

{'loss': 3.1579, 'learning_rate': 0.0007138423339540658, 'epoch': 0.36}


 36%|███▋      | 650/1790 [1:59:29<3:37:26, 11.44s/it]

{'loss': 3.1614, 'learning_rate': 0.0007076350093109871, 'epoch': 0.36}


 37%|███▋      | 660/1790 [2:01:29<3:44:32, 11.92s/it]

{'loss': 3.1359, 'learning_rate': 0.0007014276846679082, 'epoch': 0.37}


 37%|███▋      | 670/1790 [2:03:20<3:27:43, 11.13s/it]

{'loss': 3.1115, 'learning_rate': 0.0006952203600248294, 'epoch': 0.37}


 38%|███▊      | 680/1790 [2:05:14<3:30:21, 11.37s/it]

{'loss': 3.1523, 'learning_rate': 0.0006890130353817505, 'epoch': 0.38}


 39%|███▊      | 690/1790 [2:07:13<3:34:20, 11.69s/it]

{'loss': 3.1055, 'learning_rate': 0.0006828057107386716, 'epoch': 0.39}


 39%|███▉      | 700/1790 [2:09:14<3:35:54, 11.89s/it]

{'loss': 3.1517, 'learning_rate': 0.0006765983860955927, 'epoch': 0.39}


 40%|███▉      | 710/1790 [2:11:16<3:21:53, 11.22s/it]

{'loss': 3.1195, 'learning_rate': 0.000670391061452514, 'epoch': 0.4}


 40%|████      | 720/1790 [2:13:09<3:19:23, 11.18s/it]

{'loss': 3.0776, 'learning_rate': 0.0006641837368094351, 'epoch': 0.4}


 41%|████      | 730/1790 [2:15:02<3:18:33, 11.24s/it]

{'loss': 3.106, 'learning_rate': 0.0006579764121663563, 'epoch': 0.41}


 41%|████▏     | 740/1790 [2:16:58<3:21:56, 11.54s/it]

{'loss': 3.1119, 'learning_rate': 0.0006517690875232774, 'epoch': 0.41}


 42%|████▏     | 750/1790 [2:18:51<3:15:53, 11.30s/it]

{'loss': 3.1033, 'learning_rate': 0.0006455617628801986, 'epoch': 0.42}


 42%|████▏     | 760/1790 [2:20:50<3:32:11, 12.36s/it]

{'loss': 3.1434, 'learning_rate': 0.0006393544382371198, 'epoch': 0.42}


 43%|████▎     | 770/1790 [2:22:44<3:15:46, 11.52s/it]

{'loss': 3.149, 'learning_rate': 0.000633147113594041, 'epoch': 0.43}


 44%|████▎     | 780/1790 [2:24:38<3:14:42, 11.57s/it]

{'loss': 3.1247, 'learning_rate': 0.0006269397889509621, 'epoch': 0.44}


 44%|████▍     | 790/1790 [2:26:31<3:05:57, 11.16s/it]

{'loss': 3.127, 'learning_rate': 0.0006207324643078833, 'epoch': 0.44}




{'loss': 3.1044, 'learning_rate': 0.0006145251396648044, 'epoch': 0.45}


 45%|████▌     | 810/1790 [2:30:22<3:06:47, 11.44s/it]

{'loss': 3.1359, 'learning_rate': 0.0006083178150217257, 'epoch': 0.45}


 46%|████▌     | 820/1790 [2:32:16<3:00:08, 11.14s/it]

{'loss': 3.1004, 'learning_rate': 0.0006021104903786469, 'epoch': 0.46}


 46%|████▋     | 830/1790 [2:34:13<3:09:28, 11.84s/it]

{'loss': 3.1211, 'learning_rate': 0.000595903165735568, 'epoch': 0.46}


 47%|████▋     | 840/1790 [2:36:08<3:02:06, 11.50s/it]

{'loss': 3.1012, 'learning_rate': 0.0005896958410924892, 'epoch': 0.47}


 47%|████▋     | 850/1790 [2:38:07<3:06:50, 11.93s/it]

{'loss': 3.0933, 'learning_rate': 0.0005834885164494103, 'epoch': 0.47}


 48%|████▊     | 860/1790 [2:39:59<2:57:50, 11.47s/it]

{'loss': 3.0993, 'learning_rate': 0.0005772811918063315, 'epoch': 0.48}


 49%|████▊     | 870/1790 [2:41:56<2:59:46, 11.72s/it]

{'loss': 3.119, 'learning_rate': 0.0005710738671632526, 'epoch': 0.49}


 49%|████▉     | 880/1790 [2:43:52<2:57:03, 11.67s/it]

{'loss': 3.1204, 'learning_rate': 0.0005648665425201738, 'epoch': 0.49}


 50%|████▉     | 890/1790 [2:45:51<2:55:36, 11.71s/it]

{'loss': 3.0917, 'learning_rate': 0.0005586592178770949, 'epoch': 0.5}


 50%|█████     | 900/1790 [2:47:46<2:47:36, 11.30s/it]

{'loss': 3.0479, 'learning_rate': 0.0005524518932340161, 'epoch': 0.5}


 51%|█████     | 910/1790 [2:49:47<2:43:50, 11.17s/it]

{'loss': 3.0957, 'learning_rate': 0.0005462445685909373, 'epoch': 0.51}


 51%|█████▏    | 920/1790 [2:51:45<2:50:44, 11.78s/it]

{'loss': 3.0807, 'learning_rate': 0.0005400372439478585, 'epoch': 0.51}


 52%|█████▏    | 930/1790 [2:53:40<2:44:58, 11.51s/it]

{'loss': 3.0671, 'learning_rate': 0.0005338299193047796, 'epoch': 0.52}


 53%|█████▎    | 940/1790 [2:55:32<2:41:32, 11.40s/it]

{'loss': 3.0839, 'learning_rate': 0.0005276225946617008, 'epoch': 0.52}


 53%|█████▎    | 950/1790 [2:57:26<2:41:28, 11.53s/it]

{'loss': 3.0713, 'learning_rate': 0.0005214152700186219, 'epoch': 0.53}


 54%|█████▎    | 960/1790 [2:59:17<2:35:41, 11.25s/it]

{'loss': 3.0862, 'learning_rate': 0.0005152079453755432, 'epoch': 0.54}


 54%|█████▍    | 970/1790 [3:01:12<2:37:38, 11.53s/it]

{'loss': 3.0422, 'learning_rate': 0.0005090006207324644, 'epoch': 0.54}


 55%|█████▍    | 980/1790 [3:03:13<2:42:03, 12.00s/it]

{'loss': 3.1154, 'learning_rate': 0.0005027932960893855, 'epoch': 0.55}


 55%|█████▌    | 990/1790 [3:05:07<2:32:46, 11.46s/it]

{'loss': 3.0691, 'learning_rate': 0.0004965859714463067, 'epoch': 0.55}


 56%|█████▌    | 1000/1790 [3:07:00<2:26:01, 11.09s/it]

{'loss': 3.0489, 'learning_rate': 0.0004903786468032278, 'epoch': 0.56}


 56%|█████▋    | 1010/1790 [3:08:59<2:21:36, 10.89s/it]

{'loss': 3.1186, 'learning_rate': 0.00048417132216014896, 'epoch': 0.56}


 57%|█████▋    | 1020/1790 [3:10:49<2:29:37, 11.66s/it]

{'loss': 3.0457, 'learning_rate': 0.00047796399751707017, 'epoch': 0.57}


 58%|█████▊    | 1030/1790 [3:12:34<2:12:21, 10.45s/it]

{'loss': 3.0956, 'learning_rate': 0.0004717566728739913, 'epoch': 0.58}


 58%|█████▊    | 1040/1790 [3:14:22<2:14:32, 10.76s/it]

{'loss': 3.0671, 'learning_rate': 0.0004655493482309125, 'epoch': 0.58}


 59%|█████▊    | 1050/1790 [3:16:07<2:11:13, 10.64s/it]

{'loss': 3.0598, 'learning_rate': 0.0004593420235878337, 'epoch': 0.59}


 59%|█████▉    | 1060/1790 [3:17:54<2:12:22, 10.88s/it]

{'loss': 3.0428, 'learning_rate': 0.00045313469894475483, 'epoch': 0.59}


 60%|█████▉    | 1070/1790 [3:19:39<2:05:55, 10.49s/it]

{'loss': 3.0599, 'learning_rate': 0.000446927374301676, 'epoch': 0.6}


 60%|██████    | 1080/1790 [3:21:28<2:06:05, 10.66s/it]

{'loss': 3.0412, 'learning_rate': 0.00044072004965859714, 'epoch': 0.6}


 61%|██████    | 1090/1790 [3:23:15<2:06:17, 10.82s/it]

{'loss': 3.048, 'learning_rate': 0.0004345127250155183, 'epoch': 0.61}


 61%|██████▏   | 1100/1790 [3:24:59<1:57:13, 10.19s/it]

{'loss': 3.0352, 'learning_rate': 0.0004283054003724395, 'epoch': 0.61}


 62%|██████▏   | 1110/1790 [3:26:55<2:00:07, 10.60s/it]

{'loss': 3.0528, 'learning_rate': 0.00042209807572936064, 'epoch': 0.62}


 63%|██████▎   | 1120/1790 [3:28:44<2:01:46, 10.90s/it]

{'loss': 3.0876, 'learning_rate': 0.00041589075108628185, 'epoch': 0.63}


 63%|██████▎   | 1130/1790 [3:30:32<2:00:14, 10.93s/it]

{'loss': 3.038, 'learning_rate': 0.000409683426443203, 'epoch': 0.63}


 64%|██████▎   | 1140/1790 [3:32:17<1:54:33, 10.57s/it]

{'loss': 3.0445, 'learning_rate': 0.00040347610180012415, 'epoch': 0.64}


 64%|██████▍   | 1150/1790 [3:34:04<1:55:28, 10.83s/it]

{'loss': 3.0462, 'learning_rate': 0.00039726877715704536, 'epoch': 0.64}


 65%|██████▍   | 1160/1790 [3:35:52<1:55:28, 11.00s/it]

{'loss': 3.1204, 'learning_rate': 0.00039106145251396646, 'epoch': 0.65}


 65%|██████▌   | 1170/1790 [3:37:42<1:52:08, 10.85s/it]

{'loss': 3.0872, 'learning_rate': 0.00038485412787088766, 'epoch': 0.65}


 66%|██████▌   | 1180/1790 [3:39:27<1:45:11, 10.35s/it]

{'loss': 3.0658, 'learning_rate': 0.0003786468032278088, 'epoch': 0.66}


 66%|██████▋   | 1190/1790 [3:41:14<1:50:09, 11.02s/it]

{'loss': 3.0806, 'learning_rate': 0.00037243947858472997, 'epoch': 0.66}


 67%|██████▋   | 1200/1790 [3:43:05<1:52:12, 11.41s/it]

{'loss': 3.0785, 'learning_rate': 0.0003662321539416512, 'epoch': 0.67}


 68%|██████▊   | 1210/1790 [3:45:01<1:41:30, 10.50s/it]

{'loss': 3.0338, 'learning_rate': 0.0003600248292985723, 'epoch': 0.68}


 68%|██████▊   | 1220/1790 [3:46:49<1:41:58, 10.73s/it]

{'loss': 3.0761, 'learning_rate': 0.00035381750465549353, 'epoch': 0.68}


 69%|██████▊   | 1230/1790 [3:48:37<1:41:22, 10.86s/it]

{'loss': 3.012, 'learning_rate': 0.0003476101800124147, 'epoch': 0.69}


 69%|██████▉   | 1240/1790 [3:50:26<1:41:08, 11.03s/it]

{'loss': 3.0632, 'learning_rate': 0.0003414028553693358, 'epoch': 0.69}


 70%|██████▉   | 1250/1790 [3:52:13<1:37:38, 10.85s/it]

{'loss': 3.0688, 'learning_rate': 0.000335195530726257, 'epoch': 0.7}


 70%|███████   | 1260/1790 [3:54:01<1:34:10, 10.66s/it]

{'loss': 3.023, 'learning_rate': 0.00032898820608317814, 'epoch': 0.7}


 71%|███████   | 1270/1790 [3:55:45<1:29:56, 10.38s/it]

{'loss': 3.0818, 'learning_rate': 0.0003227808814400993, 'epoch': 0.71}


 72%|███████▏  | 1280/1790 [3:57:32<1:32:42, 10.91s/it]

{'loss': 3.0422, 'learning_rate': 0.0003165735567970205, 'epoch': 0.71}


 72%|███████▏  | 1290/1790 [3:59:20<1:28:29, 10.62s/it]

{'loss': 3.0592, 'learning_rate': 0.00031036623215394165, 'epoch': 0.72}


 73%|███████▎  | 1300/1790 [4:01:07<1:26:00, 10.53s/it]

{'loss': 3.0655, 'learning_rate': 0.00030415890751086286, 'epoch': 0.73}


 73%|███████▎  | 1310/1790 [4:03:07<1:30:02, 11.25s/it]

{'loss': 3.0441, 'learning_rate': 0.000297951582867784, 'epoch': 0.73}


 74%|███████▎  | 1320/1790 [4:04:57<1:22:37, 10.55s/it]

{'loss': 3.059, 'learning_rate': 0.00029174425822470516, 'epoch': 0.74}


 74%|███████▍  | 1330/1790 [4:06:42<1:21:00, 10.57s/it]

{'loss': 3.0372, 'learning_rate': 0.0002855369335816263, 'epoch': 0.74}


 75%|███████▍  | 1340/1790 [4:08:28<1:21:29, 10.87s/it]

{'loss': 3.0327, 'learning_rate': 0.00027932960893854746, 'epoch': 0.75}


 75%|███████▌  | 1350/1790 [4:10:16<1:19:32, 10.85s/it]

{'loss': 3.0707, 'learning_rate': 0.00027312228429546867, 'epoch': 0.75}


 76%|███████▌  | 1360/1790 [4:12:05<1:17:10, 10.77s/it]

{'loss': 3.068, 'learning_rate': 0.0002669149596523898, 'epoch': 0.76}


 77%|███████▋  | 1370/1790 [4:13:52<1:16:07, 10.87s/it]

{'loss': 3.0485, 'learning_rate': 0.00026070763500931097, 'epoch': 0.77}


 77%|███████▋  | 1380/1790 [4:15:44<1:18:16, 11.46s/it]

{'loss': 3.0277, 'learning_rate': 0.0002545003103662322, 'epoch': 0.77}


 78%|███████▊  | 1390/1790 [4:17:32<1:12:51, 10.93s/it]

{'loss': 3.0296, 'learning_rate': 0.00024829298572315333, 'epoch': 0.78}


 78%|███████▊  | 1400/1790 [4:19:17<1:08:42, 10.57s/it]

{'loss': 3.0608, 'learning_rate': 0.00024208566108007448, 'epoch': 0.78}


 79%|███████▉  | 1410/1790 [4:21:13<1:06:06, 10.44s/it]

{'loss': 3.0095, 'learning_rate': 0.00023587833643699566, 'epoch': 0.79}


 79%|███████▉  | 1420/1790 [4:22:59<1:05:25, 10.61s/it]

{'loss': 2.9923, 'learning_rate': 0.00022967101179391684, 'epoch': 0.79}


 80%|███████▉  | 1430/1790 [4:24:45<1:06:13, 11.04s/it]

{'loss': 3.0324, 'learning_rate': 0.000223463687150838, 'epoch': 0.8}


 80%|████████  | 1440/1790 [4:26:32<1:01:25, 10.53s/it]

{'loss': 3.0601, 'learning_rate': 0.00021725636250775914, 'epoch': 0.8}


 81%|████████  | 1450/1790 [4:28:17<59:29, 10.50s/it]  

{'loss': 3.0429, 'learning_rate': 0.00021104903786468032, 'epoch': 0.81}


 82%|████████▏ | 1460/1790 [4:30:01<57:28, 10.45s/it]  

{'loss': 3.034, 'learning_rate': 0.0002048417132216015, 'epoch': 0.82}


 82%|████████▏ | 1470/1790 [4:31:48<55:09, 10.34s/it]

{'loss': 3.0355, 'learning_rate': 0.00019863438857852268, 'epoch': 0.82}


 83%|████████▎ | 1480/1790 [4:33:37<56:33, 10.95s/it]

{'loss': 3.0435, 'learning_rate': 0.00019242706393544383, 'epoch': 0.83}


 83%|████████▎ | 1490/1790 [4:35:24<52:11, 10.44s/it]

{'loss': 3.0216, 'learning_rate': 0.00018621973929236498, 'epoch': 0.83}




{'loss': 3.0291, 'learning_rate': 0.00018001241464928616, 'epoch': 0.84}


 84%|████████▍ | 1510/1790 [4:39:00<48:46, 10.45s/it]

{'loss': 3.0305, 'learning_rate': 0.00017380509000620734, 'epoch': 0.84}


 85%|████████▍ | 1520/1790 [4:40:51<49:45, 11.06s/it]

{'loss': 3.0819, 'learning_rate': 0.0001675977653631285, 'epoch': 0.85}


 85%|████████▌ | 1530/1790 [4:42:35<45:23, 10.47s/it]

{'loss': 3.0293, 'learning_rate': 0.00016139044072004965, 'epoch': 0.85}


 86%|████████▌ | 1540/1790 [4:44:23<44:56, 10.79s/it]

{'loss': 3.0099, 'learning_rate': 0.00015518311607697082, 'epoch': 0.86}


 87%|████████▋ | 1550/1790 [4:46:10<42:44, 10.68s/it]

{'loss': 3.0409, 'learning_rate': 0.000148975791433892, 'epoch': 0.87}


 87%|████████▋ | 1560/1790 [4:48:00<42:07, 10.99s/it]

{'loss': 3.0405, 'learning_rate': 0.00014276846679081316, 'epoch': 0.87}


 88%|████████▊ | 1570/1790 [4:49:45<39:25, 10.75s/it]

{'loss': 3.0213, 'learning_rate': 0.00013656114214773433, 'epoch': 0.88}


 88%|████████▊ | 1580/1790 [4:51:33<37:59, 10.85s/it]

{'loss': 3.0542, 'learning_rate': 0.00013035381750465549, 'epoch': 0.88}


 89%|████████▉ | 1590/1790 [4:53:23<38:46, 11.63s/it]

{'loss': 3.0167, 'learning_rate': 0.00012414649286157667, 'epoch': 0.89}


 89%|████████▉ | 1600/1790 [4:55:11<34:33, 10.92s/it]

{'loss': 3.0193, 'learning_rate': 0.00011793916821849783, 'epoch': 0.89}


 90%|████████▉ | 1610/1790 [4:57:05<32:05, 10.70s/it]

{'loss': 3.0429, 'learning_rate': 0.000111731843575419, 'epoch': 0.9}


 91%|█████████ | 1620/1790 [4:58:50<29:39, 10.47s/it]

{'loss': 3.0433, 'learning_rate': 0.00010552451893234016, 'epoch': 0.9}


 91%|█████████ | 1630/1790 [5:00:34<27:13, 10.21s/it]

{'loss': 3.0394, 'learning_rate': 9.931719428926134e-05, 'epoch': 0.91}


 92%|█████████▏| 1640/1790 [5:02:23<27:42, 11.08s/it]

{'loss': 3.0456, 'learning_rate': 9.310986964618249e-05, 'epoch': 0.92}


 92%|█████████▏| 1650/1790 [5:04:10<24:51, 10.66s/it]

{'loss': 3.0086, 'learning_rate': 8.690254500310367e-05, 'epoch': 0.92}


 93%|█████████▎| 1660/1790 [5:05:58<23:11, 10.70s/it]

{'loss': 3.0333, 'learning_rate': 8.069522036002482e-05, 'epoch': 0.93}


 93%|█████████▎| 1670/1790 [5:07:44<21:09, 10.58s/it]

{'loss': 3.0376, 'learning_rate': 7.4487895716946e-05, 'epoch': 0.93}


 94%|█████████▍| 1680/1790 [5:09:32<19:37, 10.70s/it]

{'loss': 3.0208, 'learning_rate': 6.828057107386717e-05, 'epoch': 0.94}


 94%|█████████▍| 1690/1790 [5:11:17<17:44, 10.65s/it]

{'loss': 3.0133, 'learning_rate': 6.207324643078833e-05, 'epoch': 0.94}


 95%|█████████▍| 1700/1790 [5:13:07<16:33, 11.03s/it]

{'loss': 3.0043, 'learning_rate': 5.58659217877095e-05, 'epoch': 0.95}


 96%|█████████▌| 1710/1790 [5:15:11<14:09, 10.61s/it]

{'loss': 3.0273, 'learning_rate': 4.965859714463067e-05, 'epoch': 0.95}


 96%|█████████▌| 1720/1790 [5:16:58<12:13, 10.48s/it]

{'loss': 3.0143, 'learning_rate': 4.3451272501551835e-05, 'epoch': 0.96}


 97%|█████████▋| 1730/1790 [5:18:48<10:58, 10.98s/it]

{'loss': 3.0144, 'learning_rate': 3.7243947858473e-05, 'epoch': 0.97}


 97%|█████████▋| 1740/1790 [5:20:35<09:03, 10.87s/it]

{'loss': 3.0174, 'learning_rate': 3.1036623215394166e-05, 'epoch': 0.97}


 98%|█████████▊| 1750/1790 [5:22:22<07:06, 10.66s/it]

{'loss': 3.0083, 'learning_rate': 2.4829298572315335e-05, 'epoch': 0.98}


 98%|█████████▊| 1760/1790 [5:24:12<05:17, 10.59s/it]

{'loss': 3.0351, 'learning_rate': 1.86219739292365e-05, 'epoch': 0.98}


 99%|█████████▉| 1770/1790 [5:26:03<03:47, 11.37s/it]

{'loss': 3.0156, 'learning_rate': 1.2414649286157668e-05, 'epoch': 0.99}


 99%|█████████▉| 1780/1790 [5:27:58<01:56, 11.62s/it]

{'loss': 3.0323, 'learning_rate': 6.207324643078834e-06, 'epoch': 0.99}


100%|██████████| 1790/1790 [5:29:48<00:00, 11.06s/it]

{'loss': 2.9944, 'learning_rate': 0.0, 'epoch': 1.0}
{'train_runtime': 19788.8984, 'train_samples_per_second': 5.791, 'train_steps_per_second': 0.09, 'train_loss': 3.1139484916985367, 'epoch': 1.0}





TrainOutput(global_step=1790, training_loss=3.1139484916985367, metrics={'train_runtime': 19788.8984, 'train_samples_per_second': 5.791, 'train_steps_per_second': 0.09, 'train_loss': 3.1139484916985367, 'epoch': 1.0})

In [38]:
trainer.model.save_pretrained(f"models/demo/{model_name_or_path}")