# Дообучение модели Gemma-2-2b на всех данных

## Содержание ноутбука

Модель `Gemma-2-2b` дообучена на всех данных. Обучение производилось на GPU `NVIDIA A100 80Gb` в течении 7.5 часов.

**Гиперпараметры:**

* `LoRA rank = 16`
* `max_length = 256`
* `n_epochs = 2`
* `batch_size = 8`
* `learning_rate = 0.0005`
* `lr_scheduler = cosine`
* `weight_decay = 0.01`

**Результаты:**

* `loss = 1.9164`
* `eval_loss = 1.9087`

**Выводы:**

* Лучшие результаты получены в конце второй эпохи. Модель немного недообучена.

Примеры инференса в конце ноутбука.

## Логи MLFlow

![alt text](<../resources/mlflow/Screenshot 2024-12-17 032618.png>)

![alt text](<../resources/mlflow/Screenshot 2024-12-17 032424.png>)

## GPU информация

In [23]:
# !nvidia-smi

Mon Dec 16 21:32:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:8C:00.0 Off |                    0 |
| N/A   35C    P0              84W / 500W |  22921MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Импорты

In [1]:
# %pip -q install torch transformers datasets peft accelerate protobuf sentencepiece mlflow


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [1]:
import os
import random
import sys
from datetime import datetime

import mlflow
import pandas as pd
import torch
from datasets import Dataset, load_dataset
from dotenv import load_dotenv
from huggingface_hub import login
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          DataCollatorForLanguageModeling,
                          EarlyStoppingCallback, Trainer, TrainingArguments)

## Настройки

In [2]:
def seed_all(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    random.seed(seed)

# Для повторяемости результатов
SEED = 42
seed_all(SEED)
# Загрузка переменных из .env файла
load_dotenv()
# Чтение токена
token = os.environ['HUGGING_FACE_ACCESS_TOKEN']
# Авторизация в Hugging Face
login(token)
# Устройство для тензорных вычислений и хранения модели в памяти.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## Загрузка модели

In [7]:
model_name = "google/gemma-2-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Половинная точность
    device_map="auto",
    token=token,
)

Downloading shards: 100%|██████████| 3/3 [00:00<00:00, 917.19it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:02<00:00,  1.38it/s]


## Дообучение LoRA

In [8]:
LORA_RANK = 16  # Ранг матриц LoRA

# Настройки LoRA
lora_config = LoraConfig(
    r=LORA_RANK,  # Ранг малых матриц, веса которых мы будем обучать
    lora_alpha=LORA_RANK,  # (alpha / r) - множитель для выходов малых обучаемых матриц
    lora_dropout=0.1,  # Dropout регуляризация для малых обучаемых матриц
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

## Загрузка и очистка данных

In [7]:
dataset = load_dataset("nymless/geo-reviews-dataset-2023-prepared")

df = pd.DataFrame(dataset['train'])
df["name_ru"] = df["name_ru"].fillna("")
df["text"] = df["text"].str.replace("\\n", " ")
df = df.drop_duplicates(ignore_index=True)

dataset = Dataset.from_pandas(df)
dataset

Dataset({
    features: ['address', 'name_ru', 'rating', 'rubrics', 'text'],
    num_rows: 499799
})

## Векторизация данных

In [8]:
MAX_LENGTH = 256


def tokenize_text(examples):
    inputs = [
        f"Аddress: {address}\nName: {name}\nRating: {rating}\nKeywords: {rubrics}\nReview: {text}{tokenizer.eos_token}"
        for address, name, rating, rubrics, text in zip(
            examples["address"],
            examples["name_ru"],
            examples["rating"],
            examples["rubrics"],
            examples["text"],
        )
    ]
    # Векторизуем входной текст и приводим его к выбранной длине
    return tokenizer(
        inputs, max_length=MAX_LENGTH, padding="max_length", truncation=True
    )


dataset = dataset.map(tokenize_text, batched=True, remove_columns=dataset.column_names)
dataset

Map:   0%|          | 0/499799 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 499799
})

## Разделение данных

In [9]:
dataset = dataset.train_test_split(test_size=0.1, shuffle=True, seed=SEED)
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 449819
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 49980
    })
})

## Логирование MLFlow

In [12]:
MLFLOW_EXPERIMENT = "Gemma-2-2b-LoRA-Finetuning"

# Инициализация MLflow
mlflow.set_tracking_uri("./mlruns")  # Директория для логов в корне проекта
mlflow.set_experiment(MLFLOW_EXPERIMENT)  # Название эксперимента MLFlow

2024/12/16 13:31:47 INFO mlflow.tracking.fluent: Experiment with name 'Gemma-2-2b-LoRA-Finetuning' does not exist. Creating a new experiment.


<Experiment: artifact_location='/home/jupyter/work/resources/mlruns/702168699606335704', creation_time=1734355907824, experiment_id='702168699606335704', last_update_time=1734355907824, lifecycle_stage='active', name='Gemma-2-2b-LoRA-Finetuning', tags={}>

## Гиперпараметры обучения

In [13]:
BATCH_SIZE = 8
GRAD_ACCUM = 1
N_EPOCHS = 2
LEARNING_RATE = 5e-4
WEIGHT_DECAY = 0.01

n_steps = int(round(N_EPOCHS * len(dataset["train"]) / BATCH_SIZE / GRAD_ACCUM, 0))
print("Всего шагов обучения:", n_steps)

Всего шагов обучения: 112455


In [14]:
HUB_MODEL_ID = "nymless/gemma-2-2b-lora-finetuned"

training_args = TrainingArguments(
    # Включаем логирование в MLFlow
    report_to="mlflow",
    logging_steps=500,  # Через сколько шагов логировать
    # Генерация имени MLFlow Run с текущим временем и датой
    run_name=f"{MLFLOW_EXPERIMENT}-{datetime.now().strftime('%d.%m.%Y, %H:%M:%S')}",
    # Директория сохранения модели
    output_dir="./models/gemma-2-2b-lora-finetuned",
    # Параметры оценки и сохранения модели
    overwrite_output_dir=True,  # Перезапись директории при каждом запуске обучения
    save_strategy="epoch",  # Сохраняем модель в конце каждой эпохи
    load_best_model_at_end=True,  # Сохраняем лучшую по метрике модель в конце обучения
    metric_for_best_model="loss",  # Метрика для оценки модели
    save_total_limit=1,  # Сохранить только одну модель
    eval_strategy="epoch",  # Оценивать модель в конце каждой эпохи
    # Гиперпараметры
    learning_rate=LEARNING_RATE,  # Скорость обучения
    lr_scheduler_type="cosine",
    per_device_train_batch_size=BATCH_SIZE,  # Размер батча данных на этапе обучения
    per_device_eval_batch_size=BATCH_SIZE,  # Размер батча данных на этапе оценивания
    gradient_accumulation_steps=GRAD_ACCUM,  # Накапливать градиенты N шагов и подстроить веса
    num_train_epochs=N_EPOCHS,  # Количество эпох
    weight_decay=WEIGHT_DECAY,  # Коэффициент регуляризации
    fp16=True,  # Использовать формат чисел FP16 (Floating point 16-bit)
    # Работа с Hugging Face Hub (Отключено)
    push_to_hub=False,  # Сохранение модели в Hub со стандартным commit_message
    hub_model_id=HUB_MODEL_ID,  # Название модели в Hub
    hub_strategy="end",  # Загружать модель в Hub в конце обучения
    hub_token=token,  # Токен Hugging Face
)

In [15]:
# Объект, который извлекает данные батчами для обучения
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Отключаем режим маскирования текста MLM (Masked language modeling)
)

In [16]:
# Объект, который управляет ранней остановкой обучения
early_topping_callback = EarlyStoppingCallback(
    early_stopping_patience=2, early_stopping_threshold=0.001
)

In [17]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    callbacks=[early_topping_callback]
)

## Обучение модели

In [18]:
trainer.train(
    resume_from_checkpoint=False, # Продолжить обучение с выбранного (или последнего) чекпоинта
)

  0%|          | 500/112456 [02:02<7:35:25,  4.10it/s]

{'loss': 2.1569, 'grad_norm': 0.518764853477478, 'learning_rate': 0.0004999756119294758, 'epoch': 0.01}


  1%|          | 1000/112456 [04:04<7:34:04,  4.09it/s]

{'loss': 2.1077, 'grad_norm': 0.6214762330055237, 'learning_rate': 0.0004999024524761271, 'epoch': 0.02}


  1%|▏         | 1500/112456 [06:06<7:31:32,  4.10it/s]

{'loss': 2.0791, 'grad_norm': 0.6481390595436096, 'learning_rate': 0.000499780535913697, 'epoch': 0.03}


  2%|▏         | 2000/112456 [08:08<7:28:45,  4.10it/s]

{'loss': 2.0769, 'grad_norm': 0.685985267162323, 'learning_rate': 0.0004996102759437422, 'epoch': 0.04}


  2%|▏         | 2500/112456 [10:10<7:26:42,  4.10it/s]

{'loss': 2.0695, 'grad_norm': 0.6692337393760681, 'learning_rate': 0.000499391023391188, 'epoch': 0.04}


  3%|▎         | 3000/112456 [12:12<7:24:51,  4.10it/s]

{'loss': 2.0634, 'grad_norm': 0.7404853701591492, 'learning_rate': 0.0004991231135117012, 'epoch': 0.05}


  3%|▎         | 3500/112456 [14:14<7:23:03,  4.10it/s]

{'loss': 2.0558, 'grad_norm': 0.7259758114814758, 'learning_rate': 0.0004988065985757223, 'epoch': 0.06}


  4%|▎         | 4000/112456 [16:16<7:21:28,  4.09it/s]

{'loss': 2.0438, 'grad_norm': 0.6921285390853882, 'learning_rate': 0.0004984415403367597, 'epoch': 0.07}


  4%|▍         | 4500/112456 [18:18<7:18:26,  4.10it/s]

{'loss': 2.0559, 'grad_norm': 0.7862556576728821, 'learning_rate': 0.0004980288854030577, 'epoch': 0.08}


  4%|▍         | 5000/112456 [20:20<7:16:36,  4.10it/s]

{'loss': 2.0636, 'grad_norm': 0.6837876439094543, 'learning_rate': 0.0004975670603848325, 'epoch': 0.09}


  5%|▍         | 5500/112456 [22:22<7:14:33,  4.10it/s]

{'loss': 2.0486, 'grad_norm': 0.7965759634971619, 'learning_rate': 0.0004970569339031824, 'epoch': 0.1}


  5%|▌         | 6000/112456 [24:24<7:11:54,  4.11it/s]

{'loss': 2.0362, 'grad_norm': 0.8392179608345032, 'learning_rate': 0.0004964986054861119, 'epoch': 0.11}


  6%|▌         | 6500/112456 [26:26<7:10:03,  4.11it/s]

{'loss': 2.0513, 'grad_norm': 0.6913261413574219, 'learning_rate': 0.0004958921840660437, 'epoch': 0.12}


  6%|▌         | 7000/112456 [28:28<7:08:32,  4.10it/s]

{'loss': 2.0507, 'grad_norm': 0.6823986172676086, 'learning_rate': 0.0004952377879585648, 'epoch': 0.12}


  7%|▋         | 7500/112456 [30:29<7:06:08,  4.10it/s]

{'loss': 2.0501, 'grad_norm': 0.6920040249824524, 'learning_rate': 0.0004945355448393423, 'epoch': 0.13}


  7%|▋         | 8000/112456 [32:31<7:03:56,  4.11it/s]

{'loss': 2.0354, 'grad_norm': 0.7328618764877319, 'learning_rate': 0.0004937855917192142, 'epoch': 0.14}


  8%|▊         | 8500/112456 [34:33<7:02:20,  4.10it/s]

{'loss': 2.0278, 'grad_norm': 0.6659445762634277, 'learning_rate': 0.0004929880749174569, 'epoch': 0.15}


  8%|▊         | 9000/112456 [36:35<6:59:58,  4.11it/s]

{'loss': 2.0461, 'grad_norm': 0.8093478679656982, 'learning_rate': 0.0004921448870882406, 'epoch': 0.16}


  8%|▊         | 9500/112456 [38:37<6:58:30,  4.10it/s]

{'loss': 2.0291, 'grad_norm': 0.7172437906265259, 'learning_rate': 0.000491252813286079, 'epoch': 0.17}


  9%|▉         | 10000/112456 [40:39<6:56:18,  4.10it/s]

{'loss': 2.0435, 'grad_norm': 0.7749018669128418, 'learning_rate': 0.0004903136699589208, 'epoch': 0.18}


  9%|▉         | 10500/112456 [42:41<6:53:51,  4.11it/s]

{'loss': 2.0468, 'grad_norm': 0.7501105666160583, 'learning_rate': 0.0004893296590633301, 'epoch': 0.19}


 10%|▉         | 11000/112456 [44:43<6:52:10,  4.10it/s]

{'loss': 2.0143, 'grad_norm': 0.8483378887176514, 'learning_rate': 0.0004882991404463456, 'epoch': 0.2}


 10%|█         | 11500/112456 [46:45<6:49:34,  4.11it/s]

{'loss': 2.0146, 'grad_norm': 0.8141250610351562, 'learning_rate': 0.0004872201100407824, 'epoch': 0.2}


 11%|█         | 12000/112456 [48:47<6:47:38,  4.11it/s]

{'loss': 2.0253, 'grad_norm': 0.6820712685585022, 'learning_rate': 0.00048609479690903156, 'epoch': 0.21}


 11%|█         | 12500/112456 [50:48<6:45:07,  4.11it/s]

{'loss': 2.0277, 'grad_norm': 0.7749297022819519, 'learning_rate': 0.0004849234206048215, 'epoch': 0.22}


 12%|█▏        | 13000/112456 [52:50<6:43:53,  4.10it/s]

{'loss': 2.0247, 'grad_norm': 0.8346525430679321, 'learning_rate': 0.0004837062096690154, 'epoch': 0.23}


 12%|█▏        | 13500/112456 [54:52<6:41:21,  4.11it/s]

{'loss': 2.0266, 'grad_norm': 0.7978957295417786, 'learning_rate': 0.00048244340158502255, 'epoch': 0.24}


 12%|█▏        | 14000/112456 [56:54<6:39:45,  4.10it/s]

{'loss': 2.0282, 'grad_norm': 0.8245784640312195, 'learning_rate': 0.0004811352427324637, 'epoch': 0.25}


 13%|█▎        | 14500/112456 [58:55<6:37:27,  4.11it/s]

{'loss': 2.0257, 'grad_norm': 0.8406990766525269, 'learning_rate': 0.0004797819883391017, 'epoch': 0.26}


 13%|█▎        | 15000/112456 [1:00:57<6:36:23,  4.10it/s]

{'loss': 2.0284, 'grad_norm': 0.7831322550773621, 'learning_rate': 0.00047838390243104523, 'epoch': 0.27}


 14%|█▍        | 15500/112456 [1:02:59<6:33:11,  4.11it/s]

{'loss': 2.0264, 'grad_norm': 0.7529857754707336, 'learning_rate': 0.00047694125778123623, 'epoch': 0.28}


 14%|█▍        | 16000/112456 [1:05:01<6:31:25,  4.11it/s]

{'loss': 2.0227, 'grad_norm': 0.8595646023750305, 'learning_rate': 0.00047545433585623013, 'epoch': 0.28}


 15%|█▍        | 16500/112456 [1:07:02<6:29:41,  4.10it/s]

{'loss': 2.027, 'grad_norm': 0.7826861143112183, 'learning_rate': 0.0004739234267612813, 'epoch': 0.29}


 15%|█▌        | 17000/112456 [1:09:04<6:27:10,  4.11it/s]

{'loss': 2.0316, 'grad_norm': 0.9378491044044495, 'learning_rate': 0.00047235202177722205, 'epoch': 0.3}


 16%|█▌        | 17500/112456 [1:11:06<6:25:23,  4.11it/s]

{'loss': 2.038, 'grad_norm': 0.7592489719390869, 'learning_rate': 0.00047073412937857556, 'epoch': 0.31}


 16%|█▌        | 18000/112456 [1:13:08<6:23:21,  4.11it/s]

{'loss': 2.0255, 'grad_norm': 0.8399600386619568, 'learning_rate': 0.0004690731707438141, 'epoch': 0.32}


 16%|█▋        | 18500/112456 [1:15:09<6:21:45,  4.10it/s]

{'loss': 2.0258, 'grad_norm': 0.7682642340660095, 'learning_rate': 0.000467369469933548, 'epoch': 0.33}


 17%|█▋        | 19000/112456 [1:17:11<6:19:04,  4.11it/s]

{'loss': 2.0362, 'grad_norm': 0.732328474521637, 'learning_rate': 0.0004656233593475816, 'epoch': 0.34}


 17%|█▋        | 19500/112456 [1:19:13<6:17:41,  4.10it/s]

{'loss': 2.0268, 'grad_norm': 0.7495322227478027, 'learning_rate': 0.00046383517966005973, 'epoch': 0.35}


 18%|█▊        | 20000/112456 [1:21:14<6:15:11,  4.11it/s]

{'loss': 2.0156, 'grad_norm': 0.9389219880104065, 'learning_rate': 0.00046200898095364153, 'epoch': 0.36}


 18%|█▊        | 20500/112456 [1:23:16<6:13:59,  4.10it/s]

{'loss': 2.0193, 'grad_norm': 0.7239837646484375, 'learning_rate': 0.00046013780021358176, 'epoch': 0.36}


 19%|█▊        | 21000/112456 [1:25:18<6:11:29,  4.10it/s]

{'loss': 2.0123, 'grad_norm': 0.7585424184799194, 'learning_rate': 0.0004582256206295907, 'epoch': 0.37}


 19%|█▉        | 21500/112456 [1:27:20<6:09:12,  4.11it/s]

{'loss': 2.019, 'grad_norm': 0.8585526347160339, 'learning_rate': 0.00045627281527663274, 'epoch': 0.38}


 20%|█▉        | 22000/112456 [1:29:22<6:06:46,  4.11it/s]

{'loss': 2.0151, 'grad_norm': 0.8364282846450806, 'learning_rate': 0.0004542797651559452, 'epoch': 0.39}


 20%|██        | 22500/112456 [1:31:23<6:05:28,  4.10it/s]

{'loss': 2.0248, 'grad_norm': 0.8515928387641907, 'learning_rate': 0.00045224685912070335, 'epoch': 0.4}


 20%|██        | 23000/112456 [1:33:25<6:02:49,  4.11it/s]

{'loss': 2.024, 'grad_norm': 0.7328143119812012, 'learning_rate': 0.00045017867764369124, 'epoch': 0.41}


 21%|██        | 23500/112456 [1:35:27<6:01:35,  4.10it/s]

{'loss': 2.0223, 'grad_norm': 0.779207706451416, 'learning_rate': 0.0004480673350670831, 'epoch': 0.42}


 21%|██▏       | 24000/112456 [1:37:29<5:58:43,  4.11it/s]

{'loss': 2.0116, 'grad_norm': 0.7375206351280212, 'learning_rate': 0.00044591734864938567, 'epoch': 0.43}


 22%|██▏       | 24500/112456 [1:39:30<5:56:45,  4.11it/s]

{'loss': 2.0233, 'grad_norm': 0.8189454674720764, 'learning_rate': 0.000443729137862762, 'epoch': 0.44}


 22%|██▏       | 25000/112456 [1:41:32<5:54:55,  4.11it/s]

{'loss': 2.0249, 'grad_norm': 0.8829951286315918, 'learning_rate': 0.0004415076190879707, 'epoch': 0.44}


 23%|██▎       | 25500/112456 [1:43:34<5:53:34,  4.10it/s]

{'loss': 2.0168, 'grad_norm': 0.7553808093070984, 'learning_rate': 0.00043924432201565723, 'epoch': 0.45}


 23%|██▎       | 26000/112456 [1:45:36<5:50:50,  4.11it/s]

{'loss': 2.0014, 'grad_norm': 0.8104917407035828, 'learning_rate': 0.00043694410251237075, 'epoch': 0.46}


 24%|██▎       | 26500/112456 [1:47:38<5:48:42,  4.11it/s]

{'loss': 2.0021, 'grad_norm': 0.8862489461898804, 'learning_rate': 0.00043460740936143504, 'epoch': 0.47}


 24%|██▍       | 27000/112456 [1:49:40<5:47:04,  4.10it/s]

{'loss': 2.0069, 'grad_norm': 0.8271862268447876, 'learning_rate': 0.000432234698462349, 'epoch': 0.48}


 24%|██▍       | 27500/112456 [1:51:41<5:44:45,  4.11it/s]

{'loss': 2.0178, 'grad_norm': 0.7289025187492371, 'learning_rate': 0.0004298264327418384, 'epoch': 0.49}


 25%|██▍       | 28000/112456 [1:53:43<5:43:11,  4.10it/s]

{'loss': 1.9985, 'grad_norm': 0.769281268119812, 'learning_rate': 0.00042738308206353726, 'epoch': 0.5}


 25%|██▌       | 28500/112456 [1:55:45<5:40:54,  4.10it/s]

{'loss': 2.002, 'grad_norm': 0.7057134509086609, 'learning_rate': 0.00042490512313631473, 'epoch': 0.51}


 26%|██▌       | 29000/112456 [1:57:47<5:39:15,  4.10it/s]

{'loss': 2.009, 'grad_norm': 0.6614751815795898, 'learning_rate': 0.0004223930394212675, 'epoch': 0.52}


 26%|██▌       | 29500/112456 [1:59:49<5:36:41,  4.11it/s]

{'loss': 2.0094, 'grad_norm': 0.765697717666626, 'learning_rate': 0.000419847321037394, 'epoch': 0.52}


 27%|██▋       | 30000/112456 [2:01:51<5:34:51,  4.10it/s]

{'loss': 1.9991, 'grad_norm': 0.7518970966339111, 'learning_rate': 0.00041726846466596997, 'epoch': 0.53}


 27%|██▋       | 30500/112456 [2:03:53<5:33:26,  4.10it/s]

{'loss': 2.0079, 'grad_norm': 0.7607879042625427, 'learning_rate': 0.00041465697345364393, 'epoch': 0.54}


 28%|██▊       | 31000/112456 [2:05:55<5:30:37,  4.11it/s]

{'loss': 1.9964, 'grad_norm': 0.7390720248222351, 'learning_rate': 0.00041201335691427084, 'epoch': 0.55}


 28%|██▊       | 31500/112456 [2:07:56<5:30:50,  4.08it/s]

{'loss': 1.9966, 'grad_norm': 0.6989824771881104, 'learning_rate': 0.0004093435124820932, 'epoch': 0.56}


 28%|██▊       | 32000/112456 [2:09:58<5:26:37,  4.11it/s]

{'loss': 1.9975, 'grad_norm': 0.7437044382095337, 'learning_rate': 0.0004066372604509906, 'epoch': 0.57}


 29%|██▉       | 32500/112456 [2:12:00<5:24:45,  4.10it/s]

{'loss': 1.9887, 'grad_norm': 0.7748665809631348, 'learning_rate': 0.0004039004477754512, 'epoch': 0.58}


 29%|██▉       | 33000/112456 [2:14:02<5:24:40,  4.08it/s]

{'loss': 2.0098, 'grad_norm': 0.8132932782173157, 'learning_rate': 0.0004011336084201194, 'epoch': 0.59}


 30%|██▉       | 33500/112456 [2:16:04<5:20:19,  4.11it/s]

{'loss': 2.0006, 'grad_norm': 0.7618914842605591, 'learning_rate': 0.0003983485255294658, 'epoch': 0.6}


 30%|███       | 34000/112456 [2:18:06<5:18:58,  4.10it/s]

{'loss': 2.0041, 'grad_norm': 0.7833134531974792, 'learning_rate': 0.00039552337270734567, 'epoch': 0.6}


 31%|███       | 34500/112456 [2:20:08<5:17:13,  4.10it/s]

{'loss': 1.9909, 'grad_norm': 0.8321043252944946, 'learning_rate': 0.00039266982761101363, 'epoch': 0.61}


 31%|███       | 35000/112456 [2:22:10<5:14:24,  4.11it/s]

{'loss': 1.9924, 'grad_norm': 0.7615925073623657, 'learning_rate': 0.0003897884469801418, 'epoch': 0.62}


 32%|███▏      | 35500/112456 [2:24:12<5:13:46,  4.09it/s]

{'loss': 1.9852, 'grad_norm': 0.7763078212738037, 'learning_rate': 0.00038687979298524276, 'epoch': 0.63}


 32%|███▏      | 36000/112456 [2:26:14<5:11:08,  4.10it/s]

{'loss': 1.9995, 'grad_norm': 0.8723962306976318, 'learning_rate': 0.00038394443311798635, 'epoch': 0.64}


 32%|███▏      | 36500/112456 [2:28:16<5:08:34,  4.10it/s]

{'loss': 2.0124, 'grad_norm': 0.8782823085784912, 'learning_rate': 0.0003809829400804803, 'epoch': 0.65}


 33%|███▎      | 37000/112456 [2:30:18<5:07:12,  4.09it/s]

{'loss': 1.9939, 'grad_norm': 0.7619538307189941, 'learning_rate': 0.000377995891673533, 'epoch': 0.66}


 33%|███▎      | 37500/112456 [2:32:20<5:04:17,  4.11it/s]

{'loss': 1.9884, 'grad_norm': 0.7361816167831421, 'learning_rate': 0.00037498991925874063, 'epoch': 0.67}


 34%|███▍      | 38000/112456 [2:34:21<5:02:45,  4.10it/s]

{'loss': 2.0046, 'grad_norm': 0.7661983370780945, 'learning_rate': 0.00037195356152569086, 'epoch': 0.68}


 34%|███▍      | 38500/112456 [2:36:23<5:00:24,  4.10it/s]

{'loss': 1.9965, 'grad_norm': 0.7611064314842224, 'learning_rate': 0.0003688934100961678, 'epoch': 0.68}


 35%|███▍      | 39000/112456 [2:38:25<4:57:53,  4.11it/s]

{'loss': 1.9919, 'grad_norm': 0.7635722160339355, 'learning_rate': 0.0003658100620196824, 'epoch': 0.69}


 35%|███▌      | 39500/112456 [2:40:27<4:56:21,  4.10it/s]

{'loss': 1.9883, 'grad_norm': 0.9454558491706848, 'learning_rate': 0.00036270411887151754, 'epoch': 0.7}


 36%|███▌      | 40000/112456 [2:42:29<4:54:04,  4.11it/s]

{'loss': 1.9833, 'grad_norm': 0.7312596440315247, 'learning_rate': 0.00035958246404001236, 'epoch': 0.71}


 36%|███▌      | 40500/112456 [2:44:31<4:52:03,  4.11it/s]

{'loss': 1.9892, 'grad_norm': 0.793361246585846, 'learning_rate': 0.00035643319513549535, 'epoch': 0.72}


 36%|███▋      | 41000/112456 [2:46:32<4:49:55,  4.11it/s]

{'loss': 1.9842, 'grad_norm': 0.8245140910148621, 'learning_rate': 0.0003532631606288256, 'epoch': 0.73}


 37%|███▋      | 41500/112456 [2:48:34<4:48:00,  4.11it/s]

{'loss': 1.9981, 'grad_norm': 0.7606401443481445, 'learning_rate': 0.0003500793790652055, 'epoch': 0.74}


 37%|███▋      | 42000/112456 [2:50:36<4:45:48,  4.11it/s]

{'loss': 1.9796, 'grad_norm': 0.8048599362373352, 'learning_rate': 0.00034686971117529067, 'epoch': 0.75}


 38%|███▊      | 42500/112456 [2:52:38<4:43:57,  4.11it/s]

{'loss': 1.986, 'grad_norm': 0.7859276533126831, 'learning_rate': 0.0003436411435625933, 'epoch': 0.76}


 38%|███▊      | 43000/112456 [2:54:40<4:41:36,  4.11it/s]

{'loss': 1.9891, 'grad_norm': 0.8322303891181946, 'learning_rate': 0.0003403943061353906, 'epoch': 0.76}


 39%|███▊      | 43500/112456 [2:56:41<4:39:34,  4.11it/s]

{'loss': 1.9831, 'grad_norm': 0.8389952182769775, 'learning_rate': 0.00033712983236648377, 'epoch': 0.77}


 39%|███▉      | 44000/112456 [2:58:43<4:38:10,  4.10it/s]

{'loss': 1.9869, 'grad_norm': 0.7591073513031006, 'learning_rate': 0.00033384835916960484, 'epoch': 0.78}


 40%|███▉      | 44500/112456 [3:00:45<4:36:21,  4.10it/s]

{'loss': 1.9807, 'grad_norm': 0.7070205807685852, 'learning_rate': 0.0003305505267751517, 'epoch': 0.79}


 40%|████      | 45000/112456 [3:02:47<4:34:32,  4.10it/s]

{'loss': 1.9827, 'grad_norm': 0.7902395129203796, 'learning_rate': 0.0003272369786052766, 'epoch': 0.8}


 40%|████      | 45500/112456 [3:04:49<4:32:07,  4.10it/s]

{'loss': 1.979, 'grad_norm': 0.780507504940033, 'learning_rate': 0.000323908361148351, 'epoch': 0.81}


 41%|████      | 46000/112456 [3:06:51<4:31:41,  4.08it/s]

{'loss': 1.9872, 'grad_norm': 0.7896344065666199, 'learning_rate': 0.0003205720238654147, 'epoch': 0.82}


 41%|████▏     | 46500/112456 [3:08:53<4:27:29,  4.11it/s]

{'loss': 2.0101, 'grad_norm': 0.7487414479255676, 'learning_rate': 0.00031722197267824705, 'epoch': 0.83}


 42%|████▏     | 47000/112456 [3:10:55<4:25:58,  4.10it/s]

{'loss': 1.9656, 'grad_norm': 0.6491565704345703, 'learning_rate': 0.000313852106199891, 'epoch': 0.84}


 42%|████▏     | 47500/112456 [3:12:57<4:24:19,  4.10it/s]

{'loss': 1.983, 'grad_norm': 0.7652001976966858, 'learning_rate': 0.00031046978188418187, 'epoch': 0.84}


 43%|████▎     | 48000/112456 [3:14:59<4:21:32,  4.11it/s]

{'loss': 1.9847, 'grad_norm': 0.7022167444229126, 'learning_rate': 0.0003070756596380314, 'epoch': 0.85}


 43%|████▎     | 48500/112456 [3:17:01<4:20:05,  4.10it/s]

{'loss': 1.9754, 'grad_norm': 0.7046764492988586, 'learning_rate': 0.00030367040167018124, 'epoch': 0.86}


 44%|████▎     | 49000/112456 [3:19:03<4:18:02,  4.10it/s]

{'loss': 1.9831, 'grad_norm': 0.7084428668022156, 'learning_rate': 0.00030025467236200306, 'epoch': 0.87}


 44%|████▍     | 49500/112456 [3:21:05<4:16:17,  4.09it/s]

{'loss': 1.9678, 'grad_norm': 0.7312954664230347, 'learning_rate': 0.00029683599854771236, 'epoch': 0.88}


 44%|████▍     | 50000/112456 [3:23:07<4:14:04,  4.10it/s]

{'loss': 1.9739, 'grad_norm': 0.8320545554161072, 'learning_rate': 0.0002934013453499463, 'epoch': 0.89}


 45%|████▍     | 50500/112456 [3:25:09<4:12:20,  4.09it/s]

{'loss': 1.9818, 'grad_norm': 0.7356836795806885, 'learning_rate': 0.00028996511859823084, 'epoch': 0.9}


 45%|████▌     | 51000/112456 [3:27:11<4:09:30,  4.11it/s]

{'loss': 1.9784, 'grad_norm': 0.8881963491439819, 'learning_rate': 0.00028651421648855264, 'epoch': 0.91}


 46%|████▌     | 51500/112456 [3:29:12<4:07:48,  4.10it/s]

{'loss': 1.9757, 'grad_norm': 1.014479398727417, 'learning_rate': 0.0002830561902885797, 'epoch': 0.92}


 46%|████▌     | 52000/112456 [3:31:14<4:05:34,  4.10it/s]

{'loss': 1.971, 'grad_norm': 0.779587984085083, 'learning_rate': 0.00027959171467500655, 'epoch': 0.92}


 47%|████▋     | 52500/112456 [3:33:16<4:03:52,  4.10it/s]

{'loss': 1.979, 'grad_norm': 0.7394303679466248, 'learning_rate': 0.00027612146558283796, 'epoch': 0.93}


 47%|████▋     | 53000/112456 [3:35:18<4:01:16,  4.11it/s]

{'loss': 1.9731, 'grad_norm': 0.8023194670677185, 'learning_rate': 0.0002726461200735108, 'epoch': 0.94}


 48%|████▊     | 53500/112456 [3:37:20<3:59:15,  4.11it/s]

{'loss': 1.9789, 'grad_norm': 0.755465567111969, 'learning_rate': 0.000269166356202796, 'epoch': 0.95}


 48%|████▊     | 54000/112456 [3:39:22<3:57:26,  4.10it/s]

{'loss': 1.9855, 'grad_norm': 0.7584167122840881, 'learning_rate': 0.00026568285288850705, 'epoch': 0.96}


 48%|████▊     | 54500/112456 [3:41:24<3:55:19,  4.10it/s]

{'loss': 1.978, 'grad_norm': 0.8977229595184326, 'learning_rate': 0.00026219628977803996, 'epoch': 0.97}


 49%|████▉     | 55000/112456 [3:43:26<3:53:06,  4.11it/s]

{'loss': 1.9595, 'grad_norm': 0.8658048510551453, 'learning_rate': 0.00025870734711577097, 'epoch': 0.98}


 49%|████▉     | 55500/112456 [3:45:28<3:51:22,  4.10it/s]

{'loss': 1.9711, 'grad_norm': 0.7593819499015808, 'learning_rate': 0.0002552167056103377, 'epoch': 0.99}


 50%|████▉     | 56000/112456 [3:47:30<3:48:52,  4.11it/s]

{'loss': 1.9661, 'grad_norm': 0.7528297305107117, 'learning_rate': 0.00025172504630182967, 'epoch': 1.0}


 50%|█████     | 56228/112456 [3:48:25<3:34:56,  4.36it/s]The 'batch_size' argument of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'max_batch_size' argument instead.
The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.

  0%|          | 0/6248 [00:00<?, ?it/s][A
  0%|          | 2/6248 [00:00<05:44, 18.15it/s][A
  0%|          | 4/6248 [00:00<09:08, 11.38it/s][A
  0%|          | 6/6248 [00:00<10:04, 10.33it/s][A
  0%|          | 8/6248 [00:00<10:27,  9.94it/s][A
  0%|          | 10/6248 [00:00<10:41,  9.72it/s][A
  0%|          | 12/6248 [00:01<10:49,  9.60it/s][A
  0%|          | 13/6248 [00:01<10:56,  9.50it/s][A
  0%|          | 14/6248 [00:01<10:54,  9.53it/s][A
  0%|          | 15/6248 [00:01<10:57,  9.49it/s][A
  0%|          | 16/6248 [00:01<11:02,  9.41it/s][A
  0%|          | 17/6248 [00:01<11:02,  9.40it/s][A
  0%|        

{'eval_loss': 1.9575941562652588, 'eval_runtime': 670.1474, 'eval_samples_per_second': 74.581, 'eval_steps_per_second': 9.323, 'epoch': 1.0}


 50%|█████     | 56500/112456 [4:00:43<3:46:58,  4.11it/s]    

{'loss': 1.9704, 'grad_norm': 0.8217437863349915, 'learning_rate': 0.0002482330504289149, 'epoch': 1.0}


 51%|█████     | 57000/112456 [4:02:44<3:44:54,  4.11it/s]

{'loss': 1.9597, 'grad_norm': 0.6823843717575073, 'learning_rate': 0.000244741399295926, 'epoch': 1.01}


 51%|█████     | 57500/112456 [4:04:46<3:42:31,  4.12it/s]

{'loss': 1.9477, 'grad_norm': 0.7872092723846436, 'learning_rate': 0.00024125077413993576, 'epoch': 1.02}


 52%|█████▏    | 58000/112456 [4:06:47<3:40:48,  4.11it/s]

{'loss': 1.9446, 'grad_norm': 0.7631298899650574, 'learning_rate': 0.00023776185599784407, 'epoch': 1.03}


 52%|█████▏    | 58500/112456 [4:08:49<3:38:30,  4.12it/s]

{'loss': 1.9348, 'grad_norm': 0.8118918538093567, 'learning_rate': 0.00023428926603713725, 'epoch': 1.04}


 52%|█████▏    | 59000/112456 [4:10:51<3:36:56,  4.11it/s]

{'loss': 1.9513, 'grad_norm': 0.804080605506897, 'learning_rate': 0.00023081275338367548, 'epoch': 1.05}


 53%|█████▎    | 59500/112456 [4:12:52<3:34:52,  4.11it/s]

{'loss': 1.9403, 'grad_norm': 0.7807339429855347, 'learning_rate': 0.00022733301399939206, 'epoch': 1.06}


 53%|█████▎    | 60000/112456 [4:14:54<3:32:52,  4.11it/s]

{'loss': 1.9453, 'grad_norm': 0.7614490985870361, 'learning_rate': 0.0002238576970475339, 'epoch': 1.07}


 54%|█████▍    | 60500/112456 [4:16:56<3:30:42,  4.11it/s]

{'loss': 1.9395, 'grad_norm': 0.7275652885437012, 'learning_rate': 0.00022038748057830038, 'epoch': 1.08}


 54%|█████▍    | 61000/112456 [4:18:57<3:28:40,  4.11it/s]

{'loss': 1.9464, 'grad_norm': 0.8808759450912476, 'learning_rate': 0.00021692304164676316, 'epoch': 1.08}


 55%|█████▍    | 61500/112456 [4:20:59<3:26:20,  4.12it/s]

{'loss': 1.9486, 'grad_norm': 0.7297001481056213, 'learning_rate': 0.00021346505618077033, 'epoch': 1.09}


 55%|█████▌    | 62000/112456 [4:23:01<3:24:35,  4.11it/s]

{'loss': 1.946, 'grad_norm': 0.9659419059753418, 'learning_rate': 0.00021001419884906922, 'epoch': 1.1}


 56%|█████▌    | 62500/112456 [4:25:02<3:22:30,  4.11it/s]

{'loss': 1.9493, 'grad_norm': 0.8329532742500305, 'learning_rate': 0.00020657802080894375, 'epoch': 1.11}


 56%|█████▌    | 63000/112456 [4:27:04<3:20:30,  4.11it/s]

{'loss': 1.9529, 'grad_norm': 0.8192382454872131, 'learning_rate': 0.00020314342044211044, 'epoch': 1.12}


 56%|█████▋    | 63500/112456 [4:29:06<3:18:31,  4.11it/s]

{'loss': 1.9379, 'grad_norm': 0.8031325936317444, 'learning_rate': 0.00019971796200781125, 'epoch': 1.13}


 57%|█████▋    | 64000/112456 [4:31:08<3:16:37,  4.11it/s]

{'loss': 1.9456, 'grad_norm': 0.8184086084365845, 'learning_rate': 0.00019630231382862135, 'epoch': 1.14}


 57%|█████▋    | 64500/112456 [4:33:09<3:14:38,  4.11it/s]

{'loss': 1.9468, 'grad_norm': 0.7210070490837097, 'learning_rate': 0.00019289714231309022, 'epoch': 1.15}


 58%|█████▊    | 65000/112456 [4:35:11<3:12:39,  4.11it/s]

{'loss': 1.9407, 'grad_norm': 0.7995115518569946, 'learning_rate': 0.00018950311182572226, 'epoch': 1.16}


 58%|█████▊    | 65500/112456 [4:37:13<3:10:44,  4.10it/s]

{'loss': 1.9356, 'grad_norm': 0.7586414217948914, 'learning_rate': 0.00018612088455735655, 'epoch': 1.16}


 59%|█████▊    | 66000/112456 [4:39:15<3:08:35,  4.11it/s]

{'loss': 1.9359, 'grad_norm': 0.8282975554466248, 'learning_rate': 0.00018275112039597036, 'epoch': 1.17}


 59%|█████▉    | 66500/112456 [4:41:16<3:06:24,  4.11it/s]

{'loss': 1.9286, 'grad_norm': 0.8683239817619324, 'learning_rate': 0.0001793944767979319, 'epoch': 1.18}


 60%|█████▉    | 67000/112456 [4:43:18<3:04:10,  4.11it/s]

{'loss': 1.945, 'grad_norm': 0.804398775100708, 'learning_rate': 0.0001760582802141324, 'epoch': 1.19}


 60%|██████    | 67500/112456 [4:45:20<3:02:00,  4.12it/s]

{'loss': 1.9503, 'grad_norm': 0.8270774483680725, 'learning_rate': 0.00017272981024020123, 'epoch': 1.2}


 60%|██████    | 68000/112456 [4:47:21<3:00:15,  4.11it/s]

{'loss': 1.9362, 'grad_norm': 0.7521467208862305, 'learning_rate': 0.0001694164160329683, 'epoch': 1.21}


 61%|██████    | 68500/112456 [4:49:23<2:58:28,  4.10it/s]

{'loss': 1.9417, 'grad_norm': 0.8094593286514282, 'learning_rate': 0.00016611874405076654, 'epoch': 1.22}


 61%|██████▏   | 69000/112456 [4:51:25<2:56:13,  4.11it/s]

{'loss': 1.9371, 'grad_norm': 0.8141094446182251, 'learning_rate': 0.00016283743768445073, 'epoch': 1.23}


 62%|██████▏   | 69500/112456 [4:53:27<2:54:22,  4.11it/s]

{'loss': 1.9365, 'grad_norm': 0.8237975239753723, 'learning_rate': 0.00015957964833729392, 'epoch': 1.24}


 62%|██████▏   | 70000/112456 [4:55:28<2:52:14,  4.11it/s]

{'loss': 1.9588, 'grad_norm': 0.7734231948852539, 'learning_rate': 0.00015633295455965768, 'epoch': 1.24}


 63%|██████▎   | 70500/112456 [4:57:30<2:49:59,  4.11it/s]

{'loss': 1.9359, 'grad_norm': 0.7397585511207581, 'learning_rate': 0.00015310453565010136, 'epoch': 1.25}


 63%|██████▎   | 71000/112456 [4:59:32<2:48:11,  4.11it/s]

{'loss': 1.9398, 'grad_norm': 0.9181725978851318, 'learning_rate': 0.0001498950214878895, 'epoch': 1.26}


 64%|██████▎   | 71500/112456 [5:01:33<2:46:09,  4.11it/s]

{'loss': 1.9474, 'grad_norm': 0.8949733376502991, 'learning_rate': 0.0001467050382638838, 'epoch': 1.27}


 64%|██████▍   | 72000/112456 [5:03:35<2:44:00,  4.11it/s]

{'loss': 1.9249, 'grad_norm': 0.7370034456253052, 'learning_rate': 0.00014354152749344062, 'epoch': 1.28}


 64%|██████▍   | 72500/112456 [5:05:37<2:42:06,  4.11it/s]

{'loss': 1.9437, 'grad_norm': 0.7620988488197327, 'learning_rate': 0.00014039242719663178, 'epoch': 1.29}


 65%|██████▍   | 73000/112456 [5:07:39<2:39:44,  4.12it/s]

{'loss': 1.9316, 'grad_norm': 0.7169694900512695, 'learning_rate': 0.00013726471183754708, 'epoch': 1.3}


 65%|██████▌   | 73500/112456 [5:09:40<2:38:06,  4.11it/s]

{'loss': 1.9358, 'grad_norm': 0.897254467010498, 'learning_rate': 0.0001341651807334538, 'epoch': 1.31}


 66%|██████▌   | 74000/112456 [5:11:42<2:36:15,  4.10it/s]

{'loss': 1.9228, 'grad_norm': 0.7541276812553406, 'learning_rate': 0.00013108201584904972, 'epoch': 1.32}


 66%|██████▌   | 74500/112456 [5:13:44<2:34:08,  4.10it/s]

{'loss': 1.9175, 'grad_norm': 1.0185059309005737, 'learning_rate': 0.0001280220524061182, 'epoch': 1.32}


 67%|██████▋   | 75000/112456 [5:15:46<2:32:02,  4.11it/s]

{'loss': 1.9241, 'grad_norm': 0.726023256778717, 'learning_rate': 0.00012498588741749318, 'epoch': 1.33}


 67%|██████▋   | 75500/112456 [5:17:47<2:29:46,  4.11it/s]

{'loss': 1.9534, 'grad_norm': 0.7245635390281677, 'learning_rate': 0.00012197411325282165, 'epoch': 1.34}


 68%|██████▊   | 76000/112456 [5:19:49<2:27:34,  4.12it/s]

{'loss': 1.9451, 'grad_norm': 0.7173519730567932, 'learning_rate': 0.00011898731752298932, 'epoch': 1.35}


 68%|██████▊   | 76500/112456 [5:21:51<2:25:43,  4.11it/s]

{'loss': 1.9303, 'grad_norm': 0.8219094276428223, 'learning_rate': 0.00011602608296547563, 'epoch': 1.36}


 68%|██████▊   | 77000/112456 [5:23:52<2:23:49,  4.11it/s]

{'loss': 1.9251, 'grad_norm': 0.8343927264213562, 'learning_rate': 0.00011309098733065837, 'epoch': 1.37}


 69%|██████▉   | 77500/112456 [5:25:54<2:21:45,  4.11it/s]

{'loss': 1.9283, 'grad_norm': 0.8018049597740173, 'learning_rate': 0.00011018260326909193, 'epoch': 1.38}


 69%|██████▉   | 78000/112456 [5:27:56<2:19:32,  4.12it/s]

{'loss': 1.9292, 'grad_norm': 0.7450768351554871, 'learning_rate': 0.0001073072328309077, 'epoch': 1.39}


 70%|██████▉   | 78500/112456 [5:29:57<2:17:40,  4.11it/s]

{'loss': 1.9441, 'grad_norm': 0.7470696568489075, 'learning_rate': 0.00010445391267093993, 'epoch': 1.4}


 70%|███████   | 79000/112456 [5:31:59<2:15:31,  4.11it/s]

{'loss': 1.9257, 'grad_norm': 0.8448213338851929, 'learning_rate': 0.00010162898921691058, 'epoch': 1.4}


 71%|███████   | 79500/112456 [5:34:00<2:13:37,  4.11it/s]

{'loss': 1.9398, 'grad_norm': 0.802759051322937, 'learning_rate': 9.883301362427904e-05, 'epoch': 1.41}


 71%|███████   | 80000/112456 [5:36:02<2:11:24,  4.12it/s]

{'loss': 1.9347, 'grad_norm': 0.8962454795837402, 'learning_rate': 9.607203457098443e-05, 'epoch': 1.42}


 72%|███████▏  | 80500/112456 [5:38:04<2:09:28,  4.11it/s]

{'loss': 1.9069, 'grad_norm': 0.8321577310562134, 'learning_rate': 9.333552486861493e-05, 'epoch': 1.43}


 72%|███████▏  | 81000/112456 [5:40:05<2:07:38,  4.11it/s]

{'loss': 1.9316, 'grad_norm': 0.8811066150665283, 'learning_rate': 9.062958112039061e-05, 'epoch': 1.44}


 72%|███████▏  | 81500/112456 [5:42:07<2:05:29,  4.11it/s]

{'loss': 1.9143, 'grad_norm': 0.728243887424469, 'learning_rate': 8.795473126828713e-05, 'epoch': 1.45}


 73%|███████▎  | 82000/112456 [5:44:08<2:03:19,  4.12it/s]

{'loss': 1.9318, 'grad_norm': 0.8065840005874634, 'learning_rate': 8.531149718771919e-05, 'epoch': 1.46}


 73%|███████▎  | 82500/112456 [5:46:10<2:01:24,  4.11it/s]

{'loss': 1.9243, 'grad_norm': 0.8870853185653687, 'learning_rate': 8.27003945857201e-05, 'epoch': 1.47}


 74%|███████▍  | 83000/112456 [5:48:12<1:59:21,  4.11it/s]

{'loss': 1.9309, 'grad_norm': 0.8790044188499451, 'learning_rate': 8.012193290032519e-05, 'epoch': 1.48}


 74%|███████▍  | 83500/112456 [5:50:13<1:57:56,  4.09it/s]

{'loss': 1.922, 'grad_norm': 0.7642385959625244, 'learning_rate': 7.758167242771533e-05, 'epoch': 1.49}


 75%|███████▍  | 84000/112456 [5:52:15<1:55:20,  4.11it/s]

{'loss': 1.9194, 'grad_norm': 0.8493313193321228, 'learning_rate': 7.506992754547381e-05, 'epoch': 1.49}


 75%|███████▌  | 84500/112456 [5:54:17<1:53:20,  4.11it/s]

{'loss': 1.9186, 'grad_norm': 0.8428139090538025, 'learning_rate': 7.259231231878288e-05, 'epoch': 1.5}


 76%|███████▌  | 85000/112456 [5:56:18<1:51:15,  4.11it/s]

{'loss': 1.9216, 'grad_norm': 0.9233678579330444, 'learning_rate': 7.014931014168163e-05, 'epoch': 1.51}


 76%|███████▌  | 85500/112456 [5:58:20<1:49:18,  4.11it/s]

{'loss': 1.9175, 'grad_norm': 0.7146962285041809, 'learning_rate': 6.774139765504514e-05, 'epoch': 1.52}


 76%|███████▋  | 86000/112456 [6:00:21<1:47:11,  4.11it/s]

{'loss': 1.92, 'grad_norm': 0.8771910071372986, 'learning_rate': 6.537375356241601e-05, 'epoch': 1.53}


 77%|███████▋  | 86500/112456 [6:02:23<1:45:17,  4.11it/s]

{'loss': 1.9071, 'grad_norm': 0.8268271088600159, 'learning_rate': 6.303735040108121e-05, 'epoch': 1.54}


 77%|███████▋  | 87000/112456 [6:04:25<1:43:00,  4.12it/s]

{'loss': 1.9216, 'grad_norm': 0.8378838300704956, 'learning_rate': 6.073742450601694e-05, 'epoch': 1.55}


 78%|███████▊  | 87500/112456 [6:06:26<1:41:35,  4.09it/s]

{'loss': 1.9124, 'grad_norm': 0.804010808467865, 'learning_rate': 5.8474424603262546e-05, 'epoch': 1.56}


 78%|███████▊  | 88000/112456 [6:08:28<1:39:05,  4.11it/s]

{'loss': 1.9107, 'grad_norm': 0.8503875136375427, 'learning_rate': 5.624879221442788e-05, 'epoch': 1.57}


 79%|███████▊  | 88500/112456 [6:10:30<1:37:14,  4.11it/s]

{'loss': 1.9155, 'grad_norm': 0.8121322393417358, 'learning_rate': 5.4065299220742827e-05, 'epoch': 1.57}


 79%|███████▉  | 89000/112456 [6:12:31<1:35:01,  4.11it/s]

{'loss': 1.9387, 'grad_norm': 0.7379148006439209, 'learning_rate': 5.191562029930677e-05, 'epoch': 1.58}


 80%|███████▉  | 89500/112456 [6:14:33<1:32:54,  4.12it/s]

{'loss': 1.9242, 'grad_norm': 0.8499362468719482, 'learning_rate': 4.980458854444575e-05, 'epoch': 1.59}


 80%|████████  | 90000/112456 [6:16:34<1:30:57,  4.11it/s]

{'loss': 1.9244, 'grad_norm': 0.7062103152275085, 'learning_rate': 4.7732615828090464e-05, 'epoch': 1.6}


 80%|████████  | 90500/112456 [6:18:36<1:28:54,  4.12it/s]

{'loss': 1.9042, 'grad_norm': 0.7932279706001282, 'learning_rate': 4.570010640157457e-05, 'epoch': 1.61}


 81%|████████  | 91000/112456 [6:20:38<1:26:55,  4.11it/s]

{'loss': 1.9213, 'grad_norm': 0.8241848945617676, 'learning_rate': 4.37114020761028e-05, 'epoch': 1.62}


 81%|████████▏ | 91500/112456 [6:22:40<1:25:00,  4.11it/s]

{'loss': 1.9114, 'grad_norm': 0.7542880773544312, 'learning_rate': 4.175892022800234e-05, 'epoch': 1.63}


 82%|████████▏ | 92000/112456 [6:24:41<1:22:56,  4.11it/s]

{'loss': 1.9032, 'grad_norm': 0.9137052893638611, 'learning_rate': 3.9847067165018156e-05, 'epoch': 1.64}


 82%|████████▏ | 92500/112456 [6:26:43<1:20:57,  4.11it/s]

{'loss': 1.8961, 'grad_norm': 0.7237542867660522, 'learning_rate': 3.797991643719845e-05, 'epoch': 1.65}


 83%|████████▎ | 93000/112456 [6:28:44<1:18:54,  4.11it/s]

{'loss': 1.9129, 'grad_norm': 0.8846579194068909, 'learning_rate': 3.615034888603996e-05, 'epoch': 1.65}


 83%|████████▎ | 93500/112456 [6:30:46<1:16:50,  4.11it/s]

{'loss': 1.9045, 'grad_norm': 0.7961409091949463, 'learning_rate': 3.436250437786495e-05, 'epoch': 1.66}


 84%|████████▎ | 94000/112456 [6:32:48<1:14:49,  4.11it/s]

{'loss': 1.9035, 'grad_norm': 0.858398973941803, 'learning_rate': 3.2616731729297245e-05, 'epoch': 1.67}


 84%|████████▍ | 94500/112456 [6:34:49<1:12:42,  4.12it/s]

{'loss': 1.9196, 'grad_norm': 0.9105767011642456, 'learning_rate': 3.091337154854837e-05, 'epoch': 1.68}


 84%|████████▍ | 95000/112456 [6:36:51<1:10:46,  4.11it/s]

{'loss': 1.9135, 'grad_norm': 0.8684396147727966, 'learning_rate': 2.9252756168964223e-05, 'epoch': 1.69}


 85%|████████▍ | 95500/112456 [6:38:52<1:08:43,  4.11it/s]

{'loss': 1.918, 'grad_norm': 0.8912816047668457, 'learning_rate': 2.7635209584184696e-05, 'epoch': 1.7}


 85%|████████▌ | 96000/112456 [6:40:54<1:06:46,  4.11it/s]

{'loss': 1.9194, 'grad_norm': 0.8380773663520813, 'learning_rate': 2.606104738493126e-05, 'epoch': 1.71}


 86%|████████▌ | 96500/112456 [6:42:56<1:04:42,  4.11it/s]

{'loss': 1.9142, 'grad_norm': 0.777070164680481, 'learning_rate': 2.4530576697433776e-05, 'epoch': 1.72}


 86%|████████▋ | 97000/112456 [6:44:57<1:02:43,  4.11it/s]

{'loss': 1.9011, 'grad_norm': 0.7455586791038513, 'learning_rate': 2.3044096123508718e-05, 'epoch': 1.73}


 87%|████████▋ | 97500/112456 [6:46:59<1:00:32,  4.12it/s]

{'loss': 1.9231, 'grad_norm': 0.8902912139892578, 'learning_rate': 2.1601895682300615e-05, 'epoch': 1.73}


 87%|████████▋ | 98000/112456 [6:49:01<58:32,  4.12it/s]  

{'loss': 1.9031, 'grad_norm': 0.8445450067520142, 'learning_rate': 2.0204256753698192e-05, 'epoch': 1.74}


 88%|████████▊ | 98500/112456 [6:51:02<56:35,  4.11it/s]

{'loss': 1.9123, 'grad_norm': 0.971398115158081, 'learning_rate': 1.885411271147289e-05, 'epoch': 1.75}


 88%|████████▊ | 99000/112456 [6:53:04<54:31,  4.11it/s]

{'loss': 1.9115, 'grad_norm': 0.8847952485084534, 'learning_rate': 1.7546315664082934e-05, 'epoch': 1.76}


 88%|████████▊ | 99500/112456 [6:55:06<52:29,  4.11it/s]

{'loss': 1.9132, 'grad_norm': 0.7200333476066589, 'learning_rate': 1.6283871391470584e-05, 'epoch': 1.77}


 89%|████████▉ | 100000/112456 [6:57:07<50:25,  4.12it/s]

{'loss': 1.9128, 'grad_norm': 0.829526424407959, 'learning_rate': 1.5067026202275458e-05, 'epoch': 1.78}


 89%|████████▉ | 100500/112456 [6:59:09<48:22,  4.12it/s]

{'loss': 1.9077, 'grad_norm': 0.857330858707428, 'learning_rate': 1.3898313627763198e-05, 'epoch': 1.79}


 90%|████████▉ | 101000/112456 [7:01:10<46:24,  4.11it/s]

{'loss': 1.9056, 'grad_norm': 0.800880491733551, 'learning_rate': 1.2775481499048419e-05, 'epoch': 1.8}


 90%|█████████ | 101500/112456 [7:03:12<44:22,  4.12it/s]

{'loss': 1.9049, 'grad_norm': 0.8598587512969971, 'learning_rate': 1.16966366531554e-05, 'epoch': 1.81}


 91%|█████████ | 102000/112456 [7:05:14<42:24,  4.11it/s]

{'loss': 1.8932, 'grad_norm': 0.8678147196769714, 'learning_rate': 1.0664285881114095e-05, 'epoch': 1.81}


 91%|█████████ | 102500/112456 [7:07:15<40:22,  4.11it/s]

{'loss': 1.906, 'grad_norm': 0.870138943195343, 'learning_rate': 9.678630599271932e-06, 'epoch': 1.82}


 92%|█████████▏| 103000/112456 [7:09:17<38:18,  4.11it/s]

{'loss': 1.9122, 'grad_norm': 0.7833731174468994, 'learning_rate': 8.739863113473078e-06, 'epoch': 1.83}


 92%|█████████▏| 103500/112456 [7:11:18<36:17,  4.11it/s]

{'loss': 1.8989, 'grad_norm': 0.767979085445404, 'learning_rate': 7.849902880623177e-06, 'epoch': 1.84}


 92%|█████████▏| 104000/112456 [7:13:20<34:15,  4.11it/s]

{'loss': 1.9269, 'grad_norm': 0.7877718210220337, 'learning_rate': 7.00535661925597e-06, 'epoch': 1.85}


 93%|█████████▎| 104500/112456 [7:15:22<32:14,  4.11it/s]

{'loss': 1.9061, 'grad_norm': 0.8342316746711731, 'learning_rate': 6.209766576220744e-06, 'epoch': 1.86}


 93%|█████████▎| 105000/112456 [7:17:23<30:11,  4.11it/s]

{'loss': 1.907, 'grad_norm': 0.8493872284889221, 'learning_rate': 5.460099289068155e-06, 'epoch': 1.87}


 94%|█████████▍| 105500/112456 [7:19:25<28:08,  4.12it/s]

{'loss': 1.9118, 'grad_norm': 0.9167575836181641, 'learning_rate': 4.758142852671676e-06, 'epoch': 1.88}


 94%|█████████▍| 106000/112456 [7:21:26<26:05,  4.12it/s]

{'loss': 1.9041, 'grad_norm': 0.7482811808586121, 'learning_rate': 4.104034221936032e-06, 'epoch': 1.89}


 95%|█████████▍| 106500/112456 [7:23:28<24:07,  4.11it/s]

{'loss': 1.9041, 'grad_norm': 0.7960655093193054, 'learning_rate': 3.497901016440458e-06, 'epoch': 1.89}


 95%|█████████▌| 107000/112456 [7:25:29<22:04,  4.12it/s]

{'loss': 1.9335, 'grad_norm': 0.800640881061554, 'learning_rate': 2.939861495539903e-06, 'epoch': 1.9}


 96%|█████████▌| 107500/112456 [7:27:31<20:04,  4.11it/s]

{'loss': 1.9122, 'grad_norm': 0.7965419888496399, 'learning_rate': 2.430024535291897e-06, 'epoch': 1.91}


 96%|█████████▌| 108000/112456 [7:29:32<18:02,  4.12it/s]

{'loss': 1.9035, 'grad_norm': 0.8034891486167908, 'learning_rate': 1.9684896072143522e-06, 'epoch': 1.92}


 96%|█████████▋| 108500/112456 [7:31:34<16:02,  4.11it/s]

{'loss': 1.9088, 'grad_norm': 0.7428690195083618, 'learning_rate': 1.5553467588783066e-06, 'epoch': 1.93}


 97%|█████████▋| 109000/112456 [7:33:35<13:59,  4.12it/s]

{'loss': 1.9048, 'grad_norm': 0.9287446737289429, 'learning_rate': 1.1913575120573727e-06, 'epoch': 1.94}


 97%|█████████▋| 109500/112456 [7:35:37<11:57,  4.12it/s]

{'loss': 1.9176, 'grad_norm': 0.8142288327217102, 'learning_rate': 8.751340317454393e-07, 'epoch': 1.95}


 98%|█████████▊| 110000/112456 [7:37:38<09:56,  4.12it/s]

{'loss': 1.9022, 'grad_norm': 0.8441639542579651, 'learning_rate': 6.075159498380722e-07, 'epoch': 1.96}


 98%|█████████▊| 110500/112456 [7:39:40<07:55,  4.12it/s]

{'loss': 1.9116, 'grad_norm': 0.8607500791549683, 'learning_rate': 3.8855547984448236e-07, 'epoch': 1.97}


 99%|█████████▊| 111000/112456 [7:41:42<05:54,  4.11it/s]

{'loss': 1.9096, 'grad_norm': 0.9766870737075806, 'learning_rate': 2.1829534195172418e-07, 'epoch': 1.97}


 99%|█████████▉| 111500/112456 [7:43:43<03:52,  4.11it/s]

{'loss': 1.9184, 'grad_norm': 0.7720109820365906, 'learning_rate': 9.676875468986324e-08, 'epoch': 1.98}


100%|█████████▉| 112000/112456 [7:45:44<01:50,  4.12it/s]

{'loss': 1.9164, 'grad_norm': 0.8505042195320129, 'learning_rate': 2.4096296335734957e-08, 'epoch': 1.99}


100%|██████████| 112456/112456 [7:47:35<00:00,  4.56it/s]
  0%|          | 0/6248 [00:00<?, ?it/s][A
  0%|          | 2/6248 [00:00<05:24, 19.28it/s][A
  0%|          | 4/6248 [00:00<08:45, 11.88it/s][A
  0%|          | 6/6248 [00:00<09:49, 10.58it/s][A
  0%|          | 8/6248 [00:00<10:19, 10.07it/s][A
  0%|          | 10/6248 [00:00<10:36,  9.80it/s][A
  0%|          | 12/6248 [00:01<10:47,  9.63it/s][A
  0%|          | 13/6248 [00:01<10:50,  9.58it/s][A
  0%|          | 14/6248 [00:01<10:55,  9.52it/s][A
  0%|          | 15/6248 [00:01<10:55,  9.50it/s][A
  0%|          | 16/6248 [00:01<10:59,  9.45it/s][A
  0%|          | 17/6248 [00:01<10:59,  9.44it/s][A
  0%|          | 18/6248 [00:01<11:02,  9.40it/s][A
  0%|          | 19/6248 [00:01<11:02,  9.41it/s][A
  0%|          | 20/6248 [00:02<11:03,  9.39it/s][A
  0%|          | 21/6248 [00:02<11:04,  9.37it/s][A
  0%|          | 22/6248 [00:02<11:04,  9.37it/s][A
  0%|          | 23/6248 [00:02<11:06,  9.34it/s][A
 

{'eval_loss': 1.9087095260620117, 'eval_runtime': 669.5652, 'eval_samples_per_second': 74.645, 'eval_steps_per_second': 9.331, 'epoch': 2.0}


100%|██████████| 112456/112456 [7:58:47<00:00,  4.56it/s]

{'train_runtime': 28727.5111, 'train_samples_per_second': 31.316, 'train_steps_per_second': 3.915, 'train_loss': 1.967983855083113, 'epoch': 2.0}


100%|██████████| 112456/112456 [7:58:48<00:00,  3.91it/s]


TrainOutput(global_step=112456, training_loss=1.967983855083113, metrics={'train_runtime': 28727.5111, 'train_samples_per_second': 31.316, 'train_steps_per_second': 3.915, 'total_flos': 2.8019826572973834e+18, 'train_loss': 1.967983855083113, 'epoch': 2.0})

## Сохранение модели

<https://huggingface.co/nymless/gemma-2-2b-lora-finetuned>

In [19]:
trainer.push_to_hub("Train on all data")

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s][A[A
training_args.bin:   0%|          | 0.00/5.50k [00:00<?, ?B/s][A


adapter_model.safetensors:   0%|          | 0.00/12.8M [00:00<?, ?B/s][A[A[A

tokenizer.json:   0%|          | 123k/34.4M [00:00<00:36, 933kB/s][A[A


adapter_model.safetensors:   1%|          | 123k/12.8M [00:00<00:13, 917kB/s][A[A[A

tokenizer.json:   2%|▏         | 623k/34.4M [00:00<00:12, 2.76MB/s][A[A


training_args.bin: 100%|██████████| 5.50k/5.50k [00:00<00:00, 20.5kB/s]42MB/s][A[A[A



adapter_model.safetensors:  29%|██▉       | 3.70M/12.8M [00:00<00:00, 12.5MB/s][A[A[A

tokenizer.json:   8%|▊         | 2.61M/34.4M [00:00<00:03, 7.99MB/s][A[A


adapter_model.safetensors:  71%|███████   | 9.04M/12.8M [00:00<00:00, 24.3MB/s][A[A[A

tokenizer.json:  17%|█▋        | 5.88M/34.4M [00:00<00:02, 13.4MB/s][A[A

adapter_model.safetensors: 100%|██████████| 12.8M/12.8M [00:00<

CommitInfo(commit_url='https://huggingface.co/nymless/gemma-2-2b-lora-finetuned/commit/45f2fa0012e30b64c9531f2f0d773c40801b6505', commit_message='Train on all data', commit_description='', oid='45f2fa0012e30b64c9531f2f0d773c40801b6505', pr_url=None, repo_url=RepoUrl('https://huggingface.co/nymless/gemma-2-2b-lora-finetuned', endpoint='https://huggingface.co', repo_type='model', repo_id='nymless/gemma-2-2b-lora-finetuned'), pr_revision=None, pr_num=None)

## Загрузка модели

In [None]:
model_name = "google/gemma-2-2b"
lora_finetuned = "nymless/gemma-2-2b-lora-finetuned"

tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
model = AutoPeftModelForCausalLM.from_pretrained(
    lora_finetuned,
    token=token,
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Проверка генерации текста

In [10]:
# Возьмём векторизованный текст из тестового датасета
vectorized_text = dataset["test"][0]["input_ids"]
# Декодируем из вектора обратно в текст
text = tokenizer.decode(vectorized_text, skip_special_tokens=True)
print(text)

Аddress: Московская область, городской округ Балашиха, деревня Пестово, 1А
Name: Грант
Rating: 1
Keywords: Магазин автозапчастей и автотоваров;Автокосметика, автохимия;Аккумуляторы и зарядные устройства
Review: У сотрудников информация по запчасти разнятся, очень не понравилось отношение сотрудников 



In [5]:
sys.path.append(os.path.abspath(".."))

from validation.Inputs import Inputs
from scripts.generate import generate

inputs = Inputs(
    address="Московская область, городской округ Балашиха, деревня Пестово, 1А",
    name="Грант",
    rating="1",
    rubrics="Магазин автозапчастей и автотоваров;Автокосметика, автохимия;Аккумуляторы и зарядные устройства",
)
review = generate(model, tokenizer, inputs, max_length=512)
print("Review:", review)

Review: 08.03.2023. С 19.12.2022 года заказываю автоаксессуары по каталогу. 02.03.2023. Доставка отменяется на 25.03.2023. До 23.03.2023. Доставки не было. Заказ отменяется. Сроки доставки отменяются без объяснения причин.  
