# Fine-Tuning LLMs for Instruction Generation Tasks

This project focuses on fine-tuning the large language model 'EleutherAI/pythia-410m' to enhance its ability to generate accurate and relevant responses to instruction-based prompts. By leveraging instruction-tuning techniques, we aim to:

- Reduce hallucinations and unwanted outputs
- Improve consistency and reliability in generated answers
- Enhance data privacy for company-specific use cases
- Lower operational costs by optimizing model performance

Fine-tuning also enables the model to better align with domain-specific requirements and organizational standards.

**Key Libraries Used:**
- PyTorch: For efficient deep learning model training and optimization
- Transformers: For state-of-the-art NLP model architectures and utilities
- LLama Library (Lamini): For streamlined instruction-tuning workflows

This notebook provides a step-by-step guide to the fine-tuning process, including data preparation, training, evaluation

2025 Copyright Ludy Hasby Aulia - ML Engineer Candidates

# Instruction Tuning with Pythia

Ludy Hasby Aulia

[[Project Page](https://huggingface.co/ludyhasby/lamini_docs_instruct)] [[Notebook](waiting)] 
<p align="center">
    <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*SwMMluhfo_YW1-9Mwpb8kg.png" width="80%"> <br>
    Instruction Tuning, Image take from<a href="https://medium.com/@lmpo/an-overview-instruction-tuning-for-llms-440228e7edab"> LM PRO</a>
</p>


<!-- 
[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
[![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE) -->


This project focuses on fine-tuning the large language model 'EleutherAI/pythia-410m' to enhance its ability to generate accurate and relevant responses to instruction-based prompts. By leveraging instruction-tuning techniques, we aim to:

- Reduce hallucinations and unwanted outputs
- Improve consistency and reliability in generated answers
- Enhance data privacy for company-specific use cases
- Lower operational costs by optimizing model performance

Fine-tuning also enables the model to better align with domain-specific requirements and organizational standards.

**Key Libraries Used:**
- PyTorch: For efficient deep learning model training and optimization
- Transformers: For state-of-the-art NLP model architectures and utilities
- LLama Library (Lamini): For streamlined instruction-tuning workflows

This repo contains: 
- Fine Tune Model Tokenization
- Fine Tune Model Trainer 
- Lamini Docs Dataset
- Notebook Model Development 
- Inference App with HuggingFace 

**Usage and License Notices**:  The dataset is CC BY [Lamini](https://huggingface.co/datasets/lamini/lamini_docs)

- [Overview](#overview)
- [LLM Selected](#base-large-language-model)
- [Dataset Design and Preparation](#data-design-preparation)
- [Fine Tuning Strategy](#fine-tune-strategy)
- [Evaluation and Benchmarking](#evaluation-benchmarking)
- [Practical Implementation](#practical-implementation)

## Overview
Large Language Models (LLMs) have shown impressive generalization capabilities such as in-context-learning and chain-of-thoughts reasoning. To enable LLMs to follow natural language instructions and complete real-world tasks, we have been exploring methods of instruction-tuning of LLMs. 
This project demonstrates the process of instruction-tuning a large language model (LLM), specifically EleutherAI/pythia-410m, to improve its ability to follow natural language instructions and generate high-quality, relevant responses. By leveraging the [lamini_docs](https://huggingface.co/datasets/lamini/lamini_docs) dataset, we fine-tune the base model to better align with real-world instruction-following tasks, reduce hallucinations, and enhance reliability.

## Base Large Language Model
For this project, **EleutherAI/pythia-410m** was chosen due to the following reasons:
- **Accessibility & Licensing:** Pythia is fully open-source and available on Hugging Face, making it easy to use, modify, and deploy without restrictive licenses.
- **Architecture:** It is based on the transformer architecture, which is well-suited for understanding and generating coherent, context-aware text.
- **Community Support:** Pythia has strong community backing, with pre-trained weights, documentation, and integration with popular libraries like `transformers`.
- **Performance:** While smaller than some models, Pythia-410m offers a good balance between computational efficiency and output quality, making it suitable for experimentation and prototyping.
- **Instruction-Tuning Compatibility:** The model can be fine-tuned on instruction datasets (such as lamini_docs) to improve its ability to follow prompts and generate relevant, structured responses.

Other models like LLaMA, Mistral, or DeepSeek may offer higher performance or larger parameter sizes, but Pythia is a practical choice for projects focused on open-source, reproducibility, and ease of deployment.

Here is `EleutherAI/pythia-410m` architectures:
```
GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 1024)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
          (act): GELUActivation()
        )
      )
    )
    (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (embed_out): Linear(in_features=1024, out_features=50304, bias=False)
)
```

## Data Design Preparation
### Dataset Information
* [`lamini_docs.jsonl`](https://huggingface.co/datasets/lamini/lamini_docs) contains 1260 instruction-following preferrable response regarding Lamini information. 
This JSON file has the format as belom:

    - `question`: `str`, A natural language instruction or prompt describing the task. 
    - `answer`: `str`, The preferred answer to the instruction, generated by Lamini.

**Data Testing Example**

`Question input (test)`: Can Lamini generate technical documentation or user manuals for software projects?

`Prefer answer from Lamini docs`: Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.

### Data Preprocessing
Data is first loaded and then processed using the base model's tokenizer. The preprocessing steps include:

- **Tokenization:** Each question and answer is converted into tokens using the tokenizer from the pretrained model.
- **Padding and Truncation:**  
  - Questions are padded or truncated to a fixed length of 1000 tokens.
  - Answers are padded or truncated to a fixed length of 100 tokens.
  This ensures all inputs and outputs have consistent shapes for efficient training.
- **Train-Test Split:**  
  After preprocessing, the dataset is split into training and testing sets to evaluate model performance.

This workflow prepares the data for fine-tuning and ensures compatibility with. Then we make pipelines to inference each input to output. with steps and function as follow: 

1. **Generate Tokenization from Prompt**: using model tokenizer
2. **Padding and Truncating** : Since models expect inputs of fixed length, tokenized sequences are padded (adding special tokens to reach the required length) or truncated (cutting off tokens that exceed the maximum length). This ensures uniform input size for efficient batch processing.
3. **Generate Model Response**
4. **Decode the Result from Tokenization**: The output tokens produced by the model are converted back into human-readable. 
5. **Strip the Prompt**  
   The decoded output often contains the original prompt followed by the model’s response. To isolate the model’s answer, the prompt portion is removed, leaving only the generated response for evaluation or further processing.

```
def inference(prompt, model, tokenizer, max_input_token=1000, max_output_token=100):
    """
    Function to generate model response from prompt
    """
    # Generate Tokenization from prompt
    inputs = tokenizer.encode(
        prompt, 
        return_tensors="pt",
        truncation=True, 
        max_length=max_input_token
    )
    # Generate Response
    device = model.device
    generate_token = model.generate(
        inputs.to(device), 
        max_new_tokens=max_output_token
    )
    # Decode the result from tokenization
    response = tokenizer.batch_decode(generate_token, 
                                      skip_special_tokens=True)    
    # Strip the prompt
    response = response[0][len(prompt):]
    return response
```

### Handle Unrelevant Information
To handle questions that are outside the scope of Lamini Docs, the dataset includes examples specifically designed to teach the model to respond appropriately. For instance:

- `Question:`
  *Why do we shiver when we're cold?*

- `Answer:`
  *Let’s keep the discussion relevant to Lamini.*

- `Question:`
  *Why do we dream?*

- `Answer:`
  *Let’s keep the discussion relevant to Lamini.*

This approach helps the model avoid answering unrelated questions and maintain focus on Lamini-

## Fine Tune Strategy
### Key Hyperparameters to Tune
- `learning_rate=1e-6`, # learning rate, we reduce it because avoiding overfitting
- `max_steps=100`, # steps can take up to 100 because of cost of computation
- `per_device_train_batch_size=1`, # batch size per device during training, we dont use GPU
- `warmup_steps=1`, # warmup steps, to be stable
- `per_device_eval_batch_size=1`, # we dont use GPU
- `optim="adamw_torch"`, # optimizer, I think state of art
- `gradient_accumulation_steps = 4`, # beneficial to minimum GPU
- `gradient_checkpointing=False`,
- `load_best_model_at_end=True`,
- `metric_for_best_model="eval_loss"`
### Training Result
Here is our [logs](/src/logs.txt) that you can evaluate. Or you can Check our [Notebook](/src/instructionTuning.ipynb). 
### Potential Challenge
- Computational Resources : limitation of RAM and GPU, solution : Choose small LLM base model (400m - 1B)
- Repeating answer, solution : truncating response
- Bahasa Indonesia Context, solution: translating model before preprocessing, vice verse before sending

## Evaluation Benchmarking
To assess the effectiveness of instruction tuning, we compare the responses generated by the baseline (pretrained) model and the fine-tuned model on both training and testing datasets. This benchmarking process highlights improvements in the model's ability to follow instructions and generate relevant answers.

**Evaluation Steps:**
1. **Select Sample Questions:**  
   Use representative questions from both the training and testing sets.
2. **Generate Responses:**  
   Obtain answers from the baseline model and the fine-tuned model for each question.
3. **Compare Outputs:**  
   Evaluate the quality, relevance, and alignment of the generated responses against the preferred answers from the Lamini Docs dataset.

<p align="center">
    <img src="https://github.com/user-attachments/assets/54dcd733-1eb7-4f54-a2d4-c4917e9e749f" width="100%">
</p>

## Implementation for Basic Fine Tuning Pipeline
These fine tuning model can be sketch as below
#### Fine-Tuning Pipeline
You should write a simplified implementation flow of how you would:
- Load a pre-trained open-source LLM 'EleutherAI/pythia-410m'
- Load instruction dataset `lamini_docs.jsonl`
- Tokenize and preprocess the data `base_model_tokenizer`
- Training Config `TrainingArguments(...)`
- Run the training using Hugging Face’s `Trainer(...)` or similar API

#### Workflow to generate procedural instruction from fine tuning model

## Acknowledgement
This project benefits from [Lamini](https://huggingface.co/datasets/lamini/lamini_docs), [EleutherAI/phythia](https://huggingface.co/EleutherAI/pythia-410m)

## Library Load

In [1]:
import pandas as pd
from utilities import *
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    TrainingArguments
)
import torch

## Static

In [2]:
SELECTED_LLM = 'EleutherAI/pythia-410m'
DATA_DIR = "lamini/lamini_docs"
OUTPUT_DIR = 'output/pythia-410m-instruction-tuning'
USE_HF = True # Use Hugging Face Hub for training

## Data Load

We need to load dataset regarding the task (instruction generation). Here we use [lamini_docs](https://huggingface.co/datasets/lamini/lamini_docs) that has 1260 data to fine tune the SELECTED_LLM. 

In [3]:
# config model 
training_config = {
    "model": {
        "pretrained_name": SELECTED_LLM,
        "max_length" : 2048
    },
    "datasets": {
        "use_hf": USE_HF,
        "path": DATA_DIR
    },
    "verbose": True
}

next part, tokenize the dataset with tokenizer from pretrained model. After that splitting into training and testing. 

In [4]:
tokenizer = AutoTokenizer.from_pretrained(SELECTED_LLM)
tokenizer.pad_token = tokenizer.eos_token
train_dataset, test_dataset = tokenize_and_split_data(training_config, tokenizer)

print(train_dataset)
print(test_dataset)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2025-08-07 07:46:39,661 - DEBUG - utilities - Config: datasets.path: lamini/lamini_docs
datasets.use_hf: true
model.max_length: 2048
model.pretrained_name: EleutherAI/pythia-410m
verbose: true



tokenize True lamini/lamini_docs


2025-08-07 07:46:47,441 - DEBUG - fsspec.local - open file: C:/Users/Pongo/.cache/huggingface/datasets/lamini___lamini_docs/default-9b991800e664930e/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02/dataset_info.json
2025-08-07 07:46:47,515 - DEBUG - fsspec.local - open file: C:/Users/Pongo/.cache/huggingface/datasets/lamini___lamini_docs/default-9b991800e664930e/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02/dataset_info.json


Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1260
})
Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 140
})


## Base Model

Here, we need to load pretrained model, and look at the output produced. 

In [81]:
# load base model
base_model = AutoModelForCausalLM.from_pretrained(SELECTED_LLM)



Check the device, if GPU available, use GPU instead CPU. 

In [82]:
device_count = torch.cuda.device_count()
if device_count > 0:
    logger.debug("Tersedia GPU, menggunakan GPU")
    device = torch.device("cuda")
else:
    logger.debug("Menggunakan CPU")
    device = torch.device("cpu")

2025-08-07 11:15:59,644 - DEBUG - utilities - Menggunakan CPU


In [83]:
# send pretrained model to device
base_model.to(device)

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 1024)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
  

for inferencing output from model, we need process 
- Generate Tokenization from prompt
- Padding and Truncating 
- Generate Model Response 
- Decode the result from tokenization
- strip the prompt

In [84]:
def inference(prompt, model, tokenizer, max_input_token=1000, max_output_token=100):
    """
    Function to generate model response from prompt
    """
    # Generate Tokenization from prompt
    inputs = tokenizer.encode(
        prompt, 
        return_tensors="pt",
        truncation=True, 
        max_length=max_input_token
    )
    # Generate Response
    device = model.device
    generate_token = model.generate(
        inputs.to(device), 
        max_new_tokens=max_output_token
    )
    # Decode the result from tokenization
    response = tokenizer.batch_decode(generate_token, 
                                      skip_special_tokens=True)    
    # Strip the prompt
    response = response[0][len(prompt):]
    return response

Now we can test the base model output

In [85]:
test_dataset[0]['question']

'Can Lamini generate technical documentation or user manuals for software projects?'

In [86]:
test_text = test_dataset[0]['question']
print("Question input (test):", test_text)
print(f"Prefer answer from Lamini docs: {test_dataset[0]['answer']}")
print("Model's answer: ")
print(inference(test_text, base_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Question input (test): Can Lamini generate technical documentation or user manuals for software projects?
Prefer answer from Lamini docs: Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.
Model's answer: 


A:

I think you are looking for the Lamini documentation.

A:

I think you are looking for the Lamini documentation.

I think you are looking for the Lamini documentation.

I think you are looking for the Lamini documentation.

I think you are looking for the Lamini documentation.

I think you are looking for the Lamini documentation.

I think you are looking for


Model Inference Example

Let's evaluate the base model's ability to answer an instruction-based prompt before fine-tuning. We'll use a sample question from the test dataset and compare the model's output to the preferred answer from Lamini docs.

**Prompt Example:**
> Can Lamini generate technical documentation or user manuals for software projects?

**Preferred Answer (from Lamini docs):**
> Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.

**Base Model Output:**
The following cell demonstrates how the base model responds to the prompt. Notice that the output may be repetitive or not fully aligned with the preferred answer. This highlights the need for instruction-tuning to improve the model's performance on such tasks.

In [None]:
dict_eval = {
    'question': [],
    'base_model': [],
    'fine_tune': [],
    'answer': []
}

In [87]:
n = 10
for i in range(n): 
    question = train_dataset[i]['question']
    fine_tune_model_resp = inference(question, base_model, tokenizer)
    dict_eval['base_model'].append(fine_tune_model_resp)
    
    if n < 5:
        question = test_dataset[i]['question']
        fine_tune_model_resp = inference(question, base_model, tokenizer)
        dict_eval['base_model'].append(fine_tune_model_resp)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attentio

## Fine Tune LLM 

lets create the static first 

In [31]:
MAX_STEPS = 100
TRAINED_MODEL_NAME = f"lamini_docs_{MAX_STEPS}_steps"
OUTPUT_DIR = TRAINED_MODEL_NAME

Then, make arguments for the training 
- learning_rate: The step size used for updating model weights during training.
- num_train_epochs: Number of times the entire training dataset is passed through the model.
- max_steps: Total number of training steps (batches) to run. If set, it overrides num_train_epochs.
- per_device_train_batch_size: Number of samples per batch for each device (CPU/GPU) during training.
- output_dir: Directory where the trained model and checkpoints will be saved.
overwrite_output_dir: If True, existing files in the output directory will be overwritten.
- disable_tqdm: If True, disables the tqdm progress bar during training.
- eval_steps: Number of steps between each evaluation on the validation set.
- save_steps: Number of steps between saving model checkpoints.
- warmup_steps: Number of steps for the learning rate warmup at the start of training.
- per_device_eval_batch_size: Number of samples per batch for each device during evaluation.
- evaluation_strategy: Defines when to run evaluation (e.g., 'steps', 'epoch').
- logging_strategy: Defines when to log training metrics (e.g., 'steps', 'epoch').
- logging_steps: Number of steps between logging training metrics.
- optim: Optimizer to use for training (e.g., 'adamw', 'adafactor').
- gradient_accumulation_steps: Number of steps to accumulate gradients before updating model weights.
- gradient_checkpointing: If True, enables gradient checkpointing to save memory at the cost of slower training.
- load_best_model_at_end: If True, loads the best model (based on eval metric) at the end of training.
- save_total_limit: Maximum number of checkpoints to save. Older checkpoints are deleted.
- metric_for_best_model: Metric to use for selecting the best model (e.g., 'eval_loss').
- greater_is_better: If True, higher metric values are considered better; otherwise, lower values are better.

In [32]:
training_args = TrainingArguments(
    learning_rate=1e-6, # learning rate 
    max_steps=MAX_STEPS, # number of steps (each step is a batch of data)
    per_device_train_batch_size=1, # batch size per device during training
    output_dir=OUTPUT_DIR, # output directory for the model

    # Adding more arguments
    overwrite_output_dir=False, # overwrite the output directory
    disable_tqdm=False, # disable tqdm progress bar
    eval_steps=120, # evaluation steps
    save_steps=120, # save steps
    warmup_steps=1, # warmup steps
    per_device_eval_batch_size=1,
    evaluation_strategy="steps",
    logging_strategy="steps",
    logging_steps=1,
    optim="adamw_torch", # optimizer
    gradient_accumulation_steps = 4,
    gradient_checkpointing=False,

    # Parameters for early stopping
    load_best_model_at_end=True,
    save_total_limit=1,
    metric_for_best_model="eval_loss",
    greater_is_better=False
)

Before training begin, we want to know estimating of floating point operation that model do in one step training. 

In [33]:
model_flops = (
  base_model.floating_point_ops(
    {
       "input_ids": torch.zeros(
           (1, training_config["model"]["max_length"])
      )
    }
  )
  * training_args.gradient_accumulation_steps
)

print(base_model)
print("Memory footprint", base_model.get_memory_footprint() / 1e9, "GB")
print("Flops", model_flops / 1e9, "GFLOPs")

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 1024)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
  

Then we can load the training process

In [34]:
trainer = Trainer(
    model=base_model,
    model_flops=model_flops,
    total_steps=MAX_STEPS,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
)
trainer.do_grad_scaling = False

In [35]:
training_output = trainer.train()

  0%|          | 0/100 [00:00<?, ?it/s]

2025-08-07 08:05:45,049 - DEBUG - utilities - Step (1) Logs: {'loss': 2.7155, 'learning_rate': 1e-06, 'epoch': 0.0, 'iter_time': 0.0, 'flops': 0.0, 'remaining_time': 0.0}


{'loss': 2.7155, 'learning_rate': 1e-06, 'epoch': 0.0, 'iter_time': 0.0, 'flops': 0.0, 'remaining_time': 0.0}


2025-08-07 08:05:50,963 - DEBUG - utilities - Step (2) Logs: {'loss': 2.7145, 'learning_rate': 9.898989898989898e-07, 'epoch': 0.01, 'iter_time': 5.913750410079956, 'flops': 2940789368417.878, 'remaining_time': 579.5475401878357}


{'loss': 2.7145, 'learning_rate': 9.898989898989898e-07, 'epoch': 0.01, 'iter_time': 5.913750410079956, 'flops': 2940789368417.878, 'remaining_time': 579.5475401878357}


2025-08-07 08:05:56,516 - DEBUG - utilities - Step (3) Logs: {'loss': 3.0597, 'learning_rate': 9.797979797979797e-07, 'epoch': 0.01, 'iter_time': 5.732835054397583, 'flops': 3033594053975.0083, 'remaining_time': 556.0850002765656}


{'loss': 3.0597, 'learning_rate': 9.797979797979797e-07, 'epoch': 0.01, 'iter_time': 5.732835054397583, 'flops': 3033594053975.0083, 'remaining_time': 556.0850002765656}


2025-08-07 08:06:02,468 - DEBUG - utilities - Step (4) Logs: {'loss': 3.1262, 'learning_rate': 9.696969696969698e-07, 'epoch': 0.01, 'iter_time': 5.806412220001221, 'flops': 2995153233098.6216, 'remaining_time': 557.4155731201172}


{'loss': 3.1262, 'learning_rate': 9.696969696969698e-07, 'epoch': 0.01, 'iter_time': 5.806412220001221, 'flops': 2995153233098.6216, 'remaining_time': 557.4155731201172}


2025-08-07 08:06:08,333 - DEBUG - utilities - Step (5) Logs: {'loss': 3.2388, 'learning_rate': 9.595959595959596e-07, 'epoch': 0.02, 'iter_time': 5.820704936981201, 'flops': 2987798646680.6826, 'remaining_time': 552.9669690132141}


{'loss': 3.2388, 'learning_rate': 9.595959595959596e-07, 'epoch': 0.02, 'iter_time': 5.820704936981201, 'flops': 2987798646680.6826, 'remaining_time': 552.9669690132141}


2025-08-07 08:06:14,085 - DEBUG - utilities - Step (6) Logs: {'loss': 2.4197, 'learning_rate': 9.494949494949495e-07, 'epoch': 0.02, 'iter_time': 5.807143974304199, 'flops': 2994775815855.981, 'remaining_time': 545.8715335845947}


{'loss': 2.4197, 'learning_rate': 9.494949494949495e-07, 'epoch': 0.02, 'iter_time': 5.807143974304199, 'flops': 2994775815855.981, 'remaining_time': 545.8715335845947}


2025-08-07 08:06:19,801 - DEBUG - utilities - Step (7) Logs: {'loss': 2.2824, 'learning_rate': 9.393939393939395e-07, 'epoch': 0.02, 'iter_time': 5.7920394738515215, 'flops': 3002585602524.4727, 'remaining_time': 538.6596710681915}


{'loss': 2.2824, 'learning_rate': 9.393939393939395e-07, 'epoch': 0.02, 'iter_time': 5.7920394738515215, 'flops': 3002585602524.4727, 'remaining_time': 538.6596710681915}


2025-08-07 08:06:25,343 - DEBUG - utilities - Step (8) Logs: {'loss': 2.7183, 'learning_rate': 9.292929292929292e-07, 'epoch': 0.03, 'iter_time': 5.756288834980556, 'flops': 3021233790034.225, 'remaining_time': 529.5785728182112}


{'loss': 2.7183, 'learning_rate': 9.292929292929292e-07, 'epoch': 0.03, 'iter_time': 5.756288834980556, 'flops': 3021233790034.225, 'remaining_time': 529.5785728182112}


2025-08-07 08:06:30,531 - DEBUG - utilities - Step (9) Logs: {'loss': 2.2314, 'learning_rate': 9.191919191919192e-07, 'epoch': 0.03, 'iter_time': 5.685171395540237, 'flops': 3059027269975.103, 'remaining_time': 517.3505969941616}


{'loss': 2.2314, 'learning_rate': 9.191919191919192e-07, 'epoch': 0.03, 'iter_time': 5.685171395540237, 'flops': 3059027269975.103, 'remaining_time': 517.3505969941616}


2025-08-07 08:06:36,612 - DEBUG - utilities - Step (10) Logs: {'loss': 2.2184, 'learning_rate': 9.09090909090909e-07, 'epoch': 0.03, 'iter_time': 5.7292519675360785, 'flops': 3035491270410.8584, 'remaining_time': 515.6326770782471}


{'loss': 2.2184, 'learning_rate': 9.09090909090909e-07, 'epoch': 0.03, 'iter_time': 5.7292519675360785, 'flops': 3035491270410.8584, 'remaining_time': 515.6326770782471}


2025-08-07 08:06:42,637 - DEBUG - utilities - Step (11) Logs: {'loss': 2.4358, 'learning_rate': 8.98989898989899e-07, 'epoch': 0.03, 'iter_time': 5.758769845962524, 'flops': 3019932172776.952, 'remaining_time': 512.5305162906647}


{'loss': 2.4358, 'learning_rate': 8.98989898989899e-07, 'epoch': 0.03, 'iter_time': 5.758769845962524, 'flops': 3019932172776.952, 'remaining_time': 512.5305162906647}


2025-08-07 08:06:46,972 - DEBUG - utilities - Step (12) Logs: {'loss': 2.9276, 'learning_rate': 8.888888888888888e-07, 'epoch': 0.04, 'iter_time': 5.629326040094549, 'flops': 3089374147024.517, 'remaining_time': 495.3806915283203}


{'loss': 2.9276, 'learning_rate': 8.888888888888888e-07, 'epoch': 0.04, 'iter_time': 5.629326040094549, 'flops': 3089374147024.517, 'remaining_time': 495.3806915283203}


2025-08-07 08:06:51,776 - DEBUG - utilities - Step (13) Logs: {'loss': 2.4709, 'learning_rate': 8.787878787878787e-07, 'epoch': 0.04, 'iter_time': 5.560572485129039, 'flops': 3127572633923.937, 'remaining_time': 483.76980620622635}


{'loss': 2.4709, 'learning_rate': 8.787878787878787e-07, 'epoch': 0.04, 'iter_time': 5.560572485129039, 'flops': 3127572633923.937, 'remaining_time': 483.76980620622635}


2025-08-07 08:06:57,485 - DEBUG - utilities - Step (14) Logs: {'loss': 2.0219, 'learning_rate': 8.686868686868687e-07, 'epoch': 0.04, 'iter_time': 5.571981521753164, 'flops': 3121168701932.105, 'remaining_time': 479.19041087077215}


{'loss': 2.0219, 'learning_rate': 8.686868686868687e-07, 'epoch': 0.04, 'iter_time': 5.571981521753164, 'flops': 3121168701932.105, 'remaining_time': 479.19041087077215}


2025-08-07 08:07:02,548 - DEBUG - utilities - Step (15) Logs: {'loss': 2.7367, 'learning_rate': 8.585858585858586e-07, 'epoch': 0.05, 'iter_time': 5.535659517560687, 'flops': 3141648123095.4507, 'remaining_time': 470.53105899265836}


{'loss': 2.7367, 'learning_rate': 8.585858585858586e-07, 'epoch': 0.05, 'iter_time': 5.535659517560687, 'flops': 3141648123095.4507, 'remaining_time': 470.53105899265836}


2025-08-07 08:07:08,591 - DEBUG - utilities - Step (16) Logs: {'loss': 2.3912, 'learning_rate': 8.484848484848484e-07, 'epoch': 0.05, 'iter_time': 5.569475507736206, 'flops': 3122573087767.9834, 'remaining_time': 467.8359426498413}


{'loss': 2.3912, 'learning_rate': 8.484848484848484e-07, 'epoch': 0.05, 'iter_time': 5.569475507736206, 'flops': 3122573087767.9834, 'remaining_time': 467.8359426498413}


2025-08-07 08:07:13,177 - DEBUG - utilities - Step (17) Logs: {'loss': 2.29, 'learning_rate': 8.383838383838383e-07, 'epoch': 0.05, 'iter_time': 5.507981702685356, 'flops': 3157435022879.8257, 'remaining_time': 457.16248132288456}


{'loss': 2.29, 'learning_rate': 8.383838383838383e-07, 'epoch': 0.05, 'iter_time': 5.507981702685356, 'flops': 3157435022879.8257, 'remaining_time': 457.16248132288456}


2025-08-07 08:07:18,037 - DEBUG - utilities - Step (18) Logs: {'loss': 2.8927, 'learning_rate': 8.282828282828283e-07, 'epoch': 0.06, 'iter_time': 5.469868856317857, 'flops': 3179435337531.499, 'remaining_time': 448.52924621806426}


{'loss': 2.8927, 'learning_rate': 8.282828282828283e-07, 'epoch': 0.06, 'iter_time': 5.469868856317857, 'flops': 3179435337531.499, 'remaining_time': 448.52924621806426}


2025-08-07 08:07:23,051 - DEBUG - utilities - Step (19) Logs: {'loss': 2.406, 'learning_rate': 8.181818181818182e-07, 'epoch': 0.06, 'iter_time': 5.444524115986294, 'flops': 3194235889666.8315, 'remaining_time': 441.0064533948898}


{'loss': 2.406, 'learning_rate': 8.181818181818182e-07, 'epoch': 0.06, 'iter_time': 5.444524115986294, 'flops': 3194235889666.8315, 'remaining_time': 441.0064533948898}


2025-08-07 08:07:28,591 - DEBUG - utilities - Step (20) Logs: {'loss': 2.6351, 'learning_rate': 8.08080808080808e-07, 'epoch': 0.06, 'iter_time': 5.449576227288497, 'flops': 3191274625420.3975, 'remaining_time': 435.96609818307974}


{'loss': 2.6351, 'learning_rate': 8.08080808080808e-07, 'epoch': 0.06, 'iter_time': 5.449576227288497, 'flops': 3191274625420.3975, 'remaining_time': 435.96609818307974}


2025-08-07 08:07:33,441 - DEBUG - utilities - Step (21) Logs: {'loss': 2.5658, 'learning_rate': 7.97979797979798e-07, 'epoch': 0.07, 'iter_time': 5.419606673717499, 'flops': 3208921861023.327, 'remaining_time': 428.1489272236824}


{'loss': 2.5658, 'learning_rate': 7.97979797979798e-07, 'epoch': 0.07, 'iter_time': 5.419606673717499, 'flops': 3208921861023.327, 'remaining_time': 428.1489272236824}


2025-08-07 08:07:38,685 - DEBUG - utilities - Step (22) Logs: {'loss': 2.0049, 'learning_rate': 7.878787878787878e-07, 'epoch': 0.07, 'iter_time': 5.411178486687796, 'flops': 3213919920812.8794, 'remaining_time': 422.0719219616481}


{'loss': 2.0049, 'learning_rate': 7.878787878787878e-07, 'epoch': 0.07, 'iter_time': 5.411178486687796, 'flops': 3213919920812.8794, 'remaining_time': 422.0719219616481}


2025-08-07 08:07:44,340 - DEBUG - utilities - Step (23) Logs: {'loss': 2.2745, 'learning_rate': 7.777777777777778e-07, 'epoch': 0.07, 'iter_time': 5.422332655299794, 'flops': 3207308632465.0195, 'remaining_time': 417.5196144580841}


{'loss': 2.2745, 'learning_rate': 7.777777777777778e-07, 'epoch': 0.07, 'iter_time': 5.422332655299794, 'flops': 3207308632465.0195, 'remaining_time': 417.5196144580841}


2025-08-07 08:07:50,357 - DEBUG - utilities - Step (24) Logs: {'loss': 2.1388, 'learning_rate': 7.676767676767675e-07, 'epoch': 0.08, 'iter_time': 5.4481699259384815, 'flops': 3192098368782.8486, 'remaining_time': 414.0609143713246}


{'loss': 2.1388, 'learning_rate': 7.676767676767675e-07, 'epoch': 0.08, 'iter_time': 5.4481699259384815, 'flops': 3192098368782.8486, 'remaining_time': 414.0609143713246}


2025-08-07 08:07:54,817 - DEBUG - utilities - Step (25) Logs: {'loss': 2.9122, 'learning_rate': 7.575757575757575e-07, 'epoch': 0.08, 'iter_time': 5.407009959220886, 'flops': 3216397688297.571, 'remaining_time': 405.52574694156647}


{'loss': 2.9122, 'learning_rate': 7.575757575757575e-07, 'epoch': 0.08, 'iter_time': 5.407009959220886, 'flops': 3216397688297.571, 'remaining_time': 405.52574694156647}


2025-08-07 08:07:59,869 - DEBUG - utilities - Step (26) Logs: {'loss': 3.0057, 'learning_rate': 7.474747474747475e-07, 'epoch': 0.08, 'iter_time': 5.392776327133179, 'flops': 3224887011526.6167, 'remaining_time': 399.06544820785524}


{'loss': 3.0057, 'learning_rate': 7.474747474747475e-07, 'epoch': 0.08, 'iter_time': 5.392776327133179, 'flops': 3224887011526.6167, 'remaining_time': 399.06544820785524}


2025-08-07 08:08:04,741 - DEBUG - utilities - Step (27) Logs: {'loss': 2.0939, 'learning_rate': 7.373737373737373e-07, 'epoch': 0.09, 'iter_time': 5.372753207500164, 'flops': 3236905486215.648, 'remaining_time': 392.21098414751197}


{'loss': 2.0939, 'learning_rate': 7.373737373737373e-07, 'epoch': 0.09, 'iter_time': 5.372753207500164, 'flops': 3236905486215.648, 'remaining_time': 392.21098414751197}


2025-08-07 08:08:09,472 - DEBUG - utilities - Step (28) Logs: {'loss': 2.5771, 'learning_rate': 7.272727272727272e-07, 'epoch': 0.09, 'iter_time': 5.349000295003255, 'flops': 3251279374518.976, 'remaining_time': 385.1280212402344}


{'loss': 2.5771, 'learning_rate': 7.272727272727272e-07, 'epoch': 0.09, 'iter_time': 5.349000295003255, 'flops': 3251279374518.976, 'remaining_time': 385.1280212402344}


2025-08-07 08:08:15,232 - DEBUG - utilities - Step (29) Logs: {'loss': 2.1887, 'learning_rate': 7.171717171717171e-07, 'epoch': 0.09, 'iter_time': 5.363681009837559, 'flops': 3242380428952.2236, 'remaining_time': 380.8213516984667}


{'loss': 2.1887, 'learning_rate': 7.171717171717171e-07, 'epoch': 0.09, 'iter_time': 5.363681009837559, 'flops': 3242380428952.2236, 'remaining_time': 380.8213516984667}


2025-08-07 08:08:19,908 - DEBUG - utilities - Step (30) Logs: {'loss': 1.6824, 'learning_rate': 7.07070707070707e-07, 'epoch': 0.1, 'iter_time': 5.339946319316995, 'flops': 3256791977576.3594, 'remaining_time': 373.79624235218967}


{'loss': 1.6824, 'learning_rate': 7.07070707070707e-07, 'epoch': 0.1, 'iter_time': 5.339946319316995, 'flops': 3256791977576.3594, 'remaining_time': 373.79624235218967}


2025-08-07 08:08:25,377 - DEBUG - utilities - Step (31) Logs: {'loss': 2.3371, 'learning_rate': 6.96969696969697e-07, 'epoch': 0.1, 'iter_time': 5.344249669710795, 'flops': 3254169510830.7173, 'remaining_time': 368.7532272100448}


{'loss': 2.3371, 'learning_rate': 6.96969696969697e-07, 'epoch': 0.1, 'iter_time': 5.344249669710795, 'flops': 3254169510830.7173, 'remaining_time': 368.7532272100448}


2025-08-07 08:08:30,942 - DEBUG - utilities - Step (32) Logs: {'loss': 1.9923, 'learning_rate': 6.868686868686868e-07, 'epoch': 0.1, 'iter_time': 5.351369765497023, 'flops': 3249839778512.2505, 'remaining_time': 363.8931440537976}


{'loss': 1.9923, 'learning_rate': 6.868686868686868e-07, 'epoch': 0.1, 'iter_time': 5.351369765497023, 'flops': 3249839778512.2505, 'remaining_time': 363.8931440537976}


2025-08-07 08:08:35,926 - DEBUG - utilities - Step (33) Logs: {'loss': 2.3838, 'learning_rate': 6.767676767676767e-07, 'epoch': 0.1, 'iter_time': 5.339905217289925, 'flops': 3256817045577.865, 'remaining_time': 357.77364955842495}


{'loss': 2.3838, 'learning_rate': 6.767676767676767e-07, 'epoch': 0.1, 'iter_time': 5.339905217289925, 'flops': 3256817045577.865, 'remaining_time': 357.77364955842495}


2025-08-07 08:08:40,710 - DEBUG - utilities - Step (34) Logs: {'loss': 2.3857, 'learning_rate': 6.666666666666666e-07, 'epoch': 0.11, 'iter_time': 5.3230684670535, 'flops': 3267118287333.727, 'remaining_time': 351.322518825531}


{'loss': 2.3857, 'learning_rate': 6.666666666666666e-07, 'epoch': 0.11, 'iter_time': 5.3230684670535, 'flops': 3267118287333.727, 'remaining_time': 351.322518825531}


2025-08-07 08:08:44,849 - DEBUG - utilities - Step (35) Logs: {'loss': 2.1428, 'learning_rate': 6.565656565656566e-07, 'epoch': 0.11, 'iter_time': 5.288223084281473, 'flops': 3288646121063.3633, 'remaining_time': 343.73450047829573}


{'loss': 2.1428, 'learning_rate': 6.565656565656566e-07, 'epoch': 0.11, 'iter_time': 5.288223084281473, 'flops': 3288646121063.3633, 'remaining_time': 343.73450047829573}


2025-08-07 08:08:50,141 - DEBUG - utilities - Step (36) Logs: {'loss': 2.1705, 'learning_rate': 6.464646464646465e-07, 'epoch': 0.11, 'iter_time': 5.288333811078753, 'flops': 3288577263599.863, 'remaining_time': 338.4533639090402}


{'loss': 2.1705, 'learning_rate': 6.464646464646465e-07, 'epoch': 0.11, 'iter_time': 5.288333811078753, 'flops': 3288577263599.863, 'remaining_time': 338.4533639090402}


2025-08-07 08:08:55,188 - DEBUG - utilities - Step (37) Logs: {'loss': 2.5617, 'learning_rate': 6.363636363636363e-07, 'epoch': 0.12, 'iter_time': 5.281635502974193, 'flops': 3292747923185.2964, 'remaining_time': 332.7430366873741}


{'loss': 2.5617, 'learning_rate': 6.363636363636363e-07, 'epoch': 0.12, 'iter_time': 5.281635502974193, 'flops': 3292747923185.2964, 'remaining_time': 332.7430366873741}


2025-08-07 08:09:01,145 - DEBUG - utilities - Step (38) Logs: {'loss': 2.849, 'learning_rate': 6.262626262626263e-07, 'epoch': 0.12, 'iter_time': 5.2998673722550675, 'flops': 3281420668087.4307, 'remaining_time': 328.59177707981416}


{'loss': 2.849, 'learning_rate': 6.262626262626263e-07, 'epoch': 0.12, 'iter_time': 5.2998673722550675, 'flops': 3281420668087.4307, 'remaining_time': 328.59177707981416}


2025-08-07 08:09:06,185 - DEBUG - utilities - Step (39) Logs: {'loss': 2.0174, 'learning_rate': 6.161616161616161e-07, 'epoch': 0.12, 'iter_time': 5.293046775617097, 'flops': 3285649092230.5205, 'remaining_time': 322.8758533126429}


{'loss': 2.0174, 'learning_rate': 6.161616161616161e-07, 'epoch': 0.12, 'iter_time': 5.293046775617097, 'flops': 3285649092230.5205, 'remaining_time': 322.8758533126429}


2025-08-07 08:09:10,627 - DEBUG - utilities - Step (40) Logs: {'loss': 2.1603, 'learning_rate': 6.060606060606061e-07, 'epoch': 0.13, 'iter_time': 5.271234622368445, 'flops': 3299244973775.407, 'remaining_time': 316.27407734210675}


{'loss': 2.1603, 'learning_rate': 6.060606060606061e-07, 'epoch': 0.13, 'iter_time': 5.271234622368445, 'flops': 3299244973775.407, 'remaining_time': 316.27407734210675}


2025-08-07 08:09:16,271 - DEBUG - utilities - Step (41) Logs: {'loss': 1.9872, 'learning_rate': 5.959595959595959e-07, 'epoch': 0.13, 'iter_time': 5.280552673339844, 'flops': 3293423133764.611, 'remaining_time': 311.55260772705077}


{'loss': 1.9872, 'learning_rate': 5.959595959595959e-07, 'epoch': 0.13, 'iter_time': 5.280552673339844, 'flops': 3293423133764.611, 'remaining_time': 311.55260772705077}


2025-08-07 08:09:20,715 - DEBUG - utilities - Step (42) Logs: {'loss': 2.1317, 'learning_rate': 5.858585858585858e-07, 'epoch': 0.13, 'iter_time': 5.2601167981217545, 'flops': 3306218283907.667, 'remaining_time': 305.0867742910618}


{'loss': 2.1317, 'learning_rate': 5.858585858585858e-07, 'epoch': 0.13, 'iter_time': 5.2601167981217545, 'flops': 3306218283907.667, 'remaining_time': 305.0867742910618}


2025-08-07 08:09:26,278 - DEBUG - utilities - Step (43) Logs: {'loss': 3.1635, 'learning_rate': 5.757575757575758e-07, 'epoch': 0.14, 'iter_time': 5.26735474382128, 'flops': 3301675163200.3, 'remaining_time': 300.23922039781297}


{'loss': 3.1635, 'learning_rate': 5.757575757575758e-07, 'epoch': 0.14, 'iter_time': 5.26735474382128, 'flops': 3301675163200.3, 'remaining_time': 300.23922039781297}


2025-08-07 08:09:30,861 - DEBUG - utilities - Step (44) Logs: {'loss': 2.4332, 'learning_rate': 5.656565656565657e-07, 'epoch': 0.14, 'iter_time': 5.251447439193726, 'flops': 3311676358720.277, 'remaining_time': 294.08105659484863}


{'loss': 2.4332, 'learning_rate': 5.656565656565657e-07, 'epoch': 0.14, 'iter_time': 5.251447439193726, 'flops': 3311676358720.277, 'remaining_time': 294.08105659484863}


2025-08-07 08:09:35,943 - DEBUG - utilities - Step (45) Logs: {'loss': 2.2063, 'learning_rate': 5.555555555555555e-07, 'epoch': 0.14, 'iter_time': 5.247580674561587, 'flops': 3314116620968.9478, 'remaining_time': 288.6169371008873}


{'loss': 2.2063, 'learning_rate': 5.555555555555555e-07, 'epoch': 0.14, 'iter_time': 5.247580674561587, 'flops': 3314116620968.9478, 'remaining_time': 288.6169371008873}


2025-08-07 08:09:40,737 - DEBUG - utilities - Step (46) Logs: {'loss': 2.1725, 'learning_rate': 5.454545454545454e-07, 'epoch': 0.15, 'iter_time': 5.237515142228868, 'flops': 3320485738211.933, 'remaining_time': 282.82581768035885}


{'loss': 2.1725, 'learning_rate': 5.454545454545454e-07, 'epoch': 0.15, 'iter_time': 5.237515142228868, 'flops': 3320485738211.933, 'remaining_time': 282.82581768035885}


2025-08-07 08:09:45,927 - DEBUG - utilities - Step (47) Logs: {'loss': 2.0262, 'learning_rate': 5.353535353535354e-07, 'epoch': 0.15, 'iter_time': 5.236466837965923, 'flops': 3321150476376.448, 'remaining_time': 277.53274241219395}


{'loss': 2.0262, 'learning_rate': 5.353535353535354e-07, 'epoch': 0.15, 'iter_time': 5.236466837965923, 'flops': 3321150476376.448, 'remaining_time': 277.53274241219395}


2025-08-07 08:09:51,665 - DEBUG - utilities - Step (48) Logs: {'loss': 1.8742, 'learning_rate': 5.252525252525253e-07, 'epoch': 0.15, 'iter_time': 5.247144196895843, 'flops': 3314392301955.108, 'remaining_time': 272.8514982385839}


{'loss': 1.8742, 'learning_rate': 5.252525252525253e-07, 'epoch': 0.15, 'iter_time': 5.247144196895843, 'flops': 3314392301955.108, 'remaining_time': 272.8514982385839}


2025-08-07 08:09:57,518 - DEBUG - utilities - Step (49) Logs: {'loss': 2.281, 'learning_rate': 5.151515151515151e-07, 'epoch': 0.16, 'iter_time': 5.259773487846057, 'flops': 3306434083830.0757, 'remaining_time': 268.2484478801489}


{'loss': 2.281, 'learning_rate': 5.151515151515151e-07, 'epoch': 0.16, 'iter_time': 5.259773487846057, 'flops': 3306434083830.0757, 'remaining_time': 268.2484478801489}


2025-08-07 08:10:03,621 - DEBUG - utilities - Step (50) Logs: {'loss': 2.7643, 'learning_rate': 5.05050505050505e-07, 'epoch': 0.16, 'iter_time': 5.276979641038544, 'flops': 3295653103944.3843, 'remaining_time': 263.8489820519272}


{'loss': 2.7643, 'learning_rate': 5.05050505050505e-07, 'epoch': 0.16, 'iter_time': 5.276979641038544, 'flops': 3295653103944.3843, 'remaining_time': 263.8489820519272}


2025-08-07 08:10:09,184 - DEBUG - utilities - Step (51) Logs: {'loss': 2.2259, 'learning_rate': 4.949494949494949e-07, 'epoch': 0.16, 'iter_time': 5.282691111564636, 'flops': 3292089953048.3955, 'remaining_time': 258.85186446666717}


{'loss': 2.2259, 'learning_rate': 4.949494949494949e-07, 'epoch': 0.16, 'iter_time': 5.282691111564636, 'flops': 3292089953048.3955, 'remaining_time': 258.85186446666717}


2025-08-07 08:10:13,787 - DEBUG - utilities - Step (52) Logs: {'loss': 1.9349, 'learning_rate': 4.848484848484849e-07, 'epoch': 0.17, 'iter_time': 5.26936214577918, 'flops': 3300417365955.8525, 'remaining_time': 252.92938299740064}


{'loss': 1.9349, 'learning_rate': 4.848484848484849e-07, 'epoch': 0.17, 'iter_time': 5.26936214577918, 'flops': 3300417365955.8525, 'remaining_time': 252.92938299740064}


2025-08-07 08:10:19,936 - DEBUG - utilities - Step (53) Logs: {'loss': 2.1313, 'learning_rate': 4.7474747474747474e-07, 'epoch': 0.17, 'iter_time': 5.286285317861116, 'flops': 3289851623157.6787, 'remaining_time': 248.45540993947247}


{'loss': 2.1313, 'learning_rate': 4.7474747474747474e-07, 'epoch': 0.17, 'iter_time': 5.286285317861116, 'flops': 3289851623157.6787, 'remaining_time': 248.45540993947247}


2025-08-07 08:10:28,879 - DEBUG - utilities - Step (54) Logs: {'loss': 2.4095, 'learning_rate': 4.646464646464646e-07, 'epoch': 0.17, 'iter_time': 5.355270381243724, 'flops': 3247472694254.7095, 'remaining_time': 246.34243753721128}


{'loss': 2.4095, 'learning_rate': 4.646464646464646e-07, 'epoch': 0.17, 'iter_time': 5.355270381243724, 'flops': 3247472694254.7095, 'remaining_time': 246.34243753721128}


2025-08-07 08:10:42,692 - DEBUG - utilities - Step (55) Logs: {'loss': 1.6255, 'learning_rate': 4.545454545454545e-07, 'epoch': 0.17, 'iter_time': 5.511903714250635, 'flops': 3155188340550.3916, 'remaining_time': 248.03566714127857}


{'loss': 1.6255, 'learning_rate': 4.545454545454545e-07, 'epoch': 0.17, 'iter_time': 5.511903714250635, 'flops': 3155188340550.3916, 'remaining_time': 248.03566714127857}


2025-08-07 08:10:48,517 - DEBUG - utilities - Step (56) Logs: {'loss': 2.2029, 'learning_rate': 4.444444444444444e-07, 'epoch': 0.18, 'iter_time': 5.517593817277388, 'flops': 3151934504309.2524, 'remaining_time': 242.77412796020505}


{'loss': 2.2029, 'learning_rate': 4.444444444444444e-07, 'epoch': 0.18, 'iter_time': 5.517593817277388, 'flops': 3151934504309.2524, 'remaining_time': 242.77412796020505}


2025-08-07 08:10:55,293 - DEBUG - utilities - Step (57) Logs: {'loss': 2.155, 'learning_rate': 4.3434343434343435e-07, 'epoch': 0.18, 'iter_time': 5.540068256003516, 'flops': 3139148026668.0967, 'remaining_time': 238.22293500815118}


{'loss': 2.155, 'learning_rate': 4.3434343434343435e-07, 'epoch': 0.18, 'iter_time': 5.540068256003516, 'flops': 3139148026668.0967, 'remaining_time': 238.22293500815118}


2025-08-07 08:11:04,263 - DEBUG - utilities - Step (58) Logs: {'loss': 2.3282, 'learning_rate': 4.242424242424242e-07, 'epoch': 0.18, 'iter_time': 5.600246073906882, 'flops': 3105416102065.584, 'remaining_time': 235.21033510408904}


{'loss': 2.3282, 'learning_rate': 4.242424242424242e-07, 'epoch': 0.18, 'iter_time': 5.600246073906882, 'flops': 3105416102065.584, 'remaining_time': 235.21033510408904}


2025-08-07 08:11:12,516 - DEBUG - utilities - Step (59) Logs: {'loss': 2.5928, 'learning_rate': 4.1414141414141413e-07, 'epoch': 0.19, 'iter_time': 5.645975750068138, 'flops': 3080263731779.244, 'remaining_time': 231.48500575279365}


{'loss': 2.5928, 'learning_rate': 4.1414141414141413e-07, 'epoch': 0.19, 'iter_time': 5.645975750068138, 'flops': 3080263731779.244, 'remaining_time': 231.48500575279365}


2025-08-07 08:11:18,571 - DEBUG - utilities - Step (60) Logs: {'loss': 1.4358, 'learning_rate': 4.04040404040404e-07, 'epoch': 0.19, 'iter_time': 5.652906939134759, 'flops': 3076486933305.487, 'remaining_time': 226.11627756539036}


{'loss': 1.4358, 'learning_rate': 4.04040404040404e-07, 'epoch': 0.19, 'iter_time': 5.652906939134759, 'flops': 3076486933305.487, 'remaining_time': 226.11627756539036}


2025-08-07 08:11:24,904 - DEBUG - utilities - Step (61) Logs: {'loss': 2.0063, 'learning_rate': 3.939393939393939e-07, 'epoch': 0.19, 'iter_time': 5.664239696661631, 'flops': 3070331635804.5195, 'remaining_time': 220.9053481698036}


{'loss': 2.0063, 'learning_rate': 3.939393939393939e-07, 'epoch': 0.19, 'iter_time': 5.664239696661631, 'flops': 3070331635804.5195, 'remaining_time': 220.9053481698036}


2025-08-07 08:11:31,689 - DEBUG - utilities - Step (62) Logs: {'loss': 1.8261, 'learning_rate': 3.8383838383838377e-07, 'epoch': 0.2, 'iter_time': 5.682621318785871, 'flops': 3060400008698.0493, 'remaining_time': 215.9396101138631}


{'loss': 1.8261, 'learning_rate': 3.8383838383838377e-07, 'epoch': 0.2, 'iter_time': 5.682621318785871, 'flops': 3060400008698.0493, 'remaining_time': 215.9396101138631}


2025-08-07 08:11:39,354 - DEBUG - utilities - Step (63) Logs: {'loss': 2.0741, 'learning_rate': 3.7373737373737374e-07, 'epoch': 0.2, 'iter_time': 5.71459484869434, 'flops': 3043276871572.704, 'remaining_time': 211.44000940169056}


{'loss': 2.0741, 'learning_rate': 3.7373737373737374e-07, 'epoch': 0.2, 'iter_time': 5.71459484869434, 'flops': 3043276871572.704, 'remaining_time': 211.44000940169056}


2025-08-07 08:11:44,753 - DEBUG - utilities - Step (64) Logs: {'loss': 2.5515, 'learning_rate': 3.636363636363636e-07, 'epoch': 0.2, 'iter_time': 5.709589038576398, 'flops': 3045945026154.844, 'remaining_time': 205.54520538875033}


{'loss': 2.5515, 'learning_rate': 3.636363636363636e-07, 'epoch': 0.2, 'iter_time': 5.709589038576398, 'flops': 3045945026154.844, 'remaining_time': 205.54520538875033}


2025-08-07 08:11:49,931 - DEBUG - utilities - Step (65) Logs: {'loss': 3.0795, 'learning_rate': 3.535353535353535e-07, 'epoch': 0.21, 'iter_time': 5.7012797854840755, 'flops': 3050384297525.469, 'remaining_time': 199.54479249194264}


{'loss': 3.0795, 'learning_rate': 3.535353535353535e-07, 'epoch': 0.21, 'iter_time': 5.7012797854840755, 'flops': 3050384297525.469, 'remaining_time': 199.54479249194264}


2025-08-07 08:11:55,275 - DEBUG - utilities - Step (66) Logs: {'loss': 1.7088, 'learning_rate': 3.434343434343434e-07, 'epoch': 0.21, 'iter_time': 5.6957745001866265, 'flops': 3053332664920.3135, 'remaining_time': 193.6563330063453}


{'loss': 1.7088, 'learning_rate': 3.434343434343434e-07, 'epoch': 0.21, 'iter_time': 5.6957745001866265, 'flops': 3053332664920.3135, 'remaining_time': 193.6563330063453}


2025-08-07 08:12:00,139 - DEBUG - utilities - Step (67) Logs: {'loss': 2.3697, 'learning_rate': 3.333333333333333e-07, 'epoch': 0.21, 'iter_time': 5.683182467113841, 'flops': 3060097829706.3066, 'remaining_time': 187.54502141475677}


{'loss': 2.3697, 'learning_rate': 3.333333333333333e-07, 'epoch': 0.21, 'iter_time': 5.683182467113841, 'flops': 3060097829706.3066, 'remaining_time': 187.54502141475677}


2025-08-07 08:12:06,223 - DEBUG - utilities - Step (68) Logs: {'loss': 2.0624, 'learning_rate': 3.2323232323232327e-07, 'epoch': 0.22, 'iter_time': 5.6891567422382865, 'flops': 3056884371689.4707, 'remaining_time': 182.05301575162517}


{'loss': 2.0624, 'learning_rate': 3.2323232323232327e-07, 'epoch': 0.22, 'iter_time': 5.6891567422382865, 'flops': 3056884371689.4707, 'remaining_time': 182.05301575162517}


2025-08-07 08:12:13,097 - DEBUG - utilities - Step (69) Logs: {'loss': 2.3099, 'learning_rate': 3.1313131313131313e-07, 'epoch': 0.22, 'iter_time': 5.706588653957143, 'flops': 3047546509486.0874, 'remaining_time': 176.90424827267142}


{'loss': 2.3099, 'learning_rate': 3.1313131313131313e-07, 'epoch': 0.22, 'iter_time': 5.706588653957143, 'flops': 3047546509486.0874, 'remaining_time': 176.90424827267142}


2025-08-07 08:12:18,385 - DEBUG - utilities - Step (70) Logs: {'loss': 1.7547, 'learning_rate': 3.0303030303030305e-07, 'epoch': 0.22, 'iter_time': 5.700520228648531, 'flops': 3050790741174.696, 'remaining_time': 171.01560685945594}


{'loss': 1.7547, 'learning_rate': 3.0303030303030305e-07, 'epoch': 0.22, 'iter_time': 5.700520228648531, 'flops': 3050790741174.696, 'remaining_time': 171.01560685945594}


2025-08-07 08:12:24,205 - DEBUG - utilities - Step (71) Logs: {'loss': 1.9871, 'learning_rate': 2.929292929292929e-07, 'epoch': 0.23, 'iter_time': 5.702229949406215, 'flops': 3049876011270.1123, 'remaining_time': 165.36466853278023}


{'loss': 1.9871, 'learning_rate': 2.929292929292929e-07, 'epoch': 0.23, 'iter_time': 5.702229949406215, 'flops': 3049876011270.1123, 'remaining_time': 165.36466853278023}


2025-08-07 08:12:28,995 - DEBUG - utilities - Step (72) Logs: {'loss': 2.1899, 'learning_rate': 2.8282828282828283e-07, 'epoch': 0.23, 'iter_time': 5.689377448928188, 'flops': 3056765786688.3623, 'remaining_time': 159.30256856998926}


{'loss': 2.1899, 'learning_rate': 2.8282828282828283e-07, 'epoch': 0.23, 'iter_time': 5.689377448928188, 'flops': 3056765786688.3623, 'remaining_time': 159.30256856998926}


2025-08-07 08:12:35,101 - DEBUG - utilities - Step (73) Logs: {'loss': 2.1119, 'learning_rate': 2.727272727272727e-07, 'epoch': 0.23, 'iter_time': 5.695166170597076, 'flops': 3053658806871.43, 'remaining_time': 153.76948660612106}


{'loss': 2.1119, 'learning_rate': 2.727272727272727e-07, 'epoch': 0.23, 'iter_time': 5.695166170597076, 'flops': 3053658806871.43, 'remaining_time': 153.76948660612106}


2025-08-07 08:12:40,965 - DEBUG - utilities - Step (74) Logs: {'loss': 1.9142, 'learning_rate': 2.6262626262626266e-07, 'epoch': 0.23, 'iter_time': 5.697479280706954, 'flops': 3052419057025.8784, 'remaining_time': 148.1344612983808}


{'loss': 1.9142, 'learning_rate': 2.6262626262626266e-07, 'epoch': 0.23, 'iter_time': 5.697479280706954, 'flops': 3052419057025.8784, 'remaining_time': 148.1344612983808}


2025-08-07 08:12:47,604 - DEBUG - utilities - Step (75) Logs: {'loss': 2.2655, 'learning_rate': 2.525252525252525e-07, 'epoch': 0.24, 'iter_time': 5.710205419643505, 'flops': 3045616235383.3057, 'remaining_time': 142.75513549108763}


{'loss': 2.2655, 'learning_rate': 2.525252525252525e-07, 'epoch': 0.24, 'iter_time': 5.710205419643505, 'flops': 3045616235383.3057, 'remaining_time': 142.75513549108763}


2025-08-07 08:12:54,647 - DEBUG - utilities - Step (76) Logs: {'loss': 2.2733, 'learning_rate': 2.4242424242424244e-07, 'epoch': 0.24, 'iter_time': 5.727965933481852, 'flops': 3036172794217.108, 'remaining_time': 137.47118240356446}


{'loss': 2.2733, 'learning_rate': 2.4242424242424244e-07, 'epoch': 0.24, 'iter_time': 5.727965933481852, 'flops': 3036172794217.108, 'remaining_time': 137.47118240356446}


2025-08-07 08:13:03,473 - DEBUG - utilities - Step (77) Logs: {'loss': 2.2634, 'learning_rate': 2.323232323232323e-07, 'epoch': 0.24, 'iter_time': 5.768739367786207, 'flops': 3014713133090.281, 'remaining_time': 132.68100545908277}


{'loss': 2.2634, 'learning_rate': 2.323232323232323e-07, 'epoch': 0.24, 'iter_time': 5.768739367786207, 'flops': 3014713133090.281, 'remaining_time': 132.68100545908277}


2025-08-07 08:13:09,255 - DEBUG - utilities - Step (78) Logs: {'loss': 2.7154, 'learning_rate': 2.222222222222222e-07, 'epoch': 0.25, 'iter_time': 5.768903893309754, 'flops': 3014627155361.107, 'remaining_time': 126.9158856528146}


{'loss': 2.7154, 'learning_rate': 2.222222222222222e-07, 'epoch': 0.25, 'iter_time': 5.768903893309754, 'flops': 3014627155361.107, 'remaining_time': 126.9158856528146}


2025-08-07 08:13:15,776 - DEBUG - utilities - Step (79) Logs: {'loss': 2.4501, 'learning_rate': 2.121212121212121e-07, 'epoch': 0.25, 'iter_time': 5.778552727821546, 'flops': 3009593431536.6646, 'remaining_time': 121.34960728425246}


{'loss': 2.4501, 'learning_rate': 2.121212121212121e-07, 'epoch': 0.25, 'iter_time': 5.778552727821546, 'flops': 3009593431536.6646, 'remaining_time': 121.34960728425246}


2025-08-07 08:13:22,248 - DEBUG - utilities - Step (80) Logs: {'loss': 2.2831, 'learning_rate': 2.02020202020202e-07, 'epoch': 0.25, 'iter_time': 5.787329519851299, 'flops': 3005029223545.3794, 'remaining_time': 115.74659039702597}


{'loss': 2.2831, 'learning_rate': 2.02020202020202e-07, 'epoch': 0.25, 'iter_time': 5.787329519851299, 'flops': 3005029223545.3794, 'remaining_time': 115.74659039702597}


2025-08-07 08:13:30,580 - DEBUG - utilities - Step (81) Logs: {'loss': 1.8962, 'learning_rate': 1.9191919191919189e-07, 'epoch': 0.26, 'iter_time': 5.819130203127861, 'flops': 2988607184642.8276, 'remaining_time': 110.56347385942935}


{'loss': 1.8962, 'learning_rate': 1.9191919191919189e-07, 'epoch': 0.26, 'iter_time': 5.819130203127861, 'flops': 2988607184642.8276, 'remaining_time': 110.56347385942935}


2025-08-07 08:13:37,459 - DEBUG - utilities - Step (82) Logs: {'loss': 1.836, 'learning_rate': 1.818181818181818e-07, 'epoch': 0.26, 'iter_time': 5.832217240039213, 'flops': 2981900985108.5503, 'remaining_time': 104.97991032070584}


{'loss': 1.836, 'learning_rate': 1.818181818181818e-07, 'epoch': 0.26, 'iter_time': 5.832217240039213, 'flops': 2981900985108.5503, 'remaining_time': 104.97991032070584}


2025-08-07 08:13:44,793 - DEBUG - utilities - Step (83) Logs: {'loss': 1.7763, 'learning_rate': 1.717171717171717e-07, 'epoch': 0.26, 'iter_time': 5.850529080483971, 'flops': 2972567795868.705, 'remaining_time': 99.45899436822751}


{'loss': 1.7763, 'learning_rate': 1.717171717171717e-07, 'epoch': 0.26, 'iter_time': 5.850529080483971, 'flops': 2972567795868.705, 'remaining_time': 99.45899436822751}


2025-08-07 08:13:51,868 - DEBUG - utilities - Step (84) Logs: {'loss': 1.8726, 'learning_rate': 1.6161616161616163e-07, 'epoch': 0.27, 'iter_time': 5.865292101021272, 'flops': 2965085802020.302, 'remaining_time': 93.84467361634036}


{'loss': 1.8726, 'learning_rate': 1.6161616161616163e-07, 'epoch': 0.27, 'iter_time': 5.865292101021272, 'flops': 2965085802020.302, 'remaining_time': 93.84467361634036}


2025-08-07 08:13:59,282 - DEBUG - utilities - Step (85) Logs: {'loss': 2.207, 'learning_rate': 1.5151515151515152e-07, 'epoch': 0.27, 'iter_time': 5.883724465256646, 'flops': 2955796865766.624, 'remaining_time': 88.25586697884968}


{'loss': 2.207, 'learning_rate': 1.5151515151515152e-07, 'epoch': 0.27, 'iter_time': 5.883724465256646, 'flops': 2955796865766.624, 'remaining_time': 88.25586697884968}


2025-08-07 08:14:06,669 - DEBUG - utilities - Step (86) Logs: {'loss': 2.5149, 'learning_rate': 1.4141414141414141e-07, 'epoch': 0.27, 'iter_time': 5.901405126908246, 'flops': 2946941272366.3013, 'remaining_time': 82.61967177671545}


{'loss': 2.5149, 'learning_rate': 1.4141414141414141e-07, 'epoch': 0.27, 'iter_time': 5.901405126908246, 'flops': 2946941272366.3013, 'remaining_time': 82.61967177671545}


2025-08-07 08:14:13,318 - DEBUG - utilities - Step (87) Logs: {'loss': 2.5003, 'learning_rate': 1.3131313131313133e-07, 'epoch': 0.28, 'iter_time': 5.9101012058036275, 'flops': 2942605164927.1616, 'remaining_time': 76.83131567544716}


{'loss': 2.5003, 'learning_rate': 1.3131313131313133e-07, 'epoch': 0.28, 'iter_time': 5.9101012058036275, 'flops': 2942605164927.1616, 'remaining_time': 76.83131567544716}


2025-08-07 08:14:19,528 - DEBUG - utilities - Step (88) Logs: {'loss': 1.6742, 'learning_rate': 1.2121212121212122e-07, 'epoch': 0.28, 'iter_time': 5.9135538901405775, 'flops': 2940887097086.483, 'remaining_time': 70.96264668168693}


{'loss': 1.6742, 'learning_rate': 1.2121212121212122e-07, 'epoch': 0.28, 'iter_time': 5.9135538901405775, 'flops': 2940887097086.483, 'remaining_time': 70.96264668168693}


2025-08-07 08:14:25,896 - DEBUG - utilities - Step (89) Logs: {'loss': 2.7694, 'learning_rate': 1.111111111111111e-07, 'epoch': 0.28, 'iter_time': 5.9187096655368805, 'flops': 2938325296593.59, 'remaining_time': 65.10580632090569}


{'loss': 2.7694, 'learning_rate': 1.111111111111111e-07, 'epoch': 0.28, 'iter_time': 5.9187096655368805, 'flops': 2938325296593.59, 'remaining_time': 65.10580632090569}


2025-08-07 08:14:32,703 - DEBUG - utilities - Step (90) Logs: {'loss': 1.8878, 'learning_rate': 1.01010101010101e-07, 'epoch': 0.29, 'iter_time': 5.928699964887641, 'flops': 2933374000444.9004, 'remaining_time': 59.28699964887641}


{'loss': 1.8878, 'learning_rate': 1.01010101010101e-07, 'epoch': 0.29, 'iter_time': 5.928699964887641, 'flops': 2933374000444.9004, 'remaining_time': 59.28699964887641}


2025-08-07 08:14:39,537 - DEBUG - utilities - Step (91) Logs: {'loss': 2.8206, 'learning_rate': 9.09090909090909e-08, 'epoch': 0.29, 'iter_time': 5.938749490843879, 'flops': 2928410157770.2305, 'remaining_time': 53.44874541759491}


{'loss': 2.8206, 'learning_rate': 9.09090909090909e-08, 'epoch': 0.29, 'iter_time': 5.938749490843879, 'flops': 2928410157770.2305, 'remaining_time': 53.44874541759491}


2025-08-07 08:14:46,410 - DEBUG - utilities - Step (92) Logs: {'loss': 2.0162, 'learning_rate': 8.080808080808082e-08, 'epoch': 0.29, 'iter_time': 5.94901598416842, 'flops': 2923356464282.7236, 'remaining_time': 47.59212787334736}


{'loss': 2.0162, 'learning_rate': 8.080808080808082e-08, 'epoch': 0.29, 'iter_time': 5.94901598416842, 'flops': 2923356464282.7236, 'remaining_time': 47.59212787334736}


2025-08-07 08:14:54,159 - DEBUG - utilities - Step (93) Logs: {'loss': 1.9882, 'learning_rate': 7.070707070707071e-08, 'epoch': 0.3, 'iter_time': 5.968581606512484, 'flops': 2913773402120.212, 'remaining_time': 41.780071245587386}


{'loss': 1.9882, 'learning_rate': 7.070707070707071e-08, 'epoch': 0.3, 'iter_time': 5.968581606512484, 'flops': 2913773402120.212, 'remaining_time': 41.780071245587386}


2025-08-07 08:15:02,482 - DEBUG - utilities - Step (94) Logs: {'loss': 2.0007, 'learning_rate': 6.060606060606061e-08, 'epoch': 0.3, 'iter_time': 5.993904126587735, 'flops': 2901463547989.808, 'remaining_time': 35.963424759526404}


{'loss': 2.0007, 'learning_rate': 6.060606060606061e-08, 'epoch': 0.3, 'iter_time': 5.993904126587735, 'flops': 2901463547989.808, 'remaining_time': 35.963424759526404}


2025-08-07 08:15:09,697 - DEBUG - utilities - Step (95) Logs: {'loss': 1.9036, 'learning_rate': 5.05050505050505e-08, 'epoch': 0.3, 'iter_time': 6.006887519613225, 'flops': 2895192273312.2505, 'remaining_time': 30.034437598066127}


{'loss': 1.9036, 'learning_rate': 5.05050505050505e-08, 'epoch': 0.3, 'iter_time': 6.006887519613225, 'flops': 2895192273312.2505, 'remaining_time': 30.034437598066127}


2025-08-07 08:15:15,423 - DEBUG - utilities - Step (96) Logs: {'loss': 2.5702, 'learning_rate': 4.040404040404041e-08, 'epoch': 0.3, 'iter_time': 6.0039325889788175, 'flops': 2896617188101.703, 'remaining_time': 24.01573035591527}


{'loss': 2.5702, 'learning_rate': 4.040404040404041e-08, 'epoch': 0.3, 'iter_time': 6.0039325889788175, 'flops': 2896617188101.703, 'remaining_time': 24.01573035591527}


2025-08-07 08:15:21,313 - DEBUG - utilities - Step (97) Logs: {'loss': 2.1145, 'learning_rate': 3.0303030303030305e-08, 'epoch': 0.31, 'iter_time': 6.002748486896356, 'flops': 2897188574767.663, 'remaining_time': 18.008245460689068}


{'loss': 2.1145, 'learning_rate': 3.0303030303030305e-08, 'epoch': 0.31, 'iter_time': 6.002748486896356, 'flops': 2897188574767.663, 'remaining_time': 18.008245460689068}


2025-08-07 08:15:28,860 - DEBUG - utilities - Step (98) Logs: {'loss': 1.8528, 'learning_rate': 2.0202020202020204e-08, 'epoch': 0.31, 'iter_time': 6.018668722860592, 'flops': 2889525098363.322, 'remaining_time': 12.037337445721183}


{'loss': 1.8528, 'learning_rate': 2.0202020202020204e-08, 'epoch': 0.31, 'iter_time': 6.018668722860592, 'flops': 2889525098363.322, 'remaining_time': 12.037337445721183}


2025-08-07 08:15:35,526 - DEBUG - utilities - Step (99) Logs: {'loss': 3.4161, 'learning_rate': 1.0101010101010102e-08, 'epoch': 0.31, 'iter_time': 6.025273250073803, 'flops': 2886357781902.585, 'remaining_time': 6.025273250073803}


{'loss': 3.4161, 'learning_rate': 1.0101010101010102e-08, 'epoch': 0.31, 'iter_time': 6.025273250073803, 'flops': 2886357781902.585, 'remaining_time': 6.025273250073803}


2025-08-07 08:15:41,011 - DEBUG - utilities - Step (100) Logs: {'loss': 2.4024, 'learning_rate': 0.0, 'epoch': 0.32, 'iter_time': 6.019818708150074, 'flops': 2888973103109.3438, 'remaining_time': 0.0}
2025-08-07 08:15:41,029 - DEBUG - utilities - Step (100) Logs: {'train_runtime': 602.7565, 'train_samples_per_second': 0.664, 'train_steps_per_second': 0.166, 'total_flos': 66683552747520.0, 'train_loss': 2.3067615163326263, 'epoch': 0.32, 'iter_time': 6.01999704765551, 'flops': 2888887518676.2705, 'remaining_time': 0.0}


{'loss': 2.4024, 'learning_rate': 0.0, 'epoch': 0.32, 'iter_time': 6.019818708150074, 'flops': 2888973103109.3438, 'remaining_time': 0.0}
{'train_runtime': 602.7565, 'train_samples_per_second': 0.664, 'train_steps_per_second': 0.166, 'train_loss': 2.3067615163326263, 'epoch': 0.32, 'iter_time': 6.01999704765551, 'flops': 2888887518676.2705, 'remaining_time': 0.0}


Then we can to save the model

In [36]:
save_dir = f'{OUTPUT_DIR}/final'
trainer.save_model(save_dir)
print("Saved model to:", save_dir)

Saved model to: lamini_docs_100_steps/final


In [37]:
finetuned_slightly_model = AutoModelForCausalLM.from_pretrained(save_dir, local_files_only=True)

In [38]:
finetuned_slightly_model.to(device) 

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 1024)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
  

In [39]:
test_text = test_dataset[0]['question']
print("Question input (test):", test_text)
print(f"Correct answer from Lamini docs: {test_dataset[0]['answer']}")
print("Model's answer: ")
print(inference(test_text, finetuned_slightly_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Question input (test): Can Lamini generate technical documentation or user manuals for software projects?
Correct answer from Lamini docs: Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.
Model's answer: 
Yes, Lamini can generate technical documentation or user manuals for software projects. Lamini can generate documentation for any programming language, including C++, Java, Python, and LaminiML. Lamini can also generate user manuals for any programming language, including C++, Java, Python, and LaminiML. Lamini can generate documentation for any programming language, including C++, Java, Python, and LaminiML. Lamini can generate documentati

Model Inference Example

Let's evaluate the base model's ability to answer an instruction-based prompt after fine-tuning. We'll use a sample question from the test dataset and compare the model's output to the preferred answer from Lamini docs.

**Prompt Example:**
> Can Lamini generate technical documentation or user manuals for software projects?

**Preferred Answer (from Lamini docs):**
> Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.

**Fine Tune Model Output:**
>Yes, Lamini can generate technical documentation or user manuals for software projects. Lamini can generate documentation for any programming language, including C++, Java, Python, and LaminiML. Lamini can also generate user manuals for any programming language, including C++, Java, Python, and LaminiML. Lamini can generate documentation for any programming language, including C++, Java, Python, and LaminiML. Lamini can generate documentation for any

a bit better

## Evaluation

In [44]:
import tqdm

In [47]:
test_dataset

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 140
})

In [50]:
answer = test_dataset[0:10]
print(answer)

{'question': ['Can Lamini generate technical documentation or user manuals for software projects?', 'How do I include my API key in the Authorization HTTP header?', "Is there a section explaining the code's approach to handling versioning and compatibility?", 'Is there a community or support forum available for Lamini users?', 'Can the Lamini library be utilized for text completion or auto-completion tasks, such as filling in missing words or predicting the next word in a sentence?', 'Are there any costs associated with using Lamini for machine learning tasks, and how does the pricing structure work?', 'How do I instantiate the LLM engine using the Lamini Python package?', 'Does Lamini provide any mechanisms for model compression or optimization to reduce memory footprint?', 'How does the performance of LLMs trained using Lamini compare to models fine-tuned with traditional approaches?', 'Is there any support or community available to help me if I have questions or need assistance whil

In [72]:
n = 10
dict_eval = {
    'question': [],
    'base_model': [],
    'fine_tune': [],
    'answer': []
}
for i in range(n): 
    question = train_dataset[i]['question']
    answer = train_dataset[i]['answer']
    dict_eval['question'].append(question+"\n TRAIN")
    dict_eval['answer'].append(answer)
    fine_tune_model_resp = inference(question, finetuned_slightly_model, tokenizer)
    dict_eval['fine_tune'].append(fine_tune_model_resp)
    
    if n < 5:
        question = test_dataset[i]['question']
        answer = test_dataset[i]['answer']
        dict_eval['question'].append(question+"\n TEST")
        dict_eval['answer'].append(answer)
        fine_tune_model_resp = inference(question, finetuned_slightly_model, tokenizer)
        dict_eval['fine_tune'].append(fine_tune_model_resp)
    

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attentio

In [88]:
df = pd.DataFrame.from_dict(dict_eval)
style_df = df.style.set_properties(**{'text-align': 'left'})
style_df = style_df.set_properties(**{"vertical-align": "text-top"})
style_df


Unnamed: 0,question,base_model,fine_tune,answer
0,How can I evaluate the performance and quality of the generated text from Lamini models?  TRAIN,A: I think you are looking for the Lamini model. The Lamini model is a text-to-image model. It is a text-to-image model that is trained to generate text from images. The Lamini model is a text-to-image model that is trained to generate text from images. The Lamini model is a text-to-image model that is trained to generate text from images. The L,Lamini provides a set of tools and APIs to evaluate the quality and performance of generated text from Lamini models. These tools and APIs can be used to evaluate the quality and performance of generated text from Lamini models. The tools and APIs can be used to evaluate the quality and performance of generated text from Lamini models. The tools and APIs can be used to evaluate the quality and performance of generated text from Lamini models. The tools and APIs can be used to evaluate,"There are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the quality of the generated text based on factors such as coherence, fluency, and relevance. It is recommended to use a combination of these metrics for a comprehensive evaluation of the model's performance."
1,Can I find information about the code's approach to handling long-running tasks and background jobs?  TRAIN,"A: I'm not sure if this is the best way to do it, but I've found that the best way to do it is to use a background task. I've found that the best way to do this is to use a background task. I've found that the best way to do this is to use a background task. I've found that the best way to do this is to use a background task. I've found that the best way","Yes, you can find information about the code's approach to handling long-running tasks and background jobs. The code is written in Python, and it uses the Python language's built-in functions for handling long-running tasks and background jobs. The code is written in Python, and it uses the Python language's built-in functions for handling long-running tasks and background jobs. The code is written in Python, and it uses the Python language's built-in functions for handling long-running","Yes, the code includes methods for submitting jobs, checking job status, and retrieving job results. It also includes a method for canceling jobs. Additionally, there is a method for sampling multiple outputs from a model, which could be useful for long-running tasks."
2,How does Lamini AI handle requests for generating text that requires reasoning or decision-making based on given information?  TRAIN,"A: Lamini AI is a tool for creating and managing a large-scale, distributed, and distributed-based system. Lamini AI is a tool for creating and managing a large-scale, distributed, and distributed-based system. Lamini AI is a tool for creating and managing a large-scale, distributed, and distributed-based system. Lamini AI is a tool for creating and managing a large-scale,","Lamini AI is designed to handle requests for generating text that requires reasoning or decision-making based on given information. It can be used to generate text that requires reasoning or decision-making based on given information, such as text that requires reasoning or decision-making based on the context of the text. It can also be used to generate text that requires reasoning or decision-making based on the context of the text, such as text that requires reasoning or decision-making based on the context of the","Lamini AI offers features for generating text that requires logical reasoning or inference beyond simple text generation. It can handle user prompts that involve complex reasoning or logical inference, and can generate text that captures the nuances of different cultural or regional variations."
3,Does the `submit_job()` function expose any advanced training options such as learning rate schedules or early stopping?  TRAIN,"A: Yes, the `submit_job()` function exposes the following options: - `learning_rate`: The learning rate schedule. - `early_stopping`: Whether to stop training when the training is finished. The `learning_rate` and `early_stopping` options are available in the `train_config` and `eval_config` variables. A: Yes, the `submit_job()` function exposes","Yes, the `submit_job()` function exposes advanced training options such as learning rate schedules or early stopping. However, the `submit_job()` function does not provide any advanced training options. ## See also The following are the official documentation pages for the `submit_job()` function: - [API Reference](https://www.tensorflow.org/api_docs/python/tf/submit_job/) - [Training documentation](https://www",It is unclear which `submit_job()` function is being referred to as there is no such function defined in Lamini’s python library snippets. Please provide more information or context to answer the question accurately.
4,Does the `add_data()` function support different data augmentation techniques or preprocessing options for training data?  TRAIN,A: The add_data() function is not designed to be used with any preprocessing or augmentation. It is designed to be used with the `train()` function. A: The add_data() function is not designed to be used with any preprocessing or augmentation. It is designed to be used with the `train()` function. The add_data() function is not designed to be used with any preprocessing or augmentation. It is designed,"Yes, the `add_data()` function supports different data augmentation techniques or preprocessing options for training data. The `add_data()` function can be used to generate different types of data for training. For example, the `add_data()` function can be used to generate different types of data for training. The `add_data()` function can also be used to generate different types of data for training. The `add_data()` function can also be used to generate different types of data","No, the `add_data()` function does not support different data augmentation techniques or preprocessing options for training data. It simply adds the provided examples to the program's list of examples."
5,Can Lamini generate text for data storytelling or data visualization purposes?  TRAIN,Lamini is a data visualization and data storytelling tool. It is a tool that allows you to create a visual representation of your data. It is a tool that allows you to create a visual representation of your data. It is a tool that allows you to create a visual representation of your data. It is a tool that allows you to create a visual representation of your data. It is a tool that allows you to create a visual representation of your data. It is a tool that allows,"Yes, Lamini can generate text for data storytelling or data visualization purposes. Lamini can generate text for data storytelling or data visualization purposes by using the Lamini API. Lamini can generate text for data storytelling or data visualization purposes by using the Lamini API. Lamini can generate text for data storytelling or data visualization purposes by using the Lamini API. Lamini can generate text for data storytelling or data visualization purposes by using the Lamini API.","Yes, Lamini can generate text for data storytelling or data visualization purposes using its natural language generation capabilities."
6,Can the documentation predict the outcome of a coin toss?  TRAIN,"A: The answer is no. The coin is tossed, and the outcome is determined by the probability of heads and tails. The probability of heads is $1/2$, and the probability of tails is $1/2$. The probability of heads is $1/2$, and the probability of tails is $1/2$. The probability of heads is $1/2$, and the probability of tails is $1/2$. The probability of","Yes, the documentation can predict the outcome of a coin toss. The documentation can also predict the outcome of a coin toss if the coin is biased. For example, if the coin is biased toward heads, the documentation can predict the outcome of a coin toss if the coin is biased toward heads. The documentation can also predict the outcome of a coin toss if the coin is biased toward tails. The documentation can also predict the outcome of a coin toss if the coin is biased toward tails. The documentation can",I wish! This documentation only talks about Lamini’s LLM Engine.
7,"How does the `submit_job()` function work in Lamini? What does it mean to submit a job, and what happens behind the scenes?  TRAIN",The `submit_job()` function is a function that is called by the `submit()` function. The `submit()` function is a function that is called by the `submit_job()` function. The `submit_job()` function is a function that is called by the `submit()` function. The `submit()` function is a function that is called by the `submit_job()` function. The `submit()` function is a function that is called by the `submit_job()`,Lamini provides a simple API for submitting jobs. It is designed to be easy to use and easy to understand. It is designed to be easy to use and easy to understand. It is designed to be easy to use and easy to understand. It is designed to be easy to use and easy to understand. It is designed to be easy to use and easy to understand. It is designed to be easy to use and easy to understand. It is designed to be easy to use and easy to,"The `submit_job()` function in Lamini is used to submit a job for training a machine learning model. When you submit a job, Lamini takes the training data and uses it to train a model based on the specified parameters. Behind the scenes, Lamini uses distributed computing to train the model on multiple machines, which allows for faster training times. Once the training is complete, the resulting model is saved and can be used for inference."
8,Does Lamini support generating code  TRAIN,? I am trying to generate a simple HTML page using Lamini. I am using the following code: import lamini lamini.init() lamini.generate_page() lamini.generate_page() lamini.generate_page() lamini.generate_page() lamini.generate_page() lamini.generate_page() l,"for a specific task?Yes, Lamini supports generating code for specific tasks. For example, you can use Lamini to generate code for a specific task such as a specific language or a specific dataset. You can also use Lamini to generate code for a specific task such as a specific language or a specific dataset. You can also use Lamini to generate code for a specific task such as a specific language or a specific dataset. You can also use Lamini to generate","Yes, Lamini supports generating code through its API."
9,Can Lamini be used to create chatbots or virtual assistants?  TRAIN,I am a programmer and I am currently working on a project that will be used to create chatbots or virtual assistants. I am looking for a way to use Lamini to create chatbots or virtual assistants. A: Lamini is a library for creating chatbots. It is a library that is built on top of the chatbot framework. Chatbots are a type of chatbot that can be used to create a chatbot,"Yes, Lamini can be used to create chatbots or virtual assistants. Lamini can be used to create chatbots or virtual assistants that can be used to answer questions or provide answers to questions. Lamini can also be used to create chatbots or virtual assistants that can be used to answer questions or provide answers to questions. Lamini can also be used to create chatbots or virtual assistants that can be used to answer questions or provide answers to questions.","Yes, Lamini can be used to build conversational AI agents or chatbots. It provides tools and functionalities for generating coherent and contextually appropriate responses in conversational settings, as well as support for multi-turn conversations and context-aware recommendation systems."


## Model Deployment

In [73]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_hub import whoami
print(whoami())

In [77]:
tokenizer.push_to_hub("ludyhasby/lamini_docs_instruct")

CommitInfo(commit_url='https://huggingface.co/ludyhasby/lamini_docs_instruct/commit/12cf657ca23cc03b8b7dac39746c57cd63bb7748', commit_message='Upload tokenizer', commit_description='', oid='12cf657ca23cc03b8b7dac39746c57cd63bb7748', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ludyhasby/lamini_docs_instruct', endpoint='https://huggingface.co', repo_type='model', repo_id='ludyhasby/lamini_docs_instruct'), pr_revision=None, pr_num=None)

In [78]:
trainer.push_to_hub("ludyhasby/lamini_docs_instruct")

model.safetensors:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.73k [00:00<?, ?B/s]

events.out.tfevents.1754528738.Pongo.14384.1:   0%|          | 0.00/36.4k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.73k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ludyhasby/lamini_docs_100_steps/commit/1688f3d73c19ec64cb2ec507824abd1b13a50d1e', commit_message='ludyhasby/lamini_docs_instruct', commit_description='', oid='1688f3d73c19ec64cb2ec507824abd1b13a50d1e', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ludyhasby/lamini_docs_100_steps', endpoint='https://huggingface.co', repo_type='model', repo_id='ludyhasby/lamini_docs_100_steps'), pr_revision=None, pr_num=None)