# LG Exaone Finetuning

Training 7B LG Exaone in bf16 format using AWS p4d and p4de clusters

**1-Library Installation and environment setting**

In [None]:
!git config --global credential.helper store
!pip install huggingface_hub
!huggingface-cli login --token YOUR_HUGGINGFACE_KEYS
#YOUR_HUGGINGFACE_KEYS

In [None]:
# Install AWS CLI if not already installed
!pip install awscli

In [None]:
# Configure AWS CLI with your credentials
!aws configure

In [None]:
!aws sts get-caller-identity


In [None]:
!pip install --upgrade pip 
!pip install sagemaker transformers datasets peft trl bitsandbytes

In [None]:
!pip install --upgrade sagemaker

**2-Define Training Datasets inside of your S3**

In [2]:
import os
import sagemaker
import boto3

# Specify your custom bucket name
bucket_name = "llama-training-s3"
region_name = sagemaker.Session().boto_region_name  # Detect the region
s3_prefix = "llama-training-s3"
train_s3_path = f"{s3_prefix}/train/exaone_train_set.tsv"
validation_s3_path = f"{s3_prefix}/validation/exaone_validation_set.tsv"
test_s3_path = f"{s3_prefix}/test/exaone_test_set.tsv"

**3-Main Training File Creation**

Since all this is executed in a docker container and we cannot execute cell by cell, therefore, this code creates different codes in the current folder to be able to execute all of them inside of the AWS container to communicate with the cluster and the A100 8 GPUs

**Important Note:** Please, in this cell the main thing you need to change is the model name. It should correspond to the same model name you used for the data preprocessing. So for instance in the case of LG exaone, the huggingface name is: **LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct**. The second thing you will need to do is to ensure that the **.tsv** name matches the same as described in the previous cell. For example: **exaone_train_set.tsv**

**Secondary Note:** If you wonder why the preprocessing code is included inside of this code, there is a good reason. Since we cannot control the execution and ensure that tokenization was done right due to executing in a docker container, we preprocess inside of the container first to check all the tokenization works properly prior to passing it to the model. This way we can see what is happening inside of the docker container, otherwise, it would be very hard to debug since we cannot run cell by cell as in a normal notebook.

In [3]:
%%writefile train_deploy_huggingface.py
import os
import argparse
import subprocess
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
    TrainerCallback,
)
from datasets import Dataset
import pandas as pd
import torch
import json
import deepspeed
import shutil 

def list_dir_contents(path):
    """
    Recursively lists all files and directories within the given path,
    along with their sizes.
    """
    print(f"\nContents of '{path}':")
    for root, dirs, files in os.walk(path):
        level = root.replace(path, '').count(os.sep)
        indent = ' ' * 4 * level
        print(f"{indent}{os.path.basename(root)}/")
        sub_indent = ' ' * 4 * (level + 1)
        for f in files:
            file_path = os.path.join(root, f)
            size = os.path.getsize(file_path) / 1e6  # Size in MB
            print(f"{sub_indent}{f} - {size:.2f} MB")

def main():
    parser = argparse.ArgumentParser()

    # Hyperparameters
    parser.add_argument("--batch_size", type=int, default=1, help="Per-device training batch size")
    parser.add_argument("--epochs", type=int, default=4, help="Number of training epochs")
    parser.add_argument("--learning_rate", type=float, default=1e-5, help="Learning rate")
    parser.add_argument("--gradient_accumulation_steps", type=int, default=16, help="Gradient accumulation steps")
    parser.add_argument("--max_length", type=int, default=4000, help="Maximum sequence length")

    # Environment variables set by SageMaker
    parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--output_data_dir", type=str, default=os.environ.get("SM_OUTPUT_DATA_DIR"))
    parser.add_argument("--train_dir", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--validation_dir", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))

    # **Add the following line to accept --local_rank**
    parser.add_argument("--local_rank", type=int, default=-1, help="Local rank for distributed training")

    args = parser.parse_args()

    #deepspeed.init_distributed()
    #device = torch.device(f"cuda:{args.local_rank}" if torch.cuda.is_available() else "cpu")
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        "LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct",
        #"meta-llama/Llama-3.1-8B-Instruct",
        #"meta-llama/Llama-3.2-3B-Instruct",
        use_fast=True,
        add_eos_token=True,
        add_bos_token=True,
        padding_side="left",
        trust_remote_code=True,
        token='YOUR_HUGGINGFACE_KEYS'  # Replace with your actual Hugging Face token
    )
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        "LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct",
        #"meta-llama/Llama-3.1-8B-Instruct",
        #"meta-llama/Llama-3.2-3B-Instruct",
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        token='YOUR_HUGGINGFACE_KEYS',  # Replace with your actual Hugging Face token
    )
    
    
    #model.to(device)
    #model.config.rope_scaling = {"type": "linear", "factor": 2.0}
    #model.config.use_cache = False
    model.gradient_checkpointing_enable()

    
    with open('./deepspeed_config.json', 'r') as config_file:
        deepspeed_config = json.load(config_file)
    


    # Move the model to the appropriate device
    #device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    #model.to(device)
    print(f"Model device: {next(model.parameters()).device}")
    
    
    # Prepare data
    def load_and_tokenize_data(input_path):
        df = pd.read_csv(input_path, sep='\t', header=None)
        texts = df[0].tolist()
        
        # Print the first sample for verification
        print("First sample before tokenization:")
        print(texts[0])
        
        tokenized_data = tokenizer(
            texts,
            truncation=True,
            max_length=args.max_length,
            padding="max_length",
        )
        
        # Print the tokenized version of the first sample
        print("\nTokenized input_ids of the first sample:")
        print(tokenized_data["input_ids"][0])
        
        # Optionally, decode the tokenized input_ids back to text for verification
        decoded_text = tokenizer.decode(tokenized_data["input_ids"][0])
        print("\nDecoded text from tokenized input_ids:")
        print(decoded_text)
        
        tokenized_data["labels"] = tokenized_data["input_ids"].copy()
        
        dataset = Dataset.from_dict(tokenized_data)
        
        # Print an example from the dataset
        print("\nDataset example:")
        print(dataset[0])
        
        return dataset

    print("Training data path:")
    print(os.path.join(args.train_dir, 'exaone_train_set.tsv'))

    train_dataset = load_and_tokenize_data(os.path.join(args.train_dir, 'exaone_train_set.tsv'))
    eval_dataset = load_and_tokenize_data(os.path.join(args.validation_dir, 'exaone_validation_set.tsv'))

    # Data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
    )
    
    # Define separate directory for checkpoints
    checkpoints_dir = os.path.join(args.model_dir, "checkpoints")
    os.makedirs(checkpoints_dir, exist_ok=True)

    # Training arguments with DeepSpeed integration
    training_args = TrainingArguments(
        output_dir=checkpoints_dir,  # Checkpoints saved here,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        learning_rate=args.learning_rate,
        weight_decay=3e-7,
        bf16=True,
        logging_dir=os.path.join(args.output_data_dir, "logs"),
        logging_steps=10,
        log_level='debug',
        #eval_steps=50,
        #save_total_limit=1,
        load_best_model_at_end=False,
        evaluation_strategy="epoch",         # Match with save_strategy
        save_strategy="epoch",               # Ensure it matches evaluation_strategy
        save_total_limit=2,                  # Retain only best and last checkpoints
        metric_for_best_model="eval_loss",    # Specify the metric for best model
        #metric_for_best_model="eval_loss",
        dataloader_num_workers=4,
        deepspeed=deepspeed_config#"deepspeed_config.json",  # Specify DeepSpeed config file
    )

    # Custom callback to log GPU stats
    class GPUStatsCallback(TrainerCallback):
        def on_step_end(self, args, state, control, **kwargs):
            gpu = int(os.environ.get("LOCAL_RANK", -1))
            if gpu == -1:
                return  # Skip if not using GPU
            # Synchronize to ensure all computations are done
            torch.cuda.synchronize(gpu)
            allocated = torch.cuda.memory_allocated(gpu) / 1e9  # Convert to GB
            reserved = torch.cuda.memory_reserved(gpu) / 1e9  # Convert to GB

            # Optionally, get GPU utilization using nvidia-smi
            try:
                result = subprocess.check_output(
                    ['nvidia-smi', '--id={}'.format(gpu), '--query-gpu=utilization.gpu,memory.used,memory.total', '--format=csv,nounits,noheader'],
                    encoding='utf-8'
                )
                gpu_util, mem_used, mem_total = result.strip().split(',')
                gpu_util = int(gpu_util)
                mem_used = float(mem_used) / 1e3  # Convert MB to GB
                mem_total = float(mem_total) / 1e3  # Convert MB to GB
            except Exception as e:
                gpu_util = 'N/A'
                mem_used = allocated
                mem_total = reserved
                print(f"Error getting GPU utilization: {e}")

            print(f"After step {state.global_step}: GPU {gpu}, Utilization: {gpu_util}%, Memory Used: {mem_used:.2f} GB / {mem_total:.2f} GB")
            
    # Initialize Trainer with DeepSpeed and callbacks
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        #processing_class=tokenizer,
        data_collator=data_collator,
        callbacks=[GPUStatsCallback()],  # Add the custom callback here
    )

    # Start training
    trainer.train()
    
    
    # Save the model - **all processes must call this**
    print("Model trained successfully, proceeding to save...")
    trainer.save_model(args.model_dir)  # All processes call this

    # Only the main process handles tokenizer saving, model card creation, and cleanup
    if trainer.is_world_process_zero():
        print("Saving tokenizer and creating model card...")
        tokenizer.save_pretrained(args.model_dir)
        trainer.create_model_card()

        print("Saving completed. Verifying saved files...")
        # List the contents of the model directory
        list_dir_contents(args.model_dir)

        # Optionally, remove the checkpoints directory to free up space
        try:
            shutil.rmtree(checkpoints_dir)
            print(f"Removed checkpoints directory: {checkpoints_dir}")
        except Exception as e:
            print(f"Error removing checkpoints directory: {e}")
        
        list_dir_contents(args.model_dir)

        print("Done!")


    """print("Model trained successfully, proceeding to save...")
    print("Saving tokenizer...")
    # Save our tokenizer and create model card
    tokenizer.save_pretrained(args.model_dir)
    trainer.create_model_card()
    # Push the results to the hub
    #if args.repository_id:
    #    trainer.push_to_hub()

    # Saves the model to s3 uses os.environ["SM_MODEL_DIR"] to make sure checkpointing works
    print("Saving model...")
    trainer.save_model(os.environ["SM_MODEL_DIR"])
    tokenizer.save_pretrained(os.environ["SM_MODEL_DIR"])
    print("done!")"""
    
    """# Only the main process should handle saving
    if trainer.is_world_process_zero():
        print("Model trained successfully, proceeding to save...")
        print("Saving tokenizer...")
        tokenizer.save_pretrained(args.model_dir)
        trainer.create_model_card()
        
        print("Saving model...")
        trainer.save_model(args.model_dir)
        tokenizer.save_pretrained(args.model_dir)
        print("done!")"""
    
    # Save the model (only on the main process)
    #print("Model trained successfully, proceeding to save...")
    #if trainer.is_world_process_zero():
    #    print("Saving ongoing...")
    #    print("Files in model_dir:", os.listdir(args.model_dir))
    #    model.save_pretrained(args.model_dir, safe_serialization=False)
    #    #trainer.save_model(args.model_dir)
    #    print("Files in model_dir:", os.listdir(args.model_dir))
    #    print("done!")

if __name__ == "__main__":
    main()


Overwriting train_deploy_huggingface.py


**4-Create the requirements file**

Since we are inside of a docker container and cannot do many things, we need to install the necessary libraries through a single command therefore all the necessary libraries are added in this requirements.txt

In [5]:
%%writefile requirements.txt

transformers
torch
datasets
accelerate
sentencepiece
bitsandbytes
peft
pyarrow
deepspeed==0.15.4
accelerate>=0.26.0

Overwriting requirements.txt


**4-Deepspeed Configuration**

Since modeldataparallel and dataparallel are not enough to parallelizde this big models, we use deepspeed. Specifically Deepspeed ZeRO Stage 3. This allows us the maximum parallelization inside of the p4d and p4de clusters.

**Important Note:** There are a few important things to take into account. Since deepspeed configuration is tied to the huggingface trainingArguments, make sure both definitions match (the one on step 3 and this one). If you are using p4d make sure that on the "device" definition is set to "cpu". You need to offload parameters and optimizers to cpu or your cluster will crash. If you have a p4de cluster, set it to "none" and the training time would be approximately half (1h vs 30mins for 7~8B family models).

In [6]:
%%writefile deepspeed_config.json
{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 4,
  "steps_per_print": 100,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-5,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 3e-7
    }
  },
  "bf16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3,
    "stage3_gather_16bit_weights_on_model_save": true,
    "offload_optimizer": {
      "device": "none",
      "pin_memory": true
    },
    "offload_param": {
      "device": "none",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 50000000,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 50000000,
    "contiguous_gradients": true
  },
  "activation_checkpointing": {
    "partition_activations": true,
    "contiguous_memory_optimization": true
  },
  "wall_clock_breakdown": false
}




Overwriting deepspeed_config.json


**5-Deepspeed Launcher**

Unluckily we cannot launch deepspeed inside of our training job of step 3. Due to AWS and container related issues, if you try to do so, you wont be able to see the rest of the GPUs since your code is already running. In order to arrange the previous code, a deepspeed launcher must be made that directly from command line creates all the required processes (in this case 8 processes since we have 8 GPUs) and everything is coordinated by this code.

In [7]:
%%writefile ds_launcher.py
import sys
import os
import subprocess
import json
import sys
import logging
from argparse import ArgumentParser

logger = logging.getLogger(__name__)


def parse_args():
    parser = ArgumentParser(
        description=("SageMaker DeepSpeed Launch helper utility that will spawn deepspeed training scripts")
    )

    # rest from the training program
    parsed, nargs = parser.parse_known_args()

    return nargs


def main():
    # https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/launch.py
    num_gpus = int(os.environ.get("SM_NUM_GPUS", 0))
    hosts = json.loads(os.environ.get("SM_HOSTS", "{}"))
    num_nodes = len(hosts)
    current_host = os.environ.get("SM_CURRENT_HOST", 0)
    rank = hosts.index(current_host)
    print(f"num_gpus = {num_gpus}, num_nodes = {num_nodes}, current_host = {current_host}, rank = {rank}")

    # os.environ['NCCL_DEBUG'] = 'INFO'

    # get number of GPU
    # if num_gpus == 0:
    #     raise ValueError("No GPUs found.")

    args = parse_args()
    command = f"deepspeed --num_gpus={num_gpus} train_deploy_huggingface.py {' '.join(args)}"
    print(f"command = {command}")
    # launch deepspeed training
    deepspeed_launch(command)


def deepspeed_launch(command):
    # try:
    try:
        subprocess.run(command, shell=True)
    except Exception as e:
        logger.info(e)


if __name__ == "__main__":
    main()

Overwriting ds_launcher.py


**6-Sagemaker Training Job Creation**

This step wraps up all of our work. Basically all files are passed here in order to be able to launch the codes inside of the container.

**Important Note:** Make sure that the instance type is set to the one you need (p4d or p4de) with the exact name AWS requires such as **ml.p4de.24xlarge** or **ml.p4d.24xlarge**. Also make sure all hyperparameters match with the previously described ones.

In [8]:
from sagemaker.pytorch import PyTorch

# Replace with your actual Role ARN
role = "YOUR_AWS_KEYS"

distribution = {
    "deepspeed": {
        "enabled": True,
        "config_path": "deepspeed_config.json"
    }
}

estimator = PyTorch(
    entry_point="ds_launcher.py",
    role=role,
    source_dir=".",  # Ensure 'deepspeed_config.json' is included here
    instance_count=1,
    instance_type="ml.p4de.24xlarge",  # Instance with 8 GPUs
    framework_version="2.1.0",  # Ensure compatibility with DeepSpeed
    py_version="py310",
    dependencies=["requirements.txt"],
    hyperparameters={
        "epochs": 4,
        "batch_size": 1,
        "learning_rate": 1e-5,
        "gradient_accumulation_steps": 4,
        "max_length": 4000,  # Updated max_length to 200
    },
    distribution=distribution,
    output_path=f"s3://{bucket_name}/{s3_prefix}/model",
)

# Define S3 URIs for TSV data
train_s3_uri = f"s3://{bucket_name}/{train_s3_path}"
validation_s3_uri = f"s3://{bucket_name}/{validation_s3_path}"

# Fit the model
estimator.fit({"train": train_s3_uri, "validation": validation_s3_uri})


2025-01-22 06:18:30 Starting - Starting the training job
2025-01-22 06:18:30 Pending - Training job waiting for capacity......
2025-01-22 06:19:27 Pending - Preparing the instances for training.....................
2025-01-22 06:23:37 Downloading - Downloading input data...
2025-01-22 06:23:52 Downloading - Downloading the training image............
2025-01-22 06:26:14 Training - Training image download completed. Training in progress......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
  "cipher": algorithms.TripleDES,[0m
  "class": algorithms.TripleDES,[0m
[34m2025-01-22 06:27:15,695 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-01-22 06:27:15,807 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-01-22 06:27:15,816 sagemaker_pytorch_container.training INFO     Block until all host DNS look

[34mBuilding wheel for deepspeed (setup.py): finished with status 'done'[0m
[34mCreated wheel for deepspeed: filename=deepspeed-0.15.4-py3-none-any.whl size=1527840 sha256=ed1cba0eed7ebcea08fc7262745fb323b71d623f5c09946623d42b28600f19bd[0m
[34mStored in directory: /root/.cache/pip/wheels/74/bc/b6/836d7c3e3093e25502fa9248e0be9e943db245f2806ba1cd19[0m
[34mSuccessfully built deepspeed[0m
[34mInstalling collected packages: sentencepiece, py-cpuinfo, nvidia-ml-py, hjson, xxhash, safetensors, regex, propcache, multidict, msgpack, frozenlist, async-timeout, aiohappyeyeballs, yarl, huggingface-hub, aiosignal, tokenizers, deepspeed, bitsandbytes, aiohttp, accelerate, transformers, peft, datasets[0m
[34mAttempting uninstall: accelerate[0m
[34mFound existing installation: accelerate 0.22.0[0m
[34mUninstalling accelerate-0.22.0:[0m
[34mSuccessfully uninstalled accelerate-0.22.0[0m
[34mSuccessfully installed accelerate-1.3.0 aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 

[34m[2025-01-22 06:27:55,630] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)[0m
[34mdf: /root/.triton/autotune[0m
[34m: No such file or directory[0m
[34m[2025-01-22 06:27:57,493] [INFO] [runner.py:607:main] cmd = /opt/conda/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train_deploy_huggingface.py --batch_size 1 --epochs 4 --gradient_accumulation_steps 4 --learning_rate 1e-05 --max_length 4000[0m
[34m[2025-01-22 06:28:02,586] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)[0m
[34m[2025-01-22 06:28:04,467] [INFO] [launch.py:139:main] 0 NCCL_DEBUG=WARN[0m
[34m[2025-01-22 06:28:04,467] [INFO] [launch.py:139:main] 0 NCCL_SOCKET_IFNAME=eth0[0m
[34m[2025-01-22 06:28:04,467] [INFO] [launch.py:139:main] 0 NCCL_ASYNC_ERROR_HANDLING=1[0m
[34m[2025-01-2

[34mDownloading shards:   0%|          | 0/7 [00:00<?, ?it/s][0m
[34mDownloading shards:   0%|          | 0/7 [00:00<?, ?it/s][0m
[34mDownloading shards:   0%|          | 0/7 [00:00<?, ?it/s][0m
[34mDownloading shards:   0%|          | 0/7 [00:00<?, ?it/s][0m
[34mDownloading shards:   0%|          | 0/7 [00:00<?, ?it/s][0m
[34mDownloading shards:   0%|          | 0/7 [00:00<?, ?it/s][0m
[34mDownloading shards:   0%|          | 0/7 [00:00<?, ?it/s][0m
[34mDownloading shards:   0%|          | 0/7 [00:00<?, ?it/s][0m
[34mDownloading shards:  14%|█▍        | 1/7 [01:57<11:45, 117.59s/it][0m
[34mDownloading shards:  14%|█▍        | 1/7 [01:57<11:45, 117.55s/it][0m
[34mDownloading shards:  14%|█▍        | 1/7 [01:57<11:45, 117.58s/it][0m
[34mDownloading shards:  14%|█▍        | 1/7 [01:57<11:45, 117.58s/it][0m
[34mDownloading shards:  14%|█▍        | 1/7 [01:57<11:45, 117.59s/it][0m
[34mDownloading shards:  14%|█▍        | 1/7 [01:57<11:45, 117.61s/it][0m
[34mDow

[34mLoading checkpoint shards:  71%|███████▏  | 5/7 [00:09<00:03,  1.89s/it][0m
[34mLoading checkpoint shards:  71%|███████▏  | 5/7 [00:09<00:03,  1.96s/it][0m
[34mLoading checkpoint shards:  71%|███████▏  | 5/7 [00:09<00:03,  1.96s/it][0m
[34mLoading checkpoint shards:  71%|███████▏  | 5/7 [00:09<00:03,  1.94s/it][0m
[34mLoading checkpoint shards:  71%|███████▏  | 5/7 [00:09<00:03,  1.96s/it][0m
[34mLoading checkpoint shards:  71%|███████▏  | 5/7 [00:09<00:03,  1.98s/it][0m
[34mLoading checkpoint shards:  71%|███████▏  | 5/7 [00:09<00:03,  2.00s/it][0m
[34mDataset example:[0m
[34m{'input_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

[34m, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

[34m[1/3] /opt/conda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_G

[34mAfter step 1: GPU 3, Utilization: 70%, Memory Used: 44.40 GB / 81.92 GB[0m
[34mAfter step 1: GPU 6, Utilization: 42%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 1: GPU 5, Utilization: 67%, Memory Used: 45.73 GB / 81.92 GB[0m
[34mAfter step 1: GPU 0, Utilization: 54%, Memory Used: 44.02 GB / 81.92 GB[0m
[34m1%|          | 1/196 [00:13<42:27, 13.06s/it][0m
[34mAfter step 1: GPU 1, Utilization: 75%, Memory Used: 43.51 GB / 81.92 GB[0m
[34mAfter step 1: GPU 4, Utilization: 54%, Memory Used: 45.63 GB / 81.92 GB[0m
[34mAfter step 1: GPU 7, Utilization: 55%, Memory Used: 45.64 GB / 81.92 GB[0m
[34mAfter step 1: GPU 2, Utilization: 34%, Memory Used: 45.78 GB / 81.92 GB[0m
[34mAfter step 2: GPU 1, Utilization: 93%, Memory Used: 45.28 GB / 81.92 GB[0m
[34mAfter step 2: GPU 3, Utilization: 83%, Memory Used: 45.20 GB / 81.92 GB[0m
[34mAfter step 2: GPU 2, Utilization: 43%, Memory Used: 45.78 GB / 81.92 GB[0m
[34mAfter step 2: GPU 6, Utilization: 51%, Memory Use

[34mAfter step 13: GPU 6, Utilization: 100%, Memory Used: 48.44 GB / 81.92 GB[0m
[34mAfter step 13: GPU 2, Utilization: 15%, Memory Used: 45.78 GB / 81.92 GB[0m
[34mAfter step 13: GPU 1, Utilization: 22%, Memory Used: 45.67 GB / 81.92 GB[0m
[34mAfter step 13: GPU 0, Utilization: 0%, Memory Used: 44.21 GB / 81.92 GB[0m
[34m7%|▋         | 13/196 [02:41<37:46, 12.38s/it][0m
[34mAfter step 13: GPU 3, Utilization: 0%, Memory Used: 45.81 GB / 81.92 GB[0m
[34mAfter step 13: GPU 7, Utilization: 42%, Memory Used: 46.17 GB / 81.92 GB[0m
[34mAfter step 13: GPU 5, Utilization: 0%, Memory Used: 45.93 GB / 81.92 GB[0m
[34mAfter step 13: GPU 4, Utilization: 0%, Memory Used: 45.86 GB / 81.92 GB[0m
[34mAfter step 14: GPU 2, Utilization: 100%, Memory Used: 45.78 GB / 81.92 GB[0m
[34mAfter step 14: GPU 0, Utilization: 6%, Memory Used: 44.21 GB / 81.92 GB[0m
[34mAfter step 14: GPU 3, Utilization: 98%, Memory Used: 45.81 GB / 81.92 GB[0m
[34m7%|▋         | 14/196 [02:53<37:33, 12.3

[34mAfter step 25: GPU 0, Utilization: 0%, Memory Used: 44.21 GB / 81.92 GB[0m
[34m13%|█▎        | 25/196 [05:10<35:26, 12.44s/it][0m
[34mAfter step 25: GPU 3, Utilization: 6%, Memory Used: 45.81 GB / 81.92 GB[0m
[34mAfter step 25: GPU 2, Utilization: 0%, Memory Used: 45.78 GB / 81.92 GB[0m
[34mAfter step 25: GPU 4, Utilization: 23%, Memory Used: 45.86 GB / 81.92 GB[0m
[34mAfter step 25: GPU 5, Utilization: 28%, Memory Used: 45.93 GB / 81.92 GB[0m
[34mAfter step 25: GPU 7, Utilization: 11%, Memory Used: 46.17 GB / 81.92 GB[0m
[34mAfter step 25: GPU 1, Utilization: 0%, Memory Used: 45.67 GB / 81.92 GB[0m
[34mAfter step 25: GPU 6, Utilization: 51%, Memory Used: 48.44 GB / 81.92 GB[0m
[34mAfter step 26: GPU 1, Utilization: 100%, Memory Used: 45.67 GB / 81.92 GB[0m
[34mAfter step 26: GPU 4, Utilization: 92%, Memory Used: 45.86 GB / 81.92 GB[0m
[34mAfter step 26: GPU 2, Utilization: 71%, Memory Used: 45.78 GB / 81.92 GB[0m
[34mAfter step 26: GPU 6, Utilization: 98%,

[34mAfter step 37: GPU 1, Utilization: 95%, Memory Used: 45.78 GB / 81.92 GB[0m
[34mAfter step 37: GPU 5, Utilization: 95%, Memory Used: 46.04 GB / 81.92 GB[0m
[34mAfter step 37: GPU 2, Utilization: 100%, Memory Used: 46.59 GB / 81.92 GB[0m
[34mAfter step 37: GPU 6, Utilization: 70%, Memory Used: 48.44 GB / 81.92 GB[0m
[34mAfter step 37: GPU 4, Utilization: 69%, Memory Used: 45.86 GB / 81.92 GB[0m
[34mAfter step 37: GPU 7, Utilization: 86%, Memory Used: 46.17 GB / 81.92 GB[0m
[34mAfter step 37: GPU 0, Utilization: 93%, Memory Used: 44.21 GB / 81.92 GB[0m
[34mAfter step 37: GPU 3, Utilization: 98%, Memory Used: 45.92 GB / 81.92 GB[0m
[34m19%|█▉        | 37/196 [07:38<32:37, 12.31s/it][0m
[34mAfter step 38: GPU 1, Utilization: 94%, Memory Used: 45.78 GB / 81.92 GB[0m
[34mAfter step 38: GPU 2, Utilization: 100%, Memory Used: 46.59 GB / 81.92 GB[0m
[34mAfter step 38: GPU 4, Utilization: 98%, Memory Used: 45.86 GB / 81.92 GB[0m
[34mAfter step 38: GPU 6, Utilization:

[34mAfter step 49: GPU 1, Utilization: 95%, Memory Used: 45.78 GB / 81.92 GB[0m
[34mAfter step 49: GPU 6, Utilization: 35%, Memory Used: 48.44 GB / 81.92 GB[0m
[34mAfter step 49: GPU 0, Utilization: 81%, Memory Used: 44.21 GB / 81.92 GB[0m
[34m25%|██▌       | 49/196 [10:07<30:26, 12.43s/it][0m
[34mAfter step 49: GPU 2, Utilization: 82%, Memory Used: 46.59 GB / 81.92 GB[0m
[34mAfter step 49: GPU 7, Utilization: 7%, Memory Used: 46.17 GB / 81.92 GB[0m
[34mAfter step 49: GPU 3, Utilization: 0%, Memory Used: 45.92 GB / 81.92 GB[0m
[34mAfter step 49: GPU 4, Utilization: 40%, Memory Used: 45.86 GB / 81.92 GB[0m
[34mAfter step 49: GPU 5, Utilization: 30%, Memory Used: 46.04 GB / 81.92 GB[0m
[34mAfter step 50: GPU 2, Utilization: 92%, Memory Used: 46.59 GB / 81.92 GB[0m
[34mAfter step 50: GPU 4, Utilization: 100%, Memory Used: 45.86 GB / 81.92 GB[0m
[34mAfter step 50: GPU 1, Utilization: 100%, Memory Used: 45.78 GB / 81.92 GB[0m
[34mAfter step 50: GPU 6, Utilization: 1

[34m39%|███▉      | 15/38 [00:11<00:19,  1.20it/s]#033[A[0m
[34m42%|████▏     | 16/38 [00:12<00:18,  1.20it/s]#033[A[0m
[34m45%|████▍     | 17/38 [00:13<00:17,  1.20it/s]#033[A[0m
[34m47%|████▋     | 18/38 [00:14<00:16,  1.20it/s]#033[A[0m
[34m50%|█████     | 19/38 [00:14<00:15,  1.20it/s]#033[A[0m
[34m53%|█████▎    | 20/38 [00:15<00:15,  1.20it/s]#033[A[0m
[34m55%|█████▌    | 21/38 [00:16<00:14,  1.19it/s]#033[A[0m
[34m58%|█████▊    | 22/38 [00:17<00:13,  1.20it/s]#033[A[0m
[34m61%|██████    | 23/38 [00:18<00:12,  1.20it/s]#033[A[0m
[34m63%|██████▎   | 24/38 [00:19<00:11,  1.20it/s]#033[A[0m
[34m66%|██████▌   | 25/38 [00:19<00:10,  1.20it/s]#033[A[0m
[34m68%|██████▊   | 26/38 [00:20<00:09,  1.21it/s]#033[A[0m
[34m71%|███████   | 27/38 [00:21<00:09,  1.20it/s]#033[A[0m
[34m74%|███████▎  | 28/38 [00:22<00:08,  1.20it/s]#033[A[0m
[34m76%|███████▋  | 29/38 [00:23<00:07,  1.19it/s]#033[A[0m
[34m79%|███████▉  | 30/38 [00:24<00:06,  1.20it/s]#033[A[0m
[34m82%

[34mAfter step 51: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 51: GPU 4, Utilization: 31%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 51: GPU 1, Utilization: 54%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 51: GPU 3, Utilization: 28%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 51: GPU 6, Utilization: 69%, Memory Used: 49.30 GB / 81.92 GBAfter step 51: GPU 5, Utilization: 0%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 51: GPU 0, Utilization: 8%, Memory Used: 46.05 GB / 81.92 GB[0m
[34m26%|██▌       | 51/196 [11:54<1:32:03, 38.09s/it][0m
[34mAfter step 51: GPU 7, Utilization: 0%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 52: GPU 3, Utilization: 81%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 52: GPU 4, Utilization: 79%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 52: GPU 2, Utilization: 0%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 52: GPU 5, Utilization: 7%, Memory U

[34mAfter step 63: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 63: GPU 1, Utilization: 43%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 63: GPU 4, Utilization: 88%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 63: GPU 5, Utilization: 49%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 63: GPU 6, Utilization: 18%, Memory Used: 49.30 GB / 81.92 GB[0m
[34mAfter step 63: GPU 3, Utilization: 55%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 63: GPU 0, Utilization: 55%, Memory Used: 46.05 GB / 81.92 GB[0m
[34m32%|███▏      | 63/196 [14:22<28:13, 12.73s/it][0m
[34mAfter step 63: GPU 7, Utilization: 0%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 64: GPU 4, Utilization: 75%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 64: GPU 5, Utilization: 50%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 64: GPU 1, Utilization: 53%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 64: GPU 2, Utilization: 3

[34mAfter step 75: GPU 4, Utilization: 39%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 75: GPU 1, Utilization: 100%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 75: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 75: GPU 5, Utilization: 48%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 75: GPU 0, Utilization: 88%, Memory Used: 46.05 GB / 81.92 GB[0m
[34m38%|███▊      | 75/196 [16:51<24:58, 12.38s/it][0m
[34mAfter step 75: GPU 3, Utilization: 99%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 75: GPU 7, Utilization: 19%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 75: GPU 6, Utilization: 58%, Memory Used: 49.30 GB / 81.92 GB[0m
[34mAfter step 76: GPU 1, Utilization: 97%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 76: GPU 3, Utilization: 74%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 76: GPU 5, Utilization: 100%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 76: GPU 4, Utilization

[34mAfter step 87: GPU 4, Utilization: 83%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 87: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 87: GPU 6, Utilization: 78%, Memory Used: 49.30 GB / 81.92 GB[0m
[34mAfter step 87: GPU 5, Utilization: 100%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 87: GPU 7, Utilization: 16%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 87: GPU 1, Utilization: 0%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 87: GPU 3, Utilization: 36%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 87: GPU 0, Utilization: 26%, Memory Used: 46.05 GB / 81.92 GB[0m
[34m44%|████▍     | 87/196 [19:19<22:28, 12.37s/it][0m
[34mAfter step 88: GPU 2, Utilization: 91%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 88: GPU 4, Utilization: 52%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 88: GPU 6, Utilization: 58%, Memory Used: 49.30 GB / 81.92 GB[0m
[34mAfter step 88: GPU 7, Utilization: 

[34mAfter step 99: GPU 1, Utilization: 100%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 99: GPU 2, Utilization: 75%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 99: GPU 4, Utilization: 100%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 99: GPU 6, Utilization: 89%, Memory Used: 49.30 GB / 81.92 GB[0m
[34mAfter step 99: GPU 5, Utilization: 59%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 99: GPU 0, Utilization: 45%, Memory Used: 46.05 GB / 81.92 GB[0m
[34mAfter step 99: GPU 3, Utilization: 53%, Memory Used: 47.56 GB / 81.92 GB[0m
[34m51%|█████     | 99/196 [21:48<19:56, 12.34s/it][0m
[34mAfter step 99: GPU 7, Utilization: 85%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 100: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 100: GPU 5, Utilization: 99%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 100: GPU 4, Utilization: 100%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 100: GPU 6, Utiliz

[34m21%|██        | 8/38 [00:05<00:23,  1.25it/s]#033[A[0m
[34m24%|██▎       | 9/38 [00:06<00:23,  1.24it/s]#033[A[0m
[34m26%|██▋       | 10/38 [00:07<00:22,  1.22it/s]#033[A[0m
[34m29%|██▉       | 11/38 [00:08<00:22,  1.22it/s]#033[A[0m
[34m32%|███▏      | 12/38 [00:09<00:21,  1.21it/s]#033[A[0m
[34m34%|███▍      | 13/38 [00:09<00:20,  1.20it/s]#033[A[0m
[34m37%|███▋      | 14/38 [00:10<00:19,  1.20it/s]#033[A[0m
[34m39%|███▉      | 15/38 [00:11<00:19,  1.20it/s]#033[A[0m
[34m42%|████▏     | 16/38 [00:12<00:18,  1.20it/s]#033[A[0m
[34m45%|████▍     | 17/38 [00:13<00:17,  1.20it/s]#033[A[0m
[34m47%|████▋     | 18/38 [00:14<00:16,  1.20it/s]#033[A[0m
[34m50%|█████     | 19/38 [00:14<00:15,  1.20it/s]#033[A[0m
[34m53%|█████▎    | 20/38 [00:15<00:15,  1.20it/s]#033[A[0m
[34m55%|█████▌    | 21/38 [00:16<00:14,  1.20it/s]#033[A[0m
[34m58%|█████▊    | 22/38 [00:17<00:13,  1.20it/s]#033[A[0m
[34m61%|██████    | 23/38 [00:18<00:12,  1.20it/s]#033[A[0m
[34m63%|█

[34mAfter step 101: GPU 0, Utilization: 100%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m52%|█████▏    | 101/196 [23:34<59:49, 37.79s/it][0m
[34mAfter step 101: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 101: GPU 4, Utilization: 100%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 101: GPU 6, Utilization: 100%, Memory Used: 49.41 GB / 81.92 GB[0m
[34mAfter step 101: GPU 1, Utilization: 100%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 101: GPU 5, Utilization: 100%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 101: GPU 3, Utilization: 14%, Memory Used: 47.56 GB / 81.92 GBAfter step 101: GPU 7, Utilization: 50%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 102: GPU 0, Utilization: 100%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m52%|█████▏    | 102/196 [23:47<47:13, 30.14s/it][0m
[34mAfter step 102: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 102: GPU 4, Utilization: 100%, Memory Us

[34mAfter step 113: GPU 0, Utilization: 76%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m58%|█████▊    | 113/196 [26:00<17:17, 12.50s/it][0m
[34mAfter step 113: GPU 1, Utilization: 73%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 113: GPU 4, Utilization: 44%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 113: GPU 3, Utilization: 66%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 113: GPU 6, Utilization: 59%, Memory Used: 49.41 GB / 81.92 GB[0m
[34mAfter step 113: GPU 7, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 113: GPU 2, Utilization: 67%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 113: GPU 5, Utilization: 55%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 114: GPU 0, Utilization: 100%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m58%|█████▊    | 114/196 [26:12<16:55, 12.39s/it][0m
[34mAfter step 114: GPU 1, Utilization: 100%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 114: GPU 3, Utilization: 100%, Memo

[34mAfter step 125: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 125: GPU 6, Utilization: 100%, Memory Used: 49.41 GB / 81.92 GBAfter step 125: GPU 3, Utilization: 85%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 125: GPU 4, Utilization: 100%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 125: GPU 0, Utilization: 93%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m64%|██████▍   | 125/196 [28:27<14:30, 12.26s/it][0m
[34mAfter step 125: GPU 7, Utilization: 61%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 125: GPU 5, Utilization: 67%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 125: GPU 1, Utilization: 100%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 126: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 126: GPU 0, Utilization: 84%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m64%|██████▍   | 126/196 [28:39<14:17, 12.25s/it][0m
[34mAfter step 126: GPU 3, Utilization: 76%, Memory Used: 

[34mAfter step 137: GPU 0, Utilization: 82%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m70%|██████▉   | 137/196 [30:55<12:08, 12.34s/it][0m
[34mAfter step 137: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 137: GPU 3, Utilization: 100%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 137: GPU 4, Utilization: 100%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 137: GPU 6, Utilization: 88%, Memory Used: 49.41 GB / 81.92 GB[0m
[34mAfter step 137: GPU 1, Utilization: 100%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 137: GPU 5, Utilization: 100%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 137: GPU 7, Utilization: 83%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 138: GPU 4, Utilization: 100%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 138: GPU 2, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 138: GPU 6, Utilization: 100%, Memory Used: 49.41 GB / 81.92 GB[0m
[34mAfter step 138: 

[34mAfter step 149: GPU 0, Utilization: 100%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m76%|███████▌  | 149/196 [33:22<09:34, 12.23s/it][0m
[34mAfter step 149: GPU 3, Utilization: 41%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 149: GPU 4, Utilization: 100%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 149: GPU 1, Utilization: 100%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 149: GPU 6, Utilization: 0%, Memory Used: 49.41 GB / 81.92 GB[0m
[34mAfter step 149: GPU 5, Utilization: 36%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 149: GPU 2, Utilization: 0%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 149: GPU 7, Utilization: 0%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 150: GPU 3, Utilization: 100%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 150: GPU 5, Utilization: 100%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 150: GPU 4, Utilization: 100%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 150: GPU 6

[34m24%|██▎       | 9/38 [00:06<00:23,  1.24it/s]#033[A[0m
[34m26%|██▋       | 10/38 [00:07<00:22,  1.23it/s]#033[A[0m
[34m29%|██▉       | 11/38 [00:08<00:22,  1.22it/s]#033[A[0m
[34m32%|███▏      | 12/38 [00:09<00:21,  1.22it/s]#033[A[0m
[34m34%|███▍      | 13/38 [00:09<00:20,  1.21it/s]#033[A[0m
[34m37%|███▋      | 14/38 [00:10<00:19,  1.21it/s]#033[A[0m
[34m39%|███▉      | 15/38 [00:11<00:19,  1.21it/s]#033[A[0m
[34m42%|████▏     | 16/38 [00:12<00:18,  1.21it/s]#033[A[0m
[34m45%|████▍     | 17/38 [00:13<00:17,  1.21it/s]#033[A[0m
[34m47%|████▋     | 18/38 [00:14<00:16,  1.20it/s]#033[A[0m
[34m50%|█████     | 19/38 [00:14<00:15,  1.20it/s]#033[A[0m
[34m53%|█████▎    | 20/38 [00:15<00:14,  1.20it/s]#033[A[0m
[34m55%|█████▌    | 21/38 [00:16<00:14,  1.21it/s]#033[A[0m
[34m58%|█████▊    | 22/38 [00:17<00:13,  1.20it/s]#033[A[0m
[34m61%|██████    | 23/38 [00:18<00:12,  1.21it/s]#033[A[0m
[34m63%|██████▎   | 24/38 [00:19<00:11,  1.21it/s]#033[A[0m
[34m66%|

[34mhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...[0m
[34m#011- Avoid using `tokenizers` before the fork if possible[0m
[34m#011- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)[0m
[34mhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...[0m
[34m#011- Avoid using `tokenizers` before the fork if possible[0m
[34m#011- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)[0m
[34mhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...[0m
[34m#011- Avoid using `tokenizers` before the fork if possible[0m
[34m#011- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)[0m
[34mhuggingface/tokenizers: The current process jus

[34mAfter step 161: GPU 7, Utilization: 100%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 161: GPU 0, Utilization: 7%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m82%|████████▏ | 161/196 [37:24<07:39, 13.12s/it][0m
[34mAfter step 161: GPU 1, Utilization: 86%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 161: GPU 6, Utilization: 0%, Memory Used: 49.41 GB / 81.92 GB[0m
[34mAfter step 161: GPU 2, Utilization: 0%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 161: GPU 4, Utilization: 16%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 161: GPU 3, Utilization: 0%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 161: GPU 5, Utilization: 68%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 162: GPU 4, Utilization: 71%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 162: GPU 0, Utilization: 73%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m83%|████████▎ | 162/196 [37:36<07:17, 12.87s/it][0m
[34mAfter step 162: GPU 2, Utilization: 54%, Memory Used

[34mAfter step 173: GPU 0, Utilization: 0%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m88%|████████▊ | 173/196 [39:52<04:45, 12.40s/it][0m
[34mAfter step 173: GPU 1, Utilization: 6%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 173: GPU 3, Utilization: 0%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 173: GPU 6, Utilization: 63%, Memory Used: 49.41 GB / 81.92 GB[0m
[34mAfter step 173: GPU 4, Utilization: 0%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 173: GPU 2, Utilization: 22%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 173: GPU 5, Utilization: 12%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 173: GPU 7, Utilization: 0%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 174: GPU 0, Utilization: 77%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m89%|████████▉ | 174/196 [40:04<04:32, 12.37s/it][0m
[34mAfter step 174: GPU 1, Utilization: 100%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 174: GPU 3, Utilization: 71%, Memory Used:

[34mAfter step 185: GPU 0, Utilization: 100%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m94%|█████████▍| 185/196 [42:20<02:15, 12.36s/it][0m
[34mAfter step 185: GPU 3, Utilization: 100%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 185: GPU 5, Utilization: 97%, Memory Used: 47.43 GB / 81.92 GB[0m
[34mAfter step 185: GPU 4, Utilization: 64%, Memory Used: 47.64 GB / 81.92 GB[0m
[34mAfter step 185: GPU 7, Utilization: 75%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 185: GPU 6, Utilization: 79%, Memory Used: 49.41 GB / 81.92 GB[0m
[34mAfter step 185: GPU 2, Utilization: 91%, Memory Used: 47.58 GB / 81.92 GB[0m
[34mAfter step 185: GPU 1, Utilization: 60%, Memory Used: 47.70 GB / 81.92 GB[0m
[34mAfter step 186: GPU 3, Utilization: 100%, Memory Used: 47.56 GB / 81.92 GB[0m
[34mAfter step 186: GPU 0, Utilization: 100%, Memory Used: 46.16 GB / 81.92 GB[0m
[34m95%|█████████▍| 186/196 [42:32<02:03, 12.33s/it][0m
[34mAfter step 186: GPU 5, Utilization: 100%, Mem

[34mSaving model checkpoint to /opt/ml/model/checkpoints/checkpoint-196[0m
[34mSaving model checkpoint to /opt/ml/model/checkpoints/checkpoint-196[0m
[34mConfiguration saved in /opt/ml/model/checkpoints/checkpoint-196/config.json[0m
[34mConfiguration saved in /opt/ml/model/checkpoints/checkpoint-196/config.json[0m
[34mConfiguration saved in /opt/ml/model/checkpoints/checkpoint-196/generation_config.json[0m
[34mConfiguration saved in /opt/ml/model/checkpoints/checkpoint-196/generation_config.json[0m
[34mThe model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/checkpoints/checkpoint-196/model.safetensors.index.json.[0m
[34mThe model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/checkpoints/c

[34m***** Running Evaluation *****
  Num examples = 297[0m
[34mBatch size = 1[0m
[34m***** Running Evaluation *****
  Num examples = 297
  Batch size = 1[0m
[34mhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...[0m
[34m#011- Avoid using `tokenizers` before the fork if possible[0m
[34m#011- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)[0m
[34mhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...[0m
[34m#011- Avoid using `tokenizers` before the fork if possible[0m
[34m#011- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)[0m
[34mhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...[0m
[34m#011- Avoid using `tokenizers` before the for

[34mThe model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/model.safetensors.index.json.[0m
[34mThe model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/model.safetensors.index.json.[0m
[34mtokenizer config file saved in /opt/ml/model/tokenizer_config.json[0m
[34mtokenizer config file saved in /opt/ml/model/tokenizer_config.json[0m
[34mSpecial tokens file saved in /opt/ml/model/special_tokens_map.json[0m
[34mSpecial tokens file saved in /opt/ml/model/special_tokens_map.json[0m
[34mSaving tokenizer and creating model card...[0m
[34mtokenizer config file saved in /opt/ml/model/tokenizer_config.json[0m
[34mtokenizer config file saved in /opt/ml/model/tokenizer_config.json[0m
[34mSpecial 