## APO: Alibaba Price Oracle
The idea is to process the 7 days history data across the market and across time.

Main component is the self attention unit applied first along the time axis and then along the market.

Market consists of existing detected airlines with sufficient data.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib notebook
%matplotlib inline

In [2]:
from alibaba_ai_task.train.trainer import train_apo_once
from alibaba_ai_task.tools.omni_tools import get_support_data_dir

support_dir = get_support_data_dir()

num_gpus = 2
num_cpus = 6

### Training
Training can be controlled via the 'train_apo_once' interface.

We recomment using the script code for training, since 'ddp' multi-gpu training strategy doesnt work under jupyter environment.

Simply run: python -m alibaba_ai_task/train/V01/V01.py

In [3]:
train_apo_once({
        'apo.expr_id': 'V01',
        'apo.data_id': 'V01',

        'dirs.support_base_dir': support_dir,
        'dirs.work_base_dir': '/home/nghorbani/Desktop/alibaba_ai_task',

        'train_parms.batch_size': 32,
        'train_parms.num_workers': num_cpus,

    # these wont have an effect and would probably cause errors since we prepared data in the jupyter
        'data_parms.history_length': 7, 
        'data_parms.future_length': 7,

        'trainer.max_epochs': 500,
        # 'trainer.overfit_batches': 0.1,

        # 'trainer.fast_dev_run': True,
        'trainer.num_gpus': num_gpus,
        'trainer.strategy': 'ddp_spawn',
        'train_parms.optimizer.args.lr': 1e-3,
    },
)
#`Trainer(strategy='ddp')` is not compatible with an interactive environment. Run your code as a script, or choose one of the compatible backends: 
# dp, ddp_spawn, ddp_sharded_spawn, tpu_spawn. In case you are spawning processes yourself, make sure to include the Trainer creation inside the worker function.

Global seed set to 100


[34m[1mdata_module:__init__:83 -- V01 -- V01 -- Setting up the APO data loader[0m
[34m[1mdata_module:__init__:43 -- V01 -- V01 -- dimensions of loaded data: {'price': torch.Size([113, 14, 14, 2])}[0m
[34m[1mdata_module:__init__:48 -- V01 -- V01 -- Split vald: Loaded #113 data points from dataset_dir /home/nghorbani/Desktop/alibaba_ai_task/data/V01/vald.[0m
[34m[1mdata_module:__init__:43 -- V01 -- V01 -- dimensions of loaded data: {'price': torch.Size([904, 14, 14, 2])}[0m
[34m[1mdata_module:__init__:48 -- V01 -- V01 -- Split train: Loaded #904 data points from dataset_dir /home/nghorbani/Desktop/alibaba_ai_task/data/V01/train.[0m
[34m[1mdata_module:__init__:43 -- V01 -- V01 -- dimensions of loaded data: {'price': torch.Size([111, 14, 14, 2])}[0m
[34m[1mdata_module:__init__:48 -- V01 -- V01 -- Split test: Loaded #111 data points from dataset_dir /home/nghorbani/Desktop/alibaba_ai_task/data/V01/test.[0m


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  f"DataModule.{name} has already been called, so it will not be called again. "
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


[1mtrainer:configure_optimizers:87 -- V01 -- V01 -- Total trainable model_params: 1.3990 M.[0m


Global seed set to 100
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Global seed set to 100
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------


  | Name  | Type | Params
-------------------------------
0 | model | APO  | 1.4 M 
-------------------------------
1.4 M     Trainable params
0         Non-trainable params
1.4 M     Total params
5.596     Total estimated model params size (MB)


                                                                      

Global seed set to 100
Global seed set to 100
  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"
tar: Cowardly refusing to create an empty archive
Try 'tar --help' or 'tar --usage' for more information.
2021-11-19 20:00:09.142 | INFO     | alibaba_ai_task.train.trainer:on_train_start:67 - Created a git archive backup at /home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/code/apo_2021_11_19_19_59_40.tar.gz


Epoch 0:   0%|          | 0/17 [00:00<?, ?it/s] 



Epoch 0:  88%|████████▊ | 15/17 [00:16<00:02,  1.11s/it, loss=1.14e+06, v_num=0_0, train_loss_step=1.38e+6]
Validating: 0it [00:00, ?it/s][A
Epoch 0: 100%|██████████| 17/17 [00:16<00:00,  1.01it/s, loss=1.14e+06, v_num=0_0, train_loss_step=1.38e+6, val_loss=1.19e+6]
Epoch 1:   0%|          | 0/17 [00:00<?, ?it/s, loss=1.14e+06, v_num=0_0, train_loss_step=1.38e+6, val_loss=1.19e+6]         

[rank: 0] Metric val_loss improved. New best score: 1185612.000
[rank: 1] Metric val_loss improved. New best score: 1185612.000
Epoch 0, global step 14: val_loss reached 1185612.00000 (best 1185612.00000), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=00_val_loss=1185612.00.ckpt" as top 1


Epoch 1:  88%|████████▊ | 15/17 [00:01<00:00,  9.25it/s, loss=9.32e+05, v_num=0_0, train_loss_step=8.7e+5, val_loss=1.19e+6, train_loss_epoch=1.12e+6] 
Validating: 0it [00:00, ?it/s][A
Epoch 1: 100%|██████████| 17/17 [00:01<00:00,  9.74it/s, loss=9.32e+05, v_num=0_0, train_loss_step=8.7e+5, val_loss=9.7e+5, train_loss_epoch=1.12e+6] 
Epoch 2:   0%|          | 0/17 [00:00<?, ?it/s, loss=9.32e+05, v_num=0_0, train_loss_step=8.7e+5, val_loss=9.7e+5, train_loss_epoch=1.12e+6]         

[rank: 0] Metric val_loss improved by 215397.812 >= min_delta = 0.0. New best score: 970214.188
[rank: 1] Metric val_loss improved by 215397.812 >= min_delta = 0.0. New best score: 970214.188
Epoch 1, global step 29: val_loss reached 970214.18750 (best 970214.18750), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=01_val_loss=970214.19.ckpt" as top 1


Epoch 2:  88%|████████▊ | 15/17 [00:01<00:00,  9.25it/s, loss=7.73e+05, v_num=0_0, train_loss_step=9.98e+5, val_loss=9.7e+5, train_loss_epoch=8.62e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 2: 100%|██████████| 17/17 [00:01<00:00,  9.78it/s, loss=7.73e+05, v_num=0_0, train_loss_step=9.98e+5, val_loss=8.3e+5, train_loss_epoch=8.62e+5]
Epoch 3:   0%|          | 0/17 [00:00<?, ?it/s, loss=7.73e+05, v_num=0_0, train_loss_step=9.98e+5, val_loss=8.3e+5, train_loss_epoch=8.62e+5]         

[rank: 1] Metric val_loss improved by 140248.938 >= min_delta = 0.0. New best score: 829965.250
[rank: 0] Metric val_loss improved by 140248.938 >= min_delta = 0.0. New best score: 829965.250
Epoch 2, global step 44: val_loss reached 829965.25000 (best 829965.25000), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=02_val_loss=829965.25.ckpt" as top 1


Epoch 3:  88%|████████▊ | 15/17 [00:01<00:00,  9.04it/s, loss=7.9e+05, v_num=0_0, train_loss_step=1.36e+6, val_loss=8.3e+5, train_loss_epoch=7.39e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 3: 100%|██████████| 17/17 [00:01<00:00,  9.57it/s, loss=7.9e+05, v_num=0_0, train_loss_step=1.36e+6, val_loss=1.3e+6, train_loss_epoch=7.39e+5]
Epoch 4:   6%|▌         | 1/17 [00:00<00:01,  8.27it/s, loss=7.85e+05, v_num=0_0, train_loss_step=6.29e+5, val_loss=1.3e+6, train_loss_epoch=7.29e+5]

Epoch 3, global step 59: val_loss was not in top 1


Epoch 4:  88%|████████▊ | 15/17 [00:01<00:00,  9.29it/s, loss=7.62e+05, v_num=0_0, train_loss_step=1e+6, val_loss=1.3e+6, train_loss_epoch=7.29e+5]   
Validating: 0it [00:00, ?it/s][A
Epoch 4: 100%|██████████| 17/17 [00:01<00:00,  9.80it/s, loss=7.62e+05, v_num=0_0, train_loss_step=1e+6, val_loss=1.01e+6, train_loss_epoch=7.29e+5]
                                                 [A

Epoch 4, global step 74: val_loss was not in top 1


Epoch 5:  88%|████████▊ | 15/17 [00:01<00:00,  9.21it/s, loss=7.22e+05, v_num=0_0, train_loss_step=1.15e+6, val_loss=1.01e+6, train_loss_epoch=7.18e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 5: 100%|██████████| 17/17 [00:01<00:00,  9.75it/s, loss=7.22e+05, v_num=0_0, train_loss_step=1.15e+6, val_loss=8.99e+5, train_loss_epoch=7.18e+5]
                                                 [A

Epoch 5, global step 89: val_loss was not in top 1


Epoch 6:  88%|████████▊ | 15/17 [00:01<00:00,  8.90it/s, loss=6.7e+05, v_num=0_0, train_loss_step=1.2e+5, val_loss=8.99e+5, train_loss_epoch=7.06e+5]  
Validating: 0it [00:00, ?it/s][A
Epoch 6: 100%|██████████| 17/17 [00:01<00:00,  9.43it/s, loss=6.7e+05, v_num=0_0, train_loss_step=1.2e+5, val_loss=7.08e+5, train_loss_epoch=7.06e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 122050.625 >= min_delta = 0.0. New best score: 707914.625
[rank: 1] Metric val_loss improved by 122050.625 >= min_delta = 0.0. New best score: 707914.625
Epoch 6, global step 104: val_loss reached 707914.62500 (best 707914.62500), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=06_val_loss=707914.62.ckpt" as top 1


Epoch 7:  88%|████████▊ | 15/17 [00:01<00:00,  9.30it/s, loss=6.34e+05, v_num=0_0, train_loss_step=5.41e+5, val_loss=7.08e+5, train_loss_epoch=6.84e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 7: 100%|██████████| 17/17 [00:01<00:00,  9.80it/s, loss=6.34e+05, v_num=0_0, train_loss_step=5.41e+5, val_loss=6.2e+5, train_loss_epoch=6.84e+5] 
                                                 [A

[rank: 0] Metric val_loss improved by 87939.938 >= min_delta = 0.0. New best score: 619974.688
[rank: 1] Metric val_loss improved by 87939.938 >= min_delta = 0.0. New best score: 619974.688
Epoch 7, global step 119: val_loss reached 619974.68750 (best 619974.68750), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=07_val_loss=619974.69.ckpt" as top 1


Epoch 8:  88%|████████▊ | 15/17 [00:01<00:00,  9.28it/s, loss=6.55e+05, v_num=0_0, train_loss_step=9.6e+5, val_loss=6.2e+5, train_loss_epoch=6.62e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 8: 100%|██████████| 17/17 [00:01<00:00,  9.78it/s, loss=6.55e+05, v_num=0_0, train_loss_step=9.6e+5, val_loss=6.02e+5, train_loss_epoch=6.62e+5]
                                                 [A

[rank: 1] Metric val_loss improved by 17821.938 >= min_delta = 0.0. New best score: 602152.750
[rank: 0] Metric val_loss improved by 17821.938 >= min_delta = 0.0. New best score: 602152.750
Epoch 8, global step 134: val_loss reached 602152.75000 (best 602152.75000), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=08_val_loss=602152.75.ckpt" as top 1


Epoch 9:  88%|████████▊ | 15/17 [00:01<00:00,  9.25it/s, loss=6.03e+05, v_num=0_0, train_loss_step=2.25e+5, val_loss=6.02e+5, train_loss_epoch=6.22e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 9: 100%|██████████| 17/17 [00:01<00:00,  9.62it/s, loss=6.03e+05, v_num=0_0, train_loss_step=2.25e+5, val_loss=7.02e+5, train_loss_epoch=6.22e+5]
Epoch 10:   6%|▌         | 1/17 [00:00<00:01,  8.71it/s, loss=5.9e+05, v_num=0_0, train_loss_step=3.24e+5, val_loss=7.02e+5, train_loss_epoch=6e+5]    

Epoch 9, global step 149: val_loss was not in top 1


Epoch 10:  88%|████████▊ | 15/17 [00:01<00:00,  9.27it/s, loss=5.63e+05, v_num=0_0, train_loss_step=7.04e+5, val_loss=7.02e+5, train_loss_epoch=6e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 10: 100%|██████████| 17/17 [00:01<00:00,  9.82it/s, loss=5.63e+05, v_num=0_0, train_loss_step=7.04e+5, val_loss=5.9e+5, train_loss_epoch=6e+5] 
                                                 [A

[rank: 0] Metric val_loss improved by 12217.875 >= min_delta = 0.0. New best score: 589934.875
[rank: 1] Metric val_loss improved by 12217.875 >= min_delta = 0.0. New best score: 589934.875
Epoch 10, global step 164: val_loss reached 589934.87500 (best 589934.87500), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=10_val_loss=589934.88.ckpt" as top 1


Epoch 11:  88%|████████▊ | 15/17 [00:01<00:00,  9.27it/s, loss=6.19e+05, v_num=0_0, train_loss_step=6.21e+5, val_loss=5.9e+5, train_loss_epoch=6.18e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 11: 100%|██████████| 17/17 [00:01<00:00,  9.74it/s, loss=6.19e+05, v_num=0_0, train_loss_step=6.21e+5, val_loss=7.07e+5, train_loss_epoch=6.18e+5]
                                                 [A

Epoch 11, global step 179: val_loss was not in top 1


Epoch 12:  88%|████████▊ | 15/17 [00:01<00:00,  9.24it/s, loss=6.62e+05, v_num=0_0, train_loss_step=8.01e+5, val_loss=7.07e+5, train_loss_epoch=6.22e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 12: 100%|██████████| 17/17 [00:01<00:00,  9.75it/s, loss=6.62e+05, v_num=0_0, train_loss_step=8.01e+5, val_loss=6.93e+5, train_loss_epoch=6.22e+5]
                                                 [A

Epoch 12, global step 194: val_loss was not in top 1


Epoch 13:  88%|████████▊ | 15/17 [00:01<00:00,  8.83it/s, loss=6.44e+05, v_num=0_0, train_loss_step=7.59e+5, val_loss=6.93e+5, train_loss_epoch=6.32e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 13: 100%|██████████| 17/17 [00:01<00:00,  9.36it/s, loss=6.44e+05, v_num=0_0, train_loss_step=7.59e+5, val_loss=7.47e+5, train_loss_epoch=6.32e+5]
                                                 [A

Epoch 13, global step 209: val_loss was not in top 1


Epoch 14:  88%|████████▊ | 15/17 [00:01<00:00,  9.22it/s, loss=6.44e+05, v_num=0_0, train_loss_step=4.76e+5, val_loss=7.47e+5, train_loss_epoch=6.38e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 14: 100%|██████████| 17/17 [00:01<00:00,  9.74it/s, loss=6.44e+05, v_num=0_0, train_loss_step=4.76e+5, val_loss=6.39e+5, train_loss_epoch=6.38e+5]
                                                 [A

Epoch 14, global step 224: val_loss was not in top 1


Epoch    15: reducing learning rate of group 0 to 1.0000e-04.
Epoch 15:  88%|████████▊ | 15/17 [00:01<00:00,  9.26it/s, loss=6.96e+05, v_num=0_0, train_loss_step=1.77e+6, val_loss=6.39e+5, train_loss_epoch=6.37e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 15: 100%|██████████| 17/17 [00:01<00:00,  9.75it/s, loss=6.96e+05, v_num=0_0, train_loss_step=1.77e+6, val_loss=5.79e+5, train_loss_epoch=6.37e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 11094.062 >= min_delta = 0.0. New best score: 578840.812
[rank: 1] Metric val_loss improved by 11094.062 >= min_delta = 0.0. New best score: 578840.812
Epoch 15, global step 239: val_loss reached 578840.81250 (best 578840.81250), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=15_val_loss=578840.81.ckpt" as top 1


Epoch 16:  88%|████████▊ | 15/17 [00:01<00:00,  9.29it/s, loss=6.32e+05, v_num=0_0, train_loss_step=2.11e+5, val_loss=5.79e+5, train_loss_epoch=6.45e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 16: 100%|██████████| 17/17 [00:01<00:00,  9.78it/s, loss=6.32e+05, v_num=0_0, train_loss_step=2.11e+5, val_loss=5.64e+5, train_loss_epoch=6.45e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 15328.312 >= min_delta = 0.0. New best score: 563512.500
[rank: 1] Metric val_loss improved by 15328.312 >= min_delta = 0.0. New best score: 563512.500
Epoch 16, global step 254: val_loss reached 563512.50000 (best 563512.50000), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=16_val_loss=563512.50.ckpt" as top 1


Epoch 17:  88%|████████▊ | 15/17 [00:01<00:00,  9.33it/s, loss=5.75e+05, v_num=0_0, train_loss_step=4.2e+5, val_loss=5.64e+5, train_loss_epoch=5.81e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 17: 100%|██████████| 17/17 [00:01<00:00,  9.85it/s, loss=5.75e+05, v_num=0_0, train_loss_step=4.2e+5, val_loss=5.54e+5, train_loss_epoch=5.81e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 9036.000 >= min_delta = 0.0. New best score: 554476.500
[rank: 1] Metric val_loss improved by 9036.000 >= min_delta = 0.0. New best score: 554476.500
Epoch 17, global step 269: val_loss reached 554476.50000 (best 554476.50000), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=17_val_loss=554476.50.ckpt" as top 1


Epoch 18:  88%|████████▊ | 15/17 [00:01<00:00,  9.40it/s, loss=6.17e+05, v_num=0_0, train_loss_step=1.02e+6, val_loss=5.54e+5, train_loss_epoch=5.74e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 18: 100%|██████████| 17/17 [00:01<00:00,  9.80it/s, loss=6.17e+05, v_num=0_0, train_loss_step=1.02e+6, val_loss=5.58e+5, train_loss_epoch=5.74e+5]
                                                 [A

Epoch 18, global step 284: val_loss was not in top 1


Epoch 19:  88%|████████▊ | 15/17 [00:01<00:00,  9.21it/s, loss=5.85e+05, v_num=0_0, train_loss_step=9e+5, val_loss=5.58e+5, train_loss_epoch=5.75e+5]   
Validating: 0it [00:00, ?it/s][A
Epoch 19: 100%|██████████| 17/17 [00:01<00:00,  9.34it/s, loss=5.85e+05, v_num=0_0, train_loss_step=9e+5, val_loss=5.54e+5, train_loss_epoch=5.75e+5]
Epoch 20:   0%|          | 0/17 [00:00<?, ?it/s, loss=5.85e+05, v_num=0_0, train_loss_step=9e+5, val_loss=5.54e+5, train_loss_epoch=5.75e+5]         

[rank: 0] Metric val_loss improved by 255.938 >= min_delta = 0.0. New best score: 554220.562
[rank: 1] Metric val_loss improved by 255.938 >= min_delta = 0.0. New best score: 554220.562
Epoch 19, global step 299: val_loss reached 554220.56250 (best 554220.56250), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=19_val_loss=554220.56.ckpt" as top 1


Epoch 20:  88%|████████▊ | 15/17 [00:01<00:00,  9.27it/s, loss=5.81e+05, v_num=0_0, train_loss_step=4.53e+5, val_loss=5.54e+5, train_loss_epoch=5.71e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 20: 100%|██████████| 17/17 [00:01<00:00,  9.80it/s, loss=5.81e+05, v_num=0_0, train_loss_step=4.53e+5, val_loss=5.49e+5, train_loss_epoch=5.71e+5]
Epoch 21:   0%|          | 0/17 [00:00<?, ?it/s, loss=5.81e+05, v_num=0_0, train_loss_step=4.53e+5, val_loss=5.49e+5, train_loss_epoch=5.71e+5]         

[rank: 0] Metric val_loss improved by 4819.938 >= min_delta = 0.0. New best score: 549400.625
[rank: 1] Metric val_loss improved by 4819.938 >= min_delta = 0.0. New best score: 549400.625
Epoch 20, global step 314: val_loss reached 549400.62500 (best 549400.62500), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=20_val_loss=549400.62.ckpt" as top 1


Epoch 21:  88%|████████▊ | 15/17 [00:01<00:00,  9.30it/s, loss=5.53e+05, v_num=0_0, train_loss_step=7.44e+5, val_loss=5.49e+5, train_loss_epoch=5.67e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 21: 100%|██████████| 17/17 [00:01<00:00,  9.82it/s, loss=5.53e+05, v_num=0_0, train_loss_step=7.44e+5, val_loss=5.48e+5, train_loss_epoch=5.67e+5]
Epoch 22:   0%|          | 0/17 [00:00<?, ?it/s, loss=5.53e+05, v_num=0_0, train_loss_step=7.44e+5, val_loss=5.48e+5, train_loss_epoch=5.67e+5]         

[rank: 0] Metric val_loss improved by 1856.000 >= min_delta = 0.0. New best score: 547544.625
[rank: 1] Metric val_loss improved by 1856.000 >= min_delta = 0.0. New best score: 547544.625
Epoch 21, global step 329: val_loss reached 547544.62500 (best 547544.62500), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=21_val_loss=547544.62.ckpt" as top 1


Epoch 22:  88%|████████▊ | 15/17 [00:01<00:00,  9.31it/s, loss=5.81e+05, v_num=0_0, train_loss_step=7.53e+5, val_loss=5.48e+5, train_loss_epoch=5.67e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 22: 100%|██████████| 17/17 [00:01<00:00,  9.83it/s, loss=5.81e+05, v_num=0_0, train_loss_step=7.53e+5, val_loss=5.51e+5, train_loss_epoch=5.67e+5]
Epoch 23:   6%|▌         | 1/17 [00:00<00:02,  7.70it/s, loss=5.79e+05, v_num=0_0, train_loss_step=6.49e+5, val_loss=5.51e+5, train_loss_epoch=5.65e+5] 

Epoch 22, global step 344: val_loss was not in top 1


Epoch 23:  88%|████████▊ | 15/17 [00:01<00:00,  9.11it/s, loss=6.01e+05, v_num=0_0, train_loss_step=7.52e+5, val_loss=5.51e+5, train_loss_epoch=5.65e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 23: 100%|██████████| 17/17 [00:01<00:00,  9.63it/s, loss=6.01e+05, v_num=0_0, train_loss_step=7.52e+5, val_loss=5.48e+5, train_loss_epoch=5.65e+5]
                                                 [A

Epoch 23, global step 359: val_loss was not in top 1


Epoch 24:  88%|████████▊ | 15/17 [00:01<00:00,  9.28it/s, loss=5.78e+05, v_num=0_0, train_loss_step=9.35e+5, val_loss=5.48e+5, train_loss_epoch=5.65e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 24: 100%|██████████| 17/17 [00:01<00:00,  9.74it/s, loss=5.78e+05, v_num=0_0, train_loss_step=9.35e+5, val_loss=5.46e+5, train_loss_epoch=5.65e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 1603.438 >= min_delta = 0.0. New best score: 545941.188
[rank: 1] Metric val_loss improved by 1603.438 >= min_delta = 0.0. New best score: 545941.188
Epoch 24, global step 374: val_loss reached 545941.18750 (best 545941.18750), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=24_val_loss=545941.19.ckpt" as top 1


Epoch 25:  88%|████████▊ | 15/17 [00:01<00:00,  9.39it/s, loss=5.86e+05, v_num=0_0, train_loss_step=5.12e+5, val_loss=5.46e+5, train_loss_epoch=5.65e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 25: 100%|██████████| 17/17 [00:01<00:00,  9.91it/s, loss=5.86e+05, v_num=0_0, train_loss_step=5.12e+5, val_loss=5.48e+5, train_loss_epoch=5.65e+5]
                                                 [A

Epoch 25, global step 389: val_loss was not in top 1


Epoch 26:  88%|████████▊ | 15/17 [00:01<00:00,  8.93it/s, loss=5.46e+05, v_num=0_0, train_loss_step=3.8e+5, val_loss=5.48e+5, train_loss_epoch=5.64e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 26: 100%|██████████| 17/17 [00:01<00:00,  9.44it/s, loss=5.46e+05, v_num=0_0, train_loss_step=3.8e+5, val_loss=5.43e+5, train_loss_epoch=5.64e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 2615.000 >= min_delta = 0.0. New best score: 543326.188
[rank: 1] Metric val_loss improved by 2615.000 >= min_delta = 0.0. New best score: 543326.188
Epoch 26, global step 404: val_loss reached 543326.18750 (best 543326.18750), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=26_val_loss=543326.19.ckpt" as top 1


Epoch 27:  88%|████████▊ | 15/17 [00:01<00:00,  9.37it/s, loss=5.55e+05, v_num=0_0, train_loss_step=6.24e+5, val_loss=5.43e+5, train_loss_epoch=5.57e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 27: 100%|██████████| 17/17 [00:01<00:00,  9.88it/s, loss=5.55e+05, v_num=0_0, train_loss_step=6.24e+5, val_loss=5.42e+5, train_loss_epoch=5.57e+5]
                                                 [A

[rank: 1] Metric val_loss improved by 987.062 >= min_delta = 0.0. New best score: 542339.125
[rank: 0] Metric val_loss improved by 987.062 >= min_delta = 0.0. New best score: 542339.125
Epoch 27, global step 419: val_loss reached 542339.12500 (best 542339.12500), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=27_val_loss=542339.12.ckpt" as top 1


Epoch 28:  88%|████████▊ | 15/17 [00:01<00:00,  9.37it/s, loss=5.74e+05, v_num=0_0, train_loss_step=1.55e+5, val_loss=5.42e+5, train_loss_epoch=5.63e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 28: 100%|██████████| 17/17 [00:01<00:00,  9.87it/s, loss=5.74e+05, v_num=0_0, train_loss_step=1.55e+5, val_loss=5.43e+5, train_loss_epoch=5.63e+5]
                                                 [A

Epoch 28, global step 434: val_loss was not in top 1


Epoch 29:  88%|████████▊ | 15/17 [00:01<00:00,  9.29it/s, loss=5.3e+05, v_num=0_0, train_loss_step=3.77e+5, val_loss=5.43e+5, train_loss_epoch=5.57e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 29: 100%|██████████| 17/17 [00:01<00:00,  9.70it/s, loss=5.3e+05, v_num=0_0, train_loss_step=3.77e+5, val_loss=5.41e+5, train_loss_epoch=5.57e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 1652.688 >= min_delta = 0.0. New best score: 540686.438
[rank: 1] Metric val_loss improved by 1652.688 >= min_delta = 0.0. New best score: 540686.438
Epoch 29, global step 449: val_loss reached 540686.43750 (best 540686.43750), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=29_val_loss=540686.44.ckpt" as top 1


Epoch 30:  88%|████████▊ | 15/17 [00:01<00:00,  9.40it/s, loss=5.51e+05, v_num=0_0, train_loss_step=5.85e+5, val_loss=5.41e+5, train_loss_epoch=5.58e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 30: 100%|██████████| 17/17 [00:01<00:00,  9.91it/s, loss=5.51e+05, v_num=0_0, train_loss_step=5.85e+5, val_loss=5.41e+5, train_loss_epoch=5.58e+5]
                                                 [A

Epoch 30, global step 464: val_loss was not in top 1


Epoch 31:  88%|████████▊ | 15/17 [00:01<00:00,  9.38it/s, loss=5.72e+05, v_num=0_0, train_loss_step=4.5e+5, val_loss=5.41e+5, train_loss_epoch=5.53e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 31: 100%|██████████| 17/17 [00:01<00:00,  9.91it/s, loss=5.72e+05, v_num=0_0, train_loss_step=4.5e+5, val_loss=5.43e+5, train_loss_epoch=5.53e+5]
                                                 [A

Epoch 31, global step 479: val_loss was not in top 1


Epoch 32:  88%|████████▊ | 15/17 [00:01<00:00,  9.31it/s, loss=5.53e+05, v_num=0_0, train_loss_step=2.91e+5, val_loss=5.43e+5, train_loss_epoch=5.52e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 32: 100%|██████████| 17/17 [00:01<00:00,  9.80it/s, loss=5.53e+05, v_num=0_0, train_loss_step=2.91e+5, val_loss=5.68e+5, train_loss_epoch=5.52e+5]
                                                 [A

Epoch 32, global step 494: val_loss was not in top 1


Epoch 33:  88%|████████▊ | 15/17 [00:01<00:00,  8.92it/s, loss=5.38e+05, v_num=0_0, train_loss_step=6.19e+5, val_loss=5.68e+5, train_loss_epoch=5.66e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 33: 100%|██████████| 17/17 [00:01<00:00,  9.34it/s, loss=5.38e+05, v_num=0_0, train_loss_step=6.19e+5, val_loss=5.46e+5, train_loss_epoch=5.66e+5]
                                                 [A

Epoch 33, global step 509: val_loss was not in top 1


Epoch    34: reducing learning rate of group 0 to 1.0000e-05.
Epoch 34:  88%|████████▊ | 15/17 [00:01<00:00,  9.25it/s, loss=5.75e+05, v_num=0_0, train_loss_step=8.9e+5, val_loss=5.46e+5, train_loss_epoch=5.71e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 34: 100%|██████████| 17/17 [00:01<00:00,  9.69it/s, loss=5.75e+05, v_num=0_0, train_loss_step=8.9e+5, val_loss=5.42e+5, train_loss_epoch=5.71e+5]
                                                 [A

Epoch 34, global step 524: val_loss was not in top 1


Epoch 35:  88%|████████▊ | 15/17 [00:01<00:00,  9.24it/s, loss=5.75e+05, v_num=0_0, train_loss_step=9.56e+5, val_loss=5.42e+5, train_loss_epoch=5.6e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 35: 100%|██████████| 17/17 [00:01<00:00,  9.73it/s, loss=5.75e+05, v_num=0_0, train_loss_step=9.56e+5, val_loss=5.4e+5, train_loss_epoch=5.6e+5] 
                                                 [A

[rank: 0] Metric val_loss improved by 399.500 >= min_delta = 0.0. New best score: 540286.938
[rank: 1] Metric val_loss improved by 399.500 >= min_delta = 0.0. New best score: 540286.938
Epoch 35, global step 539: val_loss reached 540286.93750 (best 540286.93750), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=35_val_loss=540286.94.ckpt" as top 1


Epoch 36:  88%|████████▊ | 15/17 [00:01<00:00,  9.21it/s, loss=5.86e+05, v_num=0_0, train_loss_step=8.45e+5, val_loss=5.4e+5, train_loss_epoch=5.57e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 36: 100%|██████████| 17/17 [00:01<00:00,  9.74it/s, loss=5.86e+05, v_num=0_0, train_loss_step=8.45e+5, val_loss=5.39e+5, train_loss_epoch=5.57e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 1080.562 >= min_delta = 0.0. New best score: 539206.375
[rank: 1] Metric val_loss improved by 1080.562 >= min_delta = 0.0. New best score: 539206.375
Epoch 36, global step 554: val_loss reached 539206.37500 (best 539206.37500), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=36_val_loss=539206.38.ckpt" as top 1


Epoch 37:  88%|████████▊ | 15/17 [00:01<00:00,  9.31it/s, loss=5.59e+05, v_num=0_0, train_loss_step=5.59e+5, val_loss=5.39e+5, train_loss_epoch=5.55e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 37: 100%|██████████| 17/17 [00:01<00:00,  9.85it/s, loss=5.59e+05, v_num=0_0, train_loss_step=5.59e+5, val_loss=5.38e+5, train_loss_epoch=5.55e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 1466.750 >= min_delta = 0.0. New best score: 537739.625
[rank: 1] Metric val_loss improved by 1466.750 >= min_delta = 0.0. New best score: 537739.625
Epoch 37, global step 569: val_loss reached 537739.62500 (best 537739.62500), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=37_val_loss=537739.62.ckpt" as top 1


Epoch 38:  88%|████████▊ | 15/17 [00:01<00:00,  9.38it/s, loss=6.22e+05, v_num=0_0, train_loss_step=1.3e+6, val_loss=5.38e+5, train_loss_epoch=5.55e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 38: 100%|██████████| 17/17 [00:01<00:00,  9.91it/s, loss=6.22e+05, v_num=0_0, train_loss_step=1.3e+6, val_loss=5.39e+5, train_loss_epoch=5.55e+5]
                                                 [A

Epoch 38, global step 584: val_loss was not in top 1


Epoch 39:  88%|████████▊ | 15/17 [00:01<00:00,  9.26it/s, loss=6e+05, v_num=0_0, train_loss_step=2.87e+5, val_loss=5.39e+5, train_loss_epoch=5.53e+5]   
Validating: 0it [00:00, ?it/s][A
Epoch 39: 100%|██████████| 17/17 [00:01<00:00,  9.33it/s, loss=6e+05, v_num=0_0, train_loss_step=2.87e+5, val_loss=5.38e+5, train_loss_epoch=5.53e+5]
Epoch 40:   0%|          | 0/17 [00:00<?, ?it/s, loss=6e+05, v_num=0_0, train_loss_step=2.87e+5, val_loss=5.38e+5, train_loss_epoch=5.53e+5]         

[rank: 0] Metric val_loss improved by 132.250 >= min_delta = 0.0. New best score: 537607.375
[rank: 1] Metric val_loss improved by 132.250 >= min_delta = 0.0. New best score: 537607.375
Epoch 39, global step 599: val_loss reached 537607.37500 (best 537607.37500), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=39_val_loss=537607.38.ckpt" as top 1


Epoch 40:  88%|████████▊ | 15/17 [00:01<00:00,  9.36it/s, loss=5.69e+05, v_num=0_0, train_loss_step=7.24e+5, val_loss=5.38e+5, train_loss_epoch=5.52e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 40: 100%|██████████| 17/17 [00:01<00:00,  9.88it/s, loss=5.69e+05, v_num=0_0, train_loss_step=7.24e+5, val_loss=5.37e+5, train_loss_epoch=5.52e+5]
Epoch 41:   0%|          | 0/17 [00:00<?, ?it/s, loss=5.69e+05, v_num=0_0, train_loss_step=7.24e+5, val_loss=5.37e+5, train_loss_epoch=5.52e+5]         

[rank: 0] Metric val_loss improved by 225.000 >= min_delta = 0.0. New best score: 537382.375
[rank: 1] Metric val_loss improved by 225.000 >= min_delta = 0.0. New best score: 537382.375
Epoch 40, global step 614: val_loss reached 537382.37500 (best 537382.37500), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=40_val_loss=537382.38.ckpt" as top 1


Epoch 41:  88%|████████▊ | 15/17 [00:01<00:00,  9.31it/s, loss=6e+05, v_num=0_0, train_loss_step=1.01e+6, val_loss=5.37e+5, train_loss_epoch=5.52e+5]   
Validating: 0it [00:00, ?it/s][A
Epoch 41: 100%|██████████| 17/17 [00:01<00:00,  9.85it/s, loss=6e+05, v_num=0_0, train_loss_step=1.01e+6, val_loss=5.38e+5, train_loss_epoch=5.52e+5]
Epoch 42:   6%|▌         | 1/17 [00:00<00:01,  9.22it/s, loss=5.99e+05, v_num=0_0, train_loss_step=5.76e+5, val_loss=5.38e+5, train_loss_epoch=5.53e+5]

Epoch 41, global step 629: val_loss was not in top 1


Epoch 42:  88%|████████▊ | 15/17 [00:01<00:00,  9.38it/s, loss=5.85e+05, v_num=0_0, train_loss_step=5.24e+5, val_loss=5.38e+5, train_loss_epoch=5.53e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 42: 100%|██████████| 17/17 [00:01<00:00,  9.92it/s, loss=5.85e+05, v_num=0_0, train_loss_step=5.24e+5, val_loss=5.38e+5, train_loss_epoch=5.53e+5]
                                                 [A

Epoch 42, global step 644: val_loss was not in top 1


Epoch 43:  88%|████████▊ | 15/17 [00:01<00:00,  9.17it/s, loss=5.55e+05, v_num=0_0, train_loss_step=3.26e+5, val_loss=5.38e+5, train_loss_epoch=5.53e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 43: 100%|██████████| 17/17 [00:01<00:00,  9.67it/s, loss=5.55e+05, v_num=0_0, train_loss_step=3.26e+5, val_loss=5.39e+5, train_loss_epoch=5.53e+5]
                                                 [A

Epoch 43, global step 659: val_loss was not in top 1


Epoch 44:  88%|████████▊ | 15/17 [00:01<00:00,  9.33it/s, loss=5.37e+05, v_num=0_0, train_loss_step=4.73e+5, val_loss=5.39e+5, train_loss_epoch=5.51e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 44: 100%|██████████| 17/17 [00:01<00:00,  9.83it/s, loss=5.37e+05, v_num=0_0, train_loss_step=4.73e+5, val_loss=5.37e+5, train_loss_epoch=5.51e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 293.125 >= min_delta = 0.0. New best score: 537089.250
[rank: 1] Metric val_loss improved by 293.125 >= min_delta = 0.0. New best score: 537089.250
Epoch 44, global step 674: val_loss reached 537089.25000 (best 537089.25000), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=44_val_loss=537089.25.ckpt" as top 1


Epoch 45:  88%|████████▊ | 15/17 [00:01<00:00,  9.32it/s, loss=5.33e+05, v_num=0_0, train_loss_step=5.3e+5, val_loss=5.37e+5, train_loss_epoch=5.5e+5]  
Validating: 0it [00:00, ?it/s][A
Epoch 45: 100%|██████████| 17/17 [00:01<00:00,  9.82it/s, loss=5.33e+05, v_num=0_0, train_loss_step=5.3e+5, val_loss=5.37e+5, train_loss_epoch=5.5e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 121.875 >= min_delta = 0.0. New best score: 536967.375
[rank: 1] Metric val_loss improved by 121.875 >= min_delta = 0.0. New best score: 536967.375
Epoch 45, global step 689: val_loss reached 536967.37500 (best 536967.37500), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=45_val_loss=536967.38.ckpt" as top 1


Epoch 46:  88%|████████▊ | 15/17 [00:01<00:00,  9.02it/s, loss=5.27e+05, v_num=0_0, train_loss_step=3.05e+5, val_loss=5.37e+5, train_loss_epoch=5.5e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 46: 100%|██████████| 17/17 [00:01<00:00,  9.56it/s, loss=5.27e+05, v_num=0_0, train_loss_step=3.05e+5, val_loss=5.36e+5, train_loss_epoch=5.5e+5]
                                                 [A

[rank: 1] Metric val_loss improved by 1420.125 >= min_delta = 0.0. New best score: 535547.250
[rank: 0] Metric val_loss improved by 1420.125 >= min_delta = 0.0. New best score: 535547.250
Epoch 46, global step 704: val_loss reached 535547.25000 (best 535547.25000), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=46_val_loss=535547.25.ckpt" as top 1


Epoch 47:  88%|████████▊ | 15/17 [00:01<00:00,  9.26it/s, loss=5.73e+05, v_num=0_0, train_loss_step=1.32e+6, val_loss=5.36e+5, train_loss_epoch=5.51e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 47: 100%|██████████| 17/17 [00:01<00:00,  9.76it/s, loss=5.73e+05, v_num=0_0, train_loss_step=1.32e+6, val_loss=5.37e+5, train_loss_epoch=5.51e+5]
                                                 [A

Epoch 47, global step 719: val_loss was not in top 1


Epoch 48:  88%|████████▊ | 15/17 [00:01<00:00,  9.32it/s, loss=5.65e+05, v_num=0_0, train_loss_step=2.53e+5, val_loss=5.37e+5, train_loss_epoch=5.51e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 48: 100%|██████████| 17/17 [00:01<00:00,  9.82it/s, loss=5.65e+05, v_num=0_0, train_loss_step=2.53e+5, val_loss=5.35e+5, train_loss_epoch=5.51e+5]
                                                 [A

[rank: 1] Metric val_loss improved by 250.250 >= min_delta = 0.0. New best score: 535297.000
[rank: 0] Metric val_loss improved by 250.250 >= min_delta = 0.0. New best score: 535297.000
Epoch 48, global step 734: val_loss reached 535297.00000 (best 535297.00000), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=48_val_loss=535297.00.ckpt" as top 1


Epoch 49:  88%|████████▊ | 15/17 [00:01<00:00,  9.25it/s, loss=4.92e+05, v_num=0_0, train_loss_step=1.96e+5, val_loss=5.35e+5, train_loss_epoch=5.47e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 49: 100%|██████████| 17/17 [00:01<00:00,  9.56it/s, loss=4.92e+05, v_num=0_0, train_loss_step=1.96e+5, val_loss=5.36e+5, train_loss_epoch=5.47e+5]
Epoch 50:   6%|▌         | 1/17 [00:00<00:01,  8.30it/s, loss=4.99e+05, v_num=0_0, train_loss_step=5.25e+5, val_loss=5.36e+5, train_loss_epoch=5.48e+5] 

Epoch 49, global step 749: val_loss was not in top 1


Epoch 50:  88%|████████▊ | 15/17 [00:01<00:00,  9.31it/s, loss=5.14e+05, v_num=0_0, train_loss_step=5.28e+5, val_loss=5.36e+5, train_loss_epoch=5.48e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 50: 100%|██████████| 17/17 [00:01<00:00,  9.85it/s, loss=5.14e+05, v_num=0_0, train_loss_step=5.28e+5, val_loss=5.35e+5, train_loss_epoch=5.48e+5]
                                                 [A

Epoch 50, global step 764: val_loss was not in top 1


Epoch 51:  88%|████████▊ | 15/17 [00:01<00:00,  9.05it/s, loss=5.38e+05, v_num=0_0, train_loss_step=5.12e+5, val_loss=5.35e+5, train_loss_epoch=5.47e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 51: 100%|██████████| 17/17 [00:01<00:00,  9.60it/s, loss=5.38e+05, v_num=0_0, train_loss_step=5.12e+5, val_loss=5.34e+5, train_loss_epoch=5.47e+5]
                                                 [A

[rank: 0] Metric val_loss improved by 1214.000 >= min_delta = 0.0. New best score: 534083.000
[rank: 1] Metric val_loss improved by 1214.000 >= min_delta = 0.0. New best score: 534083.000
Epoch 51, global step 779: val_loss reached 534083.00000 (best 534083.00000), saving model to "/home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=51_val_loss=534083.00.ckpt" as top 1


Epoch 52:  88%|████████▊ | 15/17 [00:01<00:00,  9.21it/s, loss=5.41e+05, v_num=0_0, train_loss_step=7.24e+5, val_loss=5.34e+5, train_loss_epoch=5.48e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 52: 100%|██████████| 17/17 [00:01<00:00,  9.76it/s, loss=5.41e+05, v_num=0_0, train_loss_step=7.24e+5, val_loss=5.36e+5, train_loss_epoch=5.48e+5]
                                                 [A

Epoch 52, global step 794: val_loss was not in top 1


Epoch 53:  88%|████████▊ | 15/17 [00:01<00:00,  8.90it/s, loss=5.5e+05, v_num=0_0, train_loss_step=4.88e+5, val_loss=5.36e+5, train_loss_epoch=5.49e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 53: 100%|██████████| 17/17 [00:01<00:00,  9.43it/s, loss=5.5e+05, v_num=0_0, train_loss_step=4.88e+5, val_loss=5.34e+5, train_loss_epoch=5.49e+5]
                                                 [A

Epoch 53, global step 809: val_loss was not in top 1


Epoch 54:  88%|████████▊ | 15/17 [00:01<00:00,  9.26it/s, loss=5.47e+05, v_num=0_0, train_loss_step=6.95e+5, val_loss=5.34e+5, train_loss_epoch=5.49e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 54: 100%|██████████| 17/17 [00:01<00:00,  9.76it/s, loss=5.47e+05, v_num=0_0, train_loss_step=6.95e+5, val_loss=5.36e+5, train_loss_epoch=5.49e+5]
                                                 [A

Epoch 54, global step 824: val_loss was not in top 1


Epoch 55:  88%|████████▊ | 15/17 [00:01<00:00,  9.30it/s, loss=5.7e+05, v_num=0_0, train_loss_step=6.43e+5, val_loss=5.36e+5, train_loss_epoch=5.46e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 55: 100%|██████████| 17/17 [00:01<00:00,  9.80it/s, loss=5.7e+05, v_num=0_0, train_loss_step=6.43e+5, val_loss=5.35e+5, train_loss_epoch=5.46e+5]
                                                 [A

Epoch 55, global step 839: val_loss was not in top 1


Epoch    56: reducing learning rate of group 0 to 1.0000e-06.
Epoch 56:  88%|████████▊ | 15/17 [00:01<00:00,  8.80it/s, loss=5.24e+05, v_num=0_0, train_loss_step=3.06e+5, val_loss=5.35e+5, train_loss_epoch=5.47e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 56: 100%|██████████| 17/17 [00:01<00:00,  9.32it/s, loss=5.24e+05, v_num=0_0, train_loss_step=3.06e+5, val_loss=5.35e+5, train_loss_epoch=5.47e+5]
Epoch 57:   6%|▌         | 1/17 [00:00<00:01,  8.17it/s, loss=5.24e+05, v_num=0_0, train_loss_step=5.76e+5, val_loss=5.35e+5, train_loss_epoch=5.47e+5] 

Epoch 56, global step 854: val_loss was not in top 1


Epoch 57:  88%|████████▊ | 15/17 [00:01<00:00,  9.29it/s, loss=4.94e+05, v_num=0_0, train_loss_step=2.03e+5, val_loss=5.35e+5, train_loss_epoch=5.47e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 57: 100%|██████████| 17/17 [00:01<00:00,  9.78it/s, loss=4.94e+05, v_num=0_0, train_loss_step=2.03e+5, val_loss=5.36e+5, train_loss_epoch=5.47e+5]
                                                 [A

Epoch 57, global step 869: val_loss was not in top 1


Epoch 58:  88%|████████▊ | 15/17 [00:01<00:00,  9.30it/s, loss=5.34e+05, v_num=0_0, train_loss_step=5.27e+5, val_loss=5.36e+5, train_loss_epoch=5.51e+5]
Validating: 0it [00:00, ?it/s][A
Epoch 58: 100%|██████████| 17/17 [00:01<00:00,  9.84it/s, loss=5.34e+05, v_num=0_0, train_loss_step=5.27e+5, val_loss=5.36e+5, train_loss_epoch=5.51e+5]
                                                 [A

Epoch 58, global step 884: val_loss was not in top 1


Epoch 59:  88%|████████▊ | 15/17 [00:01<00:00,  9.21it/s, loss=5.29e+05, v_num=0_0, train_loss_step=4.7e+5, val_loss=5.36e+5, train_loss_epoch=5.47e+5] 
Validating: 0it [00:00, ?it/s][A
Epoch 59: 100%|██████████| 17/17 [00:01<00:00,  9.37it/s, loss=5.29e+05, v_num=0_0, train_loss_step=4.7e+5, val_loss=5.35e+5, train_loss_epoch=5.47e+5]
                                                 [AEpoch    15: reducing learning rate of group 0 to 1.0000e-04.
Epoch    34: reducing learning rate of group 0 to 1.0000e-05.
Epoch    56: reducing learning rate of group 0 to 1.0000e-06.
Epoch    60: reducing learning rate of group 0 to 1.0000e-07.
Epoch    60: reducing learning rate of group 0 to 1.0000e-07.
Epoch 59: 100%|██████████| 17/17 [00:01<00:00,  8.93it/s, loss=5.29e+05, v_num=0_0, train_loss_step=4.7e+5, val_loss=5.35e+5, train_loss_epoch=5.47e+5]


[rank: 0] Monitored metric val_loss did not improve in the last 8 records. Best score: 534083.000. Signaling Trainer to stop.
[rank: 1] Monitored metric val_loss did not improve in the last 8 records. Best score: 534083.000. Signaling Trainer to stop.
Epoch 59, global step 899: val_loss was not in top 1
2021-11-19 20:02:12.874 | SUCCESS  | alibaba_ai_task.train.trainer:on_train_end:175 - best_model_fname: /home/nghorbani/Desktop/alibaba_ai_task/training_experiments/data_V01/V01/snapshots/V01_epoch=51_val_loss=534083.00.ckpt
2021-11-19 20:02:12.879 | SUCCESS  | alibaba_ai_task.train.trainer:on_train_end:178 - Epoch 59 - Finished training at 2021_11_19_20_02_12 after 0:02:32
  rank_zero_warn("cleaning up ddp environment...")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Global seed set to 100
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Global seed set to 100
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
---------------------------------------------------------------------

Testing:  50%|█████     | 1/2 [00:15<00:15, 15.11s/it]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_loss': 586622.25}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 2/2 [00:15<00:00,  7.71s/it]


  rank_zero_warn("cleaning up ddp environment...")
