Command for training on BDD100K #45

ASMIftekhar · 2022-07-19T22:15:13Z

Hello,
Thanks a lot for your awesome work and congratulations for getting accepted at ECCV. I am planning to retrain the model on bdd100k dataset, in this script from motr_bdd100k branch, I can see multiple commented out commands. Can you confirm which one of these you actually use to train the model?

zyayoung · 2022-07-20T08:52:39Z

We are using the third config, i.e., r50.bdd100k_mot.20e. Sorry for causing confusion.

ASMIftekhar · 2022-07-21T17:14:04Z

Thanks a lot for the clarification.

ASMIftekhar · 2022-07-21T22:27:53Z

Hello,
Do you have an estimate about the training time on bdd100k? It is showing one day per epoch with 8 GPUs!!

zyayoung · 2022-07-22T02:54:07Z

The total training time was 6d18h on 8 2080ti GPUs.

ASMIftekhar · 2022-07-22T18:26:07Z

Thanks for the response. I wanted to make sure my slow training is coming from my machine and not from setting up the pipeline. I use the following command for running the model:
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \ --meta_arch motr \ --dataset_file bdd100k_mot \ --epoch 20 \ --with_box_refine \ --lr_drop 16 \ --save_period 2 \ --lr 2e-4 \ --lr_backbone 2e-5 \ --pretrained ${PRETRAIN} \ --output_dir ${EXP_DIR} \ --batch_size 1 \ --sample_mode 'random_interval' \ --sample_interval 4 \ --sampler_steps 6 12 \ --sampler_lengths 2 3 4 \ --update_query_pos \ --merger_dropout 0 \ --dropout 0 \ --random_drop 0.1 \ --fp_ratio 0.3 \ --track_embedding_layer 'AttentionMergerV4' \ --extra_track_attn \ --data_txt_path_train datasets/data_path/bdd100k.train \ --data_txt_path_val datasets/data_path/bdd100k.val \ --mot_path data/bdd100k/
While training I am seeing following logs:
Epoch: [0] [ 20/34268] eta: 1 day, 4:52:33 lr: 0.000200 grad_norm: 33.51 loss: 12.0034 (13.1927) frame_0_aux0_loss_bbox: 0.0757 (0.0819) frame_0_aux0_loss_ce: 0.6567 (0.7296) frame_0_aux0_loss_giou: 0.3462 (0.3459) frame_0_aux1_loss_bbox: 0.0722 (0.0743) frame_0_aux1_loss_ce: 0.3879 (0.5191) frame_0_aux1_loss_giou: 0.3164 (0.3131) frame_0_aux2_loss_bbox: 0.0689 (0.0715) frame_0_aux2_loss_ce: 0.3761 (0.5060) frame_0_aux2_loss_giou: 0.3073 (0.3021) frame_0_aux3_loss_bbox: 0.0704 (0.0708) frame_0_aux3_loss_ce: 0.3945 (0.5072) frame_0_aux3_loss_giou: 0.3022 (0.2992) frame_0_aux4_loss_bbox: 0.0674 (0.0691) frame_0_aux4_loss_ce: 0.4496 (0.5385) frame_0_aux4_loss_giou: 0.2945 (0.2969) frame_0_loss_bbox: 0.0673 (0.0691) frame_0_loss_ce: 0.4790 (0.5634) frame_0_loss_giou: 0.2947 (0.2960) frame_1_aux0_loss_bbox: 0.1989 (0.1953) frame_1_aux0_loss_ce: 0.6516 (0.7257) frame_1_aux0_loss_giou: 0.5459 (0.5317) frame_1_aux1_loss_bbox: 0.1954 (0.1897) frame_1_aux1_loss_ce: 0.4233 (0.5472) frame_1_aux1_loss_giou: 0.5211 (0.5116) frame_1_aux2_loss_bbox: 0.1952 (0.1894) frame_1_aux2_loss_ce: 0.4143 (0.5164) frame_1_aux2_loss_giou: 0.5061 (0.5071) frame_1_aux3_loss_bbox: 0.1958 (0.1895) frame_1_aux3_loss_ce: 0.4043 (0.5034) frame_1_aux3_loss_giou: 0.5039 (0.5071) frame_1_aux4_loss_bbox: 0.1966 (0.1894) frame_1_aux4_loss_ce: 0.4149 (0.5069) frame_1_aux4_loss_giou: 0.5008 (0.5064) frame_1_loss_bbox: 0.1986 (0.1894) frame_1_loss_ce: 0.4315 (0.5273) frame_1_loss_giou: 0.5012 (0.5059) time: 2.9063 data: 0.6959 max mem: 5583
Does it look ok to you?

ASMIftekhar · 2022-07-22T18:37:12Z

Also, if you have it, can you provide the log file for your training on bdd100K?

lebron-2016 · 2023-12-17T08:52:47Z

Thanks for the response. I wanted to make sure my slow training is coming from my machine and not from setting up the pipeline. I use the following command for running the model: python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \ --meta_arch motr \ --dataset_file bdd100k_mot \ --epoch 20 \ --with_box_refine \ --lr_drop 16 \ --save_period 2 \ --lr 2e-4 \ --lr_backbone 2e-5 \ --pretrained ${PRETRAIN} \ --output_dir ${EXP_DIR} \ --batch_size 1 \ --sample_mode 'random_interval' \ --sample_interval 4 \ --sampler_steps 6 12 \ --sampler_lengths 2 3 4 \ --update_query_pos \ --merger_dropout 0 \ --dropout 0 \ --random_drop 0.1 \ --fp_ratio 0.3 \ --track_embedding_layer 'AttentionMergerV4' \ --extra_track_attn \ --data_txt_path_train datasets/data_path/bdd100k.train \ --data_txt_path_val datasets/data_path/bdd100k.val \ --mot_path data/bdd100k/ While training I am seeing following logs: Epoch: [0] [ 20/34268] eta: 1 day, 4:52:33 lr: 0.000200 grad_norm: 33.51 loss: 12.0034 (13.1927) frame_0_aux0_loss_bbox: 0.0757 (0.0819) frame_0_aux0_loss_ce: 0.6567 (0.7296) frame_0_aux0_loss_giou: 0.3462 (0.3459) frame_0_aux1_loss_bbox: 0.0722 (0.0743) frame_0_aux1_loss_ce: 0.3879 (0.5191) frame_0_aux1_loss_giou: 0.3164 (0.3131) frame_0_aux2_loss_bbox: 0.0689 (0.0715) frame_0_aux2_loss_ce: 0.3761 (0.5060) frame_0_aux2_loss_giou: 0.3073 (0.3021) frame_0_aux3_loss_bbox: 0.0704 (0.0708) frame_0_aux3_loss_ce: 0.3945 (0.5072) frame_0_aux3_loss_giou: 0.3022 (0.2992) frame_0_aux4_loss_bbox: 0.0674 (0.0691) frame_0_aux4_loss_ce: 0.4496 (0.5385) frame_0_aux4_loss_giou: 0.2945 (0.2969) frame_0_loss_bbox: 0.0673 (0.0691) frame_0_loss_ce: 0.4790 (0.5634) frame_0_loss_giou: 0.2947 (0.2960) frame_1_aux0_loss_bbox: 0.1989 (0.1953) frame_1_aux0_loss_ce: 0.6516 (0.7257) frame_1_aux0_loss_giou: 0.5459 (0.5317) frame_1_aux1_loss_bbox: 0.1954 (0.1897) frame_1_aux1_loss_ce: 0.4233 (0.5472) frame_1_aux1_loss_giou: 0.5211 (0.5116) frame_1_aux2_loss_bbox: 0.1952 (0.1894) frame_1_aux2_loss_ce: 0.4143 (0.5164) frame_1_aux2_loss_giou: 0.5061 (0.5071) frame_1_aux3_loss_bbox: 0.1958 (0.1895) frame_1_aux3_loss_ce: 0.4043 (0.5034) frame_1_aux3_loss_giou: 0.5039 (0.5071) frame_1_aux4_loss_bbox: 0.1966 (0.1894) frame_1_aux4_loss_ce: 0.4149 (0.5069) frame_1_aux4_loss_giou: 0.5008 (0.5064) frame_1_loss_bbox: 0.1986 (0.1894) frame_1_loss_ce: 0.4315 (0.5273) frame_1_loss_giou: 0.5012 (0.5059) time: 2.9063 data: 0.6959 max mem: 5583 Does it look ok to you?

Hello, what are the CUDA version and the versions of torch and torchvision you used when training the BDD100K branch? I encountered this error during training, do you know the solution?

I have been troubled by this problem for many days and hope to get your help. Thanks!!

ASMIftekhar closed this as completed Jul 21, 2022

ASMIftekhar reopened this Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Command for training on BDD100K #45

Command for training on BDD100K #45

ASMIftekhar commented Jul 19, 2022

zyayoung commented Jul 20, 2022

ASMIftekhar commented Jul 21, 2022

ASMIftekhar commented Jul 21, 2022

zyayoung commented Jul 22, 2022

ASMIftekhar commented Jul 22, 2022

ASMIftekhar commented Jul 22, 2022

lebron-2016 commented Dec 17, 2023 •

edited

Loading

Command for training on BDD100K #45

Command for training on BDD100K #45

Comments

ASMIftekhar commented Jul 19, 2022

zyayoung commented Jul 20, 2022

ASMIftekhar commented Jul 21, 2022

ASMIftekhar commented Jul 21, 2022

zyayoung commented Jul 22, 2022

ASMIftekhar commented Jul 22, 2022

ASMIftekhar commented Jul 22, 2022

lebron-2016 commented Dec 17, 2023 • edited Loading

lebron-2016 commented Dec 17, 2023 •

edited

Loading