Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command for training on BDD100K #45

Open
ASMIftekhar opened this issue Jul 19, 2022 · 7 comments
Open

Command for training on BDD100K #45

ASMIftekhar opened this issue Jul 19, 2022 · 7 comments

Comments

@ASMIftekhar
Copy link

Hello,
Thanks a lot for your awesome work and congratulations for getting accepted at ECCV. I am planning to retrain the model on bdd100k dataset, in this script from motr_bdd100k branch, I can see multiple commented out commands. Can you confirm which one of these you actually use to train the model?

@zyayoung
Copy link
Collaborator

We are using the third config, i.e., r50.bdd100k_mot.20e. Sorry for causing confusion.

@ASMIftekhar
Copy link
Author

Thanks a lot for the clarification.

@ASMIftekhar
Copy link
Author

Hello,
Do you have an estimate about the training time on bdd100k? It is showing one day per epoch with 8 GPUs!!

@ASMIftekhar ASMIftekhar reopened this Jul 21, 2022
@zyayoung
Copy link
Collaborator

The total training time was 6d18h on 8 2080ti GPUs.

@ASMIftekhar
Copy link
Author

Thanks for the response. I wanted to make sure my slow training is coming from my machine and not from setting up the pipeline. I use the following command for running the model:
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \ --meta_arch motr \ --dataset_file bdd100k_mot \ --epoch 20 \ --with_box_refine \ --lr_drop 16 \ --save_period 2 \ --lr 2e-4 \ --lr_backbone 2e-5 \ --pretrained ${PRETRAIN} \ --output_dir ${EXP_DIR} \ --batch_size 1 \ --sample_mode 'random_interval' \ --sample_interval 4 \ --sampler_steps 6 12 \ --sampler_lengths 2 3 4 \ --update_query_pos \ --merger_dropout 0 \ --dropout 0 \ --random_drop 0.1 \ --fp_ratio 0.3 \ --track_embedding_layer 'AttentionMergerV4' \ --extra_track_attn \ --data_txt_path_train datasets/data_path/bdd100k.train \ --data_txt_path_val datasets/data_path/bdd100k.val \ --mot_path data/bdd100k/
While training I am seeing following logs:
Epoch: [0] [ 20/34268] eta: 1 day, 4:52:33 lr: 0.000200 grad_norm: 33.51 loss: 12.0034 (13.1927) frame_0_aux0_loss_bbox: 0.0757 (0.0819) frame_0_aux0_loss_ce: 0.6567 (0.7296) frame_0_aux0_loss_giou: 0.3462 (0.3459) frame_0_aux1_loss_bbox: 0.0722 (0.0743) frame_0_aux1_loss_ce: 0.3879 (0.5191) frame_0_aux1_loss_giou: 0.3164 (0.3131) frame_0_aux2_loss_bbox: 0.0689 (0.0715) frame_0_aux2_loss_ce: 0.3761 (0.5060) frame_0_aux2_loss_giou: 0.3073 (0.3021) frame_0_aux3_loss_bbox: 0.0704 (0.0708) frame_0_aux3_loss_ce: 0.3945 (0.5072) frame_0_aux3_loss_giou: 0.3022 (0.2992) frame_0_aux4_loss_bbox: 0.0674 (0.0691) frame_0_aux4_loss_ce: 0.4496 (0.5385) frame_0_aux4_loss_giou: 0.2945 (0.2969) frame_0_loss_bbox: 0.0673 (0.0691) frame_0_loss_ce: 0.4790 (0.5634) frame_0_loss_giou: 0.2947 (0.2960) frame_1_aux0_loss_bbox: 0.1989 (0.1953) frame_1_aux0_loss_ce: 0.6516 (0.7257) frame_1_aux0_loss_giou: 0.5459 (0.5317) frame_1_aux1_loss_bbox: 0.1954 (0.1897) frame_1_aux1_loss_ce: 0.4233 (0.5472) frame_1_aux1_loss_giou: 0.5211 (0.5116) frame_1_aux2_loss_bbox: 0.1952 (0.1894) frame_1_aux2_loss_ce: 0.4143 (0.5164) frame_1_aux2_loss_giou: 0.5061 (0.5071) frame_1_aux3_loss_bbox: 0.1958 (0.1895) frame_1_aux3_loss_ce: 0.4043 (0.5034) frame_1_aux3_loss_giou: 0.5039 (0.5071) frame_1_aux4_loss_bbox: 0.1966 (0.1894) frame_1_aux4_loss_ce: 0.4149 (0.5069) frame_1_aux4_loss_giou: 0.5008 (0.5064) frame_1_loss_bbox: 0.1986 (0.1894) frame_1_loss_ce: 0.4315 (0.5273) frame_1_loss_giou: 0.5012 (0.5059) time: 2.9063 data: 0.6959 max mem: 5583
Does it look ok to you?

@ASMIftekhar
Copy link
Author

Also, if you have it, can you provide the log file for your training on bdd100K?

@lebron-2016
Copy link

lebron-2016 commented Dec 17, 2023

Thanks for the response. I wanted to make sure my slow training is coming from my machine and not from setting up the pipeline. I use the following command for running the model: python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \ --meta_arch motr \ --dataset_file bdd100k_mot \ --epoch 20 \ --with_box_refine \ --lr_drop 16 \ --save_period 2 \ --lr 2e-4 \ --lr_backbone 2e-5 \ --pretrained ${PRETRAIN} \ --output_dir ${EXP_DIR} \ --batch_size 1 \ --sample_mode 'random_interval' \ --sample_interval 4 \ --sampler_steps 6 12 \ --sampler_lengths 2 3 4 \ --update_query_pos \ --merger_dropout 0 \ --dropout 0 \ --random_drop 0.1 \ --fp_ratio 0.3 \ --track_embedding_layer 'AttentionMergerV4' \ --extra_track_attn \ --data_txt_path_train datasets/data_path/bdd100k.train \ --data_txt_path_val datasets/data_path/bdd100k.val \ --mot_path data/bdd100k/ While training I am seeing following logs: Epoch: [0] [ 20/34268] eta: 1 day, 4:52:33 lr: 0.000200 grad_norm: 33.51 loss: 12.0034 (13.1927) frame_0_aux0_loss_bbox: 0.0757 (0.0819) frame_0_aux0_loss_ce: 0.6567 (0.7296) frame_0_aux0_loss_giou: 0.3462 (0.3459) frame_0_aux1_loss_bbox: 0.0722 (0.0743) frame_0_aux1_loss_ce: 0.3879 (0.5191) frame_0_aux1_loss_giou: 0.3164 (0.3131) frame_0_aux2_loss_bbox: 0.0689 (0.0715) frame_0_aux2_loss_ce: 0.3761 (0.5060) frame_0_aux2_loss_giou: 0.3073 (0.3021) frame_0_aux3_loss_bbox: 0.0704 (0.0708) frame_0_aux3_loss_ce: 0.3945 (0.5072) frame_0_aux3_loss_giou: 0.3022 (0.2992) frame_0_aux4_loss_bbox: 0.0674 (0.0691) frame_0_aux4_loss_ce: 0.4496 (0.5385) frame_0_aux4_loss_giou: 0.2945 (0.2969) frame_0_loss_bbox: 0.0673 (0.0691) frame_0_loss_ce: 0.4790 (0.5634) frame_0_loss_giou: 0.2947 (0.2960) frame_1_aux0_loss_bbox: 0.1989 (0.1953) frame_1_aux0_loss_ce: 0.6516 (0.7257) frame_1_aux0_loss_giou: 0.5459 (0.5317) frame_1_aux1_loss_bbox: 0.1954 (0.1897) frame_1_aux1_loss_ce: 0.4233 (0.5472) frame_1_aux1_loss_giou: 0.5211 (0.5116) frame_1_aux2_loss_bbox: 0.1952 (0.1894) frame_1_aux2_loss_ce: 0.4143 (0.5164) frame_1_aux2_loss_giou: 0.5061 (0.5071) frame_1_aux3_loss_bbox: 0.1958 (0.1895) frame_1_aux3_loss_ce: 0.4043 (0.5034) frame_1_aux3_loss_giou: 0.5039 (0.5071) frame_1_aux4_loss_bbox: 0.1966 (0.1894) frame_1_aux4_loss_ce: 0.4149 (0.5069) frame_1_aux4_loss_giou: 0.5008 (0.5064) frame_1_loss_bbox: 0.1986 (0.1894) frame_1_loss_ce: 0.4315 (0.5273) frame_1_loss_giou: 0.5012 (0.5059) time: 2.9063 data: 0.6959 max mem: 5583 Does it look ok to you?

Hello, what are the CUDA version and the versions of torch and torchvision you used when training the BDD100K branch? I encountered this error during training, do you know the solution?

image

I have been troubled by this problem for many days and hope to get your help. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants