Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why can't the two A100 trained models achieve your performance #47

Closed
Rzx520 opened this issue Oct 11, 2023 · 12 comments
Closed

Why can't the two A100 trained models achieve your performance #47

Rzx520 opened this issue Oct 11, 2023 · 12 comments
Assignees

Comments

@Rzx520
Copy link

Rzx520 commented Oct 11, 2023

A100(80G),Nothing else has changed.
{"K_AP50": 59.38904571533203, "K_P50": 21.074942637087915, "K_R50": 72.52758104006436, "U_AP50": 0.6464414000511169, "U_P50": 0.4288344914478119, "U_R50": 16.88679245283019, "epoch": 40}, "test_coco_eval_bbox": [14.671942710876465, 14.671942710876465, 78.46551513671875, 58.18337631225586, 64.30726623535156, 50.592430114746094, 29.676156997680664, 71.94124603271484, 56.22311782836914, 82.22350311279297, 27.28054428100586, 71.0342788696289, 22.341707229614258, 82.27958679199219, 71.79204559326172, 68.34331512451172, 49.77190017700195, 35.397483825683594, 71.02239227294922, 50.98625564575195, 83.90058135986328, 62.01821517944336, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6464414000511169], "epoch": 40, "n_parameters": 39742295}@orrzohar Thanks

@orrzohar
Copy link
Owner

Hi @Rzx520,
Could you provide some context to your question? what did you do? what task are you talking about?

@orrzohar orrzohar self-assigned this Oct 11, 2023
@Rzx520
Copy link
Author

Rzx520 commented Oct 11, 2023

Just reproduce your model, there are no special context. @orrzohar

@orrzohar
Copy link
Owner

You mean you used the weights I released?

@Rzx520
Copy link
Author

Rzx520 commented Oct 11, 2023

No, I trained it with Two A100, and the dataset is from the introduction you provided

@orrzohar
Copy link
Owner

orrzohar commented Oct 11, 2023

In general, it is impossible to answer this question with no context, as I do not have access to 80Gb A100s and cannot run the experiment myself and see the variation. Even if you took my pre-trained weights, and only evaluated on a different system, you will see some variation of +-1, as validated by others in previous issues.

To answer as best I can:

If you re-trained from scratch on a different system, even more variation is to be expected, and some hyper-parameter tuning would most likely be needed. You may need to change the number of epochs you trained on, what epoch you drop the lr on, the lr itself, etc. The hyper parameters I used on my system will definitely not be optimal on all systems. Judging from your numbers, you may be over-training, however, I can only tell that from the training curves themselves.

As you are using the 80GB GPUs, you probably can increase the batch size, which will make training more efficient and stable. But, when changing the batch size you will need to change the lr as well.

NNs as a whole are effected by the training schedule, and if you use a different system, that schedule is affected which affects the final state of the model -- unless the model is very simple/insensitive, or robust to hyperparameter/randomness in training.

If you would like, please share the relevant information and I would be happy to help you optimize the hyperparameters to get similar results.

@Rzx520
Copy link
Author

Rzx520 commented Oct 11, 2023

When I used Two A100, the parameter settings remained unchanged. I used the code and parameters you provided, including batch size,lr and other parameters. The only change was to change your 4 GPUs to 2 GPUs.@orrzohar

@Rzx520
Copy link
Author

Rzx520 commented Oct 11, 2023

s1
s2
This is the graph of UR50 and KAP50 after 41 epochs of training.@orrzohar Thanks

@orrzohar
Copy link
Owner

orrzohar commented Oct 11, 2023

You also used a different server, with different GPUs (I used A100 40Gb GPUs), and different number of GPUs, all of which will affect training. There are probably countless other differences which I do not know about (OS, dependences, etc.). Even as time has past, some default python dependences may have changed.

When using a different system, there will be some variation and you will need to tune the hyperparameters. This is, in no small part, why we publish the weights of the models.

When re-training from scratch, I cannot guarantee complete replication of performance. With some hyper parameter tuning, you should be able to get within +-1 of the reported values. (see #26, where results were reproduced on 3090s).

Looking at your training curves, I would probably reduce the lr_drop to ~125k iterations, and proportionally reduce the overall number of epochs. You can overtrain the second stage of the training after the lr_drop, and select the optimal model in terms of U_R/K_mAP. This should be relatively easily as the model saves checkpoints every few epochs, so you could restart from that rather than all the way from scratch.

@orrzohar
Copy link
Owner

Hi @Rzx520,

Did the performance improve when reducing lr_drop?

Best,
Orr

@orrzohar
Copy link
Owner

Hi @Rzx520,

Did the performance improve when reducing lr_drop?
I'd like to know if I can add this to the readme.

Best,
Orr

@Rzx520
Copy link
Author

Rzx520 commented Oct 19, 2023 via email

@orrzohar
Copy link
Owner

Hi @Rzx520,
OK, great.
I am going to close this issue, and update the required hyper-parameters for 2xA100(80Gb).
Orr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants