Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-training details #2

Closed
Huiimin5 opened this issue Apr 11, 2023 · 12 comments
Closed

pre-training details #2

Huiimin5 opened this issue Apr 11, 2023 · 12 comments

Comments

@Huiimin5
Copy link

Hi,
Could you please specify the meaning of "warmup learning rate=1e-6" in the pre-training stage? Does it mean that the learning rate starts from 1e-6 and linearly grows to 1.5e-4?

Additionally, for an image pair, the Homography and Color jittering augmentation should be applied to each individual one independently, right?

Thank you for your attention and time!

@PhilippeWeinzaepfel
Copy link

Hi,

Thanks for your interest in our work.

learning rate schedule

Sorry, this row "warmup learning rate=1e-6" in the appendix table seems a mistake from our side and has nothing to do there.
During the warm-up of 40 epochs, we linearly increase the learning rate from 0 to 1.5e-4.
We then use a cosine decay until epoch 800, but stop the training at epoch 400 as performance saturates.

Augmentation

We use independent augmentations for the two images in each pair.
We actually find later that homography does not help at all, or even slightly decrease performance, so you can also just ignore it.
For color jittering, we do not augment the hue but use standard value for the brightness/contrast/saturation, ie, ColorJitter(brightness=(0.6, 1.4), contrast=(0.6, 1.4), saturation=(0.6, 1.4), hue=0.0)

Best
Philippe

@Huiimin5
Copy link
Author

Thank you so much for your clarification.

In Section D.1, it is claimed that the finetuning configurations are consistent with MultiMAE. But in the second table, a lot of hyperparameters on NYUv2 are different, such as batch size (128 vs. 16), learning rate (3e-5 vs. 1e-4), and epoch number (1500 vs. 2000). Could you please specify where these configurations come from? or you just set them by searching for the optimal in your setting? Besides, do you use MultiMAE code for finetuning?

By the way, in Table 3, the performance of MAE is 79.6 while in MultiMAE, they report 85.1. Could you please specify how you get this number?

Thank you again for your time and attentino!

@PhilippeWeinzaepfel
Copy link

While our codebase for semantic segmentation on ADE 20k and Taskonomy is based on MultiMAE code, the one for NYUv2 has been developed independently. We have not heavily tuned it for any method and has been reported by simply changing the pre-training weights ; this might not be the optimal setup for MultiMAE, MAE or even CroCo. To obtain a better performance using CroCo, we could also leverage the decoder and reach over 88 Acc@1.25, see Table 7.

@PhilippeWeinzaepfel
Copy link

PhilippeWeinzaepfel commented Apr 18, 2023

I have just tried the Multi-MAE finetuning code for NYU depth with CroCo pretrained weights and their finetuning setup seems indeed way better than the ones we have implemented.
Here is the Acc@1.25 (= delta 1) I obtain :

  • MAE: 85.1
  • CroCo: 87.8

So I would recommend directly using their finetuning code and report these numbers.

To be more complete, here are the val_stats output by their scripts:

  • MAE pretrained weights: {'rmse': 3196.1256103515625, 'rel': 0.12942856550216675, 'srel': 546.7084655761719, 'log10': 0.17717715352773666, 'delta_1': 0.8512685894966125, 'delta_2': 0.9644498229026794, 'delta_3': 0.9886034429073334, 'loss': 2.7408722639083862, 'depth_loss': 2.7408722639083862}
  • CroCo pretrained weights: {'rmse': 3025.2979736328125, 'rel': 0.12315808981657028, 'srel': 545.6929016113281, 'log10': 0.17656730860471725, 'delta_1': 0.8780348598957062, 'delta_2': 0.9566936492919922, 'delta_3': 0.9870030283927917, 'loss': 2.4889049530029297, 'depth_loss': 2.4889049530029297}

@Huiimin5
Copy link
Author

Thank you so much for your detailed explanation.
Could you please also share the downstream depth estimation results initialized with MAE pretrained on Habitat dataset and finetued using MultiMAE codebase?
Thank you again for your time!

@PhilippeWeinzaepfel
Copy link

With "MAE Habitat", I got 84.0.

{'rmse': 3480.384033203125, 'rel': 0.14437608793377876, 'srel': 706.9155578613281, 'log10': 0.1991322711110115, 'delta_1': 0.8398233950138092, 'delta_2': 0.9496022164821625, 'delta_3': 0.9814011752605438, 'loss': 2.9076149463653564, 'depth_loss': 2.9076149463653564}

@Huiimin5
Copy link
Author

Thank you so much for providing this result.
It seems I get a different number by finetuning with the checkpoint you provided.
Do you mind sharing the finetuning log of CroCo?
Thank you for your consideration.

@PhilippeWeinzaepfel
Copy link

PhilippeWeinzaepfel commented Apr 28, 2023

You can find it here. The run is done after converting the weights to the MultiMAE format, and using --num_global_tokens 0 as we do not have global tokens in the CroCo architecture.
nyucroco.stdout.txt

@Huiimin5
Copy link
Author

Thank you so much for providing this log file.
I am curious how you get the final evaluation values in the last line:
image
as the finetuning script ends at line 8790.
Could you please specify how you get these extra output lines?

@PhilippeWeinzaepfel
Copy link

Yep. At the end of this script, they load the weights of the best model. While it is not present in the code, it should clearly be to run the test for the best model, as they do in their other finetuning scripts. So I added the lines of the code doing so before running the script:

    test_stats = evaluate(model=model, tasks_loss_fn=tasks_loss_fn, data_loader=data_loader_test,
                         device=device, epoch=-1, in_domains=args.in_domains, mode='test', log_images=True,
                         return_all_layers=return_all_layers, standardize_depth=args.standardize_depth)
    print(test_stats)

@Huiimin5
Copy link
Author

Huiimin5 commented May 1, 2023

Thank you for your clarification, but I find it difficult to understand the huge drop from online validation set performance to test set performance.
In my understanding, the test set (i.e., validation set) results has already been dumped into log.txt.
In my experiment, these dumped numbers are very close to corresponding printouts in stdout.txt, with the former synchronized across all machines and the latter unsynchronized.
In other words, the difference between line 8794 and line 8791 in previous screenshot should be insignificant.
Could you please specify what leads to such a huge performance drop?

I think we can rule out the difference between the best checkpoint and the last checkpoint, as they perform very similarly in my experiment.

As a soft reminder, the argument max_val_images should be unset to use the full validation set for evaluation. But in my experiment I did not observe a big performance difference between subset evaluation and full validation set evaluation. I am not sure if this could explain the performance drop in your case.

Thank you again for your time and patience!

@PhilippeWeinzaepfel
Copy link

  • Difference between L8794 and L8791

It seems to be only due to the distributed setting.
I had deleted the checkpoints, so I have rerun the finetuning.
Initially, I get this output with the new finetuning

(Eval) Epoch: [-1] [0/1] eta: 0:00:03 rmse: 2951.8828 (2951.8828) rel: 0.1080 (0.1080) srel: 455.3148 (455.3148) log10: 0.1533 (0.1533) delta_1: 0.9129 (0.9129) delta_2: 0.9736 (0.9736) delta_3: 0.9911 (0.9911) loss: 2.4124 (2.4124) depth_loss: 2.4124 (2.4124) time: 3.2422 data: 2.7493 max mem: 22481 
(Eval) Epoch: [-1] Total time: 0:00:03 (3.3992 s / it) 
* Loss 2.481 
* {'rmse': 3020.5174560546875, 'rel': 0.12364871427416801, 'srel': 550.2191009521484, 'log10': 0.1771184504032135, 'delta_1': 0.878442794084549, 'delta_2': 0.9569390416145325, 'delta_3': 0.9872540533542633, 'loss': 2.4809699058532715, 'depth_loss': 2.4809699058532715}

When enabling the `print' on all the processes while testing, I got that

(Test) Epoch: [-1] [0/1] eta: 0:00:06 rmse: 3089.1521 (3089.1521) rel: 0.1393 (0.1393) srel: 645.1234 (645.1234) log10: 0.2010 (0.2010) delta_1: 0.8440 (0.8440) delta_2: 0.9403 (0.9403) delta_3: 0.9834 (0.9834) loss: 2.5496 (2.5496) depth_loss: 2.5496 (2.5496) time: 6.7801 data: 3.4942 max mem: 5493 
(Test) Epoch: [-1] [0/1] eta: 0:00:06 rmse: 2951.8828 (2951.8828) rel: 0.1080 (0.1080) srel: 455.3148 (455.3148) log10: 0.1533 (0.1533) delta_1: 0.9129 (0.9129) delta_2: 0.9736 (0.9736) delta_3: 0.9911 (0.9911) loss: 2.4124 (2.4124) depth_loss: 2.4124 (2.4124) time: 6.6569 data: 3.5448 max mem: 5489 
(Test) Epoch: [-1] Total time: 0:00:06 (6.9482 s / it) 
(Test) Epoch: [-1] Total time: 0:00:06 (6.8161 s / it) 
* Loss 2.481
* Loss 2.481 
{'rmse': 3020.5174560546875, 'rel': 0.12364871427416801, 'srel': 550.2191009521484, 'log10': 0.1771184504032135, 'delta_1': 0.878442794084549, 'delta_2': 0.9569390416145325, 'delta_3': 0.9872540533542633, 'loss': 2.4809699058532715, 'depth_loss': 2.4809699058532715} 
{'rmse': 3020.5174560546875, 'rel': 0.12364871427416801, 'srel': 550.2191009521484, 'log10': 0.1771184504032135, 'delta_1': 0.878442794084549, 'delta_2': 0.9569390416145325, 'delta_3': 0.9872540533542633, 'loss': 2.4809699058532715, 'depth_loss': 2.4809699058532715}

So a huge difference between the batch of each process ; which should also be pretty imbalanced in terms of number of images due to the batch size of 96 (64*1.5) and the dataset size of 100 ; so doing a global avg is probably not really fair.

Then I have launched the test script on a single gpu with a 2x larger batch size (which covers the 100 validation/test images) and got:

(Test) Epoch: [-1]  [0/1]  eta: 0:00:10  rmse: 3021.9480 (3021.9480)  rel: 0.1238 (0.1238)  srel: 551.1194 (551.1194)  log10: 0.1789 (0.1789)  delta_1: 0.8781 (0.8781)  delta_2: 0.9568 (0.9568)  delta_3: 0.9872 (0.9872)  loss: 2.4809 (2.4809)  depth_loss: 2.4809 (2.4809)  time: 10.3641  data: 6.1422  max mem: 9358
(Test) Epoch: [-1] Total time: 0:00:10 (10.7133 s / it)
* Loss 2.481
{'rmse': 3021.947998046875, 'rel': 0.12379693239927292, 'srel': 551.1194458007812, 'log10': 0.17894227802753448, 'delta_1': 0.8781160116195679, 'delta_2': 0.9567808508872986, 'delta_3': 0.9872171878814697, 'loss': 2.480867862701416, 'depth_loss': 2.480867862701416}
  • max_val_images
    The MultiMAE setup is anyway not directly comparable to the state of the art, due to the resize, etc.
    All numbers above are with max_val_images set to 100 as they have in their github instructions.
    When validating on all images, I get higher values (on 1 gpu, batch_size of 1 for having a correct global average that is performed over all images)
{'rmse': 2364.8145019811227, 'rel': 0.10753235335971602, 'srel': 408.2935388460072, 'log10': 0.13091224415187896, 'delta_1': 0.9001198714153644, 'delta_2': 0.9783268600302842, 'delta_3': 0.9945694150727823, 'loss': 2.678799651130259, 'depth_loss': 2.678799651130259}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants