Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding pre-training #32

Closed
itsnamgyu opened this issue Oct 23, 2023 · 5 comments
Closed

Questions regarding pre-training #32

itsnamgyu opened this issue Oct 23, 2023 · 5 comments

Comments

@itsnamgyu
Copy link

itsnamgyu commented Oct 23, 2023

Thank you for the open-source code and detailed documentation of the experiments. I had a few questions about pre-training, which I can't seem to find in the paper or appendix. Could you please help?

  1. How long did pre-training take on the 80 V100 GPUs?
  • The learning rate schedule in the appendix suggests that pre-training has a total of 200,000 steps and fine-tuning has 100,000 steps, but I'm not sure (for one, I don't think the number of fine-tuning steps would be the same for all downstream tasks)
  1. What is the temporal resolution of the initial states of ERA5 used in training and evaluation?
  2. Are there any results on ERA5 performance without pre-training (at all)?

Thanks.

@tung-nd
Copy link
Collaborator

tung-nd commented Oct 27, 2023

Hi, thanks for showing your interest in ClimaX.

  1. It took 2-3 days.
    You're right about the scheduler. We used the same number of finetuning epochs (50) for all tasks, but since different tasks have different numbers of data points, the finetuning steps should be different. You can follow this schedule: linear warmup for 5 epochs and cosine decay for 45 epochs.
  2. What do you mean by temporal resolution? We use only 1 step in the input, which means it has a shape of V x H x W
  3. We ran some experiments in the early stage of the project, and the takeaway is ViT overfits heavily without any pretraining. You need to use very heavy dropout rates and/or weight decays to mitigate this.

@itsnamgyu
Copy link
Author

itsnamgyu commented Oct 27, 2023

Thanks for the response!

Regarding the temporal resolution, I am interested in the frequency of input assimilations or initial states included in the training set (sampled from the hourly ERA5 data). For instance, FourCastNet and GraphCast use assimilations at 00, 06, 12, and 18H UTC, resulting in four training samples per day. On the other hand, Pangu-Weather employs hourly initial states.

Could you provided details on how often ERA5 data is sampled for ClimaX's training set?

@rejuvyesh
Copy link
Collaborator

Hi @itsnamgyu! Thanks for your question. One key difference in inference setup for ClimaX vs other methods you mentioned is that we condition on $\Delta t$ in the future we want to make predictions for and can therefore make a prediction at any hour in a single forward pass vs doing rollouts in other methods.

Specifically

predict_range: int = 6,

allows us to control the maximum prediction range we want to finetune our model on. We used the available hourly ERA5 data here.

@tung-nd
Copy link
Collaborator

tung-nd commented Nov 6, 2023

@itsnamgyu I assume your question has been answered. Feel free to open it again if you have more questions.

@tung-nd tung-nd closed this as completed Nov 6, 2023
@itsnamgyu
Copy link
Author

Oh, yes, I was asking about this information We used the available hourly ERA5 data here. Thanks for the help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants