Questions regarding pre-training #32

itsnamgyu · 2023-10-23T06:32:29Z

Thank you for the open-source code and detailed documentation of the experiments. I had a few questions about pre-training, which I can't seem to find in the paper or appendix. Could you please help?

How long did pre-training take on the 80 V100 GPUs?

The learning rate schedule in the appendix suggests that pre-training has a total of 200,000 steps and fine-tuning has 100,000 steps, but I'm not sure (for one, I don't think the number of fine-tuning steps would be the same for all downstream tasks)

What is the temporal resolution of the initial states of ERA5 used in training and evaluation?
Are there any results on ERA5 performance without pre-training (at all)?

Thanks.

tung-nd · 2023-10-27T08:24:03Z

Hi, thanks for showing your interest in ClimaX.

It took 2-3 days.
You're right about the scheduler. We used the same number of finetuning epochs (50) for all tasks, but since different tasks have different numbers of data points, the finetuning steps should be different. You can follow this schedule: linear warmup for 5 epochs and cosine decay for 45 epochs.
What do you mean by temporal resolution? We use only 1 step in the input, which means it has a shape of V x H x W
We ran some experiments in the early stage of the project, and the takeaway is ViT overfits heavily without any pretraining. You need to use very heavy dropout rates and/or weight decays to mitigate this.

itsnamgyu · 2023-10-27T09:08:08Z

Thanks for the response!

Regarding the temporal resolution, I am interested in the frequency of input assimilations or initial states included in the training set (sampled from the hourly ERA5 data). For instance, FourCastNet and GraphCast use assimilations at 00, 06, 12, and 18H UTC, resulting in four training samples per day. On the other hand, Pangu-Weather employs hourly initial states.

Could you provided details on how often ERA5 data is sampled for ClimaX's training set?

rejuvyesh · 2023-10-27T21:11:41Z

Hi @itsnamgyu! Thanks for your question. One key difference in inference setup for ClimaX vs other methods you mentioned is that we condition on $\Delta t$ in the future we want to make predictions for and can therefore make a prediction at any hour in a single forward pass vs doing rollouts in other methods.

Specifically

ClimaX/src/climax/global_forecast/datamodule.py

Line 44 in 6d5d354

predict_range: int = 6,

allows us to control the maximum prediction range we want to finetune our model on. We used the available hourly ERA5 data here.

tung-nd · 2023-11-06T09:33:04Z

@itsnamgyu I assume your question has been answered. Feel free to open it again if you have more questions.

itsnamgyu · 2023-11-06T09:35:23Z

Oh, yes, I was asking about this information We used the available hourly ERA5 data here. Thanks for the help :)

tung-nd closed this as completed Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding pre-training #32

Questions regarding pre-training #32

itsnamgyu commented Oct 23, 2023 •

edited

Loading

tung-nd commented Oct 27, 2023

itsnamgyu commented Oct 27, 2023 •

edited

Loading

rejuvyesh commented Oct 27, 2023

tung-nd commented Nov 6, 2023

itsnamgyu commented Nov 6, 2023

Questions regarding pre-training #32

Questions regarding pre-training #32

Comments

itsnamgyu commented Oct 23, 2023 • edited Loading

tung-nd commented Oct 27, 2023

itsnamgyu commented Oct 27, 2023 • edited Loading

rejuvyesh commented Oct 27, 2023

tung-nd commented Nov 6, 2023

itsnamgyu commented Nov 6, 2023

itsnamgyu commented Oct 23, 2023 •

edited

Loading

itsnamgyu commented Oct 27, 2023 •

edited

Loading