-
Notifications
You must be signed in to change notification settings - Fork 7
"Big new design" for nowcasting_dataset #213
Description
Detailed Description
Over the last few weeks, we've come up with a bunch of ideas for how to simplify nowcasting_dataset.
This issue exists to keep track of all of those "new design" issues and to discuss how these individual issues hang together into a single coherent design; and to discuss how to implement this new design in a sequence of easy-to-digest chunks :)
The plan:
In terms of sequencing this work, I'm now thinking I'll do something along the lines of this:
First, do some stand-alone "preparatory" work. Specifically:
- Simplify the calculation of available datetimes across all
DataSources #204 - Tweak the way DataSources are represented in the Configuration model #217
- Get intersection of _Periods_ of available data across all
DataSources, instead of using datetimes. #223 - Change each
DataSourcesubclass' constructor arguments to align with the new YAML config field names #270 - Assert there's no overlap between train, test and validation datetimes at end of
split()function #299
Implement and write tests for some of the functions in the draft design in the GitHub comment below. (for now, these functions won't actually be called in the code):
-
sample_spatial_and_temporal_locations_for_examples()(done in PR ImplementDataSourceList.sample_spatial_and_temporal_locations_for_examples()#278) -
DataSource.prepare_batches()
Then, the big one (where the code size will hopefully shrink a lot!):
- Re-write
prepare_ml_data.pyso it looks like the sketch above. Useclickto pass in command-line params. - implement the remaining utility functions
- Change
DataSourceListintoManager; and maintain DataSources in adictinstead of a list? #298 - delete the unnecessary code,
- Remove PyTorch from the code #86
- Detect & remove unused functions #170
- Remove
n_timesteps_per_batch? - Stop using the first entry in
DataSourceListas the one which defines the geospatial location of each example. - Search through the code for TODOs associated with issues associated with this big design change.
- Allow user to configure the frequency of the t0 datetimes in the config yaml #277
- Document the new architecture. Ideally with a diagram 🙂
- Save
history_lengthandforecast_lengthto disk for each modality, and use these in dataloader #293 - Turn off temporal interpolation of numerical weather predictions (NWPs) #135
Remove:
- All of
dataset/datamodule.py -
dataset.datasets.NowcastingDataset -
dataset.datasets.worker_init_fn() - All the
batch_to_datasetanddataset_to_batchstuff - All the
to_numpyandfrom_numpystuff (assuming we can go straight fromxr.Datasettotorch.Tensor)
Related issues
- Machine-readable schema & validator for
xarray.Dataset#211 - Can we simplify the code by always keeping the data in one data type (e.g. xr.DataArray) per modality? #209
- Use independent processes for each "modality" #202
- Example --> Pydantic #166
- Remove time_30 and time_5 #230
- Experiment with loading entire batches at once #212
- Compute datetime features on-the-fly in nowcasting_dataloader. Remove datetime features from the on-disk batches. #208
- Simplify the calculation of available datetimes across all
DataSources #204 - Implement a thin "data loading" layer to help ML training #97
- refactor DataSources to have different functions for selecting timesteps, selecting geo locations, and post-processing #48
- Use NamedTensors to name dimensions #25
- Do we still need all the representations of sequence lengths in
DataSource? #219 - Detect & remove unused functions #170
A bit more context
I think I've made our lives far harder than they need to be by trying to support two different use-cases:
- Loading training data on-the-fly from multiple Zarr files during ML training; and
- Pre-preparing batches.
I think we can make life way easier by dropping support for use-case 1 :)
Here's the broad proposal that it'd be great to discuss:
We drop support for loading data directly from Zarr on-the-fly during ML training (which we haven't done for months, and - now that we're using large NWP images - it would be far too slow). nowcasting_dataset becomes laser-focused on pre-preparing batches (just as we use it now).
This allows us to completely rip out PyTorch from nowcasting_dataset (#86); and enables each "modality" to stays in a single data type throughout nowcasting_dataset (#209). e.g. Satellite data stays in an xr.Dataset. Each modality would be processed concurrently in different processes; and would be output into different directories (e.g. train/satellite/ and train/nwp/) (#202).
Inspired by and making use of @peterdudfield's Pydantic PR (#195), we'd have a formal schema for the data structures in nowcasting_dataset (#211).
The ultimate aim is to simplify the code (I'm lazy!), whilst keeping all the useful functionality, and making the code easier to extend & maintain 🙂
Of course, we'll still use a pytorch dataloader to load the pre-prepared batches off disk into an ML model. But that's fine; and should work in very a similar (maybe identical?) fashion to how it works now 🙂
I certainly can't claim to have thought this all through properly! And everything's up for discussion, of course!