"Big new design" for nowcasting_dataset

## Detailed Description
Over the last few weeks, we've come up with a bunch of ideas for how to simplify nowcasting_dataset.

This issue exists to keep track of all of those "new design" issues and to discuss how these individual issues hang together into a single coherent design; and to discuss how to implement this new design in a sequence of easy-to-digest chunks :)

The plan:

In terms of sequencing this work, I'm now thinking I'll do something along the lines of this:

First, do some stand-alone "preparatory" work.  Specifically:

- [x] #204 
- [x] #217 
- [x] #223 
- [x] #270
- [x] #299

Implement and write tests for some of the functions in the draft design in the GitHub comment below.  (for now, these functions won't actually be called in the code):

- [x] `sample_spatial_and_temporal_locations_for_examples()`  (done in PR #278)
- [ ] `DataSource.prepare_batches()`

Then, the big one (where the code size will hopefully shrink a lot!): 
- [ ] Re-write `prepare_ml_data.py` so it looks like the sketch above.  Use `click` to pass in command-line params.
- [ ] implement the remaining utility functions
- [x] #298
- [ ] delete the unnecessary code,
- [x] #86
- [x] #170 
- [ ] Remove `n_timesteps_per_batch`?
- [ ] Stop using the first entry in `DataSourceList` as the one which defines the geospatial location of each example.
- [ ] Search through the code for TODOs associated with issues associated with this big design change.
- [ ]  #277 
- [ ] Document the new architecture.  Ideally with a diagram :slightly_smiling_face: 
- [x] #293
- [x] #135 

Remove:
- [ ] All of `dataset/datamodule.py`
- [ ] `dataset.datasets.NowcastingDataset`
- [ ] `dataset.datasets.worker_init_fn()`
- [ ] All the `batch_to_dataset` and `dataset_to_batch` stuff
- [ ] All the `to_numpy` and `from_numpy` stuff (assuming we can go straight from `xr.Dataset` to `torch.Tensor`)

## Related issues

- [x] #211 
- [x] #209 
- [x] #202 
- [x] #166
- [x] #230
- [ ] #212 
- [x] #208
- [x] #204 
- [x] #97 
- [ ] #48 
- [ ] #25 
- [ ] #219
- [x] #170 

## A bit more context

I think I've made our lives far harder than they need to be by trying to support two different use-cases: 

1. Loading training data on-the-fly from multiple Zarr files during ML training; and 
2. Pre-preparing batches.

I think we can make life _way_ easier by dropping support for use-case 1  :)

Here's the broad proposal that it'd be great to discuss:

We drop support for loading data directly from Zarr on-the-fly during ML training (which we haven't done for months, and - now that we're using large NWP images - it would be far too slow).  nowcasting_dataset becomes laser-focused on pre-preparing batches (just as we use it now).

This allows us to completely rip out PyTorch from nowcasting_dataset (#86); and enables each "modality" to stays in a single data type throughout nowcasting_dataset (#209).  e.g. Satellite data stays in an `xr.Dataset`.  Each modality would be processed concurrently in different processes; and would be output into different directories (e.g. `train/satellite/` and `train/nwp/`) (#202).

Inspired by and making use of @peterdudfield's Pydantic PR (#195), we'd have a formal schema for the data structures in nowcasting_dataset (#211).

The ultimate aim is to simplify the code (I'm lazy!), whilst keeping all the useful functionality, and making the code easier to extend & maintain :slightly_smiling_face:

Of course, we'll still use a pytorch dataloader to load the pre-prepared batches off disk into an ML model.  But that's fine; and should work in very a similar (maybe identical?) fashion to how it works now :slightly_smiling_face:

I certainly can't claim to have thought this all through properly!  And everything's up for discussion, of course!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"Big new design" for nowcasting_dataset #213

Detailed Description

Related issues

A bit more context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"Big new design" for nowcasting_dataset #213

Description

Detailed Description

Related issues

A bit more context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions