Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

"Big new design" for nowcasting_dataset #213

@JackKelly

Description

@JackKelly

Detailed Description

Over the last few weeks, we've come up with a bunch of ideas for how to simplify nowcasting_dataset.

This issue exists to keep track of all of those "new design" issues and to discuss how these individual issues hang together into a single coherent design; and to discuss how to implement this new design in a sequence of easy-to-digest chunks :)

The plan:

In terms of sequencing this work, I'm now thinking I'll do something along the lines of this:

First, do some stand-alone "preparatory" work. Specifically:

Implement and write tests for some of the functions in the draft design in the GitHub comment below. (for now, these functions won't actually be called in the code):

Then, the big one (where the code size will hopefully shrink a lot!):

Remove:

  • All of dataset/datamodule.py
  • dataset.datasets.NowcastingDataset
  • dataset.datasets.worker_init_fn()
  • All the batch_to_dataset and dataset_to_batch stuff
  • All the to_numpy and from_numpy stuff (assuming we can go straight from xr.Dataset to torch.Tensor)

Related issues

A bit more context

I think I've made our lives far harder than they need to be by trying to support two different use-cases:

  1. Loading training data on-the-fly from multiple Zarr files during ML training; and
  2. Pre-preparing batches.

I think we can make life way easier by dropping support for use-case 1 :)

Here's the broad proposal that it'd be great to discuss:

We drop support for loading data directly from Zarr on-the-fly during ML training (which we haven't done for months, and - now that we're using large NWP images - it would be far too slow). nowcasting_dataset becomes laser-focused on pre-preparing batches (just as we use it now).

This allows us to completely rip out PyTorch from nowcasting_dataset (#86); and enables each "modality" to stays in a single data type throughout nowcasting_dataset (#209). e.g. Satellite data stays in an xr.Dataset. Each modality would be processed concurrently in different processes; and would be output into different directories (e.g. train/satellite/ and train/nwp/) (#202).

Inspired by and making use of @peterdudfield's Pydantic PR (#195), we'd have a formal schema for the data structures in nowcasting_dataset (#211).

The ultimate aim is to simplify the code (I'm lazy!), whilst keeping all the useful functionality, and making the code easier to extend & maintain 🙂

Of course, we'll still use a pytorch dataloader to load the pre-prepared batches off disk into an ML model. But that's fine; and should work in very a similar (maybe identical?) fashion to how it works now 🙂

I certainly can't claim to have thought this all through properly! And everything's up for discussion, of course!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions