Use independent processes for each "modality"

This issue is split from #166 

## Detailed Description
We could have separate files for each data source, for each batch.

For example, on disk, within the `prepared_ML_data/train/` directory, we might have `train/NWP/`, `train/satellite/`, etc.  And, as before, in each of these folders, we'd have one file per batch, identified by the batch number.  And, importantly, `train/NWP/1.nc` and `train/satellite/1.nc` would still be perfectly aligned in time and space (just as they currently are).

Saving each "modality" as a different set of files opens up the possibility to further modularise and de-couple `nowcasting_dataset`

`prepare_ml_data.py` could run through each modality separately, something like:

1. Randomly sample the "positions" in time and space for each ML training example, and save to disk.  (In a little more detail: Find all the available `t0_datetimes` from across all the DataSources (see #204).  Randomly sample from these; and randomly sample from the available locations...  This should be general enough to enable #93)
2. Fire up a separate process for each modality (probably using `futures.ProcessPoolExecutor`).  We could even have multiple processes per modality, where each process works on a different subset of the "positions" (e.g. if we want 4 processes for each modality, then split the "positions" list into quarters).
3. Each process will read from the previously-saved "positions", and save pre-prepared batches to disk for that modality.

By default, `prepare_ml_data.py` should create all modalities specified in the config yaml file.  But the user should be able to pass in a command-line argument (#171) to only re-recreate one or a subset of modalities (e.g. if we fix a bug in the creation of batches of satellite data, and we _only_ want to re-computed the satellite data).

Advantages:
* We don't have to recreate the whole pre-prepared dataset if we only want to update or add one "modality"
* This should give us fairly easy-to-debug concurrent code.  
* When our dataset gets really big, we could use multiple machines running in parallel to create the pre-prepared batches.
* It'd be very easy to use subsets of the data (e.g. we could share _just_ the pre-prepared _satellite_ batches with the MSc students)
* We can use whatever file format makes most sense for each 'modality'.  e.g. satellite images could be stored as GeoTIFFs (which would make them easy to view).
* The code to write each batch to disk could live in the superclass for `GSP` and `NWP`; and could be overridden by the `GSP` or `NWP` classes.
* This is one way to #86
* Concurrently reads different files from disk.  This should speed up execution time on `leonardo` and in the cloud.

Disadvantages:
* It makes the "ML loading code" a little more complex.  But not much more complex.
* Some 'modalities' (like PV) will have tiny files on disk (a few kBytes per batch?).  And tiny files are inefficient to load (both on the cloud and on our local hardware).  But maybe this isn't a huge problem because, when training large complex models, we probably only need to load a few batches per second (not thousands per second!)
* It's yet more "refactoring" that isn't directly improving our ML model performance :)

Subtasks, in sequence:

1. [ ] Pre-prepare the "plan" and save it to disk (before processing any data) (possibly do #204 at the same time, if it makes #202 easier).  Then, load the plan from disk and proceed as the code currently works.
2. [ ] Implement `DataSource.prepare_batch(t0_datetimes, x_centers, y_centers, dst_path)` which does everything :)  It loads a batch from the source data, selects the approprate times and spatial positions, and writes the batch to disk (this solves #212).  `prepare_ml_data.py` will read the entire pre-prepared "plan", and fire up a process (using `ProcessPoolExecutor()`) for each modality.
3. [ ] Remove the code that combines batches from each DataSource into a single batch
4. [ ] Simplify the public interface to `DataSource`: now that we're not combing data from different modalities, the data never needs to leave the DataSource. You could imagine that each DataSource only needs to expose two or three public methods: get_available_t0_datetimes(history_minutes, forecast_minutes), sample_locations_for_datetimes(t0_datetimes) , and prepare_batch(t0_datetimes, center_x, center_x) 
5. [ ] Remove any unused functions (and their tests).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use independent processes for each "modality" #202

Detailed Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Use independent processes for each "modality" #202

Description

Detailed Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions