This repository was archived by the owner on Sep 11, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 7
This repository was archived by the owner on Sep 11, 2023. It is now read-only.
Speed up prepare_ml_data.py #341
Copy link
Copy link
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Detailed Description
After the "big new design", prepare_ml_data.py takes about 12 seconds per satellite batch. That's too slow. But, fear not, there are plenty of ways to speed it up!
After prepare_ml_data.py finishes with sun, topographic, and GSP (which all zoom past), leonardo really isn't being pushed very hard when it's only doing satellite and nwp:
Possible Implementation
- Write pre-prepared batches to
leonardo's new 4 TB SSD - check that create batch is still using ThreadPoolExecutor to load examples in parallel
- In
Managerusemultiprocessing.PoolnotProcessPoolExecutor#325 - Experiment will allowing xr.open_mfdataset to use dask for NWPs and Satellite to speed up loading #456
- Use multiple processes per DataSource in Manager #311
- Experiment with calling
dataset.load()_after_ joining examples into batch #475 - Use smaller dtypes for NWP and Satellite Zarrs (see Use smaller dtypes for saved data #61). Although satellite data is already 10-bit-per-channel, so reducing to 8-bit won't speed things up that much.
- If needs be, bring back the idea of creating a batch by loading, say, 16 time slices off disk, and sampling 2 geographical regions of interest from each time slice to produce 32 examples per batch (i.e. halving the amount of data that needs to be loaded from disk per batch). This definitely speeds up loading but reduces randomness. This is how the code did it before "the big new redesign".... here's the commit where it was mostly removed: f896a5e#L103 in
nwp_data_source.get_batch()) - Speed up
GSPDataSource.get_locations()#305 - Speed up
Manager.sample_spatial_and_temporal_locations_for_examples()by splittingshuffled_t0_datetimesacross multliple processes #304 (but only bother with this if 305 is not sufficient)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
