-
Couldn't load subscription status.
- Fork 7
Get intersection of _Periods_ of available data across all DataSources, instead of using datetimes. #223
Description
Detailed Description
At present, when using GSP-region PV data, nowcasting_dataset produces ML training examples with t0 datetimes at 0 and 30 minutes past the hour.
We should experiment with enabling nowcasting_dataset to set t0 datetimes to 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55 minutes past the hour. This would increase the number of training examples by 6x (which might be quite a big deal!) and, when we run a live service, we probably want to be able to update our PV forecasts every 5 minutes (when we get new satellite data & PVOutput.org data)
Possible Implementation
Self-attention models don't need different modalities to be perfectly aligned in space and time. So, we could produce examples where t0 could be at any 5-minute increment, and there's always, say, exactly 2 historical timesteps of GSP PV data; and 4 forecast timesteps of GSP PV data. For example, an example with t0 = 12:15, history_minutes = 60, and forecast_minutes=120 would look something like this:
- satellite data at 5-minute steps from 11:15 to 14:15
- GSP PV data:
- history: 11:30, 12:00
- forecast: 12:30, 13:00, 13:30, and 14:00
I think DataSource.get_example should work as required already, although it'll need testing.
The tricker bit is changing DataModule._get_t0_datetimes() to be independent of the sample period of the various DataSources. I think the way forwards here is to have each DataSource emit a list of periods for which it has contiguous data. In particular, emit a pd.DataFrame with two columns: start_dt and end_dt; where each row represents a contiguous period of data. But, before implementing our own Period class with a Period.intersection method, we should check if pd.Period can handle arbitrary periods and/or if we can re-use pd.PeriodIndex.intersection().
This should also enable the implementation of #135
This might be important for WP1. I'll add it to the WP1 project for now, and we'll see how ML training goes with half-hour data.
Sub-tasks
-
OverrideUPDATE: This is already implemented byDataSource.get_contiguous_time_periods()inSatelliteDataSource(?) to remove nighttime.SatelliteDataSource.datetime_index()!
Done in PR #220:
- Split the modified README into a separate PR.
- Implement a
nd_time.get_contiguous_time_periods() -> pd.DataFrame - Write test for
nd_time.get_contiguous_time_periods() - add more test cases for intersection function, where the two periods are the same
Done in PR #256
- Implement a
DataSource.get_contiguous_time_periods() -> pd.DataFrameto emit a list of valid time periods. Usend_time.get_contiguous_time_periods(). - Write test(s) for
DataSource.get_contiguous_time_periods()
PR #274
- Implement a
DataSource.get_contigous_t0_time_periods() -> pd.DataFramewhich goes through each period and chops offhistory_durationfrom the beginning of the period, and chops offforecast_durationfrom the end of the period. - Enable
NowcastingDataModuleto compute the intersection of all the lists of t0 time periods from eachDataSource. Usend_time.intersection_of_2_dataframes_of_periods(). - Compute t0 datetimes across all those periods (using a user-specified frequency, e.g. '5 minutes').
- As before, split those t0 datetimes into train, valid, test
- Remove
nd_time.get_start_datetimes(),nd_time.intersection_of_datetimeindexes(),DataSource.get_t0_datetimes(), and their tests, and use grep to check they're not called from anywhere I've missed.