Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Get intersection of _Periods_ of available data across all DataSources, instead of using datetimes. #223

@JackKelly

Description

@JackKelly

Detailed Description

At present, when using GSP-region PV data, nowcasting_dataset produces ML training examples with t0 datetimes at 0 and 30 minutes past the hour.

We should experiment with enabling nowcasting_dataset to set t0 datetimes to 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55 minutes past the hour. This would increase the number of training examples by 6x (which might be quite a big deal!) and, when we run a live service, we probably want to be able to update our PV forecasts every 5 minutes (when we get new satellite data & PVOutput.org data)

Possible Implementation

Self-attention models don't need different modalities to be perfectly aligned in space and time. So, we could produce examples where t0 could be at any 5-minute increment, and there's always, say, exactly 2 historical timesteps of GSP PV data; and 4 forecast timesteps of GSP PV data. For example, an example with t0 = 12:15, history_minutes = 60, and forecast_minutes=120 would look something like this:

  • satellite data at 5-minute steps from 11:15 to 14:15
  • GSP PV data:
    • history: 11:30, 12:00
    • forecast: 12:30, 13:00, 13:30, and 14:00

I think DataSource.get_example should work as required already, although it'll need testing.

The tricker bit is changing DataModule._get_t0_datetimes() to be independent of the sample period of the various DataSources. I think the way forwards here is to have each DataSource emit a list of periods for which it has contiguous data. In particular, emit a pd.DataFrame with two columns: start_dt and end_dt; where each row represents a contiguous period of data. But, before implementing our own Period class with a Period.intersection method, we should check if pd.Period can handle arbitrary periods and/or if we can re-use pd.PeriodIndex.intersection().

This should also enable the implementation of #135

This might be important for WP1. I'll add it to the WP1 project for now, and we'll see how ML training goes with half-hour data.

Sub-tasks

  • Override DataSource.get_contiguous_time_periods() in SatelliteDataSource (?) to remove nighttime. UPDATE: This is already implemented by SatelliteDataSource.datetime_index()!

Done in PR #220:

  • Split the modified README into a separate PR.
  • Implement a nd_time.get_contiguous_time_periods() -> pd.DataFrame
  • Write test for nd_time.get_contiguous_time_periods()
  • add more test cases for intersection function, where the two periods are the same

Done in PR #256

  • Implement a DataSource.get_contiguous_time_periods() -> pd.DataFrame to emit a list of valid time periods. Use nd_time.get_contiguous_time_periods().
  • Write test(s) for DataSource.get_contiguous_time_periods()

PR #274

  • Implement a DataSource.get_contigous_t0_time_periods() -> pd.DataFrame which goes through each period and chops off history_duration from the beginning of the period, and chops off forecast_duration from the end of the period.
  • Enable NowcastingDataModule to compute the intersection of all the lists of t0 time periods from each DataSource. Use nd_time.intersection_of_2_dataframes_of_periods().
  • Compute t0 datetimes across all those periods (using a user-specified frequency, e.g. '5 minutes').
  • As before, split those t0 datetimes into train, valid, test
  • Remove nd_time.get_start_datetimes(), nd_time.intersection_of_datetimeindexes(), DataSource.get_t0_datetimes(), and their tests, and use grep to check they're not called from anywhere I've missed.

Metadata

Metadata

Assignees

Labels

dataNew data source or feature; or modification of existing data sourceenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions