-
Notifications
You must be signed in to change notification settings - Fork 7
Issue/166 batch pydantic #195
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Lots of comments, but they're mostly pretty tiny!
A few general thoughts:
-
Wherever possible, please consider removing any dependencies on
torch, in preparation for #86. e.g. please consider usingnp.random.randninstead oftorch.randn. -
I think I've made
nowcasting_datasetover complicated by startingnowcasting_datasetwith the intention of loading data on-the-fly during ML training; and then swapping to usingnowcasting_datasetto create on-disk pre-prepared batches. I think we can simplify things a lot by removing support for loading on-the-fly :) (i.e. I think we can safely say that, from now on,nowcasting_datasetwill just be used for pre-preparing batches! Please see #209 - I think it could reduce the code size a lot :) And I feel really bad for not thinking about this earlier!
| ) | ||
| elif "time_30" in self.__getattribute__(key).dims: | ||
| self.__setattr__( | ||
| key, self.__getattribute__(key).isel(time_30=slice(start_i, end_i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any instances of DataSourceOutput have one or more fields with five-minutely and one or more fields with half-hourly datetime indexes? I guess not?
If that is a possibility, then start_i and end_i would be correct indexes into one of the indexes; but would be wrong indexes into the other datetime index, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently data sources only have 5 mins, or 30 mins. I wonder if this is a good enough for the moment. If so I can add a comment -
or can we think of a case where there will be both 5 mins and 30 min data?
| from nowcasting_dataset.utils import coord_to_range | ||
|
|
||
|
|
||
| class Datetime(DataSourceOutput): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
Random thought, that isn't especially relevant to this PR(!), but it's possible that we might want to remove these datetime features from the on-disk batches, and instead compute these features on-the-fly, not least because we'll want to experiment with a variety of different ways of encoding position when we really start experimenting with the perceiver IO models. Related issue: https://github.com/openclimatefix/nowcasting_utils/issues/30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started a separate issue to discuss this: #208
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, even if we do compute these features on-the-fly, this pydantic model (and the validation code) could still be used for the datetime features computed on the fly, I guess?
PR suggestions from JK Co-authored-by: Jack Kelly <jack@OpenClimateFix.org>
Pull Request
Description
Fixes #166
How Has This Been Tested?
Added specific unitests
create Fake Dataset and check torch tensor is returned (unittest)
ran scripts prepare ml data (and validate script)
No
Yes
Checklist: