Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Move all data normalisation to nowcasting_dataloader #231

Closed
Tracked by #393
peterdudfield opened this issue Oct 18, 2021 · 21 comments · Fixed by #250 or #428
Closed
Tracked by #393

Move all data normalisation to nowcasting_dataloader #231

peterdudfield opened this issue Oct 18, 2021 · 21 comments · Fixed by #250 or #428
Assignees
Labels
enhancement New feature or request

Comments

@peterdudfield
Copy link
Contributor

Detailed Description

Currently satellite normalisation can happen in two places

Context

  • Dont want to do it twice, as this may lead to funny ML problems
  • Would be nice if the metadata stored if the data had been normalised or not.
  • Nice to have it un normalised in batches too, so it can be displayed easily

Possible Implementation

Only have option to do it once.

  • It could be a submethod of one of the Satellite classes.
  • The metadata stored if the data had been normalised or not, then we know what has been done
@peterdudfield peterdudfield added the enhancement New feature or request label Oct 18, 2021
@jacobbieker
Copy link
Member

I would lean towards normalizing it when making it a Tensor, just so we have the raw data stored, and if we want to try different normalizations, such as MetNet's which is a bit different, we can easily do that. And normalization shouldn't take too long when loading on the fly

@peterdudfield peterdudfield self-assigned this Oct 18, 2021
@JackKelly
Copy link
Member

This all sounds great!

I agree, maybe nowcasting_dataset shouldn't do any normalisation (which also makes the nowcasting_dataset code easier!). Instead, all data normalisation can be done on-the-fly in nowcasting_dataloader.

@JackKelly
Copy link
Member

This also saves us from having to compute means and standard deviations across, potentially, hundreds of TB of raw data. Instaed we can just compute the statistics for the ~1 TB nowcasting_dataset pre-prepared batches (I guess just the training batches?)

@jacobbieker
Copy link
Member

Yeah, I think just the training batches, as otherwise it would still leak some data about the validation and test sets

@peterdudfield
Copy link
Contributor Author

yea good point, perhaps we need a script to run through all the training batches and calculate the mean and or std

@peterdudfield
Copy link
Contributor Author

Or maybe its ok just to have a rough number for this. Its only normalised for the purpose of the ML model right?

@JackKelly
Copy link
Member

I think a script would be good (that runs through the training batches). I think some types of ML models can cope with badly normalised data; but, to enable fair comparison between all types of models, I think we should probably try to normalise as well as we can 🙂

@JackKelly JackKelly changed the title Tidy satelite normalisation Move all data normalisation to nowcasting_dataloader Oct 19, 2021
@JackKelly
Copy link
Member

I've taken the liberty of changing the title to "Move all data normalisation to nowcasting_dataloader", because it might be good to tidy up all the normalisation :)

@jacobbieker
Copy link
Member

There is also normalization in nowcasting-utils here: https://github.com/openclimatefix/nowcasting_utils/blob/main/nowcasting_utils/models/normalization.py incase we want to have all of our normalization code in the same place

@peterdudfield
Copy link
Contributor Author

There is also normalization in nowcasting-utils here: https://github.com/openclimatefix/nowcasting_utils/blob/main/nowcasting_utils/models/normalization.py incase we want to have all of our normalization code in the same place

This method might only work for one batch? Is it better to normalise over the whole training set

@jacobbieker
Copy link
Member

Yeah, it would probably be better to normalize over the whole training set, rather than the batch.

@JackKelly
Copy link
Member

I think it's probably easier to reason about the code if the normalisation is done in one place only (e.g. in nowcasting_dataloader). So I'd maybe lean towards removing normalisation from all our other repos, and only doing normalisation in nowcasting_dataloader. What do you think?

@peterdudfield
Copy link
Contributor Author

yea agree, thats what the two above PRs are doing

@JackKelly
Copy link
Member

Cool beans, thank you, sorry I'm a bit behind the curve!

@peterdudfield
Copy link
Contributor Author

Cool beans, thank you, sorry I'm a bit behind the curve!

all good

National Grid Nowcasting: WP1 automation moved this from To do to Done Oct 20, 2021
@peterdudfield peterdudfield reopened this Oct 22, 2021
@peterdudfield
Copy link
Contributor Author

Need to move gsp and pv normalization over there too

@JackKelly
Copy link
Member

Is all the normalisation code moved from nowcasting_dataset (and data prep scripts) to nowcasting_dataloader, now? (i.e.: can we close this issue now?! 🙂 )

@peterdudfield
Copy link
Contributor Author

Gsp and PV is not

@peterdudfield
Copy link
Contributor Author

GSP is probably quite easy to do, but PV might be a longer.

@peterdudfield
Copy link
Contributor Author

Ill do GSP first

@JackKelly
Copy link
Member

SGTM, thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
No open projects
Status: Done
3 participants