MOBT-1148 Add subperiod-selector tool#2373
Conversation
…lities for a phenomenon are present.
gavinevans
left a comment
There was a problem hiding this comment.
Thanks @MoseleyS 👍
I think that this PR achieves the aim. I've made a few suggestions.
mo-jbeaver
left a comment
There was a problem hiding this comment.
Pytest for unit test ran successfully. Acceptance test failed as it couldn't find the threshold_kwargs.json file.
Added a comment to the acceptance test to hopefully fix the filepath to resolve that.
Also added a few other formatting suggestions.
| ValueError: If no data is found in the main period cube matching the percentile and threshold constraints. | ||
| ValueError: If no matching threshold coordinate is found on the subperiod cube. | ||
| ValueError: If no data is found in the subperiod cube matching the threshold constraints. | ||
| ValueError: If the subperiod cube does not have exactly one more dimension than the main period cube. |
There was a problem hiding this comment.
| ValueError: If no data is found in the main period cube matching the percentile and threshold constraints. | |
| ValueError: If no matching threshold coordinate is found on the subperiod cube. | |
| ValueError: If no data is found in the subperiod cube matching the threshold constraints. | |
| ValueError: If the subperiod cube does not have exactly one more dimension than the main period cube. | |
| ValueError: | |
| - If no data is found in the main period cube matching the percentile and threshold constraints. | |
| - If no matching threshold coordinate is found on the subperiod cube. | |
| - If no data is found in the subperiod cube matching the threshold constraints. | |
| - If the subperiod cube does not have exactly one more dimension than the main period cube. |
| Plugin to select which subperiods contain the phenomenon identified over the main period. | ||
|
|
||
| For example, if the 50th percentile of hours of light rain over a 24 hour period is 0.25 (6 hours), | ||
| then this plugin can be used to identify which 6 hours of the 24 hour period are most likely | ||
| to contain light rain. The result can be used in the weather symbol decision tree to force | ||
| the selection of a wet symbol. |
There was a problem hiding this comment.
I understand what this script does, because we've discussed this, but I'm not sure I would from the docstring alone. It's a tricky thing to summarise, but perhaps this is a bit clearer (if more verbose)?
"Identifies which parts of a longer time period are most likely to contain a particular weather phenomena.
For example, if light rain is expected for 6 hours within a 24 hour period (e.g. the 50th percentile of light rain over a 24 hour period is 0.25), this plugin selects the 6 hours most likely to contain that light rain.
This output can then be used by the weather symbol decision tree to force the selection of a wet symbol."
There was a problem hiding this comment.
I have used some of this, and tweaked it a bit further (and copied the result to the CLI doc-string). Thanks for the suggestion.
| """Identify which subperiods to select based on the selected main period diagnostic slice. | ||
|
|
||
| The value at each grid point in the main period data indicates the fraction of subperiods to select, and the values in the | ||
| subperiod data indicate the likelihood of the phenomenon occurring in each subperiod. | ||
| The subperiods with the highest likelihood are selected until the number of selected subperiods | ||
| matches the value from the main period data. |
There was a problem hiding this comment.
Similarly, a tricky one to explain! Perhaps this is a bit clearer?
"Identify the subperiods most likely to contain the phenomenon.
At each grid point, the main period data specifies how many subperiods to select, and the subperiod data gives the likelihood of the phenomenon occurring in each subperiod.
The subperiods with the highest likelihood are selected until the number of selected subperiods matches the value from the main period data.
Where multiple subperiods have equal likelihood, the selection between them is random"
| # Identify which subperiods to select for this period rank. The leading dimension is time, so argpartition | ||
| # is used to identify the indices of the subperiods with the highest likelihood. | ||
| subperiods_to_select = np.argpartition( | ||
| subperiod_data, range(number_of_subperiods), axis=0 | ||
| )[-period_rank:] |
There was a problem hiding this comment.
Should we add an explicit tie-breaking step here? np.argpartition does not guarantee random ordering for equal values, so tied subperiods may not be selected randomly (and we wouldn't want to always select the earliest time in the day, etc!).
For example, we could generate a random array tie_breaker with the same shape as subperiod_data and then use subperiods_to_select = np.lexsort((tie_breaker, superiod_data), axis=0)[-period_rank:] instead. This would still rank subperiods by likelihood first, but would use the random values to decide the ordering where likelihoods are equal.
There was a problem hiding this comment.
I don't think this is worth the effort. The probability fields are smooth and linearly varying. Therefore the proportion of ties is going to be vanishingly small which makes handling them well not worthwhile.
There was a problem hiding this comment.
On reflection, I agree with this. 🙂 It is also quicker (computationally) which matters more.
| Select which subperiods contain the phenomenon identified over the main period. | ||
|
|
||
| For example, if the 50th percentile of hours of light rain over a 24 hour period is 0.25 (6 hours), | ||
| then this plugin can be used to identify which 6 hours of the 24 hour period are most likely | ||
| to contain light rain. The result can be used in the weather symbol decision tree to force | ||
| the selection of a wet symbol. |
There was a problem hiding this comment.
As before (so if you update there, then update here too. 🙂)
mo-AliceLake
left a comment
There was a problem hiding this comment.
Minor comments around docstrings for clarity, take or leave.
Suggestion for enforced randomisation, think we should look at that one.
…ced to acceptance test.
mo-jbeaver
left a comment
There was a problem hiding this comment.
Happy with the changes made and the acceptance test now passes.
* master: Changed implementation from clipping to masking + added Unit Tests (#2366) Adding kwarg as CLI argument & updating acceptance tests. (#2377) Refactor Pollen index for daily and hourly to single plugin (#2372) EPPT-3259 fix fsi duplicate metadata (#2370) Changes to Pollen classes for refactoring cube long names and units of concentration (#2368) Eppt 3223 lifted index investigate why the values are wrong (#2365) Cast to the original dtype of the points in `expand_bounds` (#2367) Changes that might help with intermittent Stochastic Noise failure (#2346)
* master: change ApplyDecisionTree categorical cube dtype from int32 to int16 (#2371)
* master: Revert "change ApplyDecisionTree categorical cube dtype from int32 to int16 (…" (#2382)
* master: MOBT-1148 Add subperiod-selector tool (#2373) Revert "change ApplyDecisionTree categorical cube dtype from int32 to int16 (…" (#2382) change ApplyDecisionTree categorical cube dtype from int32 to int16 (#2371) Changed implementation from clipping to masking + added Unit Tests (#2366) Adding kwarg as CLI argument & updating acceptance tests. (#2377) Refactor Pollen index for daily and hourly to single plugin (#2372) EPPT-3259 fix fsi duplicate metadata (#2370) Changes to Pollen classes for refactoring cube long names and units of concentration (#2368) Eppt 3223 lifted index investigate why the values are wrong (#2365) Cast to the original dtype of the points in `expand_bounds` (#2367) Changes that might help with intermittent Stochastic Noise failure (#2346) # Conflicts: # improver/categorical/subperiod_selector.py
Addresses https://github.com/metoppv/mo-blue-team/issues/1148
Builds on #2092. Uses the output "Fraction of period that is wet" data and the probabilities of precipitation in all the subperiods to identify which subperiods should be classed as wet so that the total number of wet periods matches the fraction at each grid point.
Requires test data from
Testing: