New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port derivative-/quantile-based clipping detection from pvfleets QA #39
Conversation
Night time data is periods of zero power preceding and following the quadratic_clipped test fixture.
Night time data is periods of zero power preceding and following the quadratic test fixture.
By keeping more data we can handle the case where there is the same number of positive power data points at each minute. With a strictly greater than comparison we end up excluding all the data in this case. Including data at the first quartile shouldn't have a substantial effect for large datasets with night time data, and it makes the filter more useful for a variety of data sets.
Only part of the data set should be marked True to indicate clipping.
Separates the test for whether a value is clipped into a named predicate which improves readability and satisfies the linter by shortening the if-line where the predicate is tested.
For the quadratic data set passing freq='10T' should give the same result as not passing a freq.
Completely goofed the call to isinstance when I wrote that the first time. Thanks to Coveralls for pointing out that there was no coverage on this line.
Still needs:
|
Gives a complete description of how the clipping threshold is computed
Track and update the longest period of suspected clipping throughout the loop, not just when clipping ends. Renames variable to improve clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably going to be a few iterations on this, let me know if it's easier to talk it out.
Not necessary to make a DataFrame with a minutes column, all the same operations can be achieved using the original series. Also adds two parameters to the function to replace the magic numbers (power_quantile and frequency_quantile).
These two parameters control the calculation of the clipping threshold.
Match the order of the threshold parameters with the order of the data parameters. Also rename minimum -> power_min to be more descriptive.
Also introduces the power_min parameter which controlls how far above the median power a level must be for it to be considered clipping.
Makes the calculation of the clipping power level a bit more readable, it is possible that it also improves the performance by using pandas functions instead of loops, but I doubt the difference is noticable given that the loop was only iterating over a single day of data (at most 1440 values). Readability is substantially better, though.
freq should be in units of minutes. Was accidentally multiplying instead of dividing.
Credit PVFleets QA project as the source of the algorithm; however, the code has been entirely implemented from scratch, so removing the copyright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better with the descriptions in the docstrings.
pvanalytics/features/clipping.py
Outdated
each minute of the day, any minute with a count less than the | ||
`frequency_quantile`-quantile of all counts is excluded from | ||
the calculation of the clipping threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a reference to the day/ night detector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you brought this up. We do a sort of day/night filtering step in _daytime_powercurve
. It might be better to pass in only daytime power, or pass a day/night mask. Passing a mask is probably the best solution to limit coupling between modules while still reducing code duplication.
@matt14muller, are there any caveats/pitfalls you can see in changing the original algorithm like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually starting to wonder if this step is even necessary. We only accept a period with low slope a clipping if it has sufficiently high power (see the power_min
parameter). That should prevent using night time or early morning periods where we shouldn't be seeing clipping as the the clipping threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow your question. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a step where minutes of the day with a low number of positive values are eliminated. I think it is to remove the night/early morning/evening times. Do you think that is really necessary? Can we just include night-time/morning/evening when looking for clipping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just include night-time/morning/evening when looking for clipping?
We can with this algorithm, since there is a power threshold critieria. The SFA function only looks for flat spots, and has to be combined with a night/day detector to avoid labeling night as clipping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to remove the frequency_quantile
parameter (and associated code) and see if the tests still pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cwhanse @matt14muller I removed the frequency_quantile
param. Works fine for the tests, but those are just simple tests... The only problem I can think of with this is if there is enough erroneous high data in the night time that the 95% quantile is driven up. I'm still good with eliminating the parameter though since we have many other quality functions to help users eliminate that problem. If worst comes to worst, users can simply pass only daytime power values, leaving night time out of the equation all together.
Co-authored-by: Cliff Hansen <cwhanse@sandia.gov>
This makes it more clear that we are looking at the point-to-point slope, as opposed a more complex numerical approximation of the derivative.
We already pass all parameters, no need for defaults. By removing the defaults I can get an error out of the interpreter when I forget to pass a parameter. Also exposes the `power_min` parameter in clipping.threshold. I'm having a hard time describing it well though, It specifies that power must be greater than `power_min` times the median of the `power_quantile` quantile of power at each minute of the day, excluding night, and early morning/late evening.
Provides a significantly better description of how the threshold is calculated.
/ normalized_power.index.to_series().diff()) * freq | ||
|
||
clipped_times = _clipped(powercurve, power_derivative, | ||
powercurve.median() * power_min, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this
Change up the order of sentences in the last paragraph (much better - thanks, Cliff). rewording in the third paragraph. Need to find a way to describe that only the values for the longest period with low-slope high-power are used to compute the threshold. (as is, it reads kind of like we take the average of the full curve). Co-authored-by: Cliff Hansen <cwhanse@sandia.gov>
Because this algorithm has a minimum power threshold (`power_min`) we don't need to filter out night time periods (where the slope will pretty much always be near zero). This could still get messed up if the `power_quantile` of night time power is very high, but we will assume that the data you pass in here has reasonable night time values (most of the time). If it doesn't and you are concerned, then you can pass data for daytime periods only.
Reflects the removal of the frequency_quantile parameter. Makes the last step in the calculation of the threshold more clear and clarifies what is returned.
Detect clipping in power data by identifying a clipping threshold (through the derivative of the 99.5% quantile of the daily power curve).
All data greater than or equal to the clipping threshold is marked as "clipped".