-
Notifications
You must be signed in to change notification settings - Fork 37
Trim leading and trailing periods of missing data #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks for sparodic/sparse data at the begining and end of a series and removes removes those leating and trailing periods.
To keep consistent with the rest of the quality functions `gaps.trim` should return a boolen mask with False for the entries that are being trimmed rather than a slice of the series.
Did not previously test this important edge-case
Per discussion with Matt and Cliff should expand this to include a function that gives a completeness score to each day (hours per day with data) and a function that gives a boolean mask for days with completeness > some threshold. This mask can/should be used in |
Completeness is the fraction of the day that has data (a timestamp exists and its value is not NaN).
Longer series is necessary for infer_freq, it is reasonable to include a longer series here since the function under test aggregates data by day, it will not be called with very short time series.
It doesn't make sense to pass a frequency that is longer than the inferred frequency of the series.
216ed20
to
abacd7c
Compare
Adds function, documentation, and initial tests covering edge cases for the threshold parameter as well as tests covering basic functionality.
Adds clarification to description of gaps.daily_completeness() and gaps.complete().
Added daily completeness filters and refactored to use these functions in Tasks
|
- rename valid_between to start_stop_dates - daily_completeness - complete
Reword for clarity and consistency. Adds a 'See Also' block to refer to related functions (mostly refers to daily_completeness).
the new name `minimum_completeness` is much more descriptive.
Reduces the number of operations taking place in the return statement. Minor change, but somewhat easier to read and understand.
Co-authored-by: Cliff Hansen <cwhanse@sandia.gov>
move the reindexing inside the completeness score function. Keeps operations that affect the index in the same function.
Slightly more verbose, but much more clear what is happening. Also simplifies the _freq_to_seconds function since it no longer needs to handle freq=None.
New function takes a series of booleans and looks for `days` days long consecutive blocks of data where every value is True.
Uses the new interface for start_stop_dates, shifting the daily_completeness calculations into trim_incomplete.
daily_completeness has been renamed completeness_score
refactored trim_incomplete to use the generic function.
Added a generic trim function that operates on a series of booleans. |
Co-authored-by: Cliff Hansen <cwhanse@sandia.gov>
Many data-sets come with leading and trailing periods that have sporadic or completely missing data. This adds two functions for identifying and removing these periods:
quality.gaps.valid_between()
gives the first and last valid dates in the series.quality.gaps.trim()
returns a boolean mask that is True for the valid period between the dates given byvalid_between
.These functions both operate on days in the data set, marking all data for each day as valid/invalid based on the number of hours of data that exists on each day. Any data point with a value other than
NaN
is considered "valid" data.