Skip to content

Conversation

wfvining
Copy link
Collaborator

Many data-sets come with leading and trailing periods that have sporadic or completely missing data. This adds two functions for identifying and removing these periods:

  • quality.gaps.valid_between() gives the first and last valid dates in the series.
  • quality.gaps.trim() returns a boolean mask that is True for the valid period between the dates given by valid_between.

These functions both operate on days in the data set, marking all data for each day as valid/invalid based on the number of hours of data that exists on each day. Any data point with a value other than NaN is considered "valid" data.

Looks for sparodic/sparse data at the begining and end of a series and
removes removes those leating and trailing periods.
To keep consistent with the rest of the quality functions `gaps.trim`
should return a boolen mask with False for the entries that are being
trimmed rather than a slice of the series.
Did not previously test this important edge-case
@wfvining wfvining marked this pull request as ready for review April 28, 2020 16:13
@wfvining wfvining requested a review from cwhanse April 28, 2020 16:13
@wfvining
Copy link
Collaborator Author

wfvining commented May 5, 2020

Per discussion with Matt and Cliff should expand this to include a function that gives a completeness score to each day (hours per day with data) and a function that gives a boolean mask for days with completeness > some threshold. This mask can/should be used in trim().

@wfvining wfvining marked this pull request as draft May 5, 2020 18:43
Completeness is the fraction of the day that has data (a timestamp
exists and its value is not NaN).
wfvining added 2 commits May 7, 2020 08:31
Longer series is necessary for infer_freq, it is reasonable to include
a longer series here since the function under test aggregates data by
day, it will not be called with very short time series.
It doesn't make sense to pass a frequency that is longer than the
inferred frequency of the series.
@wfvining wfvining force-pushed the trim-missing branch 2 times, most recently from 216ed20 to abacd7c Compare May 7, 2020 18:49
wfvining added 2 commits May 7, 2020 13:19
Adds function, documentation, and initial tests covering edge cases
for the threshold parameter as well as tests covering basic
functionality.
wfvining added 3 commits May 7, 2020 14:05
Adds clarification to description of gaps.daily_completeness() and
gaps.complete().
@wfvining
Copy link
Collaborator Author

wfvining commented May 7, 2020

Added daily completeness filters and refactored to use these functions in valid_between. Renamed valid_between to start_stop_dates and rewrote documentation to describe the function more accurately.

Tasks

  • add documentation for new functions to API docs
  • fix existing docs (did not change name of valid_between there.)

wfvining added 4 commits May 12, 2020 08:06
- rename valid_between to start_stop_dates
- daily_completeness
- complete
Reword for clarity and consistency. Adds a 'See Also' block to refer
to related functions (mostly refers to daily_completeness).
the new name `minimum_completeness` is much more descriptive.
Reduces the number of operations taking place in the return
statement. Minor change, but somewhat easier to read and understand.
@wfvining wfvining marked this pull request as ready for review May 12, 2020 14:37
@wfvining wfvining requested a review from cwhanse May 12, 2020 14:37
Co-authored-by: Cliff Hansen <cwhanse@sandia.gov>
wfvining added 3 commits May 14, 2020 11:54
move the reindexing inside the completeness score function. Keeps
operations that affect the index in the same function.
Slightly more verbose, but much more clear what is happening.
Also simplifies the _freq_to_seconds function since it no longer needs
to handle freq=None.
wfvining added 5 commits May 14, 2020 15:02
New function takes a series of booleans and looks for `days` days long
consecutive blocks of data where every value is True.
Uses the new interface for start_stop_dates, shifting the
daily_completeness calculations into trim_incomplete.
daily_completeness has been renamed completeness_score
refactored trim_incomplete to use the generic function.
@wfvining
Copy link
Collaborator Author

Added a generic trim function that operates on a series of booleans.

@wfvining wfvining requested a review from cwhanse May 19, 2020 14:42
wfvining and others added 2 commits May 19, 2020 11:29
Co-authored-by: Cliff Hansen <cwhanse@sandia.gov>
@wfvining wfvining merged commit d08a7b9 into master May 19, 2020
@wfvining wfvining deleted the trim-missing branch May 21, 2020 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants