Trim leading and trailing periods of missing data #35

wfvining · 2020-04-21T20:12:53Z

Many data-sets come with leading and trailing periods that have sporadic or completely missing data. This adds two functions for identifying and removing these periods:

quality.gaps.valid_between() gives the first and last valid dates in the series.
quality.gaps.trim() returns a boolean mask that is True for the valid period between the dates given by valid_between.

These functions both operate on days in the data set, marking all data for each day as valid/invalid based on the number of hours of data that exists on each day. Any data point with a value other than NaN is considered "valid" data.

Looks for sparodic/sparse data at the begining and end of a series and removes removes those leating and trailing periods.

To keep consistent with the rest of the quality functions `gaps.trim` should return a boolen mask with False for the entries that are being trimmed rather than a slice of the series.

Did not previously test this important edge-case

pvanalytics/quality/gaps.py

wfvining · 2020-05-05T17:33:27Z

Per discussion with Matt and Cliff should expand this to include a function that gives a completeness score to each day (hours per day with data) and a function that gives a boolean mask for days with completeness > some threshold. This mask can/should be used in trim().

Completeness is the fraction of the day that has data (a timestamp exists and its value is not NaN).

Longer series is necessary for infer_freq, it is reasonable to include a longer series here since the function under test aggregates data by day, it will not be called with very short time series.

It doesn't make sense to pass a frequency that is longer than the inferred frequency of the series.

Adds function, documentation, and initial tests covering edge cases for the threshold parameter as well as tests covering basic functionality.

Adds clarification to description of gaps.daily_completeness() and gaps.complete().

wfvining · 2020-05-07T21:53:20Z

Added daily completeness filters and refactored to use these functions in valid_between. Renamed valid_between to start_stop_dates and rewrote documentation to describe the function more accurately.

Tasks

add documentation for new functions to API docs
fix existing docs (did not change name of valid_between there.)

- rename valid_between to start_stop_dates - daily_completeness - complete

Reword for clarity and consistency. Adds a 'See Also' block to refer to related functions (mostly refers to daily_completeness).

the new name `minimum_completeness` is much more descriptive.

Reduces the number of operations taking place in the return statement. Minor change, but somewhat easier to read and understand.

docs/api.rst

pvanalytics/quality/gaps.py

Co-authored-by: Cliff Hansen <cwhanse@sandia.gov>

move the reindexing inside the completeness score function. Keeps operations that affect the index in the same function.

Slightly more verbose, but much more clear what is happening. Also simplifies the _freq_to_seconds function since it no longer needs to handle freq=None.

New function takes a series of booleans and looks for `days` days long consecutive blocks of data where every value is True.

Uses the new interface for start_stop_dates, shifting the daily_completeness calculations into trim_incomplete.

daily_completeness has been renamed completeness_score

refactored trim_incomplete to use the generic function.

wfvining · 2020-05-19T14:17:31Z

Added a generic trim function that operates on a series of booleans.

pvanalytics/quality/gaps.py

docs/api.rst

Co-authored-by: Cliff Hansen <cwhanse@sandia.gov>

wfvining added 7 commits April 21, 2020 13:27

Funciton to detect and remove leading and trailing gaps

6e2b6e3

Looks for sparodic/sparse data at the begining and end of a series and removes removes those leating and trailing periods.

Return boolean mask instead of trimming the series.

ef9554f

To keep consistent with the rest of the quality functions `gaps.trim` should return a boolen mask with False for the entries that are being trimmed rather than a slice of the series.

Trimming a series with no valid days returns all False

0a23469

Did not previously test this important edge-case

Add gaps.valid_between and gaps.trim to API documentation

0cc6335

Add license and attribution for pvfleets_qa_analysis

f95e83f

Fix docstring indentation and spelling for gaps.trim()

62e74da

clarify valid data in documentation for 'valid_between'

013d2e2

wfvining marked this pull request as ready for review April 28, 2020 16:13

wfvining requested a review from cwhanse April 28, 2020 16:13

cwhanse reviewed May 4, 2020

View reviewed changes

pvanalytics/quality/gaps.py Outdated Show resolved Hide resolved

wfvining marked this pull request as draft May 5, 2020 18:43

Tests for function that calculates a daily completeness index

d2ceb54

Completeness is the fraction of the day that has data (a timestamp exists and its value is not NaN).

wfvining force-pushed the trim-missing branch from 24f5acb to d2ceb54 Compare May 5, 2020 21:38

wfvining added 2 commits May 7, 2020 08:31

Fix data types and use longer series in tests

a089410

Longer series is necessary for infer_freq, it is reasonable to include a longer series here since the function under test aggregates data by day, it will not be called with very short time series.

Raise a value error if the frequency passed to the function is bad

2c5d618

It doesn't make sense to pass a frequency that is longer than the inferred frequency of the series.

wfvining force-pushed the trim-missing branch 2 times, most recently from 216ed20 to abacd7c Compare May 7, 2020 18:49

wfvining added 2 commits May 7, 2020 13:19

Tests for completeness filtering function.

324c6c4

Adds function, documentation, and initial tests covering edge cases for the threshold parameter as well as tests covering basic functionality.

Initial implementation of gaps.complete

f255b51

wfvining force-pushed the trim-missing branch from abacd7c to f255b51 Compare May 7, 2020 19:20

wfvining added 3 commits May 7, 2020 14:05

Improve documentation.

0dbc14b

Adds clarification to description of gaps.daily_completeness() and gaps.complete().

Refactor valid_between to use daily_completeness

5a4fb54

Rename valid_between to start_stop_dates.

eedd93e

wfvining force-pushed the trim-missing branch from 6d31676 to eedd93e Compare May 7, 2020 21:49

wfvining added 4 commits May 12, 2020 08:06

Add new gaps functions to API documentation.

b706080

- rename valid_between to start_stop_dates - daily_completeness - complete

Update documentation for quality.gaps functions

17dd564

Reword for clarity and consistency. Adds a 'See Also' block to refer to related functions (mostly refers to daily_completeness).

Rename threshold parameter to gaps.complete function

eb29ada

the new name `minimum_completeness` is much more descriptive.

Rework gaps.complete for improved readability.

e3c91d8

Reduces the number of operations taking place in the return statement. Minor change, but somewhat easier to read and understand.

wfvining marked this pull request as ready for review May 12, 2020 14:37

wfvining requested a review from cwhanse May 12, 2020 14:37

cwhanse reviewed May 13, 2020

View reviewed changes

Apply documentation changes suggested in code review.

f97093f

Co-authored-by: Cliff Hansen <cwhanse@sandia.gov>

wfvining force-pushed the trim-missing branch from 334337f to f97093f Compare May 14, 2020 14:08

wfvining added 3 commits May 14, 2020 11:54

Add keep_index arg to daily_completeness & rename completeness_score

02df37e

move the reindexing inside the completeness score function. Keeps operations that affect the index in the same function.

Improve clarity in calculation of seconds per sample.

6552a8a

Slightly more verbose, but much more clear what is happening. Also simplifies the _freq_to_seconds function since it no longer needs to handle freq=None.

use a more descriptive variable name in gaps.trim

17fc35f

wfvining force-pushed the trim-missing branch from 2e1a6f2 to 17fc35f Compare May 14, 2020 17:54

wfvining added 5 commits May 14, 2020 15:02

Rework old tests for the new start_stop_dates API

82d87ff

New function takes a series of booleans and looks for `days` days long consecutive blocks of data where every value is True.

Rewrite trim function as trim_incomplete

21cfb33

Uses the new interface for start_stop_dates, shifting the daily_completeness calculations into trim_incomplete.

Update references to daily_completeness in documentation

7ef91ea

daily_completeness has been renamed completeness_score

Documentation for more general 'gaps.trim' function

ef706f8

Generic function for trimming begining and end of time series

9fc7a59

refactored trim_incomplete to use the generic function.

wfvining requested a review from cwhanse May 19, 2020 14:42

cwhanse approved these changes May 19, 2020

View reviewed changes

pvanalytics/quality/gaps.py Outdated Show resolved Hide resolved

pvanalytics/quality/gaps.py Outdated Show resolved Hide resolved

wfvining commented May 19, 2020

View reviewed changes

docs/api.rst Show resolved Hide resolved

wfvining and others added 2 commits May 19, 2020 11:29

Apply suggestions from code review

941bf79

Co-authored-by: Cliff Hansen <cwhanse@sandia.gov>

Documentation improvements.

ddf6bb6

wfvining merged commit d08a7b9 into master May 19, 2020

wfvining deleted the trim-missing branch May 21, 2020 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trim leading and trailing periods of missing data #35

Trim leading and trailing periods of missing data #35

Uh oh!

wfvining commented Apr 21, 2020

Uh oh!

Uh oh!

wfvining commented May 5, 2020

Uh oh!

wfvining commented May 7, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wfvining commented May 19, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Trim leading and trailing periods of missing data #35

Trim leading and trailing periods of missing data #35

Uh oh!

Conversation

wfvining commented Apr 21, 2020

Uh oh!

Uh oh!

wfvining commented May 5, 2020

Uh oh!

wfvining commented May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wfvining commented May 19, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wfvining commented May 7, 2020 •

edited

Loading