Expanded aggregate functionality #25

mhsjacoby · 2023-08-03T23:48:16Z

The aggregate and create_dataframe functions have been re-written to accomplish these main tasks:

Fix time shifting issues
Remove dependence on the xts package
Add support for additional independent variables to the aggregate function
Increase performance
Simplify the code, increase readability, and improve documentation
Unify the procedures used for aggregation

Part of the procedure unification process involved stripping the primary processing code out of the create_dataframe function and instead calling to the aggregate function to serve as the processing engine. create_dataframe now acts as a user friendly front end function that will check and error trap inputs. aggregate can also be called directly by nmecr users, but its purpose is more intended to be an procedure that can be used within other functions and analysis scripts where the developer is certain that arguments are formatted correctly.

TLDR; Added time shifting functionality to the aggregate function that handles shifting weather data in a more robust way than create_dataframe. Added time shift functionality to the aggregate function as an optional argument. This is to address issues with time shifting that was handled in the create_dataframe function that attempted to shift all load and independent variable data forward to the end of the period to match normal weather data whose timestamps represent the end of the period. However, the handling of this timeshifting appears to have been done after aggregation, leading to larger than necessary shifts for hourly aggregation. The new shift functionality for the aggregate function shifts only the temp time series back by one increment of its interval prior to aggregation if the shift_normal_weather flag is TRUE. This eliminates the issue of everything getting shifted forward by an hour, day, or month when aggregating to those respective intervals which occurred in create_dataframe.

…function Added additional independent variable and associated user specified aggregation function support for hourly and daily aggregation. Additionally tried to simplify the structure of logic. TO DO: Update monthly aggregation block to support additional independent variables and aggregation function specification. Investigate if logic within this block can further be simplified.

Extensive re-write of how monthly data is handled in the aggregate function. All for loops have been removed to keep the style and syntax consistent with other parts of this function and hopefully increase readability. Support for additional independent variable aggregation has also been added to monthly aggregation. Further testing is required to confirm that the behavior is consistent with the create_dataframe function's handling of additional independent variables when aggregating to monthly data.

Changed implementation of monthly aggregation to use dplyr's overlap join syntax (using a combination of inner join and join_by with inequalities), replacing the need for the fuzzyjoin package which accomplished the same overlap join task.

…ata aggregation Updated the aggregate function to further clarify what date ranges are being used to aggregate data when monthly aggregation is selected. Now, monthly data aggregation will create a dataframe with 2 additional columns, period_start and period_end. The current assumption is that usage data provided on a monthly interval (or longer) has time stamps representing the beginning of the usage period. For example, if a time series has (5/15/23, 200 kWh) and (6/15/23, 170 kWh), then we assume the usage period associated with 200 kWh is 5/15/23-6/15/23 (not inclusive of the final date). Because the nth usage period only has an (assumed) start date and no end date, the duration of the period is assumed to be the median interval of the entire time series. If the user is unhappy with this assumption and they know the end date for the last period, they can add an additional row to their input data with the end date and 0 usage, then remove the last row of the output dataframe.

Change to displayed aggregation intervals that are output on monthly aggregation. Now, the period_start and period_end columns are both period inclusive, meaning the period of July is now displayed as 7/1 - 7/31 rather than 7/1 - 8/1

This update provides support for 15-minute aggregation to bring the features in line with the create_dataframe function.

This update is an overhaul to the create_dataframe function. In an effort to simplify the aggregation process, remove dependence on the XTS package, and more accurately deal with time shifting, the majority of work has been handed to the recently revamped aggregate.r function. create_dataframe now serves as the front end function that provides error trapping and catching on user inputs (arguments). Aggregate.r is now the backend engine for this function. Additionally, a new argument has been added: shift_normal_weather. Rather than assuming the user always wants to shift the weather file from end of period reporting to start of period reporting, we now have an argument that defaults to false and can be specified as true to execute the shifting process. Shifting now happens prior to aggregation instead of after aggregation, addressing issue "Create dataframe shifts timestamps" opened on December 29 2022.

This update fixes an issue where 15 minute, hourly, or daily data aggregated to monthly intervals would not have the correct inclusive interval end date.

Added two new optional arguments, start_date and end_date, to trim the dataframe down to a specified date range. If no start and end dates are selected, they will be automatically generated using the latest start and earliest end of all input time series.

Previously, if no end date was provided as an argument, it would be automatically generated based on the earliest ending date of all time series. This created an issue for longer interval data when aggregated to monthly intervals. For instance, if eload has the earliest ending date at 5/15/23 and is in monthly intervals, then all other data including temp and additional independent variables would be cut to end at 5/15/23. This means for the final eload usage period (5/15/23-6/15/23) there would be only 1 day of temp and additional independent variable data to aggregate. This change addresses that issue by extending the end date by the eload interval when appropriate.

Previous development of monthly aggregation led to overly complicated logic and nested if blocks, each with their own aggregation procedures. This update significantly simplifies that structure by using the overlapping join method for all monthly data aggregation. This way, only a single set of aggregation actions have to be performed while if blocks handle the construction of intervals to aggregate to. This should hopefully improve readability and make the code more maintainable. An important feature that comes from continued development of start and end date argument support is that now, if you use eload data that is originally in monthly+ increments, you can now specify when the final usage period ends by using the end_date argument. This will ensure that temperature and additional independent variables are correctly aggregated to the final usage period. If end_date is not provided, the function will attempt to determine when the final usage period ends based on the median interval of all preceeding usage periods.

This update organizes and expands error trapping for create_dataframe.R input arguments. The main addition is error trapping for mis-matched time zones. If dataframes are supplied to the function that have differing time zones, the function will throw an error and provide context with relevant time zones for each dataframe.

More documentation provided in the preamble for the aggregate and create_dataframe functions, including parameter definitions and function descriptions.

…bles Huge performance enhancements to aggregate.R when aggregating large numbers of additional independent variables. Thanks to deschen on stack overflow for [this suggestion](https://stackoverflow.com/a/76725725/22254350) which was modified and implemented in this function under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/legalcode). Previously, additional independent variables were aggregated using the summarize_at function in dplyr to apply all aggregation functions to all columns (squaring the number of input columns) and then subsetting to return only the important column function combos. Whereas now, additional independent variables are aggregated by wrapping the aggregation procedure in a call to map2 from the purrr package to iteratively aggregate on a column by column basis, only applying one function per column during each loop. This massively saves on memory and computation time for data sets with a large number of indicator variables.

increment version number

Update author list

k-wolfe99 and others added 18 commits June 16, 2023 13:45

Removed dependence on fuzzyjoin package for

c13e430

Changed implementation of monthly aggregation to use dplyr's overlap join syntax (using a combination of inner join and join_by with inequalities), replacing the need for the fuzzyjoin package which accomplished the same overlap join task.

Added support to aggregation function for 15-minute aggregation interval

35d8f69

This update provides support for 15-minute aggregation to bring the features in line with the create_dataframe function.

Aggregate.R fix for display intervals on monthly data

5aa0ddc

This update fixes an issue where 15 minute, hourly, or daily data aggregated to monthly intervals would not have the correct inclusive interval end date.

Added documentation for aggregate.R and create_dataframe.R

4e131b2

More documentation provided in the preamble for the aggregate and create_dataframe functions, including parameter definitions and function descriptions.

Update DESCRIPTION

748a829

increment version number

Update DESCRIPTION

499befa

Update author list

Merge branch 'master' into Expanded-Aggregate-Functionality

4ce76be

mhsjacoby merged commit 1219991 into master Aug 3, 2023

mhsjacoby mentioned this pull request Aug 4, 2023

Revert "Expanded aggregate functionality" #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanded aggregate functionality #25

Expanded aggregate functionality #25

mhsjacoby commented Aug 3, 2023

Expanded aggregate functionality #25

Expanded aggregate functionality #25

Conversation

mhsjacoby commented Aug 3, 2023