-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanded aggregate functionality #25
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
TLDR; Added time shifting functionality to the aggregate function that handles shifting weather data in a more robust way than create_dataframe. Added time shift functionality to the aggregate function as an optional argument. This is to address issues with time shifting that was handled in the create_dataframe function that attempted to shift all load and independent variable data forward to the end of the period to match normal weather data whose timestamps represent the end of the period. However, the handling of this timeshifting appears to have been done after aggregation, leading to larger than necessary shifts for hourly aggregation. The new shift functionality for the aggregate function shifts only the temp time series back by one increment of its interval prior to aggregation if the shift_normal_weather flag is TRUE. This eliminates the issue of everything getting shifted forward by an hour, day, or month when aggregating to those respective intervals which occurred in create_dataframe.
…function Added additional independent variable and associated user specified aggregation function support for hourly and daily aggregation. Additionally tried to simplify the structure of logic. TO DO: Update monthly aggregation block to support additional independent variables and aggregation function specification. Investigate if logic within this block can further be simplified.
Extensive re-write of how monthly data is handled in the aggregate function. All for loops have been removed to keep the style and syntax consistent with other parts of this function and hopefully increase readability. Support for additional independent variable aggregation has also been added to monthly aggregation. Further testing is required to confirm that the behavior is consistent with the create_dataframe function's handling of additional independent variables when aggregating to monthly data.
Changed implementation of monthly aggregation to use dplyr's overlap join syntax (using a combination of inner join and join_by with inequalities), replacing the need for the fuzzyjoin package which accomplished the same overlap join task.
…ata aggregation Updated the aggregate function to further clarify what date ranges are being used to aggregate data when monthly aggregation is selected. Now, monthly data aggregation will create a dataframe with 2 additional columns, period_start and period_end. The current assumption is that usage data provided on a monthly interval (or longer) has time stamps representing the beginning of the usage period. For example, if a time series has (5/15/23, 200 kWh) and (6/15/23, 170 kWh), then we assume the usage period associated with 200 kWh is 5/15/23-6/15/23 (not inclusive of the final date). Because the nth usage period only has an (assumed) start date and no end date, the duration of the period is assumed to be the median interval of the entire time series. If the user is unhappy with this assumption and they know the end date for the last period, they can add an additional row to their input data with the end date and 0 usage, then remove the last row of the output dataframe.
Change to displayed aggregation intervals that are output on monthly aggregation. Now, the period_start and period_end columns are both period inclusive, meaning the period of July is now displayed as 7/1 - 7/31 rather than 7/1 - 8/1
This update provides support for 15-minute aggregation to bring the features in line with the create_dataframe function.
This update is an overhaul to the create_dataframe function. In an effort to simplify the aggregation process, remove dependence on the XTS package, and more accurately deal with time shifting, the majority of work has been handed to the recently revamped aggregate.r function. create_dataframe now serves as the front end function that provides error trapping and catching on user inputs (arguments). Aggregate.r is now the backend engine for this function. Additionally, a new argument has been added: shift_normal_weather. Rather than assuming the user always wants to shift the weather file from end of period reporting to start of period reporting, we now have an argument that defaults to false and can be specified as true to execute the shifting process. Shifting now happens prior to aggregation instead of after aggregation, addressing issue "Create dataframe shifts timestamps" opened on December 29 2022.
This update fixes an issue where 15 minute, hourly, or daily data aggregated to monthly intervals would not have the correct inclusive interval end date.
Added two new optional arguments, start_date and end_date, to trim the dataframe down to a specified date range. If no start and end dates are selected, they will be automatically generated using the latest start and earliest end of all input time series.
Previously, if no end date was provided as an argument, it would be automatically generated based on the earliest ending date of all time series. This created an issue for longer interval data when aggregated to monthly intervals. For instance, if eload has the earliest ending date at 5/15/23 and is in monthly intervals, then all other data including temp and additional independent variables would be cut to end at 5/15/23. This means for the final eload usage period (5/15/23-6/15/23) there would be only 1 day of temp and additional independent variable data to aggregate. This change addresses that issue by extending the end date by the eload interval when appropriate.
Previous development of monthly aggregation led to overly complicated logic and nested if blocks, each with their own aggregation procedures. This update significantly simplifies that structure by using the overlapping join method for all monthly data aggregation. This way, only a single set of aggregation actions have to be performed while if blocks handle the construction of intervals to aggregate to. This should hopefully improve readability and make the code more maintainable. An important feature that comes from continued development of start and end date argument support is that now, if you use eload data that is originally in monthly+ increments, you can now specify when the final usage period ends by using the end_date argument. This will ensure that temperature and additional independent variables are correctly aggregated to the final usage period. If end_date is not provided, the function will attempt to determine when the final usage period ends based on the median interval of all preceeding usage periods.
This update organizes and expands error trapping for create_dataframe.R input arguments. The main addition is error trapping for mis-matched time zones. If dataframes are supplied to the function that have differing time zones, the function will throw an error and provide context with relevant time zones for each dataframe.
More documentation provided in the preamble for the aggregate and create_dataframe functions, including parameter definitions and function descriptions.
…bles Huge performance enhancements to aggregate.R when aggregating large numbers of additional independent variables. Thanks to deschen on stack overflow for [this suggestion](https://stackoverflow.com/a/76725725/22254350) which was modified and implemented in this function under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/legalcode). Previously, additional independent variables were aggregated using the summarize_at function in dplyr to apply all aggregation functions to all columns (squaring the number of input columns) and then subsetting to return only the important column function combos. Whereas now, additional independent variables are aggregated by wrapping the aggregation procedure in a call to map2 from the purrr package to iteratively aggregate on a column by column basis, only applying one function per column during each loop. This massively saves on memory and computation time for data sets with a large number of indicator variables.
increment version number
Update author list
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
aggregate
andcreate_dataframe
functions have been re-written to accomplish these main tasks:Part of the procedure unification process involved stripping the primary processing code out of the
create_dataframe
function and instead calling to theaggregate
function to serve as the processing engine.create_dataframe
now acts as a user friendly front end function that will check and error trap inputs.aggregate
can also be called directly by nmecr users, but its purpose is more intended to be an procedure that can be used within other functions and analysis scripts where the developer is certain that arguments are formatted correctly.