standard format(s) for a collection of predictions (forecasts or projections) #33

elray1 · 2022-12-16T21:57:12Z

The goal of this issue is to arrive at a definition for a standard format for representing a collection of predictions when they are loaded into a working environment such as an R or python session. Discussion here will focus on aspects about the general data format that are agnostic to the programming environment (i.e., they are relevant to multiple languages, e.g. R, python, ...). There are also some questions that are specific to the R implementation in the hubUtils package -- I've filed a separate issue (#32) to discuss R-specific points.

As a first pass at a proposal, I suggest that predictions could be stored in a long format data frame that has the same columns as in the model output submission files, augmented with the team and model names as discussed in this issue. In more detail, here is a take at the expected columns that will be in a dataframe-like object that is returned by a hubUtils::load_predictions function call:

team_abbr <string>: The abbreviation of the team name that produced the predictions
model_abbr <string>: The abbreviation of the model that produced the predictions
One column for each task_id variable in all model output files that were loaded. The data type is a <string> (unless the data type is otherwise specified somehow in the hub's tasks.json file? @annakrystalli do you have insights? If we try to get fancy here, we might have to worry about edge cases where the same-named task id variable is specified to have a different data type in different rounds...? But it would clearly be preferable to be able to immediately be able to convert date fields to dates...).
output_type <string>: The output type
output_type_id <string>: The output type id
value <numeric>: The value of the model prediction

If predictions are loaded from multiple rounds that have different task id variables, it is expected that any missing values in task id columns that were missing will be filled in with an NA.

Here are some alternative ideas that have come up in various discussions:

The predx package made use of a roughly similar tidy data frames, but used objects specific to each forecast output type to represent forecasts across multiple type id values. With this representation, there would be one row for each combination of values for the team, model, task id variables, and output type. Saying this another way, as an example, we would not have multiple rows for quantile forecasts by the same team/model for the same target at different quantile probability levels, but if a team/model submitted both quantile and point forecasts, there would be one row for the quantile forecasts and a second row for the point forecasts. For example, see discussion in this vignette.
Perhaps in combination with the above, it has been suggested that we could apply a pivot_wider type action to one or more of these columns. In particular, if we used a separate column for each possible output_type, then there would be only one row for each combination of team_abbr, model_abbr, and the task id variables. Thus, each row would correspond to forecasts (possibly of multiple types) for one forecasting problem and one team.

Although I see value in these alternatives I would not prefer to use them as the foundational/core data format for two (closely related) reasons:

A long data format is familiar and easily transportable across languages, and will be easy to work with in the common tools of any language
Implementing functionality along the lines of predx would take more work; for R, we could build on the existing functionality in that package, but we'd have to start from scratch in other languages

The text was updated successfully, but these errors were encountered:

elray1 · 2023-06-06T02:51:53Z

It may be helpful to add some attributes to this data frame recording relevant metadata:

task_id_cols or task_ids: character vector of columns in the data frame containing task id variables
output_type_col or output_type: string naming column of the data frame containing the output type. For modern hubs, this would be "output_type", but having this information accessible via metadata would help with backward compatibility
output_type_id_col or output_type_id: string naming column of the data frame containing the output type id. For modern hubs this would be "output_type_id".
value_col or value: string naming column of the data frame containing the value. For modern and historical hubs this would be "value".

We could eliminate the last three of these by assuming the "modern" values of those column names.

My motivation for suggesting this is that it seems like this is information that is known about the data once it is loaded in. It makes sense to track it as attributes of the data that are closely associated with the data so that these pieces of metadata can be easily accessed in downstream functionality. For example, our current draft of initial hubEnsembles functionality takes all of these things as arguments that the user would need to specify (albeit with reasonable/common defaults), but in reality there should be no need for a user to track and specify these things -- they are a part of the hub metadata, and could easily be tracked without any need for extra effort by the user.

In R, we might define this as an S3 class (named something like hub_df or hub_prediction_df or hub_pred_df?), or in python as a class. In either case, a constructor would accept the data frame as well as these attributes.

elray1 · 2023-06-06T08:28:20Z

If we did the proposal I made just above, we would probably have to provide methods for operations like rbind and bind_rows that did validations and updates to the metadata of the input data frames, e.g. concatenating and taking the unique task_ids and checking that the other columns had the same names.

Are there any other operations on hub_df objects that we'd have to be careful of?

annakrystalli · 2023-06-06T16:04:38Z

Hey @elray1,

Your suggestion is certainly possible, although I can foresee some edge cases, e.g. someone renames a column for example using dplyr::rename or mutate in order to harmonise a column name with an equivalent column in another forecast object in order to eventually combine the two with rbind. In this situation, updating of the metadata would also need to occur in the dplyr::rename step. Given this it, providing enough methods to ensure any attributes are appropriately updates would likely be too difficult/time consuming.

After in person discussions, @elray1 and myself.

Desirables:

Any columns going in need to come out of a predict function
Any added columns (i.e. not original task ids must not break the group_by functionality)

`as_model_out_df()` function

Function that will take hub model-output data extracted through a hub connection query (with potentially additional information to aid in ensembling or forecasting appended to it by users) and convert to a standardised model_out_df S3 class object, ready to be input to ensemble() or plot() methods. Alternative names considered: as_prediction_df(), as_hub_df().

df: a data frame returned from a hub_connection query.
task_id_cols = default NULL. character vector used to define pure task id columns. Overrides any metatada contained in hub_con if supplied.
output_type_col = default NULL or "output_type"? Name of column in df that contains output_type information.
output_type_id_col = default NULL or "output_type" Name of column in df that contains output_type_id information.
value_col = default NULL or "value". Name of column in df that contains value information.
hub_con = NULL
trim_to_config = FALSE

Functionality

Rename any column names with supplied arguments if necessary.
Split single team_model_abbr column into model_abbr and team_abbr if necessary (i.e. if old style hub structure where team & model label is in a single directory). This is related and takes care of add standardize_column_names() function #63 !!!!! Actually we may be back peddlling!!!!
check that any additional columns have not introduced additional groupings erroneously by comparing number of groups generated by a group_by (or even just unique()) call on all df columns apart from the value column to number of groups generated by same call but on columns excluding any extra columns contained in the supplied df. These extra columns are determined as the set disjoin of all df column names (apart from value) and the set model_abbr, team_abbr, output_type, output_type_id + tasks ids. Task ids are either determined from the config_tasks attribute of the hub_con object or the task_id_cols vector (if not NULL). task_id_cols overrides information in hub_con object of supplied. If using task_id_cols for this check issue message.
trim_to_config will trim task_id columns to only those in the config_tasks attribute of the hub_con object (? or task_id_cols vector if provided?)
Many of these checks will be encompassed in a validate_hub_prediction() which can be used by other functions also. This addresses the questions in consider refactoring basic validations of forecast data #64.

…, #33, #63, #63, #66)

elray1 mentioned this issue Jun 6, 2023

consider refactoring basic validations of forecast data #64

Closed

annakrystalli mentioned this issue Jun 6, 2023

Create as_model_out_df() function and model_out_df S3 class #66

Closed

nickreich mentioned this issue Jun 6, 2023

create plot_step_ahead_forecasts() hubverse-org/hubVis#1

Closed

This was referenced Jun 7, 2023

Change mention of type to output_type and type_id to output_type_id in all examples and documentation #71

Closed

Simple ensemble hubverse-org/hubEnsembles#4

Merged

annakrystalli added this to hubverse Development overview Jun 14, 2023

github-project-automation bot moved this to Todo in hubverse Development overview Jun 14, 2023

annakrystalli mentioned this issue Jun 14, 2023

object class to represent a collection of forecasts #32

Closed

annakrystalli moved this from Todo to In Progress in hubverse Development overview Jun 14, 2023

annakrystalli moved this from In Progress to Todo in hubverse Development overview Jun 14, 2023

annakrystalli moved this from Todo to In Progress in hubverse Development overview Jun 21, 2023

nickreich assigned annakrystalli Jun 21, 2023

annakrystalli added a commit that referenced this issue Jun 26, 2023

add as_model_out_tbl and merge_* split_model_id functions. Resolves #32…

5e70863

…, #33, #63, #63, #66)

annakrystalli mentioned this issue Jun 26, 2023

Add as_model_out_tbl function #83

Merged

4 tasks

annakrystalli linked a pull request Jun 26, 2023 that will close this issue

Add as_model_out_tbl function #83

Merged

4 tasks

annakrystalli moved this from In Progress to Ready for Review in hubverse Development overview Jun 26, 2023

annakrystalli closed this as completed in #83 Jul 5, 2023

github-project-automation bot moved this from Ready for Review to Done in hubverse Development overview Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

standard format(s) for a collection of predictions (forecasts or projections) #33

standard format(s) for a collection of predictions (forecasts or projections) #33

elray1 commented Dec 16, 2022 •

edited

Loading

elray1 commented Jun 6, 2023 •

edited

Loading

elray1 commented Jun 6, 2023

annakrystalli commented Jun 6, 2023 •

edited

Loading

standard format(s) for a collection of predictions (forecasts or projections) #33

standard format(s) for a collection of predictions (forecasts or projections) #33

Comments

elray1 commented Dec 16, 2022 • edited Loading

elray1 commented Jun 6, 2023 • edited Loading

elray1 commented Jun 6, 2023

annakrystalli commented Jun 6, 2023 • edited Loading

as_model_out_df() function

Functionality

elray1 commented Dec 16, 2022 •

edited

Loading

elray1 commented Jun 6, 2023 •

edited

Loading

annakrystalli commented Jun 6, 2023 •

edited

Loading

`as_model_out_df()` function