Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standard format(s) for a collection of predictions (forecasts or projections) #33

Closed
Tracked by #83
elray1 opened this issue Dec 16, 2022 · 3 comments · Fixed by #83
Closed
Tracked by #83

standard format(s) for a collection of predictions (forecasts or projections) #33

elray1 opened this issue Dec 16, 2022 · 3 comments · Fixed by #83
Assignees

Comments

@elray1
Copy link
Contributor

elray1 commented Dec 16, 2022

The goal of this issue is to arrive at a definition for a standard format for representing a collection of predictions when they are loaded into a working environment such as an R or python session. Discussion here will focus on aspects about the general data format that are agnostic to the programming environment (i.e., they are relevant to multiple languages, e.g. R, python, ...). There are also some questions that are specific to the R implementation in the hubUtils package -- I've filed a separate issue (#32) to discuss R-specific points.

As a first pass at a proposal, I suggest that predictions could be stored in a long format data frame that has the same columns as in the model output submission files, augmented with the team and model names as discussed in this issue. In more detail, here is a take at the expected columns that will be in a dataframe-like object that is returned by a hubUtils::load_predictions function call:

  • team_abbr <string>: The abbreviation of the team name that produced the predictions
  • model_abbr <string>: The abbreviation of the model that produced the predictions
  • One column for each task_id variable in all model output files that were loaded. The data type is a <string> (unless the data type is otherwise specified somehow in the hub's tasks.json file? @annakrystalli do you have insights? If we try to get fancy here, we might have to worry about edge cases where the same-named task id variable is specified to have a different data type in different rounds...? But it would clearly be preferable to be able to immediately be able to convert date fields to dates...).
  • output_type <string>: The output type
  • output_type_id <string>: The output type id
  • value <numeric>: The value of the model prediction

If predictions are loaded from multiple rounds that have different task id variables, it is expected that any missing values in task id columns that were missing will be filled in with an NA.

Here are some alternative ideas that have come up in various discussions:

  • The predx package made use of a roughly similar tidy data frames, but used objects specific to each forecast output type to represent forecasts across multiple type id values. With this representation, there would be one row for each combination of values for the team, model, task id variables, and output type. Saying this another way, as an example, we would not have multiple rows for quantile forecasts by the same team/model for the same target at different quantile probability levels, but if a team/model submitted both quantile and point forecasts, there would be one row for the quantile forecasts and a second row for the point forecasts. For example, see discussion in this vignette.
  • Perhaps in combination with the above, it has been suggested that we could apply a pivot_wider type action to one or more of these columns. In particular, if we used a separate column for each possible output_type, then there would be only one row for each combination of team_abbr, model_abbr, and the task id variables. Thus, each row would correspond to forecasts (possibly of multiple types) for one forecasting problem and one team.

Although I see value in these alternatives I would not prefer to use them as the foundational/core data format for two (closely related) reasons:

  1. A long data format is familiar and easily transportable across languages, and will be easy to work with in the common tools of any language
  2. Implementing functionality along the lines of predx would take more work; for R, we could build on the existing functionality in that package, but we'd have to start from scratch in other languages
@elray1
Copy link
Contributor Author

elray1 commented Jun 6, 2023

It may be helpful to add some attributes to this data frame recording relevant metadata:

  • task_id_cols or task_ids: character vector of columns in the data frame containing task id variables
  • output_type_col or output_type: string naming column of the data frame containing the output type. For modern hubs, this would be "output_type", but having this information accessible via metadata would help with backward compatibility
  • output_type_id_col or output_type_id: string naming column of the data frame containing the output type id. For modern hubs this would be "output_type_id".
  • value_col or value: string naming column of the data frame containing the value. For modern and historical hubs this would be "value".

We could eliminate the last three of these by assuming the "modern" values of those column names.

My motivation for suggesting this is that it seems like this is information that is known about the data once it is loaded in. It makes sense to track it as attributes of the data that are closely associated with the data so that these pieces of metadata can be easily accessed in downstream functionality. For example, our current draft of initial hubEnsembles functionality takes all of these things as arguments that the user would need to specify (albeit with reasonable/common defaults), but in reality there should be no need for a user to track and specify these things -- they are a part of the hub metadata, and could easily be tracked without any need for extra effort by the user.

In R, we might define this as an S3 class (named something like hub_df or hub_prediction_df or hub_pred_df?), or in python as a class. In either case, a constructor would accept the data frame as well as these attributes.

@elray1
Copy link
Contributor Author

elray1 commented Jun 6, 2023

If we did the proposal I made just above, we would probably have to provide methods for operations like rbind and bind_rows that did validations and updates to the metadata of the input data frames, e.g. concatenating and taking the unique task_ids and checking that the other columns had the same names.

Are there any other operations on hub_df objects that we'd have to be careful of?

@annakrystalli
Copy link
Member

annakrystalli commented Jun 6, 2023

Hey @elray1,

Your suggestion is certainly possible, although I can foresee some edge cases, e.g. someone renames a column for example using dplyr::rename or mutate in order to harmonise a column name with an equivalent column in another forecast object in order to eventually combine the two with rbind. In this situation, updating of the metadata would also need to occur in the dplyr::rename step. Given this it, providing enough methods to ensure any attributes are appropriately updates would likely be too difficult/time consuming.

After in person discussions, @elray1 and myself.

Desirables:

  • Any columns going in need to come out of a predict function
  • Any added columns (i.e. not original task ids must not break the group_by functionality)

as_model_out_df() function

Function that will take hub model-output data extracted through a hub connection query (with potentially additional information to aid in ensembling or forecasting appended to it by users) and convert to a standardised model_out_df S3 class object, ready to be input to ensemble() or plot() methods. Alternative names considered: as_prediction_df(), as_hub_df().

  • df: a data frame returned from a hub_connection query.
  • task_id_cols = default NULL. character vector used to define pure task id columns. Overrides any metatada contained in hub_con if supplied.
  • output_type_col = default NULL or "output_type"? Name of column in df that contains output_type information.
  • output_type_id_col = default NULL or "output_type" Name of column in df that contains output_type_id information.
  • value_col = default NULL or "value". Name of column in df that contains value information.
  • hub_con = NULL
  • trim_to_config = FALSE

Functionality

  • Rename any column names with supplied arguments if necessary.
  • Split single team_model_abbr column into model_abbr and team_abbr if necessary (i.e. if old style hub structure where team & model label is in a single directory). This is related and takes care of add standardize_column_names() function #63 !!!!! Actually we may be back peddlling!!!!
  • check that any additional columns have not introduced additional groupings erroneously by comparing number of groups generated by a group_by (or even just unique()) call on all df columns apart from the value column to number of groups generated by same call but on columns excluding any extra columns contained in the supplied df. These extra columns are determined as the set disjoin of all df column names (apart from value) and the set model_abbr, team_abbr, output_type, output_type_id + tasks ids. Task ids are either determined from the config_tasks attribute of the hub_con object or the task_id_cols vector (if not NULL). task_id_cols overrides information in hub_con object of supplied. If using task_id_cols for this check issue message.
  • trim_to_config will trim task_id columns to only those in the config_tasks attribute of the hub_con object (? or task_id_cols vector if provided?)
  • Many of these checks will be encompassed in a validate_hub_prediction() which can be used by other functions also. This addresses the questions in consider refactoring basic validations of forecast data #64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

2 participants