-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
standard format(s) for a collection of predictions (forecasts or projections) #33
Comments
It may be helpful to add some attributes to this data frame recording relevant metadata:
We could eliminate the last three of these by assuming the "modern" values of those column names. My motivation for suggesting this is that it seems like this is information that is known about the data once it is loaded in. It makes sense to track it as attributes of the data that are closely associated with the data so that these pieces of metadata can be easily accessed in downstream functionality. For example, our current draft of initial hubEnsembles functionality takes all of these things as arguments that the user would need to specify (albeit with reasonable/common defaults), but in reality there should be no need for a user to track and specify these things -- they are a part of the hub metadata, and could easily be tracked without any need for extra effort by the user. In R, we might define this as an S3 class (named something like |
If we did the proposal I made just above, we would probably have to provide methods for operations like Are there any other operations on |
Hey @elray1, Your suggestion is certainly possible, although I can foresee some edge cases, e.g. someone renames a column for example using After in person discussions, @elray1 and myself. Desirables:
|
The goal of this issue is to arrive at a definition for a standard format for representing a collection of predictions when they are loaded into a working environment such as an R or python session. Discussion here will focus on aspects about the general data format that are agnostic to the programming environment (i.e., they are relevant to multiple languages, e.g. R, python, ...). There are also some questions that are specific to the R implementation in the
hubUtils
package -- I've filed a separate issue (#32) to discuss R-specific points.As a first pass at a proposal, I suggest that predictions could be stored in a long format data frame that has the same columns as in the model output submission files, augmented with the team and model names as discussed in this issue. In more detail, here is a take at the expected columns that will be in a dataframe-like object that is returned by a
hubUtils::load_predictions
function call:team_abbr <string>
: The abbreviation of the team name that produced the predictionsmodel_abbr <string>
: The abbreviation of the model that produced the predictionstask_id
variable in all model output files that were loaded. The data type is a<string>
(unless the data type is otherwise specified somehow in the hub'stasks.json
file? @annakrystalli do you have insights? If we try to get fancy here, we might have to worry about edge cases where the same-named task id variable is specified to have a different data type in different rounds...? But it would clearly be preferable to be able to immediately be able to convert date fields to dates...).output_type <string>
: The output typeoutput_type_id <string>
: The output type idvalue <numeric>
: The value of the model predictionIf predictions are loaded from multiple rounds that have different task id variables, it is expected that any missing values in task id columns that were missing will be filled in with an NA.
Here are some alternative ideas that have come up in various discussions:
pivot_wider
type action to one or more of these columns. In particular, if we used a separate column for each possibleoutput_type
, then there would be only one row for each combination ofteam_abbr
,model_abbr
, and the task id variables. Thus, each row would correspond to forecasts (possibly of multiple types) for one forecasting problem and one team.Although I see value in these alternatives I would not prefer to use them as the foundational/core data format for two (closely related) reasons:
The text was updated successfully, but these errors were encountered: