-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce "mode"
output_type
#40
Comments
Response by @elray1 Maybe another option could be to allow submitters to only include the rows that are modes. I mentioned this to nick earlier, but I remember us discussing something similar at some point in the past on a call where we were talking about dates. I can't remember the context clearly enough to think of what to look up, but there may be discussion somewhere in a github issue on the schemas or hubDocs repo? |
Response by @nickreich Those suggestions make sense to me, so maybe something like
|
Comment by @nickreich |
In general I support the introduction of I do feel however that the changes required to accommodate categorical variables, whether forcing value to be a character column or mapping integers to categories (as suggested in #39 ) might be more effort than worth it. I just wanted to point out that it's really easy to get the mode(s) from a PMF accurately though a simple hub_connection query. See pseudo-example below: set.seed(1)
# pseudo-fub connection to data
hub_connection <- tibble::tibble(
output_type = "pmf",
type_id = as.character(1:10),
value = as.vector(rmultinom(1, 100, runif(10))/100)
)
hub_connection
#> # A tibble: 10 × 3
#> output_type type_id value
#> <chr> <chr> <dbl>
#> 1 pmf 1 0.03
#> 2 pmf 2 0.05
#> 3 pmf 3 0.12
#> 4 pmf 4 0.16
#> 5 pmf 5 0.05
#> 6 pmf 6 0.16
#> 7 pmf 7 0.2
#> 8 pmf 8 0.17
#> 9 pmf 9 0.06
#> 10 pmf 10 0
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
hub_connection %>%
filter(output_type == "pmf",
value == max(value))
#> # A tibble: 1 × 3
#> output_type type_id value
#> <chr> <chr> <dbl>
#> 1 pmf 7 0.2 Created on 2023-04-10 with reprex v2.0.2 On the other hand, getting the accurate |
I'm basically on board with the idea that it's "not worth the effort" at this time. If we were more focused on non-probabilistic forecasts, then I might lobby harder for it, but given that so much of what we do has a probabilistic slant and that as @annakrystalli points out you can obtain a mode (which usually we might only want for a categorical outcome) from the natural probabilistic encoding for categorical variables, then I feel that this is less important for now. |
Opening this issue to move discussions on this topic to the repo.
From slack:
@nickreich :
[5 days ago]
How would people feel about adding an output_type of "mode" to the other existing types? This came up today in a conversation with @annakrystalli as it seems like a possibly natural form of a point estimate for a categorical target. E.g. a "mean" or "median" wouldn’t make sense. I will note that the mode could be extracted from the representation of a probability mass function for a categorical outcome, but that would require a probabilistic forecast. If you like the idea, please just add a ✅ . If you have questions or comments or objections, please add a note here. Thanks!
One comment on this after discussing briefly with Evan is that the tabular data representation would maybe be kind of ugly, e.g. since we can only have numeric objects in the “value” column, maybe it would look something like this?
where the type-id is an array of the possible values of the categorical variable and the array in value would be indicating which value(s) are the mode? Or maybe this would need to be spread over two rows, to keep value purely numeric?
The text was updated successfully, but these errors were encountered: