Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce "mode" output_type #40

Open
annakrystalli opened this issue Apr 10, 2023 · 5 comments
Open

Introduce "mode" output_type #40

annakrystalli opened this issue Apr 10, 2023 · 5 comments
Labels
enhancement New feature or request on-hold

Comments

@annakrystalli
Copy link
Member

Opening this issue to move discussions on this topic to the repo.

From slack:

@nickreich :

[5 days ago]
How would people feel about adding an output_type of "mode" to the other existing types? This came up today in a conversation with @annakrystalli as it seems like a possibly natural form of a point estimate for a categorical target. E.g. a "mean" or "median" wouldn’t make sense. I will note that the mode could be extracted from the representation of a probability mass function for a categorical outcome, but that would require a probabilistic forecast. If you like the idea, please just add a ✅ . If you have questions or comments or objections, please add a note here. Thanks!

One comment on this after discussing briefly with Evan is that the tabular data representation would maybe be kind of ugly, e.g. since we can only have numeric objects in the “value” column, maybe it would look something like this?

output_type type_id value
"mode" ["cat1", "cat2", "cat3"] [0,1,1]

where the type-id is an array of the possible values of the categorical variable and the array in value would be indicating which value(s) are the mode? Or maybe this would need to be spread over two rows, to keep value purely numeric?

@annakrystalli
Copy link
Member Author

Response by @elray1

Maybe another option could be to allow submitters to only include the rows that are modes. I mentioned this to nick earlier, but I remember us discussing something similar at some point in the past on a call where we were talking about dates. I can't remember the context clearly enough to think of what to look up, but there may be discussion somewhere in a github issue on the schemas or hubDocs repo?

@annakrystalli
Copy link
Member Author

Response by @nickreich

Those suggestions make sense to me, so maybe something like

output_type type_id value
"mode" "cat2" 1
or in a multimodal case
output_type type_id value
----------- ------- -----
"mode" "cat2" 1
"mode" "cat3" 1

@annakrystalli
Copy link
Member Author

annakrystalli commented Apr 10, 2023

Comment by @nickreich
feels like if we really wanted to support this we’d then have to add some special handling for these cases. maybe we file this as a feature request for the future for now? include mode as a data type but basically don’t handle it for these cases yet?

@annakrystalli
Copy link
Member Author

annakrystalli commented Apr 10, 2023

In general I support the introduction of mode as a valid statistical point parameter to submit.

I do feel however that the changes required to accommodate categorical variables, whether forcing value to be a character column or mapping integers to categories (as suggested in #39 ) might be more effort than worth it.

I just wanted to point out that it's really easy to get the mode(s) from a PMF accurately though a simple hub_connection query. See pseudo-example below:

set.seed(1)
# pseudo-fub connection to data
hub_connection <- tibble::tibble(
    output_type = "pmf",
    type_id = as.character(1:10),
               value = as.vector(rmultinom(1, 100, runif(10))/100)
               )

hub_connection
#> # A tibble: 10 × 3
#>    output_type type_id value
#>    <chr>       <chr>   <dbl>
#>  1 pmf         1        0.03
#>  2 pmf         2        0.05
#>  3 pmf         3        0.12
#>  4 pmf         4        0.16
#>  5 pmf         5        0.05
#>  6 pmf         6        0.16
#>  7 pmf         7        0.2 
#>  8 pmf         8        0.17
#>  9 pmf         9        0.06
#> 10 pmf         10       0


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
hub_connection %>%
    filter(output_type == "pmf",
           value == max(value))
#> # A tibble: 1 × 3
#>   output_type type_id value
#>   <chr>       <chr>   <dbl>
#> 1 pmf         7         0.2

Created on 2023-04-10 with reprex v2.0.2

On the other hand, getting the accurate mean, mode and median of a continuous/discrete (count) distribution from a quantile or cdf is not necessarily straightforward and dependant on e.g. the quantiles reported (please correct me if I'm wrong!). So it might make sense to be able to report mode for such distributions but not worth the effort for nominal/ordinal/binary variables given the ease of obtaining it accurately from the pmf and the cost of accommodating encoding it.

@annakrystalli annakrystalli added the enhancement New feature or request label Apr 10, 2023
@nickreich
Copy link
Contributor

I'm basically on board with the idea that it's "not worth the effort" at this time. If we were more focused on non-probabilistic forecasts, then I might lobby harder for it, but given that so much of what we do has a probabilistic slant and that as @annakrystalli points out you can obtain a mode (which usually we might only want for a categorical outcome) from the natural probabilistic encoding for categorical variables, then I feel that this is less important for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request on-hold
Projects
Status: On hold
Development

No branches or pull requests

2 participants