New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"uninformative" prior #3
Comments
Thank you so much for this feedback @dhicks! 🙌 I was re-reading the paper again to understand this better and where my misunderstanding arose, and these are some great thoughts. I won't be able to work on this for the next several weeks, but we will get this fixed/accurate before a CRAN release. |
I was working on my own version of the function to help me understand the paper and how it related to this package and think it could be relevant here. I wanted to push the original function further to allow different levels of information for the priors and I think I lucked into a rough version of the first two solutions provided above. Note: I made some slight changes to variable names to help me stay consistent with the paper as I worked through it. bind_log_odds <- function(tbl, set, feature, n, alpha = NULL) {
set <- enquo(set)
feature <- enquo(feature)
y_col <- enquo(n)
grouping <- group_vars(tbl)
tbl <- ungroup(tbl)
if(is.null(alpha)) {
n_df <- count(tbl, !!set, wt = !!y_col, name = ".n")
alpha_df <- count(tbl, !!feature, wt = !!y_col, name = ".alpha")
df_joined <- tbl %>%
left_join(n_df, by = rlang::as_name(set)) %>%
left_join(alpha_df, by = rlang::as_name(feature)) %>%
mutate(.alpha0 = sum(!!y_col),
y_other = .alpha - !!y_col,
n_other = .alpha0 - .n,
l1 = (!!y_col + .alpha) / (.n + .alpha0 - !!y_col - .alpha),
l2 = (y_other + .alpha) / (n_other + .alpha0 - y_other - .alpha),
sigma2 = 1/(!!y_col + .alpha) + 1/(y_other + .alpha),
log_odds = (log(l1) - log(l2))/sqrt(sigma2))
tbl$log_odds <- df_joined$log_odds
} else {
if(length(alpha) == 1) {
n_df <- count(tbl, !!set, wt = !!y_col, name = ".n")
y_total_df <- count(tbl, !!feature, wt = !!y_col, name = "y_total")
df_joined <- tbl %>%
left_join(n_df, by = rlang::as_name(set)) %>%
left_join(y_total_df, by = rlang::as_name(feature)) %>%
mutate(.alpha = alpha,
.alpha0 = nrow(distinct(tbl, !!feature)) * alpha,
y_other = y_total - !!y_col,
n_other = sum(!!y_col) - .n,
l1 = (!!y_col + .alpha) / (.n + .alpha0 - !!y_col - .alpha),
l2 = (y_other + .alpha) / (n_other + .alpha0 - y_other - .alpha),
sigma2 = 1/(!!y_col + .alpha) + 1/(y_other + .alpha),
log_odds = (log(l1) - log(l2))/sqrt(sigma2))
tbl$log_odds <- df_joined$log_odds
} else if (nrow(alpha) >= nrow(distinct(tbl, {{feature}}))) {
n_df <- count(tbl, !!set, wt = !!y_col, name = ".n")
y_total_df <- count(tbl, !!feature, wt = !!y_col, name = "y_total")
alpha0 <- sum(alpha[,2])
df_joined <- tbl %>%
left_join(n_df, by = rlang::as_name(set)) %>%
left_join(y_total_df, by = rlang::as_name(feature)) %>%
left_join(alpha, by = rlang::as_name(feature)) %>%
rename(.alpha = ncol(.)) %>%
mutate(.alpha0 = alpha0,
y_other = y_total - !!y_col,
n_other = sum(!!y_col) - .n,
l1 = (!!y_col + .alpha) / (.n + .alpha0 - !!y_col - .alpha),
l2 = (y_other + .alpha) / (n_other + .alpha0),
sigma2 = 1/(!!y_col + .alpha) + 1/(y_other + .alpha),
log_odds = (log(l1) - log(l2))/sqrt(sigma2))
tbl$log_odds <- df_joined$log_odds
} else {
message("alpha must be length 1 or feature")
}
}
if (!is_empty(grouping)) {
tbl <- group_by(tbl, !!sym(grouping))
}
tbl
}
library(tidyverse)
library(babynames)
top_names <- babynames %>%
filter(year >= 1950,
year < 1990) %>%
mutate(decade = (year %/% 10) * 10,
decade = paste0(decade, "s")) %>%
group_by(decade) %>%
count(name, wt = n, sort = TRUE) %>%
ungroup
top_names %>%
bind_log_odds(decade, name, n) %>%
arrange(-log_odds)
#> # A tibble: 100,527 x 4
#> decade name n log_odds
#> <chr> <chr> <int> <dbl>
#> 1 1980s Ashley 357385 477.
#> 2 1980s Jessica 471493 457.
#> 3 1950s Linda 565481 414.
#> 4 1980s Joshua 399131 412.
#> 5 1980s Amanda 370873 391.
#> 6 1970s Jason 465402 374.
#> 7 1980s Justin 291843 363.
#> 8 1950s Deborah 431302 348.
#> 9 1980s Brandon 234294 331.
#> 10 1970s Jennifer 583692 330.
#> # ... with 100,517 more rows
alpha_df <- babynames %>%
count(name, wt = n, name = "alpha")
top_names %>%
bind_log_odds(decade, name, n, alpha_df) %>%
arrange(-log_odds)
#> # A tibble: 100,527 x 4
#> decade name n log_odds
#> <chr> <chr> <int> <dbl>
#> 1 1980s Ashley 357385 352.
#> 2 1980s Jessica 471493 333.
#> 3 1980s Joshua 399131 299.
#> 4 1950s Linda 565481 296.
#> 5 1980s Amanda 370873 282.
#> 6 1950s Mary 627098 275.
#> 7 1980s Justin 291843 263.
#> 8 1950s Deborah 431302 242.
#> 9 1970s Jason 465402 240.
#> 10 1980s Brandon 234294 240.
#> # ... with 100,517 more rows
top_names %>%
bind_log_odds(decade, name, n, 10) %>%
arrange(-log_odds)
#> # A tibble: 100,527 x 4
#> decade name n log_odds
#> <chr> <chr> <int> <dbl>
#> 1 1980s Jessica 471493 760.
#> 2 1950s Linda 565481 720.
#> 3 1970s Jason 465402 697.
#> 4 1980s Joshua 399131 693.
#> 5 1980s Amanda 370873 663.
#> 6 1970s Jennifer 583692 639.
#> 7 1950s Deborah 431302 610.
#> 8 1980s Justin 291843 599.
#> 9 1980s Ashley 357385 599.
#> 10 1950s Mary 627098 583.
#> # ... with 100,517 more rows Created on 2019-08-18 by the reprex package (v0.3.0) |
I also read the paper recently and have been trying to wrap my head around it. So far as I can tell, the key idea is that the mean-variance relationship for word counts isn't stable. So we first calculate the probability that a particular word occurs, which is a mean estimate, and then we divide by the variance of that estimate. So now all of the estimates have comparable variance, which we really want! [1] Figure 4 has a plot with the standardized estimates; the authors then go on to say that they don't really like them, and that we can improve things by regularizing them. Adding information to the prior will shrink the mean estimates toward a grand mean, introducing bias and decreasing variance. So we'll have two good things going for us: the overly variable means will be less variable, and we'll still be dividing by their variance to get things on the same scale. The authors propose two regularizing priors:
Then they talk about uninformative priors:
This isn't quite right. When So my first observation is that I'm not sure how much to trust the uninformativeness of the priors in the paper, especially since I can't find a reported value of My second that is that the whole point is to get a better mean estimate via regularization, so using an uninformative prior that doesn't regularize defeats the point. This seems like the perfect scenario to use empirical Bayes (i.e. use an MLE estimate for the prior parameter). I haven't picked over the code in too much detail, but it looks like you're probably doing something like this. And this is probably a good thing, because empirical bayes estimates have nice shrinkage properties, and shrinkage is the whole point! To summarize:
[1]: Another approach, which my advisor really likes, it to do a classic variance stabilizing transformation (section 1.2). [2]: This might work well for MAP estimates, where it's equivalent to the LASSO, but if you use posterior means, you're using the Bayesian LASSO, which is a known bad idea. |
Thanks to all for your patience! I have returned to working on this package because some CDC folks are using it in analysis relevant to current world events. The PR in #4 addresses a lot of the discussion here and I think improves the situation quite a bit. It does not add an option for an |
So I agree with @alexpghayes that the current approach is basically empirical Bayes. For this reason, I think that that notion of how informative the prior might be is moot. Since the prior is data driven, the training set defines how informative the prior might be. For example, if the word distribution is uniform (unlikely to happen) then the resulting flat prior would be uninformative. Parameterizing |
The PR from @alexpghayes in #5 adds some nice functionality here that we're going to say covers the bases for now: an Thanks to all for your discussion! 🙌 |
Since I read Monroe et al. recently, I was confused why you were using the uninformative Dirichlet prior approach — they note that "this is a relabeling of the smoothed log-odds-ratio" (385), which latches on to words (features) that are used only by one party (set).
If I read the code correctly, it looks like you're using the marginal distributions (i.e., distribution of features across all sets combined) to construct an informative Dirichlet prior. That's quite clever, and gets around the need to pull in data from another source. But then calling it an "uninformative" prior is incorrect. And some Bayesian purists may argue that this isn't a "prior" because it's using the same data. So I would suggest calling it an informative prior and adding a sentence or two to the documentation explaining how it's constructed.
This leads to some feature suggestions, in order of increasing complexity:
Allow the user to pass a fixed alpha value for a genuinely uninformative Dirichlet prior. (Might only be useful for comparing the effects of informative vs. uninformative priors.)
Allow the user to pass an additional dataframe with the alpha value for each feature.
Include some precalculated dataframes in the appropriate format for number 2, e.g., from Wikipedia term counts.
The text was updated successfully, but these errors were encountered: