Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"uninformative" prior #3

Closed
dhicks opened this issue Jul 20, 2019 · 6 comments
Closed

"uninformative" prior #3

dhicks opened this issue Jul 20, 2019 · 6 comments

Comments

@dhicks
Copy link

dhicks commented Jul 20, 2019

Since I read Monroe et al. recently, I was confused why you were using the uninformative Dirichlet prior approach — they note that "this is a relabeling of the smoothed log-odds-ratio" (385), which latches on to words (features) that are used only by one party (set).

If I read the code correctly, it looks like you're using the marginal distributions (i.e., distribution of features across all sets combined) to construct an informative Dirichlet prior. That's quite clever, and gets around the need to pull in data from another source. But then calling it an "uninformative" prior is incorrect. And some Bayesian purists may argue that this isn't a "prior" because it's using the same data. So I would suggest calling it an informative prior and adding a sentence or two to the documentation explaining how it's constructed.

This leads to some feature suggestions, in order of increasing complexity:

  1. Allow the user to pass a fixed alpha value for a genuinely uninformative Dirichlet prior. (Might only be useful for comparing the effects of informative vs. uninformative priors.)

  2. Allow the user to pass an additional dataframe with the alpha value for each feature.

  3. Include some precalculated dataframes in the appropriate format for number 2, e.g., from Wikipedia term counts.

@juliasilge
Copy link
Owner

Thank you so much for this feedback @dhicks! 🙌

I was re-reading the paper again to understand this better and where my misunderstanding arose, and these are some great thoughts. I won't be able to work on this for the next several weeks, but we will get this fixed/accurate before a CRAN release.

cc @TylerSchnoebelen

@scottfrechette
Copy link

I was working on my own version of the function to help me understand the paper and how it related to this package and think it could be relevant here. I wanted to push the original function further to allow different levels of information for the priors and I think I lucked into a rough version of the first two solutions provided above.

Note: I made some slight changes to variable names to help me stay consistent with the paper as I worked through it.

bind_log_odds <- function(tbl, set, feature, n, alpha = NULL) {
    
    set <- enquo(set)
    feature <- enquo(feature)
    y_col <- enquo(n)
    grouping <- group_vars(tbl)
    tbl <- ungroup(tbl)
    
    if(is.null(alpha)) {
        
        n_df <- count(tbl, !!set, wt = !!y_col, name = ".n")
        alpha_df <- count(tbl, !!feature, wt = !!y_col, name = ".alpha")
        df_joined <- tbl %>%
            left_join(n_df, by = rlang::as_name(set)) %>%
            left_join(alpha_df, by = rlang::as_name(feature)) %>%
            mutate(.alpha0 = sum(!!y_col),
                   y_other = .alpha - !!y_col,
                   n_other = .alpha0 - .n,
                   l1 = (!!y_col + .alpha) / (.n + .alpha0 - !!y_col - .alpha),
                   l2 = (y_other + .alpha) / (n_other + .alpha0 - y_other - .alpha),
                   sigma2 = 1/(!!y_col + .alpha) + 1/(y_other + .alpha),
                   log_odds = (log(l1) - log(l2))/sqrt(sigma2))
        tbl$log_odds <- df_joined$log_odds
        
    } else {
        
        if(length(alpha) == 1) {
            
            n_df <- count(tbl, !!set, wt = !!y_col, name = ".n")
            y_total_df <- count(tbl, !!feature, wt = !!y_col, name = "y_total")
            df_joined <- tbl %>%
                left_join(n_df, by = rlang::as_name(set)) %>%
                left_join(y_total_df, by = rlang::as_name(feature)) %>%
                mutate(.alpha = alpha,
                       .alpha0 = nrow(distinct(tbl, !!feature)) * alpha,
                       y_other = y_total - !!y_col,
                       n_other = sum(!!y_col) - .n,
                       l1 = (!!y_col + .alpha) / (.n + .alpha0 - !!y_col - .alpha),
                       l2 = (y_other + .alpha) / (n_other + .alpha0 - y_other - .alpha),
                       sigma2 = 1/(!!y_col + .alpha) + 1/(y_other + .alpha),
                       log_odds = (log(l1) - log(l2))/sqrt(sigma2))
            tbl$log_odds <- df_joined$log_odds
            
        } else if (nrow(alpha) >= nrow(distinct(tbl, {{feature}}))) {
            
            n_df <- count(tbl, !!set, wt = !!y_col, name = ".n")
            y_total_df <- count(tbl, !!feature, wt = !!y_col, name = "y_total")
            alpha0 <- sum(alpha[,2])
            df_joined <- tbl %>%
                left_join(n_df, by = rlang::as_name(set)) %>%
                left_join(y_total_df, by = rlang::as_name(feature)) %>%
                left_join(alpha, by = rlang::as_name(feature)) %>%
                rename(.alpha = ncol(.)) %>%
                mutate(.alpha0 = alpha0,
                       y_other = y_total - !!y_col,
                       n_other = sum(!!y_col) - .n,
                       l1 = (!!y_col + .alpha) / (.n + .alpha0 - !!y_col - .alpha),
                       l2 = (y_other + .alpha) / (n_other + .alpha0),
                       sigma2 = 1/(!!y_col + .alpha) + 1/(y_other + .alpha),
                       log_odds = (log(l1) - log(l2))/sqrt(sigma2))
            tbl$log_odds <- df_joined$log_odds
            
        } else {
            
            message("alpha must be length 1 or feature")
            
        }
        
    }
    
    if (!is_empty(grouping)) {
        tbl <- group_by(tbl, !!sym(grouping))
    }
    
    tbl
    
}

library(tidyverse)
library(babynames)

top_names <- babynames %>%
    filter(year >= 1950,
           year < 1990) %>%
    mutate(decade = (year %/% 10) * 10,
           decade = paste0(decade, "s")) %>%
    group_by(decade) %>%
    count(name, wt = n, sort = TRUE) %>%
    ungroup

top_names %>%  
    bind_log_odds(decade, name, n) %>%
    arrange(-log_odds)
#> # A tibble: 100,527 x 4
#>    decade name          n log_odds
#>    <chr>  <chr>     <int>    <dbl>
#>  1 1980s  Ashley   357385     477.
#>  2 1980s  Jessica  471493     457.
#>  3 1950s  Linda    565481     414.
#>  4 1980s  Joshua   399131     412.
#>  5 1980s  Amanda   370873     391.
#>  6 1970s  Jason    465402     374.
#>  7 1980s  Justin   291843     363.
#>  8 1950s  Deborah  431302     348.
#>  9 1980s  Brandon  234294     331.
#> 10 1970s  Jennifer 583692     330.
#> # ... with 100,517 more rows

alpha_df <- babynames %>%
    count(name, wt = n, name = "alpha")

top_names %>%  
    bind_log_odds(decade, name, n, alpha_df) %>%
    arrange(-log_odds)
#> # A tibble: 100,527 x 4
#>    decade name         n log_odds
#>    <chr>  <chr>    <int>    <dbl>
#>  1 1980s  Ashley  357385     352.
#>  2 1980s  Jessica 471493     333.
#>  3 1980s  Joshua  399131     299.
#>  4 1950s  Linda   565481     296.
#>  5 1980s  Amanda  370873     282.
#>  6 1950s  Mary    627098     275.
#>  7 1980s  Justin  291843     263.
#>  8 1950s  Deborah 431302     242.
#>  9 1970s  Jason   465402     240.
#> 10 1980s  Brandon 234294     240.
#> # ... with 100,517 more rows

top_names %>%
    bind_log_odds(decade, name, n, 10) %>%
    arrange(-log_odds)
#> # A tibble: 100,527 x 4
#>    decade name          n log_odds
#>    <chr>  <chr>     <int>    <dbl>
#>  1 1980s  Jessica  471493     760.
#>  2 1950s  Linda    565481     720.
#>  3 1970s  Jason    465402     697.
#>  4 1980s  Joshua   399131     693.
#>  5 1980s  Amanda   370873     663.
#>  6 1970s  Jennifer 583692     639.
#>  7 1950s  Deborah  431302     610.
#>  8 1980s  Justin   291843     599.
#>  9 1980s  Ashley   357385     599.
#> 10 1950s  Mary     627098     583.
#> # ... with 100,517 more rows

Created on 2019-08-18 by the reprex package (v0.3.0)

@alexpghayes
Copy link
Contributor

alexpghayes commented Aug 26, 2019

I also read the paper recently and have been trying to wrap my head around it. So far as I can tell, the key idea is that the mean-variance relationship for word counts isn't stable. So we first calculate the probability that a particular word occurs, which is a mean estimate, and then we divide by the variance of that estimate. So now all of the estimates have comparable variance, which we really want! [1]

Figure 4 has a plot with the standardized estimates; the authors then go on to say that they don't really like them, and that we can improve things by regularizing them. Adding information to the prior will shrink the mean estimates toward a grand mean, introducing bias and decreasing variance. So we'll have two good things going for us: the overly variable means will be less variable, and we'll still be dividing by their variance to get things on the same scale.

The authors propose two regularizing priors:

  • A dirichlet prior. First they use alpha = 500 (for all words) to add a pseudo-count of 500 to each word count.

  • A laplace prior. [2]

Then they talk about uninformative priors:

That is, use of any particular Dirichlet prior defined by alpha affects the posterior exactly as if we had observed in the data an additional alpha_w – 1 instances of word w. This can be arbitrarily uninformative, for example, alpha_w = 0.01 for all w.

This isn't quite right. When alpha < 1, Dirichlet priors start to concentrate probability again, which the authors may not have been aware of. You can read more about this at https://discourse.mc-stan.org/t/vague-proper-dirichlet-prior/3318/6.

So my first observation is that I'm not sure how much to trust the uninformativeness of the priors in the paper, especially since I can't find a reported value of alpha for the uninformative results.

My second that is that the whole point is to get a better mean estimate via regularization, so using an uninformative prior that doesn't regularize defeats the point. This seems like the perfect scenario to use empirical Bayes (i.e. use an MLE estimate for the prior parameter). I haven't picked over the code in too much detail, but it looks like you're probably doing something like this. And this is probably a good thing, because empirical bayes estimates have nice shrinkage properties, and shrinkage is the whole point!

To summarize:

  • I don't think an uninformative prior is a good idea here. I'm in favor of an empirical bayes approach by default.
  • It would nice to be able to specify the alpha parameter. The documentation should include a note that alpha = 1 is uninformative, but alpha -> 0 and alpha -> infinity are both informative priors.
  • This isn't a data pre-processing technique, it's really full-on modeling. Using log odds rather than word counts basically amounts to building the rest of your model on some latent dirichlet-multinomial model parameters. This means there are some overfitting concerns that come just from the preprocessing step.

[1]: Another approach, which my advisor really likes, it to do a classic variance stabilizing transformation (section 1.2).

[2]: This might work well for MAP estimates, where it's equivalent to the LASSO, but if you use posterior means, you're using the Bayesian LASSO, which is a known bad idea.

@juliasilge
Copy link
Owner

juliasilge commented May 5, 2020

Thanks to all for your patience! I have returned to working on this package because some CDC folks are using it in analysis relevant to current world events. The PR in #4 addresses a lot of the discussion here and I think improves the situation quite a bit.

It does not add an option for an alpha parameter but I can see that as a possible next step, after getting the package in better shape and on CRAN.

@topepo
Copy link

topepo commented May 5, 2020

So I agree with @alexpghayes that the current approach is basically empirical Bayes. For this reason, I think that that notion of how informative the prior might be is moot. Since the prior is data driven, the training set defines how informative the prior might be. For example, if the word distribution is uniform (unlikely to happen) then the resulting flat prior would be uninformative.

Parameterizing alpha might be a good idea but how would anyone use it. For a single alpha, the prior is not very realistic. Otherwise someone would have to create a highly multivariate alpha specification. Is that realistic?

@juliasilge
Copy link
Owner

The PR from @alexpghayes in #5 adds some nice functionality here that we're going to say covers the bases for now: an uninformative = FALSE / TRUE option for whether to use an uninformative Dirichlet prior, along with other improvements in notation and clarity in docs.

Thanks to all for your discussion! 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants