Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logistic normal distribution - feature request #1274

Closed
bachlaw opened this issue Dec 19, 2021 · 12 comments
Closed

logistic normal distribution - feature request #1274

bachlaw opened this issue Dec 19, 2021 · 12 comments

Comments

@bachlaw
Copy link
Sponsor

bachlaw commented Dec 19, 2021

Thank you for all the tremendous work here.

As compositional-data guru John Aitchison has noted, the Dirichlet distribution tends to be useful in theory but not very often in practice, due to its unrealistic assumption that multiple events which together form a simplex somehow have no correlation whatsoever with one another, even when transformed off the sum-to-1 constraints.

One solution to this problem is the logistic normal distribution, whereby the model becomes a multivariate normal regression with the intercepts as logits summing to 1 under a simplex, and the various components then gain a correlation matrix. In addition to making compositional data analysis much more meaningful, it can offer a workaround for multinomial models that prefer to recognize correlation among outcomes but wish to avoid the dreaded multinomial probit, as long as the modeler is willing to recast the various counts as compositions of a whole, with weights reflecting the number of "trials" for each row. The logistic normal distribution seems to come up most often in connection with latent dirichlet analysis and especially correlated topic modeling, but it has independent and more straightforward usefulness in the compositional data context. (The logistic normal's little brother, the logit-normal, can be a more flexible alternative to the beta distribution.)

As it is, the mvbind framework seems fully able to handle the multivariate normal nature of the outcomes, even with weights. Thus, it seems like it would only require a few lines of additional Stan code to add this capability to brms's existing multivariate model structure, because we are talking about adding transformed parameters, not a whole new density function (https://discourse.mc-stan.org/t/how-do-i-specify-a-logit-normal-distribution/4894). Having said that, I am struggling a bit at how that code would end up looking, much less how it would be coded most efficiently.

So, while recognizing time is limited and there are no doubt plenty of requests for various distributions / families to be supported, I think this one would be a very worthy addition.

@paul-buerkner
Copy link
Owner

paul-buerkner commented Dec 20, 2021

I agree with the idea. Technically, for brms, this won't be a multivariate model (via mvbind) though but rather be a special univariate model with multi column response, as is dirichlet.

Let's discuss the name of this family. Logistic normal (or logit normal) is usually understood as the "univariate" case as is log normal. How about "softmax_normal" or is there another canonical name?

@bachlaw
Copy link
Sponsor Author

bachlaw commented Dec 20, 2021

I agree with the idea. Technically, for brms, this won't be a multivariate model (via mvbind) though but rather be a special univariate model with multi column response, as is dirichlet.

Great point. That simplifies the concept and builds upon your prior work rather nicely. Logit-normal is the beta distribution equivalent and logistic normal would be the Dirichlet equivalent, so in theory if one supplied only two categories on the simplex the logistic normal should collapse to the logit normal as a special case, which is nice.

Let's discuss the name of this family. Logistic normal is usually understood as the "univariate" case as is log normal. How about "softmax_normal" or is there another canonical name?

Logistic normal is definitely the canonical name and I have not seen another reference for it, unfortunately. The similarity in name between logistic normal and log normal is unfortunate but logistic and log are completely different concepts here, obviously. In any event, softmax_normal makes perfect sense to me as a distinguishing title. The notes can reference its more common name for users who want to know more, and softmax_normal correctly describes what is going on.

There may be times when some portions of the simplex would benefit from a different formula (e.g., there are factors that affect some but not other components of the simplex). That may be why the concept made more sense to me at first as a multivariate in light of the subset functionality of brmsformula. But I don't think an initial implementation necessarily requires that to be supported, one way or another.

For what it is worth, Aitcheson wrote a nice discussion [here] about the logistic normal and its uses. (http://www.leg.ufpr.br/lib/exe/fetch.php/pessoais:abtmartins:a_concise_guide_to_compositional_data_analysis.pdf) starting around page 44.

Jonathan

@paul-buerkner
Copy link
Owner

paul-buerkner commented Dec 20, 2021

I see. So you would distinquish logit normal (https://en.wikipedia.org/wiki/Logit-normal_distribution) from logistic normal? That sounds confusing to me if we ever add the univariate logit normal to brms as well. I will think of your arguments and ideas a bit more.

@jsocolar
Copy link
Contributor

jsocolar commented Dec 20, 2021

FWIW, SUG 9.5 uses "logistic normal" following Blei & Lafferty 2007, which in turn follows Aitchison & Shen 1980. I hear "logit normal" for a variable whose logit is normally distributed (just like log normal for a variable whose logarithm is normally distributed).

softmax_normal feels like a better name to me, but it also has shortcomings. First, if the goal is to follow the convention established by "log normal", then really we would want "inverse-softmax normal".

But I think there's a further subtlety. The inverse softmax is not unique, which I think should pose an identifiability problem in the model as given in the SUG. Aitchison & Shen define their "logistic normal" by applying what they call a logistic transform to an n-1 dimensional normal variate, and then derive the final element by subtracting from 1.

Edited to add: Equivalently, we could say that keep things identified by taking the softmax over c(v, 1), where v is an n-1 dimensional vector of link-scale variates, thus identifying the model on the link-scale by fixing the link-scale value of the final element to one. But what this means is that the inverse softmax, subject to this identifying constraint, is not multivariate normal.

@bachlaw
Copy link
Sponsor Author

bachlaw commented Dec 20, 2021

It does all call into question what it means to be univariate versus multivariate. I would gratefully take the logistic-normal under any reasonable implementation. But I tend to see the logit-normal as a special case of the logistic-normal similar to how the bernoulli / binomial is a special case of the categorical / multinomial. Ideally specifying data that matches the former in the framework of the latter just gives you back the same answer as the former, albeit perhaps less efficiently.

How these relationships fit with traditional concepts of univariate versus multivariate regression is an interesting question. It would be nice, certainly, to benefit from some of the same multiple-formula flexibility you offer for categorical / multinomial models with logistic-normal models. But I don't see why that has to be offered right away either, even if it is desirable long term.

@jsocolar
Copy link
Contributor

jsocolar commented Dec 20, 2021

Yeah, the terminology just has fundamental shortcomings here. A Bernoulli / binomial is a special case of a categorical / multinomial, but a logit is not a special case of a logistic; it is a special case of an inverse logistic. But what we have in the literature is just a lot of inconsistency in terms of naming distributions for their link functions or for their inverse-link functions.

@paul-buerkner
Copy link
Owner

paul-buerkner commented Dec 20, 2021

Let's call it logistic_normal then and document very clearly what we mean and that it's meant as an alternative to dirichlet. Thank you for your input!

@paul-buerkner
Copy link
Owner

This feature is now implemented. Here is an example:

library(brms)

N <- 500
dat <- as.data.frame(rdirichlet(N, c(3, 2, 1)))
names(dat) <- c("y1", "y2", "y3")
dat$x <- rnorm(N)
dat$y <- with(dat, cbind(y1, y2, y3))

fit <- brm(bf(y ~ 1, muy3 ~ x, sigmay2 ~ x), data = dat, 
           family = brmsfamily("logistic_normal"),
           prior = prior("lkj(3)", "lncor"),
           backend = "cmdstanr")

@bachlaw
Copy link
Sponsor Author

bachlaw commented Feb 7, 2022

Much appreciated Paul! I will try it out.

@bachlaw
Copy link
Sponsor Author

bachlaw commented Feb 9, 2022

It seems to work well with cmdstan and multiple cores / chains. I am getting an error when I try to use the threading option, but I suspect it is because of this rogue capital L in the code?

Semantic error in '/tmp/RtmpxQk1XU/model-1a702f4169b6.stan', line 98, column 73 to column 79:
   -------------------------------------------------
    96:      for (n in 1:N) {
    97:        int nn = n + start - 1;
    98:        ptarget += logistic_normal_cholesky_cor_lpdf(Y[nn] | mu[n], sigma, Llncor, 1);
                                                                                  ^
    99:      }
   100:      return ptarget;
   -------------------------------------------------

Identifier 'Llncor' not in scope.
make: *** [/tmp/RtmpxQk1XU/model-1a702f4169b6.hpp] Error 1
Error: An error occured during compilation! See the message above for more information.

@paul-buerkner
Copy link
Owner

Ah I see to pass Llncor to the threading function. Will fix this.

@paul-buerkner
Copy link
Owner

Should now be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants