Fix #1457 (projpred `newdata` requiring more variables than necessary) #1459

fweber144 · 2023-02-14T21:10:00Z

This fixes issue #1457. See the commit messages for details. If you don't want to increase the version number, you can revert that of course.

I have also added a TODO comment in this line because I experienced a slight inconsistency while testing this fix: If weights (i.e., from resp_weights()) are missing in newdata (and listed in req_vars), no error is triggered, but such an error is triggered if numbers of trials (i.e., from resp_trials()) are missing in newdata (and listed in req_vars). Already from a conceptual point of view, I think it would be desirable to also throw an error if weights are missing. Tell me if you need a reprex for this (but I think this would be a different issue and furthermore, this is not projpred-related).

This is achieved by not ignoring `extract_y` anymore (ignoring this was already undesirable from a conceptual point of view) and by setting `check_response` and `req_vars` appropriately.

… fix later in projpred's unit tests.

paul-buerkner · 2023-02-15T08:34:40Z

Thanks! Why should weights be required if the response is not? They are only relevant for log_lik, which requires the response as well. Posterior_predict for example does not make use of weights.

fweber144 · 2023-02-15T11:59:54Z

Thanks! Why should weights be required if the response is not? They are only relevant for log_lik, which requires the response as well. Posterior_predict for example does not make use of weights.

I was also thinking about whether the requirement of weights could be tied to the requirement of the response, but projpred uses the weights in proj_predict() (the equivalent to posterior_predict()) in case of a binomial family (where these weights can be regarded as the numbers of trials, but projpred doesn't differentiate these).

Yesterday, this made me think about observation weights in proj_predict() in general, so I had a discussion with @avehtari because of this. He agreed that it would make sense for proj_predict() to take observation weights into account also for families other than binomial. The idea for implementing this is that a weighted dataset would need to be de-aggregated so that instead of a $S_{\text{prj}} \times N$ matrix (with $N$ denoting the number of weighted observations), proj_predict() returns an $S_{\text{prj}} \times \tilde{N}$ matrix, with $\tilde{N}$ denoting the number of de-aggregated observations. Things get a little trickier in case of non-integer observation weights. In that case, a solution could be to sample de-aggregated observations with probabilities proportional to the weights. @avehtari suggested stratified sampling for this, as implemented in the posterior package, for example. But even then, non-integer $\tilde{N}$ remains a problem that I'm not sure yet how to solve (simply round $\tilde{N}$ to the next integer?).

I think incorporating observation weights in proj_predict() and posterior_predict() makes sense, especially if you think of bayesplot::ppd_dens_overlay(). The kernel density estimate across observations that is used there should be influenced by observation weights, I think.

But honestly, I'm not very keen to implement this in projpred soon, as this is a very special case and probably hard to implement in a sensible way in terms of the UI (for example, how to allow users to identify which columns of the $S_{\text{prj}} \times \tilde{N}$ matrix belong to which columns of the $S_{\text{prj}} \times N$ matrix?). For now, I might throw an error or a warning in projpred when using proj_predict() in case of observation weights (apart from the binomial family because things seem to be correct there). But as mentioned above, proj_predict() in the binomial case prevents us from tying the requirement of weights to the requirement of the response.

fweber144 · 2023-02-15T12:08:21Z

I have to add that for bayesplot::ppd_dens_overlay(), the original response values (y) would also have to be de-aggregated in the same way as for the predictions (y_rep).

fweber144 · 2023-02-15T12:12:45Z

I guess that rstantools::posterior_predict() and corresponding rstanarm methods would also need to be changed if we really want to go in this direction.

paul-buerkner · 2023-02-15T13:35:29Z

Okay, so what we need for this PR is that variables that are listed in req_vars are actually required. Fair point. I will check and edit the PR accordingly.

paul-buerkner · 2023-02-15T13:37:35Z

Well actually, I am not sure I want to change this stuff right now and I will have to think about whether this makes any sense at all. I will merge the PR and we can revisit the weight stuff later.

fweber144 · 2023-02-15T13:53:02Z

Thanks! Revisiting the weights stuff later is ok for me.

…_vars = character()` Thus, we need to set `req_vars` to the response variable.

…eq_vars = character()` Thus, we need to set `req_vars` to the response variable.

Amend #1459

fweber144 added 4 commits February 14, 2023 16:06

Add GitHub PR number for the latent-projection support.

0464167

Fix issue paul-buerkner#1457.

84bde8a

This is achieved by not ignoring `extract_y` anymore (ignoring this was already undesirable from a conceptual point of view) and by setting `check_response` and `req_vars` appropriately.

Increase the version number to be able to check specifically for this…

332fc5f

… fix later in projpred's unit tests.

Add NEWS entry.

79d55b1

minor cleaning

8bad3e3

paul-buerkner added 2 commits February 15, 2023 14:42

change todo comment

fdf97e8

minor cleaning

d3bf58c

paul-buerkner merged commit c793d9f into paul-buerkner:master Feb 15, 2023

fweber144 deleted the newdata_issue branch February 15, 2023 13:53

fweber144 added a commit to fweber144/brms that referenced this pull request Feb 15, 2023

Fix paul-buerkner#1459: check_response = TRUE has no effect if `req…

0f4fcc0

…_vars = character()` Thus, we need to set `req_vars` to the response variable.

fweber144 added a commit to fweber144/brms that referenced this pull request Feb 15, 2023

Amend paul-buerkner#1459: check_response = TRUE has no effect if `r…

44f4283

…eq_vars = character()` Thus, we need to set `req_vars` to the response variable.

fweber144 mentioned this pull request Feb 15, 2023

Amend #1459 #1460

Merged

paul-buerkner added a commit that referenced this pull request Feb 15, 2023

Merge pull request #1460 from fweber144/fix1459

d83de47

Amend #1459

fweber144 mentioned this pull request Apr 3, 2023

Use observation weights in proj_predict() stan-dev/projpred#402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1457 (projpred `newdata` requiring more variables than necessary) #1459

Fix #1457 (projpred `newdata` requiring more variables than necessary) #1459

fweber144 commented Feb 14, 2023

paul-buerkner commented Feb 15, 2023

fweber144 commented Feb 15, 2023 •

edited

fweber144 commented Feb 15, 2023 •

edited

fweber144 commented Feb 15, 2023

paul-buerkner commented Feb 15, 2023

paul-buerkner commented Feb 15, 2023

fweber144 commented Feb 15, 2023

Fix #1457 (projpred newdata requiring more variables than necessary) #1459

Fix #1457 (projpred newdata requiring more variables than necessary) #1459

Conversation

fweber144 commented Feb 14, 2023

paul-buerkner commented Feb 15, 2023

fweber144 commented Feb 15, 2023 • edited

fweber144 commented Feb 15, 2023 • edited

fweber144 commented Feb 15, 2023

paul-buerkner commented Feb 15, 2023

paul-buerkner commented Feb 15, 2023

fweber144 commented Feb 15, 2023

Fix #1457 (projpred `newdata` requiring more variables than necessary) #1459

Fix #1457 (projpred `newdata` requiring more variables than necessary) #1459

fweber144 commented Feb 15, 2023 •

edited

fweber144 commented Feb 15, 2023 •

edited