forked from avehtari/stan-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
missing-data.Rmd
307 lines (263 loc) · 10 KB
/
missing-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
# Missing Data and Partially Known Parameters
Bayesian inference supports a general approach to missing data in
which any missing data item is represented as a parameter that is
estimated in the posterior @GelmanEtAl:2013. If the missing
data are not explicitly modeled, as in the predictors for most
regression models, then the result is an improper prior on the
parameter representing the missing predictor.
Mixing arrays of observed and missing data can be difficult to
include in Stan, partly because it can be tricky to model discrete
unknowns in Stan and partly because unlike some other statistical
languages (for example, R and Bugs), Stan requires observed and
unknown quantities to be defined in separate places in the model. Thus
it can be necessary to include code in a Stan program to splice
together observed and missing parts of a data structure. Examples are
provided later in the chapter.
## Missing Data
Stan treats variables declared in the `data` and
`transformed data` blocks as known and the variables in the
`parameters` block as unknown.
An example involving missing normal observation could be coded as follows.^[A more meaningful estimation example would involve a regression of the observed and missing observations using predictors that were known for each and specified in the `data` block.]
```
data {
int<lower=0> N_obs;
int<lower=0> N_mis;
real y_obs[N_obs];
}
parameters {
real mu;
real<lower=0> sigma;
real y_mis[N_mis];
}
model {
y_obs ~ normal(mu, sigma);
y_mis ~ normal(mu, sigma);
}
```
The number of observed and missing data points are coded as data with
non-negative integer variables `N_obs` and `N_mis`. The
observed data are provided as an array data variable `y_obs`.
The missing data are coded as an array parameter, `y_mis`. The
ordinary parameters being estimated, the location `mu` and scale
`sigma`, are also coded as parameters. The model is vectorized
on the observed and missing data; combining them in this case would be
less efficient because the data observations would be promoted and
have needless derivatives calculated.
## Partially Known Parameters {#partially-known-parameters.section}
In some situations, such as when a multivariate probability function
has partially observed outcomes or parameters, it will be necessary to
create a vector mixing known (data) and unknown (parameter) values.
This can be done in Stan by creating a vector or array in the
`transformed parameters` block and assigning to it.
The following example involves a bivariate covariance matrix in which the
variances are known, but the covariance is not.
```
data {
int<lower=0> N;
vector[2] y[N];
real<lower=0> var1; real<lower=0> var2;
}
transformed data {
real<lower=0> max_cov = sqrt(var1 * var2);
real<upper=0> min_cov = -max_cov;
}
parameters {
vector[2] mu;
real<lower=min_cov, upper=max_cov> cov;
}
transformed parameters {
matrix[2, 2] Sigma;
Sigma[1, 1] = var1; Sigma[1, 2] = cov;
Sigma[2, 1] = cov; Sigma[2, 2] = var2;
}
model {
y ~ multi_normal(mu, Sigma);
}
```
The variances are defined as data in variables `var1` and
`var2`, whereas the covariance is defined as a parameter in
variable `cov`. The $2 \times 2$ covariance matrix `Sigma`
is defined as a transformed parameter, with the variances assigned to
the two diagonal elements and the covariance to the two off-diagonal
elements.
The constraint on the covariance declaration ensures that the
resulting covariance matrix `sigma` is positive definite. The
bound, plus or minus the square root of the product of the variances,
is defined as transformed data so that it is only calculated once.
The vectorization of the multivariate normal is critical for
efficiency here. The transformed parameter `Sigma` could be
defined as a local variable within the model block if
## Sliced Missing Data
If the missing data are part of some larger data structure, then it can
often be effectively reassembled using index arrays and slicing.
Here's an example for time-series data, where only some entries in the
series are observed.
```
data {
int<lower = 0> N_obs;
int<lower = 0> N_mis;
int<lower = 1, upper = N_obs + N_mis> ii_obs[N_obs];
int<lower = 1, upper = N_obs + N_mis> ii_mis[N_mis];
real y_obs[N_obs];
}
transformed data {
int<lower = 0> N = N_obs + N_mis;
}
parameters {
real y_mis[N_mis];
real<lower=0> sigma;
}
transformed parameters {
real y[N];
y[ii_obs] = y_obs;
y[ii_mis] = y_mis;
}
model {
sigma ~ gamma(1, 1);
y[1] ~ normal(0, 100);
y[2:N] ~ normal(y[1:(N - 1)], sigma);
}
```
The index arrays `ii_obs` and `ii_mis` contain the indexes into the
final array `y` of the observed data (coded as a data vector `y_obs`)
and the missing data (coded as a parameter vector `y_mis`). See the
[time series chapter](#time-series.chapter) for further discussion of
time-series model and specifically the [autoregression
section](autoregressive.section) for an explanation of the
vectorization for `y` as well as an explanation of how to convert this
example to a full AR(1) model. To ensure `y[1]` has a proper
posterior in case it is missing, we have given it an explicit, albeit
broad, prior.
Another potential application would be filling the
columns of a data matrix of predictors for which some predictors are
missing; matrix columns can be accessed as vectors and assigned the
same way, as in
```
x[N_obs_2, 2] = x_obs_2;
x[N_mis_2, 2] = x_mis_2;
```
where the relevant variables are all hard coded with index `2` because
Stan doesn't support ragged arrays. These could all be packed into a
single array with more fiddly indexing that slices out vectors from
longer vectors (see the [ragged data structures
section](#ragged-data-structs.section) for a general discussion of
coding ragged data structures in Stan).
## Loading matrix for factor analysis
Rick Farouni, on the Stan users group, inquired as to how to build
a Cholesky factor for a covariance matrix with a unit diagonal, as
used in Bayesian factor analysis @aguilar-west:2000. This
can be accomplished by declaring the below-diagonal elements as
parameters, then filling the full matrix as a transformed parameter.
```
data {
int<lower=2> K;
}
transformed data {
int<lower=1> K_choose_2;
K_choose_2 = (K * (K - 1)) / 2;
}
parameters {
vector[K_choose_2] L_lower;
}
transformed parameters {
cholesky_factor_cov[K] L;
for (k in 1:K)
L[k, k] = 1;
{
int i;
for (m in 2:K) {
for (n in 1:(m - 1)) {
L[m, n] = L_lower[i];
L[n, m] = 0;
i += 1;
}
}
}
}
```
It is most convenient to place a prior directly on `L_lower`.
An alternative would be a prior for the full Cholesky factor `L`,
because the transform from `L_lower` to `L` is just the
identity and thus does not require a Jacobian adjustment (despite the
warning from the parser, which is not smart enough to do the code
analysis to infer that the transform is linear). It would not be at
all convenient to place a prior on the full covariance matrix `L
* L'`, because that would require a Jacobian adjustment; the exact
adjustment is detailed in the reference manual.
## Missing Multivariate Data
It's often the case that one or more components of a multivariate
outcome are missing^[This is not the same as missing components of a multivariate predictor in a regression problem; in that case, you will need to represent the missing data as a parameter and impute missing values in order to feed them into the regression.]
As an example, we'll consider the bivariate distribution, which is
easily marginalized. The coding here is brute force, representing
both an array of vector observations `y` and a boolean array
`y_observed` to indicate which values were observed (others can
have dummy values in the input).
```
vector[2] y[N];
int<lower=0, upper=1> y_observed[N, 2];
```
If both components are observed, we model them using the full
multi-normal, otherwise we model the marginal distribution of the
component that is observed.
```
for (n in 1:N) {
if (y_observed[n, 1] && y_observed[n, 2])
y[n] ~ multi_normal(mu, Sigma);
else if (y_observed[n, 1])
y[n, 1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
else if (y_observed[n, 2])
y[n, 2] ~ normal(mu[2], sqrt(Sigma[2, 2]));
}
```
It's a bit more work, but much more efficient to vectorize these
sampling statements. In transformed data, build up three vectors of
indices, for the three cases above:
```
transformed data {
int ns12[observed_12(y_observed)];
int ns1[observed_1(y_observed)];
int ns2[observed_2(y_observed)];
}
```
You will need to write functions that pull out the count of
observations in each of the three sampling situations. This must be
done with functions because the result needs to go in top-level block
variable size declaration. Then the rest of transformed data just
fills in the values using three counters.
```
int n12 = 1;
int n1 = 1;
int n2 = 1;
for (n in 1:N) {
if (y_observed[n, 1] && y_observed[n, 2]) {
ns12[n12] = n;
n12 += 1;
} else if (y_observed[n, 1]) {
ns1[n1] = n;
n1 += 1;
} else if (y_observed[n, 2]) {
ns2[n2] = n;
n2 += 1;
}
}
```
Then, in the model block, everything is vectorizable
using those indexes constructed once in transformed data:
```
y[ns12] ~ multi_normal(mu, Sigma);
y[ns1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
y[ns2] ~ normal(mu[2], sqrt(Sigma[2, 2]));
```
The result will be much more efficient than using latent variables for
the missing data, but it requires the multivariate distribution to be
marginalized analytically. It'd be more efficient still to precompute
the three arrays in the transformed data block, though the efficiency
improvement will be relatively minor compared to vectorizing the
probability functions.
This approach can easily be generalized with some index fiddling to
the general multivariate case. The trick is to pull out entries in
the covariance matrix for the missing components. It can also be used
in situations such as multivariate differential equation solutions
where only one component is observed, as in a phase-space experiment
recording only time and position of a pendulum (and not recording
momentum).