Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of missing data in Estimators #42

Open
tsalo opened this issue Jun 25, 2020 · 5 comments
Open

Handling of missing data in Estimators #42

tsalo opened this issue Jun 25, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@tsalo
Copy link
Member

tsalo commented Jun 25, 2020

Per @tyarkoni in #40 (comment):

I don't recommend treating voxels with missing data for studies as if the estimates and variances are 0. My naive expectation is that PyMARE estimators will always return a NaN value if you do that (if they don't, let me know!), which would be fine since you could just mask those voxels out. BUT even if that's true, I don't think you want to rely on PyMARE to do the right thing here. If you know you have missing data for at least one study at a voxel, I suggest either setting the estimates and variances for those studies to NaN or (better) just masking them out before you hand off to PyMARE in the first place.

Using DerSimonianLaird as an example estimator, here is the behavior I've found:

  • Any NaNs in y --> tau2 = NaN
  • All zeros in y --> tau2 = 0
  • Any NaNs in v --> tau2 = NaN
  • Any zeros in v --> tau2 = NaN

I think the behavior we want is to ignore any studies with NaNs in either y or v, and maybe raise an error when v is 0 (with the recommendation to fill missing data with NaNs instead of zeros) and raise a warning when y is 0. And of course, if all studies have NaN, then all parameters estimated by the estimator should be NaN.

It would then be the user's responsibility to ensure that missing data is represented with NaNs. On NiMARE's side, I can open an issue to do this automatically in MetaEstimator.fit(). I don't think masking is feasible for NiMARE, given that missing data may vary by both study and voxel, so we'd end up with varying numbers of studies contributing to each voxel's meta-analysis.

WDYT?

@tyarkoni
Copy link
Contributor

tyarkoni commented Jun 26, 2020

I think the current observed behavior is actually what we want. In principle the y values could all be exactly 0; it's just extremely unlikely. Whereas the same is not true of v; you can't have a standard error of exactly 0.

This does mean that users should never pass in values of 0 as y if what they actually mean is "the data is missing". But it's hard to imagine a context where one would have a non-zero v but no value for y, and therefore decide to explicitly set y to 0. So I'm okay with this.

I agree with your suggestion to raise an exception when v=0 and a warning when y=0. As far as ignoring studies when NaNs are passed, I have mixed feelings. I realize it would make your life easier in this particular case, because it would be nice to always pass in the same voxels and not have to worry about missing values. But I'm a bit apprehensive about making this the default behavior, as it would make it easy for users to overlook NaNs and conclude that their meta-analytic estimates are based on however many data points they passed in (including NaNs). I think a good compromise is to add an initialization flag to all estimators that specifies how NaNs are handled, where the default is 'error', but users can also set 'drop', which would do listwise deletion, or 'nan', which would just return NaN for the result if any value in the array is NaN. Then you would presumably set missing='nan' from NiMARE.

@tsalo
Copy link
Member Author

tsalo commented Jun 26, 2020

This does mean that users should never pass in values of 0 as y if what they actually mean is "the data is missing". But it's hard to imagine a context where one would have a non-zero v but no value for y, and therefore decide to explicitly set y to 0. So I'm okay with this.

I think that's reasonable.

I think a good compromise is to add an initialization flag to all estimators that specifies how NaNs are handled, where the default is 'error', but users can also set 'drop', which would do listwise deletion, or 'nan', which would just return NaN for the result if any value in the array is NaN. Then you would presumably set missing='nan' from NiMARE.

That sounds good to me! In addition to the initialization flag, do you think it would useful to include dof in the model summary? I'm sure we could also use dof maps in NiMARE.

@tyarkoni
Copy link
Contributor

That sounds good to me! In addition to the initialization flag, do you think it would useful to include dof in the model summary? I'm sure we could also use dof maps in NiMARE.

Yes, degrees of freedom should definitely be somewhere prominent in the summary.

I'm a bit apprehensive about DoF maps in NiMARE. Or rather, I definitely think there should be a DoF map, but I'm wary about allowing it to have non-constant values across voxels—i.e., having it show how many studies contributed to the estimate at each voxel. The concern is that users won't have an easy way to tell which studies contributed to which voxels, so different voxels won't be exactly commensurable even if they happen to have the same DoF. Also, I'm guessing most users won't ever actually look at the DoF, and hence won't realize that the estimates in their map don't all come from the same underlying sample. My fairly strong inclination would be to default to either only including voxels that have no missing data at all, or generating separate maps for every combination of studies (which admittedly would be a huge PITA, and clunky from a UX perspective too).

At the very least I think users should explicitly have to take responsibility for allowing variable studies across voxels by setting some aptly named argument, e.g., allow_inconsistent_studies, which sounds like something a bit scary (as it should).

Tagging @nicholst here in case he has thoughts on this issue (i.e., of whether/how to handle/display voxels in the same map where estimates reflect different sets of studies).

@nicholst
Copy link
Contributor

I concur with your concerns, but I think as soon as anyone starts doing IMBA's with any heterogeneity we're going to get into the variable n (df) issue. I don't see it as being at the top of the feature to-do list, but I'm pretty adamant that we should engineer NiMARE to not rule out or make it impossible to have varying n per voxel. I leave it to you two on the Engineering details and where, in the grand scheme, the feature should be implemented.

So, I like some awkward flag that they have to set... that works for me.

I agree, once people routinely use allow_inconsistent_studies it would be good to have visualisation and diagnostics on the missing data. I think FSL gives an image of "sum of all input masks" windowed from N-2 to N, so you see all the places where you're missing just 1 or 2 studies. But, again, not sure that's top priority.

@tyarkoni
Copy link
Contributor

That sounds reasonable to me. The current implementation already allows varying n per voxels, so that's not a problem. It's just a question of how to present that to the user, and what they have to do to enable it. I think having a fairly angry argument name should be sufficient to get people to at least consult the docstring the first time. (Though I agree with you that over time it's likely to become a stylized fact in the community that one should always enable this setting.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants