Clarify or check Estimator input shapes #37

tsalo · 2020-06-11T18:44:16Z

I was trying to run one of the Estimators using weighted_least_squares without initializing a Dataset and was getting confusing errors from numpy.einsum before I realized that inputs need to be 2D no matter what. We can coerce 1D inputs to 2D with a new Estimator._validate_inputs() method, or we can just update the docstrings to clarify requirements.

BTW, based on variable convention, I think it's reasonable to assume that X must be 2D, but it's not obvious that y, v, etc. should be 2D as well, and indeed, there's an obvious error about shape if X is 1D, but no shape check for y, v, etc.

To replicate:

y = np.random.random(10)
v = np.random.random(10) ** 2
X = np.random.random((10, 1))
est = pymare.estimators.DerSimonianLaird()
est.fit(y=y, v=v, X=X)
print(est.results.to_df())

Results in:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-41d64261b276> in <module>()
      3 X = np.random.random((10, 1))
      4 est = pymare.estimators.DerSimonianLaird()
----> 5 est.fit(y=y, v=v, X=X)
      6 print(est.results.to_df())

~/Documents/tsalo/PyMARE/pymare/estimators/estimators.py in fit(self, dataset, **kwargs)
     78                     kwargs[name] = getattr(dataset, name)
     79 
---> 80         self.params_ = self._fit(**kwargs)
     81         self.dataset_ = dataset
     82 

~/Documents/tsalo/PyMARE/pymare/estimators/estimators.py in _fit(self, y, v, X)
    180 
    181         # Estimate initial betas with WLS, assuming tau^2=0
--> 182         beta_wls, inv_cov = weighted_least_squares(y, v, X, return_cov=True)
    183 
    184         # Cochrane's Q

~/Documents/tsalo/PyMARE/pymare/stats.py in weighted_least_squares(y, v, X, tau2, return_cov)
     24 
     25     # Einsum indices: k = studies, p = predictors, i = parallel iterates
---> 26     wX = np.einsum('kp,ki->ipk', X, w)
     27     cov = wX.dot(X)
     28 

<__array_function__ internals> in einsum(*args, **kwargs)

~/anaconda/envs/python3/lib/python3.6/site-packages/numpy/core/einsumfunc.py in einsum(*operands, **kwargs)
   1354     # If no optimization, run pure einsum
   1355     if optimize_arg is False:
-> 1356         return c_einsum(*operands, **kwargs)
   1357 
   1358     valid_einsum_kwargs = ['out', 'dtype', 'order', 'casting']

ValueError: einstein sum subscripts string contains too many subscripts for operand 1

The text was updated successfully, but these errors were encountered:

tyarkoni · 2020-06-11T19:06:08Z

The intended usage pattern for fit() is actually to always pass in a Dataset (that will do the reshaping for you automatically). The only reason **kwargs are accepted, and I'm not entirely happy with it, TBH, is because of the Stouffers and Fishers estimators, which take non-standard arguments that don't make sense to put in a Dataset. Given that you ran into this already, and others probably will too, this probably argues in favor of giving CombinationTest its own fit() method. But if you're going to call fit() with arguments, then yes, they'll need to be already 2d (and we could clarify that in the docs, though again, I'm not sure I want to encourage people to use fit() with non-Dataset args).

The main reason for not adding a validation/reshaping method is that it seems to me to fall into a kind of no-man's land in between the convenience of the Dataset abstraction, and the efficiency of passing in already-2d variables. I.e., if you're okay having the validation done for you, there's very little additional overhead associated with constructing a Dataset, which is a nice re-usable container (and would have to be constructed anyway if you want to make use of any of the Results classes, which require it).

tsalo · 2020-06-11T19:10:07Z

Then it sounds like dropping **kwargs and giving CombinationTest its own fit is the best way to go. I'll change the title and update the summary. Thanks!

Are Datasets going to still be the preferred inputs for vectorized calls (e.g., neurostuff/NiMARE#211)?

tyarkoni · 2020-06-11T19:11:53Z

Oh, but note also that there's an ensure_2d helper in stats you can call if you want to make sure your variables are in the right shape to pass to fit(). I just don't know that we want to make that mandatory, as it does add a tiny bit of overhead, which could add up if you're running in a loop (e.g., over voxels).

tyarkoni · 2020-06-11T19:14:22Z

Are Datasets going to still be the preferred inputs for vectorized calls (e.g., neurostuff/NiMARE#211)?

It's probably worth doing a bit of profiling to see how much overhead the Dataset adds. My guess is that as long as you're not naively looping, it's too little to worry about. In the case where you're doing something massively parallel, I think the optimal approach is to ensure that your variables are already in the desired (2d) form in each iteration, and then call fit() (or even _fit(), if we remove kwargs from fit()), passing in arrays.

tsalo added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 11, 2020

tsalo mentioned this issue Jun 2, 2022

Datasets do not check input array shapes/sizes #99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify or check Estimator input shapes #37

Clarify or check Estimator input shapes #37

tsalo commented Jun 11, 2020

tyarkoni commented Jun 11, 2020

tsalo commented Jun 11, 2020

tyarkoni commented Jun 11, 2020

tyarkoni commented Jun 11, 2020 •

edited

Clarify or check Estimator input shapes #37

Clarify or check Estimator input shapes #37

Comments

tsalo commented Jun 11, 2020

tyarkoni commented Jun 11, 2020

tsalo commented Jun 11, 2020

tyarkoni commented Jun 11, 2020

tyarkoni commented Jun 11, 2020 • edited

tyarkoni commented Jun 11, 2020 •

edited