Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify or check Estimator input shapes #37

Open
tsalo opened this issue Jun 11, 2020 · 4 comments
Open

Clarify or check Estimator input shapes #37

tsalo opened this issue Jun 11, 2020 · 4 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@tsalo
Copy link
Member

tsalo commented Jun 11, 2020

I was trying to run one of the Estimators using weighted_least_squares without initializing a Dataset and was getting confusing errors from numpy.einsum before I realized that inputs need to be 2D no matter what. We can coerce 1D inputs to 2D with a new Estimator._validate_inputs() method, or we can just update the docstrings to clarify requirements.

BTW, based on variable convention, I think it's reasonable to assume that X must be 2D, but it's not obvious that y, v, etc. should be 2D as well, and indeed, there's an obvious error about shape if X is 1D, but no shape check for y, v, etc.

To replicate:

y = np.random.random(10)
v = np.random.random(10) ** 2
X = np.random.random((10, 1))
est = pymare.estimators.DerSimonianLaird()
est.fit(y=y, v=v, X=X)
print(est.results.to_df())

Results in:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-41d64261b276> in <module>()
      3 X = np.random.random((10, 1))
      4 est = pymare.estimators.DerSimonianLaird()
----> 5 est.fit(y=y, v=v, X=X)
      6 print(est.results.to_df())

~/Documents/tsalo/PyMARE/pymare/estimators/estimators.py in fit(self, dataset, **kwargs)
     78                     kwargs[name] = getattr(dataset, name)
     79 
---> 80         self.params_ = self._fit(**kwargs)
     81         self.dataset_ = dataset
     82 

~/Documents/tsalo/PyMARE/pymare/estimators/estimators.py in _fit(self, y, v, X)
    180 
    181         # Estimate initial betas with WLS, assuming tau^2=0
--> 182         beta_wls, inv_cov = weighted_least_squares(y, v, X, return_cov=True)
    183 
    184         # Cochrane's Q

~/Documents/tsalo/PyMARE/pymare/stats.py in weighted_least_squares(y, v, X, tau2, return_cov)
     24 
     25     # Einsum indices: k = studies, p = predictors, i = parallel iterates
---> 26     wX = np.einsum('kp,ki->ipk', X, w)
     27     cov = wX.dot(X)
     28 

<__array_function__ internals> in einsum(*args, **kwargs)

~/anaconda/envs/python3/lib/python3.6/site-packages/numpy/core/einsumfunc.py in einsum(*operands, **kwargs)
   1354     # If no optimization, run pure einsum
   1355     if optimize_arg is False:
-> 1356         return c_einsum(*operands, **kwargs)
   1357 
   1358     valid_einsum_kwargs = ['out', 'dtype', 'order', 'casting']

ValueError: einstein sum subscripts string contains too many subscripts for operand 1
@tsalo tsalo added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 11, 2020
@tyarkoni
Copy link
Contributor

The intended usage pattern for fit() is actually to always pass in a Dataset (that will do the reshaping for you automatically). The only reason **kwargs are accepted, and I'm not entirely happy with it, TBH, is because of the Stouffers and Fishers estimators, which take non-standard arguments that don't make sense to put in a Dataset. Given that you ran into this already, and others probably will too, this probably argues in favor of giving CombinationTest its own fit() method. But if you're going to call fit() with arguments, then yes, they'll need to be already 2d (and we could clarify that in the docs, though again, I'm not sure I want to encourage people to use fit() with non-Dataset args).

The main reason for not adding a validation/reshaping method is that it seems to me to fall into a kind of no-man's land in between the convenience of the Dataset abstraction, and the efficiency of passing in already-2d variables. I.e., if you're okay having the validation done for you, there's very little additional overhead associated with constructing a Dataset, which is a nice re-usable container (and would have to be constructed anyway if you want to make use of any of the Results classes, which require it).

@tsalo
Copy link
Member Author

tsalo commented Jun 11, 2020

Then it sounds like dropping **kwargs and giving CombinationTest its own fit is the best way to go. I'll change the title and update the summary. Thanks!

Are Datasets going to still be the preferred inputs for vectorized calls (e.g., neurostuff/NiMARE#211)?

@tyarkoni
Copy link
Contributor

Oh, but note also that there's an ensure_2d helper in stats you can call if you want to make sure your variables are in the right shape to pass to fit(). I just don't know that we want to make that mandatory, as it does add a tiny bit of overhead, which could add up if you're running in a loop (e.g., over voxels).

@tyarkoni
Copy link
Contributor

tyarkoni commented Jun 11, 2020

Are Datasets going to still be the preferred inputs for vectorized calls (e.g., neurostuff/NiMARE#211)?

It's probably worth doing a bit of profiling to see how much overhead the Dataset adds. My guess is that as long as you're not naively looping, it's too little to worry about. In the case where you're doing something massively parallel, I think the optimal approach is to ensure that your variables are already in the desired (2d) form in each iteration, and then call fit() (or even _fit(), if we remove kwargs from fit()), passing in arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants