Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cca_zoo.model_selection.GridSearchCV does not work when estimator has more than one latent dimension and scorer function is not provided #150

Open
JohannesWiesner opened this issue Oct 7, 2022 · 1 comment

Comments

@JohannesWiesner
Copy link
Contributor

I noticed that cca_zoo.model_selection.GridSearchCV does throw a ValueError when the estimator has more than one latent dimension and when the user does leaves cca_zoo.model_selection.GridSearchCV(scoring=None) as is. I am not sure, but if I remember the behavior of sklearn.model_selection.GridSearchCV correctly (and I guess you would like cca_zoo.model_selection.GridSearchCV to behave in the same way), the idea is, that if not otherwise provided GridSearchCV just uses the .score() method of the provided estimator? Checking a couple of docstrings of your estimators, this should be:

the average pairwise correlation between the views

Example:

from cca_zoo.models import SCCA_PMD
import numpy as np
from cca_zoo.model_selection import GridSearchCV

# create data
rng = np.random.RandomState(0)
X1 = rng.random((100,5))
X2 = rng.random((100,5))

# set latent dims
latent_dims=1

# run cross validation
estimator = SCCA_PMD(latent_dims=latent_dims,random_state=rng,c=[1,1])
param_grid = {'c':[[0.1,0.2],[0.1,0.2]]}
grid = GridSearchCV(estimator,
                    param_grid=param_grid,
                    cv=2)
grid.fit([X1,X2])

# run score
estimator.fit([X1,X2])
print(estimator.score([X1,X2]))

Of course the whole problem is easy to solve by simply providing the scorer function yourself (adopted from your docs):

def scorer(estimator, views):
    scores = estimator.score(views)
    return np.mean(scores)

But I guess, it would still be nice to do this in an automatic fashion? :)

@JohannesWiesner
Copy link
Contributor Author

@jameschapman19 : I think you can close this issue (for) now, don't know why (perhaps related to the latest commits in cca_zoo or scikit-learn), but it seems to work now. Here's some working code:

import numpy as np
from cca_zoo.models import GRCCA
from cca_zoo.model_selection import GridSearchCV

# create two random matrices and pretend both of them would have two feature
# groups
rng = np.random.RandomState(0)
X1 = rng.random((100,4))
X2 = rng.random((100,4))
feature_groups = [np.array([0,0,1,1]),np.array([0,0,1,1])]
latent_dims=2
estimator = GRCCA(latent_dims=latent_dims,random_state=rng)

# define a search space (optimize left and right penalty parameters)
c1 = [0,0.5,1]
c2 = [0,0.5,1]
mu1 = [0,0.5,1]
mu2 = [0,0.5,1]
param_grid = {'c':[c1,c2],'mu':[mu1,mu2]}

# FIXME: See issue #150: Defining this scorer function should actually 
# not be necessary, because this should be the default scoring function
# for all CCA-base classes
def scorer(estimator, views):
    scores = estimator.score(views)
    return np.mean(scores)

grid = GridSearchCV(estimator,param_grid,scoring=scorer)
grid.fit([X1,X2],estimator__feature_groups=feature_groups)
estimator_best = grid.best_estimator_
scores = grid.cv_results_

The magic lies in providing estimator__feature_groups=feature_groups to GridSearchCV's .fit() method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant