Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of kmeanscrossvalidator #61

Merged
merged 14 commits into from
Jul 3, 2019

Conversation

kayhoogland
Copy link
Contributor

First version of the kmeans cross validator discussed here #5

sklego/model_selection.py Outdated Show resolved Hide resolved
sklego/model_selection.py Outdated Show resolved Hide resolved
@MBrouns
Copy link
Collaborator

MBrouns commented Mar 28, 2019

Maybe it's a bit nitpicky, but currently this implementation is fully built on KMeans whereas any clusteringmethod should work right? Maybe something for the future to make the CV take any clusterer as a parameter

@koaning
Copy link
Owner

koaning commented Mar 28, 2019

@MBrouns I agree. Related: I think KlusterFoldValidation has a nice ring to it.

@koaning
Copy link
Owner

koaning commented Mar 28, 2019

@kayhoogland whats your opinion on this? it feels weird to merge this knowing that we will add a more custom thing as well. (also ... typically clustering algorithms are very prone to overfitting on a column if there's no standardisation so i think we don't just want to accept multiple clustering algorithms but that we want to allow for pipelines)

@kayhoogland
Copy link
Contributor Author

I agree, it seems the logical thing to do. Also, the name is indeed catchy ;)

@kayhoogland
Copy link
Contributor Author

kayhoogland commented Apr 5, 2019

I did some changes to enable for more clustering methods. A difficult thing to take into account is the n_splits from _BaseKFold. Some (most) clustering methods (for example DBScan) generate splits based on the data itself.

koaning
koaning previously requested changes Apr 5, 2019
sklego/model_selection.py Outdated Show resolved Hide resolved
sklego/model_selection.py Outdated Show resolved Hide resolved
@koaning koaning closed this Jun 18, 2019
@koaning
Copy link
Owner

koaning commented Jun 18, 2019

closed it because i hadnt heard from it in a while. ill gladly re-open if there's more attention for it.

@MBrouns MBrouns reopened this Jun 19, 2019
@MBrouns
Copy link
Collaborator

MBrouns commented Jun 19, 2019

I just discussed this with @kayhoogland and we see a good way going forward. I'm putting my feedback in a revie

@kayhoogland kayhoogland force-pushed the feature_kmeanscrossvalidator branch from 2541602 to efaef5e Compare July 3, 2019 07:46
if isinstance(X, pd.DataFrame):
X = X.values

clusters = self.cluster_method.fit_predict(X)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For not refitting if self.cluster_method is already fitted:

from sklearn.exceptions import NotFittedError
try:
    clusters = self.cluster_method.predict(X)
except NotFittedError:
    clusters = self.cluster_method.fit_predict(X)
 

super(KlusterFoldValidation, self).__init__(n_splits=3,
shuffle=False,
random_state=random_state)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably nice to set self.n_splits = None here

sklego/model_selection.py Outdated Show resolved Hide resolved
@MBrouns MBrouns dismissed koaning’s stale review July 3, 2019 11:05

Requested changes are made

@MBrouns MBrouns merged commit c41cacd into koaning:master Jul 3, 2019
@koaning
Copy link
Owner

koaning commented Jul 3, 2019

w00t.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants