[BUG] Incompatible API between julearn and scikit-learn pipelines #141

fraimondo · 2021-06-15T10:41:02Z

Describe the bug
When using any Grouped CV scheme for hyperparameter tuning, the fitting of the final estimator fails because there is no groups parameter defined:

julearn/julearn/api.py

Line 211 in 1c57332

pipeline.fit(df_X_conf, y)

Adding the group parameter at this line fails because Julearn pipeline API is not exactly as in scikit_learn: https://github.com/scikit-learn/scikit-learn/blob/15a949460dbf19e5e196b8ef48f9712b72a3b3c3/sklearn/pipeline.py#L314

We are missing the **fit_params.

To Reproduce
Steps to reproduce the behavior:

Check the associated PR
Run:

from test_api import test_tune_hyperparam
test_tune_hyperparam()

Enjoy the stack trace

Expected behavior
A clear and concise description of what you expected to happen.

No error, hyperparameter tuning should work fine. Test should pass.

The text was updated successfully, but these errors were encountered:

samihamdan · 2021-07-06T12:29:10Z

I see the problem, but before I can solve it we need to clarify the api as the test does not provide the needed information.
In the test you set up GroupedKFold for inner cv.
You do not set groups argument in run_cross_validation and we do not have a fit_params argument for run_cross_validation.
As far as I understand, in sklearn cross_validate you would set groups = groups for the outer cv
or for inner cv fit_params = {groups:groups} in cross_validate.

Once we decide where in run_cross_validation we want to set groups for inner cv, I should be able to fix the rest as well.

The question is now:
Where should groups for inner cv go to:

search_params inside model_params as other arguments for searchers,
even though this is not for the constructor of the searcher.
new added fit_params for run_cross_valdiation
we use groups of run_cross_validation for both inner and outer cv
=> which would not work when both are GroupedKFold
Or some other option I did not yet consider.

fraimondo · 2021-07-06T15:02:52Z

You are right, forgot to add the groups parameter in the CV test. I'm updating the PR.

However, I don't think it's possible to do grouped stuff in the inner CV in sklearn. The "groups" parameter will depend on the outer CV split.

fraimondo · 2021-07-06T15:03:52Z

https://stackoverflow.com/questions/60996995/use-groupkfold-in-nested-cross-validation-using-sklearn

samihamdan · 2021-07-06T17:07:57Z

We checked and found that the function _fit_and_score which is used by cross_validate (
https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/model_selection/_validation.py#L576 ) uses _check_fit_params which subsets array params to only include the train indeces.
(https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/utils/validation.py#L1456)

Together with manually checking this shows that one can pass in groups as fit_params in cross_validate.
Therefore, we decided to always include groups if its not None into the fit_params
This allows users to use groups for both inner and outer cv, but not different groups for each.

fraimondo · 2022-07-21T08:09:58Z

Coming back to the issue. I think this is what triggered a big refactor.

@samihamdan

* MODIFY Internals initial * new api examples * MODIFY Internals initial * Make JuTransformer work * Hyper Tuning + new API working Co-authored-by: Sami Hamdan <sami.hamdan@hhu.de> * fix for deslib import * Estimators test OK + no more binary/multiclass * model selection tests ok * scoring tests OK * merge conflict * transformers + utils test ok * Fix CI * codespell * small clean up * fix do_scoring tests * Stacked classifiers * For Sami to fix * Fix rebase * Fix pipeline tests * Improve SetColumnTypes * Improve some Tests on X_types * Make Stacking work with filtering * example ready * Make stares shine as regexes * Adjust warnings and errors +testing for pipeline.py * Make models use apply_to and needed_types * fixed print * remove old returned features from julearn * rename and clean julearn.estimators to julearn.models * remove default apply_to from available transformers * Modify prepare clean up a bit * Adjust confound removal, get rid of pick_columns, modify JuModel, JuTransformer, JuBaseEstiamtor * refactor juestimators DataFrameConfoundRemover add ColumnTypes * Modify examples * Modify/Add important Docstrings * Correct: check consitency imports * Example/stratifiedkfoldcv (#181) * Adjust example stratifiedkfoldcv * Integrade ColumnTypes in pipeline * Oops left print * Refactor Structure to have a base module for ColumnTypes and JuBaseClasses * Move JuColumnTransformer and make setting default of apply to in PipelienCreator possible * Fix tests for old versions * Add preprocess inspection tool * Modify inspect * Modify inspect * get the right dependency for CI * Fix lint * Fix codespell * Fix test api, but Sami broke it :( * Trigger CI * Fix coverage * Trigger CI * Adjust dataclass for python=3.11 * Bring back return train scores * updated example to inspect preprocessing steps (#182) * Add docstrings to column_types * Fix for X_types checks in pipeline * add/update docstrings to DataFrameConfoundRemover (#184) * add/update docstrings to DataFrameConfoundRemover * Make some test green again * Adjust preprocess to remove column_types per default and improve import * Fix tests * Bringing back the prepare testing * Sk pandas examples (#185) * updated multiclass classification example * updated inspect SVM example Co-authored-by: More, Shammi <s.more@fz-juelich.de> Co-authored-by: Sami Hamdan <sami.hamdan@hhu.de> Co-authored-by: Fede Raimondo <f.raimondo@fz-juelich.de> * Adjust target transforming * Fix building doc * disable examples * fix stacked classifier example * Fix linter * update: add default apply_to to pipeline creator * LINTEEER * update: combine pandas examples running * Fix codespell * Basic regress example (#188) * fixed regression example * formatted with black * One more example * One more example * Random forest example * Stratified kfold for regression example * Make Scoring with classification work (#191) * Make Scoring with classificication work * flake * Refactor column types * Flake8 * Fix PipelineCreator and problem type issues * Linter * chore: update `pyproject.toml` (#189) * fix: use text instead of file for license to make PyPI happy * fix: correct format for authors and maintainers in pyproject.toml * chore: use lowercase for keywords in pyproject.toml * chore: make package classifier development status to 5 in pyproject.toml Co-authored-by: Fede Raimondo <federaimondo@gmail.com> * Add example transforming the target variable with zscore (#197) * Add example transforming the target variable with zscore * Correcting script with flake8 * Correct white space in title * Adding = to title * fix syntax * Removing white lines to new syntax * Removing white line for title * Remove useless plot Co-authored-by: Paas Oliveros, Lya Katarina <l.paas.oliveros@fz-juelich.de> Co-authored-by: Fede Raimondo <fraimondo@dc.uba.ar> * update confound removal example (#196) * update confound removal example * change name of file * change name of file * Fix confound removal * Spelling Co-authored-by: Fede <fraimondo@proton.me> * Sk pandas examples (#190) * added pipeline creator * updated multiclass classification example * updated inspect SVM example * updated multiclass classification example * formatted dis_plot_groupcv_inspect_svm.py * formatted plot_cm_acc_multiclass.py Co-authored-by: More, Shammi <s.more@fz-juelich.de> Co-authored-by: Fede Raimondo <federaimondo@gmail.com> * Fix hyperparameter tuning and remove "wrapping" from model and transformer name (#200) * Fix Hyperparameter tuning * Remove Wrapper naming convention Co-authored-by: Sami Hamdan <sami.hamdan@hhu.de> * Example pca featsets (#198) * [not working] example for two PCAs * Fix double PCA issue * [working] multiple PCA steps * Fix remainder in name * Add a name in PipelineCreator.add * [working] multiple PCA steps with preprocess showcase * rename * rename * Codespell * Fix doc * Fix scoring Co-authored-by: Fede <fraimondo@proton.me> * Adjust scoring to work with registered once (#203) * Adjust scoring to work with registered once * Revert back the problem type to not be overwritable * Revert back the problem type to not be overwritable * Grouped cv example (#205) * Added new file for grouped CV. * Created example of Grouped CV. * deleted the old file of Grouped CV and created the new file for that. * Target transformer (#215) * WIP: Target transfomers + pipeline * WIP * damn git ignore * The example works * A todo for @samihamdan * Type hints + docstrings in progress * Base module ready * Typing + docstrings of conftest * Tests for inspect preprocessing + more types * Fix linting * model selection with types and docstrings * Fix linter * More tests with docstrings and typing hints * linter * Models sub-package done * Some more tests + typing + docstrings for the pipeline module * tests + docstrings + types for TargetPipelineCreator * Almost done with tranformers.target, but got tired... * Test for target confound remover done * Some WIP... go sick and I don't remember where I was * Finished testing confounds * JuColumnTransformer tested * More fixes * Updated prepare and tests * Updates to the docstings * Basic API test * I think the API is tested for the moment * All test pass, no warnings * Updated examples * fix linting * Codespell * codespell doesn't like the new verb we invented * ENH: allow to set X_types using regexp * some symmetry in the log * Fix tests * Fix for leo * WIP: model comparison * add corrected t_test * Add dependencies + fix other tets * fix test for user-specified splits * fix test * chore: improve ci-docs.yml * chore: improve ci.yml * chore: improve docs-preview.yml * chore: improve docs.yml * chore: improve pypi.yml * chore: isort * chore: black * chore: flake8 fixes * Retry * fix coverage * Add deslib for testing in tox * increase coverage * test api with apply_to * Cbpm sum (#221) * Minimal version of CPM without weighting but sum as default, only old test adjusted * Flake8 * lint * Resolve set_output bug in cbpm * Fix lint * Julearn sk pandas row select (#220) * Add row_select_col * Add test row_select simple * Adjust some logic * Lint * Adjust row selection to be accessible from PipelineCreator * Adjust row selection via column type not type. * Test and fix change_column_types transformer * fix #213 and #141 * add return confound example * fix groups * fix for [ENH] Validate run_cross_validate parameters #178 * fix for #214 * Inspection (#223) * ADD inspection tools * Add get_fitted_params to PipelineInspector * Add bugfix #219 * Fix preprocess import * adjust pipe * adjust pipe * CV Inpsection WIP * [WIP] Inspection tool (needs testing) * sort the index * Fix tests and ling * Add some test and func for _cv * Add inspection and test, fix predict_log_proba remove unnecessary todo notes * Add tests inspector * Add tests * New docs (#228) * Update docs theme to furo * Add content to index * Add new chapter * modify requirements * Add copybutton extension * Add installation related info to get started * Installation was replaced by getting started * setup structure for chapter 4 * Add messy chapter 3 to push uptodate * Number files based on chapter order * Add basic index for chapter 4 and reference * Change numbering to display in real time * Add chapter 4 to toctree * Comment data.rst to make build possible * Work in progress push for Leo * initial cbpm * some formatting * add introductory paragraph for confound removal and some references * add docs on confound removal * fix typo * capitalise titles * Work in progress push Leo * Work in progress push for Leo * Clean up made changes * data section preliminary version done * available pipeline steps * clean the docs * correct layout error * structure and add references * modify indices * Update docs theme to furo * Add content to index * Add new chapter * modify requirements * Add copybutton extension * Add installation related info to get started * Installation was replaced by getting started * setup structure for chapter 4 * Add messy chapter 3 to push uptodate * Number files based on chapter order * Add basic index for chapter 4 and reference * Change numbering to display in real time * Add chapter 4 to toctree * Comment data.rst to make build possible * Work in progress push for Leo * initial cbpm * some formatting * add introductory paragraph for confound removal and some references * add docs on confound removal * fix typo * capitalise titles * Work in progress push Leo * Work in progress push for Leo * Clean up made changes * data section preliminary version done * available pipeline steps * clean the docs * correct layout error * structure and add references * modify indices * WIP: Docs fixing, rearanging API * restructure * Add description of pipeline * Update examples structure + API reference * work in progress * Some refactoring of the what really need know section * fix CI * remove comments --------- Co-authored-by: Vera Komeyer <v.komeyer@fz-juelich.de> Co-authored-by: LeSasse <l.sasse@fz-juelich.de> * refactor index * Add general CV information * Add target preprocessing and refactor * add reference * Add model evaluation illustration pictures * Add further needed scikitlearn links * Correct typo * Add cross-reference link * Add cv-splitter as potential further selcted topic * Add model evaluation content * Fix doc sectioning and order * Fix TOCTREE * One less numbered element in the TOC * Add/viz (#227) * VIZ WIP * add viz to the dependencies for docs * add viz dependencies to tox * fix linter * delete old files * add example back * delete this file * More doc on VIZ * Add stats to viz * Finish with VIZ docs * Fix linter * Fix tests * Adaptations for proper toctree * Rearrange content to solve toctree * Minor clean-ups * Add visualization information * Fix building * added initial version of stacking documentation for julearn * fix markup in cbpm * some slight improvements to stacked model documentation * Adjust tests and fix bugs for dataframe transformers * fix bug no hyperparameters * fix bug hyperparams with target * fix bug hyperparams with target * ADD tests not fitted error target transformer * ADD test returning inspection tool with differen return_estimator settings * ADD test rais errors pipeline * MODIFY insect to allow for gridsearch * initial start with model inspection * initial version of model_inspect.rst * a few small typo fixes * fix bug in which estimators were not picklable due to local function definition inside a function in base/column_types.py * whitespaces * apparently codespell does not like .sav as a file extension and thinks it should be save. Calling it .joblib instead * [CI]: Use towncrier for news (#222) * update: add towncrier configuration in pyproject.toml * docs: refactor whats_new.rst for towncrier integration * docs: improve contributing.rst and add towncrier instructions * docs: improve maintaining.rst and add towncrier instructions * update: add docs/changes/newsfragments * chore: add 151.bugfix changelog * chore: add 170.enh changelog * update: improve docs/Makefile and add towncrier integration * chore: delete Makefile * update: add towncrier to pyproject.toml * [ENH] Allow list of hyperparameter options for tunning (#47) * [WIP] Multiple hyperparam grids * fix linter * increase coverage * add more tests * More tests for merger * add news fragemtn * Add specific gallery section for doc pages that need to be run. * Docs on hyperparameter tuning * some more updates * FIX predict proba * Update dependencies * Fix CV + update first chapter of docs * Fix tests * codespell * fix test for 3.8 maybe * Fix inspector * fix inpsecto * Fix issues with doc building * Improving some titles * Update intesphinx links * Fix inspection + docs * Docs (almost) ready * Fix codespell * Update example of stratification for regression * Update dependencies * Fix the run_cross_validation docstring * Fix test for stats * MODIFY Target Transformation * MODIFY K-fold for regression * Fix for scikit-learn 1.3.0 * Fix whats'new * Maybe ready for release? * Lint * codespell * Add pearson's R scorer * ADD: folds inspector prediction returns target value + avoid regexp when no patterns in X * Set column types even if input is an array * WIP: Config module * Add tests for new config flags * Update X Warns to 5k * linter * update doc * Typo in DOC --------- Co-authored-by: Sami Hamdan <sami.hamdan@hhu.de> Co-authored-by: Leonard Sasse <l.sasse@fz-juelich.de> Co-authored-by: Shammi270787 <more.shammi@gmail.com> Co-authored-by: More, Shammi <s.more@fz-juelich.de> Co-authored-by: antogeo <antogeo@gmail.com> Co-authored-by: Synchon Mandal <synchon@protonmail.com> Co-authored-by: Lya K. Paas <lya.paas@gmail.com> Co-authored-by: Paas Oliveros, Lya Katarina <l.paas.oliveros@fz-juelich.de> Co-authored-by: Vera Komeyer <75420154+verakye@users.noreply.github.com> Co-authored-by: kaurao <kaustubh.patil@gmail.com> Co-authored-by: Kimia Nazarzadeh <60739692+KNazarzadeh@users.noreply.github.com> Co-authored-by: Vera Komeyer <v.komeyer@fz-juelich.de>

fraimondo added the bug Something isn't working label Jun 15, 2021

fraimondo assigned samihamdan Jun 15, 2021

fraimondo added a commit that referenced this issue Jul 6, 2021

Add groups parameter in test: #141

4379e28

samihamdan mentioned this issue Jul 6, 2021

[WIP] Make API compatible between julearn and scikit-learn pipelines #142

Closed

samihamdan added a commit that referenced this issue Jul 7, 2021

Add groups to inner cv with fit_params in cross_validate #141

6872892

fraimondo added this to the v0.3.0 milestone Jul 21, 2022

fraimondo added the Priority: High High Priority Issue label Jul 21, 2022

LeSasse mentioned this issue Feb 15, 2023

[BUG] Groups parameter is not handed over to inner cv #213

Closed

fraimondo added a commit that referenced this issue Apr 6, 2023

fix #213 and #141

1a750cb

fraimondo closed this as completed Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Incompatible API between julearn and scikit-learn pipelines #141

[BUG] Incompatible API between julearn and scikit-learn pipelines #141

fraimondo commented Jun 15, 2021

samihamdan commented Jul 6, 2021

fraimondo commented Jul 6, 2021

fraimondo commented Jul 6, 2021

samihamdan commented Jul 6, 2021

fraimondo commented Jul 21, 2022

[BUG] Incompatible API between julearn and scikit-learn pipelines #141

[BUG] Incompatible API between julearn and scikit-learn pipelines #141

Comments

fraimondo commented Jun 15, 2021

samihamdan commented Jul 6, 2021

fraimondo commented Jul 6, 2021

fraimondo commented Jul 6, 2021

samihamdan commented Jul 6, 2021

fraimondo commented Jul 21, 2022