Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Replace numpy with pandas in data loaders #2829

Merged
merged 49 commits into from Jan 28, 2022

Conversation

achamma723
Copy link
Contributor

This PR is fixing the issue of replacing Numpy with Pandas in data loaders:
Closes #2621

@codecov
Copy link

codecov bot commented May 7, 2021

Codecov Report

Merging #2829 (65fe185) into main (d91545d) will decrease coverage by 0.01%.
The diff coverage is 90.32%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2829      +/-   ##
==========================================
- Coverage   90.45%   90.44%   -0.02%     
==========================================
  Files         122      122              
  Lines       14562    14593      +31     
  Branches     2973     2982       +9     
==========================================
+ Hits        13172    13198      +26     
  Misses        824      824              
- Partials      566      571       +5     
Impacted Files Coverage Δ
nilearn/datasets/__init__.py 100.00% <ø> (ø)
nilearn/datasets/atlas.py 92.08% <87.17%> (-0.75%) ⬇️
nilearn/datasets/func.py 76.99% <93.33%> (+0.05%) ⬆️
nilearn/_utils/docs.py 91.93% <100.00%> (+0.06%) ⬆️
nilearn/datasets/struct.py 91.54% <100.00%> (+0.26%) ⬆️
nilearn/decoding/space_net.py 88.18% <0.00%> (-0.55%) ⬇️
nilearn/reporting/_get_clusters_table.py 100.00% <0.00%> (+3.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d91545d...65fe185. Read the comment docs.

@NicolasGensollen
Copy link
Member

Thanks @achamma723 !
FYI: this change to pandas is a breaking change and has been scheduled for release 0.9 (see also #2655)

@NicolasGensollen NicolasGensollen added this to the 0.9 milestone May 7, 2021
Copy link
Member

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thx !

@tsalo tsalo changed the title [FIX]: Repalce numpy with pandas [FIX] Replace numpy with pandas in data loaders Jul 5, 2021
@NicolasGensollen
Copy link
Member

Hi @achamma723
Could you rebase this PR ad we are approaching release 0.9.0 ?

@achamma723
Copy link
Contributor Author

Hello @NicolasGensollen , Is there an error with test_fetch_atlas_destrieux_2009?

@@ -237,7 +238,7 @@ def fetch_atlas_destrieux_2009(lateralized=True, data_dir=None, url=None,
files_ = _fetch_files(data_dir, files, resume=resume,
verbose=verbose)

params = dict(maps=files_[1], labels=np.recfromcsv(files_[0]))
params = dict(maps=files_[1], labels=pd.read_csv(files_[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to be something like

    params = dict(maps=files_[1], labels=pd.read_csv(files_[0], index_col=0).to_records())

@jeromedockes
Copy link
Member

Hello @NicolasGensollen , Is there an error with test_fetch_atlas_destrieux_2009?

there isn't an error with the test; the PR changes the output of the function: it is now a pd DataFrame with 2 columns (one repeating the index) rather than a 1d np recarray

@@ -0,0 +1,6 @@
{
"cells": [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you remove the checkpoints and the notebook? thanks!

nilearn/datasets/atlas.py Show resolved Hide resolved
nilearn/datasets/atlas.py Outdated Show resolved Hide resolved
nilearn/datasets/func.py Show resolved Hide resolved
nilearn/datasets/struct.py Show resolved Hide resolved
nilearn/datasets/tests/test_func.py Show resolved Hide resolved
@jeromedockes
Copy link
Member

@NicolasGensollen I couldn't tell from the issue description: did we just want to start using pandas internally, or do we want to start returning DataFrames or Series instead of recarrays wherever they were used? In the latter case we need a deprecation cycle

@NicolasGensollen
Copy link
Member

@jeromedockes there is already a deprecation cycle running, initiated with #2663, and supposed to end with release 0.9.

@NicolasGensollen
Copy link
Member

Also linking related #2655

@jeromedockes
Copy link
Member

@jeromedockes there is already a deprecation cycle running, initiated with #2663, and supposed to end with release 0.9.

Ok thanks! in that case we need to track down those that still return recarrays, such as fetch_atlas_difumo

@NicolasGensollen
Copy link
Member

@achamma723 @jeromedockes I might be missing something, but I don't understand why we would sometimes return pandas dataframes and sometimes rec arrays... 🤔
My understanding of the goal of this PR was to get rid or rec arrays in all the fetchers, and use (and return) pandas dataframes instead, right?
I see even a few fetchers where we load through Pandas to get a dataframe, and then convert to records.

@jeromedockes
Copy link
Member

My understanding of the goal of this PR was to get rid or rec arrays in all the fetchers, and use (and return) pandas dataframes instead, right?

yes, IIUC it will always be dataframes or series

Copy link
Member

@NicolasGensollen NicolasGensollen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achamma723 Could you rebase or merge master to get the doc changes of #3083 ?

nilearn/datasets/atlas.py Outdated Show resolved Hide resolved
nilearn/datasets/atlas.py Outdated Show resolved Hide resolved
nilearn/datasets/atlas.py Show resolved Hide resolved
nilearn/datasets/atlas.py Outdated Show resolved Hide resolved
Copy link
Member

@NicolasGensollen NicolasGensollen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for merging master @achamma723
In the last meeting, we decided to extend the deprecation cycle for rec arrays as we realized the current warning might have been a little light for such a change.
The idea would be to add an optional parameter to affected fetchers (return_dataframe or something alike) defaulting to the current behavior of returning rec arrays for backward compatibility.
Let me know if you want to take care of this, otherwise I can commit to your branch.

@jeromedockes
Copy link
Member

so many fetchers may have a part that roughly looks like:

if legacy_format:
    result["confounds"] = result["confounds"].to_records()
    ...

NicolasGensollen and others added 7 commits January 26, 2022 12:13
* Add failing test

* get rid of scaling_axis attribute

* deprecate scaling_axis

* [circle full] add whatsnew entry

* Fix
)

* Including Hierarchical Kmeans in doc and example

* Adding hierarchical k_means and make it available in parcellations

* test hierarchical_kmeans and test Parcellation using it

* fix incompatible apis of fit() and transform() methods

* Improve docstring and example layout

* "Upgrade to get_data function"

* "changing nose function to pytest function as the rest of the library evolved"

* "solving sklearn.utils.testing deprecation"

* "improve flake8"

* "flake8, missing import, adding test exception"

* Apply suggestions from code review

Including B.T. suggestions

Co-authored-by: bthirion <bertrand.thirion@inria.fr>

* "adding tests for 'scaling' and warning ; fixed scaling bug"

* "test_refactoring"

* "flake8 fixes"

* "clarify example, fix PEP8"

* Apply suggestions from code review

Co-authored-by: bthirion <bertrand.thirion@inria.fr>

* "apply suggestions from N.G. review"

* Apply suggestions from code review

Co-authored-by: Gensollen <nicolas.gensollen@gmail.com>

* "apply B.T refactoring"

* "flake8"

* "flake8"

* "codespell"

* "enforcing final number of clusters is the one asked by the user"

* "solving bugging test_case"

* "solving edge case"

* "missing float casting"

* "check that adjusted clusters are int"

* "flake8 fix"

* Update nilearn/regions/hierarchical_kmeans_clustering.py

Co-authored-by: bthirion <bertrand.thirion@inria.fr>

* Update examples/03_connectivity/plot_data_driven_parcellations.py

Co-authored-by: bthirion <bertrand.thirion@inria.fr>

* Update examples/03_connectivity/plot_data_driven_parcellations.py

Co-authored-by: bthirion <bertrand.thirion@inria.fr>

* "fix code review comments"

* 'fix bug introduced by review suggestion'

* Update nilearn/regions/hierarchical_kmeans_clustering.py

Co-authored-by: Gensollen <nicolas.gensollen@gmail.com>

* Update nilearn/regions/hierarchical_kmeans_clustering.py

Co-authored-by: Gensollen <nicolas.gensollen@gmail.com>

* Update nilearn/regions/hierarchical_kmeans_clustering.py

Co-authored-by: Gensollen <nicolas.gensollen@gmail.com>

* Update nilearn/regions/hierarchical_kmeans_clustering.py

Co-authored-by: Gensollen <nicolas.gensollen@gmail.com>

* "apply code review suggestions"

* "add whats_new"

* fixing docstring

* addressing comments of jdockes

* trying to fix pep8 violations

* simplifying hierarchical k-means avoiding useless checks

* pep8

* Addressing other comments by jdockes

Co-authored-by: BAZEILLE Thomas <thomas.bazeille@inria.fr>
Co-authored-by: bthirion <bertrand.thirion@inria.fr>
Co-authored-by: Gensollen <nicolas.gensollen@gmail.com>
* added deploy key

* clean deploy key

* clean deploy key

* clean cricreci.yml

* clean circleci.yml

* clean circleci.yml

* clean circleci.yml

* fixed Hommel function + added test

* pep8 fixes

* pep8 fixes

* pep8 fixes

* correct use of assert

* revert unrelated change

* minor fixed on hommel value computation

* addedwhatsnew entry

Co-authored-by: Gensollen <nicolas.gensollen@gmail.com>
* Add gh_substitutions

* Add links to contributor profiles (unfinished)

* Use contributor links in AUTHORS

* Break and refactor changelog (unfinished...)

* [circle full] Fix typo and run full build in strict mode.

* Add more links

* iter

* [circle full] request full build

* Add more links again

* [circle full] Iter.

* Update AUTHORS.rst

* gh_role --> _gh_role

* [circle full] Add fundings.

* [circle full] Fix wrong ref.

* [circle full] update niconnect link

* [circle full] fix ref

* fix refs: input_data -> maskers

* [circle full] request full build

* [circle full] fix old ref to fetch_cobre

* [circle full] fix broken ref

* [circle full] Fix error in rebase
@NicolasGensollen
Copy link
Member

Thanks @achamma723 !
@jeromedockes do you want to take another look at it before we merge?

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @achamma723 and @NicolasGensollen
@NicolasGensollen if you had another look at the fetchers and all the changes in column names have been fixed I think we're good!

my only remaining question is about the whatsnew, not sure if there was an issue with the rebase due to #3049

nilearn/datasets/struct.py Outdated Show resolved Hide resolved
doc/whats_new.rst Outdated Show resolved Hide resolved
@bthirion bthirion mentioned this pull request Jan 27, 2022
@@ -1,8 +1,8 @@
numpy==1.16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this should not be here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achamma723 could you rebase to clean this? Thanks!

@bthirion
Copy link
Member

besides the requirement thing, this looks OK.

* Add tests min req with matplotlib

* remove useless variable

* [circle full] update README
@NicolasGensollen
Copy link
Member

Thanks @achamma723 but it looks like there are still unrelated changes. I'm not sure how you rebase, but you should be able to drop the unrelated commits.
If you cannot clean it, I'll merge the PR anyway since it is not a big deal, it just makes the diff more complicated than it should be...

@achamma723
Copy link
Contributor Author

Thanks @achamma723 but it looks like there are still unrelated changes. I'm not sure how you rebase, but you should be able to drop the unrelated commits. If you cannot clean it, I'll merge the PR anyway since it is not a big deal, it just makes the diff more complicated than it should be...

Hello @NicolasGensollen , I'm pulling from main, rebasing and then pushing again. Should I drop something unrelated?

@NicolasGensollen
Copy link
Member

Hello @NicolasGensollen , I'm pulling from main, rebasing and then pushing again. Should I drop something unrelated?

Hmm, it's weird that you get some commits from main in here. You can try to manually drop them when rebasing. I think the following commits shouldn't be there:

Copy link
Member

@NicolasGensollen NicolasGensollen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thx!

@NicolasGensollen
Copy link
Member

Thanks everyone! Merging...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use pandas in the data loaders
9 participants