Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ADHD dataset in examples to MAIN #1953

Merged
merged 49 commits into from Apr 15, 2019

Conversation

KamalakerDadi
Copy link
Contributor

Follow up to PR #1887

PR #1887 will be closed for mess up in the commit history.

@KamalakerDadi
Copy link
Contributor Author

This PR is based on new upload of non-smooth data by @emdupre

@codecov
Copy link

codecov bot commented Mar 30, 2019

Codecov Report

Merging #1953 into master will increase coverage by <.01%.
The diff coverage is 96.7%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1953      +/-   ##
==========================================
+ Coverage   95.31%   95.32%   +<.01%     
==========================================
  Files         137      137              
  Lines       17766    17856      +90     
==========================================
+ Hits        16934    17021      +87     
- Misses        832      835       +3
Impacted Files Coverage Δ
nilearn/datasets/__init__.py 100% <ø> (ø) ⬆️
nilearn/datasets/tests/test_func.py 100% <100%> (ø) ⬆️
nilearn/plotting/displays.py 95.45% <100%> (+0.01%) ⬆️
nilearn/datasets/func.py 89.51% <94.64%> (+0.62%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a072a3...c5d6e03. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 30, 2019

Codecov Report

Merging #1953 into master will increase coverage by <.01%.
The diff coverage is 96.7%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1953      +/-   ##
==========================================
+ Coverage   95.32%   95.32%   +<.01%     
==========================================
  Files         137      137              
  Lines       17781    17856      +75     
==========================================
+ Hits        16949    17021      +72     
- Misses        832      835       +3
Impacted Files Coverage Δ
nilearn/datasets/__init__.py 100% <ø> (ø) ⬆️
nilearn/datasets/tests/test_func.py 100% <100%> (ø) ⬆️
nilearn/plotting/displays.py 95.45% <100%> (+0.01%) ⬆️
nilearn/datasets/func.py 89.51% <94.64%> (+0.62%) ⬆️
nilearn/image/image.py 98% <0%> (-0.04%) ⬇️
nilearn/image/tests/test_image.py 98.86% <0%> (-0.02%) ⬇️
nilearn/plotting/tests/test_img_plotting.py 100% <0%> (ø) ⬆️
nilearn/decoding/space_net.py 94.39% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 498d5eb...ef732d7. Read the comment docs.

@KamalakerDadi
Copy link
Contributor Author

@emdupre Now, we have more nuisance signals (confounds) for each subject than earlier uploaded version.

The problem with this is some of the columns has "n/a" fillings which are treated as NaNs. Columns such as framewise_displacement, dvars, std_dvars.

As a result some of the example won't work. For example:
examples/03_connectivity/plot_inverse_covariance_connectome.py

ValueError: array must not contain infs or NaNs

How do we tackle this ?

Copy link
Member

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great !
minor stuff.


Notes
-----
This functional MRI datasets is used as part of teaching how to use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is used for teaching

child_count = participants['Child_Adult'].tolist().count('child')
adult_count = participants['Child_Adult'].tolist().count('adult')

n_child = np.round(float(n_subjects) / max_subjects * child_count).astype(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maube explain that you want to keep the proportion of children vs adults.

@@ -716,7 +716,9 @@ def add_contours(self, img, threshold=1e-6, filled=False, **kwargs):
(from matplotlib) for contours and contour_fillings can be
different.
"""
self._map_show(img, type='contour', **kwargs)
if not filled:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit unrelated. Is this an obvious fix or does it require some explanations ?

Copy link
Contributor Author

@emdupre
Copy link
Member

emdupre commented Apr 1, 2019

The problem with this is some of the columns has "n/a" fillings which are treated as NaNs. Columns such as framewise_displacement, dvars, std_dvars.
As a result some of the example won't work

Sorry, this is an fMRIPrep issue ! Essentially, some of the calculated confounds rely on successive volumes (framewise displacement, DVARS, and standardized DVARS are the main offenders, here). So for the first volume, there's no way to calculate a sensible value (since we need two volumes to calculate e.g., framewise displacement) -- so we have NaN instead.

A reasonable solution is to just fill these as zero, which I've gone ahead and done. I tried to do this just by versioning the files, so we should be able to roll back (I also have the original copy on Compute Canada). Let me know if this makes sense to you !

@KamalakerDadi
Copy link
Contributor Author

A reasonable solution is to just fill these as zero, which I've gone ahead and done. I tried to do this just by versioning the files, so we should be able to roll back (I also have the original copy on Compute Canada). Let me know if this makes sense to you !

You have replaced them with zeros and ready to make a new upload ? Other thing we could also do is to pick the basic confounds for cleaning in the examples, such as motion, compcor.

@emdupre
Copy link
Member

emdupre commented Apr 1, 2019 via email

@KamalakerDadi
Copy link
Contributor Author

They're now uploaded -- it's just "version 2" of those files :)

Are they lying under the same keys or new ones ?

@emdupre
Copy link
Member

emdupre commented Apr 1, 2019 via email

@illdopejake
Copy link
Contributor

hey folks, @emdupre and I were just discussing this dataset. There was something that has been bugging us since the start but that was brought up again recently. It's only partially relevant to this thread, but I'm not sure where else to mention it. Anyway:

When we look at the features that were contributing to the age prediction (in the machine learning example that got all of this started), they were mostly coming from the cerebellum and temporal pole. This made us think that the algorithm was learning something about the bounding box, or headsize, or something of that ilk, to predict age. Which is maybe not that interesting and might speak negatively to the dataset we're working with, depending on the angle.

However, the first time @GaelVaroquaux and I did this back in September 2018 (for the resting-state conference), the important features were hanging about in the frontal lobe, where one might expect them to be from a biological point of view. There were many differences between that analysis and the newer one:

a) We were using all subjects instead of a subset
b) Original resolution (2mm)
c) It was a regression problem predicting log age.

I'll need to dig a bit deeper to see which of those factors (or combo) might explain the discrepancy. But, do you think this is something we need to worry about moving forward with this dataset? If so, perhaps something to nip in the bud now before we commit to it? Maybe this won't be relevant outside of the ML example context but, just wanted to be on the safe side and mention it...

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 2, 2019 via email

@kchawla-pi
Copy link
Collaborator

a parameter to the download function that enables downloading all the subject

What's stopping us from building one? There are rumors circulating about an upcoming sprint...

@kchawla-pi
Copy link
Collaborator

kchawla-pi commented Apr 2, 2019

More on point, what do we need to do to implement this?

Off the top of my head, we can upload a small subset and the full dataset. For our examples we use the smaller one, for speed, if the user specifies 'full' then they download the full version for their analysis.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 2, 2019 via email

@illdopejake
Copy link
Contributor

However, maybe we should first check that it helps solving the problem.

I will look into it!

@illdopejake
Copy link
Contributor

However, maybe we should first check that it helps solving the problem.

It seems to be a classification vs. regression issue actually. I went back and ran a classification on the original (full) dataset, and the features were remarkably similar to the new dataset. Similarly, when doing regression on the new (smaller) dataset, the most important features were indeed in the frontal lobe. There's still some information coming from the cerebellum etc, but I'm less worried about this now.

By the way, the data looks much better since @emdupre treated it, and the prediction is actually better!

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 3, 2019 via email

@illdopejake
Copy link
Contributor

Out of curiosity, what are your MAE?

So, we're predicting log age. The values range from ~1.3 - 3.5. The MAE for the smaller @emdupre dataset is 0.34. Using the original (full) dataset as it was archived on OpenNeuro, the MAE was ... wait for it... 0.34. 😉

I suggest that we move forward with the plan, and do provide an option for an additional bigger download. It will help users have confidence in what they do. Thanks!

Great! My pleasure!

sub-pixar058,5c8ff3b02286e80018c3e3c4,5c8ff3b14712b400183b705a
sub-pixar061,5c8ff3b12286e80016c3c30f,5c8ff3b34712b400183b7060
sub-pixar062,5c8ff3b2a743a9001660a07a,5c8ff3b54712b400193b5ba3
sub-pixar063,5c8ff3b2a743a9001660a07a,5c8ff3b54712b400193b5ba3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are now fixed ! See correct key 👇

Suggested change
sub-pixar063,5c8ff3b2a743a9001660a07a,5c8ff3b54712b400193b5ba3
sub-pixar063,5c9e99d006cd47001a5ab599,5c8ff3b62286e80016c3c31b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update this and see how CircleCI goes with it.

Thanks a lot!

@KamalakerDadi
Copy link
Contributor Author

This is ready for review. A big thanks to team of data organization.

@@ -0,0 +1,37 @@
Montreal Artificial Intelligence and Neuroscience conference 2018 datasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I'd like the line above to be change to match better the new name, as well as the name of the file.

I general, I'd like us to avoid using "main" in the code and the filenames to describe this dataset, I will be confusing to people.

@GaelVaroquaux
Copy link
Member

I had a look at this PR (and sent a PR to this branch KamalakerDadi#10 ). It's excellent work! The dataset is so light compared to the ADHD data, but it's so much richer and interesting.

Aside from my PR, the only other comment that I have is that we should avoid using the naming "main", even inside the codebase. It will confuse people who work on the code in the future.

@KamalakerDadi
Copy link
Contributor Author

I would change "main" to "brain_development", then ?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 11, 2019 via email

@KamalakerDadi
Copy link
Contributor Author

I had a look at this PR (and sent a PR to this branch KamalakerDadi#10 ).

Thanks. Its merged.

@emdupre
Copy link
Member

emdupre commented Apr 11, 2019

This looks great, @KamalakerDadi ! Thanks so much ✨

I was starting a review but first had one very general question: In the examples we're now referring to this alternately as the "rest", "brain development", "resting-state development", "resting state functional connectivity ", or just "resting-state" dataset. The precise term chosen seems to vary somewhat randomly.

Can we standardize the short names throughout the examples ? I think this would help clarify for users !

@KamalakerDadi
Copy link
Contributor Author

I was starting a review but first had one very general question: In the examples we're now referring to this alternately as the "rest", "brain development", "resting-state development", "resting state functional connectivity ", or just "resting-state" dataset. The precise term chosen seems to vary somewhat randomly.

Yeah. I agree. :)

Can we standardize the short names throughout the examples ? I think this would help clarify for users !

Any suggestions ?

Copy link
Member

@emdupre emdupre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- I had a few questions re how we link out / refer to the data. Thanks again !!

`<http://fcon_1000.projects.nitrc.org/indi/adhd200/>`_. We use utilities
:func:`fetch_adhd` implemented in nilearn for automatic fetching of these
datasets.
`<https://openneuro.org/datasets/ds000228/versions/1.0.0>`_. We use utilities
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually give the OSF link here ? That's where the preprocessed data lives. The data available on OpenNeuro is either raw BIDS or smoothed.

We will use down-sample data from the `children resting-state dataset
<https://openneuro.org/datasets/ds000228/versions/1.0.0>`_ has been
preprocessed using `FMRIPrep <https://github.com/poldracklab/fmriprep>`_.
We use nilearn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think we might want to link OSF here. It gives the exact fMRIPrep "methods section" as well as the Nilearn commands used in downsampling.

preprocessed using `CPAC <http://fcp-indi.github.io/>`_. We use nilearn
We will use down-sample data from the `children resting-state dataset
<https://openneuro.org/datasets/ds000228/versions/1.0.0>`_ has been
preprocessed using `FMRIPrep <https://github.com/poldracklab/fmriprep>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference:

Suggested change
preprocessed using `FMRIPrep <https://github.com/poldracklab/fmriprep>`_.
preprocessed using `FMRIPrep <https://fmriprep.readthedocs.io>`_.

@@ -3,7 +3,8 @@
=====================================================

An example applying CanICA to resting-state data. This example applies it
to 30 subjects of the ADHD200 datasets. Then it plots a map with all the
to 30 subjects of the resting-state development datasets (children).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we selecting only children, here ? I thought from the fetcher that it would select both children and adults when given a n_subjects value...

# Look at blocks structure, reflecting functional networks.
# Now we display as a connectome the mean correlation matrix over all children.
plotting.plot_connectome(mean_correlation_matrix, msdl_coords,
title='mean correlation over all childrens')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title='mean correlation over all childrens')
title='mean correlation over all children')

@@ -223,4 +227,11 @@ def plot_matrices(matrices, matrix_kind):
plt.grid(True)
plt.tight_layout()

###############################################################################
# While the comparison is not fully conclusive on this small dataset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we suggest here that users can re-run the example with all available data and compare ?

18-39). This rs-fmri data can be used to try to predict who are adults and
who are children.

The data is downsampled to 4mm resolution for convenience. The original
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should link to OSF here, too, as it describes the preprocessing and downsampling for anyone interested.

Here: https://openneuro.org/datasets/ds000228/versions/1.0.0

Track issue for more information:
https://github.com/nilearn/nilearn/issues/1864
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be closed once this PR is merged, no ? Is it still useful to link, here ?

Nature communications, 9(1), 1027.
https://www.nature.com/articles/s41467-018-03399-2

Licence: usage is unrestricted for non-commercial research purposes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can give the exact license -- the original data was PDDL. I haven't actually licensed our downsampled version. My gut would be for CC-BY 4.0. Preferences @pbellec @illdopejake ?

The original data is downloaded from OpenNeuro
https://openneuro.org/datasets/ds000228/versions/1.0.0

This fetcher downloads downsampled data that are available on Open
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on including OSF link for exact preprocessing and downsampling.

emdupre and others added 21 commits April 15, 2019 12:13
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
…s.py

Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
…s.py

Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
…s.py

Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
@GaelVaroquaux
Copy link
Member

Gr8!! Merging.

@GaelVaroquaux GaelVaroquaux merged commit 5f283a9 into nilearn:master Apr 15, 2019
Sprint Paris 2019 automation moved this from In progress to Done Apr 15, 2019
@KamalakerDadi
Copy link
Contributor Author

Great! Thanks @emdupre for helping on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants