New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ADHD dataset in examples to MAIN #1953
Update ADHD dataset in examples to MAIN #1953
Conversation
This PR is based on new upload of non-smooth data by @emdupre |
Codecov Report
@@ Coverage Diff @@
## master #1953 +/- ##
==========================================
+ Coverage 95.31% 95.32% +<.01%
==========================================
Files 137 137
Lines 17766 17856 +90
==========================================
+ Hits 16934 17021 +87
- Misses 832 835 +3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #1953 +/- ##
==========================================
+ Coverage 95.32% 95.32% +<.01%
==========================================
Files 137 137
Lines 17781 17856 +75
==========================================
+ Hits 16949 17021 +72
- Misses 832 835 +3
Continue to review full report at Codecov.
|
@emdupre Now, we have more nuisance signals (confounds) for each subject than earlier uploaded version. The problem with this is some of the columns has "n/a" fillings which are treated as NaNs. Columns such as framewise_displacement, dvars, std_dvars. As a result some of the example won't work. For example: ValueError: array must not contain infs or NaNs How do we tackle this ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great !
minor stuff.
|
||
Notes | ||
----- | ||
This functional MRI datasets is used as part of teaching how to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is used for teaching
child_count = participants['Child_Adult'].tolist().count('child') | ||
adult_count = participants['Child_Adult'].tolist().count('adult') | ||
|
||
n_child = np.round(float(n_subjects) / max_subjects * child_count).astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maube explain that you want to keep the proportion of children vs adults.
@@ -716,7 +716,9 @@ def add_contours(self, img, threshold=1e-6, filled=False, **kwargs): | |||
(from matplotlib) for contours and contour_fillings can be | |||
different. | |||
""" | |||
self._map_show(img, type='contour', **kwargs) | |||
if not filled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit unrelated. Is this an obvious fix or does it require some explanations ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is related and bug raised by @GaelVaroquaux
https://3567-1235740-gh.circle-artifacts.com/0/home/circleci/project/doc/_build/html/auto_examples/03_connectivity/plot_compare_resting_state_decomposition.html#sphx-glr-auto-examples-03-connectivity-plot-compare-resting-state-decomposition-py
the contours are not filled in some positions.
Sorry, this is an fMRIPrep issue ! Essentially, some of the calculated confounds rely on successive volumes (framewise displacement, DVARS, and standardized DVARS are the main offenders, here). So for the first volume, there's no way to calculate a sensible value (since we need two volumes to calculate e.g., framewise displacement) -- so we have NaN instead. A reasonable solution is to just fill these as zero, which I've gone ahead and done. I tried to do this just by versioning the files, so we should be able to roll back (I also have the original copy on Compute Canada). Let me know if this makes sense to you ! |
You have replaced them with zeros and ready to make a new upload ? Other thing we could also do is to pick the basic confounds for cleaning in the examples, such as motion, compcor. |
They're now uploaded -- it's just "version 2" of those files :)
I have no strong feelings as to which regressors are included ! Do you have
a list in mind ?
…On Mon., Apr. 1, 2019, 07:52 KamalakerDadi, ***@***.***> wrote:
A reasonable solution is to just fill these as zero, which I've gone ahead
and done. I tried to do this just by versioning the files, so we should be
able to roll back (I also have the original copy on Compute Canada). Let me
know if this makes sense to you !
You have replaced them with zeros and ready to make a new upload ? Other
thing we could also do is to pick the basic confounds for cleaning in the
examples, such as motion, compcor.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1953 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AOUk51CT8i5pSV49ouEeo0R5_w75yKJJks5vcfLggaJpZM4cTtTF>
.
|
Are they lying under the same keys or new ones ? |
Same keys !
…On Mon., Apr. 1, 2019, 08:45 KamalakerDadi, ***@***.***> wrote:
They're now uploaded -- it's just "version 2" of those files :)
Are they lying under the same keys or new ones ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1953 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AOUk57mB9R7TT6idZuO1L8ZxTwOsnTJ0ks5vcf9bgaJpZM4cTtTF>
.
|
hey folks, @emdupre and I were just discussing this dataset. There was something that has been bugging us since the start but that was brought up again recently. It's only partially relevant to this thread, but I'm not sure where else to mention it. Anyway: When we look at the features that were contributing to the age prediction (in the machine learning example that got all of this started), they were mostly coming from the cerebellum and temporal pole. This made us think that the algorithm was learning something about the bounding box, or headsize, or something of that ilk, to predict age. Which is maybe not that interesting and might speak negatively to the dataset we're working with, depending on the angle. However, the first time @GaelVaroquaux and I did this back in September 2018 (for the resting-state conference), the important features were hanging about in the frontal lobe, where one might expect them to be from a biological point of view. There were many differences between that analysis and the newer one: a) We were using all subjects instead of a subset I'll need to dig a bit deeper to see which of those factors (or combo) might explain the discrepancy. But, do you think this is something we need to worry about moving forward with this dataset? If so, perhaps something to nip in the bud now before we commit to it? Maybe this won't be relevant outside of the ML example context but, just wanted to be on the safe side and mention it... |
I'll need to dig a bit deeper to see which of those factors (or combo) might explain the discrepancy. But, do you think this is something we need to worry about moving forward with this dataset?
Only partly. It is important to keep in mind that with small samples, the
features may not be well trusted.
To see if this is a problem, the first thing that I would try is to crank
of the number of subjects and see how things evolve.
In an ideal world, we would have a parameter to the download function
that enables downloading all the subjects. The download and analysis
would then take longer, but the results would be more trustworthy.
|
What's stopping us from building one? There are rumors circulating about an upcoming sprint... |
More on point, what do we need to do to implement this? Off the top of my head, we can upload a small subset and the full dataset. For our examples we use the smaller one, for speed, if the user specifies 'full' then they download the full version for their analysis. |
Update a second zip file on OSF with the remaining subjects preprocessed and add an argument to the downloading function to optionally download from it.
However, maybe we should first check that it helps solving the problem.
Sent from my phone. Please forgive typos and briefness.
…On Apr 2, 2019, 08:55, at 08:55, Kshitij Chawla ***@***.***> wrote:
More on point, what do we need to do to implement this?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1953 (comment)
|
I will look into it! |
It seems to be a classification vs. regression issue actually. I went back and ran a classification on the original (full) dataset, and the features were remarkably similar to the new dataset. Similarly, when doing regression on the new (smaller) dataset, the most important features were indeed in the frontal lobe. There's still some information coming from the cerebellum etc, but I'm less worried about this now. By the way, the data looks much better since @emdupre treated it, and the prediction is actually better! |
It seems to be a classification vs. regression issue actually.
OK, so it's probably because classification really aims for a
discriminant hyperplane, and hence cannot be trusted to identify features
marginally linked to the target.
By the way, the data looks much better since @emdupre treated it, and the prediction is actually better!
Great! That gives us trust in the data. Out of curiosity, what are your
MAE?
I suggest that we move forward with the plan, and do provide an option
for an additional bigger download. It will help users have confidence in
what they do.
Thanks!
|
So, we're predicting log age. The values range from ~1.3 - 3.5. The MAE for the smaller @emdupre dataset is 0.34. Using the original (full) dataset as it was archived on OpenNeuro, the MAE was ... wait for it... 0.34. 😉
Great! My pleasure! |
nilearn/datasets/data/MAIN_osf.csv
Outdated
sub-pixar058,5c8ff3b02286e80018c3e3c4,5c8ff3b14712b400183b705a | ||
sub-pixar061,5c8ff3b12286e80016c3c30f,5c8ff3b34712b400183b7060 | ||
sub-pixar062,5c8ff3b2a743a9001660a07a,5c8ff3b54712b400193b5ba3 | ||
sub-pixar063,5c8ff3b2a743a9001660a07a,5c8ff3b54712b400193b5ba3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are now fixed ! See correct key 👇
sub-pixar063,5c8ff3b2a743a9001660a07a,5c8ff3b54712b400193b5ba3 | |
sub-pixar063,5c9e99d006cd47001a5ab599,5c8ff3b62286e80016c3c31b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update this and see how CircleCI goes with it.
Thanks a lot!
This is ready for review. A big thanks to team of data organization. |
@@ -0,0 +1,37 @@ | |||
Montreal Artificial Intelligence and Neuroscience conference 2018 datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I'd like the line above to be change to match better the new name, as well as the name of the file.
I general, I'd like us to avoid using "main" in the code and the filenames to describe this dataset, I will be confusing to people.
I had a look at this PR (and sent a PR to this branch KamalakerDadi#10 ). It's excellent work! The dataset is so light compared to the ADHD data, but it's so much richer and interesting. Aside from my PR, the only other comment that I have is that we should avoid using the naming "main", even inside the codebase. It will confuse people who work on the code in the future. |
I would change "main" to "brain_development", then ? |
I would change "main" to "brain_development", then ?
"development_rsfmri", rather, as this is the name that we use in the
function.
|
Thanks. Its merged. |
This looks great, @KamalakerDadi ! Thanks so much ✨ I was starting a review but first had one very general question: In the examples we're now referring to this alternately as the "rest", "brain development", "resting-state development", "resting state functional connectivity ", or just "resting-state" dataset. The precise term chosen seems to vary somewhat randomly. Can we standardize the short names throughout the examples ? I think this would help clarify for users ! |
Yeah. I agree. :)
Any suggestions ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- I had a few questions re how we link out / refer to the data. Thanks again !!
`<http://fcon_1000.projects.nitrc.org/indi/adhd200/>`_. We use utilities | ||
:func:`fetch_adhd` implemented in nilearn for automatic fetching of these | ||
datasets. | ||
`<https://openneuro.org/datasets/ds000228/versions/1.0.0>`_. We use utilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we actually give the OSF link here ? That's where the preprocessed data lives. The data available on OpenNeuro is either raw BIDS or smoothed.
We will use down-sample data from the `children resting-state dataset | ||
<https://openneuro.org/datasets/ds000228/versions/1.0.0>`_ has been | ||
preprocessed using `FMRIPrep <https://github.com/poldracklab/fmriprep>`_. | ||
We use nilearn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think we might want to link OSF here. It gives the exact fMRIPrep "methods section" as well as the Nilearn commands used in downsampling.
preprocessed using `CPAC <http://fcp-indi.github.io/>`_. We use nilearn | ||
We will use down-sample data from the `children resting-state dataset | ||
<https://openneuro.org/datasets/ds000228/versions/1.0.0>`_ has been | ||
preprocessed using `FMRIPrep <https://github.com/poldracklab/fmriprep>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference:
preprocessed using `FMRIPrep <https://github.com/poldracklab/fmriprep>`_. | |
preprocessed using `FMRIPrep <https://fmriprep.readthedocs.io>`_. |
@@ -3,7 +3,8 @@ | |||
===================================================== | |||
|
|||
An example applying CanICA to resting-state data. This example applies it | |||
to 30 subjects of the ADHD200 datasets. Then it plots a map with all the | |||
to 30 subjects of the resting-state development datasets (children). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we selecting only children, here ? I thought from the fetcher that it would select both children and adults when given a n_subjects
value...
# Look at blocks structure, reflecting functional networks. | ||
# Now we display as a connectome the mean correlation matrix over all children. | ||
plotting.plot_connectome(mean_correlation_matrix, msdl_coords, | ||
title='mean correlation over all childrens') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title='mean correlation over all childrens') | |
title='mean correlation over all children') |
@@ -223,4 +227,11 @@ def plot_matrices(matrices, matrix_kind): | |||
plt.grid(True) | |||
plt.tight_layout() | |||
|
|||
############################################################################### | |||
# While the comparison is not fully conclusive on this small dataset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we suggest here that users can re-run the example with all available data and compare ?
18-39). This rs-fmri data can be used to try to predict who are adults and | ||
who are children. | ||
|
||
The data is downsampled to 4mm resolution for convenience. The original |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should link to OSF here, too, as it describes the preprocessing and downsampling for anyone interested.
Here: https://openneuro.org/datasets/ds000228/versions/1.0.0 | ||
|
||
Track issue for more information: | ||
https://github.com/nilearn/nilearn/issues/1864 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be closed once this PR is merged, no ? Is it still useful to link, here ?
Nature communications, 9(1), 1027. | ||
https://www.nature.com/articles/s41467-018-03399-2 | ||
|
||
Licence: usage is unrestricted for non-commercial research purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can give the exact license -- the original data was PDDL. I haven't actually licensed our downsampled version. My gut would be for CC-BY 4.0. Preferences @pbellec @illdopejake ?
The original data is downloaded from OpenNeuro | ||
https://openneuro.org/datasets/ds000228/versions/1.0.0 | ||
|
||
This fetcher downloads downsampled data that are available on Open |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on including OSF link for exact preprocessing and downsampling.
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
…s.py Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
…s.py Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
…s.py Co-Authored-By: KamalakerDadi <dkamalakarreddy@gmail.com>
Gr8!! Merging. |
Great! Thanks @emdupre for helping on this. |
Follow up to PR #1887
PR #1887 will be closed for mess up in the commit history.