Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG, ENH: Allow ICA.max_pca_components to be a float, and introduce ICA.max_pca_components_ #8321

Merged
merged 14 commits into from Oct 1, 2020

Conversation

hoechenberger
Copy link
Member

@hoechenberger hoechenberger commented Sep 30, 2020

What does this implement/fix?

  • ICA's max_pca_components can now be a float between 0.0 and 1.0, allowing users to select which components to keep after the PCA step by explained cumulative variance. This makes it easier to avoid accidentally feeding rank-deficient data into the ICA algorithm by e.g. setting max_pca_components=0.99.
  • ICA.max_pca_components could previously be altered when calling ICA.fit(); this really shouldn't happen. I have now introduced ICA.max_pca_components_, which contains any alterations that ICA.fit() conducts, while ICA.max_pca_components retains its initialization value, even after fitting.

Additional information

In follow-up PRs, I will

  • add a warning if users try to perform ICA on potentially rank-deficient data, which seems to be a common problem among researchers we know, leading some to believe ICA doesn't do the job for them
  • update the ICA tutorial to mention the new features

@agramfort suggested we should maybe change the default from None to 1.0 (did I get this right?), but if we do, we should update the other params (n_components, n_pca_components) as well in a separate PR.

WIP: Want to add a test case that ensures max_pca_components=None yields the same results as max_pca_components=1.0.

@hoechenberger hoechenberger changed the title WIP, ENH: Allow ICA.max_pca_components to be a float, and introduce ICA.max_pca_components_ MRG, ENH: Allow ICA.max_pca_components to be a float, and introduce ICA.max_pca_components_ Sep 30, 2020
@hoechenberger
Copy link
Member Author

The current approach doesn't properly update ICA.pca_explained_variance_. Needs a fix before we can merge.

@hoechenberger hoechenberger changed the title MRG, ENH: Allow ICA.max_pca_components to be a float, and introduce ICA.max_pca_components_ ENH: Allow ICA.max_pca_components to be a float, and introduce ICA.max_pca_components_ Sep 30, 2020
@hoechenberger
Copy link
Member Author

Never mind, all is well. The only thing that I discovered is that we should probably introduce ICA.pca_explained_variance_ratio_ in a followup PR at some point. This is good to be reviewed & merged.

@hoechenberger hoechenberger changed the title ENH: Allow ICA.max_pca_components to be a float, and introduce ICA.max_pca_components_ MRG, ENH: Allow ICA.max_pca_components to be a float, and introduce ICA.max_pca_components_ Sep 30, 2020
@hoechenberger
Copy link
Member Author

x-ref #8322

doc/changes/latest.inc Outdated Show resolved Hide resolved

.. versionchanged:: 0.22
Added support for float to select components by cumulative
explained variance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you typically want to just use max_pca_components=None and n_pca_components=float? Or would this behave differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't even thought about this one. I need to try and see how it works on our data.

If that does work, I'm wondering why we'd need max_pca_components at all?!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we could simply by default fit PCA with number of components == number of channels, and then use n_pca_components and n_components to select only a subset of them. I just looked at the code and it seems as if max_pca_components could be dropped entirely… but I'd need to do some more testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you typically want to just use max_pca_components=None and n_pca_components=float? Or would this behave differently?

I just tested. Would behave differently:

  • max_pca_components=None --> feed all PCs to ICA
  • n_pca_components=X --> use X components for signal reconstruction after ICA run

these settings are orthogonal

Let's flip this:

  • max_pca_components=X --> only keep X PCs around, and feed them to ICA
  • n_pca_components=None --> use all retained components for signal reconstruction after ICA run (i.e., X PCs in our case)

This still begs the question, IMHO, why the h* we need max_pca_components. What one could to is:

  • drop max_pca_components
  • select all PCs to keep via n_pca_components
  • select a subset of those PCs to feed into ICA via n_components

… but I ain't touching that with a 5-foot pole now unless I'm being told to ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just to say: I think this PR is good to merge as-is, and we should maybe discuss the ICA API in a new issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This conversation again shows that the current API is too complicated)

@hoechenberger
Copy link
Member Author

hoechenberger commented Sep 30, 2020

Travis is red bc one job timed out after successfully running all tests.

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that remains for me is to warn if max_pca_components_ is higher than the rank provided by info so you don't end up with the problem that triggered this PR.

nice work @hoechenberger !

doc/changes/latest.inc Outdated Show resolved Hide resolved
mne/preprocessing/tests/test_ica.py Outdated Show resolved Hide resolved
@hoechenberger
Copy link
Member Author

hoechenberger commented Oct 1, 2020

I've addressed all comments

@agramfort

one thing that remains for me is to warn if max_pca_components_ is higher than the rank provided by info so you don't end up with the problem that triggered this PR.

I was going to address this in a follow-up PR to keep the diff of this PR small. But can also add it here. WDYT?

@hoechenberger
Copy link
Member Author

@agramfort

one thing that remains for me is to warn if max_pca_components_ is higher than the rank provided by info so you don't end up with the problem that triggered this PR.

I was going to address this in a follow-up PR to keep the diff of this PR small. But can also add it here. WDYT?

Actually, no, I definitely want to address this in a separate PR, otherwise things are going to get too complex here. Unless you have strong objections, of course

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for MRG when CIs are green and branch is rebased to fix conflicts

@hoechenberger hoechenberger merged commit 4270859 into mne-tools:master Oct 1, 2020
3 checks passed
@hoechenberger hoechenberger deleted the ica-max_pca_components branch October 1, 2020 08:08
Copy link
Member

@larsoner larsoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not 100% sure this will be necessary if we go the route of #8326 / fix the pinv

del data_transformed # Free memory.
self.max_pca_components_ = (np.sum(pca.explained_variance_ratio_
.cumsum()
<= max_pca_components))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ask for amax_pca_components=0.6 and the components are 0.4, 0.3, 0.2, 0.1 this code will give you 1 component, which is only 0.4; it seems like we want it to give you 2 components (0.7 exp var).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it has been done in ICA since forever… we always say "less than X" in the docstrings.

_PCA handles this differently – exactly like you proposed.


if (self.n_components is not None and
self.max_pca_components_ < self.n_components):
msg = (f'You asked to select only a subset of PCA components '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick, this actually requires more left spaces than the pattern

raise ValueError(
    f''
    f'')

and it also makes it look like msg could be triaged later (in a on_whatever='raise' sort of pattern) / you have to look down more lines to know all this conditional does is raise an error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will keep that in mind for future PRs

@larsoner
Copy link
Member

larsoner commented Oct 2, 2020

@hoechenberger can you see if your original problematic use case works on #8326 with max_pca_components=None, n_pca_components=None, n_components=0.9999? If it does then I think we can revert this, emit warnings when n_components=None and data are rank deficient, and consider changing our default to n_components=0.9999 or so. It avoids the need for this max_pca_components_ being a float.

@hoechenberger
Copy link
Member Author

@larsoner I am very sure we had issues using n_components only and not max_pca_components. So this begs the question whether we could have worked around this by passing n_components and n_pca_components.

@larsoner
Copy link
Member

larsoner commented Oct 2, 2020

Worth trying n_components=0.9999, n_pca_components=0.9999. You might also need to pass n_pca_components=0.9999 to the apply step, it can be changed during .fit (not sure if it will in the fractional variance case, probably not)

@hoechenberger
Copy link
Member Author

@larsoner Will first try to create a MWE to reproduce the issue

@hoechenberger
Copy link
Member Author

@larsoner

# %%
import os
import mne


sample_data_folder = mne.datasets.sample.data_path()
sample_data_raw_file = os.path.join(sample_data_folder, 'MEG', 'sample',
                                    'sample_audvis_raw.fif')

# Load & filter raw data.
raw = (mne.io.read_raw_fif(sample_data_raw_file, preload=True)
       .pick_types(meg=True, stim=True, eog=True))
raw.filter(l_freq=1, h_freq=None)

# Create Epochs.
events = mne.find_events(raw, stim_channel='STI 014')
event_dict = {'auditory/left': 1, 'auditory/right': 2, 'visual/left': 3,
              'visual/right': 4, 'face': 5, 'buttonpress': 32}
tmin, tmax = -0.2, 0.5
baseline = (None, 0)
reject = dict(mag=3000e-15,
              grad=3000e-13)
epochs = mne.Epochs(raw, events, tmin=tmin, tmax=tmax, event_id=event_dict,
                    reject=reject, baseline=baseline, preload=True)

# Create Evoked.
evoked = epochs.average()

# Run ICA.
n_components = 0.8
n_pca_components = 0.999
max_pca_components = None
method = 'picard'
fit_params = dict(fastica_it=5)
max_iter = 500
random_state = 42
ica = mne.preprocessing.ICA(method=method, random_state=random_state,
                            fit_params=fit_params,
                            max_iter=max_iter,
                            n_components=n_components,
                            n_pca_components=n_pca_components,
                            max_pca_components=max_pca_components)
ica.fit(epochs)

# Detect ECG artifacts.
ecg_epochs = mne.preprocessing.create_ecg_epochs(raw, tmin=-0.5, tmax=0.5,
                                                 baseline=(None, -0.2),
                                                 reject=None)
ecg_evoked = ecg_epochs.average()
ecg_inds, scores = ica.find_bads_ecg(
    ecg_epochs, method='ctps',
    threshold=0.1)
ica.exclude = ecg_inds

print(f'Detected {len(ecg_inds)} ECG-related ICs in '
      f'{len(ecg_epochs)} ECG epochs.')

ecg_evoked.plot()
ica.apply(ecg_evoked.copy()).plot()

Works. Had to set n_components to a value much smaller than 0.999, otherwise I'd end up with too many components and ICA would run for ages.

So – yeah. This seems to be doing the trick? Shall we set n_components=0.9999 by default?

@hoechenberger
Copy link
Member Author

hoechenberger commented Oct 2, 2020

This, again, begs the question whether we need max_pca_components at all. As we can control how many PCA components get to be part of the data we send to ICA, and how many components in total to use for signal reconstruction after ICA, we can effectively achieve a dimensionality reduction during the signal reconstruction step, simply by setting n_pca_components=desired_data_rank. I don't see how a dimensionality reduction before ICA via max_pca_components would be necessary, then, as we're effectively doing the same thing via n_components.

@larsoner
Copy link
Member

larsoner commented Oct 2, 2020

I think I'm +1 for deprecating max_pca_components. I don't see any reason not to always retain all of them. Then you choose how many to fit with ICA with n_components, and how many to reconstruct with with n_pca_components. Someone needs to have a chat with @agramfort about this to figure out why it's not possible. But based on the tests and doc changes in #8326, and where the instability seems to come from (from the pinv on the whitened unmixing matrix that only uses n_components components anyway) it seems like it should be okay.

@hoechenberger
Copy link
Member Author

@larsoner Yes +1 on that, this is how I see things now after fighting with this for a couple days in a row.

@hoechenberger
Copy link
Member Author

This also means that in any case, we need to fix #8327 ASAP so we don't accidentally draw incorrect conclusions.

@agramfort
Copy link
Member

agramfort commented Oct 3, 2020 via email

@hoechenberger
Copy link
Member Author

the rationale for max_pca_components was to avoid computing too many components for speed reasons and also so be able to tweak the n_pca_components parameter in the apply method without having to recompute ICA

Yep, we changed our minds elsewhere (damn, this is spread across too many issues and PRs!), now the idea is to deprecate ICA.n_pca_components and only keep this param in apply().

@larsoner
Copy link
Member

larsoner commented Oct 3, 2020

the rationale for max_pca_components was to avoid computing too many components for speed reasons

The number of ICA components is controlled by n_components, so you must mean the number of PCA components.

The PCA is computed via a dense-matrix deterministic full SVD and then restricted to the slice of highest-variance components so there is no speed gain from this argument in fit as far as I can see.

This makes me think that the only thing we want in __init__ is n_components, and we should have n_pca_components or max_pca_components in apply. Or we can keep max_pca_components in __init__ to reduce user API churn, but just say that it doesn't really matter / is redundant with n_pca_components in apply.

@hoechenberger
Copy link
Member Author

hoechenberger commented Oct 3, 2020

This makes me think that the only thing we want in __init__ is n_components, and we should have n_pca_components or max_pca_components in apply.

… and n_pca_components (would be my preferred choice) or max_pca_components in apply() should default to 0.9999 or so, right?

Otherwise I think you're right, decluttering ICA.__init__() would be a good idea. I highly doubt most users know what the heck is going on with all those current arguments.

@hoechenberger
Copy link
Member Author

hoechenberger commented Oct 3, 2020

Thinking about this again, I think it does make sense to have both n_components and n_pca_components in __init__(), so you can set which values to use during fit() and apply() by default. Both fit() and apply() can get / keep params to override these defaults, i.e. fit(n_components=...), apply(n_pca_components=...)

max_pca_components can go.

WDYT?

marsipu pushed a commit to marsipu/mne-python that referenced this pull request Oct 14, 2020
…CA.max_pca_components_ (mne-tools#8321)

* Support floats for ICA.max_pca_components

* Introduce ICA.max_pca_components_

* flake & docstring

* Fix bug in comparison

* Test max_pca_components=1.0

* Fix doc build

* docstring [skip travis]

* Fix test

* Fix I/O roundtrip for old files and add test

* Remove unreachable test case

* Better comments [skip travis]

* API -> Bug

* Reduce test matrix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants