New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG, ENH: Allow ICA.max_pca_components to be a float, and introduce ICA.max_pca_components_ #8321
MRG, ENH: Allow ICA.max_pca_components to be a float, and introduce ICA.max_pca_components_ #8321
Conversation
The current approach doesn't properly update |
Never mind, all is well. The only thing that I discovered is that we should probably introduce |
x-ref #8322 |
|
||
.. versionchanged:: 0.22 | ||
Added support for float to select components by cumulative | ||
explained variance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't you typically want to just use max_pca_components=None
and n_pca_components=float
? Or would this behave differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't even thought about this one. I need to try and see how it works on our data.
If that does work, I'm wondering why we'd need max_pca_components
at all?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we could simply by default fit PCA with number of components == number of channels, and then use n_pca_components
and n_components
to select only a subset of them. I just looked at the code and it seems as if max_pca_components
could be dropped entirely… but I'd need to do some more testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't you typically want to just use
max_pca_components=None
andn_pca_components=float
? Or would this behave differently?
I just tested. Would behave differently:
max_pca_components=None
--> feed all PCs to ICAn_pca_components=X
--> useX
components for signal reconstruction after ICA run
these settings are orthogonal
Let's flip this:
max_pca_components=X
--> only keepX
PCs around, and feed them to ICAn_pca_components=None
--> use all retained components for signal reconstruction after ICA run (i.e.,X
PCs in our case)
This still begs the question, IMHO, why the h* we need max_pca_components
. What one could to is:
- drop
max_pca_components
- select all PCs to keep via
n_pca_components
- select a subset of those PCs to feed into ICA via
n_components
… but I ain't touching that with a 5-foot pole now unless I'm being told to ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just to say: I think this PR is good to merge as-is, and we should maybe discuss the ICA API in a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This conversation again shows that the current API is too complicated)
Travis is red bc one job timed out after successfully running all tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing that remains for me is to warn if max_pca_components_ is higher than the rank provided by info so you don't end up with the problem that triggered this PR.
nice work @hoechenberger !
I've addressed all comments
I was going to address this in a follow-up PR to keep the diff of this PR small. But can also add it here. WDYT? |
Actually, no, I definitely want to address this in a separate PR, otherwise things are going to get too complex here. Unless you have strong objections, of course |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for MRG when CIs are green and branch is rebased to fix conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not 100% sure this will be necessary if we go the route of #8326 / fix the pinv
del data_transformed # Free memory. | ||
self.max_pca_components_ = (np.sum(pca.explained_variance_ratio_ | ||
.cumsum() | ||
<= max_pca_components)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ask for amax_pca_components=0.6
and the components are 0.4, 0.3, 0.2, 0.1 this code will give you 1 component, which is only 0.4; it seems like we want it to give you 2 components (0.7 exp var).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how it has been done in ICA
since forever… we always say "less than X" in the docstrings.
_PCA
handles this differently – exactly like you proposed.
|
||
if (self.n_components is not None and | ||
self.max_pca_components_ < self.n_components): | ||
msg = (f'You asked to select only a subset of PCA components ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick, this actually requires more left spaces than the pattern
raise ValueError(
f''
f'')
and it also makes it look like msg
could be triaged later (in a on_whatever='raise'
sort of pattern) / you have to look down more lines to know all this conditional does is raise an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, will keep that in mind for future PRs
@hoechenberger can you see if your original problematic use case works on #8326 with |
@larsoner I am very sure we had issues using |
Worth trying |
@larsoner Will first try to create a MWE to reproduce the issue |
# %%
import os
import mne
sample_data_folder = mne.datasets.sample.data_path()
sample_data_raw_file = os.path.join(sample_data_folder, 'MEG', 'sample',
'sample_audvis_raw.fif')
# Load & filter raw data.
raw = (mne.io.read_raw_fif(sample_data_raw_file, preload=True)
.pick_types(meg=True, stim=True, eog=True))
raw.filter(l_freq=1, h_freq=None)
# Create Epochs.
events = mne.find_events(raw, stim_channel='STI 014')
event_dict = {'auditory/left': 1, 'auditory/right': 2, 'visual/left': 3,
'visual/right': 4, 'face': 5, 'buttonpress': 32}
tmin, tmax = -0.2, 0.5
baseline = (None, 0)
reject = dict(mag=3000e-15,
grad=3000e-13)
epochs = mne.Epochs(raw, events, tmin=tmin, tmax=tmax, event_id=event_dict,
reject=reject, baseline=baseline, preload=True)
# Create Evoked.
evoked = epochs.average()
# Run ICA.
n_components = 0.8
n_pca_components = 0.999
max_pca_components = None
method = 'picard'
fit_params = dict(fastica_it=5)
max_iter = 500
random_state = 42
ica = mne.preprocessing.ICA(method=method, random_state=random_state,
fit_params=fit_params,
max_iter=max_iter,
n_components=n_components,
n_pca_components=n_pca_components,
max_pca_components=max_pca_components)
ica.fit(epochs)
# Detect ECG artifacts.
ecg_epochs = mne.preprocessing.create_ecg_epochs(raw, tmin=-0.5, tmax=0.5,
baseline=(None, -0.2),
reject=None)
ecg_evoked = ecg_epochs.average()
ecg_inds, scores = ica.find_bads_ecg(
ecg_epochs, method='ctps',
threshold=0.1)
ica.exclude = ecg_inds
print(f'Detected {len(ecg_inds)} ECG-related ICs in '
f'{len(ecg_epochs)} ECG epochs.')
ecg_evoked.plot()
ica.apply(ecg_evoked.copy()).plot() Works. Had to set So – yeah. This seems to be doing the trick? Shall we set |
This, again, begs the question whether we need |
I think I'm +1 for deprecating |
@larsoner Yes +1 on that, this is how I see things now after fighting with this for a couple days in a row. |
This also means that in any case, we need to fix #8327 ASAP so we don't accidentally draw incorrect conclusions. |
the rationale for max_pca_components was to avoid computing too many
components for speed reasons
and also so be able to tweak the n_pca_components parameter in the apply
method without having
to recompute ICA
… |
Yep, we changed our minds elsewhere (damn, this is spread across too many issues and PRs!), now the idea is to deprecate |
The number of ICA components is controlled by The PCA is computed via a dense-matrix deterministic full SVD and then restricted to the slice of highest-variance components so there is no speed gain from this argument in This makes me think that the only thing we want in |
… and Otherwise I think you're right, decluttering |
Thinking about this again, I think it does make sense to have both
WDYT? |
…CA.max_pca_components_ (mne-tools#8321) * Support floats for ICA.max_pca_components * Introduce ICA.max_pca_components_ * flake & docstring * Fix bug in comparison * Test max_pca_components=1.0 * Fix doc build * docstring [skip travis] * Fix test * Fix I/O roundtrip for old files and add test * Remove unreachable test case * Better comments [skip travis] * API -> Bug * Reduce test matrix
What does this implement/fix?
max_pca_components
can now be a float between 0.0 and 1.0, allowing users to select which components to keep after the PCA step by explained cumulative variance. This makes it easier to avoid accidentally feeding rank-deficient data into the ICA algorithm by e.g. settingmax_pca_components=0.99
.ICA.max_pca_components
could previously be altered when callingICA.fit()
; this really shouldn't happen. I have now introducedICA.max_pca_components_
, which contains any alterations thatICA.fit()
conducts, whileICA.max_pca_components
retains its initialization value, even after fitting.Additional information
In follow-up PRs, I will
@agramfort suggested we should maybe change the default from
None
to1.0
(did I get this right?), but if we do, we should update the other params (n_components
,n_pca_components
) as well in a separate PR.WIP: Want to add a test case that ensures
max_pca_components=None
yields the same results asmax_pca_components=1.0
.