New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Discrepancy between decoding accuracies from nilearn's Decoder
and pure sklearn implementation
#4344
Comments
Thx for looking at that. it would be useful to dig a bit deeper into why there is such a difference with Standardize=False in the case of Nilearn uniquely. |
So I found that nilearn
Therefore, since in nilearn, we have an ovr problem, which turns out to be inherently unbalanced in this case (and probably for most other cases),
I updated the sklearn implementation, which should be similar to what nilearn is doing, but I think it still isn't exactly the same, (particularly the behavior of from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
from nilearn import datasets
from nilearn import plotting
from nilearn.image import mean_img
import numpy as np
import pandas as pd
from nilearn.image import index_img
from sklearn.model_selection import LeaveOneGroupOut
from nilearn.decoding import Decoder
from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline
from nilearn.maskers import NiftiMasker
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
haxby_dataset = datasets.fetch_haxby()
fmri_filename = haxby_dataset.func[0]
mask_filename = haxby_dataset.mask_vt[0]
###############################################################################
# Load behavioral information
behavioral = pd.read_csv(haxby_dataset.session_target[0], delimiter=" ")
conditions = behavioral["labels"]
runs = behavioral["chunks"]
condition_mask = conditions.isin(
["bottle", "cat", "chair", "face", "house", "scissors", "shoe"]
)
# Apply mask to chunks, runs and func data
fmri_niimgs = index_img(fmri_filename, condition_mask)
conditions = conditions[condition_mask]
runs = runs[condition_mask]
# print len and shape
print(len(conditions))
print(fmri_niimgs.shape[-1])
print(len(runs))
####### NILEARN DECODING ####################################
Xs_nilearn = []
decoders_nilearn = []
cv = LeaveOneGroupOut()
for standardize_true_false in ["zscore_sample", False]:
# explicitly avoid screening even though providing a small mask has the same effect
screening_percentile = 100
decoder_nilearn = Decoder(
estimator="logistic_l2",
mask=mask_filename,
cv=cv,
scoring="roc_auc_ovr",
standardize=standardize_true_false,
n_jobs=12, # parallelizing
screening_percentile=screening_percentile,
)
decoder_nilearn.fit(fmri_niimgs, conditions, groups=runs)
print(f"Decoder results with standardize={standardize_true_false}")
print(
"Mean decoding accuracy:",
np.mean(list(decoder_nilearn.cv_scores_.values())),
)
print("Chance level:", 1.0 / len(np.unique(conditions)))
Xs_nilearn.append(
(
decoder_nilearn.masker_.fit_transform(fmri_niimgs),
standardize_true_false,
)
)
decoders_nilearn.append(decoder_nilearn)
####### NILEARN MASKING WITH/WITHOUT STANDARDIZATION AND SKLEARN DECODING ####################################
Xs_sklearn = []
decoders_sklearn = []
for standardize_true_false in ["zscore_sample", False]:
# mask data before decoding
# using the standardize parameter (on/off) to match nilearn's behavior
masker = NiftiMasker(
mask_img=mask_filename, standardize=standardize_true_false
)
X = masker.fit_transform(fmri_niimgs)
y = np.array(conditions)
groups = np.array(runs)
logo = LeaveOneGroupOut()
# using one-vs-rest to closely match nilearn's behavior
clf = OneVsRestClassifier(
LogisticRegressionCV(
solver="liblinear",
penalty="l2",
Cs=np.geomspace(2e-3, 2e4, 8) * 0.5, # same as nilearn
random_state=42,
)
)
# Not using StandardScaler because it standaridizes across features
# nilearn's zscore_sample standardizes across samples
# scaler = StandardScaler()
# if standardize_true_false:
# pipeline = Pipeline([("scaler", scaler), ("clf", clf)])
# else:
# pipeline = Pipeline([("clf", clf)])
decoder_sklearn = cross_validate(
clf,
X=X,
y=y,
groups=groups,
cv=logo,
n_jobs=12,
scoring="roc_auc_ovr",
return_indices=True,
return_estimator=True,
)
print(
f"Mean score in mask {mask_filename} with standardize={standardize_true_false}:",
np.mean(decoder_sklearn["test_score"]),
)
X_to_append = X.copy()
Xs_sklearn.append((X_to_append, standardize_true_false))
decoders_sklearn.append(decoder_sklearn)
# check if the two standardization (or lack of them) gives similar results
for (X_nilearn, _), (X_sklearn, _) in zip(Xs_nilearn, Xs_sklearn):
print(np.allclose(X_nilearn, X_sklearn))
print(np.array_equal(X_nilearn, X_sklearn))
# returns True
# True
# check cv fold indices
for decoder_nilearn, decoder_sklearn in zip(
decoders_nilearn, decoders_sklearn
):
for i in range(11):
print("Split", i + 1)
print(
"Test indices equal?",
np.array_equal(
decoder_nilearn.cv_[i][1],
decoder_sklearn["indices"]["test"][i],
),
)
print(
"Train indices equal?",
np.array_equal(
decoder_nilearn.cv_[i][0],
decoder_sklearn["indices"]["train"][i],
),
)
# returns True for all splits |
I think that a remaining difference concerning the inner cross-validation is that Nilearn does the model averaging while sklearn refits the model with optimal parameters for final evaluation. Indeed, if the accuracy metric used by Nilearn is accuracy on an unbalanced setting where chance level would be (#classes-1) / # classes, then what we show with decoder is wrong and chance level is not 1/#classes. What I still don't get is that this interacts with normalization. |
Is there an existing issue for this?
Operating system
Operating system version
Linux Ubuntu 22.04.4
Python version
nilearn version
0.10.4.dev34+g58f12585
Expected behavior
As initially reported on neurostars here, the decoding performances should be similar, if not the same, using either nilearn or sklearn. But they aren't -- especially when one does not standardize the input time series / features.
While investigating, I observed the following so far:
the time series / features (X) we get after masking (to later feed into a sklearn model) vs. after the automatic masking step within the nilearnDecoder
are different.Decoder
, hyperparameter search is done by default using the "CV" versions of sklearn models. But in the docs we specify, for example,LogisticRegression
while we actually useLogisticRegressionCV
.LogisticRegressionCV
) are different in nilearn and sklearn.I am currently looking into the ReNA clustering and ANOVA feature selection steps that are done in nilearn
Decoder
. I have a feeling that they interfere with the decoding even if one switches them off via top-levelDecoder
parameters.Current behavior & error messages
Accuracy scores:
Nilearn, with standardization: 0.6636432350718064
Sklearn, with standardization: 0.6693121693121693
Nilearn, without std.: 0.8784958427815571
Sklearn, without std.: 0.6243386243386243
Steps and code to reproduce bug
The text was updated successfully, but these errors were encountered: