[BUG] Discrepancy between decoding accuracies from nilearn's `Decoder` and pure sklearn implementation #4344

man-shu · 2024-03-27T14:30:47Z

Is there an existing issue for this?

I have searched the existing issues

Operating system

Linux
Mac
Windows

Operating system version

Linux Ubuntu 22.04.4

Python version

nilearn version

0.10.4.dev34+g58f12585

Expected behavior

As initially reported on neurostars here, the decoding performances should be similar, if not the same, using either nilearn or sklearn. But they aren't -- especially when one does not standardize the input time series / features.

While investigating, I observed the following so far:

~~the time series / features (X) we get after masking (to later feed into a sklearn model) vs. after the automatic masking step within the nilearn Decoder are different.~~
with nilearn Decoder, hyperparameter search is done by default using the "CV" versions of sklearn models. But in the docs we specify, for example, LogisticRegression while we actually use LogisticRegressionCV.
The default parameter grids (given to LogisticRegressionCV) are different in nilearn and sklearn.

I am currently looking into the ReNA clustering and ANOVA feature selection steps that are done in nilearn Decoder. I have a feeling that they interfere with the decoding even if one switches them off via top-level Decoder parameters.

Current behavior & error messages

Accuracy scores:

Nilearn, with standardization: 0.6636432350718064
Sklearn, with standardization: 0.6693121693121693

Nilearn, without std.: 0.8784958427815571
Sklearn, without std.: 0.6243386243386243

Steps and code to reproduce bug

from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

from nilearn import datasets
from nilearn import plotting
from nilearn.image import mean_img
import numpy as np
import pandas as pd
from nilearn.image import index_img
from sklearn.model_selection import LeaveOneGroupOut
from nilearn.decoding import Decoder
from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline
from nilearn.maskers import NiftiMasker

haxby_dataset = datasets.fetch_haxby()
fmri_filename = haxby_dataset.func[0]
mask_filename = haxby_dataset.mask_vt[0]

###############################################################################

# Load behavioral information
behavioral = pd.read_csv(haxby_dataset.session_target[0], delimiter=" ")
conditions = behavioral["labels"]
runs = behavioral["chunks"]
condition_mask = conditions.isin(
    ["bottle", "cat", "chair", "face", "house", "scissors", "shoe"]
)

# Apply mask to chunks, runs and func data
fmri_niimgs = index_img(fmri_filename, condition_mask)
conditions = conditions[condition_mask]
runs = runs[condition_mask]

# print len and shape
print(len(conditions))
print(fmri_niimgs.shape[-1])
print(len(runs))

####### NILEARN DECODING ####################################
Xs_nilearn = []
params_nilearn = []
cv = LeaveOneGroupOut()
for standardize_true_false in ["zscore_sample", False]:
    # explicitly avoid screening even though providing a small mask has the same effect
    screening_percentile = 100
    decoder_nilearn = Decoder(
        estimator="logistic_l2",
        mask=mask_filename,
        cv=cv,
        scoring="accuracy",
        standardize=standardize_true_false,
        n_jobs=10,  # parallelizing
        screening_percentile=screening_percentile,
    )
    decoder_nilearn.fit(fmri_niimgs, conditions, groups=runs)
    print(f"Decoder results with standardize={standardize_true_false}")
    print(
        "Mean decoding accuracy:",
        np.mean(list(decoder_nilearn.cv_scores_.values())),
    )
    print("Chance level:", 1.0 / len(np.unique(conditions)))

    Xs_nilearn.append(
        (
            decoder_nilearn.masker_.fit_transform(fmri_niimgs),
            standardize_true_false,
        )
    )

####### NILEARN MASKING WITH/WITHOUT STANDARDIZATION AND SKLEARN DECODING ####################################

Xs_sklearn = []
for standardize_true_false in ["zscore_sample", False]:
    # mask data before decoding
    masker = NiftiMasker(
        mask_img=mask_filename, standardize=standardize_true_false
    )
    X = masker.fit_transform(fmri_niimgs)
    y = np.array(conditions)
    groups = np.array(runs)

    logo = LeaveOneGroupOut()
    clf = LogisticRegressionCV(
        solver="liblinear",
        penalty="l2",
        Cs=np.geomspace(2e-3, 2e4, 8) * 0.5,  # same as nilearn
        random_state=42,
    )

    # Not using StandardScaler because it standaridizes across features
    # nilearn's zscore_sample standardizes across samples
    # scaler = StandardScaler()
    # if standardize_true_false:
    #     pipeline = Pipeline([("scaler", scaler), ("clf", clf)])
    # else:
    #     pipeline = Pipeline([("clf", clf)])

    decoder_sklearn = cross_validate(
        clf,
        X=X,
        y=y,
        groups=groups,
        cv=logo,
        n_jobs=8,
        scoring="accuracy",
    )

    print(
        f"Mean score in mask {mask_filename} with standardize={standardize_true_false}:",
        np.mean(decoder_sklearn["test_score"]),
    )

    X_to_append = X.copy()
    Xs_sklearn.append((X_to_append, standardize_true_false))


# check if the two standardization (or lack of them) gives similar results
for (X_nilearn, _), (X_sklearn, _) in zip(Xs_nilearn, Xs_sklearn):
    print(np.allclose(X_nilearn, X_sklearn))
# returns True
#         True

The text was updated successfully, but these errors were encountered:

bthirion · 2024-03-27T20:24:07Z

Thx for looking at that. it would be useful to dig a bit deeper into why there is such a difference with Standardize=False in the case of Nilearn uniquely.

man-shu · 2024-03-28T18:17:00Z

So I found that nilearn Decoder:

converts multiclass classification into N one vs. rest (ovr) classification problems, where N = number of classes (= 7 in this case) see line#733 in decoder.py
each ovr problem is "solved" for each cross-validation split (in this case, N_splits = N_groups = 12), such that the number of problems now becomes N*N_groups (= 7 * 12) see line#785 in decoder.py
for each cross-validation split, it then fits a different learner for each hyperparameter (in this case, 8 default values) and picks the best performing one (via LogisticRegressionCV).
The final score we get is the average across these N*N_groups scores

Therefore, since in nilearn, we have an ovr problem, which turns out to be inherently unbalanced in this case (and probably for most other cases), accuracy scores would be flawed. I think either roc_auc_ovr or balanced_accuracy should be better evaluation metrics (@bthirion would know better), and indeed, roc_auc_ovr is relatively closer between nilearn and now modified sklearn implementations.

Nilearn, with standardization: 0.8987972761120909
Sklearn, with standardization: 0.9416274740348817

Nilearn, without std.: 0.838477366255144
Sklearn, without std.: 0.9368508720360573

I updated the sklearn implementation, which should be similar to what nilearn is doing, but I think it still isn't exactly the same, (particularly the behavior of OneVsRestClassifier, that I do not yet completely understand):

from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

from nilearn import datasets
from nilearn import plotting
from nilearn.image import mean_img
import numpy as np
import pandas as pd
from nilearn.image import index_img
from sklearn.model_selection import LeaveOneGroupOut
from nilearn.decoding import Decoder
from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline
from nilearn.maskers import NiftiMasker

from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

haxby_dataset = datasets.fetch_haxby()
fmri_filename = haxby_dataset.func[0]
mask_filename = haxby_dataset.mask_vt[0]

###############################################################################

# Load behavioral information
behavioral = pd.read_csv(haxby_dataset.session_target[0], delimiter=" ")
conditions = behavioral["labels"]
runs = behavioral["chunks"]
condition_mask = conditions.isin(
    ["bottle", "cat", "chair", "face", "house", "scissors", "shoe"]
)

# Apply mask to chunks, runs and func data
fmri_niimgs = index_img(fmri_filename, condition_mask)
conditions = conditions[condition_mask]
runs = runs[condition_mask]

# print len and shape
print(len(conditions))
print(fmri_niimgs.shape[-1])
print(len(runs))

####### NILEARN DECODING ####################################
Xs_nilearn = []
decoders_nilearn = []
cv = LeaveOneGroupOut()
for standardize_true_false in ["zscore_sample", False]:
    # explicitly avoid screening even though providing a small mask has the same effect
    screening_percentile = 100
    decoder_nilearn = Decoder(
        estimator="logistic_l2",
        mask=mask_filename,
        cv=cv,
        scoring="roc_auc_ovr",
        standardize=standardize_true_false,
        n_jobs=12,  # parallelizing
        screening_percentile=screening_percentile,
    )
    decoder_nilearn.fit(fmri_niimgs, conditions, groups=runs)
    print(f"Decoder results with standardize={standardize_true_false}")
    print(
        "Mean decoding accuracy:",
        np.mean(list(decoder_nilearn.cv_scores_.values())),
    )
    print("Chance level:", 1.0 / len(np.unique(conditions)))

    Xs_nilearn.append(
        (
            decoder_nilearn.masker_.fit_transform(fmri_niimgs),
            standardize_true_false,
        )
    )
    decoders_nilearn.append(decoder_nilearn)

####### NILEARN MASKING WITH/WITHOUT STANDARDIZATION AND SKLEARN DECODING ####################################

Xs_sklearn = []
decoders_sklearn = []
for standardize_true_false in ["zscore_sample", False]:
    # mask data before decoding
    # using the standardize parameter (on/off) to match nilearn's behavior
    masker = NiftiMasker(
        mask_img=mask_filename, standardize=standardize_true_false
    )
    X = masker.fit_transform(fmri_niimgs)
    y = np.array(conditions)
    groups = np.array(runs)

    logo = LeaveOneGroupOut()

    # using one-vs-rest to closely match nilearn's behavior
    clf = OneVsRestClassifier(
        LogisticRegressionCV(
            solver="liblinear",
            penalty="l2",
            Cs=np.geomspace(2e-3, 2e4, 8) * 0.5,  # same as nilearn
            random_state=42,
        )
    )

    # Not using StandardScaler because it standaridizes across features
    # nilearn's zscore_sample standardizes across samples
    # scaler = StandardScaler()
    # if standardize_true_false:
    #     pipeline = Pipeline([("scaler", scaler), ("clf", clf)])
    # else:
    #     pipeline = Pipeline([("clf", clf)])

    decoder_sklearn = cross_validate(
        clf,
        X=X,
        y=y,
        groups=groups,
        cv=logo,
        n_jobs=12,
        scoring="roc_auc_ovr",
        return_indices=True,
        return_estimator=True,
    )

    print(
        f"Mean score in mask {mask_filename} with standardize={standardize_true_false}:",
        np.mean(decoder_sklearn["test_score"]),
    )

    X_to_append = X.copy()
    Xs_sklearn.append((X_to_append, standardize_true_false))
    decoders_sklearn.append(decoder_sklearn)


# check if the two standardization (or lack of them) gives similar results
for (X_nilearn, _), (X_sklearn, _) in zip(Xs_nilearn, Xs_sklearn):
    print(np.allclose(X_nilearn, X_sklearn))
    print(np.array_equal(X_nilearn, X_sklearn))
# returns True
#         True


# check cv fold indices
for decoder_nilearn, decoder_sklearn in zip(
    decoders_nilearn, decoders_sklearn
):
    for i in range(11):
        print("Split", i + 1)
        print(
            "Test indices equal?",
            np.array_equal(
                decoder_nilearn.cv_[i][1],
                decoder_sklearn["indices"]["test"][i],
            ),
        )
        print(
            "Train indices equal?",
            np.array_equal(
                decoder_nilearn.cv_[i][0],
                decoder_sklearn["indices"]["train"][i],
            ),
        )
# returns True for all splits

bthirion · 2024-03-28T20:32:16Z

I think that a remaining difference concerning the inner cross-validation is that Nilearn does the model averaging while sklearn refits the model with optimal parameters for final evaluation.

Indeed, if the accuracy metric used by Nilearn is accuracy on an unbalanced setting where chance level would be (#classes-1) / # classes, then what we show with decoder is wrong and chance level is not 1/#classes.

What I still don't get is that this interacts with normalization.

man-shu added the Bug for bug reports label Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Discrepancy between decoding accuracies from nilearn's `Decoder` and pure sklearn implementation #4344

[BUG] Discrepancy between decoding accuracies from nilearn's `Decoder` and pure sklearn implementation #4344

man-shu commented Mar 27, 2024 •

edited

bthirion commented Mar 27, 2024

man-shu commented Mar 28, 2024

bthirion commented Mar 28, 2024

[BUG] Discrepancy between decoding accuracies from nilearn's Decoder and pure sklearn implementation #4344

[BUG] Discrepancy between decoding accuracies from nilearn's Decoder and pure sklearn implementation #4344

Comments

man-shu commented Mar 27, 2024 • edited

Is there an existing issue for this?

Operating system

Operating system version

Python version

nilearn version

Expected behavior

Current behavior & error messages

Steps and code to reproduce bug

bthirion commented Mar 27, 2024

man-shu commented Mar 28, 2024

bthirion commented Mar 28, 2024

[BUG] Discrepancy between decoding accuracies from nilearn's `Decoder` and pure sklearn implementation #4344

[BUG] Discrepancy between decoding accuracies from nilearn's `Decoder` and pure sklearn implementation #4344

man-shu commented Mar 27, 2024 •

edited