# Noise Ceiling Analysis


## Requirements

This notebook requires an additional package, `noiseceiling`.
To run this notebook, you need to have a few packages installed. The easiest way to do this is to use mamba to create a new environment from the `environment.yml` file in the root of this repository:

```bash
mamba env create -f environment.yml
mamba activate acnets
```

In [1]:
# 0. SETUP

%reload_ext autoreload
%autoreload 3

import pandas as pd
from pathlib import Path
import xarray as xr
import scipy.stats as stats
import numpy as np
from src.acnets.pipeline import ConnectivityPipeline, ConnectivityVectorizer
from sklearn.feature_selection import SelectFromModel, VarianceThreshold
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import LinearSVC
from tqdm.auto import tqdm

from src.acnets.pipeline import Parcellation

tqdm.pandas()

## Parameters

These parameters can be set in the command line when running the notebook, or in the notebook itself.

In [2]:
# PARAMETERS

N_CV_SPLITS = 100                       # number of cross-validation splits
N_TEST_SUBJECTS = 8                     # test size for cross-validation (number of subjects)

MODELS_DIR= Path('models/')             # Directory to save models

## Data

Here we load the data from the `data/julia2018/` dataset. These files contain the connectivity matrices for each participant, for each combination of parcellation and connectivity metric. For the reminder of this notebook, we only focus on `dosenbach2010` parcellation atlas.

In [6]:
# DATA PREPARATION
parcellation = Parcellation(atlas_name='dosenbach2010').fit()

subjects = parcellation.dataset_.coords['subject'].values

# extract group labels (AVGP or NVGP) from subject ids (e.g. AVGP-01)
subject_labels = [s[:4] for s in subjects]  

X = subjects.reshape(-1, 1)  # subject ids, shape: (n_subjects, 1)

y_encoder = LabelEncoder()
y = y_encoder.fit_transform(subject_labels)     # labels, shape: (n_subjects,)


## Feature extraction

The feature extraction pipeline is composed of the following steps:

1. Extract connectivity matrices from the data
2. Vectorize the connectivity matrices
3. Scale the connectivity matrices
4. Remove zero-variance features
5. Select the top features based on the coefficient of a SVM classifier

In [28]:
# DEFINE PIPELINE

pipe  = Pipeline([
    ('connectivity', ConnectivityPipeline(kind='partial correlation')),
    ('vectorize', ConnectivityVectorizer()),
    ('scale', StandardScaler()),
    ('zerovar', VarianceThreshold()),
    ('select', SelectFromModel(LinearSVC(penalty='l1', dual=False, max_iter=10000),
                               max_features=lambda x: min(10, x.shape[1]))),
    # ('clf', SVC(kernel='linear', C=1))
])

# DEBUG (expected to overfit, i.e., score=1)
X_features = pipe.fit_transform(X, y)

# WARNING the noiseceiling does not work out of the box, download the source code, change `sklearn` dependency to `scikit-learn` in `setup.py` and install with `pip install .` in the source code directory. See https://github.com/lukassnoek/noiseceiling for the original source code.

from noiseceiling import compute_nc_classification
compute_nc_classification(pd.DataFrame(X_features), pd.Series(y))

# RESULT: the noise ceiling analysis is not applicable to this dataset because there is no repeat in the X.

ValueError: There are no repeats in your data.

## Verify the pipeline

Here we verify that the pipeline works by running it on all aggregation strategies.

In [37]:
# TEST VARIOUS AGGREGATIONS (calculate cross-validated accuracy and bootstrap CI)

for timeseries_aggregation, connectivity_aggregation in [
    (None, None),                # no aggregation (regions)
    ('network', None),           # time-series aggregation region->network
    ('random_network', None),    # time-series aggregation region->random_network
    (None, 'network'),           # connectivity matrix aggregation region->network
    (None, 'random_network'),    # connectivity matrix aggregation region->random_network
    ]:

    pipe.set_params(connectivity__atlas='dosenbach2010',
                    connectivity__kind='partial correlation',
                    connectivity__timeseries_aggregation=timeseries_aggregation,
                    connectivity__connectivity_aggregation=connectivity_aggregation)

    scores = cross_val_score(pipe, X, y,
                            cv=CV,
                            scoring='accuracy',
                            n_jobs=-1)
    bootstrap_ci = stats.bootstrap(scores.reshape(1,-1), np.mean)

    print(f'[timeseries={timeseries_aggregation}, connectivity={connectivity_aggregation}]')
    print('Test accuracy (mean ± std): {:.2f} ± {:.2f}'.format(scores.mean(), scores.std()))
    print(bootstrap_ci.confidence_interval, '\n')

[timeseries=None, connectivity=None]
Test accuracy (mean ± std): 0.47 ± 0.16
ConfidenceInterval(low=0.44125, high=0.505) 

[timeseries=network, connectivity=None]
Test accuracy (mean ± std): 0.74 ± 0.14
ConfidenceInterval(low=0.7075, high=0.76375) 

[timeseries=random_network, connectivity=None]
Test accuracy (mean ± std): 0.49 ± 0.13
ConfidenceInterval(low=0.46, high=0.51) 

[timeseries=None, connectivity=network]
Test accuracy (mean ± std): 0.50 ± 0.19
ConfidenceInterval(low=0.46625, high=0.54) 

[timeseries=None, connectivity=random_network]
Test accuracy (mean ± std): 0.50 ± 0.15
ConfidenceInterval(low=0.46875, high=0.525) 



In [41]:

def get_model_name(params):
    """Helper function to generate a unique model name from the parameters."""

    atlas = params['connectivity__atlas']
    kind = params['connectivity__kind'].replace(' ', '')
    tagg = params['connectivity__timeseries_aggregation'] or 'region'  # none = region
    cagg = params['connectivity__connectivity_aggregation'] or 'none'  # none = ts-aggregation
    tagg = tagg.replace('random_network', 'random')  # random_network -> random
    cagg = cagg.replace('random_network', 'random')  # random_network -> random
    name = f'{atlas}_kind-{kind}_tagg-{tagg}_cagg-{cagg}'

    return name
