## Data preparation for encoding/decoding models

For this example, we will use the data from Haxby et al., 2001., which are shared via OpenNeuro:

https://openneuro.org/datasets/ds000105/versions/3.0.0

The data are formatted according to the BIDS standard: https://bids-specification.readthedocs.io/en/stable/index.html

First, import required dependencies. You can install these using `pip install -r requirements.txt` from the main repo directory.

In [32]:
%load_ext autoreload
%autoreload 2

import os
import nilearn
from nilearn import datasets, plotting
from nilearn.image import load_img, mean_img, resample_img
from nilearn.maskers import NiftiMasker
from nilearn.glm.first_level import FirstLevelModel
from nilearn.glm.second_level import SecondLevelModel
from nilearn.plotting import plot_stat_map, plot_design_matrix
import h5py
import numpy as np
import nibabel as nib
import datalad.api as dl
from bids import BIDSLayout
from nilearn.glm.first_level import make_first_level_design_matrix
import pandas as pd
import matplotlib.pyplot as plt
from templateflow import api as tflow
import templateflow
from utils import (get_difumo_mask, 
                   get_subject_common_brain_mask,
                   get_group_common_mask,
                   get_subject_runs)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### Get the data using datalad

We will use a tool called [Datalad](https://www.datalad.org/) to obtain the data from openneuro. 

We will download the raw data, as well as the processed data (using [fMRIPrep](https://fmriprep.org/en/stable/).  Note that downloading these derivative data can take quite a while depending on the speed of one's connection.  

In [20]:
data_dir = "/Users/poldrack/data_unsynced/ds000105"
assert os.path.exists(data_dir), "Data directory not found: %s" % data_dir

# get the raw data
ds = dl.clone(
    path=data_dir,
    source="https://github.com/OpenNeuroDatasets/ds000105.git",
)
dl.get(dataset=data_dir, recursive=True)

get_fmriprep = False  #set to false after downloading fmriprep once
fmriprep_dir = os.path.join(data_dir, 'derivatives', 'fmriprep')

# get the preprocessed derivatives - this takes some time!
if get_fmriprep:
    dl.clone(
        path=fmriprep_dir,
        source='https://github.com/OpenNeuroDerivatives/ds000105-fmriprep.git')
    dl.get(dataset=fmriprep_dir, recursive=True)

[INFO] Ensuring presence of Dataset(/Users/poldrack/data_unsynced/ds000105) to get /Users/poldrack/data_unsynced/ds000105 


### Query the dataset using PyBIDS

Because the dataset is organized using the BIDS standard, we can use the [PyBIDS](https://bids-standard.github.io/pybids/) tool to query the dataset and obtain useful metadata.


In [3]:
# load the dataset using pybids and get runs for each subject

def get_layouts(data_dir, fmriprep_dir):
    
    layout = BIDSLayout(data_dir)
    deriv_layout = BIDSLayout(fmriprep_dir, derivatives=True, validate=False)
    return layout, deriv_layout

layout, deriv_layout = get_layouts(data_dir, fmriprep_dir)


Example contents of 'dataset_description.json':
{"Name": "Example dataset", "BIDSVersion": "1.0.2", "GeneratedBy": [{"Name": "Example pipeline"}]}


In [5]:
deriv_layout

BIDS Layout: .../ds000105/derivatives/fmriprep | Subjects: 6 | Sessions: 0 | Runs: 71

### Create common mask for each subject

Each run will have slightly different voxels included in its brain mask, but we want to have a common mask across all runs, so we will generate a mask that includes the intersection of masks across all of the individual subs/runs.

In [4]:
group_mask = get_group_common_mask(layout)

### Confound regression

Use the outputs from fMRIPrep to generate a denoised version of the data.



In [37]:
subjects = [int(sub) for sub in layout.get_subjects()]
for subject in subject:
    runs = get_subject_runs(subject, data_dir)
    print(f'Subject {subject} has {len(runs)} runs')
    mask_img = get_subject_common_brain_mask(subject, data_dir)
    for run in runs:
        preproc_file = deriv_layout.get(subject=subject, run=run, 
                                        desc='preproc', space='MNI152NLin2009cAsym',
                                        suffix='bold', extension='nii.gz', 
                                        return_type='file')
        preproc_img = nib.load(preproc_file[0])
        t_r = deriv_layout.get_metadata(preproc_file[0])['RepetitionTime']

        assert len(preproc_file) == 1, f"Found {len(preproc_file)} preproc files for subject {subject} run {run}"
        confound_file = deriv_layout.get(subject=subject, run=run, 
                                        desc='confounds', 
                                        suffix='timeseries', extension='tsv', 
                                        return_type='file')
        assert len(confound_file) == 1, f"Found {len(confound_file)} confound files for subject {subject} run {run}"
        confounds = pd.read_csv(confound_file[0], sep='\t').bfill()
        # need to include cosine with acompcor
        confound_prefixes = ['trans_', 'rot_', 'a_comp_cor_', 'cosine_']
        confound_cols = [c for c in list(confounds.columns) if any([c.startswith(p) for p in confound_prefixes])]
        confounds_selected = confounds[confound_cols]
        cleaned_img = nilearn.image.clean_img(preproc_img,
                                confounds=confounds_selected,
                                t_r=t_r,mask_img=mask_img)
        cleaned_img_file = preproc_file[0].replace('preproc','cleaned')
        assert cleaned_img_file != preproc_file[0]
        cleaned_img.to_filename(os.path.join(cleaned_img_file))
                                

Subject 1 has 12 runs
Subject 2 has 12 runs
Subject 3 has 12 runs
Subject 4 has 12 runs
Subject 5 has 11 runs
Subject 6 has 12 runs


### select task block timepoints

drop the first two TRs from each task block, and generate task labels for each timepoint

In [39]:
for subject in [1]: # subjects:
    events_file = layout.get(subject=subject, run=run, datatype='func', extension='tsv', 
                             return_type='file')[0]
    events = pd.read_csv(events_file, sep='\t')


In [66]:
conditions = events.trial_type.unique().tolist()
conditions.sort()
print(conditions)
blocklen = int(np.floor(24 / t_r)) - 2 # # of timepoints after removing 2 TRs

timepoints = np.arange(0, preproc_img.shape[3]*t_r, t_r)
offset = 4 # offset from beginning of block in secs

for condition in conditions:
    match_df = events[events.trial_type == condition]
    onset = match_df.onset.tolist()[0]
    print(f'Condition {condition} starts at {onset} secs')
    onset_timepoint = np.where(timepoints >= (onset + offset) )[0][0]
    print(f'Condition {condition} starts at timepoint {onset_timepoint} ({timepoints[onset_timepoint]} secs)')




['bottle', 'cat', 'chair', 'face', 'house', 'scissors', 'scrambledpix', 'shoe']
Condition bottle starts at 12.0 secs
Condition bottle starts at timepoint 7 (17.5 secs)
Condition cat starts at 228.0 secs
Condition cat starts at timepoint 93 (232.5 secs)
Condition chair starts at 84.0 secs
Condition chair starts at timepoint 36 (90.0 secs)
Condition face starts at 156.0 secs
Condition face starts at timepoint 64 (160.0 secs)
Condition house starts at 48.0 secs
Condition house starts at timepoint 21 (52.5 secs)
Condition scissors starts at 264.0 secs
Condition scissors starts at timepoint 108 (270.0 secs)
Condition scrambledpix starts at 120.0 secs
Condition scrambledpix starts at timepoint 50 (125.0 secs)
Condition shoe starts at 192.0 secs
Condition shoe starts at timepoint 79 (197.5 secs)


In [53]:
events[:20]

Unnamed: 0,onset,duration,trial_type
0,12.0,0.5,bottle
1,14.0,0.5,bottle
2,16.0,0.5,bottle
3,18.0,0.5,bottle
4,20.0,0.5,bottle
5,22.0,0.5,bottle
6,24.0,0.5,bottle
7,26.0,0.5,bottle
8,28.0,0.5,bottle
9,30.0,0.5,bottle
