# Interfacing with project data
We've written wrapper functions so that you don't need to do any complex file I/O with the project datasets. (Though if you're interested you're more than welcome to do so). The loading functions for the three datasets are provided below.

For each of these functions, you'll supply the path to where you've downloaded the datasets and the subject ID as well as any optional arguments specific to the project.

For the fMRI dataset, the function will get the data and the labels (and chunks). 

For the EEG datasets, the functions will return 3 variables: the data, the labels (and chunks), and the channel names. 

## fMRI Dataset

These data were first published in Haxby 2001 (Science). For those of you using this function, you will need to add an additional module called "nibabel" in order to load the data. 

http://www.pymvpa.org/datadb/haxby2001.html

In [1]:
# data.pymvpa.org/datasets/haxby2001
def load_haxby_data(datapath, sub, mask=None):
    # input arguments:
    # datapath (string): path to the root directory
    # sub (string): subject ID (e.g. subj1, subj2, etc)
    # output:
    # maskeddata (numpy array): samples x voxels data matrix
    # fmrilabel (pandas dataframe): length samples
    import nibabel as nib
    import pandas as pd
    import numpy as np

    fmriobj = nib.load(os.path.join(datapath, sub, 'train.nii.gz'))
    fmridata, fmriheader = fmriobj.get_data(), fmriobj.header
    fmridata = np.rollaxis(fmridata, -1)
    # shift last axis to the first
    fmrilabel = pd.read_table(os.path.join(datapath, sub, 'labels.txt'), delim_whitespace=True)
    if mask is not None:
        maskobj = nib.load(os.path.join(datapath, sub, mask + '.nii.gz'))
        maskdata, maskheader = maskobj.get_data(), maskobj.header
        maskeddata = fmridata[:, maskdata > 0]  # timepoints axis 0, voxels axis 1
        # need to figure out how to mask features back to original geometry
        print maskeddata.shape
    else:
        maskeddata = fmridata

    return maskeddata, fmrilabel[fmrilabel.chunks != 11]  # not loading the testing run that we've set aside

## task EEG Dataset
The task EEG dataset is taken from a Kaggle competition (https://www.kaggle.com/c/grasp-and-lift-eeg-detection). There are a number of different labels in the "eegevents" output (e.g. grasping vs lifting, left vs right, etc).

In [2]:
def load_task_eeg_series(datapath, sub, series):
    # input arguments:
    # datapath (string): path to the root directory
    # sub (string): subject ID (e.g. subj1, subj2, etc)
    # series (int): series name (e.g. 1, 2, etc). 
    # This will load in all of the specified data and chunk them by series
    
    # output:
    # eegdata (numpy array): samples x channels data matrix
    # eegevents (pandas dataframe): labels
    # channel_names (list): names of the channels
    import os
    import pandas as pd
    eegdata = pd.read_csv(os.path.join(datapath, sub + '_series' + str(series) + '_data.csv'))
    eegevents = pd.read_csv(os.path.join(datapath, sub + '_series' + str(series) + '_events.csv'))
    return eegdata.iloc[:].as_matrix(), eegevents, eegdata.keys()


def load_task_eeg_data(datapath, sub):
    # call this one in your code
    # input arguments:
    # datapath (string): path to the root directory
    # sub (string): subject ID (e.g. subj1, subj2, etc)
    # series (list): series number (e.g. [1,2,3] etc). 
    # This will load in all of the specified data and chunk them by series
    
    # output:
    # eegdata (numpy array): samples x channels data matrix
    # eegevents (pandas dataframe): labels
    # channel_names (list): names of the channels
    import pandas as pd
    import numpy as np
    eegdata = []
    eegevents = []
    for s in range(1, 9):
        ed, ee, ek = load_task_eeg_series(datapath, sub, s)
        ee['chunks'] = pd.Series(s * np.ones(ee.shape[0]))
        eegdata.append(ed)
        eegevents.append(ee)
    eegkeys = ek    
    return np.vstack(eegdata), pd.concat(eegevents), eegkeys

## clinical EEG Dataset

The clinical EEG dataset is taken from https://physionet.org/pn6/chbmit/, with references in the link.

The dataset was originally 24-48 hour continuous monitoring of patients with intractable seizures. We've clipped out seizure events, as well as the time periods 10 min before and after each event. Each seizure event (before, during and after) is denoted by a chunk.

In [3]:
def load_clinical_eeg_data(datapath, sub):
    # input arguments:
    # datapath (string): path to the root directory
    # sub (string): subject ID (e.g. chb01, chb02, etc)
    
    # output:
    # eegdata (numpy array): samples x channels data matrix
    # eegevents (pandas dataframe): labels and chunks
    # channel_names (list): names of the channels
    import pandas as pd
    alldata = pd.read_csv(os.path.join(datapath, 'train', sub + '.csv'))
    alldata.rename(columns={'Unnamed: 0': 'Index'})
    eegevents = alldata[['labels', 'chunks']]
    alldata.drop(['Unnamed: 0', 'labels', 'chunks'], axis=1, inplace=True)
    names = alldata.keys()
    return alldata.iloc[:].as_matrix(), eegevents, names