Okay, so we've collected some data. Woohoo. The first thing we'll want to do is take the files our acquisition software has just output and organize them in a sensible way. 

What is "a sensible way," you mught ask? Any organization where you (and anybody else who ever needs to use your data) will immediately know where everything is and be able to find all the important metadata (e.g. acquisition parameters, which may ont be available in the file itself, etc.).

Labs will often have their own internal guidelines for how an EEG dataset should be organized, or maybe people just do what works for them. I'm a big fan of the [Brain Imaging Data Structure (BIDS)](https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/03-electroencephalography.html); if a committee of experienced researchers put their heads together and decided this was a sensible way to organize data, who am I to reinvent the wheel? And by organizing our data the same way as everyone else, we are afforded the ability to use tools that assume your data follows BIDS specifications. Importantly, when we write code to analyze our dataset, we know we can apply it to any other dataset stored in BIDS format in the future, like every dataset on [OpenNeuro](https://openneuro.org/) which can save you a ton of time down the line.

The only annoying part about BIDS is that it can be a struggle to get your data into the highly specific directory structure. Luckily, the EEG ecosystem has tools like [MNE-BIDS](https://mne.tools/mne-bids/stable/index.html) in Python and [Fieldtrip Toolbox's `data2bids` function](https://www.fieldtriptoolbox.org/example/bids/) to make this part trivial.

In [1]:
from mne_bids import BIDSPath, write_raw_bids, get_anonymization_daysback
import itertools
import mne

import numpy as np
import os
import re

First we'll load our data into MNE as we would any other EEG file. I've put the data we collected from Pablo into a folder called `data`, which we'll peak inside.

In [2]:
DATA_DIR = '../data/raw' # where our data currently lives
BIDS_DIR = '../data/bids' # where we want it to live

fnames = os.listdir(DATA_DIR)
fnames

['letty_subj_6.eeg',
 'letty_subj_5.eeg',
 'letty_subj_4.eeg',
 'letty_subj_3.eeg',
 'letty_subj_2.eeg',
 'letty_subj_3_2.vmrk',
 'letty_subj_3_2.vhdr',
 'letty_subj_6.vhdr',
 'letty_subj_6.vmrk',
 'CACS-64_NO_REF.bvef',
 'letty_subj_4_2.vmrk',
 'letty_subj_4_2.vhdr',
 'letty_subj_4.vmrk',
 'letty_subj_4.vhdr',
 'letty_subj_4_2.eeg',
 'letty_subj_5.vhdr',
 'letty_subj_5.vmrk',
 'letty_subj_2.vmrk',
 'letty_subj_2.vhdr',
 'letty_subj_3_2.eeg',
 'letty_subj_3.vhdr',
 'letty_subj_3.vmrk',
 'CACS-64_REF.bvef']

We only have one subject now, so it would be easy to hardcode this. But we can save ourselves some work down the line by automating this process, so we'll pretend we have more subjects than we do. 

MNE only needs one of the file names to read the file; namely, the `.vhdr` header file.

In [3]:
fnames = [f for f in fnames if '.vhdr' in f] # filter for .vhdr files
fnames

['letty_subj_3_2.vhdr',
 'letty_subj_6.vhdr',
 'letty_subj_4_2.vhdr',
 'letty_subj_4.vhdr',
 'letty_subj_5.vhdr',
 'letty_subj_2.vhdr',
 'letty_subj_3.vhdr']

How our subject IDs and task names are represented in our filename will obviously vary from project to project, since they depend on what you type into the acquistion software. (This type of inconsistent naming conventions is why we're converting to BIDS to begin with.) So you'll need to write your own code for this next part. 

I'm using regular expressions because I normally find them handy for pulling info out of file names, but obviously there are other (easier, if you don't already know the notoriously inscrutable regular expression syntax) ways to do this. Don't mind me.

In [25]:
# Get subject list from file order
filter_subs = re.compile('letty_subj_(\w?).*') # create regex filter
subs = list(map(filter_subs.findall, fnames)) # extract subject numbers with filter
subs = list(itertools.chain(*subs)) # flatten then nested list
print(subs)

# Get a task list
tasks = ['pitch']*len(subs) # broadcast the only task name
print(tasks)

# Get a run list
filter_runs = re.compile('\w+[0-9]_([0-9]).*')
runs = list(map(filter_runs.findall, fnames))
runs = ['1' if x == [] else x for x in runs]
runs = list(itertools.chain(*runs))
print(runs)

['3', '6', '4', '4', '5', '2', '3']
['pitch', 'pitch', 'pitch', 'pitch', 'pitch', 'pitch', 'pitch']
['2', '1', '2', '1', '1', '1', '1']


We'll want to rename our channels to something more information than `'Ch1'`, etc. Brain Products' acticaps positions electrodes according to the [10-20 system](https://en.wikipedia.org/wiki/10%E2%80%9320_system_(EEG)), so if we rename our electrodes to their 10-20 location names, everyone will know where they are on the head. We'll make a mapping from the channel names in our files to the corresponding 10-20 names using the layout file provided by Brain Products for our cap. 

In [5]:
dig = mne.channels.read_custom_montage(DATA_DIR + '/CACS-64_NO_REF.bvef')
mapping = {'Ch%s'%i: dig.ch_names[i] for i in range(len(dig.ch_names))}
mapping

{'Ch0': 'GND',
 'Ch1': 'Fp1',
 'Ch2': 'Fz',
 'Ch3': 'F3',
 'Ch4': 'F7',
 'Ch5': 'FT9',
 'Ch6': 'FC5',
 'Ch7': 'FC1',
 'Ch8': 'C3',
 'Ch9': 'T7',
 'Ch10': 'TP9',
 'Ch11': 'CP5',
 'Ch12': 'CP1',
 'Ch13': 'Pz',
 'Ch14': 'P3',
 'Ch15': 'P7',
 'Ch16': 'O1',
 'Ch17': 'Oz',
 'Ch18': 'O2',
 'Ch19': 'P4',
 'Ch20': 'P8',
 'Ch21': 'TP10',
 'Ch22': 'CP6',
 'Ch23': 'CP2',
 'Ch24': 'Cz',
 'Ch25': 'C4',
 'Ch26': 'T8',
 'Ch27': 'FT10',
 'Ch28': 'FC6',
 'Ch29': 'FC2',
 'Ch30': 'F4',
 'Ch31': 'F8',
 'Ch32': 'Fp2',
 'Ch33': 'AF7',
 'Ch34': 'AF3',
 'Ch35': 'AFz',
 'Ch36': 'F1',
 'Ch37': 'F5',
 'Ch38': 'FT7',
 'Ch39': 'FC3',
 'Ch40': 'C1',
 'Ch41': 'C5',
 'Ch42': 'TP7',
 'Ch43': 'CP3',
 'Ch44': 'P1',
 'Ch45': 'P5',
 'Ch46': 'PO7',
 'Ch47': 'PO3',
 'Ch48': 'POz',
 'Ch49': 'PO4',
 'Ch50': 'PO8',
 'Ch51': 'P6',
 'Ch52': 'P2',
 'Ch53': 'CPz',
 'Ch54': 'CP4',
 'Ch55': 'TP8',
 'Ch56': 'C6',
 'Ch57': 'C2',
 'Ch58': 'FC4',
 'Ch59': 'FT8',
 'Ch60': 'F6',
 'Ch61': 'AF8',
 'Ch62': 'AF4',
 'Ch63': 'F2',
 'Ch64': 'FCz'

In [6]:
# # Retrieve channel mapping
# dig = mne.channels.read_custom_montage(DATA_DIR + '/CACS-64_NO_REF.bvef')
# no_ref_mapping = {'Ch%s'%i: dig.ch_names[i] for i in range(len(dig.ch_names))}
# no_ref_mapping

# # Mapping for subjects 3 and 4
# dig = mne.channels.read_custom_montage(DATA_DIR + '/CACS-64_REF.bvef')
# ref_mapping = {'Ch%s'%i: dig.ch_names[i] for i in range(len(dig.ch_names))}
# ref_mapping

# # channel_mapping = accidental_mapping if sid in bad_mapping_subs else default_mapping

The ground electrode doesn't appear in the file, so we will remove that from the mapping (because MNE isn't yet smart enough to deal with extraneous values). Also, I had electrode 24 set as the reference electrode during the recording, so it didn't appear in the file. (As a side note, while we don't do it here, we can actually add the reference channel back with a constant value of zero, since it will become a valid channel again after re-referencing.) 

In [7]:
del mapping['Ch0']
del mapping['Ch24']

In [8]:
# weird_sub_channels = {
#     bad_sub1: bs1_mapping_dict,
#     bad_sub2: bs2_mapping_dict
# }

# channel_map = weird_sub_channels.get(sid, default_map)
# for i, fname in enumerate(fnames):
#     print(i, fname)

Now all we need to do is read the data file using MNE, add any info that BIDS will want but isn't available in the original file (in this case, the power line frequency, which varies from country to country so constitutes important and non-obvious metadata), give the basic info we just extracted to MNE-BIDS so it can build the BIDS directory structure, and copy our data. We'll also want to rename our events to something more interpretable than integer codes.

If we want to share this dataset in the future, we'll also need to anonymize it. That means removing the date it is collected. (It would also mean removing our subject's name, but the cat's already out of the bag on that one in this case -- sorry, Pablo).

In [9]:
for i in range(len(fnames)):
    
    sub = subs[i]
    task = tasks[i]
    block = 
    fpath = os.path.join(DATA_DIR, fnames[i])
    
    # load data with MNE function for your file format
    raw = mne.io.read_raw_brainvision(fpath)

    # add some info BIDS will want
    raw.info['line_freq'] = 60 # the power line frequency in the building we collected in
    
    # rename events from random integers to interpretable names
    # (this part is specific to your experiment, obviously)
    events, event_ids = mne.events_from_annotations(raw)
    events = events[events[:,2] != event_ids['New Segment/'], :]
    event_codes = events[:,2]
    baseline_code = np.argmax(np.bincount(event_codes)) # the one with more trials
    oddball_code = np.unique(event_codes)[np.unique(event_codes) != baseline_code][0]
    event_names = {baseline_code: 'baseline', oddball_code: 'oddball'}
    annot = mne.annotations_from_events(events, sfreq = raw.info['sfreq'], event_desc = event_names)
    raw = raw.set_annotations(annot)
#     raw.load_data() # read data from memory
    raw.rename_channels(mapping)
    
    # build appropriate BIDS directory structure 
    bids_path = BIDSPath(
        run = block,
        subject = sub, 
        task = task, 
        datatype = 'eeg', 
        root = BIDS_DIR
    )
    
    # get range of dates the BIDS specfiication will accept
    daysback_min, daysback_max = get_anonymization_daysback(raw)
    
    # write data into BIDS directory, while anonymizing
    write_raw_bids(
        raw, 
        bids_path = bids_path, 
#         allow_preload = True, # whether to load full dataset into memory when copying
#         format = 'BrainVision', # format to save to
        anonymize = dict(daysback = daysback_min) # shift dates by daysback
    )

Extracting parameters from ../data/raw/letty_subj_3_2.vhdr...
Setting channel info structure...
Used Annotations descriptions: ['New Segment/', 'Stimulus/S  1', 'Stimulus/S  2', 'Stimulus/S  3', 'Stimulus/S  4', 'Stimulus/S  5']
Extracting parameters from /Users/nusbaumlab/src/pitch_tracking/data/raw/letty_subj_3_2.vhdr...
Setting channel info structure...
Writing '../data/bids/participants.tsv'...
Writing '../data/bids/participants.json'...
Used Annotations descriptions: ['baseline', 'oddball']
Writing '../data/bids/sub-3/eeg/sub-3_task-pitch_events.tsv'...
Writing '../data/bids/dataset_description.json'...
Writing '../data/bids/sub-3/eeg/sub-3_task-pitch_eeg.json'...
Writing '../data/bids/sub-3/eeg/sub-3_task-pitch_channels.tsv'...
Copying data files to sub-3_task-pitch_eeg.vhdr
Created "sub-3_task-pitch_eeg.eeg" in "/Users/nusbaumlab/src/pitch_tracking/data/bids/sub-3/eeg".
Created "sub-3_task-pitch_eeg.vhdr" in "/Users/nusbaumlab/src/pitch_tracking/data/bids/sub-3/eeg".
Created "su

FileExistsError: "../data/bids/sub-4/eeg/sub-4_task-pitch_events.tsv" already exists. Please set overwrite to True.

A couple of notes:

1. In this case, we are just copying from one Brain Vision file to another, which we can do since our data we already in that file format. Often, we'll collect from a system that outputs to a file format which isn't already BIDS compliant. In that case, you'll need to load the data into memory with `raw.load_data()` and then set the `allow_preload = True` when writing the data. Last I checked, this is also necessary when renaming channels for idiosyncratic reasons (though this may change since MNE-BIDS is under active development), which is why we've done so here even though our data is already in a Brain Vision file.
2. If you have digitized electrode positions for your specific subject, you'll want to [load those as you normally would in MNE](https://mne.tools/stable/auto_tutorials/intro/40_sensor_locations.html) and assign them to the `raw` object before writing to BIDS. This will ensure your electrode locations get recorded in a BIDS compliant manner. This is _only_ for subject-specific electrode positions; don't do this for standard templates.
3. As we saw in the timing test tutorial, MNE loads event times stored in the Brain Vision file as annotations in `raw.annotations`, which MNE-BIDS records in a BIDS-valid event file automatically. If your events are represented in a different way, you can either convert them to annoations or provide an events data structure to `write_raw_bids`.

That's pretty much it. We can check view some attributes of our resulting data directory using the `pybids` package. (Accessing the directory as a `BIDSLayout` also runs the [BIDS Validator](https://github.com/bids-standard/bids-validator) automatically, ensuring everything is up to par.

In [10]:
from bids import BIDSLayout
layout = BIDSLayout(BIDS_DIR)

In [11]:
layout.get_subjects()

['6', '1', '3', '4']

In [12]:
layout.get_tasks()

['p', 'pitches', 'pitch']

In [None]:
from mne_bids import print_dir_tree
print_dir_tree(BIDS_DIR)

You can go through the descriptor files like `README`, `dataset_description.json`, and `participants.tsv` to add other information (e.g. the paper's authors, subjects' handedness, etc.) by hand if you wish. MNE-BIDS will also happily organize data from different sessions and, runs, and tasks into one, big, happy directory. 