Skip to content
Benjamin Ultan Cowley edited this page Apr 3, 2017 · 20 revisions

Computational Testing for Automated Preprocessing (CTAP)

What is CTAP?

The main aim of the Computational Testing for Automated Preprocessing (CTAP) toolbox is to regularise and streamline EEG preprocessing, to minimise human subjectivity and error and facilitate easy batch processing for experts and novices alike. The main aim breaks down into two separate but complementary aims:

  1. batch processing using EEGLAB functions and
  2. testing and comparison of automated methodologies.

The CTAP toolbox provides two main functionalities to achieve these aims:

  1. the core supports scripted specification of an EEGLAB analysis pipeline and tools for running the pipe, making the workflow transparent and easy to control. Automated output of ‘quality control’ logs and imagery helps to keep track of what's going on.
  2. the testing module uses synthetic data to generate ground truth controlled tests of preprocessing methods, with capability to generate new synthetic data matching the parameters of the lab’s own data. This allows experimenters to select the best methods for their purpose, or developers to flexibly test and benchmark their novel methods.

NOTE: If you use CTAP for your research, please use the following citation:

  • Cowley, B., Korpela, J., & Torniainen, J. E. (2017). Computational Testing for Automated Preprocessing: a Matlab toolbox to enable large scale electroencephalography data processing. PeerJ Computer Science, 3:e108. http://doi.org/10.7717/peerj-cs.108

Installation

Clone the GitHub repository to your machine using

git clone https://github.com/bwrc/ctap.git <dest dir>

Add the whole directory to your Matlab path.

You also need to have EEGLAB added to your Matlab path. CTAP has been tested against the latest version of EEGLAB found in: https://sccn.ucsd.edu/wiki/How_to_download_EEGLAB CTAP developers recommend to test against this EEGLAB 'development head' by maintaining a local copy of the EEGLAB repository. This helps to ensure reported bugs are possible to replicate.

Getting started

A small working example with synthetic data can be found under:

<dest dir>/ctap/templates/manuscript_example/

The example is run by running the script runctap_manu.m. After the initial run, you should set the synthetic data creation flag to false in order to skip data creation on subsequent runs. The analysis pipe used in this example is defined in cfg_manu.m.

There is also an edited version of the manuscript pipe that implements branching of the analysis. It can be found under:

<dest dir>/ctap/templates/branching_example/

This example starts with runctap_manu_branch.m with the pipe defined in the cfg_*.m files residing in the same directory.

Terminology

CTAP: this software package / repository

measurement configuration or MC: measurement configuration file contains a list of all test subjects, measurements and other relevant information needed to find the data files and to analyze them properly. This file can be automatically generated based on a directory with EEG data files or it can be custom made by the experienced user.

Configuration struct Cfg: A Matlab struct that contains all configurations that alter the behavior of CTAP, e.g. parameters to be passed to functions, path names to use for output, etc.

Preliminary Design Philosophy

CTAP developed from the following considerations of how to manage EEG projects in order to maximize payoff for the development work involved.

Components of a typical analysis system

  • raw data

  • processed (intermediate) data

  • extracted features

  • generic analysis code (preprocessing, feature extraction, statistical analysis)

  • project specific scripts & setup files

These components should be kept apart (different folders / repositories) in order to allow:

  1. easy removal of intermediate analysis steps to free up disk space

  2. easy copying/synching of a partial project for e.g. working at home

  3. code developed in project A to directly benefit project B

Example workflow for high-density EEG

  1. structured data storage during collection

  2. data import into EEG struct: signal selection, event import

  3. detection of bad channels

  4. detection of bad signal segments/epochs

  5. ICA

  6. artefact removal using ICs

  7. feature extraction

  8. feature export/storage

  9. statistical analysis

Design principles

  • modular design: analysis steps should be implemented as standalone functions (no pop-ups or GUI elements), they should be applicable in any (reasonable) order

  • ease of scripting: unified scripting interface for the analysis functions (e.g. input: EEG struct + varargin, output: EEG struct + values of varargin used + other important parameters)

  • no GUI (at least for now)

Main challenges

  • three forms of data: continuous, discontinuous, epoched. (Discontinuous data emerges as segments of data are discarded. Epoched data is formed when chopping the data into pieces for e.g. ERP or feature extraction.) Proposed solution: state data type requirement in the documentation of each analysis function and check data adequacy within code as well

  • some analysis steps will take hours to compute (e.g. ICA for a 128 chan dataset of 30 min duration). Proposed solution: intermediate saving of results to minimize unnecessary re-computation. How to make these saves automated, flexible and not too disk-consuming?

  • configuration depends on project, task/protocol, subject and feature to be extracted: how to create necessary configurations without excessive manual work? Proposed solution: Define a base template for the configuration of the project. Create task/protocol and feature specific configuration files by updating the base template. Subject specific updates loaded from a main information file/database. Some tools needed to make the comparison of configurations easy. Analysis functions should define the defaults and report the actual values used as output.

  • the components of the pipeline will be changing constantly

  • branching the analysis: how to avoid mixing datasets computed using different configurations?

One possible pipeline design could be:

Analysis steps will be collected to sets/"chunks" of one or more individual steps. Each step set has a name (id string). The scripting system is configured by declaring:

  • steps belonging to each "step set" (including order) and name of the step set

  • order of the "step sets" in the whole workflow

  • raw data location

  • any non-default varargins for the analysis functions

Each step set produces an intermediate processed EEG dataset. An example: one step set might be called "prepro" and it could contain {bad_channel_rejection, bad_segment_rejection}.

The user specifies which measurements to process (giving a list of casenames) and which step sets to run. The system then looks from the configuration where to look for the source data.

Why like this:

  • saving intermediate data after every analysis step is often unnecessary and replicates the data too much (at least for high density EEG)

  • running the whole analysis once is impractical: takes very long, does not allow efficient debugging of problematic measurements

  • setting the analysis workflow by manipulating input and output file locations separately for each step can be very frustrating (depending on the implementation)

  • the analysis workflow usually changes constantly as new things come up so reconfiguring should be easy