# Getting Started with CML Readers

In [1]:
import json
import pandas as pd
import cmlreaders as cml

## Finding Files on Rino

The PathFinder helper class can be used to locate files on RHINO. It's sole responsibility is to locate and return the file path of the file. In many cases, a file could be located in more than one location. In these situations, PathFinder will search over the list of possible locations and return the path where the file is first found. Implicitly, this assumes that the order of the file locations is prioritized such that the preferred location comes before a fall-back location. 

In [2]:
# If not working on RHINO, specify the mount point
rhino_root = "/Volumes/RHINO/"

# Instantiate the finder object
finder = cml.PathFinder(subject="R1389J", experiment="catFR5", session="1", 
                        localization="0", montage="0", rootdir=rhino_root)

### What can you request?

The PathFinder has a few built-in properties to help you understand what data types are currently supported. Different file types require that the finder be instantiated with different fields. For example, if you are planning to request localization files, there is no need to specify an experiment, session, or montage. However, it is not a problem to specify too many fields, as any extraneous ones will simply be ignored if the data type does not require that it be given. The following properties are defined:
- requestable_files: All supported data types
- localization_files: Files related to localization
- montage_files: Files associated with a specific montage
- session_files: Files that are specific to a session. This files could be processed events, Ramulator files, etc.

For high-level information about each of these data types, see the [Data Guide](https://pennmem.github.io/cmlreaders/html/data_guide.html) section of the documentation.

In [3]:
finder.requestable_files

['r1_index',
 'ltp_index',
 'voxel_coordinates',
 'prior_stim_results',
 'electrode_coordinates',
 'jacksheet',
 'area',
 'electrode_categories',
 'good_leads',
 'leads',
 'classifier_excluded_leads',
 'localization',
 'matlab_bipolar_talstruct',
 'matlab_monopolar_talstruct',
 'pairs',
 'contacts',
 'session_summary',
 'classifier_summary',
 'math_summary',
 'target_selection_table',
 'baseline_classifier',
 'all_events',
 'task_events',
 'math_events',
 'ps4_events',
 'sources',
 'experiment_log',
 'session_log',
 'ramulator_session_folder',
 'event_log',
 'experiment_config',
 'raw_eeg',
 'odin_config',
 'used_classifier',
 'excluded_pairs',
 'all_pairs']

In [4]:
finder.localization_files

['voxel_coordinates',
 'prior_stim_results',
 'electrode_coordinates',
 'jacksheet',
 'good_leads',
 'leads',
 'area',
 'classifier_excluded_leads',
 'localization',
 'electrode_categories',
 'target_selection_file',
 'baseline_classifier']

In [5]:
finder.montage_files

['pairs', 'contacts']

In [6]:
finder.session_files

['session_summary',
 'classifier_summary',
 'math_summary',
 'used_classifier',
 'excluded_pairs',
 'all_pairs',
 'experiment_log',
 'session_log',
 'event_log',
 'experiment_config',
 'raw_eeg',
 'odin_config',
 'all_events',
 'task_events',
 'math_events',
 'ps4_events']

### Finding File Paths

In [7]:
# Find some example files
example_data_types = ['pairs', 'task_events', 'voxel_coordinates']
for data_type in example_data_types:
    print(finder.find(data_type=data_type))

/Volumes/RHINO/protocols/r1/subjects/R1389J/localizations/0/montages/0/neuroradiology/current_processed/pairs.json
/Volumes/RHINO/protocols/r1/subjects/R1389J/experiments/catFR5/sessions/1/behavioral/current_processed/task_events.json
/Volumes/RHINO/data10/RAM/subjects/R1389J/tal/VOX_coords_mother.txt


## Identifying Available Sessions

CMLReaders contains a utility function for loading the json-formatted index files located in the protocols/ directory on RHINO as a dataframe. Once loaded, the standard pandas selection idioms can be used to answer questions such as:

1. What subjects completed FR1?
2. What experiments did subject R1111M complete?
3. How many sessions have been colleted of PAL1?

For many analyses, this will be the first step in determining the sample of subjects to be used.

In [8]:
from cmlreaders import get_data_index

In [9]:
r1_data = get_data_index(kind='r1', rootdir='/Volumes/RHINO/')
r1_data.head()

Unnamed: 0,Recognition,all_events,contacts,experiment,import_type,localization,math_events,montage,original_experiment,original_session,pairs,ps4_events,session,subject,subject_alias,system_version,task_events
0,,protocols/r1/subjects/R1001P/experiments/FR1/s...,protocols/r1/subjects/R1001P/localizations/0/m...,FR1,build,0,protocols/r1/subjects/R1001P/experiments/FR1/s...,0,,0,protocols/r1/subjects/R1001P/localizations/0/m...,,0,R1001P,R1001P,,protocols/r1/subjects/R1001P/experiments/FR1/s...
1,,protocols/r1/subjects/R1001P/experiments/FR1/s...,protocols/r1/subjects/R1001P/localizations/0/m...,FR1,build,0,protocols/r1/subjects/R1001P/experiments/FR1/s...,0,,1,protocols/r1/subjects/R1001P/localizations/0/m...,,1,R1001P,R1001P,,protocols/r1/subjects/R1001P/experiments/FR1/s...
2,,protocols/r1/subjects/R1001P/experiments/FR2/s...,protocols/r1/subjects/R1001P/localizations/0/m...,FR2,build,0,protocols/r1/subjects/R1001P/experiments/FR2/s...,0,,0,protocols/r1/subjects/R1001P/localizations/0/m...,,0,R1001P,R1001P,,protocols/r1/subjects/R1001P/experiments/FR2/s...
3,,protocols/r1/subjects/R1001P/experiments/FR2/s...,protocols/r1/subjects/R1001P/localizations/0/m...,FR2,build,0,protocols/r1/subjects/R1001P/experiments/FR2/s...,0,,1,protocols/r1/subjects/R1001P/localizations/0/m...,,1,R1001P,R1001P,,protocols/r1/subjects/R1001P/experiments/FR2/s...
4,,protocols/r1/subjects/R1001P/experiments/PAL1/...,protocols/r1/subjects/R1001P/localizations/0/m...,PAL1,build,0,protocols/r1/subjects/R1001P/experiments/PAL1/...,0,,0,protocols/r1/subjects/R1001P/localizations/0/m...,,0,R1001P,R1001P,,protocols/r1/subjects/R1001P/experiments/PAL1/...


In [10]:
# What subjects completed FR1?
fr1_subjects = r1_data[r1_data['experiment'] == 'FR1']['subject'].unique()
fr1_subjects

array(['R1001P', 'R1002P', 'R1003P', 'R1006P', 'R1010J', 'R1015J',
       'R1018P', 'R1020J', 'R1022J', 'R1023J', 'R1026D', 'R1027J',
       'R1030J', 'R1031M', 'R1032D', 'R1033D', 'R1034D', 'R1035M',
       'R1036M', 'R1039M', 'R1042M', 'R1044J', 'R1045E', 'R1048E',
       'R1049J', 'R1050M', 'R1051J', 'R1052E', 'R1053M', 'R1054J',
       'R1056M', 'R1057E', 'R1059J', 'R1060M', 'R1061T', 'R1062J',
       'R1063C', 'R1065J', 'R1066P', 'R1067P', 'R1068J', 'R1069M',
       'R1070T', 'R1074M', 'R1075J', 'R1076D', 'R1077T', 'R1080E',
       'R1081J', 'R1083J', 'R1084T', 'R1086M', 'R1089P', 'R1092J',
       'R1093J', 'R1094T', 'R1096E', 'R1098D', 'R1100D', 'R1101T',
       'R1102P', 'R1104D', 'R1105E', 'R1106M', 'R1108J', 'R1111M',
       'R1112M', 'R1113T', 'R1114C', 'R1115T', 'R1118N', 'R1120E',
       'R1121M', 'R1122E', 'R1123C', 'R1124J', 'R1125T', 'R1127P',
       'R1128E', 'R1129D', 'R1130M', 'R1131M', 'R1134T', 'R1135E',
       'R1136N', 'R1137E', 'R1138T', 'R1142N', 'R1145J', 'R114

In [11]:
# What experiments did R1111M complete?
r1111m_experiments = r1_data[r1_data['subject'] == 'R1111M']['experiment'].unique()
r1111m_experiments

array(['FR1', 'FR2', 'PAL1', 'PAL2', 'PS2', 'catFR1'], dtype=object)

In [12]:
# How many sessions of PAL1 have been collected?
pal_sessions = r1_data[r1_data['experiment'] == 'PAL1']
len(pal_sessions)

151

## Loading Data

In most cases, the end goal is to load the data into memory rather than just locating a file or determing what data has been collected. In this case, CML Readers provides a handy class to unify the API for loading data. By default, the location will be determined automatically based on the file type using the PathFinder class demonstrated earlier. However, a custom path can be given by using the file_path keyword. This can be useful if you have some data stored locally that is in the same format as one of the data types supported by CMLReaders that you would like to load and use. See the "Loading from a Custom Location" section below for an example.

Each data type has a default representation that is returned when you call the .load() method. Most users will want to use this default representation. However, if you would like to get the data in a different format, you have two options:

1. Get the reader for the data type and load the data using a different supported method using one of the as_x methods
2. Load the data as the default type and convert it manually

In [13]:
reader = cml.CMLReader(subject="R1389J", experiment="catFR5", session="1", 
                       localization="0", montage="0", rootdir=rhino_root)

### Using the Default Representation

In [14]:
# Pandas dataframe
events_df = reader.load('task_events')
events_df.head()

Unnamed: 0,eegoffset,category,category_num,eegfile,exp_version,experiment,intrusion,is_stim,item_name,item_num,...,recog_rt,recognized,rectime,rejected,serialpos,session,stim_list,stim_params,subject,type
0,-1,X,-999,,,catFR5,-999,False,X,-999,...,-999,-999,-999,-999,-999,1,False,[],R1389J,STIM_ARTIFACT_DETECTION_START
1,5831,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,,-1,...,-999,-999,-999,-999,-999,1,False,"[{'amplitude': 500.0, 'anode_label': 'STG6', '...",R1389J,STIM_ON
2,7790,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,,-1,...,-999,-999,-999,-999,-999,1,False,"[{'amplitude': 500.0, 'anode_label': 'STG6', '...",R1389J,STIM_ON
3,9786,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,,-1,...,-999,-999,-999,-999,-999,1,False,"[{'amplitude': 500.0, 'anode_label': 'STG6', '...",R1389J,STIM_ON
4,11782,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,,-1,...,-999,-999,-999,-999,-999,1,False,"[{'amplitude': 500.0, 'anode_label': 'STG6', '...",R1389J,STIM_ON


In [15]:
# Custom python class
session_summary = reader.load('session_summary')
type(session_summary)

Could not import MorletWaveletFilterCppLegacy (single-core C++ version of MorletWaveletFilter): cannot import name 'MorletWaveletTransform'
You can still use MorletWaveletFilter


ramutils.reports.summary.FRStimSessionSummary

In [16]:
# Python dictionary
electrode_categories_dict = reader.load('electrode_categories')
electrode_categories_dict

{'interictal': ['ONEMC 9',
  'ONEMC10',
  'ONEMC8',
  'SMC3',
  'SMC4',
  'STG 7 ',
  'STG8',
  'TWOSTG 4'],
 'brain_lesion': ['FOURSC', 'ONESC', 'ONNEMC', 'THREESC', 'TWOMC', 'TWOSC'],
 'bad_channel': ['NONE'],
 'soz': []}

### Using the Underlying Reader

In [17]:
# Ask CMLReader to give back the reader instead of the data
event_reader = reader.get_reader('task_events')
type(event_reader)

cmlreaders.readers.readers.EventReader

In [18]:
# Load the task events as a dictionary instead of the default representation
event_dict = event_reader.as_dict()
event_dict[:1]

[{'eegoffset': -1,
  'category': 'X',
  'category_num': -999,
  'eegfile': '',
  'exp_version': '',
  'experiment': 'catFR5',
  'intrusion': -999,
  'is_stim': False,
  'item_name': 'X',
  'item_num': -999,
  'list': -999,
  'montage': 0,
  'msoffset': -1,
  'mstime': -1,
  'phase': '',
  'protocol': 'r1',
  'recalled': False,
  'recog_resp': -999,
  'recog_rt': -999,
  'recognized': -999,
  'rectime': -999,
  'rejected': -999,
  'serialpos': -999,
  'session': 1,
  'stim_list': False,
  'stim_params': [],
  'subject': 'R1389J',
  'type': 'STIM_ARTIFACT_DETECTION_START'}]

In [19]:
# Load the task event as a recarray (not recommended)
event_recarray = event_reader.as_recarray()
event_recarray[0]

(0, -1, 'X', -999, '', '', 'catFR5', -999, False, 'X', -999, -999, 0, -1, -1, '', 'r1', False, -999, -999, -999, -999, -999, -999, 1, False, [], 'R1389J', 'STIM_ARTIFACT_DETECTION_START')

## Saving Data

CMLReaders can also be used to save data in a different format from the original files. The three supported output formats are: CSV, JSON, and HDF5. However, not all file formats are supported by all of the readers. In particular, HDF5 output is only supported for a minimal number of data types.

In [20]:
# JSON to CSV
event_reader.to_csv("task_events.csv")
event_df = pd.read_csv("task_events.csv")
event_df.head()

Unnamed: 0,eegoffset,category,category_num,eegfile,exp_version,experiment,intrusion,is_stim,item_name,item_num,...,recog_rt,recognized,rectime,rejected,serialpos,session,stim_list,stim_params,subject,type
0,-1,X,-999,,,catFR5,-999,False,X,-999,...,-999,-999,-999,-999,-999,1,False,[],R1389J,STIM_ARTIFACT_DETECTION_START
1,5831,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,,-1,...,-999,-999,-999,-999,-999,1,False,"[{'amplitude': 500.0, 'anode_label': 'STG6', '...",R1389J,STIM_ON
2,7790,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,,-1,...,-999,-999,-999,-999,-999,1,False,"[{'amplitude': 500.0, 'anode_label': 'STG6', '...",R1389J,STIM_ON
3,9786,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,,-1,...,-999,-999,-999,-999,-999,1,False,"[{'amplitude': 500.0, 'anode_label': 'STG6', '...",R1389J,STIM_ON
4,11782,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,,-1,...,-999,-999,-999,-999,-999,1,False,"[{'amplitude': 500.0, 'anode_label': 'STG6', '...",R1389J,STIM_ON


In [21]:
# CSV to JSON
electrode_coord_reader = reader.get_reader('electrode_coordinates')
electrode_coord_reader.to_json('electrode_coordinates.json')

with open("electrode_coordinates.json") as f:
    electrode_coord_dict = json.load(f)

electrode_coord_dict.keys()

dict_keys(['contact_name', 'contact_type', 'x', 'y', 'z', 'atlas', 'orient_to'])

### Loading from a Custom Location

Since locations are automatically determined based on the data type, if loading from a custom location, be sure to specify the file_type parameter when loading the data. One common use case for this functionality is that you need to load some data from your scratch or home directory that is in a format supported by CMLReaders. Another use case is that you may discover that data is residing in a location that is not known by CMLReaders. Instead of waiting until the next release of the package, you can use a custom file path to load the data, submit an issue on Github, and continue with you analysis. 


In [22]:
# Loading from a custom location
event_reader.to_json("task_events.json")
event_df = reader.load('task_events', file_path='task_events.json')
event_df.head()

Unnamed: 0,eegoffset,category,category_num,eegfile,exp_version,experiment,intrusion,is_stim,item_name,item_num,...,recog_rt,recognized,rectime,rejected,serialpos,session,stim_list,stim_params,subject,type
0,-1,X,-999,,,catFR5,-999,False,X,-999,...,-999,-999,-999,-999,-999,1,False,[],R1389J,STIM_ARTIFACT_DETECTION_START
1,5831,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,,-1,...,-999,-999,-999,-999,-999,1,False,"[{'amplitude': 500.0, 'anode_label': 'STG6', '...",R1389J,STIM_ON
10,23781,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,,-1,...,-999,-999,-999,-999,-999,1,False,"[{'amplitude': 500.0, 'anode_label': 'STG6', '...",R1389J,STIM_ON
100,315125,X,-999,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,X,-999,...,-999,-999,-999,-999,-999,1,False,[],R1389J,REC_END
1000,2515863,Flowers,9,R1389J_catFR5_1_28Feb18_1552.h5,,catFR5,-999,False,LILY,112,...,-999,-999,-999,-999,4,1,True,[],R1389J,WORD
