## Find event combinations

This notebook traverses through a data set and gathers the unique combinations of values in the specified columns of the event files.

The setup requires the following variables for your dataset:

| Variable            | Purpose                                                        |
|---------------------|----------------------------------------------------------------|
| `dataset_root_path` | Full path to root directory of dataset.                        |
| `output_path`       | Output path for the spreadsheet template. If None, then print. |
| `exclude_dirs`      | List of directories to exclude when constructing file lists.   |
| `key_columns`       | List of column names in the events.tsv files to combine.       |

The result will be a tabular file (tab-separated file) whose columns are the `key_columns` in the order given. The values will be all unique combinations of the `key_columns`, sorted by columns left to right.

This can be used to remap the columns in event files to use a new recoding. The resulting spreadsheet is also useful for deciding whether two columns contain redundant information.

In [1]:
import os
from hed.tools.analysis.key_map import KeyMap
from hed.tools.util.data_util import get_new_dataframe
from hed.tools.util.io_util import get_file_list

# Variables to set for the specific dataset
dataset_root = '../../../datasets/eeg_ds002893s_hed_attention_shift'
exclude_dirs = ['stimuli', 'code', 'derivatives', 'sourcedata', 'phenotype']
output_path = ''
exclude_dirs = ['trial', 'derivatives', 'code', 'sourcedata']

# Construct the key map
key_columns = ["focus_modality", "event_type", "attention_status"]
key_map = KeyMap(key_columns)

# Construct the unique combinations
event_files = get_file_list(dataset_root, extensions=[".tsv"], name_suffix="events", exclude_dirs=exclude_dirs)
for event_file in event_files:
    print(f"{os.path.basename(event_file)}")
    df = get_new_dataframe(event_file)
    key_map.update(df)

key_map.resort()
template = key_map.make_template()
key_counts_sum = template['key_counts'].sum()
print(f"The total count of the keys is:{key_counts_sum}")
if output_path:
    template.to_csv(output_path, sep='\t', index=False, header=True)
else:
    print(template)  


sub-001_task-AuditoryVisualShift_run-01_events.tsv
sub-002_task-AuditoryVisualShift_run-01_events.tsv
The total count of the keys is:11730
    key_counts focus_modality       event_type attention_status
0         2298       auditory         low_tone         attended
1         2292         visual         dark_bar         attended
2         1540       auditory         dark_bar       unattended
3         1538         visual         low_tone       unattended
4          585       auditory     button_press              nan
5          577       auditory        high_tone         attended
6          576         visual        light_bar         attended
7          572         visual     button_press              nan
8          384       auditory        light_bar       unattended
9          383         visual        high_tone       unattended
10         288       auditory        hear_word         attended
11         287         visual        look_word         attended
12          96         visual