## Find event combinations

This notebook traverses through a data set and gathers the unique combinations of values in the specified columns of the event files.

The setup requires the following variables for your dataset:

| Variable            | Purpose                                                        |
|---------------------|----------------------------------------------------------------|
| `dataset_root_path` | Full path to root directory of dataset.                        |
| `output_path`       | Output path for the spreadsheet template. If None, then print. |
| `exclude_dirs`      | List of directories to exclude when constructing file lists.   |
| `key_columns`       | List of column names in the events.tsv files to combine.       |

The result will be a tabular file (tab-separated file) whose columns are the `key_columns` in the order given. The values will be all unique combinations of the `key_columns`, sorted by columns left to right.

This can be used to remap the columns in event files to use a new recoding. The resulting spreadsheet is also useful for deciding whether two columns contain redundant information.

In [3]:
import os
from hed.tools.analysis.key_map import KeyMap
from hed.tools.util.data_util import get_new_dataframe
from hed.tools.util.io_util import get_file_list

# Variables to set for the specific dataset
data_root = 'T:/summaryTests/ds002718-download'
output_path = ''
exclude_dirs = ['stimuli', 'derivatives', 'code', 'sourcecode']

# Construct the key map
key_columns = [ "trial_type", "value"]
key_map = KeyMap(key_columns)

# Construct the unique combinations
event_files = get_file_list(data_root, extensions=[".tsv"], name_suffix="_events", exclude_dirs=exclude_dirs)
for event_file in event_files:
    print(f"{os.path.basename(event_file)}")
    df = get_new_dataframe(event_file)
    key_map.update(df)

key_map.resort()
template = key_map.make_template()
key_counts_sum = template['key_counts'].sum()
print(f"The total count of the keys is:{key_counts_sum}")
if output_path:
    template.to_csv(output_path, sep='\t', index=False, header=True)
else:
    print(template)  


sub-002_task-FaceRecognition_events.tsv
sub-003_task-FaceRecognition_events.tsv
sub-004_task-FaceRecognition_events.tsv
sub-005_task-FaceRecognition_events.tsv
sub-006_task-FaceRecognition_events.tsv
sub-007_task-FaceRecognition_events.tsv
sub-008_task-FaceRecognition_events.tsv
sub-009_task-FaceRecognition_events.tsv
sub-010_task-FaceRecognition_events.tsv
sub-011_task-FaceRecognition_events.tsv
sub-012_task-FaceRecognition_events.tsv
sub-013_task-FaceRecognition_events.tsv
sub-014_task-FaceRecognition_events.tsv
sub-015_task-FaceRecognition_events.tsv
sub-016_task-FaceRecognition_events.tsv
sub-017_task-FaceRecognition_events.tsv
sub-018_task-FaceRecognition_events.tsv
sub-019_task-FaceRecognition_events.tsv
The total count of the keys is:31448
    key_counts               trial_type value
0           90                 boundary     0
1         2700               famous_new     5
2         1313      famous_second_early     6
3         1291       famous_second_late     7
4         353