## Summarize the contents of the event files in a BIDS dataset.

A first step in annotating a BIDS dataset is to find out what is in the dataset
event files.
Sometimes event files will have a few unexpected or incorrect codes.
It is usually a good idea to find out what is actually in the dataset
event files before starting the annotation process.

This notebook traverses through the BIDS data set and gathers the unique
values for each column and number of times each value appears in the dataset.

The script creates dictionaries of `key` to full path
for each type of file.  The `key` is of the form `sub-xxx_run-y` which
uniquely specify each event file in the dataset. If a dataset contains
multiple sessions for each subject, the `key` should include additional
parts of the file name to uniquely specify each subject.
Keys are specified by a `entities` tuple lists the BIDS entity names
to include in the key.
BIDS base file names are constructed of entity *name*-*value* pairs separated
by underbars and followed by an ending *_suffix*.

For a file name `sub-001_ses-3_task-target_run-01_events.tsv`,
the tuple ('sub', 'task') gives a key of `sub-001_task-target`,
while the tuple ('sub', 'ses', 'run) gives a key of `sub-001_ses-3_run-01`.
The use of dictionaries of file names with such keys makes it
easier to associate related files in the BIDS naming structure.

The setup requires the setting of the following variables for your dataset:

| Variable | Purpose |
| -------- | ------- |
| bids_root_path | Full path to root directory of dataset.|
| exclude_dirs | List of directories to exclude when constructing file lists. |
| entities  | Tuple of entity names used to construct a unique keys representing filenames. <br>(See [Dictionaries of filenames](https://hed-examples.readthedocs.io/en/latest/HedInPython.html#dictionaries-of-filenames-anchor) for examples of how to choose the keys.)|
| skip_columns  |  List of column names in the `events.tsv` files to skip in the analysis. |

For large datasets, you will want to be sure to exclude columns such as
`onset` and `sample`, since the summary produces counts of the number of times
each unique value appears somewhere in an event file.

The notebook creates `BidsTabularDictionary` and `BidsTsvSummary` objects to handle the summarization.

The example below uses a
[small version](https://github.com/hed-standard/hed-examples/tree/main/datasets/eeg_ds003654s_hed)
of the Wakeman-Hanson face-processing dataset available on openNeuro as
[ds003654](https://openneuro.org/datasets/ds003645/versions/2.0.0)
which is distributed as part of this dataset.

In [1]:
import os
from hed.tools import BidsTabularDictionary
from hed.util import get_file_list

# Variables to set for the specific dataset
bids_root_path =  os.path.realpath('../../../datasets/eeg_ds003654s_hed')
name = 'eeg_ds003654s_hed'
exclude_dirs = ['stimuli']
entities = ('sub', 'run')
skip_columns = ["onset", "duration", "sample", "stim_file", "trial", "response_time"]


# Construct the file dictionary for the BIDS event files
event_files = get_file_list(bids_root_path, extensions=[".tsv"], name_suffix="_events", exclude_dirs=exclude_dirs)
file_dict = BidsTabularDictionary(name, event_files, entities=entities)

In [2]:
print(f"Summarizing {bids_root_path}...")
file_dict.print_files(title="\nBIDS style event files")


Summarizing D:\Research\HED\hed-examples\datasets\eeg_ds003654s_hed...

BIDS style event files (6 files)
sub_002_run_1: sub-002_task-FacePerception_run-1_events.tsv
sub_002_run_2: sub-002_task-FacePerception_run-2_events.tsv
sub_002_run_3: sub-002_task-FacePerception_run-3_events.tsv
sub_003_run_1: sub-003_task-FacePerception_run-1_events.tsv
sub_003_run_2: sub-003_task-FacePerception_run-2_events.tsv
sub_003_run_3: sub-003_task-FacePerception_run-3_events.tsv


In [3]:
print(f"\nBIDS style event file columns:")
for key, file, rowcount, columns in file_dict.iter_tsv_info():
    print(f"{key} [{rowcount} events]: {str(columns)}")



BIDS style event file columns:
sub_002_run_1 [552 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']
sub_002_run_2 [516 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']
sub_002_run_3 [533 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']
sub_003_run_1 [589 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']
sub_003_run_2 [589 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']
sub_003_run_3 [583 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']


In [4]:
from hed.tools import BidsTabularSummary

bids_dicts_all, bids_dicts =  BidsTabularSummary.make_combined_dicts(file_dict, skip_cols=skip_columns)
bids_dicts_all.print('\nBIDS events summary')


BIDS events summary
  Categorical columns (5):
    event_type (8 distinct values):
      double_press: 1
      left_press: 246
      right_press: 457
      setup_right_sym: 6
      show_circle: 884
      show_cross: 884
      show_face: 878
      show_face_initial: 6
    face_type (4 distinct values):
      famous_face: 294
      n/a: 2478
      scrambled_face: 298
      unfamiliar_face: 292
    rep_lag (12 distinct values):
      1: 216
      10: 49
      11: 57
      12: 40
      13: 17
      14: 15
      15: 3
      6: 1
      7: 5
      8: 10
      9: 21
      n/a: 2928
    rep_status (4 distinct values):
      delayed_repeat: 218
      first_show: 450
      immediate_repeat: 216
      n/a: 2478
    value (15 distinct values):
      0: 884
      1: 884
      13: 150
      14: 64
      15: 78
      17: 150
      18: 82
      19: 66
      256: 246
      3: 6
      4096: 457
      4352: 1
      5: 150
      6: 70
      7: 74
  Value columns (0):
