## Initial analysis of a BIDS dataset events.
The first step in annotating a BIDS dataset is to find out what is in the dataset
event files. 

The example used in this notebook is reduced version of an auditory attention shift
dataset which is available at
[https://github.com/hed-standard/hed-examples/data/eeg_ds0028932](https://github.com/hed-standard/hed-examples/data/eeg_ds0028932).

To run this notebook, you will need download this dataset and set the `bids_root_path`
variable to the local path of the dataset's root directory.

### Assess the events

The tools traverse through the BIDS data set and gathers the unique values for each
column and number of times each value appears in the dataset. Usually, you will
want to exclude the columns `onset`, `duration`, `sample`, and `HED` as the unique
values in these columns are not meaningful.

In [1]:
bids_root_path = "D:/eeg_ds002893s"

The next example recursively traverses the directory tree and produces
a list of the full paths of the dataset event files. Event files have
extension `.tsv` and the file names end with `_events`.  You may wish
to check the returned list to verify that the expected event files
are in the dataset.

In [2]:
from hed.tools import get_file_list
event_file_list = get_file_list(bids_root_path, types=[".tsv"], suffix="_events")
print(f"Bids dataset {bids_root_path} has {len(event_file_list)} event files")

Bids dataset D:/eeg_ds002893s has 6 event files


The HED tools provide functions for analyzing the contents of the event files.
In order to HED tag the datasets, we will need to know the different
categorical values present in the dataset. The `ColumnDict` class provides
basic facilities for analyzing the contents and for making template files.

In [3]:
from hed.tools import ColumnDict
column_names_to_skip = ["onset", "duration", "sample", "HED"]
col_dict = ColumnDict(skip_cols=column_names_to_skip, name='AttentionShiftShort')
for file in event_file_list:
    col_dict.update(file)
col_dict.print()

Summary for column dictionary AttentionShiftShort:
  Categorical columns (4):
    response_time (1 distinct values):
      n/a: 29227
    stim_file (1 distinct values):
      n/a: 29227
    trial_type (3 distinct values):
      1: 5456
      2: 6059
      3: 17712
    value (28 distinct values):
      11: 228
      12: 228
      13: 456
      14: 456
      15: 1822
      16: 1820
      21: 252
      22: 252
      23: 505
      24: 504
      25: 2010
      26: 2012
      28: 2
      31: 720
      32: 719
      37: 960
      38: 960
      39: 480
      202: 72
      212: 3
      310: 480
      311: 3838
      312: 3832
      313: 1920
      314: 1915
      1201: 433
      2201: 502
      3201: 1846
  Value columns (0):


The previous example collates the information for all of the event files.
If you want the results of the individual files, you can create a
`ColumnDict` for each file individually and use the `update_dict` to
create a summary `ColumnDict`.

In [4]:
from hed.tools import ColumnDict
column_names_to_skip = ["onset", "duration", "sample", "HED"]
col_dict_all = ColumnDict(skip_cols=column_names_to_skip, name='AttentionShiftWithTotals')
for file in event_file_list:
    col_dict = ColumnDict(skip_cols=column_names_to_skip, name=file)
    col_dict.update(file)
    col_dict.print()
    col_dict_all.update_dict(col_dict)
col_dict_all.print()

Summary for column dictionary D:/eeg_ds002893s\sub-001\eeg\sub-001_task-AuditoryVisualShift_run-01_events.tsv:
  Categorical columns (4):
    response_time (1 distinct values):
      n/a: 5856
    stim_file (1 distinct values):
      n/a: 5856
    trial_type (3 distinct values):
      1: 1155
      2: 1153
      3: 3548
    value (26 distinct values):
      11: 48
      12: 48
      13: 96
      14: 96
      15: 383
      16: 383
      21: 48
      22: 48
      23: 96
      24: 96
      25: 384
      26: 384
      31: 144
      32: 143
      37: 192
      38: 192
      39: 96
      202: 6
      310: 96
      311: 767
      312: 766
      313: 384
      314: 382
      1201: 100
      2201: 95
      3201: 383
  Value columns (0):
Summary for column dictionary D:/eeg_ds002893s\sub-002\eeg\sub-002_task-AuditoryVisualShift_run-01_events.tsv:
  Categorical columns (4):
    response_time (1 distinct values):
      n/a: 5874
    stim_file (1 distinct values):
      n/a: 5874
    trial_type (3 