Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API update for tsv files #634

Open
adam2392 opened this issue Nov 17, 2020 · 13 comments
Open

API update for tsv files #634

adam2392 opened this issue Nov 17, 2020 · 13 comments

Comments

@adam2392
Copy link
Member

Describe the problem

With #601, an API for "updating" BIDs dataset was introduced in mne-bids. This however, only works for JSON files currently, which are easier because it can be represented as a dictionary update().

I have many datasets where additional information is appended though at the channel level, which are encoded in the channels.tsv, electrodes.tsv files. It would be awesome to have a similarly structured API for updating the corresponding .tsv files inside a BIDS dataset.

Describe your solution

I don't have one heh, but was hoping the brilliant minds here would have an idea on how to update a tsv file robustly and efficiently.

Describe possible alternatives

A naive way of solving the above problem would be to just have a nested dictionary as the dictionary update. That is:

an updating dictionary entries for a channels.tsv file would look like:

{
  'C1': {
      'status_description': 'blah',
},
   'C2': {
       'type': 'Electroencephalography',
},
}

Then the corresponding status_description and type are updated for channel rows "C1" and "C2". Everything else left as is.

Additional context

@hoechenberger
Copy link
Member

My take would be a list of tuples:

new_vals = [('row_val_1', 'col_val_1', val_1),
            ('row_val_n', 'col_val_n', val_n)]

To select all rows or columns, we could use None:

new_vals = [(None, 'col_val_1', val_1),  # all rows
            ('row_val_n', None, val_n)]  # all cols

We need a way to specify which column shall serve as the index, so specifying
a row value for indexing actually makes sense.

For example, to select all bad channels and change their
status_description, we'd do:

new_vals = [('status', 'bad', 'status_description', 'super bad')]

but one can already see how this is going to be horrible to use.

I believe what we actually want is pandas-style indexing (i.e.,
DataFrame.loc)

@agramfort
Copy link
Member

agramfort commented Nov 18, 2020 via email

@hoechenberger
Copy link
Member

hoechenberger commented Nov 18, 2020

Totally agree. I was just trying to think within the existing framework of the JSON updates.

But I would very much prefer using pandas here too.

We should not expose these implementation details to the users though, so let's discuss API :) I don't want to end up with one update function for JSON and another for TSV files, that's all

@adam2392
Copy link
Member Author

adam2392 commented Nov 19, 2020

let's not reinvent the wheel here. if we have a way to read it the events as a dataframe we can let users do what they want and then we offer them a way to write it back doing some schema checks. my 2c

Agreed internally should definitely use pandas DataFrame. The API would be just for consistency sake, since I can see that there's a lot of use for updating a list of channels with some updated column value in channels.tsv/electrodes.tsv.

new_vals = [('row_val_1', 'col_val_1', val_1),
            ('row_val_n', 'col_val_n', val_n)]

To select all rows or columns, we could use None:

new_vals = [(None, 'col_val_1', val_1),  # all rows 
            ('row_val_n', None, val_n)]  # all cols

I like this approach.

We need a way to specify which column shall serve as the index, so specifying
a row value for indexing actually makes sense.

I think for files such as channels.tsv/electrodes.tsv we can safely assume that the index is the name column? That way if you want to update the status and status_description?

new_vals = [(None, 'status', 'bad'), (None, 'status_description', 'super bad')]

@agramfort
Copy link
Member

agramfort commented Nov 19, 2020 via email

@hoechenberger
Copy link
Member

I'm also not yet totally convinced what exactly we're trying to achieve and how to get there :)

@adam2392
Copy link
Member Author

This would be a change to electrodes/channels.tsv. Say I need to change description of a channel, or set of channels to resected, or say left temporal lobe:

new_vals = [
('C1', 'description', 'resected'),
('C2', 'description', 'resected'),
(C40', 'description', 'left temporal lobe'),
]

# index column is assumed based on the fact that it's a 'channels.tsv' file to be 'name'
update_sidecar_tsv(bids_path, new_vals)

I acknowledge this can potentially be messy down the road, but minimally I feel like it would be useful to have a tsv reader/writer that might abstract away the pandas DataFrame reading?

@hoechenberger
Copy link
Member

but minimally I feel like it would be useful to have a tsv reader/writer that might abstract away the pandas DataFrame reading?

I see the biggest advantage in the fact that a BIDSPath can essentially match more than just one TSV file (an entire study), which could be useful

@jasmainak
Copy link
Member

jasmainak commented Nov 19, 2020

to have a tsv reader/writer that might abstract away the pandas DataFrame reading?

is it more than 1 line of code to read pandas? We will add a dependency with no benefit. Can't you achieve the same with BIDSPath.match() and pd.read_csv(...) ?

@hoechenberger
Copy link
Member

is it more than 1 line of code to read pandas? We will add a dependency with no benefit. Can't you achieve the same with BIDSPath.match() and pd.read_csv(...) ?

Yes, and it'd be more flexible too, I suppose.

But this raises the question why we offer JSON sidecar updating and not TSV sidecar updating?

@agramfort
Copy link
Member

agramfort commented Nov 19, 2020 via email

@hoechenberger
Copy link
Member

cause pandas does exist for JSON :)

Well.

# %%
import json_tricks
import pandas as pd

x = {
    'foo': 'bar',
    'baz': {
         'xxx': 1,
         'yyy': False
    }
}

x_json = json_tricks.dump(x, '/tmp/foo.json')

df = pd.read_json('/tmp/foo.json')
print(df)
     foo  baz
xxx  bar    1
yyy  bar    0

@agramfort
Copy link
Member

agramfort commented Nov 19, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants