Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardizing Neuropoly datasets (One more step towards BIDS) #282

Open
2 of 50 tasks
NathanMolinier opened this issue Nov 15, 2023 · 39 comments
Open
2 of 50 tasks

Standardizing Neuropoly datasets (One more step towards BIDS) #282

NathanMolinier opened this issue Nov 15, 2023 · 39 comments
Assignees

Comments

@NathanMolinier
Copy link

NathanMolinier commented Nov 15, 2023

Description

This issue will be used to centralize the main discussions regarding our standardization strategies. The ultimate objective is to enable the execution of BIDS validator scripts within the derivatives folders of each dataset in our collection.

Dataset specific issues related to their standardization should be linked to this issue as well.

BIDS conventions

1. Important links

BIDS specifications
Neuropoly intranet

2. Important points regarding the raw dataset for MRI

[Brackets] are characterizing optional informations

As you may know, subjects folders in the raw datasets are structured as follows for MRI,

Raw structure
sub-<label>/
    [ses-<label>/]
        anat/
            sub-<label>[_ses-<label>][_acq-<label>][_ce-<label>][_rec-<label>][_run-<index>][_part-<mag|phase|real|imag>]_<suffix>.json
            sub-<label>[_ses-<label>][_acq-<label>][_ce-<label>][_rec-<label>][_run-<index>][_part-<mag|phase|real|imag>]_<suffix>.nii[.gz]
        dwi/
            sub-<label>[_ses-<label>][_acq-<label>][_rec-<label>][_dir-<label>][_run-<index>][_part-<mag|phase|real|imag>]_dwi.bval
            sub-<label>[_ses-<label>][_acq-<label>][_rec-<label>][_dir-<label>][_run-<index>][_part-<mag|phase|real|imag>]_dwi.bvec
            sub-<label>[_ses-<label>][_acq-<label>][_rec-<label>][_dir-<label>][_run-<index>][_part-<mag|phase|real|imag>]_dwi.json
            sub-<label>[_ses-<label>][_acq-<label>][_rec-<label>][_dir-<label>][_run-<index>][_part-<mag|phase|real|imag>]_dwi.nii[.gz]

With folders corresponding to subjects, [sessions] and MRI modalities.

Then, regarding the way filenames are constructed, we can identify 3 main types of elements:

Raw entities

Characterized by a key word (sub, ses, acq, etc.) and a value (label = an alphanumeric value, index = a nonnegative integer, etc) separated with a dash -

  • sub-<label>
  • [ses-<label>]
  • [acq-<label>]
  • [ce-<label>]
  • [rec-<label>]
  • [run-<index>]
  • [part-<mag|phase|real|imag>]
  • [dir-<label>]

Entities are then separated with an underscore _

Raw suffixes

An alphanumeric string located after all the entities following a final underscore _ (i.e. the <suffix>) --> corresponding in our cases to the MRI contrast:

  • T1w
  • MP2RAGE
  • dwi
  • etc.
Raw extensions

Files extensions:

  • .nii.gz
  • .json
  • .bval
  • etc.

3. Important points regarding derivatives datasets

First, it is important to understand what are BIDS derivatives folders:

"Derivatives are outputs of common processing pipelines, capturing data and meta-data sufficient for a researcher to understand and (critically) reuse those outputs in subsequent processing. Standardizing derivatives is motivated by use cases where formalized machine-readable access to processed data enables higher level processing."

Basically, derivative folders have to be seen as processed data obtained from the raw dataset. With a derivative folder corresponding to DIFFERENT processes. Let's consider a fundamental workflow that aligns with the receipt of a new dataset. Let's imagine that we would like to generate spinal cord (SC) segmentions and intervertebral discs labels from this raw data using sct_deepseg_sc and sct_label_vertebrae.

Since the wanted outputs will be generated using TWO different algorithms, we will have to create TWO different folders under derivatives (e.g. derivatives/labels_deepseg and derivatives/labels_vertebrae). Moreover, a dataset_description.json file will have to be added at the root of EACH derivative folder to keep track of all the operations applied to the data (processes used, manual corrections...).

dataset_description.json
{
  "BIDSVersion": "1.8.0",
  "Name": "<dataset_name>",
  "DatasetType": "derivative",
  "GeneratedBy": [
    {
      "Name": "sct_deepseg_sc",
      "Version": "SCT v6.1"
    },
    {
      "Name": "Manual",
      "Description": "Manually corrected by Nathan Molinier"
    }
  ]
}

To keep a record of complex processing steps applied to the data, a descriptions.tsv file can be used.

descriptions.tsv This file must be composed at least of two columns:
  • desc_id: labels corresponding to the desc entities (see Derivative entities below)
  • description: human readable descriptions of the processing steps

This file MAY be located at the root of the derivative dataset, or at the subject or session level

Then, derivative folders follow the same structure as the raw folders:

Derivatives structure
sub-<label>/
    [ses-<label>/]
        modality/
            <source_entities>[_space-<space>][_res-<label>][_den-<label>][_desc-<label>]_<suffix>.<extension>

With folders corresponding to subjects, [sessions] and MRI modalities.

Finally, to construct the filenames, we can identify the same 3 type of elements as before (entities, suffixes and extensions) plus 1 extra-consideration related to the raw data:

⚠️ Entities and suffixes are different from those used with the raw filenames and are specific to data types.

source_entities

This element corresponds to the entire source filename, with the omission of the source suffix and extension.

For MRI we need to remove the contrast from the filename (see here).

Derivative entities

Characterized by a key word (space, res, den, etc.) and a value (label = an alphanumeric value, index = a nonnegative integer, etc) separated with a dash -

  • [space-<space>]: image space if different from raw space: template space (i.e. MNI305 etc), individual, study etc. (see BIDS for allowed spaces)
  • [res-<label>]: for changes in resolution
  • [den-<label>]
  • [desc-<label>]: should be used to specify the contrast (i.e. _desc-T1w and _desc-T2w)
  • [label-<label>]: to avoid confusion if multiple masks are available we can specify the masked structure (i.e. _label-WM for white matter, _label-GM for gray matter, _label-L for lesions etc.)

Entities are then separated with an underscore _

Derivative suffixes

An alphanumeric string located after all the entities following a final underscore _ :

  • mask for binary masks (0 and 1 only)
  • dseg for discrete segmentations representing multiple anatomical structures
  • probseg for probabilistic segmentations representing a single anatomical structure with values ranging from 0 to 1
  • etc.
Derivatives extensions

Files extensions:

  • .nii.gz
  • .json
  • etc.

Neuropoly strategy

To overcome this standardization challenge, multiple steps have to be considered:

  • Have a meeting with the laboratory to ensure that everyone understands the motivation behind this initiative.
  • Update our intranet with the BIDS 1.9 convention
  • Go through all our datasets to standardize our derivatives folders while making them more BIDS compatible
  • Use or create a BIDS validator script able to validate derivatives folders

Neuropoly Dataset state 2023-11-13 slides

MRI Datasets

  • basel-mp2rage
  • bavaria-quebec-spine-ms
  • beijing-tumor
  • canproco
  • data-single-subject_DO-NOT-USE
  • data-multi-subject
  • dcm-oklahoma
  • dcm-zurich-lesions
  • dcm-zurich
  • gmseg-challenge-2016
  • inspired
  • levin-stroke
  • lumbar-epfl
  • marseille-3T-mp2rage
  • marseille-rootlets
  • mni-bmpd
  • model_seg_exvivo_gm-wm_t2_unet2d-multichannel-softseg
  • msseg_challenge_2016
  • msseg_challenge_2021
  • philadelphia-pediatric
  • sci-colorado
  • sci-paris
  • sci–zurich
  • sct-testing-large
  • template
  • template_dog_virginiatech
  • twh-rootlets
  • uk-biobank
  • uk-biobank-processed
  • umass-ms-ge-excite1.5
  • umass-ms-ge-hdxt1.5
  • umass-ms-ge-pioneer3
  • umass-ms-siemens-espree1.5
  • uqueensland_mouse
  • whole-spine
  • zurich-mouse

Micro Datasets

  • data_axondeepseg_bf_source
  • data_axondeepseg_bf_training
  • data_axondeepseg_sickkids
  • data_axondeepseg_stanford
  • data_axondeepseg_tem
  • data_axondeepseg_users
  • data_axondeepseg_vcu
  • data_axondeepseg_wakehealth_source
  • data_axondeepseg_wakehealth_training

EEG Datasets

  • eeg-epilepsy
@NathanMolinier NathanMolinier self-assigned this Nov 17, 2023
@naga-karthik
Copy link
Member

Hey @NathanMolinier! I have to say that the issue description summarizes the BIDS format quite well (much much better than the original documentation, which is hard to understand). Great job!

One minor clarification: Are you sure that the the GeneratedBy in dataset_description.json is correct?
I mean, I thought that the Name refers to the person or the model that did the segmentations -- so in this case, it would be

{
      "Name": "Nathan Molinier",
      "Description": "Manually correction of deepseg_sc predictions"
    }

Also, a very important key in this is the Date, which should be added for all manual and automatic predictions. So, overall, I think the dataset_description should have the following structure:

{
  "BIDSVersion": "1.8.0",
  "Name": "<dataset_name>",
  "DatasetType": "derivative",
  "GeneratedBy": [
    {
      "Name": "sct_deepseg_sc",
      "Description": "SCT v6.1"
      "Date": "202X-mm-dd HH:MM:SS"
    },
    {
      "Name": "Nathan Molinier",
      "Description": "Manually correction of deepseg_sc predictions"
      "Date": "202X-mm-dd HH:MM:SS"
    }
  ]
}

@NathanMolinier
Copy link
Author

NathanMolinier commented Nov 21, 2023

Hey @naga-karthik ! Thanks for your comment ! Regarding the .json files, there are few considerations that need to be taken into account. We can identify actually 2 types of .json:

  • .json sidecars (OPTIONAL) with not a lot of specifications. We could keep using these files to store informations about specific correctors including the date/time.
  • dataset_description.json at the root of any given derivative folder (MANDATORY). This file is used to track the different processes used to create a given derivative folder. And because the manual corrections could be performed by multiple authors, we have to use the field description to specify correctors.

The date field that you added is here not relevant because it could be different for all the images in the derivative folder.

@mguaypaq
Copy link
Member

Nathan's description of the difference is good; basically we can put all the information we want in the json sidecars, but for dataset_description.json the GeneratedBy key has a more rigid schema. According to this part of the spec, the allowed fields are:

  • Name (mandatory): Name of the pipeline or process that generated the outputs, or Manual.
  • Description: Plain-text description of the pipeline or process that generated the outputs.
  • CodeURL: URL where the code used to generate the dataset may be found.
  • Version: Version of the pipeline
  • Container: A sub-object with keys Type, Tag, URI, to identify a docker container.

@valosekj
Copy link
Member

valosekj commented Nov 22, 2023

Thanks, @NathanMolinier, for this initiative! The description is clear and makes sense! 👍🏻

Basically, derivative folders have to be seen as processed data obtained from the raw dataset. With a derivative folder corresponding to DIFFERENT processes

Since the wanted outputs will be generated using TWO different algorithms, we will have to create TWO different folders under derivatives (e.g. derivatives/labels_deepseg and derivatives/labels_vertebrae).

We usually call many different algorithms within a single pipeline (e.g., process_data.sh bash script), for example, sct_deepseg_sc, sct_deepseg_gm, sct_label_vertebrae, etc. Should we then have multiple derivatives folders (one for each algorithm) as you are proposing or just a single derivatives folder for the whole pipeline (named based on the bash script, e.g., derivatives/process_data)?

Using multiple derivatives folders would be a big change causing backward incompatibility with all our current processing scripts which try to locate labels under derivatives-labels; example here.

Also, for the retrospective rearrangement of the existing datasets, we do not always know if SC seg was obtained using sct_deepseg_sc or sct_propseg or completely manually. How would we name the derivatives folder then?

@NathanMolinier
Copy link
Author

NathanMolinier commented Nov 22, 2023

Thanks @valosekj ! Great questions !

We usually call many different algorithms within a single pipeline (e.g., process_data.sh bash script), for example, sct_deepseg_sc, sct_deepseg_gm, sct_label_vertebrae, etc. Should we then have multiple derivatives folders (one for each algorithm) as you are proposing or just a single derivatives folder for the whole pipeline (named based on the bash script, e.g., derivatives/process_data)?

For this workflow, all the files generated using this pipeline can be kept inside the same derivative folder. In this case we will need to add the different functions computed inside the dataset_description.json file and possibly in a descriptions.tsv file:

dataset_description.json
{
 "BIDSVersion": "1.9.0",
 "Name": "<dataset_name>",
 "DatasetType": "derivative",
 "GeneratedBy": [
   {
     "Name": "sct_deepseg_sc",
     "Version": "SCT v6.1"
   },
   {
     "Name": "sct_deepseg_gm",
     "Version": "SCT v6.1"
   },
   {
     "Name": "sct_label_vertebrae",
     "Version": "SCT v6.1"
   },
   {
     "Name": "Manual",
     "Description": "Manually corrected by Nathan Molinier"
   }
 ]
}

Regarding the name of the folder, I believe there is nothing covering that. However, I think we should agree on a naming pattern for the lab datasets.

Using multiple derivatives folders would be a big change causing backward incompatibility with all our current processing scripts which try to locate labels under derivatives-labels; example here.

I understand that this point is a bit problematic. However, if you look at all the codes created in the lab, I believe we can identify more ways to import the data than the number of people currently working in the lab. Having a different way to import the data depending on the dataset is not a good thing. I truly believe that standardizing could help us finding the "good" behavior to use/import our data.

Also, for the retrospective rearrangement of the existing datasets, we do not always know if SC seg was obtained using sct_deepseg_sc or sct_propseg or completely manually. How would we name the derivatives folder then?

Thanks to BIDS standard, this information should be located also in the dataset_description.json file. If we cannot remember retrospectively, I believe we can specify it in a descriptions.tsv file.

@mguaypaq
Copy link
Member

The items in the GeneratedBy list in the derivative's dataset_description.json should be pipelines rather than commands, I think. Since the version of the pipeline is given, that should be enough to look at the pipeline code and see which SCT commands (and versions) were used.

@NathanMolinier NathanMolinier pinned this issue Nov 30, 2023
@NathanMolinier
Copy link
Author

The dataset curation section on our intranet was updated. Feel free to give me some feedback on it @mguaypaq @naga-karthik @valosekj @jcohenadad @sandrinebedard @plbenveniste @Nilser3

@Nilser3
Copy link

Nilser3 commented Dec 5, 2023

Hey @NathanMolinier , Thanks you!
Maybe add a flag for denoised images (see: the BIDS Preprocessed or cleaned data )
For example, in MP2RAGE datas we sometimes have denoised and non-denoised images. (Discussed here)

@jcohenadad
Copy link
Member

@Nilser3 please feel free to add it directly, or to suggest the change via a PR

@jcohenadad
Copy link
Member

@NathanMolinier I took the liberty to do this neuropoly/intranet.neuro.polymtl.ca@5666e57, and i'm planning to do the same for the rest of the document-- please let me know if you disagree

@NathanMolinier
Copy link
Author

@NathanMolinier I took the liberty to do this neuropoly/intranet.neuro.polymtl.ca@5666e57, and i'm planning to do the same for the rest of the document-- please let me know if you disagree

I was hesitant to do that while writing the documentation, but I finally kept the format with details to reduce the overall size of the document. However, I do understand that being able to access specific features using relative links can be interesting. We could also decide to use these titles in combination with the details.

@NathanMolinier
Copy link
Author

Hey @NathanMolinier , Thanks you! Maybe add a flag for denoised images (see: the BIDS Preprocessed or cleaned data ) For example, in MP2RAGE datas we sometimes have denoised and non-denoised images. (Discussed here)

Regarding this, I need to figure out how we should handle this case.

@jcohenadad
Copy link
Member

We could also decide to use these titles in combination with the details.

You mean, keep the headings (ie: hyperlink) sections, and below the section, use the "details" to make it look shorter? I'm not a big fan of that. It will involve more clicking. Putting myself in the shoes of someone who wants to create a BIDS dataset, they will need to go through all sub-sections, and they can do so easily from the clickable TOC, so why add more clicks?

@NathanMolinier
Copy link
Author

Yes it was exactly what I meant. So, it's fine with me if we decide to switch to your version with more headings.

@jcohenadad
Copy link
Member

This is a pretty big deal:

https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/0b1d9580da994af05ef7b40e38996062034fc385/data/dataset-curation.md?plain=1#L341

In the past we decided to keep the contrast and add the derivatives entities (eg: _label-sc_mask). We need to discuss the implications of changing that, because it will impact pretty much ALL our processing pipelines. Tagging @sandrinebedard @valosekj @naga-karthik @plbenveniste @Nilser3 @mguaypaq

@jcohenadad
Copy link
Member

jcohenadad commented Dec 5, 2023

About this:

https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/0b1d9580da994af05ef7b40e38996062034fc385/data/dataset-curation.md?plain=1#L363

In the past, I vaguely remember we discussed this and agreed to use mask for soft masks as well, and do the distinction in the derivative/ child folder (eg: derivatives/labels_soft/). Tagging @mguaypaq @sandrinebedard @naga-karthik

We also need to keep in mind that the BIDS-derivative standard is changing quite fast, so it is possible that the "standard" will change in 1y from now (hence the pragmatic decision to also allow our internal rules to prevail)

@jcohenadad
Copy link
Member

@NathanMolinier pls merge into single table:
https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/42f6a5eeae3f7c3925e0be9fa1b09dec1f6c929e/data/dataset-curation.md?plain=1#L361-L376

@jcohenadad
Copy link
Member

I'm not sure we should introduce the distinction between seg and label:
https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/42f6a5eeae3f7c3925e0be9fa1b09dec1f6c929e/data/dataset-curation.md?plain=1#L361-L362

With our use cases, I see more confusion being introduced by future members, which means more trouble for the core team to maintain the consistency of our database

@jcohenadad
Copy link
Member

I would not introduce these two:

https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/42f6a5eeae3f7c3925e0be9fa1b09dec1f6c929e/data/dataset-curation.md?plain=1#L364-L365

Instead, I would use a single suffix _label, as we've done so far.

@NathanMolinier
Copy link
Author

Regarding the fact that raw suffixes (contrasts in our case) are removed from the original image names and replaced by the desc- entities in the derived filenames. I personally think that we should comply to this key feature if we:

  • decide to promote BIDS usage in our open-source tools like SCT
  • want our scripts to work on BIDS compatible datasets generated by random users
  • decide to participate in the development of BIDS
  • decide to trust BIDS

About this last point, I strongly believe that standards such as BIDS are dependent on trust. Put differently, if individuals opt not to adhere entirely to the fundamental principles of a standard, we will ultimately end up with numerous BIDS-like datasets that are incompatible.

Regarding our old scripts, and the fact that they may not work on these new datasets, it is important to remember that they were created based on specific versions of the datasets and thanks to Git, it is always possible to checkout to an anterior version (commit).

Also, I tried to find related conversations on the BIDS repository and I found this issue where people are talking about an entity mod- which could be used as well to store information about the MRI contrast.

Finally, the practice of excluding the image suffix during the creation of derived data is not a novel concept. This feature was originally described in the initial version of the documentation addressing derivative datasets (v.1.4.0).

@jcohenadad
Copy link
Member

#282 (comment) these are excellent points, @NathanMolinier

@sandrinebedard
Copy link
Member

About this:

https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/0b1d9580da994af05ef7b40e38996062034fc385/data/dataset-curation.md?plain=1#L363

In the past, I vaguely remember we discussed this and agreed to use mask for soft masks as well, and do the distinction in the derivative/ child folder (eg: derivatives/labels_soft/). Tagging @mguaypaq @sandrinebedard @naga-karthik

We also need to keep in mind that the BIDS-derivative standard is changing quite fast, so it is possible that the "standard" will change in 1y from now (hence the pragmatic decision to also allow our internal rules to prevail)

We still added the suffix softseg , here is an example: sub-amu01_T2w_softseg.nii.gz that would be located under labels_softseg

@NathanMolinier
Copy link
Author

NathanMolinier commented Dec 12, 2023

Beyond thinking between _T1w and _mod-T1w, we need to make sure that we understand the scope of the changes. Indeed, we need first to clarify the purpose of our derivatives datasets to choose the most suited option for our research.

Currently, most of our datasets are private and are only used for our different projects mainly to train and test AI models. However, for reproducibility, all the scripts used for these projects are available on github. Therefore, changing the way we standardize/use our data will impact directly these scripts:

Workaround 1:

  • we decide to not follow BIDS derivatives' main features and comply with our own standard using _T1w
  • we decide to use custom suffixes and entities.

Workaround 2: see mguaypaq's slides

  • we decide to not follow BIDS derivatives' main features and comply with our own standard using _T1w
  • we track actively the sources of our derived data using the json sidecars
  • we decide to use custom suffixes and entities.

Workaround 3:

  • we decide to follow BIDS derivatives’ main features and comply with their standard using _mod-T1w
  • we decide to use custom suffixes and entities.

If we decide to go with workaround 1, we will need to clarify within all our scripts that we are not complying to BIDS derivatives format and thus people will have to modify our scripts or transform their data to comply with our own custom format. From my point of view, this will make our scripts less likely to be used and scripts like manual correction will only be used for our internal usage.

If we decide to go with workaround 3, BIDS users would be more likely interested in our scripts. Regarding entities and suffixes, having custom ones will not be as impactful because this could be easily mitigated using special parser flags.

I’m not saying that BIDS is perfect (I personally think that suffixes should be avoided) but I think that complying to a standard could positively impact the scope of our research work. Because having our own custom standard will only generate more barriers with other researchers.

@jcohenadad
Copy link
Member

jcohenadad commented Dec 12, 2023

Thank you for your insights @NathanMolinier . These are solid arguments. For the sake of comprehensiveness, I will copy/paste the minutes from our previous meeting:

About the derivative suffix (ie. Workaround 1 vs. 2), here are some additional pros/cons:

  • Pros for workaround 1:
    • in SHELL scripts (example), the variable of file name can be used to add a suffix (eg _mask). It will be VERY confusing for non-SHELL experts to deal with an additional ‘mod-XX’ in the middle of the file name.
  • Pros for workaround 2:
    • Not considered
  • Pros for workaround 3:

About the use of probseg: We decided to use “softseg”. Related commit (that could still be updated) https://github.com/neuropoly/intranet.neuro.polymtl.ca/commit/03ca9af8753dd11795627f242350b380f4e5b890

About “blabel”, “dlabel”: We decided to not use them. These are too confusing and unnecessary.

Some Examples:

  • sub-001_T1w_label-SC_seg # SC seg binary
  • sub-001_T1w_label-SC_softseg # SC seg soft
  • sub-001_T1w_label-vertebrae_dseg # segmentation of vertebrae non-binary
  • sub-001_T1w_label-discs_dlabel # point-wise disc levels
  • sub-001_T1w_label-compression_label # binary label of compression (there could be more than one label)
  • sub-001_T1w_label-pmj_dlabel # single point-wise label of pmj with value 50
  • sub-001_T1w_label-vertebrae_dlabel # point-wise mid-vertebral levels in the middle of the spinal cord
  • sub-001_T1w_label-lesion_seg # lesion seg, binary

@jcohenadad
Copy link
Member

jcohenadad commented Dec 12, 2023

cross ref to gslide explaining additional BIDS context

@jcohenadad
Copy link
Member

jcohenadad commented Dec 12, 2023

another meeting on 2023-12-12-- trying to find consensual decision about what to do:

About the derivative suffix (ie. Workaround 1 vs. 2 vs. 3):

  • additional scenario to consider
    • mask that is the same across multiple contrasts
      • with workaround 1: we would duplicate it (eg blabla_T1w_mask ; blabla_T2w_mask)
    • conclusion: we opt for workaround 1.

About the use of probseg: We opt for “softseg”.

  • about the term 'softmask' in this commit --> this referred not to a segmentation, but a bigger 'region' mask (eg cylinder mask) --> we won't use that.

About “blabel”, “dlabel”: We decided to not use them. These are too confusing and unnecessary.

Some Examples:

https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/master/data/dataset-curation.md#examples-and-use-cases

(someone please continue that list)

todo:

  • figure out ways to cross-ref the dicts (eg; in the descriptoion JSON) and link to intranet-- in the intranet, links to eg SCT website-- we need also to create convention for rootlets and others
  • edit intranet with vocab: "lesion"
  • finish examples in intranet
  • move "Examples and use cases" on top

@NathanMolinier
Copy link
Author

NathanMolinier commented Jan 12, 2024

One last question about our convention:

BIDS says that Derivative data obtained using DIFFERENT processes/workflows should be stored using DIFFERENT derivatives folders, example:

  • derivatives/labels/
  • derivatives/sct_5.6/
  • derivatives/fmriprep_2.3/

The idea is to use the dataset_description.json to track the processes use to generate the data.

However, I believe we mentioned that we maybe would like to keep only one folder under derivatives ?


Solution 1: We use multiple folders to track the processes used to create the data.

Solution 2: We decide to regroup all the files under a same derivative dataset called labels as we currently do.


What do we decide ? I just want to properly update the intranet.

@valosekj
Copy link
Member

However, I believe we mentioned that we maybe would like to keep only one folder under derivatives ?

Yeah, I think so too. I.e., "Solution 2":

Solution 2: We decide to regroup all the files under a same derivative dataset called labels as we currently do.

I believe this solution is also better for bash scripts where we check if a label already exists under derivatives (example here). It would be more complicated to handle multiple different derivatives/labels/ folders.

@NathanMolinier
Copy link
Author

Solution 2: We decide to regroup all the files under a same derivative dataset called labels as we currently do.

I believe this solution is also better for bash scripts where we check if a label already exists under derivatives (example here). It would be more complicated to handle multiple different derivatives/labels/ folders.

It also means that we will have to remove the derivative folder labels_softseg from data-multi-subject.

@jcohenadad
Copy link
Member

I would say a mix between solution 1 and solution 2. For most datasets, we only have binary labels. So let's call all of them 'labels' for the reason raised by @valosekj (easier to crawl inside multiple datasets).

on the other hand, in some cases, we have software specific, or different types of labels, that we would like to distinguish from (eg: labels_softseg). In those cases, I see no alternative than doing a subfolder per set of labels.

@sandrinebedard
Copy link
Member

I would go with Julien's approach!

I would say a mix between solution 1 and solution 2. For most datasets, we only have binary labels. So let's call all of them 'labels' for the reason raised by @valosekj (easier to crawl inside multiple datasets).

on the other hand, in some cases, we have software specific, or different types of labels, that we would like to distinguish from (eg: labels_softseg). In those cases, I see no alternative than doing a subfolder per set of labels.

For the labels_softseg in the spine generic multi-subject dataset, this is a special case that these segmentation are an average of 6 contrasts (so not the output of a segmentation model but using the binary segmentation in labels + averaging ).

@naga-karthik
Copy link
Member

IIUC, soft segmentations such as sub-001_T1w_label-SC_softseg.nii.gz will be under labels_softseg but binary segmentations such as sub-001_T1w_label-SC.nii.gz will be under labels ?

These file name examples are taken from #282 (comment)

@NathanMolinier
Copy link
Author

For the labels_softseg in the spine generic multi-subject dataset, this is a special case that these segmentation are an average of 6 contrasts (so not the output of a segmentation model but using the binary segmentation in labels + averaging ).

I understand completely, that's why I believe that solution 1 should not be set aside. The contrast agnostic project will not be the last project asking for a particular data.

In fact, having special data for a specific project is not new: the spinegeneric dataset and canproco do contain labels generated in a different image space that the one used for images, however the data still is currently stored under labels and this can be confusing.

@jcohenadad
Copy link
Member

the spinegeneric dataset and canproco do contain labels generated in a different image space that the one used for images, however the data still is currently stored under labels and this can be confusing.

isn't it what the JSON sidecar is supposed to address?

About the canproco: i'm surprised it contains labels that are not in the same voxel space-- @plbenveniste do you have more details about this?

@valosekj
Copy link
Member

About the canproco: i'm surprised it contains labels that are not in the same voxel space-- @plbenveniste do you have more details about this?

A note from canproco/README.md:

Note: T2w sagittal images were preprocessed before the spinal cord was segmented ('_T2w_seg-manual.nii.gz') and disc labels were identified ('_T2w_labels-manual.nii.gz').
Namely, reorientation to RPI and resampling to 0.8mm isotropic voxel were performed. This is why the original T2w images have different dimensions than segmentations and labels.
Preprocessing steps: https://github.com/ivadomed/canproco/blob/8e1b2c35f96eeeb3838b512dd93eba25e5a5e97a/scripts-t2w_csa/sct-preprocess_data.sh#L162-L170
For context, see #197

This happened because I got inspiration from the spine-generic analysis pipeline.

@NathanMolinier
Copy link
Author

the spinegeneric dataset and canproco do contain labels generated in a different image space that the one used for images, however the data still is currently stored under labels and this can be confusing.

isn't it what the JSON sidecar is supposed to address?

Yes also, but people could miss this information from the JSON sidecars, from a human perspective having separate folder may be more visual.

Also, I'm not saying that I'm completely against having all the files within the same derivative folder, but I do think that we should not completely avoid solution 1. This could be helpful for other projects like the contrast-agnostic SC segmentation.

@jcohenadad
Copy link
Member

Note: T2w sagittal images were preprocessed before the spinal cord was segmented ('_T2w_seg-manual.nii.gz') and disc labels were identified ('_T2w_labels-manual.nii.gz').
Namely, reorientation to RPI and resampling to 0.8mm isotropic voxel were performed. This is why the original T2w images have different dimensions than segmentations and labels.
Preprocessing steps: https://github.com/ivadomed/canproco/blob/8e1b2c35f96eeeb3838b512dd93eba25e5a5e97a/scripts-t2w_csa/sct-preprocess_data.sh#L162-L170
For context, see #197

This happened because I got inspiration from the spine-generic analysis pipeline.

Right, that makes perfect sense. In any case, it looks like we are converging towards a solution where some labels are not in the same space, and the pre-processing (before training) pipeline will take care of put them back into the appropriate space.

@NathanMolinier
Copy link
Author

Right, that makes perfect sense. In any case, it looks like we are converging towards a solution where some labels are not in the same space, and the pre-processing (before training) pipeline will take care of put them back into the appropriate space.

Labels in a different space that the one used for images will also have a new entity space-XXX associated and a reference within their JSON sidecars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants