ConstructCSV when the directory layout is not in default format #532

carlpe · 2022-12-02T21:18:22Z

From Constructing the Data CSV it shows the users how to construct the data CSV automatically, but this assumes the data is in the following format:

./experiment_0/data_dir/
  │   │
  │   └───Patient_001 # this is used to construct the "SubjectID" header of the CSV
  │   │   │ Patient_001_brain_t1.nii.gz
  │   │   │ Patient_001_brain_t1ce.nii.gz
  │   │   │ Patient_001_brain_t2.nii.gz
  │   │   │ Patient_001_brain_flair.nii.gz
  │   │   │ Patient_001_brain_seg.nii.gz
  │   │   
  │   └───Patient_002 # this is used to construct the "Subject_ID" header of the CSV
  │   │   │ ...
  │

Sometimes, the files are not in this order, but for example like this:

  │   │
  │   └─── ct
  │   │   │ subject001_ct.nii.gz
  │   │   │ subject002_ct.nii.gz
  │   │   │ subject003_ct.nii.gz
  │   │   │ subject004_ct.nii.gz
  │   │   
  │   └─── gt
  │   │   │ subject001_gt.nii.gz
  │   │   │ subject002_gt.nii.gz
  │   │   │ subject003_gt.nii.gz
  │   │   │ subject004_gt.nii.gz
  │

It would be great if it was possible to use the constructCSV regardless of the directory format.

If this is already implemented, please show an example how to do this in the documentation.

Thank you.

The text was updated successfully, but these errors were encountered:

sarthakpati · 2022-12-03T02:32:04Z

It would be great if it was possible to use the constructCSV regardless of the directory format.

Unfortunately, gandlf_constructCSV is designed to work only with a specific folder structure. If you can think of a way it could be made more generic (while ensuring that the current mechanism works as expected), we'd be happy to consider updating the implementation 😄.

carlpe · 2022-12-03T07:13:52Z

The way they did it in Niftynet was to search through a given folder for all files named for example xxx_ct.nii.gz and xxx_gt.nii.gz and save the names to a CSV.

https://github.com/NifTK/NiftyNet/blob/935bf4334cd00fa9f9d50f6a95ddcbfdde4031e0/niftynet/utilities/util_csv.py#L206

https://github.com/NifTK/NiftyNet/blob/935bf4334cd00fa9f9d50f6a95ddcbfdde4031e0/niftynet/utilities/filename_matching.py#L96

sarthakpati · 2022-12-03T14:01:19Z

I started implementing something but ran into a problem right from the get-go: How should the program know how to match subjects in a single folder?

./experiment_0/data_dir/
  │   │
  │   └───Patient_001 # this is used to construct the "SubjectID" header of the CSV
  │   │   │ Patient_001_brain_t1.nii.gz
  │   │   │ Patient_001_brain_t1ce.nii.gz
  │   │   │ Patient_001_brain_t2.nii.gz
  │   │   │ Patient_001_brain_flair.nii.gz
  │   │   │ Patient_001_brain_seg.nii.gz
  │   │   
  │   └───Patient_002 # this is used to construct the "Subject_ID" header of the CSV
  │   │   │ ...
  │

In the above example, all files contain Patient_${ID} as an identifier. If this is the case, then it would be much cleaner practice to structure the input folder in a per-patient manner, which would allow ground truth and other metadata to be kept on a per-patient basis. Not dictating how someone should structure their data, it's just that we need to try and hit the lowest common denominator, and supporting all possible data structure formats is impossible 😞.

Additionally, the above structure is somewhat related to the brain imaging data structure (BIDS) (a formalized mechanism to define data formats), but not entirely, since BIDS has definitions mostly for DICOM. Anyway, let me know what you think.

carlpe · 2022-12-03T14:57:21Z

The subjects are matched based on their number. For example 001, 002 and so on. I guess it is not a very important issue, it would just be easier in the scenario where the data was structured in such a manner, which is how it could typically be in NiftyNet or in Monai 😊

https://niftynet.readthedocs.io/en/dev/filename_matching.html#automatic-filename-matching

I think it is totally ok to keep it as it already is in GaNDLF also, because it is working once we have the right directory format.

sarthakpati · 2022-12-03T15:44:11Z

What do you guys think about this, @AlexanderGetka-cbica, @Geeks-Sid?

Geeks-Sid · 2022-12-12T01:33:00Z

While this is of great utility, construct_csv is a starter code for folks to get started. There could be many more formats for folder structuring and while it would be great to support all of them, It is currently not in our plans. But as always, pull requests are appreciated.

sarthakpati · 2022-12-12T02:53:10Z

Cool, thanks for the input! What about you, @AlexanderGetka-cbica?

AlexanderGetka-cbica · 2022-12-12T03:31:15Z

I wrote some code similar to this for the automatic multi-subject feature extraction pipeline on the IPP. But as I learned, any heuristically based method is going to fail at some point.

The difficulty is this:
The default current scheme assumes that the directory name can be interpreted as a patient identifier, and identifies channels based on strings present in the filenames.
A scheme that handles the data Carl shows requires us to interpret directories as channel names, and identify subjects based on strings present in the filenames.
If we cannot actually make guarantees about how subjectIDs and channels are named (or their order in the filenames themselves), we can't automatically detect each case, so for a general-use constructCSV script, we should just pick one and stick with it. Even if we provided an option for this, something like "topLevelDirsAreChannels" ( I really cannot think of a clear, succinct name for this behavior ), users would have to know what that means and interpret it, which will just cause confusion.

If we can safely assume that subjectIDs only differ by number, then we actually can autodetect this case (and provide a switch just in case users actually don't get the output they expect.) Is that a reasonable assumption?

sarthakpati · 2022-12-12T15:21:59Z

If we can safely assume that subjectIDs only differ by number, then we actually can autodetect this case (and provide a switch just in case users actually don't get the output they expect.)

I think this is a very well-put argument. I'll ask @carlpe for more clarification.

carlpe · 2022-12-12T16:55:17Z

For the data sets consisting of only one single channel (_ct.nii.gz), the file names will differ only by number.

But in case we have multiple channels such as for example several MR weightings (_T1.nii.gz, T2.nii.gz), it will differ in more than the number exclusively.

I suppose it might be better to keep it as you already have it now, as there is a good reasoning for the formatting.
I found it doesn't take long to convert into your format anyway. I actually found a windows tool that will do this for me batchwise for the whole dataset, it is a free software named "Advanced Renamer", took only a couple of minutes to do the directory formatting.

sarthakpati · 2023-02-02T19:25:32Z

Closing this until we have a different solution.

sarthakpati added the enhancement New feature or request label Dec 3, 2022

sarthakpati closed this as completed Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConstructCSV when the directory layout is not in default format #532

ConstructCSV when the directory layout is not in default format #532

carlpe commented Dec 2, 2022

sarthakpati commented Dec 3, 2022

carlpe commented Dec 3, 2022

sarthakpati commented Dec 3, 2022

carlpe commented Dec 3, 2022 •

edited

sarthakpati commented Dec 3, 2022

Geeks-Sid commented Dec 12, 2022

sarthakpati commented Dec 12, 2022

AlexanderGetka-cbica commented Dec 12, 2022

sarthakpati commented Dec 12, 2022

carlpe commented Dec 12, 2022 •

edited

sarthakpati commented Feb 2, 2023

ConstructCSV when the directory layout is not in default format #532

ConstructCSV when the directory layout is not in default format #532

Comments

carlpe commented Dec 2, 2022

sarthakpati commented Dec 3, 2022

carlpe commented Dec 3, 2022

sarthakpati commented Dec 3, 2022

carlpe commented Dec 3, 2022 • edited

sarthakpati commented Dec 3, 2022

Geeks-Sid commented Dec 12, 2022

sarthakpati commented Dec 12, 2022

AlexanderGetka-cbica commented Dec 12, 2022

sarthakpati commented Dec 12, 2022

carlpe commented Dec 12, 2022 • edited

sarthakpati commented Feb 2, 2023

carlpe commented Dec 3, 2022 •

edited

carlpe commented Dec 12, 2022 •

edited