About Pretraining Data Formats #59

WYC-321 · 2022-06-06T15:29:51Z

I downloaded the dataset for pre-training on TCIA, but I found that the downloaded data format is .dcm, which is inconsistent with the format of .nii.gz in the json file. I wonder if something is done to do the format conversion？

ahatamiz · 2022-06-06T20:17:49Z

Hi @WYC-321

We did convert the Dicom files to nifti. In addition, we filtered out some of the outlier cases according to the information provided in the meta info. Please see the json files containing the exact train/val splits in here.

Thanks

WYC-321 · 2022-06-07T06:59:25Z

Hi, @ahatamiz :
Thank you for your answer.
After looking at the dataset I have some more detailed questions:
(1). Dicom files are simply converted to nifti without any additional processing ?

I noticed that the naming rules in the json file are different from the naming rules of the database. For example, in dataset_TCIAcolon_v2_0.json file, the images are named like this: img_19.nii.gz, but in the TCIA CT Colonography Trial database, the directory paths are like this:
CT COLONOGRAPHY\1.3.6.1.4.1.9328.50.4.0019\01-01-2000-1-CT ABD WCONT RECONSTRUCTION-18588. I'm guessing that the 0019 in 1.3.6.1.4.1.9328.50.4.0019 refers to img_19, but there are five subfolders under this directory: 1.000000-NA-18589 (including 1 dicom file)，3.000000-NA-18592 (including 482 dicom files)，5.000000-NA-19075 (including 1 dicom file)，7.000000-NA-19078 (including 438 dicom files)，9.000000-NA-19517 (including 1 dicom file)，11.000000-NA-19520 (including 444 dicom files). So even though I have the json file, I still don't know img19.nii.gz refers to which subfolder. (All data in five subfolders ? Or data in one subfolder ?). There are similar situations for other datasets. And the questions are as follows:
(2). How can I link the files in the original database with the files described by json?
(3). Some subfolders contain multiple Dicom slices, just concatenate them in order and convert them to a nifti file ?
(4). Given the complexity of the details, is it possible to expose a script that converts the raw data to the data described in json file ?

Finally, thanks again for your excellent work and contributions to open source code.

Best wishes !

ahatamiz · 2022-06-13T14:26:20Z

Hi @WYC-321,

I believe the best way to address your questions is to release the pre-processing pipeline. I have raised the issue regarding this with our team members and the code for pre-processing shall be released very soon.

CC: @wyli

Best

WYC-321 · 2022-06-28T09:07:07Z

Thanks a lot to your team.

Jamshidhsp · 2022-10-12T08:45:13Z

@WYC-321 I have the same issue with the code. Could you manage to work it out?

JiaxinZhuang · 2022-11-16T09:46:21Z

I also download the datasets and try to follow the split in the JSON file. However, for HSNCC as well as TCIAcolon, it's hard to convert to the required nifty file from the downloaded dataset. Because I can't find the corresponding relationship.

JakobDexl · 2023-02-20T11:28:06Z

@JiaxinZhuang @WYC-321 did you manage to figure it out? I'm also struggling with the naming relationship for the datasets (HNSCC and COLON).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Pretraining Data Formats #59

About Pretraining Data Formats #59

WYC-321 commented Jun 6, 2022 •

edited

Loading

ahatamiz commented Jun 6, 2022 •

edited

Loading

WYC-321 commented Jun 7, 2022

ahatamiz commented Jun 13, 2022

WYC-321 commented Jun 28, 2022

Jamshidhsp commented Oct 12, 2022

JiaxinZhuang commented Nov 16, 2022

JakobDexl commented Feb 20, 2023

About Pretraining Data Formats #59

About Pretraining Data Formats #59

Comments

WYC-321 commented Jun 6, 2022 • edited Loading

ahatamiz commented Jun 6, 2022 • edited Loading

WYC-321 commented Jun 7, 2022

ahatamiz commented Jun 13, 2022

WYC-321 commented Jun 28, 2022

Jamshidhsp commented Oct 12, 2022

JiaxinZhuang commented Nov 16, 2022

JakobDexl commented Feb 20, 2023

WYC-321 commented Jun 6, 2022 •

edited

Loading

ahatamiz commented Jun 6, 2022 •

edited

Loading