Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Pretraining Data Formats #59

Open
WYC-321 opened this issue Jun 6, 2022 · 7 comments
Open

About Pretraining Data Formats #59

WYC-321 opened this issue Jun 6, 2022 · 7 comments

Comments

@WYC-321
Copy link

WYC-321 commented Jun 6, 2022

I downloaded the dataset for pre-training on TCIA, but I found that the downloaded data format is .dcm, which is inconsistent with the format of .nii.gz in the json file. I wonder if something is done to do the format conversion?

@ahatamiz
Copy link
Contributor

ahatamiz commented Jun 6, 2022

Hi @WYC-321

We did convert the Dicom files to nifti. In addition, we filtered out some of the outlier cases according to the information provided in the meta info. Please see the json files containing the exact train/val splits in here.

Thanks

@WYC-321
Copy link
Author

WYC-321 commented Jun 7, 2022

Hi, @ahatamiz :
Thank you for your answer.
After looking at the dataset I have some more detailed questions:
(1). Dicom files are simply converted to nifti without any additional processing ?

I noticed that the naming rules in the json file are different from the naming rules of the database. For example, in dataset_TCIAcolon_v2_0.json file, the images are named like this: img_19.nii.gz, but in the TCIA CT Colonography Trial database, the directory paths are like this:
CT COLONOGRAPHY\1.3.6.1.4.1.9328.50.4.0019\01-01-2000-1-CT ABD WCONT RECONSTRUCTION-18588. I'm guessing that the 0019 in 1.3.6.1.4.1.9328.50.4.0019 refers to img_19, but there are five subfolders under this directory: 1.000000-NA-18589 (including 1 dicom file),3.000000-NA-18592 (including 482 dicom files),5.000000-NA-19075 (including 1 dicom file),7.000000-NA-19078 (including 438 dicom files),9.000000-NA-19517 (including 1 dicom file),11.000000-NA-19520 (including 444 dicom files). So even though I have the json file, I still don't know img19.nii.gz refers to which subfolder. (All data in five subfolders ? Or data in one subfolder ?). There are similar situations for other datasets. And the questions are as follows:
(2). How can I link the files in the original database with the files described by json?
(3). Some subfolders contain multiple Dicom slices, just concatenate them in order and convert them to a nifti file ?
(4). Given the complexity of the details, is it possible to expose a script that converts the raw data to the data described in json file ?

Finally, thanks again for your excellent work and contributions to open source code.

Best wishes !

@ahatamiz
Copy link
Contributor

Hi @WYC-321,

I believe the best way to address your questions is to release the pre-processing pipeline. I have raised the issue regarding this with our team members and the code for pre-processing shall be released very soon.

CC: @wyli

Best

@WYC-321
Copy link
Author

WYC-321 commented Jun 28, 2022

Thanks a lot to your team.

@Jamshidhsp
Copy link

@WYC-321 I have the same issue with the code. Could you manage to work it out?

@JiaxinZhuang
Copy link

I also download the datasets and try to follow the split in the JSON file. However, for HSNCC as well as TCIAcolon, it's hard to convert to the required nifty file from the downloaded dataset. Because I can't find the corresponding relationship.

@JakobDexl
Copy link

@JiaxinZhuang @WYC-321 did you manage to figure it out? I'm also struggling with the naming relationship for the datasets (HNSCC and COLON).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants