Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New HDF5 dataset #19

Closed
jspunda opened this issue May 5, 2017 · 5 comments
Closed

New HDF5 dataset #19

jspunda opened this issue May 5, 2017 · 5 comments

Comments

@jspunda
Copy link
Owner

jspunda commented May 5, 2017

The current DICOM + plus 2 .csv files setup is a little messy and cumbersome to work with. To extract lesions, we would first have to load all DICOM series of interest from disk. Then compare this with the information in the first .csv file (prostateX-images-train.csv) to get the lesion info (ijk, spacing, etc.). After that we have to load prostateX-findings-train.csv to obtain the zone and clinSig information.

We have to put all this information together in order to extract the right lesion, with the right truth label and zone information from the right DICOM series, before we can start training. That's why I decided to restructure the data and combine the DICOM pixel data and the two .csv files into one hdf5 dataset.

The code for this can be found in the h5_converter branch. There are, as of now, three files: csv_fix.py, h5_converter.py and h5_query.py. Csv_fix and h5_converter only have to be run once in order to actually build the hdf5 set (which I have already done). The way the set is structured can be found in h5_converter.py.

To actually retrieve something from the set we can use h5_query.py. It contains a class that lets us draw DICOM images and their lesion information very quickly. It's almost instant. Much faster than our old way of reading DICOM files from disk and then loading their pixel data.

Note that there is no actual lesion pixel data in the hdf5 set. Just the lesion attributes from the .csv files and the DICOM pixel data. Actually extracting the lesion pixel data from the DICOM pixel data should be much more straightforward with the query result from h5_query.py.

The new HDF5 dataset can be found at https://jspunda.stackstorage.com/s/0Zy95CMqQzwVaAq
The password for the file is: ismi2017

Whether or not we are actually going to be using this new set of course depends on what everyone thinks, but in my opinion it will simplify and speed things up a lot in the future.

@jspunda
Copy link
Owner Author

jspunda commented May 5, 2017

I'm not sure what you mean. Both original .csv files are opened in 'read' mode. Then the new file is written with a different name: ProstateX-Images-Train-NEW.csv

Maybe I'm missing something...

@schelv
Copy link
Collaborator

schelv commented May 5, 2017

That is what the description in csv_fix.py says. Then I read the code...

@schelv
Copy link
Collaborator

schelv commented May 5, 2017

Can you give an example of how the data loading works?
For example I want:
X,y with X being the 2d tumor slices, and y the label.

@jspunda
Copy link
Owner Author

jspunda commented May 5, 2017

If you mean a cutout of the lesion from a particular slice, there is no functionality for it as of yet. You could however create a query object like the example code in h5 query.py. Let's say for all the ADC series. That will give you a subset of the data containing just the full dicom images and the lesion information.

The print result function in h5_query shows how to traverse this subset. At the very end it extracts one lesion attribute named 'ijk'. To get the label for that lesion, the attribute name should be changed to 'ClinSig' . If you want to access the raw dicom pixel data, it would look something like result[patient_id][dcm_series name]['pixel_array'][:]

@schelv
Copy link
Collaborator

schelv commented May 15, 2017

works great!

@schelv schelv closed this as completed May 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants