It's really handy to have all the DICOM info available in a single DataFrame, so let's create that! In this notebook, we'll just create the DICOM DataFrames. To see how to use them to analyze the competition data, see [this followup notebook](https://www.kaggle.com/jhoward/some-dicom-gotchas-to-be-aware-of-fastai).

First, we'll install the latest versions of pytorch and fastai v2 (not officially released yet) so we can use the fastai medical imaging module.

In [1]:
!pip install torch torchvision feather-format pyarrow --upgrade   > /dev/null
!pip install git+https://github.com/fastai/fastai_dev             > /dev/null

[31mERROR: allennlp 0.9.0 requires flaky, which is not installed.[0m
[31mERROR: allennlp 0.9.0 requires responses>=0.7, which is not installed.[0m
  Running command git clone -q https://github.com/fastai/fastai_dev /tmp/pip-req-build-g7ab0_ar


In [2]:
from fastai2.torch_basics      import *
from fastai2.data.all          import *
from fastai2.test              import *
from fastai2.medical.imaging   import *

Let's take a look at what files we have in the dataset.

In [3]:
path = Path('../input/rsna-intracranial-hemorrhage-detection/')

Most lists in fastai v2, including that returned by `Path.ls`, are returned as a [fastai.core.L](http://dev.fast.ai/core.html#L), which has lots of handy methods, such as `attrgot` used here to grab file names.

In [4]:
path_trn = path/'stage_1_train_images'
fns_trn = path_trn.ls()
fns_trn[:5].attrgot('name')

(#5) [ID_231d901c1.dcm,ID_994bc0470.dcm,ID_127689cce.dcm,ID_25457734a.dcm,ID_81c9aa125.dcm]

In [5]:
path_tst = path/'stage_1_test_images'
fns_tst = path_tst.ls()
len(fns_trn),len(fns_tst)

(674258, 78545)

We can grab a file and take a look inside using the `dcmread` method that fastai v2 adds.

In [6]:
fn = fns_trn[0]
dcm = fn.dcmread()
dcm

(0008, 0018) SOP Instance UID                    UI: ID_231d901c1
(0008, 0060) Modality                            CS: 'CT'
(0010, 0020) Patient ID                          LO: 'ID_b81a287f'
(0020, 000d) Study Instance UID                  UI: ID_dd37ba3adb
(0020, 000e) Series Instance UID                 UI: ID_15dcd6057a
(0020, 0010) Study ID                            SH: ''
(0020, 0032) Image Position (Patient)            DS: ['-125.000', '-123.101', '104.307']
(0020, 0037) Image Orientation (Patient)         DS: ['1.000000', '0.000000', '0.000000', '0.000000', '0.984808', '-0.173648']
(0028, 0002) Samples per Pixel                   US: 1
(0028, 0004) Photometric Interpretation          CS: 'MONOCHROME2'
(0028, 0010) Rows                                US: 512
(0028, 0011) Columns                             US: 512
(0028, 0030) Pixel Spacing                       DS: ['0.488281', '0.488281']
(0028, 0100) Bits Allocated                      US: 16
(0028, 0101) Bits Stored         

# Labels

Before we pull the metadata out of the DIMCOM files, let's process the labels into a convenient format and save it for later. We'll use *feather* format because it's lightning fast!

In [7]:
def save_lbls():
    path_lbls = path/'stage_1_train.csv'
    lbls = pd.read_csv(path_lbls)
    lbls[["ID","htype"]] = lbls.ID.str.rsplit("_", n=1, expand=True)
    lbls.drop_duplicates(['ID','htype'], inplace=True)
    pvt = lbls.pivot('ID', 'htype', 'Label')
    pvt.reset_index(inplace=True)    
    pvt.to_feather('labels.fth')

In [8]:
save_lbls()

In [9]:
df_lbls = pd.read_feather('labels.fth').set_index('ID')
df_lbls.head(8)

Unnamed: 0_level_0,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ID_000039fa0,0,0,0,0,0,0
ID_00005679d,0,0,0,0,0,0
ID_00008ce3c,0,0,0,0,0,0
ID_0000950d7,0,0,0,0,0,0
ID_0000aee4b,0,0,0,0,0,0
ID_0000f1657,0,0,0,0,0,0
ID_000178e76,0,0,0,0,0,0
ID_00019828f,0,0,0,0,0,0


In [10]:
df_lbls.mean()

any                 0.144015
epidural            0.004095
intraparenchymal    0.048296
intraventricular    0.035248
subarachnoid        0.047641
subdural            0.063026
dtype: float64

There's not much RAM on these kaggle kernel instances, so we'll clean up as we go.

In [11]:
del(df_lbls)
import gc; gc.collect();

# DICOM Meta

To turn the DICOM file metadata into a DataFrame we can use the `from_dicoms` function that fastai v2 adds. By passing `px_summ=True` summary statistics of the image pixels (mean/min/max/std) will be added to the DataFrame as well (although it takes much longer if you include this, since the image data has to be uncompressed).

In [12]:
%time df_tst = pd.DataFrame.from_dicoms(fns_tst, px_summ=True, n_workers=2)
df_tst.to_feather('df_tst.fth')
df_tst.head()

CPU times: user 1min 4s, sys: 25.4 s, total: 1min 30s
Wall time: 5min 19s


Unnamed: 0,SOPInstanceUID,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,...,MultiPixelSpacing,PixelSpacing1,img_min,img_max,img_mean,img_std,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1
0,ID_47a2de312,CT,ID_ae549afd,ID_ed2df9f4f0,ID_823c0e49f8,,-125.0,1.0,1,MONOCHROME2,...,1,0.488281,-2000,2779,14.395943,1202.477116,,,,
1,ID_0ea75f68f,CT,ID_2e2abf40,ID_76ee69d498,ID_517894241c,,-125.0,1.0,1,MONOCHROME2,...,1,0.488281,-2000,3392,66.198795,1205.325972,,,,
2,ID_508306a1f,CT,ID_ad5a9ac9,ID_dbfdc62c5a,ID_a698fcc176,,-125.0,1.0,1,MONOCHROME2,...,1,0.488281,-2000,1359,-362.891445,868.7463,,,,
3,ID_efc19ad30,CT,ID_bc57b857,ID_7be6a5b9ee,ID_8691d27be3,,-125.0,1.0,1,MONOCHROME2,...,1,0.488281,-2000,2334,-268.46743,1249.256516,,,,
4,ID_ba886fcc2,CT,ID_e2441ca0,ID_eb2b4f4bef,ID_4c05f02584,,-125.0,1.0,1,MONOCHROME2,...,1,0.488281,-2000,2742,-272.197708,985.178578,,,,


In [13]:
del(df_tst)
gc.collect();

Unfortunately Kaggle runs out of RAM if we use multiple workers, so we pass `n_workers=0`, which disables parallel processing.

In [14]:
%time df_trn = pd.DataFrame.from_dicoms(fns_trn, px_summ=True, n_workers=0)
df_trn.to_feather('df_trn.fth')

{'SOPInstanceUID': 'ID_6431af929', 'Modality': 'CT', 'PatientID': 'ID_bba2045f', 'StudyInstanceUID': 'ID_9180c688de', 'SeriesInstanceUID': 'ID_863be16ddb', 'StudyID': '', 'ImagePositionPatient': "-125.000", 'ImageOrientationPatient': "1.000000", 'SamplesPerPixel': 1, 'PhotometricInterpretation': 'MONOCHROME2', 'Rows': 512, 'Columns': 512, 'PixelSpacing': "0.488281", 'BitsAllocated': 16, 'BitsStored': 16, 'HighBit': 15, 'PixelRepresentation': 1, 'WindowCenter': "30", 'WindowWidth': "80", 'RescaleIntercept': "-1024", 'RescaleSlope': "1", 'fname': '../input/rsna-intracranial-hemorrhage-detection/stage_1_train_images/ID_6431af929.dcm', 'MultiImagePositionPatient': 1, 'ImagePositionPatient1': "-127.746", 'ImagePositionPatient2': "174.901", 'MultiImageOrientationPatient': 1, 'ImageOrientationPatient1': "0.000000", 'ImageOrientationPatient2': "0.000000", 'ImageOrientationPatient3': "0.000000", 'ImageOrientationPatient4': "0.972370", 'ImageOrientationPatient5': "-0.233445", 'MultiPixelSpacing'

There is one corrupted DICOM in the competition data, so the command above prints out the information about this file. Despite the error message show above, the command completes successfully, and the data from the corrupted file is not included in the output DataFrame.