# Combining data sources into a unified dataset

- Loading and processing raw data files
- Implementing a Python class to represent our data
- Converting our data into a format usable by PyTorch
- Visualizing the training and validation data

Our goal is to be able to produce a training sample given our inputs of raw CT scan data and a list of annotations for those CTs.

Our CT data comes in two files: a .mhd file containing metadata header information, and a .raw file containing the raw bytes that make up the 3D array. Each file’s name starts with a unique identifier called the series UID (the name comes from the Digital Imaging and Communications in Medicine [DICOM] nomenclature) for the CT scan in question. For example, for series UID 1.2.3, there would be two files: 1.2.3.mhd and 1.2.3.raw.

Our Ct class will consume those two files and produce the 3D array, as well as the transformation matrix to convert from the patient coordinate system (which we will discuss in more detail in section 10.6) to the index, row, column coordinates needed by the array (these coordinates are shown as (I,R,C) in the figures and are denoted with _irc variable suffixes in the code).

We will also load the annotation data provided by LUNA, which will give us a list of nodule coordinates, each with a malignancy flag, along with the series UID of the relevant CT scan. By combining the nodule coordinate with coordinate system transformation information, we get the index, row, and column of the voxel at the center of our nodule.

Using the (I,R,C) coordinates, we can crop a small 3D slice of our CT data to use as the input to our model. Along with this 3D sample array, we must construct the rest of our training sample tuple, which will have the sample array, nodule status flag, series UID, and the index of this sample in the CT list of nodule candidates. This sample tuple is exactly what PyTorch expects from our Dataset subclass and represents the last section of our bridge from our original raw data to the standard structure of PyTorch tensors.


Details of the ETL (dset.py) are indicated in the book Deep Learning with Pytorch (Eli Stevens, Luca Antiga, Thomas Viehmann).

In [1]:
%matplotlib inline
import numpy as np

In [2]:
from p2ch10.dsets import getCandidateInfoList, getCt, LunaDataset
candidateInfo_list = getCandidateInfoList(requireOnDisk_bool=False)
positiveInfo_list = [x for x in candidateInfo_list if x[0]]
diameter_list = [x[1] for x in positiveInfo_list]

Get the total of positive nodules (actual cancer) from source data

In [3]:
print(len(positiveInfo_list))
print(positiveInfo_list[0])

1557
CandidateInfoTuple(isNodule_bool=True, diameter_mm=32.27003025, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.287966244644280690737019247886', center_xyz=(66.58022805, 82.56931698, -110.5421104))


We have a few very large candidates, starting at 32 mm, but they rapidly drop off to half that size. The bulk of the candidates are in the 4 to 10 mm range, and several hundred don’t have size information at all. This looks as expected; you might recall that we had more actual nodules than we had diameter annotations. 

In [4]:
for i in range(0, len(diameter_list), 100):
    print('{:4}  {:4.1f} mm'.format(i, diameter_list[i]))

   0  32.3 mm
 100  18.5 mm
 200  13.7 mm
 300  10.7 mm
 400   8.7 mm
 500   7.6 mm
 600   6.7 mm
 700   6.1 mm
 800   5.6 mm
 900   5.1 mm
1000   4.7 mm
1100   4.3 mm
1200   0.0 mm
1300   0.0 mm
1400   0.0 mm
1500   0.0 mm


nodule information:

In [5]:
for candidateInfo_tup in positiveInfo_list[:10]:
    print(candidateInfo_tup)
for candidateInfo_tup in positiveInfo_list[-10:]:
    print(candidateInfo_tup)
    
for candidateInfo_tup in positiveInfo_list:
    if candidateInfo_tup.series_uid.endswith('565'):
        print(candidateInfo_tup)

CandidateInfoTuple(isNodule_bool=True, diameter_mm=32.27003025, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.287966244644280690737019247886', center_xyz=(66.58022805, 82.56931698, -110.5421104))
CandidateInfoTuple(isNodule_bool=True, diameter_mm=30.61040636, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.112740418331256326754121315800', center_xyz=(48.23675256, 37.47721004, -98.64208784))
CandidateInfoTuple(isNodule_bool=True, diameter_mm=30.61040636, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.112740418331256326754121315800', center_xyz=(44.19, 37.79, -107.01))
CandidateInfoTuple(isNodule_bool=True, diameter_mm=30.61040636, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.112740418331256326754121315800', center_xyz=(40.69, 32.19, -97.15))
CandidateInfoTuple(isNodule_bool=True, diameter_mm=27.44242293, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.943403138251347598519939390311', center_xyz=(-45.29440163, 74.86925386, -97.52812481))
CandidateInfoTuple(isNodule_bool=True, diameter_mm=27.

histogram of nodule's diameter:

In [6]:
np.histogram(diameter_list)

(array([384, 550, 281, 117,  78,  59,  42,  34,   8,   4]),
 array([ 0.        ,  3.22700302,  6.45400605,  9.68100907, 12.9080121 ,
        16.13501512, 19.36201815, 22.58902117, 25.8160242 , 29.04302722,
        32.27003025]))

In [7]:
from p2ch10.vis import findPositiveSamples, showCandidate
positiveSample_list = findPositiveSamples()

2025-09-19 01:30:50,759 INFO     pid:72685 p2ch10.dsets:182:__init__ <p2ch10.dsets.LunaDataset object at 0x7603f2024150>: 754975 training samples


0 CandidateInfoTuple(isNodule_bool=True, diameter_mm=15.13080337, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.897684031374557757145405000951', center_xyz=(-104.2535243, 142.7860645, -709.618551))
1 CandidateInfoTuple(isNodule_bool=True, diameter_mm=7.477744916, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.215640837032688688030770057224', center_xyz=(95.75543943, -23.23279864, -107.8593292))
2 CandidateInfoTuple(isNodule_bool=True, diameter_mm=0.0, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.163901773171373940247829492387', center_xyz=(-56.59, 33.44, -420.08))
3 CandidateInfoTuple(isNodule_bool=True, diameter_mm=4.206700455, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.282779922503707013097174625409', center_xyz=(82.64467593, -82.26416667, -162.6957639))
4 CandidateInfoTuple(isNodule_bool=True, diameter_mm=5.41336635, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.328944769569002417592093467626', center_xyz=(-91.7884345, -5.7590625, -138.8975))
5 CandidateInfoTuple(isNodule_bool=Tr

In [8]:
series_uid = positiveSample_list[11][2]
showCandidate(series_uid)

2025-09-19 01:30:57,360 INFO     pid:72685 p2ch10.dsets:182:__init__ <p2ch10.dsets.LunaDataset object at 0x7603dc402b90>: 853 training samples


<IPython.core.display.Javascript object>

1.3.6.1.4.1.14519.5.2.1.6279.6001.392861216720727557882279374324 475 False [475]


In [9]:
series_uid = '1.3.6.1.4.1.14519.5.2.1.6279.6001.124154461048929153767743874565'
showCandidate(series_uid)

2025-09-19 01:31:03,366 INFO     pid:72685 p2ch10.dsets:182:__init__ <p2ch10.dsets.LunaDataset object at 0x7603e633e8d0>: 1333 training samples


<IPython.core.display.Javascript object>

1.3.6.1.4.1.14519.5.2.1.6279.6001.124154461048929153767743874565 774 False [774]


In [10]:
series_uid = '1.3.6.1.4.1.14519.5.2.1.6279.6001.126264578931778258890371755354'
showCandidate(series_uid)

2025-09-19 01:31:09,453 INFO     pid:72685 p2ch10.dsets:182:__init__ <p2ch10.dsets.LunaDataset object at 0x7603eac4e8d0>: 1129 training samples


<IPython.core.display.Javascript object>

1.3.6.1.4.1.14519.5.2.1.6279.6001.126264578931778258890371755354 951 False [951]


Converting our data into a format usable by PyTorch:

In [11]:
LunaDataset()[0]

2025-09-19 01:31:14,609 INFO     pid:72685 p2ch10.dsets:182:__init__ <p2ch10.dsets.LunaDataset object at 0x7603dec8db10>: 754975 training samples


(tensor([[[[-884., -898., -902.,  ..., -160., -180., -190.],
           [-930., -914., -918.,  ..., -156., -166., -148.],
           [-886., -886., -870.,  ..., -106., -122., -116.],
           ...,
           [-846., -844., -844.,  ...,   56.,   48.,   82.],
           [-940., -932., -878.,  ...,  112.,   82.,   90.],
           [-844., -880., -870.,  ...,   56.,   22.,   -6.]],
 
          [[-826., -808., -822.,  ...,  -92., -122., -118.],
           [-888., -862., -860.,  ...,  -94., -120., -110.],
           [-888., -888., -856.,  ...,  -94., -110., -134.],
           ...,
           [-822., -844., -864.,  ...,   44.,   62.,   92.],
           [-842., -870., -838.,  ...,   88.,   52.,   74.],
           [-804., -848., -856.,  ...,   68.,   28.,   50.]],
 
          [[-836., -830., -850.,  ...,    4.,   -2.,  -10.],
           [-846., -860., -898.,  ...,  -22.,  -56.,  -50.],
           [-826., -852., -854.,  ...,  -24.,  -40.,  -56.],
           ...,
           [-568., -648., -752.

Some visualizations:

In [12]:
import numpy as np
import ipyvolume as ipv
V = np.zeros((128,128,128)) # our 3d array
# outer box
V[30:-30,30:-30,30:-30] = 0.75
V[35:-35,35:-35,35:-35] = 0.0
# inner box
V[50:-50,50:-50,50:-50] = 0.25
V[55:-55,55:-55,55:-55] = 0.0
ipv.quickvolshow(V, level=[0.25, 0.75], opacity=0.03, level_width=0.1, data_min=0, data_max=1)


  subdata[..., i] = ((gradient[i][zindex] / 2.0 + 0.5) * 255).astype(np.uint8)


Container(children=[VBox(children=(HBox(children=(Label(value='levels:'), FloatSlider(value=0.25, max=1.0, ste…

In [13]:
ct = getCt(series_uid)
ipv.quickvolshow(ct.hu_a, level=[0.25, 0.5, 0.9], opacity=0.1, level_width=0.1, data_min=-1000, data_max=1000)

Container(children=[VBox(children=(HBox(children=(Label(value='levels:'), FloatSlider(value=0.25, max=1.0, ste…

# Note: the following visualization doesn't look very good.
It's only included here for completeness. 

In [14]:
import scipy.ndimage.morphology
def build2dLungMask(ct, mask_ndx, threshold_gcc = 0.7):
    dense_mask = ct.hu_a[mask_ndx] > threshold_gcc
    denoise_mask = scipy.ndimage.morphology.binary_closing(dense_mask, iterations=2)
    tissue_mask = scipy.ndimage.morphology.binary_opening(denoise_mask, iterations=10)
    body_mask = scipy.ndimage.morphology.binary_fill_holes(tissue_mask)
    air_mask = scipy.ndimage.morphology.binary_fill_holes(body_mask & ~tissue_mask)

    lung_mask = scipy.ndimage.morphology.binary_dilation(air_mask, iterations=2)

    return air_mask, lung_mask, dense_mask, denoise_mask, tissue_mask, body_mask


def build3dLungMask(ct):
    air_mask, lung_mask, dense_mask, denoise_mask, tissue_mask, body_mask = mask_list = \
        [np.zeros_like(ct.hu_a, dtype=np.bool) for _ in range(6)]

    for mask_ndx in range(ct.hu_a.shape[0]):
        for i, mask_ary in enumerate(build2dLungMask(ct, mask_ndx)):
            mask_list[i][mask_ndx] = mask_ary

    return air_mask, lung_mask, dense_mask, denoise_mask, tissue_mask, body_mask

In [15]:
from p2ch10.dsets import getCt
ct = getCt(series_uid)
air_mask, lung_mask, dense_mask, denoise_mask, tissue_mask, body_mask = build3dLungMask(ct)


  denoise_mask = scipy.ndimage.morphology.binary_closing(dense_mask, iterations=2)
  tissue_mask = scipy.ndimage.morphology.binary_opening(denoise_mask, iterations=10)
  body_mask = scipy.ndimage.morphology.binary_fill_holes(tissue_mask)
  air_mask = scipy.ndimage.morphology.binary_fill_holes(body_mask & ~tissue_mask)
  lung_mask = scipy.ndimage.morphology.binary_dilation(air_mask, iterations=2)


In [16]:
bones = ct.hu_a * (ct.hu_a > 1.5)
lungs = ct.hu_a * air_mask
ipv.figure()
ipv.pylab.volshow(bones + lungs, level=[0.17, 0.17, 0.23], data_min=100, data_max=900)
ipv.show()

Container(children=[VBox(children=(HBox(children=(Label(value='levels:'), FloatSlider(value=0.17, max=1.0, ste…