# Unboxing CBIS-DDSM - A Structural Analysis

Before conducting the exploratory analyses, we'll unbox the data to ensure that the structure supports the analyses and that ...*' we got what we paid for*. In the next section, we'll perform an analysis of the data quality along dimensions of validity, completeness, consistency, and uniqueness. The following exploratory analysis will harvest insight from the data. 

Our aim here is to get a general sense of the data *structure* and to make any structural changes necessary to facilitate the next stages of the analysis. First, we'll examine the case training sets, then the series metadata, and finally, we'll discover the structure of the DICOM metadata.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../../..")))
from myst_nb import glue
import pandas as pd
import pydicom

fp_calc_train = "data/meta/0_raw/calc_case_description_train_set.csv"
fp_mass_train = "data/meta/0_raw/mass_case_description_train_set.csv"
fp_metadata = "data/meta/0_raw/metadata.csv"

## Case Data

### Calcification Training Cases

In [2]:
df = pd.read_csv(fp_calc_train)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1546 entries, 0 to 1545
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   patient_id               1546 non-null   object
 1   breast density           1546 non-null   int64 
 2   left or right breast     1546 non-null   object
 3   image view               1546 non-null   object
 4   abnormality id           1546 non-null   int64 
 5   abnormality type         1546 non-null   object
 6   calc type                1526 non-null   object
 7   calc distribution        1170 non-null   object
 8   assessment               1546 non-null   int64 
 9   pathology                1546 non-null   object
 10  subtlety                 1546 non-null   int64 
 11  image file path          1546 non-null   object
 12  cropped image file path  1546 non-null   object
 13  ROI mask file path       1546 non-null   object
dtypes: int64(4), object(10)
memory usage: 16

We have 1546 observations and 14 columns *(with spaces in the column names, which is somewhat vexing)* in the calcification training set. Let's take a look at the data.

In [3]:
df.sample(n=5, random_state=57)

Unnamed: 0,patient_id,breast density,left or right breast,image view,abnormality id,abnormality type,calc type,calc distribution,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
109,P_00112,3,RIGHT,CC,5,calcification,ROUND_AND_REGULAR-EGGSHELL,,2,BENIGN_WITHOUT_CALLBACK,3,Calc-Training_P_00112_RIGHT_CC/1.3.6.1.4.1.959...,Calc-Training_P_00112_RIGHT_CC_5/1.3.6.1.4.1.9...,Calc-Training_P_00112_RIGHT_CC_5/1.3.6.1.4.1.9...
1334,P_01740,2,RIGHT,MLO,2,calcification,ROUND_AND_REGULAR-LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Training_P_01740_RIGHT_MLO/1.3.6.1.4.1.95...,Calc-Training_P_01740_RIGHT_MLO_2/1.3.6.1.4.1....,Calc-Training_P_01740_RIGHT_MLO_2/1.3.6.1.4.1....
774,P_00991,2,LEFT,CC,1,calcification,LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,4,Calc-Training_P_00991_LEFT_CC/1.3.6.1.4.1.9590...,Calc-Training_P_00991_LEFT_CC_1/1.3.6.1.4.1.95...,Calc-Training_P_00991_LEFT_CC_1/1.3.6.1.4.1.95...
937,P_01198,2,RIGHT,MLO,1,calcification,PLEOMORPHIC,CLUSTERED,5,MALIGNANT,5,Calc-Training_P_01198_RIGHT_MLO/1.3.6.1.4.1.95...,Calc-Training_P_01198_RIGHT_MLO_1/1.3.6.1.4.1....,Calc-Training_P_01198_RIGHT_MLO_1/1.3.6.1.4.1....
1459,P_02115,3,LEFT,CC,1,calcification,AMORPHOUS,CLUSTERED,0,BENIGN,3,Calc-Training_P_02115_LEFT_CC/1.3.6.1.4.1.9590...,Calc-Training_P_02115_LEFT_CC_1/1.3.6.1.4.1.95...,Calc-Training_P_02115_LEFT_CC_1/1.3.6.1.4.1.95...


Ok, let's take a look at the mass training set.

### Mass Train Cases

In [4]:
df = pd.read_csv(fp_mass_train)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1318 entries, 0 to 1317
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   patient_id               1318 non-null   object
 1   breast_density           1318 non-null   int64 
 2   left or right breast     1318 non-null   object
 3   image view               1318 non-null   object
 4   abnormality id           1318 non-null   int64 
 5   abnormality type         1318 non-null   object
 6   mass shape               1314 non-null   object
 7   mass margins             1275 non-null   object
 8   assessment               1318 non-null   int64 
 9   pathology                1318 non-null   object
 10  subtlety                 1318 non-null   int64 
 11  image file path          1318 non-null   object
 12  cropped image file path  1318 non-null   object
 13  ROI mask file path       1318 non-null   object
dtypes: int64(4), object(10)
memory usage: 14

And the data...

In [5]:
df.sample(n=5, random_state=240)

Unnamed: 0,patient_id,breast_density,left or right breast,image view,abnormality id,abnormality type,mass shape,mass margins,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
669,P_00963,1,LEFT,CC,1,mass,FOCAL_ASYMMETRIC_DENSITY,ILL_DEFINED,3,BENIGN,5,Mass-Training_P_00963_LEFT_CC/1.3.6.1.4.1.9590...,Mass-Training_P_00963_LEFT_CC_1/1.3.6.1.4.1.95...,Mass-Training_P_00963_LEFT_CC_1/1.3.6.1.4.1.95...
475,P_00695,2,RIGHT,CC,1,mass,ASYMMETRIC_BREAST_TISSUE,ILL_DEFINED,5,MALIGNANT,5,Mass-Training_P_00695_RIGHT_CC/1.3.6.1.4.1.959...,Mass-Training_P_00695_RIGHT_CC_1/1.3.6.1.4.1.9...,Mass-Training_P_00695_RIGHT_CC_1/1.3.6.1.4.1.9...
326,P_00451,2,LEFT,CC,1,mass,LOBULATED,CIRCUMSCRIBED,3,BENIGN,5,Mass-Training_P_00451_LEFT_CC/1.3.6.1.4.1.9590...,Mass-Training_P_00451_LEFT_CC_1/1.3.6.1.4.1.95...,Mass-Training_P_00451_LEFT_CC_1/1.3.6.1.4.1.95...
276,P_00412,2,RIGHT,CC,1,mass,LOBULATED,CIRCUMSCRIBED,4,BENIGN,5,Mass-Training_P_00412_RIGHT_CC/1.3.6.1.4.1.959...,Mass-Training_P_00412_RIGHT_CC_1/1.3.6.1.4.1.9...,Mass-Training_P_00412_RIGHT_CC_1/1.3.6.1.4.1.9...
1123,P_01632,1,LEFT,MLO,2,mass,LYMPH_NODE,CIRCUMSCRIBED,3,BENIGN_WITHOUT_CALLBACK,2,Mass-Training_P_01632_LEFT_MLO/1.3.6.1.4.1.959...,Mass-Training_P_01632_LEFT_MLO_2/1.3.6.1.4.1.9...,Mass-Training_P_01632_LEFT_MLO_2/1.3.6.1.4.1.9...


Ok, let's inspect the series metadata.

## Series Metadata

The series metadata contains study and series information for the DICOM files that were downloaded from the TCIA.

In [6]:
df = pd.read_csv(fp_metadata)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6775 entries, 0 to 6774
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   series_uid            6775 non-null   object
 1   collection            6775 non-null   object
 2   data_description_uri  6775 non-null   object
 3   subject_id            6775 non-null   object
 4   study_uid             6775 non-null   object
 5   study_date            6775 non-null   object
 6   series_description    6775 non-null   object
 7   modality              6775 non-null   object
 8   sop_class_name        6775 non-null   object
 9   sop_class_uid         6775 non-null   object
 10  number_of_images      6775 non-null   int64 
 11  file_size             6775 non-null   object
 12  file_location         6775 non-null   object
 13  download_timestamp    6775 non-null   object
dtypes: int64(1), object(13)
memory usage: 741.1+ KB


In [7]:
df.sample(n=5, random_state=55)

Unnamed: 0,series_uid,collection,data_description_uri,subject_id,study_uid,study_date,series_description,modality,sop_class_name,sop_class_uid,number_of_images,file_size,file_location,download_timestamp
3832,1.3.6.1.4.1.9590.100.1.2.734848824118740414325...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Mass-Training_P_01423_RIGHT_MLO_1,1.3.6.1.4.1.9590.100.1.2.180257759711372363920...,07-20-2016,cropped images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,184.00 KB,./CBIS-DDSM/Mass-Training_P_01423_RIGHT_MLO_1/...,2023-05-24T05:37:09.688
3117,1.3.6.1.4.1.9590.100.1.2.392713810013832904138...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Mass-Training_P_00436_LEFT_CC,1.3.6.1.4.1.9590.100.1.2.392237990116430150037...,07-20-2016,full mammogram images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,30.40 MB,./CBIS-DDSM/Mass-Training_P_00436_LEFT_CC/07-2...,2023-05-24T05:06:14.737
743,1.3.6.1.4.1.9590.100.1.2.278345840012295347342...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_00400_RIGHT_MLO,1.3.6.1.4.1.9590.100.1.2.844948427113636906383...,08-07-2016,full mammogram images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,28.68 MB,./CBIS-DDSM/Calc-Training_P_00400_RIGHT_MLO/08...,2023-05-24T03:21:57.331
4265,1.3.6.1.4.1.9590.100.1.2.158167540111402988020...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Mass-Training_P_01755_RIGHT_MLO,1.3.6.1.4.1.9590.100.1.2.352110018811446066523...,07-20-2016,full mammogram images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,44.86 MB,./CBIS-DDSM/Mass-Training_P_01755_RIGHT_MLO/07...,2023-05-24T05:54:59.539
5835,1.3.6.1.4.1.9590.100.1.2.194043791119034007186...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01169_LEFT_CC_2,1.3.6.1.4.1.9590.100.1.2.188656945510799710515...,09-06-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,22.76 MB,./CBIS-DDSM/Calc-Training_P_01169_LEFT_CC_2/09...,2023-05-24T06:58:48.001


Very alright! Now the DICOM image metadata.

## DICOM Image Metadata

In [8]:
fp = "data/image/0_raw/Mass-Training_P_01981_RIGHT_CC/07-20-2016-DDSM-94258/1.000000-full mammogram images-07312/1-1.dcm"
pydicom.dcmread(fp)

FileNotFoundError: [Errno 2] No such file or directory: 'data/image/0_raw/Mass-Training_P_01981_RIGHT_CC/07-20-2016-DDSM-94258/1.000000-full mammogram images-07312/1-1.dcm'

## Summary of Structural Concerns

1. The data are currently split into train and test sets by BI-RADS category. While convenient for modeling, this is rather cumbersome for the exploratory data analysis which considers the entire dataset, before making any assumptions. As such, bringing the calcification and mass train and test sets together with the associated full mammogram image metadata into a single dataset will facilitate a top-down, comprehensive view of all the data for analysis.
2. Our binary classification target variable can take one of three values: 'MALIGNANT', 'BENIGN', and 'BENIGN_WITHOUT_CALLBACK. The latter indicates that some aspect of the case is worth tracking; but, no follow-up is required. For our purposes, that is a distinction without a difference. Nonetheless, rather than modifying the variable and losing potentially useful information, we'll create a new target variable, 'cancer', which will be True if the pathology is 'MALIGNANT', and False otherwise.
3. The CBIS-DDSM creators define a case as a particular abnormality as seen in the cranial-caudal (CC) or mediolateral oblique (MLO) image views; yet, there is no formal identification of a unique *mammogram*. A mammogram is defined here as a CC or MLO image of the left or right breast taken for a specific patient. Such a mammogram identifier would allow us to bring the case and the *full* mammogram imaging data together into a single dataset. ROI masks and cropped images are abnormality-level constructs and can remain as such.

Overall, this is an advocation for a single dataset containing only the information relevant to the analyses and modeling tasks. In the next section, a single task-specific dataset will be harvested from the CBIS-DDSM data.