# CBIS-DDSM - Unbox

Before conducting data quality and exploratory analyses, we'll unbox the data to ensure that, well...*' we got what we paid for*. In the next section, we'll conduct an analysis of the data quality along dimensions of validity, completeness, consistency and uniqueness. The exploratory analysis is designed to harvest insight from the data. Our aim here, is to get a general sense of the data *structure* and to make any structural changes necessary to facilitate the next stages of the analysis. First, we'll examine the case training sets, then the series metadata, and finally, we'll discover the structure of the DICOM metadata.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../../..")))
from myst_nb import glue
import pandas as pd

fp_calc_train = "data/meta/0_raw/calc_case_description_train_set.csv"
fp_mass_train = "data/meta/0_raw/mass_case_description_train_set.csv"
fp_metadata = "data/meta/0_raw/metadata.csv"

## Case Data
### Calcification Training Cases

In [2]:
df = pd.read_csv(fp_calc_train)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1546 entries, 0 to 1545
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   patient_id               1546 non-null   object
 1   breast density           1546 non-null   int64 
 2   left or right breast     1546 non-null   object
 3   image view               1546 non-null   object
 4   abnormality id           1546 non-null   int64 
 5   abnormality type         1546 non-null   object
 6   calc type                1526 non-null   object
 7   calc distribution        1170 non-null   object
 8   assessment               1546 non-null   int64 
 9   pathology                1546 non-null   object
 10  subtlety                 1546 non-null   int64 
 11  image file path          1546 non-null   object
 12  cropped image file path  1546 non-null   object
 13  ROI mask file path       1546 non-null   object
dtypes: int64(4), object(10)
memory usage: 16

We have 1546 observations and 14 columns *(with spaces in the column names, which is somewhat vexing)* in the calcification training set. Let's take a look at the data.

In [3]:
calc_train_sample = df.sample(n=5, random_state=57)
glue('calc_train_sample', calc_train_sample)

Unnamed: 0,patient_id,breast density,left or right breast,image view,abnormality id,abnormality type,calc type,calc distribution,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
109,P_00112,3,RIGHT,CC,5,calcification,ROUND_AND_REGULAR-EGGSHELL,,2,BENIGN_WITHOUT_CALLBACK,3,Calc-Training_P_00112_RIGHT_CC/1.3.6.1.4.1.959...,Calc-Training_P_00112_RIGHT_CC_5/1.3.6.1.4.1.9...,Calc-Training_P_00112_RIGHT_CC_5/1.3.6.1.4.1.9...
1334,P_01740,2,RIGHT,MLO,2,calcification,ROUND_AND_REGULAR-LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Training_P_01740_RIGHT_MLO/1.3.6.1.4.1.95...,Calc-Training_P_01740_RIGHT_MLO_2/1.3.6.1.4.1....,Calc-Training_P_01740_RIGHT_MLO_2/1.3.6.1.4.1....
774,P_00991,2,LEFT,CC,1,calcification,LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,4,Calc-Training_P_00991_LEFT_CC/1.3.6.1.4.1.9590...,Calc-Training_P_00991_LEFT_CC_1/1.3.6.1.4.1.95...,Calc-Training_P_00991_LEFT_CC_1/1.3.6.1.4.1.95...
937,P_01198,2,RIGHT,MLO,1,calcification,PLEOMORPHIC,CLUSTERED,5,MALIGNANT,5,Calc-Training_P_01198_RIGHT_MLO/1.3.6.1.4.1.95...,Calc-Training_P_01198_RIGHT_MLO_1/1.3.6.1.4.1....,Calc-Training_P_01198_RIGHT_MLO_1/1.3.6.1.4.1....
1459,P_02115,3,LEFT,CC,1,calcification,AMORPHOUS,CLUSTERED,0,BENIGN,3,Calc-Training_P_02115_LEFT_CC/1.3.6.1.4.1.9590...,Calc-Training_P_02115_LEFT_CC_1/1.3.6.1.4.1.95...,Calc-Training_P_02115_LEFT_CC_1/1.3.6.1.4.1.95...


```{glue:figure} calc_train_sample
---
align: center
name: calc_train_sample_fig
---
Calcification Training Set Samples
```

Ok, let's take a look at the mass training set.

### Mass Train Cases

In [4]:
df = pd.read_csv(fp_mass_train)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1318 entries, 0 to 1317
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   patient_id               1318 non-null   object
 1   breast_density           1318 non-null   int64 
 2   left or right breast     1318 non-null   object
 3   image view               1318 non-null   object
 4   abnormality id           1318 non-null   int64 
 5   abnormality type         1318 non-null   object
 6   mass shape               1314 non-null   object
 7   mass margins             1275 non-null   object
 8   assessment               1318 non-null   int64 
 9   pathology                1318 non-null   object
 10  subtlety                 1318 non-null   int64 
 11  image file path          1318 non-null   object
 12  cropped image file path  1318 non-null   object
 13  ROI mask file path       1318 non-null   object
dtypes: int64(4), object(10)
memory usage: 14

And the data...

In [5]:
mass_train_sample = df.sample(n=5, random_state=240)
glue('mass_train_sample', mass_train_sample)

Unnamed: 0,patient_id,breast_density,left or right breast,image view,abnormality id,abnormality type,mass shape,mass margins,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
669,P_00963,1,LEFT,CC,1,mass,FOCAL_ASYMMETRIC_DENSITY,ILL_DEFINED,3,BENIGN,5,Mass-Training_P_00963_LEFT_CC/1.3.6.1.4.1.9590...,Mass-Training_P_00963_LEFT_CC_1/1.3.6.1.4.1.95...,Mass-Training_P_00963_LEFT_CC_1/1.3.6.1.4.1.95...
475,P_00695,2,RIGHT,CC,1,mass,ASYMMETRIC_BREAST_TISSUE,ILL_DEFINED,5,MALIGNANT,5,Mass-Training_P_00695_RIGHT_CC/1.3.6.1.4.1.959...,Mass-Training_P_00695_RIGHT_CC_1/1.3.6.1.4.1.9...,Mass-Training_P_00695_RIGHT_CC_1/1.3.6.1.4.1.9...
326,P_00451,2,LEFT,CC,1,mass,LOBULATED,CIRCUMSCRIBED,3,BENIGN,5,Mass-Training_P_00451_LEFT_CC/1.3.6.1.4.1.9590...,Mass-Training_P_00451_LEFT_CC_1/1.3.6.1.4.1.95...,Mass-Training_P_00451_LEFT_CC_1/1.3.6.1.4.1.95...
276,P_00412,2,RIGHT,CC,1,mass,LOBULATED,CIRCUMSCRIBED,4,BENIGN,5,Mass-Training_P_00412_RIGHT_CC/1.3.6.1.4.1.959...,Mass-Training_P_00412_RIGHT_CC_1/1.3.6.1.4.1.9...,Mass-Training_P_00412_RIGHT_CC_1/1.3.6.1.4.1.9...
1123,P_01632,1,LEFT,MLO,2,mass,LYMPH_NODE,CIRCUMSCRIBED,3,BENIGN_WITHOUT_CALLBACK,2,Mass-Training_P_01632_LEFT_MLO/1.3.6.1.4.1.959...,Mass-Training_P_01632_LEFT_MLO_2/1.3.6.1.4.1.9...,Mass-Training_P_01632_LEFT_MLO_2/1.3.6.1.4.1.9...


```{glue:figure} mass_train_sample
---
align: center
name: mass_train_sample_fig
---
Calcification Training Set Samples
```

Ok, a few things stand out thus far.

1. The case datasets have inconsistent variable names. For instance, the calcification set has 'breast density' and the mass case provides 'breast_density'. Our first task is to ensure that the variables are consistently named across case files.
2. The data are currently split into train and test sets by BI-RADS category. While convenient for modeling, this is rather cumbersome for the exploratory data analysis which considers the entire dataset, before making any assumptions. Our second task is to combine the calcification and mass train and test sets into a single case dataset, facilitating a top-down, comprehensive image view of all the data for analysis.
3. The target variable, pathology, has three values: 'MALIGNANT', 'BENIGN', and 'BENIGN_WITHOUT_CALLBACK. The latter indicates that some aspect of the case is worth tracking; but, no follow-up is required. For our purposes, that is a distinction without a difference. Nonetheless, rather than modifying the variable and losing potentially useful information, we'll create a new target variable, 'cancer', which will be True if the pathology is 'MALIGNANT', and False otherwise.
4. The CBIS-DDSM creators define a case as a particular abnormality as seen in the cranial caudal (CC) or mediolateral oblique (MLO) image_views; yet, the dataset lacks a unique case identifier. Yet, distinguishing cases involves parsing the patient_id, abnormality type, the file set, left or right breast, and  image_view. An alternative is to parse the file paths for study_uid and series_uid in order to map the case to an image.

## Series Metadata
The series metadata contains study and series information for the DICOM files that were downloaded from the TCIA.

In [6]:
df = pd.read_csv(fp_metadata)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6775 entries, 0 to 6774
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   series_uid            6775 non-null   object
 1   collection            6775 non-null   object
 2   data_description_uri  6775 non-null   object
 3   subject_id            6775 non-null   object
 4   study_uid             6775 non-null   object
 5   study_date            6775 non-null   object
 6   series_description    6775 non-null   object
 7   modality              6775 non-null   object
 8   sop_class_name        6775 non-null   object
 9   sop_class_uid         6775 non-null   object
 10  number_of_images      6775 non-null   int64 
 11  file_size             6775 non-null   object
 12  file_location         6775 non-null   object
 13  download_timestamp    6775 non-null   object
dtypes: int64(1), object(13)
memory usage: 741.1+ KB


In [7]:
series_data = df.sample(n=5, random_state=55)
full_mammogram_image_count_per_location = df.loc[df['series_description'] == 'full mammogram images']['number_of_images'].max()
cropped_image_count_per_location = df.loc[df['series_description'] == 'cropped images']['number_of_images'].max()
roi_image_count_per_location =  df.loc[df['series_description'] == 'ROI mask images']['number_of_images'].max()

glue('full_mammogram_image_count_per_location', full_mammogram_image_count_per_location)
glue('cropped_image_count_per_location', cropped_image_count_per_location)
glue('roi_image_count_per_location', roi_image_count_per_location)
glue('series_data', series_data)

1

2

2

Unnamed: 0,series_uid,collection,data_description_uri,subject_id,study_uid,study_date,series_description,modality,sop_class_name,sop_class_uid,number_of_images,file_size,file_location,download_timestamp
3832,1.3.6.1.4.1.9590.100.1.2.734848824118740414325...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Mass-Training_P_01423_RIGHT_MLO_1,1.3.6.1.4.1.9590.100.1.2.180257759711372363920...,07-20-2016,cropped images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,184.00 KB,./CBIS-DDSM/Mass-Training_P_01423_RIGHT_MLO_1/...,2023-05-24T05:37:09.688
3117,1.3.6.1.4.1.9590.100.1.2.392713810013832904138...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Mass-Training_P_00436_LEFT_CC,1.3.6.1.4.1.9590.100.1.2.392237990116430150037...,07-20-2016,full mammogram images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,30.40 MB,./CBIS-DDSM/Mass-Training_P_00436_LEFT_CC/07-2...,2023-05-24T05:06:14.737
743,1.3.6.1.4.1.9590.100.1.2.278345840012295347342...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_00400_RIGHT_MLO,1.3.6.1.4.1.9590.100.1.2.844948427113636906383...,08-07-2016,full mammogram images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,28.68 MB,./CBIS-DDSM/Calc-Training_P_00400_RIGHT_MLO/08...,2023-05-24T03:21:57.331
4265,1.3.6.1.4.1.9590.100.1.2.158167540111402988020...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Mass-Training_P_01755_RIGHT_MLO,1.3.6.1.4.1.9590.100.1.2.352110018811446066523...,07-20-2016,full mammogram images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,44.86 MB,./CBIS-DDSM/Mass-Training_P_01755_RIGHT_MLO/07...,2023-05-24T05:54:59.539
5835,1.3.6.1.4.1.9590.100.1.2.194043791119034007186...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01169_LEFT_CC_2,1.3.6.1.4.1.9590.100.1.2.188656945510799710515...,09-06-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,22.76 MB,./CBIS-DDSM/Calc-Training_P_01169_LEFT_CC_2/09...,2023-05-24T06:58:48.001


```{glue:figure} series_data
---
align: center
name: series_data_fig
---
Series Metadata
```

The most important variables in this dataset are:
- series_uid
- subject_id
- study_uid
- study_description
- number_of_images
- file_location

Still, several observations can be made at this stage:
1. Each full mammogram file location has {glue:}`full_mammogram_image_count_per_location`, but each ROI and cropped image folder may have upwards of two images. Since an observation in this dataset can pertain to one or more images, there is no way to uniquely identify a cropped or ROI image in the dataset.

Let's examine a specific case and see how it is represented in the dataset. Patient P_00112 has five abnormalities. Let's check the case dataset.

In [8]:
df_calc = pd.read_csv(fp_calc_train)
df_calc_P_01740 = df_calc.loc[df_calc['patient_id'] == 'P_01740']
df_calc_P_01740

Unnamed: 0,patient_id,breast density,left or right breast,image view,abnormality id,abnormality type,calc type,calc distribution,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
1329,P_01740,2,LEFT,CC,1,calcification,LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Training_P_01740_LEFT_CC/1.3.6.1.4.1.9590...,Calc-Training_P_01740_LEFT_CC_1/1.3.6.1.4.1.95...,Calc-Training_P_01740_LEFT_CC_1/1.3.6.1.4.1.95...
1330,P_01740,2,LEFT,MLO,1,calcification,LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Training_P_01740_LEFT_MLO/1.3.6.1.4.1.959...,Calc-Training_P_01740_LEFT_MLO_1/1.3.6.1.4.1.9...,Calc-Training_P_01740_LEFT_MLO_1/1.3.6.1.4.1.9...
1331,P_01740,2,RIGHT,CC,1,calcification,ROUND_AND_REGULAR-LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Training_P_01740_RIGHT_CC/1.3.6.1.4.1.959...,Calc-Training_P_01740_RIGHT_CC_1/1.3.6.1.4.1.9...,Calc-Training_P_01740_RIGHT_CC_1/1.3.6.1.4.1.9...
1332,P_01740,2,RIGHT,CC,2,calcification,ROUND_AND_REGULAR-LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Training_P_01740_RIGHT_CC/1.3.6.1.4.1.959...,Calc-Training_P_01740_RIGHT_CC_2/1.3.6.1.4.1.9...,Calc-Training_P_01740_RIGHT_CC_2/1.3.6.1.4.1.9...
1333,P_01740,2,RIGHT,MLO,1,calcification,ROUND_AND_REGULAR-LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Training_P_01740_RIGHT_MLO/1.3.6.1.4.1.95...,Calc-Training_P_01740_RIGHT_MLO_1/1.3.6.1.4.1....,Calc-Training_P_01740_RIGHT_MLO_1/1.3.6.1.4.1....
1334,P_01740,2,RIGHT,MLO,2,calcification,ROUND_AND_REGULAR-LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Training_P_01740_RIGHT_MLO/1.3.6.1.4.1.95...,Calc-Training_P_01740_RIGHT_MLO_2/1.3.6.1.4.1....,Calc-Training_P_01740_RIGHT_MLO_2/1.3.6.1.4.1....


In [9]:
df_series = pd.read_csv(fp_metadata)
df_series_P_01740 = df_series.loc[df_series['subject_id'].str.contains('P_01740')]
df_series_P_01740

Unnamed: 0,series_uid,collection,data_description_uri,subject_id,study_uid,study_date,series_description,modality,sop_class_name,sop_class_uid,number_of_images,file_size,file_location,download_timestamp
568,1.3.6.1.4.1.9590.100.1.2.417708848712227437940...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01740_LEFT_MLO,1.3.6.1.4.1.9590.100.1.2.934444141132745063086...,08-07-2016,full mammogram images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,45.21 MB,./CBIS-DDSM/Calc-Training_P_01740_LEFT_MLO/08-...,2023-05-24T03:13:40.461
1295,1.3.6.1.4.1.9590.100.1.2.252648962912720348140...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01740_RIGHT_CC_2,1.3.6.1.4.1.9590.100.1.2.379747141211120868725...,09-06-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,22.50 MB,./CBIS-DDSM/Calc-Training_P_01740_RIGHT_CC_2/0...,2023-05-24T03:48:00.924
1576,1.3.6.1.4.1.9590.100.1.2.115213958111012888340...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01740_RIGHT_CC_1,1.3.6.1.4.1.9590.100.1.2.125151893012444441822...,09-06-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,22.50 MB,./CBIS-DDSM/Calc-Training_P_01740_RIGHT_CC_1/0...,2023-05-24T04:00:45.901
1918,1.3.6.1.4.1.9590.100.1.2.229498429511355728428...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01740_LEFT_MLO_1,1.3.6.1.4.1.9590.100.1.2.204043526411094826833...,09-06-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,22.63 MB,./CBIS-DDSM/Calc-Training_P_01740_LEFT_MLO_1/0...,2023-05-24T04:15:10.026
2495,1.3.6.1.4.1.9590.100.1.2.413602028113120193338...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01740_RIGHT_MLO,1.3.6.1.4.1.9590.100.1.2.336832061149881173841...,08-07-2016,full mammogram images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,44.81 MB,./CBIS-DDSM/Calc-Training_P_01740_RIGHT_MLO/08...,2023-05-24T04:40:56.234
4785,1.3.6.1.4.1.9590.100.1.2.425705758910687622722...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01740_RIGHT_MLO_2,1.3.6.1.4.1.9590.100.1.2.111855809012015466120...,09-06-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,22.43 MB,./CBIS-DDSM/Calc-Training_P_01740_RIGHT_MLO_2/...,2023-05-24T06:15:17.659
5947,1.3.6.1.4.1.9590.100.1.2.228757952612837468335...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01740_RIGHT_CC,1.3.6.1.4.1.9590.100.1.2.271503225413812867817...,08-07-2016,full mammogram images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,44.97 MB,./CBIS-DDSM/Calc-Training_P_01740_RIGHT_CC/08-...,2023-05-24T07:03:51.649
6073,1.3.6.1.4.1.9590.100.1.2.556168133105659130077...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01740_RIGHT_MLO_1,1.3.6.1.4.1.9590.100.1.2.304765073511251056040...,09-06-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,22.43 MB,./CBIS-DDSM/Calc-Training_P_01740_RIGHT_MLO_1/...,2023-05-24T07:08:46.998
6224,1.3.6.1.4.1.9590.100.1.2.366865405125940293312...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01740_LEFT_CC,1.3.6.1.4.1.9590.100.1.2.242214869511793817517...,08-07-2016,full mammogram images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,1,43.87 MB,./CBIS-DDSM/Calc-Training_P_01740_LEFT_CC/08-0...,2023-05-24T07:15:18.558
6395,1.3.6.1.4.1.9590.100.1.2.188753084611465434733...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_01740_LEFT_CC_1,1.3.6.1.4.1.9590.100.1.2.280124183610569107510...,09-06-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,21.95 MB,./CBIS-DDSM/Calc-Training_P_01740_LEFT_CC_1/09...,2023-05-24T07:23:05.273
