# CBIS-DDSM - Unbox

Before conducting data quality and exploratory analyses, we unbox the data to ensure that, well...*' we got what we paid for*. Our aim here is to get a general sense of the structure of the data and to make any changes that will facilitate the next stages of the analysis. 

First, we'll examine the case training sets, then the series metadata, and finally, we'll extract data from a few DICOM datasets.


In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../../..")))
from myst_nb import glue
import pandas as pd

fp_calc_train = "data/meta/0_raw/calc_case_description_train_set.csv"
fp_mass_train = "data/meta/0_raw/mass_case_description_train_set.csv"
fp_metadata = "data/meta/0_raw/metadata.csv"

## Calcification Cases

### Calcification Train Cases

We expect a total of 602 cases, 329 are benign and 273 are malignant.

In [2]:
df = pd.read_csv(fp_calc_train)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1546 entries, 0 to 1545
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   patient_id               1546 non-null   object
 1   breast density           1546 non-null   int64 
 2   left or right breast     1546 non-null   object
 3   image view               1546 non-null   object
 4   abnormality id           1546 non-null   int64 
 5   abnormality type         1546 non-null   object
 6   calc type                1526 non-null   object
 7   calc distribution        1170 non-null   object
 8   assessment               1546 non-null   int64 
 9   pathology                1546 non-null   object
 10  subtlety                 1546 non-null   int64 
 11  image file path          1546 non-null   object
 12  cropped image file path  1546 non-null   object
 13  ROI mask file path       1546 non-null   object
dtypes: int64(4), object(10)
memory usage: 16

We have 1546 observations and 14 columns *(with spaces in the column names, which is somewhat vexing)* in the calcification training set.

In [3]:
# Total record count
n_records = len(df)

# Sample
sample = df.head()

# Number of cases
n_cases =  df['patient_id'].nunique()

# Pathologies and patient counts.
pathologies = df[['patient_id', 'pathology']].drop_duplicates().groupby(by='pathology').count().reset_index()
pathologies.columns = ['Pathology', 'Count']
pathologies.loc['Total'] = pathologies.sum()
pathologies.loc[pathologies.index[-1], 'Pathology'] = ""

# Number of patients with multiple abnormalities and diagnoses
case_pathologies = df[['patient_id', 'pathology']].drop_duplicates().groupby(by='patient_id').count()
n_cases_multiple_pathologies = len(case_pathologies.loc[case_pathologies['pathology']==2])

glue('calc_train_sample', sample)
glue('calc_train_n_records', n_records)
glue('calc_train_pathologies', pathologies)
glue('calc_train_n_cases', n_cases)
glue('calc_train_n_cases_multiple_pathologies', n_cases_multiple_pathologies)

Unnamed: 0,patient_id,breast density,left or right breast,image view,abnormality id,abnormality type,calc type,calc distribution,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
0,P_00005,3,RIGHT,CC,1,calcification,AMORPHOUS,CLUSTERED,3,MALIGNANT,3,Calc-Training_P_00005_RIGHT_CC/1.3.6.1.4.1.959...,Calc-Training_P_00005_RIGHT_CC_1/1.3.6.1.4.1.9...,Calc-Training_P_00005_RIGHT_CC_1/1.3.6.1.4.1.9...
1,P_00005,3,RIGHT,MLO,1,calcification,AMORPHOUS,CLUSTERED,3,MALIGNANT,3,Calc-Training_P_00005_RIGHT_MLO/1.3.6.1.4.1.95...,Calc-Training_P_00005_RIGHT_MLO_1/1.3.6.1.4.1....,Calc-Training_P_00005_RIGHT_MLO_1/1.3.6.1.4.1....
2,P_00007,4,LEFT,CC,1,calcification,PLEOMORPHIC,LINEAR,4,BENIGN,4,Calc-Training_P_00007_LEFT_CC/1.3.6.1.4.1.9590...,Calc-Training_P_00007_LEFT_CC_1/1.3.6.1.4.1.95...,Calc-Training_P_00007_LEFT_CC_1/1.3.6.1.4.1.95...
3,P_00007,4,LEFT,MLO,1,calcification,PLEOMORPHIC,LINEAR,4,BENIGN,4,Calc-Training_P_00007_LEFT_MLO/1.3.6.1.4.1.959...,Calc-Training_P_00007_LEFT_MLO_1/1.3.6.1.4.1.9...,Calc-Training_P_00007_LEFT_MLO_1/1.3.6.1.4.1.9...
4,P_00008,1,LEFT,CC,1,calcification,,REGIONAL,2,BENIGN_WITHOUT_CALLBACK,3,Calc-Training_P_00008_LEFT_CC/1.3.6.1.4.1.9590...,Calc-Training_P_00008_LEFT_CC_1/1.3.6.1.4.1.95...,Calc-Training_P_00008_LEFT_CC_1/1.3.6.1.4.1.95...


1546

Unnamed: 0,Pathology,Count
0,BENIGN,258
1,BENIGN_WITHOUT_CALLBACK,82
2,MALIGNANT,276
Total,,616


602

14

Of the {glue:}`calc_train_n_records` observations, there are {glue:}`calc_train_n_cases` unique cases in the calcification training set. Let's summarize the counts by diagnosis.

```{glue:figure} calc_train_pathologies
---
align: center
name: calc_train_pathologies_fig
---
Patient Count by Pathology
```

{numref}`calc_train_pathologies_fig` indicates and the data show that there are {glue:}`calc_train_n_cases_multiple_pathologies` patients with two pathologies, bringing our total to 616 diagnoses. Let's take a look at the data.


```{glue:figure} calc_train_sample
---
align: center
name: calc_train_sample_fig
---
Calcification Train Data Sample
```

We have some missing values (calcification types and distributions) and an *unexpected* third classification in our binary classification problem. Let's check the calcification test data.

### Calcification Test Cases
The calcification test dataset should contain a total of 151 cases, of which 85 are benign, and 66 are malignant.

In [4]:
df = pd.read_csv(fp_calc_test)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 326 entries, 0 to 325
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   patient_id               326 non-null    object
 1   breast density           326 non-null    int64 
 2   left or right breast     326 non-null    object
 3   image view               326 non-null    object
 4   abnormality id           326 non-null    int64 
 5   abnormality type         326 non-null    object
 6   calc type                322 non-null    object
 7   calc distribution        263 non-null    object
 8   assessment               326 non-null    int64 
 9   pathology                326 non-null    object
 10  subtlety                 326 non-null    int64 
 11  image file path          326 non-null    object
 12  cropped image file path  326 non-null    object
 13  ROI mask file path       326 non-null    object
dtypes: int64(4), object(10)
memory usage: 35.8

In [5]:
# Total record count
n_records = len(df)

# Sample
sample = df.head()

# Number of cases
n_cases =  df['patient_id'].nunique()

# Pathologies and patient counts.
pathologies = df[['patient_id', 'pathology']].drop_duplicates().groupby(by='pathology').count().reset_index()
pathologies.columns = ['Pathology', 'Count']
pathologies.loc['Total'] = pathologies.sum()
pathologies.loc[pathologies.index[-1], 'Pathology'] = ""

# Number of patients with multiple abnormalities and diagnoses
case_pathologies = df[['patient_id', 'pathology']].drop_duplicates().groupby(by='patient_id').count()
n_cases_multiple_pathologies = len(case_pathologies.loc[case_pathologies['pathology']==2])

glue('calc_test_sample', sample)
glue('calc_test_n_records', n_records)
glue('calc_test_pathologies', pathologies)
glue('calc_test_n_cases', n_cases)
glue('calc_test_n_cases_multiple_pathologies', n_cases_multiple_pathologies)


Unnamed: 0,patient_id,breast density,left or right breast,image view,abnormality id,abnormality type,calc type,calc distribution,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
0,P_00038,2,LEFT,CC,1,calcification,PUNCTATE-PLEOMORPHIC,CLUSTERED,4,BENIGN,2,Calc-Test_P_00038_LEFT_CC/1.3.6.1.4.1.9590.100...,Calc-Test_P_00038_LEFT_CC_1/1.3.6.1.4.1.9590.1...,Calc-Test_P_00038_LEFT_CC_1/1.3.6.1.4.1.9590.1...
1,P_00038,2,LEFT,MLO,1,calcification,PUNCTATE-PLEOMORPHIC,CLUSTERED,4,BENIGN,2,Calc-Test_P_00038_LEFT_MLO/1.3.6.1.4.1.9590.10...,Calc-Test_P_00038_LEFT_MLO_1/1.3.6.1.4.1.9590....,Calc-Test_P_00038_LEFT_MLO_1/1.3.6.1.4.1.9590....
2,P_00038,2,RIGHT,CC,1,calcification,VASCULAR,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Test_P_00038_RIGHT_CC/1.3.6.1.4.1.9590.10...,Calc-Test_P_00038_RIGHT_CC_1/1.3.6.1.4.1.9590....,Calc-Test_P_00038_RIGHT_CC_1/1.3.6.1.4.1.9590....
3,P_00038,2,RIGHT,CC,2,calcification,VASCULAR,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Test_P_00038_RIGHT_CC/1.3.6.1.4.1.9590.10...,Calc-Test_P_00038_RIGHT_CC_2/1.3.6.1.4.1.9590....,Calc-Test_P_00038_RIGHT_CC_2/1.3.6.1.4.1.9590....
4,P_00038,2,RIGHT,MLO,1,calcification,VASCULAR,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Test_P_00038_RIGHT_MLO/1.3.6.1.4.1.9590.1...,Calc-Test_P_00038_RIGHT_MLO_1/1.3.6.1.4.1.9590...,Calc-Test_P_00038_RIGHT_MLO_1/1.3.6.1.4.1.9590...


326

Unnamed: 0,Pathology,Count
0,BENIGN,70
1,BENIGN_WITHOUT_CALLBACK,17
2,MALIGNANT,67
Total,,154


151

3

Of the {glue:}`calc_test_n_records` observations, there are {glue:}`calc_test_n_cases` unique cases in the calcification testing set. 

```{glue:figure} calc_test_pathologies
---
align: center
name: calc_test_pathologies_fig
---
Patient Count by Pathology
```

Similarly, {numref}`calc_test_pathologies_fig` reveals that {glue:}`calc_test_n_cases_multiple_pathologies` patients have multiple (two) abnormalities with different diagnoses. Let's inspect the data.

```{glue:figure} calc_test_sample
---
align: center
name: calc_test_sample_fig
---
Calcification Test Data Sample
```

Approximately 2% of the calcification types and 20% of the calcification distributions are missing. Let's examine the mass case data.

## Mass Cases

### Mass Train Cases

We expect a total of 691 cases, 355 are benign and 336 are malignant.

In [6]:
df = pd.read_csv(fp_mass_train)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1318 entries, 0 to 1317
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   patient_id               1318 non-null   object
 1   breast_density           1318 non-null   int64 
 2   left or right breast     1318 non-null   object
 3   image view               1318 non-null   object
 4   abnormality id           1318 non-null   int64 
 5   abnormality type         1318 non-null   object
 6   mass shape               1314 non-null   object
 7   mass margins             1275 non-null   object
 8   assessment               1318 non-null   int64 
 9   pathology                1318 non-null   object
 10  subtlety                 1318 non-null   int64 
 11  image file path          1318 non-null   object
 12  cropped image file path  1318 non-null   object
 13  ROI mask file path       1318 non-null   object
dtypes: int64(4), object(10)
memory usage: 14

In [7]:
# Total record count
n_records = len(df)

# Sample
sample = df.head()

# Number of cases
n_cases =  df['patient_id'].nunique()

# Pathologies and patient counts.
pathologies = df[['patient_id', 'pathology']].drop_duplicates().groupby(by='pathology').count().reset_index()
pathologies.columns = ['Pathology', 'Count']
pathologies.loc['Total'] = pathologies.sum()
pathologies.loc[pathologies.index[-1], 'Pathology'] = ""

# Number of patients with multiple abnormalities and diagnoses
case_pathologies = df[['patient_id', 'pathology']].drop_duplicates().groupby(by='patient_id').count()
n_cases_multiple_pathologies = len(case_pathologies.loc[case_pathologies['pathology']==2])

glue('mass_train_sample', sample)
glue('mass_train_n_records', n_records)
glue('mass_train_pathologies', pathologies)
glue('mass_train_n_cases', n_cases)
glue('mass_train_n_cases_multiple_pathologies', n_cases_multiple_pathologies)


Unnamed: 0,patient_id,breast_density,left or right breast,image view,abnormality id,abnormality type,mass shape,mass margins,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
0,P_00001,3,LEFT,CC,1,mass,IRREGULAR-ARCHITECTURAL_DISTORTION,SPICULATED,4,MALIGNANT,4,Mass-Training_P_00001_LEFT_CC/1.3.6.1.4.1.9590...,Mass-Training_P_00001_LEFT_CC_1/1.3.6.1.4.1.95...,Mass-Training_P_00001_LEFT_CC_1/1.3.6.1.4.1.95...
1,P_00001,3,LEFT,MLO,1,mass,IRREGULAR-ARCHITECTURAL_DISTORTION,SPICULATED,4,MALIGNANT,4,Mass-Training_P_00001_LEFT_MLO/1.3.6.1.4.1.959...,Mass-Training_P_00001_LEFT_MLO_1/1.3.6.1.4.1.9...,Mass-Training_P_00001_LEFT_MLO_1/1.3.6.1.4.1.9...
2,P_00004,3,LEFT,CC,1,mass,ARCHITECTURAL_DISTORTION,ILL_DEFINED,4,BENIGN,3,Mass-Training_P_00004_LEFT_CC/1.3.6.1.4.1.9590...,Mass-Training_P_00004_LEFT_CC_1/1.3.6.1.4.1.95...,Mass-Training_P_00004_LEFT_CC_1/1.3.6.1.4.1.95...
3,P_00004,3,LEFT,MLO,1,mass,ARCHITECTURAL_DISTORTION,ILL_DEFINED,4,BENIGN,3,Mass-Training_P_00004_LEFT_MLO/1.3.6.1.4.1.959...,Mass-Training_P_00004_LEFT_MLO_1/1.3.6.1.4.1.9...,Mass-Training_P_00004_LEFT_MLO_1/1.3.6.1.4.1.9...
4,P_00004,3,RIGHT,MLO,1,mass,OVAL,CIRCUMSCRIBED,4,BENIGN,5,Mass-Training_P_00004_RIGHT_MLO/1.3.6.1.4.1.95...,Mass-Training_P_00004_RIGHT_MLO_1/1.3.6.1.4.1....,Mass-Training_P_00004_RIGHT_MLO_1/1.3.6.1.4.1....


1318

Unnamed: 0,Pathology,Count
0,BENIGN,304
1,BENIGN_WITHOUT_CALLBACK,54
2,MALIGNANT,346
Total,,704


691

13

Of the {glue:}`mass_train_n_records` observations, there are {glue:}`mass_train_n_cases` unique cases in the mass training set. The following summarizes the pathologies.

```{glue:figure} mass_train_pathologies
---
align: center
name: mass_train_pathologies_fig
---
Patient Count by Pathology
```

{numref}`mass_train_pathologies_fig` indicates that there are {glue:}`mass_train_n_cases_multiple_pathologies` patients with two pathologies, bringing our total to 704 diagnoses.

```{glue:figure} mass_train_sample
---
align: center
name: mass_train_sample_fig
---
Mass Training Data Sample
```

Only missing 1% of the mass shapes, and 4% of the mass margins. Let's check the test set.


### Mass Test Cases

In [11]:
df = pd.read_csv(fp_mass_test)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378 entries, 0 to 377
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   patient_id               378 non-null    object
 1   breast_density           378 non-null    int64 
 2   left or right breast     378 non-null    object
 3   image view               378 non-null    object
 4   abnormality id           378 non-null    int64 
 5   abnormality type         378 non-null    object
 6   mass shape               378 non-null    object
 7   mass margins             361 non-null    object
 8   assessment               378 non-null    int64 
 9   pathology                378 non-null    object
 10  subtlety                 378 non-null    int64 
 11  image file path          378 non-null    object
 12  cropped image file path  378 non-null    object
 13  ROI mask file path       378 non-null    object
dtypes: int64(4), object(10)
memory usage: 41.5

In [9]:
# Total record count
n_records = len(df)

# Sample
sample = df.head()

# Number of cases
n_cases =  df['patient_id'].nunique()

# Pathologies and patient counts.
pathologies = df[['patient_id', 'pathology']].drop_duplicates().groupby(by='pathology').count().reset_index()
pathologies.columns = ['Pathology', 'Count']
pathologies.loc['Total'] = pathologies.sum()
pathologies.loc[pathologies.index[-1], 'Pathology'] = ""

# Number of patients with multiple abnormalities and diagnoses
case_pathologies = df[['patient_id', 'pathology']].drop_duplicates().groupby(by='patient_id').count()
n_cases_multiple_pathologies = len(case_pathologies.loc[case_pathologies['pathology']==2])

glue('mass_test_sample', sample)
glue('mass_test_n_records', n_records)
glue('mass_test_pathologies', pathologies)
glue('mass_test_n_cases', n_cases)
glue('mass_test_n_cases_multiple_pathologies', n_cases_multiple_pathologies)



Unnamed: 0,patient_id,breast density,left or right breast,image view,abnormality id,abnormality type,calc type,calc distribution,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
0,P_00038,2,LEFT,CC,1,calcification,PUNCTATE-PLEOMORPHIC,CLUSTERED,4,BENIGN,2,Calc-Test_P_00038_LEFT_CC/1.3.6.1.4.1.9590.100...,Calc-Test_P_00038_LEFT_CC_1/1.3.6.1.4.1.9590.1...,Calc-Test_P_00038_LEFT_CC_1/1.3.6.1.4.1.9590.1...
1,P_00038,2,LEFT,MLO,1,calcification,PUNCTATE-PLEOMORPHIC,CLUSTERED,4,BENIGN,2,Calc-Test_P_00038_LEFT_MLO/1.3.6.1.4.1.9590.10...,Calc-Test_P_00038_LEFT_MLO_1/1.3.6.1.4.1.9590....,Calc-Test_P_00038_LEFT_MLO_1/1.3.6.1.4.1.9590....
2,P_00038,2,RIGHT,CC,1,calcification,VASCULAR,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Test_P_00038_RIGHT_CC/1.3.6.1.4.1.9590.10...,Calc-Test_P_00038_RIGHT_CC_1/1.3.6.1.4.1.9590....,Calc-Test_P_00038_RIGHT_CC_1/1.3.6.1.4.1.9590....
3,P_00038,2,RIGHT,CC,2,calcification,VASCULAR,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Test_P_00038_RIGHT_CC/1.3.6.1.4.1.9590.10...,Calc-Test_P_00038_RIGHT_CC_2/1.3.6.1.4.1.9590....,Calc-Test_P_00038_RIGHT_CC_2/1.3.6.1.4.1.9590....
4,P_00038,2,RIGHT,MLO,1,calcification,VASCULAR,,2,BENIGN_WITHOUT_CALLBACK,5,Calc-Test_P_00038_RIGHT_MLO/1.3.6.1.4.1.9590.1...,Calc-Test_P_00038_RIGHT_MLO_1/1.3.6.1.4.1.9590...,Calc-Test_P_00038_RIGHT_MLO_1/1.3.6.1.4.1.9590...


326

Unnamed: 0,Pathology,Count
0,BENIGN,70
1,BENIGN_WITHOUT_CALLBACK,17
2,MALIGNANT,67
Total,,154


151

3

Of the {glue:}`mass_test_n_records` observations, there are {glue:}`mass_test_n_cases` unique cases in the mass testing set. 

```{glue:figure} mass_test_pathologies
---
align: center
name: mass_test_pathologies_fig
---
Patient Count by Pathology
```

Here, {numref}`mass_test_pathologies_fig` shows that there are {glue:}`mass_test_n_cases_multiple_pathologies` patients with two abnormalities with different pathologies, bringing our total to 154 assessments.

```{glue:figure} mass_test_sample
---
align: center
name: mass_test_sample_fig
---
Mass Test Data Sample
```

About 4% of the mass margins are missing from the test set. Next, we'll examine the series metadata.

In [10]:
df = pd.read_csv(fp_metadata)
df.info()

sample = df.head()
glue('series_sample', sample)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6775 entries, 0 to 6774
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   series_uid            6775 non-null   object
 1   collection            6775 non-null   object
 2   data_description_uri  6775 non-null   object
 3   subject_id            6775 non-null   object
 4   study_uid             6775 non-null   object
 5   study_date            6775 non-null   object
 6   series_description    6775 non-null   object
 7   modality              6775 non-null   object
 8   sop_class_name        6775 non-null   object
 9   sop_class_uid         6775 non-null   object
 10  number_of_images      6775 non-null   int64 
 11  file_size             6775 non-null   object
 12  file_location         6775 non-null   object
 13  download_timestamp    6775 non-null   object
dtypes: int64(1), object(13)
memory usage: 741.1+ KB


```{glue:figure} series_sample
---
align: center
name: series_sample_fig
---
Series Metadata Sample
```

Here, we have the series and study information for each set of images. The file locations indicate folders containing full mammogram, ROI, and cropped images. 

Next, we unbox the DICOM images. While doing so, we'll extract statistics on the pixel values for the exploratory analysis.

A few of things stand out thus far.

1. About 2% of the calcification types and 25% of the calcification distributions are missing. 
2. We have a third *unexpected* pathology for our binary classification problem: 'BENIGN_WITHOUT_CALLBACK'. Benign without callback implies that a physician marked some aspect of the case as worth tracking, but did not require further followup {cite}`DDSMOverviewVolumea`. Changing these diagnoses to 'BENIGN' at this stage risks losing information that may explain other anomalies or missing values in the dataset. Rather than risk information loss, we'll add a new binary feature to indicate malignant (True) or a benign (False).
3. A quick check revealed that the file paths in the case datasets are not consistent with the data downloaded from the TCIA. However, they contain study and series uids corresponding to DICOM image metadata which may help link cases to images.

The following summarizes the pathologies.
