# CBIS-DDSM Data Preparation
Downstream data quality and exploratory analyses requires data that has relevance for the study, minimal or no data duplication or anomalies, referential integrity, no transitive dependencies, and a format, type, and structure so that analyses are efficient, inferences are correct, and insights have impact. In this section, we condition and optimize the data, its structure, organization, and format to support downstream exploratory data analysis.

This exercise will produce four datasets essential to the exploratory effort:
1. Case Description: Mass and calcification clinical data with cross-references to DICOM series.
2. Master Case: Contains all cases at series level. 
3. Series Metadata: Series data mapped to file locations. 
4. DICOM Metadata: DICOM image properties such as bit rate, width, height, and pixel value ranges.


In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

from bcd.data.case.prep import CasePrep, MasterCasePrep
from bcd.data.series.prep import SeriesPrep
from bcd.data.dicom.prep import DicomPrep

## Case Description
Several data preparation tasks are performed on the mass and calcification case descriptive datasets. 

1. **Combined Case Datasets**: The mass and calcification cases are split into separate training and test sets to support modeling and consistent evaluation of diagnostic methods. Exploratory data analysis; by-contrast, examines the *entire* dataset, before any assumptions are made on the data. We combine the training and test sets by BI-RADS category to ensure that we have an appropriate top-down view of the forest, and the trees.
2. **Series UID, not File Paths**: The case description datasets contain incorrect file paths for the full, cropped, and ROI mask images. However, the file paths include Series UID, which can be mapped to the correct file location via series metadata. Here, we replace these file paths with the series uids, contained therein.
3. **Benign is Benign**: Case descriptions have a pathology of 'BENIGN', 'MALIGNANT', and 'BENIGN WITHOUT CALLBACK'. The latter means that the region is suspicious and should be monitored, but no further investigation is required. This distinction isn't relevant to this study; hence, 'BENIGN WITHOUT CALLBACK' will be replaced by 'BENIGN'.

In [2]:
# Calc Filepaths
CALC_TRAIN_FP = "data/raw/meta/calc_case_description_train_set.csv"
CALC_TEST_FP = "data/raw/meta/calc_case_description_test_set.csv"
CALC_FP = "data/staged/calc_cases.csv"

# Mass Filepaths
MASS_TRAIN_FP = "data/raw/meta/mass_case_description_train_set.csv"
MASS_TEST_FP = "data/raw/meta/mass_case_description_test_set.csv"
MASS_FP = "data/staged/mass_cases.csv"

In [3]:
case = CasePrep()

### Calcification Cases

In [4]:
dfc = case.prep(train_fp=CALC_TRAIN_FP, test_fp=CALC_TEST_FP, cases_fp=CALC_FP, result=True)
dfc.info()
dfc.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1872 entries, 0 to 1871
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   patient_id                1872 non-null   object
 1   breast_density            1872 non-null   int64 
 2   left_or_right_breast      1872 non-null   object
 3   image_view                1872 non-null   object
 4   abnormality_id            1872 non-null   int64 
 5   abnormality_type          1872 non-null   object
 6   calc_type                 1848 non-null   object
 7   calc_distribution         1433 non-null   object
 8   assessment                1872 non-null   int64 
 9   pathology                 1872 non-null   object
 10  subtlety                  1872 non-null   int64 
 11  dataset                   1872 non-null   object
 12  image_series_uid          1872 non-null   object
 13  roi_mask_series_uid       1872 non-null   object
 14  cropped_image_series_uid

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,dataset,image_series_uid,roi_mask_series_uid,cropped_image_series_uid
0,P_00005,3,RIGHT,CC,1,calcification,AMORPHOUS,CLUSTERED,3,MALIGNANT,3,train,1.3.6.1.4.1.9590.100.1.2.474143160103683865197...,1.3.6.1.4.1.9590.100.1.2.393344010211719049419...,1.3.6.1.4.1.9590.100.1.2.393344010211719049419...
1,P_00005,3,RIGHT,MLO,1,calcification,AMORPHOUS,CLUSTERED,3,MALIGNANT,3,train,1.3.6.1.4.1.9590.100.1.2.250596608311207922527...,1.3.6.1.4.1.9590.100.1.2.296281207812130400303...,1.3.6.1.4.1.9590.100.1.2.296281207812130400303...
2,P_00007,4,LEFT,CC,1,calcification,PLEOMORPHIC,LINEAR,4,BENIGN,4,train,1.3.6.1.4.1.9590.100.1.2.228699627313487111012...,1.3.6.1.4.1.9590.100.1.2.314135871111943890422...,1.3.6.1.4.1.9590.100.1.2.314135871111943890422...
3,P_00007,4,LEFT,MLO,1,calcification,PLEOMORPHIC,LINEAR,4,BENIGN,4,train,1.3.6.1.4.1.9590.100.1.2.104743410411133110629...,1.3.6.1.4.1.9590.100.1.2.914582796124855152034...,1.3.6.1.4.1.9590.100.1.2.914582796124855152034...
4,P_00008,1,LEFT,CC,1,calcification,,REGIONAL,2,BENIGN,3,train,1.3.6.1.4.1.9590.100.1.2.406725628213826290127...,1.3.6.1.4.1.9590.100.1.2.281397494612871934937...,1.3.6.1.4.1.9590.100.1.2.281397494612871934937...


### Mass Cases

In [5]:
dfm = case.prep(train_fp=MASS_TRAIN_FP, test_fp=MASS_TEST_FP, cases_fp=MASS_FP, result=True)
dfm.info()
dfm.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1696 entries, 0 to 1695
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   patient_id                1696 non-null   object
 1   breast_density            1696 non-null   int64 
 2   left_or_right_breast      1696 non-null   object
 3   image_view                1696 non-null   object
 4   abnormality_id            1696 non-null   int64 
 5   abnormality_type          1696 non-null   object
 6   mass_shape                1692 non-null   object
 7   mass_margins              1636 non-null   object
 8   assessment                1696 non-null   int64 
 9   pathology                 1696 non-null   object
 10  subtlety                  1696 non-null   int64 
 11  dataset                   1696 non-null   object
 12  image_series_uid          1696 non-null   object
 13  roi_mask_series_uid       1696 non-null   object
 14  cropped_image_series_uid

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,mass_shape,mass_margins,assessment,pathology,subtlety,dataset,image_series_uid,roi_mask_series_uid,cropped_image_series_uid
0,P_00001,3,LEFT,CC,1,mass,IRREGULAR-ARCHITECTURAL_DISTORTION,SPICULATED,4,MALIGNANT,4,train,1.3.6.1.4.1.9590.100.1.2.342386194811267636608...,1.3.6.1.4.1.9590.100.1.2.296736403313792599626...,1.3.6.1.4.1.9590.100.1.2.296736403313792599626...
1,P_00001,3,LEFT,MLO,1,mass,IRREGULAR-ARCHITECTURAL_DISTORTION,SPICULATED,4,MALIGNANT,4,train,1.3.6.1.4.1.9590.100.1.2.359308329312397897125...,1.3.6.1.4.1.9590.100.1.2.227955274711225756835...,1.3.6.1.4.1.9590.100.1.2.227955274711225756835...
2,P_00004,3,LEFT,CC,1,mass,ARCHITECTURAL_DISTORTION,ILL_DEFINED,4,BENIGN,3,train,1.3.6.1.4.1.9590.100.1.2.891800462110225318343...,1.3.6.1.4.1.9590.100.1.2.429120414011832984817...,1.3.6.1.4.1.9590.100.1.2.429120414011832984817...
3,P_00004,3,LEFT,MLO,1,mass,ARCHITECTURAL_DISTORTION,ILL_DEFINED,4,BENIGN,3,train,1.3.6.1.4.1.9590.100.1.2.295360926313492745441...,1.3.6.1.4.1.9590.100.1.2.115134232113001553100...,1.3.6.1.4.1.9590.100.1.2.115134232113001553100...
4,P_00004,3,RIGHT,MLO,1,mass,OVAL,CIRCUMSCRIBED,4,BENIGN,5,train,1.3.6.1.4.1.9590.100.1.2.410524754913057908920...,1.3.6.1.4.1.9590.100.1.2.128573663126257012032...,1.3.6.1.4.1.9590.100.1.2.128573663126257012032...


## Master Case
This dataset will map all cases to their associated image, ROI mask, and cropped image series identifiers.

In [6]:
MASTER_FP = "data/staged/master_cases.csv"

In [7]:
case = MasterCasePrep()
dfM = case.prep(mass_fp=MASS_FP, calc_fp=CALC_FP, master_fp=MASTER_FP, result=True)
dfM.info()
dfM.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6775 entries, 0 to 6774
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   patient_id            6775 non-null   object
 1   breast_density        6775 non-null   int64 
 2   left_or_right_breast  6775 non-null   object
 3   image_view            6775 non-null   object
 4   abnormality_id        6775 non-null   int64 
 5   abnormality_type      6775 non-null   object
 6   mass_shape            3384 non-null   object
 7   mass_margins          3280 non-null   object
 8   assessment            6775 non-null   int64 
 9   pathology             6775 non-null   object
 10  subtlety              6775 non-null   int64 
 11  dataset               6775 non-null   object
 12  series_uid            6775 non-null   object
 13  calc_type             3348 non-null   object
 14  calc_distribution     2704 non-null   object
 15  series_description    6775 non-null   

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,mass_shape,mass_margins,assessment,pathology,subtlety,dataset,series_uid,calc_type,calc_distribution,series_description
0,P_00001,3,LEFT,CC,1,mass,IRREGULAR-ARCHITECTURAL_DISTORTION,SPICULATED,4,MALIGNANT,4,train,1.3.6.1.4.1.9590.100.1.2.342386194811267636608...,,,full mammogram images
1,P_00001,3,LEFT,MLO,1,mass,IRREGULAR-ARCHITECTURAL_DISTORTION,SPICULATED,4,MALIGNANT,4,train,1.3.6.1.4.1.9590.100.1.2.359308329312397897125...,,,full mammogram images
2,P_00004,3,LEFT,CC,1,mass,ARCHITECTURAL_DISTORTION,ILL_DEFINED,4,BENIGN,3,train,1.3.6.1.4.1.9590.100.1.2.891800462110225318343...,,,full mammogram images
3,P_00004,3,LEFT,MLO,1,mass,ARCHITECTURAL_DISTORTION,ILL_DEFINED,4,BENIGN,3,train,1.3.6.1.4.1.9590.100.1.2.295360926313492745441...,,,full mammogram images
4,P_00004,3,RIGHT,MLO,1,mass,OVAL,CIRCUMSCRIBED,4,BENIGN,5,train,1.3.6.1.4.1.9590.100.1.2.410524754913057908920...,,,full mammogram images


## Series Metadata
The purpose of the series metadata is map a series identifier to the location of its files. Variables must be relevant to that purpose, or removed. To that end, the following variables will be removed from the series metadata dataset:

| # |   Variable    | Description                         |       Rationale      |
|---|---------------|-------------------------------------|----------------------|
| 1 | collection    | Name of collection (CBIS-DDSM)      |       Relevance      |
| 2 | data_description_uri | The URI for the CBIS-DDSM Data Description | Relevance |
| 3 | subject_id | Composite of file set, BI-RADS category, patient id, breast side, and view | Relevance, Redundancy | 
| 4 | study_uid  | Study identifier | Relevance, Transitive Dependence |
| 5 | study_date | Date study was conducted | Relevance |
| 6 | modality | Contains single value: "MG'  | Relevance |
| 7 | sop_class_name | DICOM SOP Class  | Relevance |
| 8 | sop_class_uid | DICOM SOP UID  | Relevance |
| 9 | file_size | The file to which this refers is unclear.  | Relevance, Transitive Dependency |
| 10 | download_timestamp | Datetime the data were downloaded  | Relevance |

Once the spurious data are removed, the series metadata file will contain:
- Series UID: A unique identifier for the series
- Series Description: Specifies wether the series includes full mammography, cropped images, or ROI masks.
- Number of Images: The number of images in the series
- File Location: The directory containing the images

In [8]:
SERIES_INPUT_FP = "data/raw/meta/metadata.csv"
SERIES_OUTPUT_FP = "data/staged/series.csv"

In [9]:
series = SeriesPrep()
dfs = series.prep(fp_in=SERIES_INPUT_FP, fp_out=SERIES_OUTPUT_FP, result=True)
dfs.info()
dfs.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6775 entries, 0 to 6774
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   series_uid          6775 non-null   object
 1   series_description  6775 non-null   object
 2   number_of_images    6775 non-null   int64 
 3   file_location       6775 non-null   object
dtypes: int64(1), object(3)
memory usage: 211.8+ KB


Unnamed: 0,series_uid,series_description,number_of_images,file_location
0,1.3.6.1.4.1.9590.100.1.2.294445047912407030012...,ROI mask images,2,data/raw/CBIS-DDSM/Calc-Test_P_01004_LEFT_MLO_...
1,1.3.6.1.4.1.9590.100.1.2.188613955710170417803...,ROI mask images,2,data/raw/CBIS-DDSM/Calc-Test_P_00038_LEFT_MLO_...
2,1.3.6.1.4.1.9590.100.1.2.399466258212646932018...,ROI mask images,2,data/raw/CBIS-DDSM/Calc-Test_P_00041_LEFT_MLO_...
3,1.3.6.1.4.1.9590.100.1.2.419081637812053404913...,ROI mask images,2,data/raw/CBIS-DDSM/Calc-Test_P_00038_LEFT_CC_1...
4,1.3.6.1.4.1.9590.100.1.2.284547955212024760928...,ROI mask images,2,data/raw/CBIS-DDSM/Calc-Test_P_00485_LEFT_CC_3...


## DICOM Metadata
DICOM image properties are buried within the DICOM image files. Here, we extract the image metadata properties from the DICOM image files and combine them with the master case data. This dataset will contain the following variables:

- Series UID: The unique identifier for the series.
- Series Description: 'full mammogram', 'ROI mask', or 'cropped image'
- Patient Id: A unique identifier for each patient.
- Side: Left or right breast.
- Image View: Either craniocaudal (CC) or mediolateral oblique (MLO) view.
- Height: Number of rows in the image pixel data.
- Width: Number of columns in the image pixel data.
- Bits: The number of bits stored in the image.
- Smallest Image Pixel Value
- Largest Image Pixel Value
- Pixel Value Range
- Breast Density
- Abnormality Id
- Abnormality Type: 'calcification' or 'mass'
- BI-RADS Assessment: Range from 0 to 5.
- Calcification type and Distribution (if applicable)
- Mass shape and margins  (if applicable)
- Pathology: 'MALIGNANT' or 'BENIGN'
- Subtlety: Indication of assessment difficult from 1 (nuanced) to 5 (obvious)
- Dataset: 'train' or 'test'


In [10]:
LOCATION = "data/raw/CBIS-DDSM"
DICOM_FP = "data/staged/dicom.csv"
SKIP_LIST = ["Mass-Training_P_01382_LEFT_MLO/07-20-2016-DDSM-93921/1.000000-full mammogram images-05891/1-1.dcm"]

In [11]:
dicom = DicomPrep()
dicom.prep(location=LOCATION, master_fp=MASTER_FP, dicom_fp=DICOM_FP, skip_list=SKIP_LIST)

DEBUG:bcd.data.dicom.prep:Number of filepaths: 10239
DEBUG:bcd.data.dicom.prep:Number of filtered filepaths: 10238
DEBUG:bcd.data.dicom.prep:Shape of DICOM Data: (10238, 23)
