# CBIS-DDSM Data Preparation
Downstream data quality and exploratory analyses requires data that has relevance for the study, minimal or no data duplication or anomalies, referential integrity, no transitive dependencies, and a format, type, and structure so that analyses are efficient, inferences are correct, and insights have impact. In this section, we condition and optimize the data, its structure, organization, and format to support downstream exploratory data analysis.

This exercise will produce three datasets essential to the exploratory effort:
1. Case Description: Mass and calcification clinical data with cross-references to DICOM series.
2. Series Metadata: Series data mapped to file locations. 
3. DICOM Metadata: DICOM image properties such as bit rate, width, height, and pixel value ranges.


In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

from bcd.data.case.prep import CasePrep
from bcd.data.series.prep import SeriesPrep
from bcd.data.dicom.prep import DicomPrep

## Case Description
Several data preparation tasks are performed on the mass and calcification case descriptive datasets. 

1. **Combined Case Datasets**: The mass and calcification cases are split into separate training and test sets to support modeling and consistent evaluation of diagnostic methods. Exploratory data analysis; by-contrast, examines the *entire* dataset, before any assumptions are made on the data. We combine the training and test sets by BI-RADS category to ensure that we have an appropriate top-down view of the forest, and the trees.
2. **Series UID, not File Paths**: The case description datasets contain incorrect file paths for the full, cropped, and ROI mask images. However, the file paths include Series UID, which can be mapped to the correct file location via series metadata. Here, we replace these file paths with the series uids, contained therein.
3. **Benign is Benign**: Case descriptions have a pathology of 'BENIGN', 'MALIGNANT', and 'BENIGN WITHOUT CALLBACK'. The latter means that the region is suspicious and should be monitored, but no further investigation is required. This distinction isn't relevant to this study; hence, 'BENIGN WITHOUT CALLBACK' will be replaced by 'BENIGN'.

In [2]:
# Calc Filepaths
CALC_TRAIN_FP = "data/raw/meta/calc_case_description_train_set.csv"
CALC_TEST_FP = "data/raw/meta/calc_case_description_test_set.csv"
CALC_FP = "data/staged/calc_cases.csv"

# Mass Filepaths
MASS_TRAIN_FP = "data/raw/meta/mass_case_description_train_set.csv"
MASS_TEST_FP = "data/raw/meta/mass_case_description_test_set.csv"
MASS_FP = "data/staged/mass_cases.csv"

In [3]:
case = CasePrep()
case.prep(train_fp=CALC_TRAIN_FP, test_fp=CALC_TEST_FP, cases_fp=CALC_FP)
case.prep(train_fp=MASS_TRAIN_FP, test_fp=MASS_TEST_FP, cases_fp=MASS_FP)

## Series Metadata
The purpose of the series metadata is map a series identifier to the location of its files. Variables must be relevant to that purpose, or removed. To that end, the following variables will be removed from the series metadata dataset:

| # |   Variable    | Description                         |       Rationale      |
|---|---------------|-------------------------------------|----------------------|
| 1 | collection    | Name of collection (CBIS-DDSM)      |       Relevance      |
| 2 | data_description_uri | The URI for the CBIS-DDSM Data Description | Relevance |
| 3 | subject_id | Composite of file set, BI-RADS category, patient id, breast side, and view | Relevance, Redundancy | 
| 4 | study_uid  | Study identifier | Relevance, Transitive Dependence |
| 5 | study_date | Date study was conducted | Relevance |
| 6 | modality | Contains single value: "MG'  | Relevance |
| 7 | sop_class_name | DICOM SOP Class  | Relevance |
| 8 | sop_class_uid | DICOM SOP UID  | Relevance |
| 9 | file_size | The file to which this refers is unclear.  | Relevance, Transitive Dependency |
| 10 | download_timestamp | Datetime the data were downloaded  | Relevance |

Once the spurious data are removed, the series metadata file will contain:
- Series UID: A unique identifier for the series
- Series Description: Specifies wether the series includes full mammography, cropped images, or ROI masks.
- Number of Images: The number of images in the series
- File Location: The directory containing the images

In [4]:
SERIES_INPUT_FP = "data/raw/meta/metadata.csv"
SERIES_OUTPUT_FP = "data/staged/series.csv"

In [5]:
series = SeriesPrep()
series.prep(fp_in=SERIES_INPUT_FP, fp_out=SERIES_OUTPUT_FP)

## Dicom Metadata
DICOM image properties are buried within the DICOM image files. Here, we extract the following image metadata properties from the DICOM image files and store them in a csv file for downstream analysis and image data processing:

- Series UID: The unique identifier for the series.
- Patient Id: A unique identifier for each patient.
- Side: Left or right breast.
- Image View: Either craniocaudal (CC) or mediolateral oblique (MLO) view.
- Height: Number of rows in the image pixel data.
- Width: Number of columns in the image pixel data.
- Bits: The number of bits stored in the image.
- Smallest Image Pixel Value
- Largest Image Pixel Value
- Pixel Value Range


In [6]:
LOCATION = "data/raw/CBIS-DDSM"
DICOM_FP = "data/staged/dicom.csv"
SKIP_LIST = ["Mass-Training_P_01382_LEFT_MLO/07-20-2016-DDSM-93921/1.000000-full mammogram images-05891/1-1.dcm"]

In [7]:
dicom = DicomPrep()
dicom.prep(location=LOCATION, dicom_fp=DICOM_FP, skip_list=SKIP_LIST)

DEBUG:bcd.data.dicom.prep:Number of filepaths: 10239
DEBUG:bcd.data.dicom.prep:Number of filtered filepaths: 10238
DEBUG:bcd.data.dicom.prep:Length of DICOM data: 10238
DEBUG:bcd.data.dicom.prep:Shape of DICOM Data: (10238, 11)
