# CBIS-DDSM Data Preparation
In the prior section, we obtained the CBIS-DDSM data from [The Cancer Imaging Archive](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=22516629) (TCIA). This brief section aims to convert the data to a form suitable for exploratory data analyses.  Once complete, we'll have the following four datasets for subsequent phase analysis:

| # | Dataset              | Description                                                        |
|---|----------------------|--------------------------------------------------------------------|
| 1 | calc_cases.csv       | Calcification cases from the training and test sets.               |
| 2 | mass_cases.csv       | Mass cases from the training and test sets.                        |
| 3 | case_series_xref.csv | Case / Series Cross-Reference                                      |
| 4 | dicom.csv            | DICOM dataset containing properties, and quality assessments for each image. |

First, we'll build the case datasets (1,2,3), then we'll construct the DICOM image metadata and quality assessment dataset (4). 
, four datasets will be produced 
This Our tasks are as follows:

5. **Assess Image Quality**: 
6. **Extract DICOM Image Metadata**: Extract image metadata such as height, width, bit-depth, pixel values, and photometric interpretation from DICOM image files.

Once complete, we will have the following datasets for analysis:



In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as  pd

from bcd.data.prep.meta.case import CasePrep
from bcd.data.prep.meta.dicom import DicomPrep

## Build Case Datasets
The data are currently split into train and test sets by BI-RADS category. While convenient for modeling, this is rather cumbersome for the exploratory data analysis which considers the entire dataset, before making any assumptions. Our first task is to combine the train and test sets into separate datasets by BI-RADS category, facilitating a top-down, comprehensive view of all the data during analysis. 

The target variable, pathology, has three values: 'MALIGNANT', 'BENIGN', and 'BENIGN_WITHOUT_CALLBACK. The latter indicates that some aspect of the case as worth tracking; but, no follow-up is required. For our purposes, that is a distinction without a difference. Nonetheless, rather than modifying the variable and lose potentially useful information, we'll create a new target variable, 'cancer', which will be True if pathology is 'MALIGNANT', and False otherwise. 

The CBIS-DDSM creators define a case as a particular abnormality as seen in the cranial caudal (CC) or mediolateral oblique (MLO) views; yet, the dataset lacks a unique case identifier. Consequently, five different variables are required to map metadata with their associated image. To enable direct matching between metadata and DICOM images, each case will be assigned a unique identifier, which will be cross-referenced to a full mammogram, ROI mask, or cropped image series dataset.

In [2]:
# Input file paths
CALC_TRAIN_FP = "data/raw/meta/calc_train.csv"
CALC_TEST_FP = "data/raw/meta/calc_test.csv"

MASS_TRAIN_FP = "data/raw/meta/mass_train.csv"
MASS_TEST_FP = "data/raw/meta/mass_test.csv"

# Output Filepaths
CALC_FP = "data/staged/calc_cases.csv"
MASS_FP = "data/staged/mass_cases.csv"
CASE_SERIES_XREF__FP = "data/staged/case_series_xref.csv"


In [3]:
case = CasePrep()
dfc, dfm = case.prep(calc_train_fp=CALC_TRAIN_FP, calc_test_fp=CALC_TEST_FP, mass_train_fp=MASS_TRAIN_FP, mass_test_fp=MASS_TEST_FP, calc_fp=CALC_FP, mass_fp=MASS_FP, case_series_fp=CASE_SERIES_XREF__FP, result=True, force=False)
print(f"The Calcification Dataset has been created with {dfc.shape[0]} rows and {dfc.shape[1]} columns.")
print(f"The Mass Dataset has been created with {dfm.shape[0]} rows and {dfm.shape[1]} columns.")

FileNotFoundError: [Errno 2] No such file or directory: '/home/john/projects/bcd/data/raw/meta/calc_case_description_train_set.csv'

## Build DICOM Dataset
Developing an image preprocessing approach requires visibility into image metadata, currently buried in individual DICOM files. Our first task is to extract these data and store them in a format with greater accessibility. Note: There are 10,239 images in the dataset according to [TCIA](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=22516629#2251662935562334b1e043a3a0512554ef512cad). The DICOM image file for case_id 'P_01382_LEFT_mass_MLO_1' was corrupt, and could not be read. Therefore, we will have a total of 10,238 images.

The performance of deep learning (DL) classification, detection, and diagnostic models and architectures depends on the quality of the data used in training. In this regard, we need an objective method, based on computational models, that can predict a radiologist's perception of the quality of an image. For this task, we will use the Blind/Reference-less Image Spatial Quality Evaluator (BRISQUE) algorithm. A widely used, state-of-the-art No-Reference Image Quality Assessment (NR-IQA) model, BRISQUE measures image quality using locally normalized luminance coefficients and image features. 

The DicomPrep class below will extract the DICOM data and assign a BRISQUE score to each image.


In [None]:
LOCATION = "data/raw/CBIS-DDSM"
DICOM_FP = "data/staged/dicom.csv"
SKIP_LIST = ["Mass-Training_P_01382_LEFT_MLO/07-20-2016-DDSM-93921/1.000000-full mammogram images-05891/1-1.dcm"]

In [None]:
dicom = DicomPrep()
dfd = dicom.prep(location=LOCATION, dicom_fp=DICOM_FP, skip_list=SKIP_LIST, result=True, force=False)
dfd = dicom.add_series_description(dicom_fp=DICOM_FP, series_fp=CASE_SERIES_XREF__FP)
print(f"The Calcification Dataset has been created with {dfd.shape[0]} rows and {dfd.shape[1]} columns.")