# CBIS-DDSM Data Preparation

In the prior section, we obtained the CBIS-DDSM data from [The Cancer Imaging Archive](https://wiki.cancerimagingarchive.net/pages/image_viewpage.action?pageId=22516629) (TCIA). This brief section aims to convert the data to a form suitable for exploratory data analyses.  Once complete, we'll have the following four datasets for subsequent phase analysis:

| # | Dataset              | Description                                                                  |
| - | -------------------- | ---------------------------------------------------------------------------- |
| 1 | calc_cases.csv       | Calcification cases from the training and test sets.                         |
| 2 | mass_cases.csv       | Mass cases from the training and test sets.                                  |
| 3 | case_series_xref.csv | Case / Series Cross-Reference                                                |
| 4 | dicom.csv            | DICOM dataset containing properties, and quality assessments for each image. |

First, we'll build the case datasets (1,2,3), then we'll construct the DICOM image metadata and quality assessment dataset (4).
, four datasets will be produced
This Our tasks are as follows:

5. **Assess Image Quality**:
6. **Extract DICOM Image Metadata**: Extract image metadata such as height, width, bit-depth, pixel values, and photometric interpretation from DICOM image files.

Once complete, we will have the following datasets for analysis:

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../../..")))

import pandas as  pd

from bcd.preprocess.metadata.case import CasePrep
from bcd.preprocess.metadata.dicom import DicomPrep

## Build Case Dataset

The data are currently split into train and test sets by BI-RADS category. While convenient for modeling, this is rather cumbersome for the exploratory data analysis which considers the entire dataset, before making any assumptions. Our first task is to combine the calcification and mass train and test sets into a single case dataset, facilitating a top-down, comprehensive image_view of all the data for analysis.

The target variable, pathology, has three values: 'MALIGNANT', 'BENIGN', and 'BENIGN_WITHOUT_CALLBACK. The latter indicates that some aspect of the case as worth tracking; but, no follow-up is required. For our purposes, that is a distinction without a difference. Nonetheless, rather than modifying the variable and lose potentially useful information, we'll create a new target variable, 'cancer', which will be True if pathology is 'MALIGNANT', and False otherwise.

The CBIS-DDSM creators define a case as a particular abnormality as seen in the cranial caudal (CC) or mediolateral oblique (MLO) image_views; yet, the dataset lacks a unique case identifier. Consequently, five different variables are required to map metadata with their associated image. To enable direct matching between metadata and DICOM images, each case will be assigned a unique identifier, which will be cross-referenced to a full mammogram, ROI mask, or cropped image series dataset.

In [2]:
CALC_TRAIN_FP = "data/meta/0_raw/calc_train.csv"
CALC_TEST_FP = "data/meta/0_raw/calc_test.csv"
MASS_TRAIN_FP = "data/meta/0_raw/mass_train.csv"
MASS_TEST_FP = "data/meta/0_raw/mass_test.csv"
CASE_FP = "data/meta/1_staged/cases.csv"
CASE_SERIES_XREF_FP = "data/meta/1_staged/case_series_xref.csv"

In [3]:
case = CasePrep()
df, df_xref = case.prep(calc_train_fp=CALC_TRAIN_FP, calc_test_fp=CALC_TEST_FP, mass_train_fp=MASS_TRAIN_FP, mass_test_fp=MASS_TEST_FP, case_fp=CASE_FP, case_series_fp=CASE_SERIES_XREF_FP, result=True, force=False)
print(f"The Case Dataset has been created with {df.shape[0]} rows and {df.shape[1]} columns.")
print(f"The Case/Series XRef Dataset has been created with {df_xref.shape[0]} rows and {df_xref.shape[1]} columns.")

The Case Dataset has been created with 3568 rows and 16 columns.
The Case/Series XRef Dataset has been created with 7240 rows and 3 columns.


## Build DICOM Dataset

Developing an image preprocessing approach requires visibility into image metadata, currently buried in individual DICOM files. Our first task is to extract these data and store them in a more accessible format. Note: There are 10,239 images in the dataset according to [TCIA](https://wiki.cancerimagingarchive.net/pages/image_viewpage.action?pageId=22516629#2251662935562334b1e043a3a0512554ef512cad). The DICOM image file for case_id 'P_01382_LEFT_mass_MLO_1' was corrupt, and could not be read. Therefore, we will have a total of 10,238 images.

In [4]:
LOCATION = "data/image/0_raw"
DICOM_FP = "data/meta/1_staged/dicom.csv"
SKIP_LIST = ["Mass-Training_P_01382_LEFT_MLO/07-20-2016-DDSM-93921/1.000000-full mammogram images-05891/1-1.dcm"]

In [5]:
dicom = DicomPrep()
dfd = dicom.prep(location=LOCATION, dicom_fp=DICOM_FP, skip_list=SKIP_LIST, result=True, force=True)
dfd = dicom.add_series_description(dicom_fp=DICOM_FP, series_fp=CASE_SERIES_XREF_FP)
print(f"The DICOM Dataset has been created with {dfd.shape[0]} rows and {dfd.shape[1]} columns.")

  0%|                                                                                   | 0/10238 [00:00<?, ?it/s]

  0%|                                                                           | 6/10238 [00:00<05:43, 29.75it/s]

  0%|                                                                          | 12/10238 [00:01<16:50, 10.12it/s]

  0%|▏                                                                         | 18/10238 [00:01<14:25, 11.81it/s]

  0%|▏                                                                         | 24/10238 [00:01<12:14, 13.90it/s]

  0%|▏                                                                         | 30/10238 [00:02<12:13, 13.92it/s]

  0%|▎                                                                         | 36/10238 [00:02<09:53, 17.18it/s]

  0%|▎                                                                         | 42/10238 [00:02<11:45, 14.45it/s]

  0%|▎                                                                         | 48/10238 [00:03<10:19, 16.45it/s]

  1%|▍                                                                         | 54/10238 [00:03<09:30, 17.84it/s]

  1%|▍                                                                         | 60/10238 [00:03<09:06, 18.61it/s]

  1%|▍                                                                         | 66/10238 [00:03<07:55, 21.37it/s]

  1%|▌                                                                         | 72/10238 [00:04<08:25, 20.10it/s]

  1%|▌                                                                         | 78/10238 [00:04<07:45, 21.85it/s]

  1%|▌                                                                         | 84/10238 [00:04<09:16, 18.25it/s]

  1%|▋                                                                         | 90/10238 [00:05<09:09, 18.46it/s]

  1%|▋                                                                         | 96/10238 [00:05<08:41, 19.46it/s]

  1%|▋                                                                        | 102/10238 [00:05<09:40, 17.46it/s]

  1%|▊                                                                        | 108/10238 [00:06<14:20, 11.77it/s]

  1%|▊                                                                        | 114/10238 [00:08<24:36,  6.86it/s]

  1%|▊                                                                        | 120/10238 [00:10<30:34,  5.51it/s]

  1%|▉                                                                        | 126/10238 [00:11<28:43,  5.87it/s]

  1%|▉                                                                        | 132/10238 [00:12<33:20,  5.05it/s]

  1%|▉                                                                        | 138/10238 [00:13<27:47,  6.06it/s]

  1%|█                                                                        | 144/10238 [00:14<31:39,  5.32it/s]

  1%|█                                                                        | 150/10238 [00:14<24:48,  6.78it/s]

  2%|█                                                                        | 156/10238 [00:15<19:56,  8.42it/s]

  2%|█▏                                                                       | 162/10238 [00:15<15:37, 10.75it/s]

  2%|█▏                                                                       | 168/10238 [00:15<13:16, 12.64it/s]

  2%|█▏                                                                       | 174/10238 [00:16<12:35, 13.32it/s]

  2%|█▎                                                                       | 180/10238 [00:16<10:15, 16.33it/s]

  2%|█▎                                                                       | 186/10238 [00:16<08:37, 19.42it/s]

  2%|█▎                                                                       | 192/10238 [00:16<07:32, 22.19it/s]

  2%|█▍                                                                       | 198/10238 [00:17<09:02, 18.52it/s]

  2%|█▍                                                                       | 204/10238 [00:17<08:12, 20.37it/s]

  2%|█▍                                                                       | 210/10238 [00:17<08:12, 20.35it/s]

  2%|█▌                                                                       | 216/10238 [00:17<07:42, 21.68it/s]

  2%|█▌                                                                       | 222/10238 [00:18<08:02, 20.76it/s]

  2%|█▋                                                                       | 228/10238 [00:18<08:48, 18.94it/s]

  2%|█▋                                                                       | 234/10238 [00:20<25:43,  6.48it/s]

  2%|█▋                                                                       | 240/10238 [00:21<26:42,  6.24it/s]

  2%|█▊                                                                       | 246/10238 [00:22<21:35,  7.71it/s]

  2%|█▊                                                                       | 252/10238 [00:22<17:55,  9.29it/s]

  3%|█▊                                                                       | 258/10238 [00:22<13:58, 11.91it/s]

  3%|█▉                                                                       | 264/10238 [00:23<11:53, 13.97it/s]

  3%|█▉                                                                       | 270/10238 [00:23<09:41, 17.15it/s]

  3%|█▉                                                                       | 276/10238 [00:23<09:45, 17.00it/s]

  3%|██                                                                       | 282/10238 [00:23<09:10, 18.09it/s]

  3%|██                                                                       | 288/10238 [00:24<08:59, 18.43it/s]

  3%|██                                                                       | 294/10238 [00:24<07:36, 21.79it/s]

  3%|██▏                                                                      | 300/10238 [00:24<08:09, 20.30it/s]

  3%|██▏                                                                      | 306/10238 [00:25<08:38, 19.16it/s]

  3%|██▏                                                                      | 312/10238 [00:25<07:46, 21.28it/s]

  3%|██▎                                                                      | 318/10238 [00:25<07:21, 22.49it/s]

  3%|██▎                                                                      | 324/10238 [00:26<09:28, 17.43it/s]

  3%|██▎                                                                      | 330/10238 [00:26<09:59, 16.52it/s]

  3%|██▍                                                                      | 336/10238 [00:26<10:15, 16.08it/s]

  3%|██▍                                                                      | 342/10238 [00:27<10:07, 16.28it/s]

  3%|██▍                                                                      | 348/10238 [00:27<08:14, 20.01it/s]

  3%|██▌                                                                      | 354/10238 [00:27<08:45, 18.81it/s]

  4%|██▌                                                                      | 360/10238 [00:27<08:23, 19.61it/s]

  4%|██▌                                                                      | 366/10238 [00:28<09:17, 17.70it/s]

  4%|██▋                                                                      | 372/10238 [00:28<08:57, 18.37it/s]

KeyboardInterrupt: 