# Data Preparation

In the prior section, we identified a few structural concerns worth addressing before any quality or exploratory analysis analyses take place. Here, we extract the relevant task-specific information from the CBIS-DDSM case and dicom datasets and integrate the data into a single, combined full mammogram dataset. 

Our process will take three steps: 
1. Combine the calcification mass training and test sets into a single full mammogram dataset, 
2. Add DICOM image file paths to the *series* metadata,
3. Extract the *DICOM* image metadata using the file paths above, and merge them with the case data from #1. 

The full dataset will have a few upgrades that will facilitate the analysis, detection, and classification tasks:
1. A mammogram ID, consisting of abnormality type, fileset (train/test), patient_id, breast laterality, and view will uniquely identify each full mammogram image.
2. A Boolean target variable, 'cancer', will be added combining BENIGN and BENIGN_WITHOUT_CALLBACK into a single Boolean value.
3. Pixel statistics such as the minimum, maximum, mean, and standard deviation, will be added to the dataset.

## Case Dataset Integration
The following code cells will integrate all case data into a single file. This doesn't rule out separation of mass and calcification cases as may be needed downstream, but it reduces a certain amount of redundancy and allows us to see both the forest and the trees.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../../..")))

In [2]:
from typing import Union
from glob import glob

import dask
import numpy as np
import pandas as pd
import pydicom

from bcd.dal.file import IOService
from bcd.utils.file import getsize
from bcd.utils.profile import profiler
from bcd.data_prep.base import DataPrep
from bcd.data_prep.prep import CasePrep
from bcd.data_prep.prep import SeriesPrep
from bcd.data_prep.prep import CBISPrep

In [3]:
calc_test = "data/meta/0_raw/calc_case_description_test_set.csv"
calc_train = "data/meta/0_raw/calc_case_description_train_set.csv"
mass_test = "data/meta/0_raw/mass_case_description_test_set.csv"
mass_train = "data/meta/0_raw/mass_case_description_train_set.csv"

case_fp = "data/meta/1_interim/cases.csv"

cp = CasePrep(calc_train_fp=calc_train, calc_test_fp=calc_test, mass_train_fp=mass_train, mass_test_fp=mass_test, case_fp=case_fp, force=False)
cases = cp.prep()
cases.info()
cases.sample(n=5, random_state=55)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3568 entries, 0 to 3567
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   patient_id         3568 non-null   object
 1   breast_density     3568 non-null   int64 
 2   laterality         3568 non-null   object
 3   image_view         3568 non-null   object
 4   abnormality_id     3568 non-null   int64 
 5   abnormality_type   3568 non-null   object
 6   calc_type          3544 non-null   object
 7   calc_distribution  3129 non-null   object
 8   assessment         3568 non-null   int64 
 9   pathology          3568 non-null   object
 10  subtlety           3568 non-null   int64 
 11  fileset            3568 non-null   object
 12  mass_shape         3564 non-null   object
 13  mass_margins       3508 non-null   object
 14  mmg_id             3568 non-null   object
 15  cancer             3568 non-null   bool  
dtypes: bool(1), int64(4), object(11)
memory us

Unnamed: 0,patient_id,breast_density,laterality,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,mmg_id,cancer
3563,P_01825,2,RIGHT,MLO,1,mass,NOT APPLICABLE,NOT APPLICABLE,3,BENIGN_WITHOUT_CALLBACK,3,test,LOBULATED,MICROLOBULATED,Mass-Test_P_01825_RIGHT_MLO,False
17,P_00008,1,RIGHT,MLO,3,calcification,,REGIONAL,2,BENIGN_WITHOUT_CALLBACK,3,training,NOT APPLICABLE,NOT APPLICABLE,Calc-Training_P_00008_RIGHT_MLO,False
3533,P_01690,3,LEFT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,0,BENIGN,4,test,OVAL,OBSCURED,Mass-Test_P_01690_LEFT_CC,False
2677,P_01162,3,LEFT,MLO,1,mass,NOT APPLICABLE,NOT APPLICABLE,4,BENIGN,4,training,ROUND,CIRCUMSCRIBED,Mass-Training_P_01162_LEFT_MLO,False
3030,P_01656,2,LEFT,MLO,1,mass,NOT APPLICABLE,NOT APPLICABLE,5,MALIGNANT,5,training,OVAL,CIRCUMSCRIBED,Mass-Training_P_01656_LEFT_MLO,True


The dataset above has both mass and calcification training and test data, as well as a mammogram id, 'mmmg_id', and a Boolean target 'cancer'.

## Series Metadata
Next, we add filepaths to the series metadata.

In [4]:
fpi = "data/meta/0_raw/metadata.csv"
fpo = "data/meta/3_clean/series.csv"
sp = SeriesPrep(filepath=fpi, series_filepath=fpo, force=False)
series = sp.prep()
series.info()
series.sample(n=5, random_state=55)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10239 entries, 0 to 10238
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   series_uid            10239 non-null  object
 1   collection            10239 non-null  object
 2   data_description_uri  10239 non-null  object
 3   subject_id            10239 non-null  object
 4   study_uid             10239 non-null  object
 5   study_date            10239 non-null  object
 6   series_description    10239 non-null  object
 7   modality              10239 non-null  object
 8   sop_class_name        10239 non-null  object
 9   sop_class_uid         10239 non-null  object
 10  number_of_images      10239 non-null  int64 
 11  file_size             10239 non-null  object
 12  file_location         10239 non-null  object
 13  download_timestamp    10239 non-null  object
 14  filepath              10239 non-null  object
dtypes: int64(1), object(14)
memory usage

Unnamed: 0,series_uid,collection,data_description_uri,subject_id,study_uid,study_date,series_description,modality,sop_class_name,sop_class_uid,number_of_images,file_size,file_location,download_timestamp,filepath
752,1.3.6.1.4.1.9590.100.1.2.118805243010527642836...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Test_P_01419_LEFT_CC_1,1.3.6.1.4.1.9590.100.1.2.267657522710744920625...,08-29-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,18.04 MB,./CBIS-DDSM/Calc-Test_P_01419_LEFT_CC_1/08-29-...,2023-05-24T03:09:56.273,data/image/0_raw/CBIS-DDSM/Calc-Test_P_01419_L...
8883,1.3.6.1.4.1.9590.100.1.2.160255505411776231122...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Training_P_00260_RIGHT_CC_1,1.3.6.1.4.1.9590.100.1.2.404192837610226889722...,09-06-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,22.08 MB,./CBIS-DDSM/Calc-Training_P_00260_RIGHT_CC_1/0...,2023-05-24T07:00:54.02,data/image/0_raw/CBIS-DDSM/Calc-Training_P_002...
6889,1.3.6.1.4.1.9590.100.1.2.190371998012909216226...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Mass-Test_P_01697_LEFT_MLO_1,1.3.6.1.4.1.9590.100.1.2.420210647711670485715...,10-04-2016,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,24.01 MB,./CBIS-DDSM/Mass-Test_P_01697_LEFT_MLO_1/10-04...,2023-05-24T06:07:39.287,data/image/0_raw/CBIS-DDSM/Mass-Test_P_01697_L...
1810,1.3.6.1.4.1.9590.100.1.2.109486979812564135735...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Test_P_00180_LEFT_MLO_1,1.3.6.1.4.1.9590.100.1.2.363307813117337481178...,08-29-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,14.43 MB,./CBIS-DDSM/Calc-Test_P_00180_LEFT_MLO_1/08-29...,2023-05-24T03:43:49.606,data/image/0_raw/CBIS-DDSM/Calc-Test_P_00180_L...
1182,1.3.6.1.4.1.9590.100.1.2.223222662511824177432...,CBIS-DDSM,https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY,Calc-Test_P_00390_RIGHT_MLO_1,1.3.6.1.4.1.9590.100.1.2.366591994512355436005...,08-29-2017,ROI mask images,MG,Secondary Capture Image Storage,1.2.840.10008.5.1.4.1.1.7,2,13.60 MB,./CBIS-DDSM/Calc-Test_P_00390_RIGHT_MLO_1/08-2...,2023-05-24T03:24:41.338,data/image/0_raw/CBIS-DDSM/Calc-Test_P_00390_R...


Full filepaths have been added for all 10,239 images in the CBIS-DDSM.

## DICOM Image Metadata

Finally, we extract the DICOM data described in {numref}`dicom_image_metadata` and merge that with the case data.

```{table} DICOM Image Metadata
:name: dicom_image_metadata

| # | Name                       | Description                                                                              |
|---|----------------------------|------------------------------------------------------------------------------------------|
| 1 | bit_depth                  | Number of bits used to define each pixel                                                 |
| 2 | rows                       | Number of pixel rows in the image                                                        |
| 3 | cols                       | Number of pixel columns in the image                                                     |
| 4 | aspect_ratio               | Ratio of width to height in image                                                        |
| 5 | size                       | Product of width and height in image                                                     |
| 6 | min_pixel_value            | Minimum pixel value                                                                      |
| 7 | max_pixel_value            | Maximum pixel value                                                                      |
| 8 | mean_pixel_value           | Average pixel value                                                                      |
| 9 | std_pixel_value            | Standard deviation of pixel values                                                       |

```

In [5]:
cases = "data/meta/1_interim/cases.csv"
series = "data/meta/3_clean/series.csv"
cbis = "data/meta/2_staged/cbis.csv"
cp = CBISPrep(case_filepath=cases, series_filepath=series, cbis_filepath=cbis, force=False)
cbis = cp.prep()
cbis.info()
cbis.sample(n=5, random_state=55)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3568 entries, 0 to 3567
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   patient_id         3568 non-null   object 
 1   breast_density     3568 non-null   int64  
 2   laterality         3568 non-null   object 
 3   image_view         3568 non-null   object 
 4   abnormality_id     3568 non-null   int64  
 5   abnormality_type   3568 non-null   object 
 6   calc_type          3544 non-null   object 
 7   calc_distribution  3129 non-null   object 
 8   assessment         3568 non-null   int64  
 9   pathology          3568 non-null   object 
 10  subtlety           3568 non-null   int64  
 11  fileset            3568 non-null   object 
 12  mass_shape         3564 non-null   object 
 13  mass_margins       3508 non-null   object 
 14  mmg_id             3568 non-null   object 
 15  cancer             3568 non-null   bool   
 16  bit_depth          3568 

Unnamed: 0,patient_id,breast_density,laterality,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,...,rows,cols,aspect_ratio,size,file_size,min_pixel_value,max_pixel_value,mean_pixel_value,std_pixel_value,filepath
3563,P_01825,2,RIGHT,MLO,1,mass,NOT APPLICABLE,NOT APPLICABLE,3,BENIGN_WITHOUT_CALLBACK,...,4520,2888,0.64,13053760,26108614,0,61031,12100.82,15102.12,data/image/0_raw/CBIS-DDSM/Mass-Test_P_01825_R...
17,P_00008,1,RIGHT,MLO,3,calcification,,REGIONAL,2,BENIGN_WITHOUT_CALLBACK,...,4576,3048,0.67,13947648,27896398,0,65535,16262.07,15051.22,data/image/0_raw/CBIS-DDSM/Calc-Training_P_000...
3533,P_01690,3,LEFT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,0,BENIGN,...,5386,3706,0.69,19960516,39922120,0,65535,12696.9,14839.88,data/image/0_raw/CBIS-DDSM/Mass-Test_P_01690_L...
2677,P_01162,3,LEFT,MLO,1,mass,NOT APPLICABLE,NOT APPLICABLE,4,BENIGN,...,5911,4096,0.69,24211456,48424010,0,65535,13687.71,15241.62,data/image/0_raw/CBIS-DDSM/Mass-Training_P_011...
3030,P_01656,2,LEFT,MLO,1,mass,NOT APPLICABLE,NOT APPLICABLE,5,MALIGNANT,...,4736,2656,0.56,12578816,25158730,0,65535,18227.27,16527.84,data/image/0_raw/CBIS-DDSM/Mass-Training_P_016...


We have all case information along with the DICOM image metadata in a single dataset.