# Data Transformation

Our final data preparation task before exploratory data analysis is to prepare a dataset for multivariate analysis.   For multivariate modeling, we will be one-hot encoding the morphological features and normalizing numeric data to range [0,1]. 

The multivariate analysis will include 12 independent variables: breast_density, laterality, image_view, abnormality_id, abnormality_type, assessment,  calc_type, calc_distribution, subtlety, mass_shape, mass_margins, mean_pixel_value, and std_pixel_value. The binary dependent target variable will be cancer. Pathology will not be included in the multivariate analysis; however, BI-RADS assessment is expected to be a major influence on the target. It would be notable to the degree it is not.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../../..")))

In [2]:


import pandas as pd
import numpy as np

from bcd.data_prep.transform import CBISTransformer
pd.options.display.max_columns = 99

In [3]:
FP_CLEAN = "data/meta/3_clean/cbis.csv"
FP_COOKED = "data/meta/4_cooked/cbis.csv"

In [4]:
x4mr = CBISTransformer(source_fp=FP_CLEAN, destination_fp=FP_COOKED, force=True)
df = x4mr.transform()

Ok, let's check the results.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3568 entries, 0 to 3567
Data columns (total 64 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   patient_id                   3568 non-null   object 
 1   breast_density               3568 non-null   float64
 2   laterality                   3568 non-null   object 
 3   image_view                   3568 non-null   object 
 4   abnormality_id               3568 non-null   int64  
 5   abnormality_type             3568 non-null   object 
 6   calc_type                    3568 non-null   object 
 7   calc_distribution            3568 non-null   object 
 8   assessment                   3568 non-null   int64  
 9   pathology                    3568 non-null   object 
 10  subtlety                     3568 non-null   float64
 11  fileset                      3568 non-null   object 
 12  mass_shape                   3568 non-null   object 
 13  mass_margins      

We have 64 variables, 37 of which are one-hot encoded.

In [6]:
df.sample(n=5, random_state=22)

Unnamed: 0,patient_id,breast_density,laterality,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,mmg_id,cancer,bit_depth,rows,cols,aspect_ratio,size,file_size,min_pixel_value,max_pixel_value,mean_pixel_value,std_pixel_value,filepath,AT_calcification,AT_mass,LR_LEFT,LR_RIGHT,IV_CC,IV_MLO,CT_AMORPHOUS,CT_COARSE,CT_DYSTROPHIC,CT_EGGSHELL,CT_FINE_LINEAR_BRANCHING,CT_LARGE_RODLIKE,CT_LUCENT_CENTERED,CT_MILK_OF_CALCIUM,CT_PLEOMORPHIC,CT_PUNCTATE,CT_ROUND_AND_REGULAR,CT_SKIN,CT_VASCULAR,CD_CLUSTERED,CD_LINEAR,CD_REGIONAL,CD_DIFFUSELY_SCATTERED,CD_SEGMENTAL,MS_IRREGULAR,MS_ARCHITECTURAL_DISTORTION,MS_OVAL,MS_LYMPH_NODE,MS_LOBULATED,MS_FOCAL_ASYMMETRIC_DENSITY,MS_ROUND,MS_ASYMMETRIC_BREAST_TISSUE,MM_SPICULATED,MM_ILL_DEFINED,MM_CIRCUMSCRIBED,MM_OBSCURED,MM_MICROLOBULATED
2164,P_00420,1.0,RIGHT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,4,MALIGNANT,3.0,training,IRREGULAR,ILL_DEFINED,Mass-Training_P_00420_RIGHT_CC,1,16,5224,2896,0.55,15128704,30258504,0,65535,16087.0,16856.0,data/image/0_raw/CBIS-DDSM/Mass-Training_P_004...,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
104,P_00112,3.0,LEFT,MLO,6,calcification,ROUND_AND_REGULAR-EGGSHELL,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,3.0,training,NOT APPLICABLE,NOT APPLICABLE,Calc-Training_P_00112_LEFT_MLO,0,16,4608,3008,0.65,13860864,27722826,0,65535,10951.0,15315.0,data/image/0_raw/CBIS-DDSM/Calc-Training_P_001...,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
217,P_00302,2.0,LEFT,MLO,2,calcification,PLEOMORPHIC,CLUSTERED,4,BENIGN,5.0,training,NOT APPLICABLE,NOT APPLICABLE,Calc-Training_P_00302_LEFT_MLO,0,16,5688,3880,0.68,22069440,44139978,0,60547,14239.0,16556.0,data/image/0_raw/CBIS-DDSM/Calc-Training_P_003...,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1673,P_00790,3.0,RIGHT,MLO,1,calcification,PLEOMORPHIC,CLUSTERED,4,BENIGN,4.0,test,NOT APPLICABLE,NOT APPLICABLE,Calc-Test_P_00790_RIGHT_MLO,0,16,4680,2888,0.62,13515840,27032774,0,65535,5868.0,10174.0,data/image/0_raw/CBIS-DDSM/Calc-Test_P_00790_R...,1,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2667,P_01152,2.0,RIGHT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,3,BENIGN,5.0,training,LOBULATED,CIRCUMSCRIBED,Mass-Training_P_01152_RIGHT_CC,0,16,5776,4064,0.7,23473664,46948424,0,65535,10109.0,14381.0,data/image/0_raw/CBIS-DDSM/Mass-Training_P_011...,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0


All values have been normalized and this completes the data transformation section. On to exploratory data analysis...finally!