# CBIS-DDSM Data Transformation
Our aim here is to prepare the meta datasets for modeling to be conducted as part of the multivariate exploratory data analysis.

The CBIS-DDSM has 45 calcification types, 9 calcification distributions, 20 mass shapes, and 19 mass margins, many of which are compound categories, in that two or more categories are combined. For instance, calcification type 'ROUND_AND_REGULAR-PUNCTATE-AMORPHOUS' indicates three different types: 'ROUND_AND_REGULAR', 'PUNCTATE', and 'AMORPHOUS'. Segregating these compound categories into separate categories will drastically reduce the number of categories to analyze. More importantly, it aligns our data and the analyses with the common morphological taxonomy. So, task one is to extract the unary morphological categories from the compound classifications.  

Once the unary categories are extracted, all nominal variables will be dummy encoded to values in [0,1]. Then, all model variables will be standardized to zero mean and unit variance. 


In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as pd
import numpy as np

from bcd.prep.case import CaseTransformer
pd.options.display.max_columns = 99

In [2]:
FP_CASES_CLEAN = "data/meta/2_clean/cases.csv"
FP_CASES_COOKED = "data/meta/3_cooked/cases.csv"

In [4]:
x4mr = CaseTransformer(source_fp=FP_CASES_CLEAN, destination_fp=FP_CASES_COOKED)
df = x4mr.transform()
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3566 entries, 0 to 3565
Data columns (total 53 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   patient_id                   3566 non-null   object
 1   breast_density               3566 non-null   int64 
 2   left_or_right_breast         3566 non-null   object
 3   image_view                   3566 non-null   object
 4   abnormality_id               3566 non-null   int64 
 5   abnormality_type             3566 non-null   object
 6   calc_type                    3566 non-null   object
 7   calc_distribution            3566 non-null   object
 8   assessment                   3566 non-null   int64 
 9   pathology                    3566 non-null   object
 10  subtlety                     3566 non-null   int64 
 11  fileset                      3566 non-null   object
 12  mass_shape                   3566 non-null   object
 13  mass_margins                 3566