# Data Cleaning
The prior data quality analysis revealed several data anomalies requiring attention. 

1. Cases with invalid values for subtlety and breast density.
2. Missing calcification type, calcification distribution, mass shape, and mass margins data. 
3. Categories that have different spelling, but the same meaning. 

As such, the data cleaning tasks are detailed in {numref}`data_cleaning_tasks`:

```{table} Data Cleaning Tasks
:name: data_cleaning_tasks
| # | Task                                                                             |
|---|----------------------------------------------------------------------------------|
| 1 | Replace invalid values for breast density with NA for imputation.                |
| 2 | Replace invalid values for subtlety with NA for imputation            |
| 3 | Replace calcification types 'LUCENT_CENTER' with 'LUCENT_CENTERED', and 'PLEOMORPHIC-PLEOMORPHIC', with 'PLEOMORPHIC' |
| 4 | Impute missing values using Multiple Imputation by Chained Equations   (MICE)    |
| 5 | Conduct random inspection of imputations.                                        |
| 6 | Conduct final data quality analysis.                                        |
```


In [None]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../../..")))

In [1]:

import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from studioai.preprocessing.encode import RankFrequencyEncoder

from bcd.data_prep.clean import CBISImputer
from bcd.data.dataset import CBISDataset

pd.options.display.max_rows = 999
pd.options.display.max_columns = 50

In [2]:
FP_STAGED = "data/meta/2_staged/cbis.csv"
FP_CLEAN = "data/meta/3_clean/cbis.csv"

## Load Data

In [3]:
df = pd.read_csv(FP_STAGED)
df_orig = df.copy()

## Breast Density
Replace invalid values for breast density with NA for downstream imputation.

In [4]:
# Set invalid values for breast_density to NA
df['breast_density'] = df['breast_density'].replace(0, np.NAN)

## Subtlety
Replace invalid values for subtlety with NA.

In [5]:
# Set case and mass data to NOT APPLICABLE where appropriate.
df['subtlety'] = df['subtlety'].replace(0, np.NAN)

## Category Alignment
Align categories that have the same meaning, but differ in spelling only.

In [6]:
df.loc[df['calc_type'] == 'LUCENT_CENTER', 'calc_type'] = 'LUCENT_CENTERED'
df.loc[df['calc_type'] == 'ROUND_AND_REGULAR-LUCENT_CENTER-DYSTROPHIC', 'calc_type'] = 'ROUND_AND_REGULAR-LUCENT_CENTERED-DYSTROPHIC'
df.loc[df['calc_type'] == 'PUNCTATE-LUCENT_CENTER', 'calc_type'] = 'PUNCTATE-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'VASCULAR-COARSE-LUCENT_CENTER-ROUND_AND_REGULAR-PUNCTATE', 'calc_type'] = 'VASCULAR-COARSE-LUCENT_CENTERED-ROUND_AND_REGULAR-PUNCTATE'
df.loc[df['calc_type'] == 'ROUND_AND_REGULAR-LUCENT_CENTER', 'calc_type'] = 'ROUND_AND_REGULAR-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'LUCENT_CENTER-PUNCTATE', 'calc_type'] = 'LUCENT_CENTERED-PUNCTATE'
df.loc[df['calc_type'] == 'COARSE-ROUND_AND_REGULAR-LUCENT_CENTER', 'calc_type'] = 'COARSE-ROUND_AND_REGULAR-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'ROUND_AND_REGULAR-LUCENT_CENTER-PUNCTATE', 'calc_type'] = 'ROUND_AND_REGULAR-LUCENT_CENTERED-PUNCTATE'
df.loc[df['calc_type'] == 'COARSE-LUCENT_CENTER', 'calc_type'] = 'COARSE-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'PLEOMORPHIC-PLEOMORPHIC', 'calc_type'] = 'PLEOMORPHIC'

## Impute Missing Values
Multiple Imputation by Chained Equations (MICE) is a robust, informative method of estimating missing values in datasets. The procedure imputes missing data through an iterative series of predictive models which estimate the value of missing data using the other variables in the dataset. For this, we'll use our CBISImputer which wraps scikit-learn's IterativeImputer implementation of MICE.



In [7]:

null_mask = df.isnull().any(axis=1)
df_missing = df[null_mask]
msg = f"There are {df_missing.shape[0]} rows (approximately {round(df_missing.shape[0] / df_orig.shape[0] * 100,1)}% of the rows) with missing data in the total dataset."
print(msg)

There are 527 rows (approximately 14.8% of the rows) with missing data in the total dataset.


In [8]:
imp = CBISImputer(random_state=5)
_ = imp.fit(df=df)
df_clean = imp.transform(df=df)
# Somehow aspect ratio gets corrupted during iterative imputation. IterativeImputer is experimental and the issue of modifying or imputing non-NA values has been raised. Until the issue is isolated and resolved...
df_clean["aspect_ratio"] = df_clean["cols"] / df_clean["rows"]

With that, let's save the data.

In [9]:
os.makedirs(os.path.dirname(FP_CLEAN), exist_ok=True)
df_clean.to_csv(FP_CLEAN, index=False)

## Random Sample Inspection

Let's take a look at a random sampling of the missing data and compare.

In [10]:
sample_cases = df_missing['mmg_id'].sample(5, random_state=72)
df_missing.loc[df_missing['mmg_id'].isin(sample_cases)]
df_clean.loc[df_clean['mmg_id'].isin(sample_cases)]

Unnamed: 0,patient_id,breast_density,laterality,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,...,rows,cols,aspect_ratio,size,file_size,min_pixel_value,max_pixel_value,mean_pixel_value,std_pixel_value,filepath
350,P_00476,2.0,RIGHT,MLO,1,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,...,5768,4032,0.7,23256576,46514254,0,65535,14498.95,16846.46,data/image/0_raw/CBIS-DDSM/Calc-Training_P_004...
351,P_00476,2.0,RIGHT,MLO,2,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,...,5768,4032,0.7,23256576,46514254,0,65535,14498.95,16846.46,data/image/0_raw/CBIS-DDSM/Calc-Training_P_004...
352,P_00476,2.0,RIGHT,MLO,3,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,...,5768,4032,0.7,23256576,46514254,0,65535,14498.95,16846.46,data/image/0_raw/CBIS-DDSM/Calc-Training_P_004...
564,P_00705,2.0,LEFT,CC,1,calcification,LUCENT_CENTERED-PUNCTATE,,2,BENIGN_WITHOUT_CALLBACK,...,5720,4088,0.71,23383360,46767816,0,65535,9003.68,12763.37,data/image/0_raw/CBIS-DDSM/Calc-Training_P_007...
565,P_00705,2.0,LEFT,CC,2,calcification,LUCENT_CENTERED-PUNCTATE,,2,BENIGN_WITHOUT_CALLBACK,...,5720,4088,0.71,23383360,46767816,0,65535,9003.68,12763.37,data/image/0_raw/CBIS-DDSM/Calc-Training_P_007...
566,P_00705,2.0,LEFT,MLO,1,calcification,LUCENT_CENTERED-PUNCTATE,,2,BENIGN_WITHOUT_CALLBACK,...,5840,3888,0.67,22705920,45412938,0,65535,11039.29,14755.17,data/image/0_raw/CBIS-DDSM/Calc-Training_P_007...
567,P_00705,2.0,LEFT,MLO,2,calcification,LUCENT_CENTERED-PUNCTATE,,2,BENIGN_WITHOUT_CALLBACK,...,5840,3888,0.67,22705920,45412938,0,65535,11039.29,14755.17,data/image/0_raw/CBIS-DDSM/Calc-Training_P_007...
775,P_00991,2.0,LEFT,MLO,1,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,...,4520,3032,0.67,13704640,27410378,0,65535,15022.1,16641.97,data/image/0_raw/CBIS-DDSM/Calc-Training_P_009...
2852,P_01398,2.0,LEFT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,2,BENIGN_WITHOUT_CALLBACK,...,4560,3120,0.68,14227200,28455496,0,65535,9499.79,14133.45,data/image/0_raw/CBIS-DDSM/Mass-Training_P_013...


Unnamed: 0,patient_id,breast_density,laterality,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,...,rows,cols,aspect_ratio,size,file_size,min_pixel_value,max_pixel_value,mean_pixel_value,std_pixel_value,filepath
350,P_00476,2.0,RIGHT,MLO,1,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,...,5768,4032,0.7,23256576,46514254,0,65535,14499.0,16846.0,data/image/0_raw/CBIS-DDSM/Calc-Training_P_004...
351,P_00476,2.0,RIGHT,MLO,2,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,...,5768,4032,0.7,23256576,46514254,0,65535,14499.0,16846.0,data/image/0_raw/CBIS-DDSM/Calc-Training_P_004...
352,P_00476,2.0,RIGHT,MLO,3,calcification,LUCENT_CENTERED,LINEAR-SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,...,5768,4032,0.7,23256576,46514254,0,65535,14499.0,16846.0,data/image/0_raw/CBIS-DDSM/Calc-Training_P_004...
564,P_00705,2.0,LEFT,CC,1,calcification,LUCENT_CENTERED-PUNCTATE,REGIONAL-REGIONAL,2,BENIGN_WITHOUT_CALLBACK,...,5720,4088,0.71,23383360,46767816,0,65535,9004.0,12763.0,data/image/0_raw/CBIS-DDSM/Calc-Training_P_007...
565,P_00705,2.0,LEFT,CC,2,calcification,LUCENT_CENTERED-PUNCTATE,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,...,5720,4088,0.71,23383360,46767816,0,65535,9004.0,12763.0,data/image/0_raw/CBIS-DDSM/Calc-Training_P_007...
566,P_00705,2.0,LEFT,MLO,1,calcification,LUCENT_CENTERED-PUNCTATE,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,...,5840,3888,0.67,22705920,45412938,0,65535,11039.0,14755.0,data/image/0_raw/CBIS-DDSM/Calc-Training_P_007...
567,P_00705,2.0,LEFT,MLO,2,calcification,LUCENT_CENTERED-PUNCTATE,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,...,5840,3888,0.67,22705920,45412938,0,65535,11039.0,14755.0,data/image/0_raw/CBIS-DDSM/Calc-Training_P_007...
775,P_00991,2.0,LEFT,MLO,1,calcification,LUCENT_CENTERED,CLUSTERED,2,BENIGN_WITHOUT_CALLBACK,...,4520,3032,0.67,13704640,27410378,0,65535,15022.0,16642.0,data/image/0_raw/CBIS-DDSM/Calc-Training_P_009...
2852,P_01398,2.0,LEFT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,2,BENIGN_WITHOUT_CALLBACK,...,4560,3120,0.68,14227200,28455496,0,65535,9500.0,14133.0,data/image/0_raw/CBIS-DDSM/Mass-Training_P_013...


## Data Quality Analysis 2.0
Ok, let's have another go at the data quality analysis.

### Completeness Analysis

In [11]:
ds = CBISDataset(filepath=FP_CLEAN)
dqcc = ds.dqa.analyze_completeness()
print(dqcc.summary)



                          Completeness                          
                         Dataset | CBIS-DDSM
                         Records | 3568
                Complete Records | 3568
             Record Completeness | 1.0
                     Data Values | 96336
            Complete Data Values | 96336
         Data Value Completeness | 1.0




We're complete.

### Uniqueness Analysis

In [12]:
dqcu = ds.dqa.analyze_uniqueness()
print(dqcu.summary)



                           Uniqueness                           
                         Dataset | CBIS-DDSM
                         Records | 3568
                  Unique Records | 3568
               Record Uniqueness | 1.0
                     Data Values | 96336
              Unique Data Values | 22457
           Data Value Uniqueness | 0.233




We're unique.

### Validity Analysis

In [13]:
dqcv = ds.dqa.analyze_validity()
print(dqcv.summary)



                            Validity                            
                         Dataset | CBIS-DDSM
                         Records | 3568
                   Valid Records | 3568
                 Record Validity | 1.0
                     Data Values | 96336
               Valid Data Values | 96336
             Data Value Validity | 1.0




We're valid. That concludes this data cleaning section.