# Data Cleaning
The prior data quality analysis revealed several data anomalies requiring attention. 

1. Cases with invalid values for subtlety and breast density.
2. Missing calcification type, calcification distribution, mass shape, and mass margins data. 
3. Categories that have different spelling, but same meaning. 

In this brief notebook, we address these issues, producing a clean dataset for analysis. Concretely, our task list is as follows:

| # | Task                                                                             |
|---|----------------------------------------------------------------------------------|
| 1 | Replace invalid values for breast density with NA for imputation.                |
| 2 | Set mass and calcification morphological values to NA where they do not   apply. |
| 3 | Remove cases with invalid values for subtlety                                    |
| 4 | Replace calcification type 'LUCENT_CENTER' with 'LUCENT_CENTERED'.               |
| 5 | Impute missing values using Multiple Imputation by Chained Equations   (MICE)    |
| 6 | Conduct random inspection of imputations.                                        |
| 7 | Conduct final data quality analysis.                                             |

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as pd
import numpy as np

from bcd.data.prep.case import CaseImputer
pd.options.display.max_rows = 999

from bcd.data.dqa.case import MassCaseDQA, CalcCaseDQA
from bcd.data import CALC_VARIABLES, MASS_VARIABLES

In [2]:
FP_IN = "data/staged/case.csv"
FP_CLEAN_CASES = "data/clean/cases.csv"
FP_CLEAN_CALC =  "data/clean/calc_cases.csv"
FP_CLEAN_MASS =  "data/clean/mass_cases.csv"
CASES_TO_REMOVE = ['P_00710_RIGHT_mass_MLO_1', 'P_00710_RIGHT_mass_CC_1']

## Load and Save Original Data
First, let's load the data and create an interim dataset for processing.

In [3]:
# Load data and capture original version.
df_orig = pd.read_csv(FP_IN)
df_interim = df_orig

## Task 1: Breast Density
Replace invalid values for breast density with NA for downstream imputation.

In [4]:
# Set invalid values for breast_density to NA
df_interim['breast_density'] = df_interim['breast_density'].replace(0, np.NAN)

## Task 2: Morphology Not Applicable
Set morphological variables to NA where they do not apply. For instance mass shape doesn't apply to calcification cases.

In [5]:
# Set case and mass data to NOT APPLICABLE where appropriate.
df_interim.loc[df_interim['abnormality_type'] == 'calcification', 'mass_shape'] = 'NA'
df_interim.loc[df_interim['abnormality_type'] == 'calcification', 'mass_margins'] = 'NA'
df_interim.loc[df_interim['abnormality_type'] == 'mass', 'calc_type'] = 'NA'
df_interim.loc[df_interim['abnormality_type'] == 'mass', 'calc_distribution'] = 'NA'

## Task 3: Case Removal
Remove cases with invalid values for subtlety.

In [6]:
df_interim = df_interim.loc[~df_interim['case_id'].isin(CASES_TO_REMOVE)]
msg = f"A total of {df_orig.shape[0] - df_interim.shape[0]} rows were removed from the dataset, leaving {df_interim.shape[0]} rows."
print(msg)

A total of 2 rows were removed from the dataset, leaving 3566 rows.


## Task 4: Category Alignment
Align categories that have the same meaning, but differ in spelling only.

In [7]:
df_interim.loc[df_interim['calc_type'] == 'LUCENT_CENTER', 'calc_type'] = 'LUCENT_CENTERED'

## Task 5: Impute Missing Values
Multiple Imputation by Chained Equations (MICE) is a robust, informative method of estimating missing values in datasets. The procedure imputes missing data through an iterative series of predictive models which estimate the value of missing data using the other variables in the dataset. For this, we'll use our CaseImputer which wraps scikit-learn's IterativeImputer implementation of MICE.

First, let's capture the missing values as we will inspect them after imputation.

In [8]:
# Grab rows with missing data
null_mask = df_interim.isnull().any(axis=1)
df_missing = df_interim[null_mask]
msg = f"There are {df_missing.shape[0]} rows (approximately {round(df_missing.shape[0] / df_orig.shape[0] * 100,1)}% of the dataset) with missing data in the dataset."
print(msg)

There are 525 rows (approximately 14.7% of the dataset) with missing data in the dataset.


In [9]:
imp = CaseImputer(random_state=5)
imp.fit(df=df_interim)
df_clean = imp.transform(df=df_interim)

<bcd.data.prep.case.CaseImputer at 0x7fad9df158d0>

With that, let's save the data.

In [10]:
os.makedirs(os.path.dirname(FP_CLEAN_CASES), exist_ok=True)
df_clean.to_csv(FP_CLEAN_CASES, index=False)
df_calc = df_clean[CALC_VARIABLES].loc[df_clean['abnormality_type'] == 'calcification']
df_mass = df_clean[MASS_VARIABLES].loc[df_clean['abnormality_type'] == 'mass']
df_calc.to_csv(FP_CLEAN_CALC, index=False)
df_mass.to_csv(FP_CLEAN_MASS, index=False)

## Task 6: Random Sample Inspection

Let's take a look at a random sampling of the missing data and compare.

In [11]:
sample_cases = df_missing['case_id'].sample(10)
df_missing.loc[df_missing['case_id'].isin(sample_cases)]
df_clean.loc[df_clean['case_id'].isin(sample_cases)]

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
276,P_00394,2.0,LEFT,MLO,2,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,5,train,,,P_00394_LEFT_calcification_MLO_2,False
389,P_00510,2.0,RIGHT,CC,3,calcification,VASCULAR,,2,BENIGN_WITHOUT_CALLBACK,4,train,,,P_00510_RIGHT_calcification_CC_3,False
402,P_00529,3.0,LEFT,MLO,1,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,3,train,,,P_00529_LEFT_calcification_MLO_1,False
403,P_00529,3.0,RIGHT,CC,1,calcification,ROUND_AND_REGULAR-LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,3,train,,,P_00529_RIGHT_calcification_CC_1,False
413,P_00538,3.0,RIGHT,CC,1,calcification,COARSE,,2,BENIGN_WITHOUT_CALLBACK,3,train,,,P_00538_RIGHT_calcification_CC_1,False
820,P_01049,3.0,LEFT,CC,1,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,3,train,,,P_01049_LEFT_calcification_CC_1,False
1047,P_01336,2.0,RIGHT,MLO,3,calcification,COARSE-ROUND_AND_REGULAR-LUCENT_CENTER,,2,BENIGN_WITHOUT_CALLBACK,5,train,,,P_01336_RIGHT_calcification_MLO_3,False
1417,P_01838,1.0,RIGHT,CC,1,calcification,EGGSHELL,,2,BENIGN_WITHOUT_CALLBACK,5,train,,,P_01838_RIGHT_calcification_CC_1,False
1551,P_00038,2.0,RIGHT,MLO,2,calcification,VASCULAR,,2,BENIGN_WITHOUT_CALLBACK,5,test,,,P_00038_RIGHT_calcification_MLO_2,False
3072,P_01717,1.0,LEFT,CC,1,mass,,,3,BENIGN_WITHOUT_CALLBACK,4,train,FOCAL_ASYMMETRIC_DENSITY,,P_01717_LEFT_mass_CC_1,False


Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
276,P_00394,2.0,LEFT,MLO,2,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,train,,,P_00394_LEFT_calcification_MLO_2,False
389,P_00510,2.0,RIGHT,CC,3,calcification,VASCULAR,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,4,train,,,P_00510_RIGHT_calcification_CC_3,False
402,P_00529,3.0,LEFT,MLO,1,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,3,train,,,P_00529_LEFT_calcification_MLO_1,False
403,P_00529,3.0,RIGHT,CC,1,calcification,ROUND_AND_REGULAR-LUCENT_CENTER,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,3,train,,,P_00529_RIGHT_calcification_CC_1,False
413,P_00538,3.0,RIGHT,CC,1,calcification,COARSE,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,3,train,,,P_00538_RIGHT_calcification_CC_1,False
820,P_01049,3.0,LEFT,CC,1,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,3,train,,,P_01049_LEFT_calcification_CC_1,False
1047,P_01336,2.0,RIGHT,MLO,3,calcification,COARSE-ROUND_AND_REGULAR-LUCENT_CENTER,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,train,,,P_01336_RIGHT_calcification_MLO_3,False
1417,P_01838,1.0,RIGHT,CC,1,calcification,EGGSHELL,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,train,,,P_01838_RIGHT_calcification_CC_1,False
1551,P_00038,2.0,RIGHT,MLO,2,calcification,VASCULAR,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,test,,,P_00038_RIGHT_calcification_MLO_2,False
3070,P_01717,1.0,LEFT,CC,1,mass,,,3,BENIGN_WITHOUT_CALLBACK,4,train,FOCAL_ASYMMETRIC_DENSITY,SPICULATED,P_01717_LEFT_mass_CC_1,False


## Task 7: Data Quality Analysis 2.0
Ok, let's have another go at the data quality analysis.

### Completeness Analysis

In [12]:
dqc = CalcCaseDQA(filepath=FP_CLEAN_CALC)
dqm = MassCaseDQA(filepath=FP_CLEAN_MASS)
dqcc = dqc.analyze_completeness()
dqmc = dqm.analyze_completeness()

In [13]:
print(dqcc.summary)
print(dqmc.summary)



                          Completeness                          
                         Dataset | Calc Cases
                        Filename | calc_cases.csv
                         Records | 1872
                Complete Records | 1872
             Record Completeness | 1.0
                     Data Values | 26208
            Complete Data Values | 26208
         Data Value Completeness | 1.0




                          Completeness                          
                         Dataset | Mass Cases
                        Filename | mass_cases.csv
                         Records | 1694
                Complete Records | 1694
             Record Completeness | 1.0
                     Data Values | 23716
            Complete Data Values | 23716
         Data Value Completeness | 1.0




We're complete.

### Uniqueness Analysis

In [14]:
dqcu = dqc.analyze_uniqueness()
dqmu = dqm.analyze_uniqueness()
print(dqcu.summary)
print(dqmu.summary)



                           Uniqueness                           
                         Dataset | Calc Cases
                        Filename | calc_cases.csv
                         Records | 1872
                  Unique Records | 1872
               Record Uniqueness | 1.0
                     Data Values | 26208
              Unique Data Values | 2712
           Data Value Uniqueness | 0.103




                           Uniqueness                           
                         Dataset | Mass Cases
                        Filename | mass_cases.csv
                         Records | 1694
                  Unique Records | 1694
               Record Uniqueness | 1.0
                     Data Values | 23716
              Unique Data Values | 2658
           Data Value Uniqueness | 0.112




We're unique.

### Validity Analysis

In [15]:
dqcv = dqc.analyze_validity()
dqmv = dqm.analyze_validity()
print(dqcv.summary)
print(dqmv.summary)



                            Validity                            
                         Dataset | Calc Cases
                        Filename | calc_cases.csv
                         Records | 1872
                   Valid Records | 1872
                 Record Validity | 1.0
                     Data Values | 26208
               Valid Data Values | 26208
             Data Value Validity | 1.0




                            Validity                            
                         Dataset | Mass Cases
                        Filename | mass_cases.csv
                         Records | 1694
                   Valid Records | 1694
                 Record Validity | 1.0
                     Data Values | 23716
               Valid Data Values | 23716
             Data Value Validity | 1.0




We're valid. This concludes the data cleaning exercise.