# Data Cleaning
The prior data quality analysis revealed several data anomalies requiring attention. 

1. Cases with invalid values for subtlety and breast density.
2. Missing calcification type, calcification distribution, mass shape, and mass margins data. 
3. Categories that have different spelling, but the same meaning. 


| # | Task                                                                             |
|---|----------------------------------------------------------------------------------|
| 1 | Replace invalid values for breast density with NA for imputation.                |
| 2 | Set mass and calcification morphological values to NA where they do not   apply. |
| 3 | Remove cases with invalid values for subtlety                                    |
| 4 | Replace calcification types 'LUCENT_CENTER' with 'LUCENT_CENTERED', and 'PLEOMORPHIC-PLEOMORPHIC', with 'PLEOMORPHIC' |
| 5 | Impute missing values using Multiple Imputation by Chained Equations   (MICE)    |
| 6 | Convert data to the appropriate data types                                       |
| 7 | Conduct random inspection of imputations.                                        |
| 8 | Conduct final case data quality analysis.                                        |

Once the case dataset has been cleaned, we will merge it with the DICOM Metadata dataset.

In [4]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as pd
import numpy as np

from bcd.data.prep.meta.case import CaseImputer
pd.options.display.max_rows = 999

from bcd.data.quality.case import CaseDQA

In [28]:
FP_STAGED_CASES = "data/staged/cases.csv"
FP_STAGED_DICOM = "data/staged/dicom.csv"

FP_CLEAN_CASES = "data/clean/cases.csv"
FP_CLEAN_DICOM = "data/clean/dicom.csv"

FP_CASE_SERIES_XREF = "data/staged/case_series_xref.csv"
CASES_TO_REMOVE = ['P_00710_RIGHT_mass_MLO_1', 'P_00710_RIGHT_mass_CC_1']

## Load Data

In [12]:
df = pd.read_csv(FP_STAGED_CASES)
df_orig = df.copy()

## Breast Density
Replace invalid values for breast density with NA for downstream imputation.

In [13]:
# Set invalid values for breast_density to NA
df['breast_density'] = df['breast_density'].replace(0, np.NAN)

## Morphology Not Applicable
Set morphological variables to NA where they do not apply. For instance mass shape doesn't apply to calcification cases.

In [14]:
# Set case and mass data to NOT APPLICABLE where appropriate.
df.loc[df['abnormality_type'] == 'calcification', 'mass_shape'] = 'NOT APPLICABLE'
df.loc[df['abnormality_type'] == 'calcification', 'mass_margins'] = 'NOT APPLICABLE'
df.loc[df['abnormality_type'] == 'mass', 'calc_type'] = 'NOT APPLICABLE'
df.loc[df['abnormality_type'] == 'mass', 'calc_distribution'] = 'NOT APPLICABLE'

## Case Removal
Remove cases with invalid values for subtlety.

In [15]:
df = df.loc[~df['case_id'].isin(CASES_TO_REMOVE)]
msg = f"A total of {df_orig.shape[0] - df.shape[0]} rows were removed from the dataset, leaving {df.shape[0]} rows."
print(msg)

A total of 2 rows were removed from the dataset, leaving 3566 rows.


## Category Alignment
Align categories that have the same meaning, but differ in spelling only.

In [16]:
df.loc[df['calc_type'] == 'LUCENT_CENTER', 'calc_type'] = 'LUCENT_CENTERED'
df.loc[df['calc_type'] == 'ROUND_AND_REGULAR-LUCENT_CENTER-DYSTROPHIC', 'calc_type'] = 'ROUND_AND_REGULAR-LUCENT_CENTERED-DYSTROPHIC'
df.loc[df['calc_type'] == 'PUNCTATE-LUCENT_CENTER', 'calc_type'] = 'PUNCTATE-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'VASCULAR-COARSE-LUCENT_CENTER-ROUND_AND_REGULAR-PUNCTATE', 'calc_type'] = 'VASCULAR-COARSE-LUCENT_CENTERED-ROUND_AND_REGULAR-PUNCTATE'
df.loc[df['calc_type'] == 'ROUND_AND_REGULAR-LUCENT_CENTER', 'calc_type'] = 'ROUND_AND_REGULAR-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'LUCENT_CENTER-PUNCTATE', 'calc_type'] = 'LUCENT_CENTERED-PUNCTATE'
df.loc[df['calc_type'] == 'COARSE-ROUND_AND_REGULAR-LUCENT_CENTER', 'calc_type'] = 'COARSE-ROUND_AND_REGULAR-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'ROUND_AND_REGULAR-LUCENT_CENTER-PUNCTATE', 'calc_type'] = 'ROUND_AND_REGULAR-LUCENT_CENTERED-PUNCTATE'
df.loc[df['calc_type'] == 'COARSE-LUCENT_CENTER', 'calc_type'] = 'COARSE-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'PLEOMORPHIC-PLEOMORPHIC', 'calc_type'] = 'PLEOMORPHIC'

## Impute Missing Values
Multiple Imputation by Chained Equations (MICE) is a robust, informative method of estimating missing values in datasets. The procedure imputes missing data through an iterative series of predictive models which estimate the value of missing data using the other variables in the dataset. For this, we'll use our CaseImputer which wraps scikit-learn's IterativeImputer implementation of MICE.

First, let's capture the missing values as we will inspect them after imputation.

In [17]:
# Grab rows with missing data
null_mask = df.isnull().any(axis=1)
df_missing = df[null_mask]
msg = f"There are {df_missing.shape[0]} rows (approximately {round(df_missing.shape[0] / df_orig.shape[0] * 100,1)}% of the dataset) with missing data in the dataset."
print(msg)

There are 525 rows (approximately 14.7% of the dataset) with missing data in the dataset.


In [18]:
imp = CaseImputer(random_state=5)
imp.fit(df=df)
df_clean = imp.transform(df=df)

<bcd.data.prep.meta.case.CaseImputer at 0x7fc099f62fe0>

With that, let's save the data.

## Data Types

In [19]:
dtypes = {"patient_id": "object",
"breast_density": "int32",
"left_or_right_breast": "category",
"image_view": "category",
"abnormality_id": "int32",
"abnormality_type": "category",
"calc_type": "category",
"calc_distribution": "category",
"assessment": "int32",
"pathology": "category",
"subtlety": "int32",
"fileset": "object",
"mass_shape": "category",
"mass_margins": "category",
"case_id": "object",
"cancer": "bool",
}
dtypes_ordinal = {
"breast_density": "category",
"assessment": "category",
"subtlety": "category",
}
df_clean = df_clean.astype(dtypes)
df_clean = df_clean.astype(dtypes_ordinal)
df_clean.info()
df_clean.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3566 entries, 0 to 3565
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   patient_id            3566 non-null   object  
 1   breast_density        3566 non-null   category
 2   left_or_right_breast  3566 non-null   category
 3   image_view            3566 non-null   category
 4   abnormality_id        3566 non-null   int32   
 5   abnormality_type      3566 non-null   category
 6   calc_type             3566 non-null   category
 7   calc_distribution     3566 non-null   category
 8   assessment            3566 non-null   category
 9   pathology             3566 non-null   category
 10  subtlety              3566 non-null   category
 11  fileset               3566 non-null   object  
 12  mass_shape            3566 non-null   category
 13  mass_margins          3566 non-null   category
 14  case_id               3566 non-null   object  
 15  canc

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
0,P_00005,3,RIGHT,CC,1,calcification,AMORPHOUS,CLUSTERED,3,MALIGNANT,3,train,NOT APPLICABLE,NOT APPLICABLE,P_00005_RIGHT_calcification_CC_1,True
1,P_00005,3,RIGHT,MLO,1,calcification,AMORPHOUS,CLUSTERED,3,MALIGNANT,3,train,NOT APPLICABLE,NOT APPLICABLE,P_00005_RIGHT_calcification_MLO_1,True
2,P_00007,4,LEFT,CC,1,calcification,PLEOMORPHIC,LINEAR,4,BENIGN,4,train,NOT APPLICABLE,NOT APPLICABLE,P_00007_LEFT_calcification_CC_1,False
3,P_00007,4,LEFT,MLO,1,calcification,PLEOMORPHIC,LINEAR,4,BENIGN,4,train,NOT APPLICABLE,NOT APPLICABLE,P_00007_LEFT_calcification_MLO_1,False
4,P_00008,1,LEFT,CC,1,calcification,AMORPHOUS,REGIONAL,2,BENIGN_WITHOUT_CALLBACK,3,train,NOT APPLICABLE,NOT APPLICABLE,P_00008_LEFT_calcification_CC_1,False


In [20]:
os.makedirs(os.path.dirname(FP_CLEAN_CASES), exist_ok=True)
df_clean.to_csv(FP_CLEAN_CASES, index=False)

## Random Sample Inspection

Let's take a look at a random sampling of the missing data and compare.

In [21]:
sample_cases = df_missing['case_id'].sample(10)
df_missing.loc[df_missing['case_id'].isin(sample_cases)]
df_clean.loc[df_clean['case_id'].isin(sample_cases)]

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
622,P_00780,1.0,RIGHT,CC,4,calcification,FINE_LINEAR_BRANCHING,,2,BENIGN_WITHOUT_CALLBACK,4,train,NOT APPLICABLE,NOT APPLICABLE,P_00780_RIGHT_calcification_CC_4,False
977,P_01274,2.0,RIGHT,CC,1,calcification,VASCULAR,,2,BENIGN_WITHOUT_CALLBACK,4,train,NOT APPLICABLE,NOT APPLICABLE,P_01274_RIGHT_calcification_CC_1,False
1168,P_01457,2.0,RIGHT,CC,1,calcification,COARSE,,2,BENIGN_WITHOUT_CALLBACK,3,train,NOT APPLICABLE,NOT APPLICABLE,P_01457_RIGHT_calcification_CC_1,False
1265,P_01628,3.0,LEFT,MLO,2,calcification,ROUND_AND_REGULAR-LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_01628_LEFT_calcification_MLO_2,False
1355,P_01769,3.0,RIGHT,CC,3,calcification,ROUND_AND_REGULAR-LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_01769_RIGHT_calcification_CC_3,False
1391,P_01823,3.0,RIGHT,CC,2,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_01823_RIGHT_calcification_CC_2,False
1637,P_00562,2.0,LEFT,CC,1,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,4,test,NOT APPLICABLE,NOT APPLICABLE,P_00562_LEFT_calcification_CC_1,False
1688,P_00905,3.0,RIGHT,MLO,1,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,3,test,NOT APPLICABLE,NOT APPLICABLE,P_00905_RIGHT_calcification_MLO_1,False
2840,P_01386,1.0,LEFT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,3,BENIGN_WITHOUT_CALLBACK,5,train,ARCHITECTURAL_DISTORTION,,P_01386_LEFT_mass_CC_1,False
2852,P_01398,2.0,LEFT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,2,BENIGN_WITHOUT_CALLBACK,5,train,ASYMMETRIC_BREAST_TISSUE,,P_01398_LEFT_mass_CC_1,False


Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
622,P_00780,1,RIGHT,CC,4,calcification,FINE_LINEAR_BRANCHING,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,4,train,NOT APPLICABLE,NOT APPLICABLE,P_00780_RIGHT_calcification_CC_4,False
977,P_01274,2,RIGHT,CC,1,calcification,VASCULAR,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,4,train,NOT APPLICABLE,NOT APPLICABLE,P_01274_RIGHT_calcification_CC_1,False
1168,P_01457,2,RIGHT,CC,1,calcification,COARSE,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,3,train,NOT APPLICABLE,NOT APPLICABLE,P_01457_RIGHT_calcification_CC_1,False
1265,P_01628,3,LEFT,MLO,2,calcification,ROUND_AND_REGULAR-LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_01628_LEFT_calcification_MLO_2,False
1355,P_01769,3,RIGHT,CC,3,calcification,ROUND_AND_REGULAR-LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_01769_RIGHT_calcification_CC_3,False
1391,P_01823,3,RIGHT,CC,2,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_01823_RIGHT_calcification_CC_2,False
1637,P_00562,2,LEFT,CC,1,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,4,test,NOT APPLICABLE,NOT APPLICABLE,P_00562_LEFT_calcification_CC_1,False
1688,P_00905,3,RIGHT,MLO,1,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,3,test,NOT APPLICABLE,NOT APPLICABLE,P_00905_RIGHT_calcification_MLO_1,False
2838,P_01386,1,LEFT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,3,BENIGN_WITHOUT_CALLBACK,5,train,ARCHITECTURAL_DISTORTION,SPICULATED,P_01386_LEFT_mass_CC_1,False
2850,P_01398,2,LEFT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,2,BENIGN_WITHOUT_CALLBACK,5,train,ASYMMETRIC_BREAST_TISSUE,SPICULATED,P_01398_LEFT_mass_CC_1,False


## Data Quality Analysis 2.0
Ok, let's have another go at the data quality analysis.

### Completeness Analysis

In [24]:
dqc = CaseDQA(filepath=FP_CLEAN_CASES)
dqcc = dqc.analyze_completeness()
print(dqcc.summary)



                          Completeness                          
                         Dataset | Cases
                        Filename | cases.csv
                         Records | 3566
                Complete Records | 3566
             Record Completeness | 1.0
                     Data Values | 57056
            Complete Data Values | 57056
         Data Value Completeness | 1.0




We're complete.

### Uniqueness Analysis

In [25]:
dqcu = dqc.analyze_uniqueness()
print(dqcu.summary)



                           Uniqueness                           
                         Dataset | Cases
                        Filename | cases.csv
                         Records | 3566
                  Unique Records | 3566
               Record Uniqueness | 1.0
                     Data Values | 57056
              Unique Data Values | 5262
           Data Value Uniqueness | 0.092




We're unique.

### Validity Analysis

In [26]:
dqcv = dqc.analyze_validity()
print(dqcv.summary)



                            Validity                            
                         Dataset | Cases
                        Filename | cases.csv
                         Records | 3566
                   Valid Records | 3566
                 Record Validity | 1.0
                     Data Values | 57056
               Valid Data Values | 57056
             Data Value Validity | 1.0




We're valid. Now, let's merge the case and DICOM metadata.

## Merge Case and DICOM data.

In [30]:
df_case = df_clean[["case_id", "breast_density", "subtlety", "abnormality_type", "calc_type", "calc_distribution", "mass_shape", "mass_margins", "assessment", "fileset", "cancer"]]

dicom = pd.read_csv(FP_STAGED_DICOM)
xref = pd.read_csv(FP_CASE_SERIES_XREF)
df_case = df_case.merge(xref, on='case_id')
dicom = dicom.merge(df_case, on=['series_uid', 'series_description'])
dicom.to_csv(FP_CLEAN_DICOM, index=False)
dicom.info()
dicom.sample(n=10)

This concludes the case and DICOM metadata cleaning exercise.