# Data Cleaning
The prior data quality analysis revealed several data anomalies requiring attention. 

1. Cases with invalid values for subtlety and breast density.
2. Missing calcification type, calcification distribution, mass shape, and mass margins data. 
3. Categories that have different spelling, but the same meaning. 


| # | Task                                                                             |
|---|----------------------------------------------------------------------------------|
| 1 | Replace invalid values for breast density with NA for imputation.                |
| 2 | Set mass and calcification morphological values to NA where they do not   apply. |
| 3 | Remove cases with invalid values for subtlety                                    |
| 4 | Replace calcification types 'LUCENT_CENTER' with 'LUCENT_CENTERED', and 'PLEOMORPHIC-PLEOMORPHIC', with 'PLEOMORPHIC' |
| 5 | Impute missing values using Multiple Imputation by Chained Equations   (MICE)    |
| 6 | Convert data to the appropriate data types                                       |
| 7 | Conduct random inspection of imputations.                                        |
| 8 | Conduct final case data quality analysis.                                        |

Once the case dataset has been cleaned, we will merge it with the DICOM Metadata dataset.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as pd
import numpy as np

from bcd.analyze.prep.case import CaseImputer
pd.options.display.max_rows = 999

from bcd.analyze.quality.case import CaseDQA

In [2]:
FP_STAGED_CASES = "data/meta/1_staged/cases.csv"
FP_STAGED_DICOM = "data/meta/1_staged/dicom.csv"

FP_CLEAN_CASES = "data/meta/2_clean/cases.csv"
FP_CLEAN_DICOM = "data/meta/2_clean/dicom.csv"

FP_CASE_SERIES_XREF = "data/meta/1_staged/case_series_xref.csv"
CASES_TO_REMOVE = ['P_00710_RIGHT_mass_MLO_1', 'P_00710_RIGHT_mass_CC_1']

## Load Data

In [3]:
df = pd.read_csv(FP_STAGED_CASES)
df_orig = df.copy()

## Breast Density
Replace invalid values for breast density with NA for downstream imputation.

In [4]:
# Set invalid values for breast_density to NA
df['breast_density'] = df['breast_density'].replace(0, np.NAN)

## Morphology Not Applicable
Set morphological variables to NA where they do not apply. For instance mass shape doesn't apply to calcification cases.

In [5]:
# Set case and mass data to NOT APPLICABLE where appropriate.
df.loc[df['abnormality_type'] == 'calcification', 'mass_shape'] = 'NOT APPLICABLE'
df.loc[df['abnormality_type'] == 'calcification', 'mass_margins'] = 'NOT APPLICABLE'
df.loc[df['abnormality_type'] == 'mass', 'calc_type'] = 'NOT APPLICABLE'
df.loc[df['abnormality_type'] == 'mass', 'calc_distribution'] = 'NOT APPLICABLE'

## Case Removal
Remove cases with invalid values for subtlety.

In [6]:
df = df.loc[~df['case_id'].isin(CASES_TO_REMOVE)]
msg = f"A total of {df_orig.shape[0] - df.shape[0]} rows were removed from the dataset, leaving {df.shape[0]} rows."
print(msg)

A total of 2 rows were removed from the dataset, leaving 3566 rows.


## Category Alignment
Align categories that have the same meaning, but differ in spelling only.

In [7]:
df.loc[df['calc_type'] == 'LUCENT_CENTER', 'calc_type'] = 'LUCENT_CENTERED'
df.loc[df['calc_type'] == 'ROUND_AND_REGULAR-LUCENT_CENTER-DYSTROPHIC', 'calc_type'] = 'ROUND_AND_REGULAR-LUCENT_CENTERED-DYSTROPHIC'
df.loc[df['calc_type'] == 'PUNCTATE-LUCENT_CENTER', 'calc_type'] = 'PUNCTATE-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'VASCULAR-COARSE-LUCENT_CENTER-ROUND_AND_REGULAR-PUNCTATE', 'calc_type'] = 'VASCULAR-COARSE-LUCENT_CENTERED-ROUND_AND_REGULAR-PUNCTATE'
df.loc[df['calc_type'] == 'ROUND_AND_REGULAR-LUCENT_CENTER', 'calc_type'] = 'ROUND_AND_REGULAR-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'LUCENT_CENTER-PUNCTATE', 'calc_type'] = 'LUCENT_CENTERED-PUNCTATE'
df.loc[df['calc_type'] == 'COARSE-ROUND_AND_REGULAR-LUCENT_CENTER', 'calc_type'] = 'COARSE-ROUND_AND_REGULAR-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'ROUND_AND_REGULAR-LUCENT_CENTER-PUNCTATE', 'calc_type'] = 'ROUND_AND_REGULAR-LUCENT_CENTERED-PUNCTATE'
df.loc[df['calc_type'] == 'COARSE-LUCENT_CENTER', 'calc_type'] = 'COARSE-LUCENT_CENTERED'
df.loc[df['calc_type'] == 'PLEOMORPHIC-PLEOMORPHIC', 'calc_type'] = 'PLEOMORPHIC'

## Impute Missing Values
Multiple Imputation by Chained Equations (MICE) is a robust, informative method of estimating missing values in datasets. The procedure imputes missing data through an iterative series of predictive models which estimate the value of missing data using the other variables in the dataset. For this, we'll use our CaseImputer which wraps scikit-learn's IterativeImputer implementation of MICE.

First, let's capture the missing values as we will inspect them after imputation.

In [8]:
# Grab rows with missing data
null_mask = df.isnull().any(axis=1)
df_missing = df[null_mask]
msg = f"There are {df_missing.shape[0]} rows (approximately {round(df_missing.shape[0] / df_orig.shape[0] * 100,1)}% of the dataset) with missing data in the dataset."
print(msg)

There are 525 rows (approximately 14.7% of the dataset) with missing data in the dataset.


In [9]:
imp = CaseImputer(random_state=5)
imp.fit(df=df)
df_clean = imp.transform(df=df)

<bcd.analyze.prep.case.CaseImputer at 0x7f3abf91a230>

With that, let's save the data.

## Data Types

In [10]:
dtypes = {"patient_id": "object",
"breast_density": "int32",
"left_or_right_breast": "category",
"image_view": "category",
"abnormality_id": "int32",
"abnormality_type": "category",
"calc_type": "category",
"calc_distribution": "category",
"assessment": "int32",
"pathology": "category",
"subtlety": "int32",
"fileset": "object",
"mass_shape": "category",
"mass_margins": "category",
"case_id": "object",
"cancer": "bool",
}
dtypes_ordinal = {
"breast_density": "category",
"assessment": "category",
"subtlety": "category",
}
df_clean = df_clean.astype(dtypes)
df_clean = df_clean.astype(dtypes_ordinal)
df_clean.info()
df_clean.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3566 entries, 0 to 3565
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   patient_id            3566 non-null   object  
 1   breast_density        3566 non-null   category
 2   left_or_right_breast  3566 non-null   category
 3   image_view            3566 non-null   category
 4   abnormality_id        3566 non-null   int32   
 5   abnormality_type      3566 non-null   category
 6   calc_type             3566 non-null   category
 7   calc_distribution     3566 non-null   category
 8   assessment            3566 non-null   category
 9   pathology             3566 non-null   category
 10  subtlety              3566 non-null   category
 11  fileset               3566 non-null   object  
 12  mass_shape            3566 non-null   category
 13  mass_margins          3566 non-null   category
 14  case_id               3566 non-null   object  
 15  canc

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
0,P_00005,3,RIGHT,CC,1,calcification,AMORPHOUS,CLUSTERED,3,MALIGNANT,3,train,NOT APPLICABLE,NOT APPLICABLE,P_00005_RIGHT_calcification_CC_1,True
1,P_00005,3,RIGHT,MLO,1,calcification,AMORPHOUS,CLUSTERED,3,MALIGNANT,3,train,NOT APPLICABLE,NOT APPLICABLE,P_00005_RIGHT_calcification_MLO_1,True
2,P_00007,4,LEFT,CC,1,calcification,PLEOMORPHIC,LINEAR,4,BENIGN,4,train,NOT APPLICABLE,NOT APPLICABLE,P_00007_LEFT_calcification_CC_1,False
3,P_00007,4,LEFT,MLO,1,calcification,PLEOMORPHIC,LINEAR,4,BENIGN,4,train,NOT APPLICABLE,NOT APPLICABLE,P_00007_LEFT_calcification_MLO_1,False
4,P_00008,1,LEFT,CC,1,calcification,AMORPHOUS,REGIONAL,2,BENIGN_WITHOUT_CALLBACK,3,train,NOT APPLICABLE,NOT APPLICABLE,P_00008_LEFT_calcification_CC_1,False


In [11]:
os.makedirs(os.path.dirname(FP_CLEAN_CASES), exist_ok=True)
df_clean.to_csv(FP_CLEAN_CASES, index=False)

## Random Sample Inspection

Let's take a look at a random sampling of the missing data and compare.

In [12]:
sample_cases = df_missing['case_id'].sample(10)
df_missing.loc[df_missing['case_id'].isin(sample_cases)]
df_clean.loc[df_clean['case_id'].isin(sample_cases)]

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
350,P_00476,2.0,RIGHT,MLO,1,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,3,train,NOT APPLICABLE,NOT APPLICABLE,P_00476_RIGHT_calcification_MLO_1,False
394,P_00519,3.0,LEFT,MLO,1,calcification,COARSE,,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_00519_LEFT_calcification_MLO_1,False
433,P_00555,3.0,RIGHT,CC,1,calcification,COARSE-ROUND_AND_REGULAR-LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,4,train,NOT APPLICABLE,NOT APPLICABLE,P_00555_RIGHT_calcification_CC_1,False
508,P_00635,3.0,RIGHT,CC,2,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_00635_RIGHT_calcification_CC_2,False
639,P_00799,4.0,RIGHT,CC,1,calcification,COARSE,,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_00799_RIGHT_calcification_CC_1,False
855,P_01099,2.0,LEFT,CC,3,calcification,VASCULAR,,2,BENIGN_WITHOUT_CALLBACK,4,train,NOT APPLICABLE,NOT APPLICABLE,P_01099_LEFT_calcification_CC_3,False
1409,P_01838,1.0,LEFT,CC,5,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_01838_LEFT_calcification_CC_5,False
1587,P_00214,1.0,LEFT,MLO,1,calcification,LUCENT_CENTERED,,2,BENIGN_WITHOUT_CALLBACK,4,test,NOT APPLICABLE,NOT APPLICABLE,P_00214_LEFT_calcification_MLO_1,False
1707,P_01030,3.0,RIGHT,MLO,2,calcification,ROUND_AND_REGULAR,,2,BENIGN_WITHOUT_CALLBACK,5,test,NOT APPLICABLE,NOT APPLICABLE,P_01030_RIGHT_calcification_MLO_2,False
3005,P_01635,2.0,RIGHT,MLO,1,mass,NOT APPLICABLE,NOT APPLICABLE,5,BENIGN_WITHOUT_CALLBACK,5,train,ARCHITECTURAL_DISTORTION,,P_01635_RIGHT_mass_MLO_1,False


Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
350,P_00476,2,RIGHT,MLO,1,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,3,train,NOT APPLICABLE,NOT APPLICABLE,P_00476_RIGHT_calcification_MLO_1,False
394,P_00519,3,LEFT,MLO,1,calcification,COARSE,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_00519_LEFT_calcification_MLO_1,False
433,P_00555,3,RIGHT,CC,1,calcification,COARSE-ROUND_AND_REGULAR-LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,4,train,NOT APPLICABLE,NOT APPLICABLE,P_00555_RIGHT_calcification_CC_1,False
508,P_00635,3,RIGHT,CC,2,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_00635_RIGHT_calcification_CC_2,False
639,P_00799,4,RIGHT,CC,1,calcification,COARSE,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_00799_RIGHT_calcification_CC_1,False
855,P_01099,2,LEFT,CC,3,calcification,VASCULAR,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,4,train,NOT APPLICABLE,NOT APPLICABLE,P_01099_LEFT_calcification_CC_3,False
1409,P_01838,1,LEFT,CC,5,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,train,NOT APPLICABLE,NOT APPLICABLE,P_01838_LEFT_calcification_CC_5,False
1587,P_00214,1,LEFT,MLO,1,calcification,LUCENT_CENTERED,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,4,test,NOT APPLICABLE,NOT APPLICABLE,P_00214_LEFT_calcification_MLO_1,False
1707,P_01030,3,RIGHT,MLO,2,calcification,ROUND_AND_REGULAR,SEGMENTAL,2,BENIGN_WITHOUT_CALLBACK,5,test,NOT APPLICABLE,NOT APPLICABLE,P_01030_RIGHT_calcification_MLO_2,False
3003,P_01635,2,RIGHT,MLO,1,mass,NOT APPLICABLE,NOT APPLICABLE,5,BENIGN_WITHOUT_CALLBACK,5,train,ARCHITECTURAL_DISTORTION,SPICULATED,P_01635_RIGHT_mass_MLO_1,False


## Data Quality Analysis 2.0
Ok, let's have another go at the data quality analysis.

### Completeness Analysis

In [13]:
dqc = CaseDQA(filepath=FP_CLEAN_CASES)
dqcc = dqc.analyze_completeness()
print(dqcc.summary)



                          Completeness                          
                         Dataset | Cases
                        Filename | cases.csv
                         Records | 3566
                Complete Records | 3566
             Record Completeness | 1.0
                     Data Values | 57056
            Complete Data Values | 57056
         Data Value Completeness | 1.0




We're complete.

### Uniqueness Analysis

In [14]:
dqcu = dqc.analyze_uniqueness()
print(dqcu.summary)



                           Uniqueness                           
                         Dataset | Cases
                        Filename | cases.csv
                         Records | 3566
                  Unique Records | 3566
               Record Uniqueness | 1.0
                     Data Values | 57056
              Unique Data Values | 5262
           Data Value Uniqueness | 0.092




We're unique.

### Validity Analysis

In [15]:
dqcv = dqc.analyze_validity()
print(dqcv.summary)



                            Validity                            
                         Dataset | Cases
                        Filename | cases.csv
                         Records | 3566
                   Valid Records | 3566
                 Record Validity | 1.0
                     Data Values | 57056
               Valid Data Values | 57056
             Data Value Validity | 1.0




We're valid. Now, let's merge the case and DICOM metadata.

## Merge Case and DICOM data.

In [16]:
df_case = df_clean[["case_id", "breast_density", "subtlety", "abnormality_type", "calc_type", "calc_distribution", "mass_shape", "mass_margins", "assessment", "fileset", "cancer"]]

dicom = pd.read_csv(FP_STAGED_DICOM)
xref = pd.read_csv(FP_CASE_SERIES_XREF)
df_case = df_case.merge(xref, on='case_id')
dicom = dicom.merge(df_case, on=['series_uid', 'series_description'])
dicom.to_csv(FP_CLEAN_DICOM, index=False)
dicom.info()
dicom.sample(n=10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10697 entries, 0 to 10696
Data columns (total 28 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   id                          10697 non-null  object  
 1   series_uid                  10697 non-null  object  
 2   filepath                    10697 non-null  object  
 3   patient_id                  10697 non-null  object  
 4   side                        10697 non-null  object  
 5   view                        10697 non-null  object  
 6   photometric_interpretation  10697 non-null  object  
 7   samples_per_pixel           10697 non-null  int64   
 8   height                      10697 non-null  int64   
 9   width                       10697 non-null  int64   
 10  size                        10697 non-null  int64   
 11  aspect_ratio                10697 non-null  float64 
 12  bit_depth                   10697 non-null  int64   
 13  smallest_image_p

Unnamed: 0,id,series_uid,filepath,patient_id,side,view,photometric_interpretation,samples_per_pixel,height,width,...,breast_density,subtlety,abnormality_type,calc_type,calc_distribution,mass_shape,mass_margins,assessment,fileset,cancer
6799,65d13e14-1b87-4e8b-a03c-5fa21d5bb044,1.3.6.1.4.1.9590.100.1.2.172286569100178001284...,data/image/0_raw/Calc-Training_P_00645_RIGHT_M...,P_00645,RIGHT,MLO,MONOCHROME2,1,5896,3121,...,3,2,calcification,PLEOMORPHIC,LINEAR,NOT APPLICABLE,NOT APPLICABLE,4,train,False
9714,cdd0f5ff-bf22-428b-86e4-22cc847fd8f3,1.3.6.1.4.1.9590.100.1.2.312891195211815450318...,data/image/0_raw/Calc-Training_P_01524_LEFT_CC...,P_01524,LEFT,CC,MONOCHROME2,1,5131,2761,...,3,3,calcification,PLEOMORPHIC,SEGMENTAL,NOT APPLICABLE,NOT APPLICABLE,4,train,True
2889,03752452-1951-4dba-965e-5c7f1e60f279,1.3.6.1.4.1.9590.100.1.2.321225274412631915214...,data/image/0_raw/Mass-Training_P_00543_RIGHT_M...,P_00543,RIGHT,MLO,MONOCHROME2,1,5671,2851,...,2,4,mass,NOT APPLICABLE,NOT APPLICABLE,IRREGULAR,ILL_DEFINED,4,train,True
8711,bbf5603b-faf9-44bd-83c1-9c916354ca61,1.3.6.1.4.1.9590.100.1.2.108500338511121810332...,data/image/0_raw/Calc-Test_P_01643_LEFT_CC_1/0...,P_01643,LEFT,CC,MONOCHROME2,1,4624,2800,...,2,2,calcification,AMORPHOUS,CLUSTERED,NOT APPLICABLE,NOT APPLICABLE,4,test,False
8510,9d621cc2-fc67-479d-a515-015a975707d2,1.3.6.1.4.1.9590.100.1.2.163032446612580148615...,data/image/0_raw/Mass-Training_P_00891_RIGHT_C...,P_00891,RIGHT,CC,MONOCHROME2,1,4728,2480,...,1,5,mass,NOT APPLICABLE,NOT APPLICABLE,IRREGULAR,SPICULATED,5,train,True
6723,0f761c51-205a-4585-a3e4-8822af8074bb,1.3.6.1.4.1.9590.100.1.2.312534744711605661541...,data/image/0_raw/Calc-Training_P_01455_RIGHT_C...,P_01455,RIGHT,CC,MONOCHROME2,1,745,641,...,1,5,calcification,ROUND_AND_REGULAR,CLUSTERED,NOT APPLICABLE,NOT APPLICABLE,5,train,True
1558,b2731b2d-7a54-4ec8-b105-22dbf6e458c9,1.3.6.1.4.1.9590.100.1.2.155962986212990698720...,data/image/0_raw/Calc-Training_P_01567_LEFT_CC...,P_01567,LEFT,CC,MONOCHROME2,1,4536,2912,...,3,3,calcification,DYSTROPHIC,CLUSTERED,NOT APPLICABLE,NOT APPLICABLE,3,train,False
532,2a493b6c-b98b-4ac9-bfb9-0c9d85ff3018,1.3.6.1.4.1.9590.100.1.2.116019378912146865822...,data/image/0_raw/Mass-Test_P_00173_LEFT_MLO/10...,P_00173,LEFT,MLO,MONOCHROME2,1,4464,2848,...,2,5,mass,NOT APPLICABLE,NOT APPLICABLE,LOBULATED,CIRCUMSCRIBED,3,test,False
4086,6920819f-796e-4c4f-98b6-074a56f7dd9a,1.3.6.1.4.1.9590.100.1.2.403913185412613681641...,data/image/0_raw/Mass-Training_P_00328_RIGHT_M...,P_00328,RIGHT,MLO,MONOCHROME2,1,5928,4056,...,3,4,mass,NOT APPLICABLE,NOT APPLICABLE,ARCHITECTURAL_DISTORTION,ILL_DEFINED-SPICULATED,4,train,True
259,50f51c9b-c6f7-43d3-9569-fb2c214000d3,1.3.6.1.4.1.9590.100.1.2.290778243011662918019...,data/image/0_raw/Mass-Training_P_00823_RIGHT_C...,P_00823,RIGHT,CC,MONOCHROME2,1,239,295,...,4,3,mass,NOT APPLICABLE,NOT APPLICABLE,OVAL,OBSCURED,4,train,False


This concludes the case and DICOM metadata cleaning exercise.