# CBIS-DDSM Data Quality Analysis (DQA)
In the previous section, we began the process of transforming the raw CBIS-DDSM case and DICOM data into a structure and format more suitable for analysis. The purpose of this effort is to identify potential data quality issues, errors, and anomalies requiring further data preprocessing, prior to the analysis stage. 

## Datasets
The scope of this analysis includes the following datasets:

| Dataset             | Description                                        |
|---------------------|----------------------------------------------------|
| Master Case Dataset | Calcification and mass case training and test sets |
| DICOM Dataset       | DICOM Image Metadata                               |

### Approach
Our approach will touch on four dimensions of data quality.

| # | Dimension | Definition | Metric | Metric Definition |
|---|----------------|-------------------------------------|----------------|-------------------------------------|
| 1 | Completeness | Are the data complete? | Completeness Ratio | Ratio of non-null data values / rows over the total number of data values / rows |
| 2 | Uniqueness | Are their duplicate records | Uniqueness Ratio | Ratio of unique data values / rows over total number of data values / rows |
| 3 | Validity   | Are the values consistent with defined rules? | Validity Ratio | Ratio of valid cells / rows over total number of data cells / rows |

Note, accuracy and relevance, two additional dimensions of data quality, are missing from this analysis. An accuracy evaluation requires an external source-of-truth against which, the CBIS-DDSM data are compared. As we lack such a data source, we have no basis for evaluating the accuracy of the CBIS-DDSM data collection. With respect to relevance, duplicate and irrelevant data were removed from the data collection during the previous data preparation exercise.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as pd
from IPython.display import HTML, display_html
import numpy as np

from bcd.analyze.quality.case import CaseDQA
from bcd.analyze.quality.dicom import DicomDQA

pd.set_option('display.max_colwidth', 200)

In [2]:
# Filepaths
fpc = "data/meta/1_staged/cases.csv"
fpd = "data/meta/1_staged/dicom.csv"

## Completeness Analysis

In [3]:
dq = CaseDQA(filepath=fpc)
dqc = dq.analyze_completeness()
print(dqc.summary, end=" ")




                          Completeness                          
                         Dataset | Cases
                        Filename | cases.csv
                         Records | 3568
                Complete Records | 3043
             Record Completeness | 0.853
                     Data Values | 57088
            Complete Data Values | 56561
         Data Value Completeness | 0.991

 

Record and data value completeness are 0.85 and 0.99 respectively. Let's take a closer look.

In [4]:
dqc.detail

Unnamed: 0,N,Complete,Missing,Completeness
patient_id,3568,3568,0,1.0
breast_density,3568,3568,0,1.0
left_or_right_breast,3568,3568,0,1.0
image_view,3568,3568,0,1.0
abnormality_id,3568,3568,0,1.0
abnormality_type,3568,3568,0,1.0
calc_type,3568,3544,24,0.99
calc_distribution,3568,3129,439,0.88
assessment,3568,3568,0,1.0
pathology,3568,3568,0,1.0


Missingness for calc_type, mass_shape, and mass_margins is 1%, < 1%, and 2% respectively. Strictly speaking, we could drop those observations without a considerable loss of data. On the other hand, calc_distribution missingness deserves a closer look. Let's examine missingness vis-a-vis the pathology target. 

In [5]:
dq.get_incomplete_data(subset='calc_distribution')['pathology'].value_counts().to_frame()

Unnamed: 0_level_0,count
pathology,Unnamed: 1_level_1
BENIGN_WITHOUT_CALLBACK,434
BENIGN,5


This is interesting. Over 98% of the records with missing calcification distribution have a pathology of BENIGN_WITHOUT_CALLBACK. This suggests that the calcification distribution data are Missing at Random (MAR). That is, the probability of encountering missing calcification distribution may be systematically related to a benign pathology. As the missingness seems to be related to *observed* data, the multiple imputation by chained equations (MICE) {cite}`schaferMultipleImputationPrimer1999` is a principled method for dealing with such missing data while mitigating data loss and bias in the dataset.

Let's move on to the uniqueness analysis.

### Uniqueness Analysis

In [6]:
dqu = dq.analyze_uniqueness()
print(dqu.summary)



                           Uniqueness                           
                         Dataset | Cases
                        Filename | cases.csv
                         Records | 3568
                  Unique Records | 3568
               Record Uniqueness | 1.0
                     Data Values | 57088
              Unique Data Values | 5270
           Data Value Uniqueness | 0.092




No duplicate records exist in the data.

### Validity Analysis
Here, we assess the degree to which the datasets contain valid values. 

In [7]:
dqv = dq.analyze_validity()
print(dqv.summary)



                            Validity                            
                         Dataset | Cases
                        Filename | cases.csv
                         Records | 3568
                   Valid Records | 3009
                 Record Validity | 0.843
                     Data Values | 57088
               Valid Data Values | 56360
             Data Value Validity | 0.987




Record and data value validity are 0.84 and 0.99, respectively. Note, missing data will have an impact on record validity. Still, let's take a closer look at the invalid data. 

In [8]:
dqv.detail

Unnamed: 0,N,Valid,Invalid,Validity
case_id,3568,3568,0,1.0
patient_id,3568,3568,0,1.0
breast_density,3568,3566,2,1.0
left_or_right_breast,3568,3568,0,1.0
image_view,3568,3568,0,1.0
abnormality_id,3568,3568,0,1.0
abnormality_type,3568,3568,0,1.0
calc_type,3568,3347,221,0.94
calc_distribution,3568,3129,439,0.88
mass_shape,3568,3564,4,1.0


The invalidity of calc_distribution and mass_margins is explained by the presence of missing data. Approximately 6% of the observations have invalid calcification types. Let's take a look.

In [9]:
dq.get_invalid_data(subset='calc_type')['calc_type'].unique()

array([nan, 'ROUND_AND_REGULAR-LUCENT_CENTER-DYSTROPHIC',
       'PUNCTATE-LUCENT_CENTER',
       'VASCULAR-COARSE-LUCENT_CENTER-ROUND_AND_REGULAR-PUNCTATE',
       'LUCENT_CENTER', 'ROUND_AND_REGULAR-LUCENT_CENTER',
       'LUCENT_CENTER-PUNCTATE', 'PLEOMORPHIC-PLEOMORPHIC',
       'COARSE-ROUND_AND_REGULAR-LUCENT_CENTER',
       'ROUND_AND_REGULAR-LUCENT_CENTER-PUNCTATE', 'COARSE-LUCENT_CENTER'],
      dtype=object)

A common thread among these invalid values is the type, 'LUCENT_CENTER', which should be 'LUCENT_CENTERED'. 'PLEOMORPHIC-PLEOMORPHIC', should be 'PLEOMORPHIC' and there are some NaN values extant. We'll make those changes in the next section. Now, the invalid breast density records.

In [10]:
dq.get_invalid_data(subset='breast_density')

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
1817,P_01743,0,RIGHT,CC,1,calcification,PLEOMORPHIC,DIFFUSELY_SCATTERED,5,MALIGNANT,5,test,NOT APPLICABLE,NOT APPLICABLE,Calcification-Test_P_01743_RIGHT_CC_1,True
1818,P_01743,0,RIGHT,MLO,1,calcification,PLEOMORPHIC,DIFFUSELY_SCATTERED,5,MALIGNANT,5,test,NOT APPLICABLE,NOT APPLICABLE,Calcification-Test_P_01743_RIGHT_MLO_1,True


Both observations with breast density values of zero, are for the same patient. Let's see what densities we have for similar malignant calcification cases.

In [11]:
df = dq.get_complete_data()
df_similar_type = df.loc[(df['calc_type'] == 'PLEOMORPHIC') & (df['cancer'] == True)]
df_similar_dist = df.loc[(df['calc_distribution'] == 'DIFFUSELY_SCATTERED') & (df['cancer'] == True)]


The breast density counts below are for PLEOMORPHIC calcification cases.

In [12]:
df_similar_type['breast_density'].value_counts().to_frame()

Unnamed: 0_level_0,count
breast_density,Unnamed: 1_level_1
3,158
2,123
4,121
1,23
0,2


Statistically, breast densities of 2,3 and 4 are equally probable for malignant PLEOMORPHIC cases. Let's see if calcification distribution is more discriminative.

In [13]:
df_similar_dist['breast_density'].value_counts().to_frame()

Unnamed: 0_level_0,count
breast_density,Unnamed: 1_level_1
2,2
0,2
3,1


We only have five DIFFUSELY_SCATTERED calcification cases, two of which are our invalid cases. Two cases have breast densities of 2, and another with a value of 3. Not a statistically significant difference in breast densities for DIFFUSELY_SCATTERED calcification cases. Though these invalid observations are relatively, they represent 40% of the DIFFUSELY_SCATTERED calcification cases. We'll attempt to impute these values rather than delete them. 

#### Invalid Mass Case Analysis
The invalid mass cases have subtlety values of zero. Let's take a look at the data.

In [14]:
dq.get_invalid_data(subset='subtlety')

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
2364,P_00710,2,RIGHT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,0,BENIGN,0,train,,,Mass-Train_P_00710_RIGHT_CC_1,False
2365,P_00710,2,RIGHT,MLO,1,mass,NOT APPLICABLE,NOT APPLICABLE,0,BENIGN,0,train,,,Mass-Train_P_00710_RIGHT_MLO_1,False


Similarly, these cases are for the same patient. Notably, these cases convey little information: missing mass shape and margins data. In this case deletion would be the best option.

## DICOM Dataset
### Completeness Analysis

In [15]:
dqd = DicomDQA(filepath=fpd)
dqdc = dqd.analyze_completeness()
print(dqdc.summary)



                          Completeness                          
                         Dataset | Dicom
                        Filename | dicom.csv
                         Records | 10238
                Complete Records | 10238
             Record Completeness | 1.0
                     Data Values | 174046
            Complete Data Values | 174046
         Data Value Completeness | 1.0




According to The Cancer Imaging Archive (TCIA), there are 10,230 images in the CBIS-DDSM. Yet, our DICOM Dataset contains 10,703 observations, 464 more than expected at a completeness of 100%. In addition, a corrupt DICOM image file was encountered during data preparation and was skipped. This brings the overage to 465. Could there be some duplication? 

### Uniqueness Analysis

In [16]:
dqdu = dqd.analyze_uniqueness()
print(dqdu.summary)



                           Uniqueness                           
                         Dataset | Dicom
                        Filename | dicom.csv
                         Records | 10238
                  Unique Records | 10238
               Record Uniqueness | 1.0
                     Data Values | 174046
              Unique Data Values | 48750
           Data Value Uniqueness | 0.28




As indicated above, we have no duplicate records in the dataset.

### Validity Analysis

In [17]:
dqdv = dqd.analyze_validity()
print(dqdv.summary)



                            Validity                            
                         Dataset | Dicom
                        Filename | dicom.csv
                         Records | 10238
                   Valid Records | 10238
                 Record Validity | 1.0
                     Data Values | 174046
               Valid Data Values | 174046
             Data Value Validity | 1.0




The DICOM Dataset has 100% validity. 

### Summary
A brief data quality analysis of the CBIS-DDSM case and DICOM data examined completeness, uniqueness, validity and consistency.

**Data Preparation Tasks**
1. The following observations have zero for breast density, values that will be considered missing:
   1. P_01743_RIGHT_calcification_CC_1
   2. P_01743_RIGHT_calcification_MLO_1
2. For calcification type, we'll perform the following replacements:
   1. 'LUCENT_CENTERED' for 'LUCENT_CENTER'
   2. 'PLEOMORHIC' for 'PLEOMORHIC-PLEOMORHIC'
3. We will use Multivariate Imputation by Chained Equations (MICE) to predict missing and invalid values for the following variables:
   1. calc_distribution
   2. mass_margins
   3. mass shape
   4. calc_type
   5. breast_density
4. The following cases will be deleted from the dataset.
   1. P_00710_RIGHT_mass_MLO_1
   2. P_00710_RIGHT_mass_CC_1