# CBIS-DDSM Data Quality Analysis (DQA)
In the previous section, we began the process of transforming the raw CBIS-DDSM case and DICOM data into a structure and format more suitable for analysis. The purpose of this effort is to identify potential data quality issues, errors, and anomalies requiring further data preprocessing, prior to the analysis stage. 

## Datasets
The scope of this analysis includes the following datasets:

| Dataset             | Description                                        |
|---------------------|----------------------------------------------------|
| Master Case Dataset | Calcification and mass case training and test sets |
| DICOM Dataset       | DICOM Image Metadata                               |

### Approach
Our approach will touch on four dimensions of data quality.

| # | Dimension | Definition | Metric | Metric Definition |
|---|----------------|-------------------------------------|----------------|-------------------------------------|
| 1 | Completeness | Are the data complete? | Completeness Ratio | Ratio of non-null data values / rows over the total number of data values / rows |
| 2 | Uniqueness | Are their duplicate records | Uniqueness Ratio | Ratio of unique data values / rows over total number of data values / rows |
| 3 | Validity   | Are the values consistent with defined rules? | Validity Ratio | Ratio of valid cells / rows over total number of data cells / rows |
| 4 | Consistency   | Are the data consistent across datasets? | Consistency Ratio | Ratio of consistent data pairs over total data pairs evaluated |

Note, accuracy and relevance, two additional dimensions of data quality, are missing from this analysis. An accuracy evaluation requires an external source-of-truth against which, the CBIS-DDSM data are compared. As we lack such a data source, we have no basis for evaluating the accuracy of the CBIS-DDSM data collection. With respect to relevance, duplicate and irrelevant data were removed from the data collection during the previous data preparation exercise.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as pd
from IPython.display import HTML, display_html
import numpy as np

from bcd.data.dqa.case import MassCaseDQA, CalcCaseDQA
from bcd.data.dqa.dicom import DicomDQA

pd.set_option('display.max_colwidth', 200)

In [2]:
# Filepaths
fpc = "data/staged/calc_cases.csv"
fpm = "data/staged/mass_cases.csv"
fpd = "data/staged/dicom.csv"

## Master Case Dataset


### Completeness Analysis

In [3]:
dqc = CalcCaseDQA(filepath=fpc)
dqm = MassCaseDQA(filepath=fpm)
dqcc = dqc.analyze_completeness()
dqmc = dqm.analyze_completeness()
print(dqcc.summary, end=" ")
print(dqmc.summary, end=" ")




                          Completeness                          
                         Dataset | Calc Cases
                        Filename | calc_cases.csv
                         Records | 1872
                Complete Records | 1409
             Record Completeness | 0.753
                     Data Values | 26208
            Complete Data Values | 25745
         Data Value Completeness | 0.982

 

                          Completeness                          
                         Dataset | Mass Cases
                        Filename | mass_cases.csv
                         Records | 1696
                Complete Records | 1634
             Record Completeness | 0.963
                     Data Values | 23744
            Complete Data Values | 23680
         Data Value Completeness | 0.997

 

Calcification cases have a record and data value completeness of 0.753 and 0.982 respectively. On the other hand, mass record and data value completeness was 0.96 and 0.997 respectively. Let's examine data value completeness for the calcification cases.

In [4]:
dqcc.detail

Unnamed: 0,N,Complete,Missing,Completeness
patient_id,1872,1872,0,1.0
breast_density,1872,1872,0,1.0
left_or_right_breast,1872,1872,0,1.0
image_view,1872,1872,0,1.0
abnormality_id,1872,1872,0,1.0
abnormality_type,1872,1872,0,1.0
calc_type,1872,1848,24,0.99
calc_distribution,1872,1433,439,0.77
assessment,1872,1872,0,1.0
pathology,1872,1872,0,1.0


Calcification type and calcification distribution have 1% and 23% missing values, respectively. The 24 records with missing calcification type can be ignored as they represent less than %1 of the data. Calcification distribution; on the other hand, has a missingness of over 30%. Let's examine calcification distribution vis-a-vis our target, pathology.

In [5]:
dqc.get_incomplete_data(subset='calc_distribution')['pathology'].value_counts().to_frame()

Unnamed: 0_level_0,count
pathology,Unnamed: 1_level_1
BENIGN_WITHOUT_CALLBACK,434
BENIGN,5


This is interesting. Over 98% of the records with missing calcification distribution have a pathology of BENIGN_WITHOUT_CALLBACK. This suggests that the calcification distribution data are Missing at Random (MAR). That is, the probability of encountering missing calcification distribution may be systematically related to a benign pathology. Let's take a look at the variable level mass case completeness.

In [6]:
dqmc.detail

Unnamed: 0,N,Complete,Missing,Completeness
patient_id,1696,1696,0,1.0
breast_density,1696,1696,0,1.0
left_or_right_breast,1696,1696,0,1.0
image_view,1696,1696,0,1.0
abnormality_id,1696,1696,0,1.0
abnormality_type,1696,1696,0,1.0
assessment,1696,1696,0,1.0
pathology,1696,1696,0,1.0
subtlety,1696,1696,0,1.0
fileset,1696,1696,0,1.0


Here, mass margins and shape have a missingness of about 4% and 1% respectively. The missingness is small enough relative to the sample size to be ignored. Nonetheless, let's examine mass margins data vis-a-vis pathology.

In [7]:
dqm.get_incomplete_data(subset='mass_margins')['pathology'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,proportion
pathology,Unnamed: 1_level_1
BENIGN_WITHOUT_CALLBACK,0.68
MALIGNANT,0.18
BENIGN,0.13


As with calcification distribution, the mass margin missingness appears to be related to a pathology of benign without callback.Let's check mass shape missingness.

In [8]:
dqm.get_incomplete_data(subset='mass_shape')['pathology'].value_counts(normalize=True).to_frame()

Unnamed: 0_level_0,proportion
pathology,Unnamed: 1_level_1
MALIGNANT,0.5
BENIGN,0.5


These numbers indicate that a malignant and benign pathologies are equally likely in the presence of mass shape missingness. 

Calcification and mass cases have a missingness of about 25% and 5%, respectively. In both cases, the missing values are nominal variables. Rather than deleting the observations and risk losing valuable information, we will encode the missing values with the value of 'MISSING'.

Let's move on to the uniqueness analysis.

### Uniqueness Analysis

In [9]:
dqcu = dqc.analyze_uniqueness()
print(dqcu.summary)
dqmu = dqm.analyze_uniqueness()
print(dqmu.summary)



                           Uniqueness                           
                         Dataset | Calc Cases
                        Filename | calc_cases.csv
                         Records | 1872
                  Unique Records | 1872
               Record Uniqueness | 1.0
                     Data Values | 26208
              Unique Data Values | 2714
           Data Value Uniqueness | 0.104




                           Uniqueness                           
                         Dataset | Mass Cases
                        Filename | mass_cases.csv
                         Records | 1696
                  Unique Records | 1696
               Record Uniqueness | 1.0
                     Data Values | 23744
              Unique Data Values | 2661
           Data Value Uniqueness | 0.112




With record uniqueness of one for both calcification and mass cases, no duplication appears extant.

### Validity Analysis
Here, we assess the degree to which the datasets contain valid values. 

In [10]:
dqcv = dqc.analyze_validity()
print(dqcv.summary)
dqmv = dqm.analyze_validity()
print(dqmv.summary)



                            Validity                            
                         Dataset | Calc Cases
                        Filename | calc_cases.csv
                         Records | 1872
                   Valid Records | 1407
                 Record Validity | 0.752
                     Data Values | 26208
               Valid Data Values | 25743
             Data Value Validity | 0.982




                            Validity                            
                         Dataset | Mass Cases
                        Filename | mass_cases.csv
                         Records | 1696
                   Valid Records | 1634
                 Record Validity | 0.963
                     Data Values | 23744
               Valid Data Values | 23678
             Data Value Validity | 0.997




For calcification cases, We have a row and data value validity 0.752 and 0.982, respectively. This comports with expectations as null values also invalid values. For mass cases, the situation is considerably better. We have a row and data value validity of 0.963 and 0.997, respectively.  

Imputing the missing calcification distribution values will drastically improve validity for the calcification cases. Mass cases have a missingness of less than 5%. Nonetheless, that amounts to 60 observations. Let's check the details of mass case validity.

In [11]:
df1_style = dqcv.detail.style.set_table_attributes("style='display:inline; margin-right:220px;'").set_caption("Calcification Validity")
df2_style = dqmv.detail.style.set_table_attributes("style='display:inline; margin-right:120px'").set_caption("Mass Validity")

display_html(df1_style._repr_html_() + df2_style._repr_html_(), raw=True)

Unnamed: 0,N,Valid,Invalid,Validity
case_id,1872,1872,0,1.0
patient_id,1872,1872,0,1.0
breast_density,1872,1870,2,0.998932
left_or_right_breast,1872,1872,0,1.0
image_view,1872,1872,0,1.0
abnormality_id,1872,1872,0,1.0
abnormality_type,1872,1872,0,1.0
calc_type,1872,1848,24,0.987179
calc_distribution,1872,1433,439,0.765491
assessment,1872,1872,0,1.0

Unnamed: 0,N,Valid,Invalid,Validity
case_id,1696,1696,0,1.0
patient_id,1696,1696,0,1.0
breast_density,1696,1696,0,1.0
left_or_right_breast,1696,1696,0,1.0
image_view,1696,1696,0,1.0
abnormality_id,1696,1696,0,1.0
abnormality_type,1696,1696,0,1.0
mass_shape,1696,1692,4,0.997642
mass_margins,1696,1636,60,0.964623
assessment,1696,1696,0,1.0


The numbers of invalid values for calc_type, calc_distribution, mass_shape, and mass_margins equals the numbers of missing values. However, we have four new invalid observations. Two of which have invalid breast density values, and the others have invalid subtlety values. Let's take a look and the invalid calcification cases.

#### Invalid Calcification Case Analysis

In [20]:
dqc.get_invalid_data(subset='breast_density')

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,subtlety,fileset,case_id,cancer
1817,P_01743,0,RIGHT,CC,1,calcification,PLEOMORPHIC,DIFFUSELY_SCATTERED,5,MALIGNANT,5,test,P_01743_RIGHT_calcification_CC_1,True
1818,P_01743,0,RIGHT,MLO,1,calcification,PLEOMORPHIC,DIFFUSELY_SCATTERED,5,MALIGNANT,5,test,P_01743_RIGHT_calcification_MLO_1,True


Both observations with breast density values of zero, are for the same patient. Let's see what densities we have for similar malignant calcification cases.

In [25]:
df = dqc.get_complete_data()
df_similar_type = df.loc[(df['calc_type'] == 'PLEOMORPHIC') & (df['cancer'] == True)]
df_similar_dist = df.loc[(df['calc_distribution'] == 'DIFFUSELY_SCATTERED') & (df['cancer'] == True)]


The breast density counts below are for PLEOMORPHIC calcification cases.

In [23]:
df_similar_type['breast_density'].value_counts().to_frame()

Unnamed: 0_level_0,count
breast_density,Unnamed: 1_level_1
3,158
2,123
4,121
1,23
0,2


Statistically, breast densities of 2,3 and 4 are equally probable for malignant PLEOMORPHIC cases. Let's see if calcification distribution is more discriminative.

In [24]:
df_similar_dist['breast_density'].value_counts().to_frame()

Unnamed: 0_level_0,count
breast_density,Unnamed: 1_level_1
2,2
0,2
3,1


We only have five DIFFUSELY_SCATTERED calcification cases, two of which have invalid values, two have a breast densities of 2, and another with a value of 3. Not a statistically significant difference in breast densities for DIFFUSELY_SCATTERED calcification cases. Though these invalid observations are relatively, they represent 40% of the DIFFUSELY_SCATTERED calcification cases. We'll attempt to impute these values rather than delete them. 

#### Invalid Mass Case Analysis
The invalid mass cases have subtlety values of zero. Let's take a look at the data.

In [26]:
dqm.get_invalid_data(subset='subtlety')

Unnamed: 0,patient_id,breast_density,left_or_right_breast,image_view,abnormality_id,abnormality_type,assessment,pathology,subtlety,fileset,mass_shape,mass_margins,case_id,cancer
492,P_00710,2,RIGHT,CC,1,mass,0,BENIGN,0,train,,,P_00710_RIGHT_mass_CC_1,False
493,P_00710,2,RIGHT,MLO,1,mass,0,BENIGN,0,train,,,P_00710_RIGHT_mass_MLO_1,False


Similarly, these cases are for the same patient. Notably, these cases convey little information: missing mass shape and margins data. In this case deletion would be the best option.

## DICOM Dataset
### Completeness Analysis

In [13]:
dqd = DicomDQA(filepath=fpd)
dqdc = dqd.analyze_completeness()
print(dqdc.summary)



                          Completeness                          
                         Dataset | Dicom
                        Filename | dicom.csv
                         Records | 10238
                Complete Records | 10238
             Record Completeness | 1.0
                     Data Values | 174046
            Complete Data Values | 174046
         Data Value Completeness | 1.0




According to The Cancer Imaging Archive (TCIA), there are 10,230 images in the CBIS-DDSM. Yet, our DICOM Dataset contains 10,703 observations, 464 more than expected at a completeness of 100%. In addition, a corrupt DICOM image file was encountered during data preparation and was skipped. This brings the overage to 465. Could there be some duplication? 

### Uniqueness Analysis

In [14]:
dqdu = dqd.analyze_uniqueness()
print(dqdu.summary)



                           Uniqueness                           
                         Dataset | Dicom
                        Filename | dicom.csv
                         Records | 10238
                  Unique Records | 10238
               Record Uniqueness | 1.0
                     Data Values | 174046
              Unique Data Values | 48679
           Data Value Uniqueness | 0.28




As indicated above, we have no duplicate records in the dataset.

### Validity Analysis

In [15]:
dqdv = dqd.analyze_validity()
print(dqdv.summary)



                            Validity                            
                         Dataset | Dicom
                        Filename | dicom.csv
                         Records | 10238
                   Valid Records | 10238
                 Record Validity | 1.0
                     Data Values | 174046
               Valid Data Values | 153570
             Data Value Validity | 0.882




The DICOM Dataset has 100% validity. 

### Consistency Analysis
Both the case datasets and the DICOM image dataset contain patient_id, side, and image_view. Next, we confirm that these variables are consistent between the case and the DICOM datasets. 

In [16]:
dqdcy = dqd.analyze_consistency()
print(dqdcy.summary)



                          Consistency                           
                         Dataset | Dicom
                        Filename | dicom.csv
                         Records | 10703
              Consistent Records | 10703
              Record Consistency | 1.0
                     Data Values | 32109
          Consistent Data Values | 32109
          Data Value Consistency | 1.0




These data show that the case information are consistent between the case and DICOM Datasets. 

### Summary
A brief data quality analysis of the CBIS-DDSM case and DICOM data examined completeness, uniqueness, validity and consistency. The results are summarized below.

In [17]:
data = [0.753, 0.963, 1,1, 1, 1, 0.752, 0.963, 1, 1]
datasets = ['MASS Cases', 'Calc Cases', 'DICOM Dataset','MASS Cases', 'Calc Cases', 'DICOM Dataset','MASS Cases', 'Calc Cases', 'DICOM Dataset','DICOM Dataset']
dimension = ["Completeness","Completeness","Completeness","Uniqueness","Uniqueness","Uniqueness","Validity","Validity","Validity","Consistency"]
d = {'Dimension': dimension, 'Dataset': datasets, 'Value': data, }
df = pd.DataFrame(d)
df.set_index(['Dimension', 'Dataset'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Dimension,Dataset,Unnamed: 2_level_1
Completeness,MASS Cases,0.75
Completeness,Calc Cases,0.96
Completeness,DICOM Dataset,1.0
Uniqueness,MASS Cases,1.0
Uniqueness,Calc Cases,1.0
Uniqueness,DICOM Dataset,1.0
Validity,MASS Cases,0.75
Validity,Calc Cases,0.96
Validity,DICOM Dataset,1.0
Consistency,DICOM Dataset,1.0


**Data Preparation Tasks**
1. The following observations have zero for breast density, values that will be considered missing:
   1. P_01743_RIGHT_calcification_CC_1
   2. P_01743_RIGHT_calcification_MLO_1
2. We will use Multivariate Imputation by Chained Equations (MICE) to predict missing and invalid values for the following variables:
   1. calc_distribution
   2. mass_margins
   3. mass shape
   4. calc_type
   5. breast_density
3. The following cases will be deleted from the dataset.
   1. P_00710_RIGHT_mass_MLO_1
   2. P_00710_RIGHT_mass_CC_1