# CBIS-DDSM Data Quality Analysis (DQA)
In the previous section, we began the process of transforming the raw CBIS-DDSM case and DICOM data into a structure and format more suitable for analysis. The purpose of this effort is to identify potential data quality issues, errors, and anomalies requiring further data preprocessing, prior to the analysis stage. 

## Datasets
The scope of this analysis includes the following datasets:

| Dataset             | Description                                        |
|---------------------|----------------------------------------------------|
| Master Case Dataset | Calcification and mass case training and test sets |
| Series Dataset       | Series Metadata                               |

### Approach
Our approach will touch on four dimensions of data quality.

| # | Dimension | Definition | Metric | Metric Definition |
|---|----------------|-------------------------------------|----------------|-------------------------------------|
| 1 | Consistency | Are variable names and feature categories consistent? | Consistency Ratio | Consistent aspects / total number of aspects |
| 2 | Completeness | Are the data complete? | Completeness Ratio | Ratio of non-null data values / rows over the total number of data values / rows |
| 3 | Uniqueness | Are their duplicate records | Uniqueness Ratio | Ratio of unique data values / rows over total number of data values / rows |
| 4 | Validity   | Are the values consistent with defined rules? | Validity Ratio | Ratio of valid cells / rows over total number of data cells / rows |

Note, accuracy and relevance, two additional dimensions of data quality, are missing from this analysis. An accuracy evaluation requires an external source-of-truth against which, the CBIS-DDSM data are compared. As we lack such a data source, we have no basis for evaluating the accuracy of the CBIS-DDSM data collection. With respect to relevance, duplicate and irrelevant data were removed from the data collection during the previous data preparation exercise.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as pd
from IPython.display import HTML, display_html
import numpy as np

from bcd.analyze.dqa.case import CaseDQA
from bcd.analyze.dqa.dicom import DicomDQA
from bcd.dal.file import FileManager

pd.set_option('display.max_colwidth', 200)

## Completeness Analysis

In [2]:
dq = CaseDQA(file_manager=FileManager)
dqc = dq.analyze_completeness()
print(dqc.summary, end=" ")


DEBUG:bcd.analyze.dqa.case:  patient_id  breast_density left or right breast image view  abnormality id  \
0    P_00001            3.00                 LEFT         CC               1   
1    P_00001            3.00                 LEFT        MLO               1   
2    P_00004            3.00                 LEFT         CC               1   
3    P_00004            3.00                 LEFT        MLO               1   
4    P_00004            3.00                RIGHT        MLO               1   

  abnormality type                          mass shape   mass margins  \
0             mass  IRREGULAR-ARCHITECTURAL_DISTORTION     SPICULATED   
1             mass  IRREGULAR-ARCHITECTURAL_DISTORTION     SPICULATED   
2             mass            ARCHITECTURAL_DISTORTION    ILL_DEFINED   
3             mass            ARCHITECTURAL_DISTORTION    ILL_DEFINED   
4             mass                                OVAL  CIRCUMSCRIBED   

   assessment  pathology  subtlety  \
0           4  

Record and data value completeness are 0.85 and 0.99 respectively. Let's take a closer look.

In [3]:
dqc.detail

Unnamed: 0,N,Complete,Missing,Completeness
patient_id,3568,3568,0,1.0
breast_density,3568,1696,1872,0.48
left or right breast,3568,3568,0,1.0
image view,3568,3568,0,1.0
abnormality id,3568,3568,0,1.0
abnormality type,3568,3568,0,1.0
mass shape,3568,3568,0,1.0
mass margins,3568,3568,0,1.0
assessment,3568,3568,0,1.0
pathology,3568,3568,0,1.0


Missingness for calc type, mass_shape, and mass_margins is 1%, < 1%, and 2% respectively. Strictly speaking, we could drop those observations without a considerable loss of data. On the other hand, calc distribution missingness deserves a closer look. Let's examine missingness vis-a-vis the pathology target. 

In [4]:
dq.get_incomplete_data(subset='calc distribution')['pathology'].value_counts().to_frame()

Unnamed: 0_level_0,count
pathology,Unnamed: 1_level_1


This is interesting. Over 98% of the records with missing calcification distribution have a pathology of BENIGN_WITHOUT_CALLBACK. This suggests that the calcification distribution data are Missing at Random (MAR). That is, the probability of encountering missing calcification distribution may be systematically related to a benign pathology. As the missingness seems to be related to *observed* data, the multiple imputation by chained equations (MICE) {cite}`schaferMultipleImputationPrimer1999` is a principled method for dealing with such missing data while mitigating data loss and bias in the dataset.

Let's move on to the uniqueness analysis.

### Uniqueness Analysis

In [5]:
dqu = dq.analyze_uniqueness()
print(dqu.summary)



                           Uniqueness                           
                         Dataset | Case
                         Records | 3568
                  Unique Records | 3568
               Record Uniqueness | 1.0
                     Data Values | 64224
              Unique Data Values | 11943
           Data Value Uniqueness | 0.186




No duplicate records exist in the data.

### Validity Analysis
Here, we assess the degree to which the datasets contain valid values. 

In [6]:
dqv = dq.analyze_validity()
print(dqv.summary)

KeyError: 'abnormality_type'

Record and data value validity are 0.84 and 0.99, respectively. Note, missing data will have an impact on record validity. Still, let's take a closer look at the invalid data. 

In [None]:
dqv.detail

The invalidity of calc distribution and mass_margins is explained by the presence of missing data. Approximately 6% of the observations have invalid calcification types. Let's take a look.

In [None]:
dq.get_invalid_data(subset='calc type')['calc type'].unique()

A common thread among these invalid values is the type, 'LUCENT_CENTER', which should be 'LUCENT_CENTERED'. 'PLEOMORPHIC-PLEOMORPHIC', should be 'PLEOMORPHIC' and there are some NaN values extant. We'll make those changes in the next section. Now, the invalid breast density records.

In [None]:
dq.get_invalid_data(subset='breast_density')

Both observations with breast density values of zero, are for the same patient. Let's see what densities we have for similar malignant calcification cases.

In [None]:
df = dq.get_complete_data()
df_similar_type = df.loc[(df['calc type'] == 'PLEOMORPHIC') & (df['cancer'] == True)]
df_similar_dist = df.loc[(df['calc distribution'] == 'DIFFUSELY_SCATTERED') & (df['cancer'] == True)]


The breast density counts below are for PLEOMORPHIC calcification cases.

In [None]:
df_similar_type['breast_density'].value_counts().to_frame()

Statistically, breast densities of 2,3 and 4 are equally probable for malignant PLEOMORPHIC cases. Let's see if calcification distribution is more discriminative.

In [None]:
df_similar_dist['breast_density'].value_counts().to_frame()

We only have five DIFFUSELY_SCATTERED calcification cases, two of which are our invalid cases. Two cases have breast densities of 2, and another with a value of 3. Not a statistically significant difference in breast densities for DIFFUSELY_SCATTERED calcification cases. Though these invalid observations are relatively, they represent 40% of the DIFFUSELY_SCATTERED calcification cases. We'll attempt to impute these values rather than delete them. 

#### Invalid Mass Case Analysis
The invalid mass cases have subtlety values of zero. Let's take a look at the data.

In [None]:
dq.get_invalid_data(subset='subtlety')

Similarly, these cases are for the same patient. Notably, these cases convey little information: missing mass shape and margins data. In this case deletion would be the best option.

## Series Dataset
### Completeness Analysis

In [None]:
dqd = DicomDQA(filepath=fpd)
dqdc = dqd.analyze_completeness()
print(dqdc.summary)

According to The Cancer Imaging Archive (TCIA), there are 10,230 images in the CBIS-DDSM. Yet, our DICOM Dataset contains 10,703 observations, 464 more than expected at a completeness of 100%. In addition, a corrupt DICOM image file was encountered during data preparation and was skipped. This brings the overage to 465. Could there be some duplication? 

### Uniqueness Analysis

In [None]:
dqdu = dqd.analyze_uniqueness()
print(dqdu.summary)

As indicated above, we have no duplicate records in the dataset.

### Validity Analysis

In [None]:
dqdv = dqd.analyze_validity()
print(dqdv.summary)

The DICOM Dataset has 100% validity. 

### Summary
A brief data quality analysis of the CBIS-DDSM case and DICOM data examined completeness, uniqueness, validity and consistency.

**Data Preparation Tasks**
1. The following observations have zero for breast density, values that will be considered missing:
   1. P_01743_RIGHT_calcification_CC_1
   2. P_01743_RIGHT_calcification_MLO_1
2. For calcification type, we'll perform the following replacements:
   1. 'LUCENT_CENTERED' for 'LUCENT_CENTER'
   2. 'PLEOMORHIC' for 'PLEOMORHIC-PLEOMORHIC'
3. We will use Multivariate Imputation by Chained Equations (MICE) to predict missing and invalid values for the following variables:
   1. calc distribution
   2. mass_margins
   3. mass shape
   4. calc type
   5. breast_density
4. The following cases will be deleted from the dataset.
   1. P_00710_RIGHT_mass_MLO_1
   2. P_00710_RIGHT_mass_CC_1