# Data Quality Analysis (DQA)
In the previous section, we began the process of transforming the raw CBIS-DDSM case and DICOM data into a structure and format more suitable for analysis. The purpose of this effort is to identify potential data quality issues, errors, and anomalies requiring further data preprocessing, prior to the analysis stage. 

## Approach
Our approach will touch on three dimensions of data quality.

| # | Dimension | Definition | Metric | Metric Definition |
|---|----------------|-------------------------------------|----------------|-------------------------------------|
| 1 | Completeness | Are the data complete? | Completeness Ratio | Ratio of non-null data values / rows over the total number of data values / rows |
| 2 | Uniqueness | Are their duplicate records | Uniqueness Ratio | Ratio of unique data values / rows over total number of data values / rows |
| 3 | Validity   | Are the values consistent with defined rules? | Validity Ratio | Ratio of valid cells / rows over total number of data cells / rows |

Note, accuracy and relevance, two additional dimensions of data quality, are missing from this analysis. An accuracy evaluation requires an external source-of-truth against which, the CBIS-DDSM data are compared. As we lack such a data source, we have no basis for evaluating the accuracy of the CBIS-DDSM data collection. With respect to relevance, duplicate and irrelevant data were removed from the data collection during the previous data preparation exercise.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../../..")))

import pandas as pd
from IPython.display import HTML, display_html
import numpy as np

from bcd.analyze.dqa.cbis import CBISDQA
from bcd.data.dataset import CBISDataset

pd.set_option('display.max_colwidth', 200)

## Preliminaries
As a first step, we'll create the CBIS-DDSM Dataset object and install the DQA module.

In [2]:
filepath = "data/meta/2_staged/cbis.csv"
cbis = CBISDataset(filepath=filepath)

## Completeness Analysis

In [3]:
dqc = cbis.dqa.analyze_completeness()
print(dqc.summary, end=" ")



                          Completeness                          
                         Dataset | CBIS-DDSM
                         Records | 3568
                Complete Records | 3043
             Record Completeness | 0.853
                     Data Values | 96336
            Complete Data Values | 95809
         Data Value Completeness | 0.995

 

Record and data value completeness are 0.85 and 0.99 respectively. Let's take a closer look.

In [4]:
dqc.detail

Unnamed: 0,N,Complete,Missing,Completeness
patient_id,3568,3568,0,1.0
breast_density,3568,3568,0,1.0
laterality,3568,3568,0,1.0
image_view,3568,3568,0,1.0
abnormality_id,3568,3568,0,1.0
abnormality_type,3568,3568,0,1.0
calc_type,3568,3544,24,0.99
calc_distribution,3568,3129,439,0.88
assessment,3568,3568,0,1.0
pathology,3568,3568,0,1.0


Missingness for calc type, mass_shape, and mass_margins is 1%, < 1%, and 2% respectively. Strictly speaking, we could drop those observations without a considerable loss of data. On the other hand, calc distribution missingness deserves a closer look. Let's examine missingness vis-a-vis the pathology target.

In [5]:
cbis.dqa.get_incomplete_data(subset='calc_distribution')['pathology'].value_counts().to_frame()

Unnamed: 0_level_0,count
pathology,Unnamed: 1_level_1
BENIGN_WITHOUT_CALLBACK,434
BENIGN,5


This is interesting. Over 98% of the records with missing calcification distribution have a pathology of BENIGN_WITHOUT_CALLBACK. This suggests that the calcification distribution data are Missing at Random (MAR). That is, the probability of encountering missing calcification distribution may be systematically related to a benign pathology. As the missingness seems to be related to *observed* data, the multiple imputation by chained equations (MICE) {cite}`schaferMultipleImputationPrimer1999` is a principled method for dealing with such missing data while mitigating data loss and bias in the dataset.

Let's move on to the uniqueness analysis.

## Uniqueness Analysis

In [6]:
dqu = cbis.dqa.analyze_uniqueness()
print(dqu.summary)



                           Uniqueness                           
                         Dataset | CBIS-DDSM
                         Records | 3568
                  Unique Records | 3568
               Record Uniqueness | 1.0
                     Data Values | 96336
              Unique Data Values | 23101
           Data Value Uniqueness | 0.24




No duplicate records exist in the data.

### Validity Analysis
Here, we assess the degree to which the datasets contain valid values.

In [7]:
dqv = cbis.dqa.analyze_validity()
print(dqv.summary)



                            Validity                            
                         Dataset | CBIS-DDSM
                         Records | 3568
                   Valid Records | 3009
                 Record Validity | 0.843
                     Data Values | 96336
               Valid Data Values | 95608
             Data Value Validity | 0.992




Record and data value validity are 0.84 and 0.99, respectively. Note, missing data will have an impact on record validity. Still, let's take a closer look at the invalid data.

In [8]:
dqv.detail

Unnamed: 0,N,Valid,Invalid,Validity
mmg_id,3568,3568,0,1.0
patient_id,3568,3568,0,1.0
breast_density,3568,3566,2,1.0
laterality,3568,3568,0,1.0
image_view,3568,3568,0,1.0
abnormality_id,3568,3568,0,1.0
abnormality_type,3568,3568,0,1.0
calc_type,3568,3347,221,0.94
calc_distribution,3568,3129,439,0.88
mass_shape,3568,3564,4,1.0


The invalidity of calc distribution and mass_margins is explained by the missing data. Approximately 6% of the observations have invalid calcification types. Let's take a look.

In [9]:
cbis.dqa.get_invalid_data(subset='calc_type')['calc_type'].unique()

array([nan, 'ROUND_AND_REGULAR-LUCENT_CENTER-DYSTROPHIC',
       'PUNCTATE-LUCENT_CENTER',
       'VASCULAR-COARSE-LUCENT_CENTER-ROUND_AND_REGULAR-PUNCTATE',
       'LUCENT_CENTER', 'ROUND_AND_REGULAR-LUCENT_CENTER',
       'LUCENT_CENTER-PUNCTATE', 'PLEOMORPHIC-PLEOMORPHIC',
       'COARSE-ROUND_AND_REGULAR-LUCENT_CENTER',
       'ROUND_AND_REGULAR-LUCENT_CENTER-PUNCTATE', 'COARSE-LUCENT_CENTER'],
      dtype=object)

A common thread among these invalid values is the type, 'LUCENT_CENTER', which should be 'LUCENT_CENTERED'. 'PLEOMORPHIC-PLEOMORPHIC', should be 'PLEOMORPHIC' and there are some NaN values extant. We'll make those changes in the next section. Now, the invalid breast density records.

In [10]:
cbis.dqa.get_invalid_data(subset='breast_density')

Unnamed: 0,patient_id,breast_density,laterality,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,...,rows,cols,aspect_ratio,size,file_size,min_pixel_value,max_pixel_value,mean_pixel_value,std_pixel_value,filepath
1817,P_01743,0,RIGHT,CC,1,calcification,PLEOMORPHIC,DIFFUSELY_SCATTERED,5,MALIGNANT,...,4576,1831,0.4,8378656,16758290,0,65535,10701.68,19437.31,data/image/0_raw/CBIS-DDSM/Calc-Test_P_01743_RIGHT_CC/08-29-2017-DDSM-78699/1.000000-full mammogram images-62009/1-1.dcm
1818,P_01743,0,RIGHT,MLO,1,calcification,PLEOMORPHIC,DIFFUSELY_SCATTERED,5,MALIGNANT,...,5176,2716,0.52,14058016,28117014,0,65535,14108.36,20763.14,data/image/0_raw/CBIS-DDSM/Calc-Test_P_01743_RIGHT_MLO/08-29-2017-DDSM-15483/1.000000-full mammogram images-98404/1-1.dcm


Both observations with breast density values of zero, are for the same patient. Let's see what densities we have for similar malignant calcification cases.

In [11]:
df = cbis.dqa.get_complete_data()
df_similar_type = df.loc[(df['calc_type'] == 'PLEOMORPHIC') & (df['cancer'] == True)]
df_similar_dist = df.loc[(df['calc_distribution'] == 'DIFFUSELY_SCATTERED') & (df['cancer'] == True)]

The breast density counts below are for PLEOMORPHIC calcification cases.

In [12]:
df_similar_type['breast_density'].value_counts().to_frame()

Unnamed: 0_level_0,count
breast_density,Unnamed: 1_level_1
3,158
2,123
4,121
1,23
0,2


Statistically, breast densities of 2,3 and 4 are equally probable for malignant PLEOMORPHIC cases. Let's see if calcification distribution is more discriminative.

In [13]:
df_similar_dist['breast_density'].value_counts().to_frame()

Unnamed: 0_level_0,count
breast_density,Unnamed: 1_level_1
2,2
0,2
3,1


We only have five DIFFUSELY_SCATTERED calcification cases, two of which are our invalid cases. Two cases have breast densities of 2, and another with a value of 3. Not a statistically significant difference in breast densities for DIFFUSELY_SCATTERED calcification cases. Though these invalid observations are relative, they represent 40% of the DIFFUSELY_SCATTERED calcification cases. We'll attempt to impute these values rather than delete them.

#### Invalid Mass Case Analysis
The invalid mass cases have subtlety values of zero. Let's take a look at the data.

In [14]:
cbis.dqa.get_invalid_data(subset='subtlety')

Unnamed: 0,patient_id,breast_density,laterality,image_view,abnormality_id,abnormality_type,calc_type,calc_distribution,assessment,pathology,...,rows,cols,aspect_ratio,size,file_size,min_pixel_value,max_pixel_value,mean_pixel_value,std_pixel_value,filepath
2364,P_00710,2,RIGHT,CC,1,mass,NOT APPLICABLE,NOT APPLICABLE,0,BENIGN,...,4728,3112,0.66,14713536,29428168,0,65535,6370.67,12047.63,data/image/0_raw/CBIS-DDSM/Mass-Training_P_00710_RIGHT_CC/07-20-2016-DDSM-01521/1.000000-full mammogram images-11222/1-1.dcm
2365,P_00710,2,RIGHT,MLO,1,mass,NOT APPLICABLE,NOT APPLICABLE,0,BENIGN,...,4688,3120,0.67,14626560,29254222,0,65535,10743.82,15182.58,data/image/0_raw/CBIS-DDSM/Mass-Training_P_00710_RIGHT_MLO/07-20-2016-DDSM-11749/1.000000-full mammogram images-15311/1-1.dcm


Similarly, these cases are for the same patient. Notably, these cases convey little information: missing mass shape and margins data. In this case deletion would be the best option.

### Summary
A brief data quality analysis of the CBIS-DDSM case and DICOM data examined completeness, uniqueness, and validity. Our data cleaning tasks are as follows:

1. The following observations have zero for breast density, values that will be marked as missing and imputed.
   1. P_01743_RIGHT_calcification_CC_1
   2. P_01743_RIGHT_calcification_MLO_1
2. The following cases have zero subtlety. These values will be marked as missing and imputed.
   1. P_00710_RIGHT_mass_MLO_1
   2. P_00710_RIGHT_mass_CC_1
3. For calcification type, we'll perform the following replacements:
   1. 'LUCENT_CENTERED' for 'LUCENT_CENTER'
   2. 'PLEOMORHIC' for 'PLEOMORHIC-PLEOMORHIC'
4. We will use Multivariate Imputation by Chained Equations (MICE) to predict missing values for the following variables:
   1. calc distribution
   2. mass_margins
   3. mass shape
   4. calc type
   5. breast_density
   6. subtlety