# CBIS-DDSM Data Quality Analysis (DQA)
This section aims to identify any data issues that may affect downstream analyses and modeling tasks. Our approach will touch on three dimensions of data quality. 

| # | Dimension | Definition | Metric | Metric Definition |
|---|----------------|-------------------------------------|----------------|-------------------------------------|
| 1 | Completeness | Are the data complete? | Completeness Ratio | Ratio of non-null values over the total number of data cells |
| 2 | Uniqueness | Are their duplicate records | Uniqueness Ratio | Ratio of unique records over total records |
| 3 | Validity   | Are the values consistent with defined rules? | Validity Ratio | Ratio of valid cells over total number of data cells |
| 4 | Consistency   | Are the data consistent across datasets? | Consistency Ratio | Ratio of consistent data pairs over total data pairs evaluated |

Datasets in scope for this analysis are as follows:

| # | File Name | Description |
|---|----------------|-------------------------------------|
| 1 | calc_cases.csv | Calcification Case Descriptive Data |
| 2 | mass_cases.csv | Mass Case Descriptive Data          |
| 3 | series.csv     | Series Metadata                     |
| 4 | dicom.csv      | DICOM Image Metadata                |

This DQA comprises intra-dataset, and inter-dataset analyses. The former evaluates each of the above datasets for completeness, uniqueness, and validity separately. The latter assesses the degree to which there is consistency between the case description and the DICOM image datasets. Concretely, we confirm agreement of patient_id, body part (left or right breast) and image view ('CC' or 'MLO') between the two datasets. 

Note, missing from this analysis are three additional dimensions of data quality: accuracy and relevance. Accuracy measures the degree to which the data reflect the real world. This requires an external source-of-truth against which, the data are compared. As we lack such a data source, we have no basis for evaluating the accuracy of the CBIS-DDSM collection. Relevance measures the alignment between the content of the data and its purpose. In the prior data preparation section, we removed or transformed data not relevant to purpose.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as pd
import numpy as np

from bcd.data.case import MassCaseDataset, CalcCaseDataset
from bcd.data.series import SeriesDataset
from bcd.data.dicom import DicomDataset

In [2]:
# Filepaths
fpc = "data/staged/calc_cases.csv"
fpm = "data/staged/mass_cases.csv"
fps = "data/staged/series.csv"
fpd = "data/staged/dicom.csv"

## Intra-Dataset Data Quality Analysis
Here, we evaluate each dataset for completeness, uniqueness, and validity.

### Calcification Cases
Let's summarize the calcification case dataset.

In [3]:
dsc = CalcCaseDataset(filepath=fpc)
dsc.info

Unnamed: 0,Column,DataType,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,patient_id,object,1872,0,1.0,753,0.4,119808
1,breast_density,category,1872,0,1.0,5,0.0,2334
2,left_or_right_breast,category,1872,0,1.0,2,0.0,2103
3,image_view,category,1872,0,1.0,2,0.0,1991
4,abnormality_id,category,1872,0,1.0,7,0.0,2278
5,abnormality_type,category,1872,0,1.0,1,0.0,1942
6,calc_type,category,1848,24,0.99,47,0.03,6710
7,calc_distribution,category,1433,439,0.77,9,0.0,2804
8,assessment,category,1872,0,1.0,5,0.0,2334
9,pathology,category,1872,0,1.0,2,0.0,2109


## Series Metadata
These metadata connect study information and metadata to DICOM image file location. Our analysis will be focused on six of the 16 variables in this dataset:
1. subject_id: Identifies a unique subject, abnormality type, and image within the study.
2. series_description: Specifies whether the image is a full mammogram, cropped mammogram, or an ROI mask. 
3. number_of_images: The number of images associated with the series.
4. file_location: The directory containing the DICOM images for the series.
5. casetype: Specifies the abnormality type
6. fileset: Indicates whether the series is associated with the training or test set.

Select data quality metrics are as follows:

In [4]:
dssm.info

Unnamed: 0,Column,DataType,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,series_uid,object,6775,0,1.0,6775,1.0,818100
1,collection,object,6775,0,1.0,1,0.0,447150
2,data_description_uri,object,6775,0,1.0,1,0.0,691050
3,subject_id,object,6775,0,1.0,6671,0.98,591463
4,study_uid,object,6775,0,1.0,6775,1.0,818042
5,study_date,object,6775,0,1.0,9,0.0,453925
6,series_description,object,6775,0,1.0,3,0.0,506311
7,modality,object,6775,0,1.0,1,0.0,399725
8,sop_class_name,object,6775,0,1.0,1,0.0,596200
9,sop_class_uid,object,6775,0,1.0,1,0.0,555550


**Series Metadata Quality Assessment**
1. Accuracy: Our ability to evaluate accuracy is limited by the lack of an external source of truth for validation. That said, data preparation routines written in Python verified the accuracy of file_location and number_of_files. Case type and fileset values were parsed from the subject_id containing those values.
2. Completeness: As indicated above, no null values were extant, and the number of observations comports with the number of series from TCIA dataset summary {cite}`clarkCancerImagingArchive2013`.
3. Consistency: The file_location and number_of_images values were consistent with the DICOM Image repository.
4. Relevance: This dataset has a relevance score of 0.375, as six out of 16 variables are relevant to downstream analysis.
5. Uniqueness: The cardinality and uniqueness scores are provided above. There were no duplicate records in the dataset.
6. Validity: All nominal identifier variables were consistent with the specified format and structure. The number_of_images are all 16-bit integers. File locations were both valid and accurate and all values for casetype and fileset were valid. 

## Master Case Metadata 
This combined dataset contains clinical data for all mass and calcification abnormalities in the CBIS-DDSM. Let's summarize the data.

In [6]:
dsmstr.info

Unnamed: 0,Column,DataType,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,ROI_mask_file_path,object,3568,0,1.0,3568,1.0,817762
1,abnormality_id,int64,3568,0,1.0,7,0.0,28544
2,abnormality_type,object,3568,0,1.0,2,0.0,234496
3,assessment,int64,3568,0,1.0,6,0.0,28544
4,breast_density,int64,3568,0,1.0,5,0.0,28544
5,cropped_image_file_path,object,3568,0,1.0,3568,1.0,817935
6,dataset,object,3568,0,1.0,2,0.0,229104
7,image_file_path,object,3568,0,1.0,3103,0.87,809040
8,image_view,object,3568,0,1.0,2,0.0,212408
9,left_or_right_breast,object,3568,0,1.0,2,0.0,219397


**Master Case Metadata Quality Assessment**
1. Accuracy: The values for image_file_path, cropped_image_file_path, and ROI_mask_file_path were examined vis-a-vis the DICOM image repository. Zero percent of the file paths were accurate.
2. Completeness: This dataset has a completeness score of 100% as shown above.
3. Consistency: As stated, the three file path variables did not match the file paths in the DICOM image repository. 
4. Relevance: This dataset has a relevancy score of 0.769 as ten of the 13 variables were relevant to downstream analysis.
5. Uniqueness: No duplicate records were found in this dataset. 
6. Validity: All values were evaluated and assessed as valid, except for two observations having subtlety assessments of 0. Subtlety is a measure of how difficult a case is to diagnose and takes on discrete values in [1,5]. Let's take a look at the invalid observations.


In [10]:
dfmstr.loc[dfmstr['subtlety']==0]

Unnamed: 0,ROI_mask_file_path,abnormality_id,abnormality_type,assessment,breast_density,cropped_image_file_path,dataset,image_file_path,image_view,left_or_right_breast,pathology,patient_id,subtlety
2364,Mass-Training_P_00710_RIGHT_CC_1/1.3.6.1.4.1.9...,1,mass,0,2,Mass-Training_P_00710_RIGHT_CC_1/1.3.6.1.4.1.9...,training,Mass-Training_P_00710_RIGHT_CC/1.3.6.1.4.1.959...,CC,RIGHT,BENIGN,P_00710,0
2365,Mass-Training_P_00710_RIGHT_MLO_1/1.3.6.1.4.1....,1,mass,0,2,Mass-Training_P_00710_RIGHT_MLO_1/1.3.6.1.4.1....,training,Mass-Training_P_00710_RIGHT_MLO/1.3.6.1.4.1.95...,MLO,RIGHT,BENIGN,P_00710,0


Several observations can be made in this case:
1. Both cases are for the same patient,
2. Both cases have a BI-RADS assessment of zero, which means that a screening mammogram was read and the final assessment was "incomplete-needs further imaging studies" or "needs comparison to prior studies."
3. Both cases have been assessed as benign.

In either event, these represent rare cases, without any discernable pattern. As such, these data points are designated Missing Completely At Random (MCAR) and will be deleted from the dataset. 

## Calcification Case Metadata
The training and test calcification cases were combined into a single dataset for the purposes of evaluating abnormality type specific variables, namely calcification type and distribution.  

In [11]:
dsc.info

Unnamed: 0,Column,DataType,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,patient_id,object,1872,0,1.0,753,0.4,119808
1,breast_density,int64,1872,0,1.0,5,0.0,14976
2,left_or_right_breast,object,1872,0,1.0,2,0.0,115062
3,image_view,object,1872,0,1.0,2,0.0,111432
4,abnormality_id,int64,1872,0,1.0,7,0.0,14976
5,abnormality_type,object,1872,0,1.0,1,0.0,131040
6,calc_type,object,1848,24,0.99,47,0.03,131685
7,calc_distribution,object,1433,439,0.77,9,0.0,108898
8,assessment,int64,1872,0,1.0,5,0.0,14976
9,pathology,object,1872,0,1.0,2,0.0,119955


**Calcification Case Metadata Quality Assessment**
All variables with the exception of calcification type and distribution, have been evaluated in the master case dataset above. Hence, our examination will be restricted to calcificiation type and distribution.

1. Accuracy: As the case with this collection, the lack of an external ground truth leaves us with no basis for evaluating the accuracy of calcification type and distribution.
2. Completeness: Calcification type has a completeness score of 0.99; whereas calcification distribution has a completeness of 0.77.
3. Consistency: No basis for a consistency assessment. 
5. Relevance: Both variables are relevant to the downstream exploratory analyses. 
6. Uniqueness: No duplicate records exist.
7. Validity: No basis for assessing the validity of calcification type and distribution. 


## Mass Case Metadata
The training and test mass cases were combined into a single dataset for the purposes of evaluating abnormality type specific variables, namely mass shape and margins.  

In [12]:
dsm.info

Unnamed: 0,Column,DataType,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,patient_id,object,1696,0,1.0,892,0.53,108544
1,breast_density,int64,1696,0,1.0,4,0.0,13568
2,left_or_right_breast,object,1696,0,1.0,2,0.0,104335
3,image_view,object,1696,0,1.0,2,0.0,100976
4,abnormality_id,int64,1696,0,1.0,6,0.0,13568
5,abnormality_type,object,1696,0,1.0,1,0.0,103456
6,mass_shape,object,1692,4,1.0,20,0.01,113029
7,mass_margins,object,1636,60,0.96,19,0.01,114964
8,assessment,int64,1696,0,1.0,6,0.0,13568
9,pathology,object,1696,0,1.0,2,0.0,109200


**Mass Case Metadata Quality Assessment**
All variables with the exception of mass shape and margins, have been evaluated in the master case dataset above. Hence, our examination will be restricted to these two variables.

1. Accuracy: As the case with this collection, the lack of an external ground truth leaves us with no basis for evaluating the accuracy of mass shape and margins.
2. Completeness: Mass shape has a completeness score of 1.0 whereas mass margins has a completeness of 0.96.
3. Consistency: No basis for a consistency assessment. 
5. Relevance: Both variables are relevant to the downstream exploratory analyses. 
6. Uniqueness: No duplicate records exist.
7. Validity: No basis for assessing the validity of mass shape and margins.


## Image Metadata
This dataset was constructed using the series metadata and data extracted from the DICOM image files. 

In [13]:
dsi.info
corrupt_image = "Mass-Training_P_01382_LEFT_MLO/07-20-2016-DDSM-93921/1.000000-full mammogram images-05891/1-1.dcm"

Unnamed: 0,Column,DataType,Valid,Null,Validity,Cardinality,Percent Unique,Size
0,patient_id,object,10305,0,1.0,1567,0.15,659721
1,subject_id,object,10305,0,1.0,6671,0.65,901460
2,series_uid,object,10305,0,1.0,6775,0.66,1240751
3,description,object,10305,0,1.0,4,0.0,760194
4,view,object,10305,0,1.0,3,0.0,613563
5,side,object,10305,0,1.0,3,0.0,633614
6,casetype,object,10305,0,1.0,3,0.0,628873
7,fileset,object,10305,0,1.0,3,0.0,636991
8,filepath,object,10305,0,1.0,10239,0.99,1738868
9,width,object,10305,0,1.0,1209,0.12,625230
