# CBIS-DDSM Data Quality Analysis (DQA)

This section aims to identify any data issues that may affect downstream analyses and modeling efforts. Our approach will touch on four dimensions of data quality. 

| # | Dimension | Definition | Metric | Metric Definition |
|---|----------------|-------------------------------------|----------------|-------------------------------------|
| 1 | Completeness | Are the data complete? | Completeness Ratio | Ratio of non-null values / rows over the total number of data cells / rows |
| 2 | Uniqueness | Are their duplicate records | Uniqueness Ratio | Ratio of unique values / rows over total number of cells / rows |
| 3 | Validity   | Are the values consistent with defined rules? | Validity Ratio | Ratio of valid cells / rows over total number of data cells / rows |
| 4 | Consistency   | Are the data consistent across datasets? | Consistency Ratio | Ratio of consistent data pairs over total data pairs evaluated |


Datasets in scope for this analysis are as follows:

| # | File Name | Description |
|---|----------------|-------------------------------------|
| 1 | calc_cases.csv | Calcification Case Descriptive Data |
| 2 | mass_cases.csv | Mass Case Descriptive Data          |
| 3 | series.csv     | Series Metadata                     |
| 4 | dicom.csv      | DICOM Image Metadata                |

Completeness, uniqueness, and validity analyses will be conducted on each dataset separately. After which, we'll evaluate the degree to which mass and calcification case data and DICOM metadata are consistent. 

Note, accuracy and relevance, two additional dimensions of data quality,  are missing from this analysis. An accuracy evaluation requires an external source-of-truth with which, the CBIS-DDSM data are matched. As we lack such a data source, we have no basis for evaluating the accuracy of the CBIS-DDSM data collection. With respect to relevance, duplicate and irrelevant data were removed from the data collection during the previous data preparation exercise.

In [48]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as pd
import numpy as np

from bcd.data.case.dqa import MassCaseDQA, CalcCaseDQA
from bcd.data.series.dqa import SeriesDQA
from bcd.data.dicom.dqa import DicomDQA

In [49]:
# Filepaths
fpc = "data/staged/calc_cases.csv"
fpm = "data/staged/mass_cases.csv"
fps = "data/staged/series.csv"
fpd = "data/staged/dicom.csv"

## Calcification Cases
### Completeness Analysis

In [50]:
dqa = CalcCaseDQA(filepath=fpc)
calc_completeness = dqa.analyze_completeness()
print(calc_completeness.summary)



                          Completeness                          
                         Dataset | Calc Cases
                        Filename | calc_cases.csv
                            Rows | 1872
                   Rows Complete | 1409
                Row Completeness | 0.753
                           Cells | 28080
                  Cells Complete | 27617
               Cell Completeness | 0.984




Here, we've summarized dataset completeness at the row and cell levels. We have a completeness of 0.75 and 0.98 for rows and cells respectively. Let's drill down to the variable level.

In [51]:
calc_completeness.detail

Unnamed: 0,N,Complete,Completeness
patient_id,1872,1872,1.0
breast_density,1872,1872,1.0
left_or_right_breast,1872,1872,1.0
image_view,1872,1872,1.0
abnormality_id,1872,1872,1.0
abnormality_type,1872,1872,1.0
calc_type,1872,1848,0.99
calc_distribution,1872,1433,0.77
assessment,1872,1872,1.0
pathology,1872,1872,1.0


The sources of 'incompleteness' are calcification type and calcification distribution with completeness of 0.99 and 0.77 respectively. 

### Uniqueness Analysis

In [52]:
calc_uniqueness = dqa.analyze_uniqueness()
print(calc_uniqueness.summary)



                           Uniqueness                           
                         Dataset | Calc Cases
                        Filename | calc_cases.csv
                            Rows | 1872
                     Unique Rows | 1872
                  Row Uniqueness | 1.0
                           Cells | 28080
                    Unique Cells | 6095
                 Cell Uniqueness | 0.217




As indicated, no duplicate records are extant in the dataset. 

### Validity Analysis

In [53]:
calc_validity = dqa.analyze_validity()
print(calc_validity.summary)



                            Validity                            
                         Dataset | Calc Cases
                        Filename | calc_cases.csv
                            Rows | 1872
                      Rows Valid | 1331
                    Row Validity | 0.711
                           Cells | 28080
                     Cells Valid | 27539
                   Cell Validity | 0.981




Approximately 70% of the rows are valid cases. Let's check the distribution of the invalid cells among the variables in the dataset.

In [54]:
calc_validity.detail

Unnamed: 0,N,Valid,Validity
patient_id,1872,1872,1.0
breast_density,1872,1870,1.0
left_or_right_breast,1872,1872,1.0
image_view,1872,1872,1.0
abnormality_id,1872,1872,1.0
abnormality_type,1872,1872,1.0
calc_type,1872,1848,0.99
calc_distribution,1872,1433,0.77
assessment,1872,1796,0.96
pathology,1872,1872,1.0


Missing calcification distribution data are having a material impact on the validity of the calcification dataset.  Let's move on to the mass cases.

## Mass Cases
### Completeness Analysis

In [55]:
dqa = MassCaseDQA(filepath=fpm)
mass_completeness = dqa.analyze_completeness()
print(mass_completeness.summary)



                          Completeness                          
                         Dataset | Mass Cases
                        Filename | mass_cases.csv
                            Rows | 1696
                   Rows Complete | 1634
                Row Completeness | 0.963
                           Cells | 25440
                  Cells Complete | 25376
               Cell Completeness | 0.997




Row and cell completeness are 0.96 and 0.99 respectively. 

In [56]:
mass_completeness.detail

Unnamed: 0,N,Complete,Completeness
patient_id,1696,1696,1.0
breast_density,1696,1696,1.0
left_or_right_breast,1696,1696,1.0
image_view,1696,1696,1.0
abnormality_id,1696,1696,1.0
abnormality_type,1696,1696,1.0
mass_shape,1696,1692,1.0
mass_margins,1696,1636,0.96
assessment,1696,1696,1.0
pathology,1696,1696,1.0


Of the 1696 observations, 60 have missing mass margin data.

### Uniqueness Analysis

In [57]:
mass_uniqueness = dqa.analyze_uniqueness()
print(mass_uniqueness.summary)



                           Uniqueness                           
                         Dataset | Mass Cases
                        Filename | mass_cases.csv
                            Rows | 1696
                     Unique Rows | 1696
                  Row Uniqueness | 1.0
                           Cells | 25440
                    Unique Cells | 5946
                 Cell Uniqueness | 0.234




No duplicate rows are extant in the mass cases dataset.

### Validity Analysis

In [58]:
mass_validity = dqa.analyze_validity()
print(mass_validity.summary)



                            Validity                            
                         Dataset | Mass Cases
                        Filename | mass_cases.csv
                            Rows | 1696
                      Rows Valid | 1478
                    Row Validity | 0.871
                           Cells | 25440
                     Cells Valid | 25212
                   Cell Validity | 0.991




The high cell validity vis-a-vis the row validity indicates that our validity issues center on relatively few variables in the dataset.

In [59]:
mass_validity.detail

Unnamed: 0,N,Valid,Validity
patient_id,1696,1696,1.0
breast_density,1696,1696,1.0
left_or_right_breast,1696,1696,1.0
image_view,1696,1696,1.0
abnormality_id,1696,1696,1.0
abnormality_type,1696,1696,1.0
mass_shape,1696,1692,1.0
mass_margins,1696,1636,0.96
assessment,1696,1534,0.9
pathology,1696,1696,1.0


As suspected,  the variance in row and cell validity can be attributed to errors in mass margin and BI-RADS assessment values. Moving on the the series metadata.

## Series Metadata
### Completeness Analysis

In [60]:
dqa = SeriesDQA(filepath=fps)
series_completeness = dqa.analyze_completeness()
print(series_completeness.summary)



                          Completeness                          
                         Dataset | Series
                        Filename | series.csv
                            Rows | 6775
                   Rows Complete | 6775
                Row Completeness | 1.0
                           Cells | 27100
                  Cells Complete | 27100
               Cell Completeness | 1.0




### Uniqueness Analysis

In [61]:
series_uniqueness = dqa.analyze_uniqueness()
print(series_uniqueness.summary)



                           Uniqueness                           
                         Dataset | Series
                        Filename | series.csv
                            Rows | 6775
                     Unique Rows | 6775
                  Row Uniqueness | 1.0
                           Cells | 27100
                    Unique Cells | 13555
                 Cell Uniqueness | 0.5




### Validity Analysis

In [62]:
series_validity = dqa.analyze_validity()
print(series_validity.summary)



                            Validity                            
                         Dataset | Series
                        Filename | series.csv
                            Rows | 6775
                      Rows Valid | 6775
                    Row Validity | 1.0
                           Cells | 27100
                     Cells Valid | 27100
                   Cell Validity | 1.0




The series data are complete, unique, and valid. On to the DICOM Image metadata.

## DICOM Image Metadata
### Completeness Analysis

In [63]:
dqa = DicomDQA(filepath=fpd)
dicom_completeness = dqa.analyze_completeness()
print(dicom_completeness.summary)



                          Completeness                          
                         Dataset | Dicom
                        Filename | dicom.csv
                            Rows | 10238
                   Rows Complete | 10238
                Row Completeness | 1.0
                           Cells | 112618
                  Cells Complete | 112618
               Cell Completeness | 1.0




### Uniqueness Analysis

In [64]:
dicom_uniqueness = dqa.analyze_uniqueness()
print(dicom_uniqueness.summary)



                           Uniqueness                           
                         Dataset | Dicom
                        Filename | dicom.csv
                            Rows | 10238
                     Unique Rows | 10238
                  Row Uniqueness | 1.0
                           Cells | 112618
                    Unique Cells | 27385
                 Cell Uniqueness | 0.243




### Validity Analysis

In [65]:
dicom_validity = dqa.analyze_validity()
print(dicom_validity.summary)



                            Validity                            
                         Dataset | Dicom
                        Filename | dicom.csv
                            Rows | 10238
                      Rows Valid | 10238
                    Row Validity | 1.0
                           Cells | 112618
                     Cells Valid | 112618
                   Cell Validity | 1.0




## Consistency Analysis
The calcification and mass case descriptions include a series uid linking the mammography to DICOM image files. Next, we assess the degree to which the series uids in the case files are in agreement with the DICOM image metadata.

In [86]:
# Calcification Series UIDs
dfc = pd.read_csv(fpc)
calc_full = pd.DataFrame({'case type': 'calcification', 'series_type': 'Full Mammography', 'series_uid': dfc['image_series_uid']})
calc_roi = pd.DataFrame({'case type': 'calcification', 'series_type': 'ROI Mask', 'series_uid': dfc['roi_mask_series_uid']})
calc_cropped = pd.DataFrame({'case type': 'calcification', 'series_type': 'Cropped Image', 'series_uid': dfc['cropped_image_series_uid']})

dfm = pd.read_csv(fpm)
mass_full = pd.DataFrame({'case type': 'mass', 'series_type': 'Full Mammography', 'series_uid': dfm['image_series_uid']})
mass_roi = pd.DataFrame({'case type': 'mass', 'series_type': 'ROI Mask', 'series_uid': dfm['roi_mask_series_uid']})
mass_cropped = pd.DataFrame({'case type': 'mass', 'series_type': 'Cropped Image', 'series_uid': dfm['cropped_image_series_uid']})

case_series_uids = pd.concat([calc_full, calc_roi, calc_cropped, mass_full, mass_roi, mass_cropped], axis=0)

dicom = pd.read_csv(fpd)

case_series_uids['valid'] = case_series_uids['series_uid'].isin(dicom['series_uid'])
result = case_series_uids[['case type', 'series_type', 'valid']].groupby(by=['case type','series_type', 'valid']).value_counts().to_frame()
print(f"A total of {result['count'].sum()} series uids were evaluated.")

result
r = result.reset_index()
overall_consistency = r.loc[r['valid'] == True]['count'].sum() / r['count'].sum()
print(f"Overall consistency ratio is {round(overall_consistency,3)}")


A total of 10704 series uids were evaluated.


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
case type,series_type,valid,Unnamed: 3_level_1
calcification,Cropped Image,True,1872
calcification,Full Mammography,True,1872
calcification,ROI Mask,True,1872
mass,Cropped Image,True,1696
mass,Full Mammography,False,1
mass,Full Mammography,True,1695
mass,ROI Mask,True,1696


Overall consistency ratio is 1.0


Of 10,704 series uids, one was missing from the DICOM metadata. This corresponds with a corrupt DICOM image file that could not be read. 
Let's summarize.

## Completeness Analysis

In [82]:
completeness = pd.concat([calc_completeness.summary.as_df(), mass_completeness.summary.as_df(), series_completeness.summary.as_df(), dicom_completeness.summary.as_df()], axis=0)
completeness
overall_completeness = completeness['rows_complete'].sum() / completeness['rows'].sum() 
print(f"Overall weighted average completeness for the CBIS-DDSM collection was {round(overall_completeness,3)}")

Unnamed: 0,dataset,filename,rows,rows_complete,row_completeness,cells,cells_complete,cell_completeness
0,Calc Cases,calc_cases.csv,1872,1409,0.75,28080,27617,0.98
0,Mass Cases,mass_cases.csv,1696,1634,0.96,25440,25376,1.0
0,Series,series.csv,6775,6775,1.0,27100,27100,1.0
0,Dicom,dicom.csv,10238,10238,1.0,112618,112618,1.0


Overall weighted average completeness for the CBIS-DDSM collection was 0.974


## Uniqueness Analysis

In [83]:
uniqueness = pd.concat([calc_uniqueness.summary.as_df(), mass_uniqueness.summary.as_df(), series_uniqueness.summary.as_df(), dicom_uniqueness.summary.as_df()], axis=0)
uniqueness
overall_uniqueness = uniqueness['unique_rows'].sum() / uniqueness['rows'].sum() 
print(f"Overall weighted average uniqueness for the CBIS-DDSM collection was {round(overall_uniqueness,3)}")

Unnamed: 0,dataset,filename,rows,unique_rows,row_uniqueness,cells,unique_cells,cell_uniqueness
0,Calc Cases,calc_cases.csv,1872,1872,1.0,28080,6095,0.22
0,Mass Cases,mass_cases.csv,1696,1696,1.0,25440,5946,0.23
0,Series,series.csv,6775,6775,1.0,27100,13555,0.5
0,Dicom,dicom.csv,10238,10238,1.0,112618,27385,0.24


Overall weighted average completeness for the CBIS-DDSM collection was 1.0


## Validity Analysis

In [84]:
validity = pd.concat([calc_validity.summary.as_df(), mass_validity.summary.as_df(), series_validity.summary.as_df(), dicom_validity.summary.as_df()], axis=0)
validity
overall_validity = validity['rows_valid'].sum() / validity['rows'].sum() 
print(f"Overall weighted average validity for the CBIS-DDSM collection was {round(overall_validity,3)}")

Unnamed: 0,dataset,filename,rows,rows_valid,row_validity,cells,cells_valid,cell_validity
0,Calc Cases,calc_cases.csv,1872,1331,0.71,28080,27539,0.98
0,Mass Cases,mass_cases.csv,1696,1478,0.87,25440,25212,0.99
0,Series,series.csv,6775,6775,1.0,27100,27100,1.0
0,Dicom,dicom.csv,10238,10238,1.0,112618,112618,1.0


Overall weighted average completeness for the CBIS-DDSM collection was 0.963


This brief data quality analysis of the CBIS-DDSM was undertaken to expose potential data quality issues in advance of the analysis effort. To that end, the DQA centered around four main themes or questions: 
- Completeness: Are all the datasets and the data items complete?
- Uniqueness: 
Is the data complete themes or questions themes of completeness, uniqueness, validity, and consistency.  four dimensions of data quality   Four dimensions of data quality Four dimensions of data quality were  analysis  with the data any data issues that would be  This brief data quality analysis examined the CBIS-DDSM along four dimensions of data quality.  four dimensions  The CBIS-DDSM collection is one of the most widely used mammography datasets in research.   

The CBIS-DDSM collection  