# CBIS-DDSM - Unbox

Before conducting data quality or exploratory analyses, we unbox the data to ensure that, well...*'we got what we paid for'*. Our aim here is to:

1. Ensure that record counts are correct,
2. Variable names are consistent across files,
3. Data structure facilitates the next stage of data quality analysis.

In [1]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../../..")))
from myst_nb import glue
import pandas as pd

In [2]:
fp_calc_train = "data/meta/0_raw/calc_case_description_train_set.csv"
fp_calc_test = "data/meta/0_raw/calc_case_description_test_set.csv"
fp_mass_train = "data/meta/0_raw/mass_case_description_train_set.csv"
fp_mass_test = "data/meta/0_raw/mass_case_description_test_set.csv"
fp_metadata = "data/meta/0_raw/metadata.csv"

## Calcification Cases

### Calcification Train Cases

We expect a total of 602 cases, 329 are benign and 273 are malignant.

In [3]:
df = pd.read_csv(fp_calc_train)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1546 entries, 0 to 1545
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   patient_id               1546 non-null   object
 1   breast density           1546 non-null   int64 
 2   left or right breast     1546 non-null   object
 3   image view               1546 non-null   object
 4   abnormality id           1546 non-null   int64 
 5   abnormality type         1546 non-null   object
 6   calc type                1526 non-null   object
 7   calc distribution        1170 non-null   object
 8   assessment               1546 non-null   int64 
 9   pathology                1546 non-null   object
 10  subtlety                 1546 non-null   int64 
 11  image file path          1546 non-null   object
 12  cropped image file path  1546 non-null   object
 13  ROI mask file path       1546 non-null   object
dtypes: int64(4), object(10)
memory usage: 16

We have 1546 observations and 14 columns *(with spaces in the column names, which is somewhat vexing)* in the calcification training set.

In [4]:
n_records = len(df)
n_cases =  df['patient_id'].nunique()
case_pathologies = df[['patient_id', 'pathology']].drop_duplicates().groupby(by='patient_id').count()
n_cases_multiple_pathologies = len(case_pathologies.loc[case_pathologies['pathology']==2])
glue('n_records', n_records)
glue('n_cases', n_cases)
glue('case_pathologies', case_pathologies)
glue('n_cases_multiple_pathologies', n_cases_multiple_pathologies)

1546

602

Unnamed: 0_level_0,pathology
patient_id,Unnamed: 1_level_1
P_00005,1
P_00007,1
P_00008,1
P_00010,1
P_00011,1
...,...
P_02559,1
P_02563,1
P_02566,1
P_02572,1


14

```{glue:figure} case_pathologies
---
align: center
name: case_pathologies_df
---
CBIS-DDSM Case Pathologies
```

Several observations can be made:

1. It would appear that we have three values for pathology: MALIGNANT, BENIGN, and BENIGN_WITHOUT_CALLBACK. While changing this to BENIGN might simplify things, we lose information that might be useful in understanding missing values, such as #2.
2. About 2% of the cases in the training set have missing calcification types. One case is illustrated above. Might the missing value be associated with the pathology?
3. A quick check of the file paths reveals, unfortunately, that they are not valid. If the other case files contain invalid file paths, we can safely drop these them from the dataset.
4. One slightly vexing

### Calcification Test Cases