# EDA Part 1: Introduction

In this section, we conduct an exploratory data analysis of the CBIS-DDSM dataset with several goals in mind:

1. Maximize insight into the data and the factors that influence screening results in the CBIS-DDSM dataset.
2. Assess the quality of the digital mammography in the CBIS-DDSM dataset.
3. Select optimal methods and parameters for preprocessing tasks such as denoising and artifact removal.

```{table}
:name: eda_vars
| Variable          | Section | Definition                                                              |
|-------------------|---------|-------------------------------------------------------------------------|
| patient_id        | Case    | Identifier for each patient                                             |
| mmg_id            | Case    | Identifier for each mammogram                                           |
| breast_density    | Case    | BI-RADS categorization of the amount of fibroglandular tissue (FGT)     |
| laterality        | Case    | Left or right breast                                                    |
| image_view        | Case    | Craniocaudal (CC) or Mediolateral Oblique (MLO) view                    |
| abnormality_id    | Case    | The number of abnormality in the mammogram.                             |
| abnormality_type  | Case    | Either calcification or mass                                            |
| calc_type         | Case    | Calcification type (when applicable)                                    |
| calc_distribution | Case    | Calcification distribution (when applicable)                            |
| assessment        | Case    | BI-RADS assessment in [0,5]                                             |
| pathology         | Case    | Either 'BENIGN, 'BENIGN_WITHOUT_CALLBACK', or 'MALIGNANT'               |
| subtlety          | Case    | Indication of reading difficulty from 0 - highly subtle to 5 - obvious. |
| mass_shape        | Case    | Shape of the mass such as round, oval, lobular, or irregular            |
| mass_margins      | Case    | The feature that separates the mass from adjacent breast parenchyma.    |
| cancer            | Case    | Either True (Malignant) or False (Benign)                               |
| bit_depth         | Image   | Number of bits representing a pixel value. Either 8 or 16               |
| rows              | Image   | Number of rows in the image                                             |
| cols              | Image   | Number of columns in the image                                          |
| aspect_ratio      | Image   | Ratio of the image width to height.                                     |
| size              | Image   | Product of image width and height                                       |
| min_pixel_value   | Image   | Minimum pixel value                                                     |
| max_pixel_value   | Image   | Maximum pixel value                                                     |
| mean_pixel_value  | Image   | Average pixel value                                                     |
| std_pixel_value   | Image   | Standard deviation of pixel values                                      |
```

The exploratory data analysis will include the case and image variables listed in {numref}`eda_vars` and will be structured as follows:

{numref}`eda2` explores the case metadata for insights into screening and diagnosis of calcification and mass abnormalities in the dataset. {numref}`eda3` examines the quality and characteristics of the CBIS-DDSM images vis-à-vis abnormality type and morphological features of calcifications and masses. Finally, {numref}`eda4` evaluates methods and optimal parameter settings for preprocessing tasks such as denoising and artifact removal.