# Data Cleaning
In the prior section, we identified a few anomalies to correct. This brief notebook execute the following data cleaning steps:
1. Remove cases with invalid subtlety values   
2. Set values to 'NOT APPLICABLE' where appropriate, e.g., mass shape and margins are not applicable to calcification cases and will be set as such.
3. Perform Multivariate Imputation by Chained Equations (MICE) to predict missing values in the dataset.

Once complete, we'll conduct a random inspection of the data and run a final data quality analysis.

In [None]:
import os
if 'jbook' in os.getcwd():
    os.chdir(os.path.abspath(os.path.join("../..")))

import pandas as pd
import numpy as np

from bcd.data.prep.case import CaseImputer
pd.options.display.max_rows = 999

from bcd.data.dqa.case import MassCaseDQA, CalcCaseDQA
from bcd.data import CALC_VARIABLES, MASS_VARIABLES

In [None]:
FP_IN = "data/staged/case.csv"
FP_CLEAN_CASES = "data/clean/cases.csv"
FP_CLEAN_CALC =  "data/clean/calc_cases.csv"
FP_CLEAN_MASS =  "data/clean/mass_cases.csv"
CASES_TO_REMOVE = ['P_00710_RIGHT_mass_MLO_1', 'P_00710_RIGHT_mass_CC_1']

## Setup
First, we'll load the data, and capture the original data. Next, we'll set invalid values for breast density to NA so that they will be picked up by the imputer. Then, set mass and calcification specific variables to NA where appropriate. Finally, we'll capture the rows with missing data for final comparison.

In [None]:
# Load data and capture original version.
df_orig = pd.read_csv(FP_IN)
df_interim = df_orig

In [None]:
# Set invalid values for breast_density to NA
df_interim['breast_density'] = df_interim['breast_density'].replace(0, np.NAN)

In [None]:
# Set case and mass data to NOT APPLICABLE where appropriate.
df_interim.loc[df_interim['abnormality_type'] == 'calcification', 'mass_shape'] = 'NA'
df_interim.loc[df_interim['abnormality_type'] == 'calcification', 'mass_margins'] = 'NA'
df_interim.loc[df_interim['abnormality_type'] == 'mass', 'calc_type'] = 'NA'
df_interim.loc[df_interim['abnormality_type'] == 'mass', 'calc_distribution'] = 'NA'

In [None]:
# Grab rows with missing data
null_mask = df_interim.isnull().any(axis=1)
df_missing = df_interim[null_mask]
msg = f"There are {df_missing.shape[0]} rows (approximately {round(df_missing.shape[0] / df_orig.shape[0] * 100,1)}% of the dataset) with missing data in the dataset."
print(msg)

## Data Removal

In [None]:
df_interim = df_interim.loc[~df_interim['case_id'].isin(CASES_TO_REMOVE)]
msg = f"A total of {df_orig.shape[0] - df_interim.shape[0]} rows were removed from the dataset, leaving {df_interim.shape[0]} rows."
print(msg)

## Data Imputation
Multiple Imputation by Chained Equations (MICE) is a robust, informative method of estimating missing values in datasets. The procedure imputes missing data through an iterative series of predictive models which estimate the value of missing data using the other variables in the dataset. For this, we'll use our CaseImputer which wraps scikit-learn's IterativeImputer implementation of MICE.

In [None]:
imp = CaseImputer(random_state=5)
imp.fit(df=df_interim)
df_clean = imp.transform(df=df_interim)

## Save Data

In [None]:
os.makedirs(os.path.dirname(FP_CLEAN_CASES), exist_ok=True)
df_clean.to_csv(FP_CLEAN_CASES, index=False)
df_calc = df_clean[CALC_VARIABLES].loc[df_clean['abnormality_type'] == 'calcification']
df_mass = df_clean[MASS_VARIABLES].loc[df_clean['abnormality_type'] == 'mass']
df_calc.to_csv(FP_CLEAN_CALC, index=False)
df_mass.to_csv(FP_CLEAN_MASS, index=False)

## Random Sample Inspection

Let's take a look at a random sampling of the missing data and compare.

In [None]:
sample_cases = df_missing['case_id'].sample(10)
df_missing.loc[df_missing['case_id'].isin(sample_cases)]
df_clean.loc[df_clean['case_id'].isin(sample_cases)]

## Data Quality Analysis 2.0
Ok, let's have another go at the data quality analysis.

### Completeness Analysis

In [None]:
dqc = CalcCaseDQA(filepath=FP_CLEAN_CALC)
dqm = MassCaseDQA(filepath=FP_CLEAN_MASS)
dqcc = dqc.analyze_completeness()
dqmc = dqm.analyze_completeness()

In [None]:
print(dqcc.summary)
print(dqmc.summary)

We're complete.

### Uniqueness Analysis

In [None]:
dqcu = dqc.analyze_uniqueness()
dqmu = dqm.analyze_uniqueness()
print(dqcu.summary)
print(dqmu.summary)

We're unique.

### Validity Analysis

In [None]:
dqcv = dqc.analyze_validity()
dqmv = dqm.analyze_validity()
print(dqcv.summary)
print(dqmv.summary)

We're valid. This concludes the data cleaning exercise.