# **Data Cleaning**

## Objectives

* Convert values of zero in RestingBP and Cholesterol to NaN
* Use an imputer to fill in the missing values
* Split the data into train and test sets

## Inputs

* outputs/datasets/collection/HeartDiseasePrediction.csv

## Outputs

* Cleaned train and test datasets
* path to train set here
* path to test set here

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os


current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

Load the dataset from the data collection notebook

In [None]:
import pandas as pd


df = pd.read_csv("outputs/datasets/collection/HeartDiseasePrediction.csv")
df.head(3)

---

# Data Cleaning

From our exploratory data analysis there were no missing values observed, however there were two features that contained zero values where a zero value made no sense ie it was assumed that no data had been collected.

* RestingBP - 1 zero (0.1% of data)
* Cholesterol - 172 zeros (18.7% of data)

The RestingBP only has one zero value so we could just drop the row. Rather than losing the data, however, we can just impute the zero value with the median value as there was no strong correlation of RestingBP with the target, HeartDisease.

First, before any imputation, we need to convert the zeros in RestingBP and Cholesterol into NaN.

In [None]:
import numpy as np


zero_to_nan_df = df
for col in ["RestingBP", "Cholesterol"]:
    zero_to_nan_df[col] = zero_to_nan_df[col].replace(0, np.nan)

def EvaluateMissingData(df):
    missing_data_absolute = df.isna().sum()
    missing_data_percentage = round(missing_data_absolute/len(df)*100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_data_absolute,
                                   "PercentageOfDataset": missing_data_percentage,
                                   "DataType": df.dtypes}
                                    )
                          .sort_values(by=['PercentageOfDataset'], ascending=False)
                          .query("PercentageOfDataset > 0")
                          )

    return df_missing_data

EvaluateMissingData(zero_to_nan_df)

Now, the NaN values can be replaced with imputed values.

* For RestingBP, we will use a median imputation method.
* For Cholesterol, we will use a random sample imputation method as shown in the EDA.

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer, RandomSampleImputer


pipeline = Pipeline([
    ( "median_imputation", MeanMedianImputer(imputation_method="median",
                                             variables=["RestingBP"])),
    ( "random_sample_imputation", RandomSampleImputer(random_state=1,
                                                      seed='general',
                                                      variables=["Cholesterol"]))
])

cleaned_df = pipeline.fit_transform(zero_to_nan_df)
EvaluateMissingData(cleaned_df)

Next, we assess the difference on the dataset after cleaning to see if the methods selected had any impact on the distributions of the cleaned features.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


sns.set(style="whitegrid")


def DataCleaningEffect(df_original, df_cleaned, variables_applied_with_method):

  flag_count = 1

  categorical_variables = df_original.select_dtypes(exclude=['number']).columns

  for set_of_variables in [variables_applied_with_method]:
    print("\n=====================================================================================")
    print(f"* Distribution Effect Analysis After Data Cleaning Method in the following variables:")
    print(f"{set_of_variables} \n\n")

    for var in set_of_variables:
      if var in categorical_variables:

        df1 = pd.DataFrame({"Type": "Original", "Value": df_original[var]})
        df2 = pd.DataFrame({"Type": "Cleaned", "Value": df_cleaned[var]})
        dfAux = pd.concat([df1, df2], axis=0)
        fig, axes = plt.subplots(figsize=(15, 5))
        sns.countplot(hue='Type', data=dfAux, x="Value",
                      palette=["#432371", "#FAAE7B"])
        axes.set(title=f"Distribution Plot {flag_count}: {var}")
        plt.xticks(rotation=90)
        plt.legend()

      else:

        fig, axes = plt.subplots(figsize=(10, 5))
        sns.histplot(data=df_original, x=var, color="#432371",
                     label="Original", kde=True, element="step", ax=axes)
        sns.histplot(data=df_cleaned, x=var, color="#FAAE7B",
                     label="Cleaned", kde=True, element="step", ax=axes)
        axes.set(title=f"Distribution Plot {flag_count}: {var}")
        plt.legend()

      plt.show()
      flag_count += 1


cleaned_features = ["RestingBP", "Cholesterol"]
DataCleaningEffect(df, cleaned_df, cleaned_features)

From the above plots, it can be observed that there was no impact on distribution.

---

## Split into Training and Tests Sets

In [None]:
from sklearn.model_selection import train_test_split


TrainSet, TestSet, _, __ = train_test_split(
                                        cleaned_df,
                                        cleaned_df["HeartDisease"],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

---

# Push files to Repo

In [None]:
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)

### Train Set

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)

### Test Set

In [None]:
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)