# **Data Cleaning Notebook**

## Objectives

* Evaluate missing data
* Clean data


## Inputs

* outputs/datasets/collection/HousePrices.csv

## Outputs

* Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned


---

## Change Working Directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

We set the parent folder as the new working directory using os.path.dirname() and os.chdir().

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

## Load Collected Data

In [None]:
import pandas as pd

df = pd.read_csv("outputs/datasets/collection/HousePrices.csv")
df.head(5)

## Data Exploration

When previewing the first five rows of the DataFrame above, we can already spot missing values in multiple cells across four columns. In this step, we take a closer look to identify all variables containing missing data.

A total of 9 out of 24 columns—representing 75% of the dataset—contain missing values.

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

In [None]:
len(vars_with_missing_data)

In [None]:
df[vars_with_missing_data].info()

### Profile Report
- We generate a profile report using the ydata-profiling library to examine variables with missing values in more detail.

In [None]:
# adapted from customer churn study
from ydata_profiling import ProfileReport

if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("No missing data found.")

We observed that several variables not only have missing values but also include many zero entries.

---

## Data Cleaning

### Assessing Missing Data Levels

In [None]:
def EvaluateMissingData(df):
    """
    Function to evaluate data with missing values
    """
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute / len(df) * 100, 2)
    df_missing_data = (
        pd.DataFrame(
            data={
                "RowsWithMissingData": missing_data_absolute,
                "PercentageOfDataset": missing_data_percentage,
                "DataType": df.dtypes,
            }
        )
        .sort_values(by=["PercentageOfDataset"], ascending=False)
        .query("PercentageOfDataset > 0")
    )

    return df_missing_data

Check missing data levels for collected dataset

In [None]:
EvaluateMissingData(df)

Both EnclosedPorch and WoodDeckSF are missing over 85% of their values, so they may be dropped as they likely add little predictive value.

### Handling Missing Data

- Creating the DataCleaningEffect() function
- This function is based on code from Unit 9 of the Feature-engine module, with some modifications.

In [None]:
import seaborn as sns

sns.set(style="whitegrid")
import matplotlib.pyplot as plt


def DataCleaningEffect(df_original, df_cleaned, variables_applied_with_method):
    """
    Visualize the effect of data cleaning methods by comparing
    the distributions of original and cleaned data for selected variables.
    """

    flag_count = 1  # Track plot number

    # Identify categorical variables (non-numeric types)
    categorical_variables = df_original.select_dtypes(exclude=["number"]).columns

    # Display overview of analysis
    print("\n" + "=" * 90)
    print("* Visual Comparison of Distributions Before and After Cleaning")
    print(f"* Variables assessed: {variables_applied_with_method} \n")

    # Iterate over each selected variable
    for var in variables_applied_with_method:
        if var in categorical_variables:
            # Categorical variable: bar chart comparison

            df1 = pd.DataFrame({"Type": "Original", "Value": df_original[var]})
            df2 = pd.DataFrame({"Type": "Cleaned", "Value": df_cleaned[var]})
            df_combined = pd.concat([df1, df2], axis=0)

            fig, axes = plt.subplots(figsize=(15, 5))
            sns.countplot(
                hue="Type", data=df_combined, x="Value", palette=["#432371", "#FAAE7B"]
            )
            axes.set(title=f"Distribution Plot {flag_count}: {var}")
            plt.xticks(rotation=90)
            plt.legend()

        else:
            # Numerical variable: histogram comparison

            fig, axes = plt.subplots(figsize=(10, 5))
            sns.histplot(
                data=df_original,
                x=var,
                color="#432371",
                label="Original",
                kde=True,
                element="step",
                ax=axes,
            )
            sns.histplot(
                data=df_cleaned,
                x=var,
                color="#FAAE7B",
                label="Cleaned",
                kde=True,
                element="step",
                ax=axes,
            )
            axes.set(title=f"Distribution Plot {flag_count}: {var}")
            plt.legend()

        plt.show()
        flag_count += 1

### Data Cleaning Summary
- Dropped Features: 
we removed EnclosedPorch and WoodDeckSF because over 80% of their values were missing. Even though they add to the size of a home, they don’t show enough variation to be useful for predictions.

- Imputations 
    - Mean Imputation was used for LotFrontage and BedroomAbvGr, since their values are roughly normally distributed.
    - Median Imputation was used for 2ndFlrSF, GarageYrBlt, and MasVnrArea. These columns are slightly skewed, and using the median helps avoid being misled by outliers.

- Categorical Imputation was used for GarageFinish and BsmtFinType1 because they are text-based categories, and we can't apply mean/median to them.

---

### Split Dataset into Train and Test 

We split the data so we can apply imputations on the Train Set and test their effect on the Test Set.

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet = train_test_split(df, test_size=0.2, random_state=0)
print(f"TrainSet: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

We first check missing values in the Train Set to ensure it still represents the full dataset.

In [None]:
df_missing_data = EvaluateMissingData(TrainSet)
print(f"* There are {df_missing_data.shape[0]} variables with missing data \n")
df_missing_data

---

## Drop Variables

In [None]:
from feature_engine.selection import DropFeatures

dropper = DropFeatures(features_to_drop=drop_vars)
dropper.fit(TrainSet)
TrainSet = dropper.transform(TrainSet)
TestSet = dropper.transform(TestSet)
df = dropper.transform(df)

---

## Mean Imputation

Variables: ['LotFrontage' , 'BedroomAbvGr']
- These variables have distributions that are roughly normal, so we will use the mean to fill in the missing values.

In [None]:
from feature_engine.imputation import MeanMedianImputer

# Select variables where mean is appropriate
variables_mean = ["LotFrontage", "BedroomAbvGr"]

# Create and apply the mean imputer
imputer = MeanMedianImputer(imputation_method="mean", variables=variables_mean)
df_method = imputer.fit_transform(TrainSet)

# Visualize the effect of imputation
DataCleaningEffect(
    df_original=TrainSet,
    df_cleaned=df_method,
    variables_applied_with_method=variables_mean,
)

---

## Median Imputation

Variables: ['2ndFlrSF', 'GarageYrBlt', 'MasVnrArea']
- These variables may have skewed distributions or outliers, so we will use the median to fill in the missing values.

In [None]:
from feature_engine.imputation import MeanMedianImputer

# Select variables where median is more robust
variables_median = ["2ndFlrSF", "GarageYrBlt", "MasVnrArea"]

# Create and apply the median imputer
imputer = MeanMedianImputer(imputation_method="median", variables=variables_median)
df_method = imputer.fit_transform(TrainSet)

# Visualize the effect of imputation
DataCleaningEffect(
    df_original=TrainSet,
    df_cleaned=df_method,
    variables_applied_with_method=variables_median,
)

After median imputation, most GarageYrBlt values are around 1975. We checked and found that when this value is missing, GarageArea is zero, meaning there's no garage. Since garage size matters more than its year, we may drop GarageYrBlt.

In [None]:
TrainSet[(TrainSet["GarageArea"] == 0)][["GarageYrBlt", "GarageArea"]]

---

## Categorical Imputation

Variables: ['GarageFinish', 'BsmtFinType1']

- We fill missing values in these categorical columns with 'None', since they likely indicate that the garage or basement is not present.

In [None]:
from feature_engine.imputation import CategoricalImputer

variables_categorical = ["GarageFinish", "BsmtFinType1"]
imputer = CategoricalImputer(
    imputation_method="missing",  
    fill_value="None",  
    variables=variables_categorical,  
)

# Fit on training set and apply transformation
df_method = imputer.fit_transform(TrainSet)

# Visualize the impact of imputation
DataCleaningEffect(
    df_original=TrainSet,
    df_cleaned=df_method,
    variables_applied_with_method=variables_categorical,
)

In [None]:
TrainSet[(TrainSet["GarageArea"] == 0)][["GarageFinish", "GarageArea"]]

In [None]:
TrainSet[(TrainSet["TotalBsmtSF"] == 0)][["BsmtFinType1", "TotalBsmtSF"]]

---

## Data Cleaning Pipeline
- We combine all cleaning steps into one pipeline for efficiency.
- Steps included:
    - Mean imputation: ['LotFrontage', 'BedroomAbvGr']
    - Median imputation: ['2ndFlrSF', 'MasVnrArea']
    - Categorical imputation: ['GarageFinish', 'BsmtFinType1']
    - Drop features: ['EnclosedPorch', 'GarageYrBlt', 'WoodDeckSF']

In [None]:
from sklearn.pipeline import Pipeline

mean_vars = ["LotFrontage", "BedroomAbvGr"]
median_vars = ["2ndFlrSF", "MasVnrArea"]
cat_vars = ["GarageFinish", "BsmtFinType1"]
drop_vars = ["EnclosedPorch", "GarageYrBlt", "WoodDeckSF"]

dataCleaning_pipeline = Pipeline(
    [
        ("mean", MeanMedianImputer(imputation_method="mean", variables=mean_vars)),
        (
            "median",
            MeanMedianImputer(imputation_method="median", variables=median_vars),
        ),
        (
            "categorical",
            CategoricalImputer(
                imputation_method="missing", fill_value="None", variables=cat_vars
            ),
        ),
        ("drop", DropFeatures(features_to_drop=drop_vars)),
    ]
)

Next, we’ll clean the full dataset by applying our pipeline.

In [None]:
TrainSet, TestSet = dataCleaning_pipeline.fit_transform(
    TrainSet
), dataCleaning_pipeline.fit_transform(TestSet)

In [None]:
df = dataCleaning_pipeline.fit_transform(df)

In [None]:
EvaluateMissingData(TrainSet)

In [None]:
EvaluateMissingData(TestSet)

In [None]:
EvaluateMissingData(df)

After running the missing data check, we can confirm that all missing values in the Train, Test, and original datasets have been taken care of.

---

# Push files to Repo

In [None]:
import os

try:
    os.makedirs(
        name="outputs/datasets/cleaned"
    )  # create outputs/datasets/cleaned folder
except Exception as e:
    print(e)

df.to_csv(f"outputs/datasets/cleaned/HousePricesCleaned.csv", index=False)
TrainSet.to_csv(f"outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)
TestSet.to_csv(f"outputs/datasets/cleaned/TestSetCleaned.csv", index=False)


Save Data Cleaning Pipeline

In [None]:
import joblib

file_path = f"outputs/ml_pipeline/data_cleaning"

try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

In [None]:
joblib.dump(
    value=dataCleaning_pipeline, filename=f"{file_path}/dataCleaning_pipeline.pkl"
)

---

## Summary and the Next Steps

**Summary**

* Out of 24 variables, 9 (or 75%) had missing values.
* We handled them using the following techniques:
    - Mean imputation for: LotFrontage, BedroomAbvGr
    - Median imputation for: 2ndFlrSF, MasVnrArea
    - Categorical imputation for: GarageFinish, BsmtFinType1
    - Dropped: EnclosedPorch, GarageYrBlt, WoodDeckSF

**Next Steps**:
- Create the Data Study Notebook.
- Analyze which features most influence SalePrice.
- Use visualizations such as scatter plots, box plots, and heatmaps.
