# **Data Cleaning Notebook**

## Objectives

- Evaluate missing data and outliers
- Analyze zero vs. NaN semantics for key features
- Quantify distributions and skewness to justify imputation methods
- Clean data (drop, impute, handle outliers)
- Generate cleaned Train and Test sets, saved to outputs/datasets/cleaned and export pipeline


## Inputs

- outputs/datasets/collection/HousePrices.csv

## Outputs

- Cleaned full dataset: outputs/datasets/cleaned/HousePricesCleaned.csv
- Cleaned train/test splits: outputs/datasets/cleaned/TrainSetCleaned.csv, outputs/datasets/cleaned/TestSetCleaned.csv
- Data cleaning pipeline: outputs/ml_pipeline/data_cleaning/dataCleaning_pipeline.pkl


---

### Change Working Directory

In [None]:
import os

current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))  # set project root
print("Current directory:", os.getcwd())

---

### Load Collected Data

In [None]:
import pandas as pd

df = pd.read_csv("outputs/datasets/collection/HousePrices.csv")
df.head(5)

## Identify columns with missing values

In [None]:
vars_with_missing = df.columns[df.isna().sum() > 0].tolist()
print("Columns with missing:", vars_with_missing)
print(df[vars_with_missing].info())

## Evaluate missing data levels

In [None]:
def EvaluateMissingData(df):
    missing_abs = df.isnull().sum()
    missing_pct = (missing_abs / len(df) * 100).round(2)
    return (
        pd.DataFrame(
            {"MissingCount": missing_abs, "MissingPct": missing_pct, "DType": df.dtypes}
        )
        .query("MissingPct > 0")
        .sort_values(by="MissingPct", ascending=False)
    )


print(EvaluateMissingData(df))

---

## 0 vs NaN Analysis for EnclosedPorch, WoodDeckSF

In [None]:
for col in ["EnclosedPorch", "WoodDeckSF"]:
    zeros = (df[col] == 0).sum()
    nans = df[col].isna().sum()
    print(f"{col}: zeros={zeros}, NaNs={nans}, total={len(df)}")

Interpretation: If 0 entries represent "no porch/deck", we might keep zeros and drop NaNs separately, but since NaN missing >85%, we will drop these features.

---

## Outlier Detection (IQR Method)

In [None]:
import numpy as np

numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns.tolist()
outliers_summary = []
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower) | (df[col] > upper)].shape[0]
    outliers_summary.append((col, outliers))
outliers_df = pd.DataFrame(
    outliers_summary, columns=["Variable", "OutlierCount"]
).sort_values("OutlierCount", ascending=False)
outliers_df.head(10)

Next steps: Decide later whether to Winsorize or remove specific extreme outliers in Modeling stage.

---

## Distribution & Skewness Analysis for Imputation

In [None]:
vars_mean = ["LotFrontage", "BedroomAbvGr"]
vars_median = ["2ndFlrSF", "GarageYrBlt", "MasVnrArea"]

# Compute skewness
skew_info = pd.Series(df[vars_mean + vars_median].skew()).to_frame("Skewness")
skew_info

# Plot histograms\import matplotlib.pyplot as plt
for col in vars_mean + vars_median:
    plt.figure(figsize=(6, 3))
    df[col].hist(bins=30)
    plt.title(f"{col} Distribution (skew={skew_info.loc[col,'Skewness']:.2f})")
    plt.show()

Justification:
- vars_mean skewness near 0 ⇒ mean imputation appropriate
- vars_median moderate skew >0.5 ⇒ median imputation more robust

---

## Profile Report (Missing Variables Only)

In [None]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df[vars_with_missing], minimal=True)
profile.to_notebook_iframe()

---

## Data Cleaning Functions

In [None]:
def EvaluateMissingData(df):  # (repeat for reuse)
    missing_abs = df.isnull().sum()
    missing_pct = (missing_abs / len(df) * 100).round(2)
    return (
        pd.DataFrame(
            {"MissingCount": missing_abs, "MissingPct": missing_pct, "DType": df.dtypes}
        )
        .query("MissingPct > 0")
        .sort_values(by="MissingPct", ascending=False)
    )


import seaborn as sns
import matplotlib.pyplot as plt


def DataCleaningEffect(df_orig, df_clean, vars_applied):
    print("\n", "=" * 60)
    for var in vars_applied:
        plt.figure(figsize=(6, 3))
        if df_orig[var].dtype == "object":
            sns.countplot(x=var, hue=None, data=df_orig, label="orig")
            sns.countplot(x=var, hue=None, data=df_clean, label="clean", alpha=0.7)
        else:
            sns.histplot(df_orig[var], element="step", stat="density", label="orig")
            sns.histplot(df_clean[var], element="step", stat="density", label="clean")
        plt.title(f"{var} Before/After")
        plt.legend()
        plt.show()

---

## Data Cleaning Summary & Decisions
- Drop features: EnclosedPorch, WoodDeckSF (NaN >85%)
- Impute mean: LotFrontage, BedroomAbvGr
- Impute median: 2ndFlrSF, GarageYrBlt, MasVnrArea
- Impute categorical: GarageFinish, BsmtFinType1, also BsmtExposure (fill with 'None')

---

### Split Dataset into Train and Test 

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet = train_test_split(df, test_size=0.2, random_state=0)
print(f"TrainSet: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

### Verify TrainSet missing before cleaning

In [None]:
print(EvaluateMissingData(TrainSet))

---

## Apply Data Cleaning Pipeline (separate fit/transform)

In [None]:
from feature_engine.selection import DropFeatures
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from sklearn.pipeline import Pipeline

# Define steps
drop_vars = ["EnclosedPorch", "WoodDeckSF"]
mean_vars = vars_mean
median_vars = vars_median
cat_vars = ["GarageFinish", "BsmtFinType1", "BsmtExposure"]

pipeline = Pipeline(
    [
        ("drop", DropFeatures(features_to_drop=drop_vars)),
        ("mean", MeanMedianImputer(imputation_method="mean", variables=mean_vars)),
        (
            "median",
            MeanMedianImputer(imputation_method="median", variables=median_vars),
        ),
        (
            "cat",
            CategoricalImputer(
                imputation_method="missing", fill_value="None", variables=cat_vars
            ),
        ),
    ]
)

# Fit on Train, transform both
pipeline.fit(TrainSet)
TrainCleaned = pipeline.transform(TrainSet)
TestCleaned = pipeline.transform(TestSet)
FullCleaned = pipeline.transform(df)

# Check missing after cleaning
print("Train missing:\n", EvaluateMissingData(TrainCleaned))
print("Test missing:\n", EvaluateMissingData(TestCleaned))
print("Full missing:\n", EvaluateMissingData(FullCleaned))

# Visualize effect for each group
DataCleaningEffect(TrainSet, TrainCleaned, mean_vars + median_vars + cat_vars)

---

## Save Cleaned Data and Pipeline

In [None]:
# Create output dirs
os.makedirs("outputs/datasets/cleaned", exist_ok=True)
# Save CSVs
pd.DataFrame(FullCleaned, columns=TrainCleaned.columns).to_csv(
    "outputs/datasets/cleaned/HousePricesCleaned.csv", index=False
)
pd.DataFrame(TrainCleaned, columns=TrainCleaned.columns).to_csv(
    "outputs/datasets/cleaned/TrainSetCleaned.csv", index=False
)
pd.DataFrame(TestCleaned, columns=TestCleaned.columns).to_csv(
    "outputs/datasets/cleaned/TestSetCleaned.csv", index=False
)
# Save pipeline
import joblib

os.makedirs("outputs/ml_pipeline/data_cleaning", exist_ok=True)
joblib.dump(pipeline, "outputs/ml_pipeline/data_cleaning/dataCleaning_pipeline.pkl")

---

## Summary and Next Steps

**Summary**
- Missing values: All handled
- Outliers: Detected; decide handling strategy in Modeling Notebook.
- Pipeline: Fit on train, transform test/full; avoids data leakage.
- Justifications: Skewness, zero-vs-NaN, distribution plots, IQR analysis.

**Next Steps**:
- Move to Data Study (EDA) Notebook to analyze feature–target relationships and generate visual insights for the dashboard.
