# **Data Cleaning Notebook**

## Objectives

* Evaluate missing data
* Clean data


## Inputs

* outputs/datasets/collection/HousePrices.csv

## Outputs

* Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned


---

## Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

Change the working directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

## Load Collected Data

In [None]:
import pandas as pd

df_raw_path = "outputs/datasets/collection/HousePrices.csv"
df = pd.read_csv(df_raw_path)
df.head(3)

## Data Exploration

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

In [None]:
from ydata_profiling import ProfileReport

if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("No missing data found.")

## Correlation and PPS Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps
import warnings

warnings.filterwarnings("ignore")

# Define heatmap functions
def heatmap_corr(df, threshold, figsize=(18, 12), font_annot=8):
    mask = np.zeros_like(df, dtype=bool)
    mask[np.triu_indices_from(mask)] = True
    mask[abs(df) < threshold] = True
    
    plt.figure(figsize=figsize)
    sns.heatmap(df, mask=mask, annot=True, cmap="virids", anoot_kws={"size": font_annot}, linewidths=0.5)
    plt.show()

def heatmap_pps(df, threshold, figsize=(18, 12), font_annot=8):
    mask = np.zeros_like(df, dtype=bool)
    mask[abs(df) < threshold] = True
    
    plt.figure(figsize=figsize)
    sns.heatmap(df, mask=mask, annot=True, cmap="rocket_r", annot_kws={"size": font_annot}, linewidths=0.5)
    plt.show()

def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman", numeric_only=True)
    df_corr_pearson = df.corr(method="pearson", numeric_only=True)

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.pivot(columns='x', index='y', values='ppscore')

    return df_corr_pearson, df_corr_spearman, pps_matrix

# Run analysis
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

# Display heatmaps
heatmap_corr(df_corr_pearson, threshold=0.4)
heatmap_corr(df_corr_spearman, threshold=0.4)
heatmap_pps(pps_matrix, threshold=0.2)

## Data Cleaning

### Evaluate Missing Data

In [None]:
def EvaluateMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute / len(df) * 100, 2)
    df_missing_data = pd.DataFrame({
        "RowsWithMissingData": missing_data_absolute,
        "PercentageOfDataset": missing_data_percentage,
        "DataType": df.dtypes   
    }).sort_values(by="PercentageOfDataset", ascending=False).query("PercentageOfDataset > 0")
    return df_missing_data

# Run on full dataset
EvaluateMissingData(df)

## Data Cleaning Strategy

In [None]:
drop_vars = ['BsmtExposure', 'GarageFinish', 'WoodDeckSF', 'KitchenQual', 'EnclosedPorch', 'LotFrontage', 'BsmtFinType1']

# Fill remaining missing numerical values (median)
fill_median_vars = ['GarageYrBlt', 'MasVnrArea', 'BedroomAbvGr']

# Fill remaining missing with Zero
fill_zero_vars = ['2ndFlrSF']

## Split Dataset into Train/Test 

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet = train_test_split(df, test_size=0.2, random_state=0)
print(f"TrainSet: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

## Drop Variables

In [None]:
from feature_engine.selection import DropFeatures

dropper = DropFeatures(features_to_drop=drop_vars)
dropper.fit(TrainSet)
TrainSet = dropper.transform(TrainSet)
TestSet = dropper.transform(TestSet)
df = dropper.transform(df)

## Fill Zero Values

In [None]:
for col in fill_zero_vars:
    TrainSet[col].fillna(0, inplace=True)
    TestSet[col].fillna(0, inplace=True)
    df[col].fillna(0, inplace=True)

## Median Imputation

In [None]:
from sklearn.impute import SimpleImputer

median_imputer = SimpleImputer(strategy='median')
TrainSet[fill_median_vars] = median_imputer.fit_transform(TrainSet[fill_median_vars])
TestSet[fill_median_vars] = median_imputer.fit_transform(TestSet[fill_median_vars])
df[fill_median_vars] = median_imputer.fit_transform(df[fill_median_vars])

## Check Final Missing Values

In [None]:
print("Missing in TrainSet:\n", TrainSet.isnull().sum)
print("Missing in TestSet:\n", TestSet.isnull().sum)
print("Missing in full df:\n", df.isnull().sum)

---

# Push files to Repo

In [None]:
import os

try:
    os.makedirs(
        name="outputs/datasets/cleaned"
    )  # create outputs/datasets/cleaned folder
except Exception as e:
    print(e)

df.to_csv(f"outputs/datasets/cleaned/HousePricesCleaned.csv", index=False)
TrainSet.to_csv(f"outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)
TestSet.to_csv(f"outputs/datasets/cleaned/TestSetCleaned.csv", index=False)


## Summary and the next steps

**Summary**

- Loaded raw data and inspected missing values.
- Dropped 7 low-quality or subjective features  
  (`BsmtExposure`, `GarageFinish`, `WoodDeckSF`, `KitchenQual`, `EnclosedPorch`, `LotFrontage`, `BsmtFinType1`).
- Imputed missing values:  
  - `2ndFlrSF` → filled with `0`  
  - `GarageYrBlt`, `MasVnrArea`, `BedroomAbvGr` → filled with `median`
- Split the dataset** into Train/Test sets.
- Confirmed that all missing values were handled.
- Saved cleaned datasets to `outputs/datasets/cleaned`.

---

**Next Steps**

- Create the Data Study Notebook.
- Analyze which features most influence `SalePrice`.
- Use visualizations such as scatter plots, box plots, and heatmaps.
