# Data Cleaning Notebook

## Objectives

*   Evaluate missing data
*   Clean data

## Inputs

* outputs/datasets/collection/house_prices_records.csv

## Outputs

* Generate cleaned dataset, saved under outputs/datasets/cleaned

## Conclusions

* The two columns that were mostly missing data (EnclosedPorch and WoodDeckSF) can be dropped completely, rest of the missing data can be filled with median values.
 
  



---


# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Collected data

In [None]:
import pandas as pd
df_raw_path = "outputs/datasets/collection/house_prices_records.csv"
df = pd.read_csv(df_raw_path)
df.head(3)



We map the categorical columns into numerical values, to make further studies easier.

In [None]:
# Define the mapping for categorical columns
cat_mappings = {
    'BsmtExposure': {'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'None': 0},
    'BsmtFinType1': {'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'None': 0},
    'GarageFinish': {'Fin': 3, 'RFn': 2, 'Unf': 1, 'None': 0},
    'KitchenQual': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1}
}

# Apply mappings to the categorical columns
for column, mapping in cat_mappings.items():
    df[column] = df[column].map(mapping)

# Data Exploration

In Data Cleaning you are interested to check the distribution and shape of a variable with missing data.

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

Explore the missing data

In [None]:
from pandas_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

# Data Cleaning

Drop variables that consist of mostly missing data, (EnclosedPorch and WoodDeckSF).

In [None]:
import pandas as pd
from feature_engine.selection import DropFeatures

features_to_drop = ['EnclosedPorch', 'WoodDeckSF']
drop_features = DropFeatures(features_to_drop=features_to_drop)

df_transformed = drop_features.fit_transform(df)
df_transformed.info()

Fill the rest of the missing values with median

In [None]:
from sklearn.impute import SimpleImputer

# Find columns with missing data
cols_with_missing_data = df_transformed.columns[df_transformed.isnull().any()].tolist()

# Create a SimpleImputer object
imputer = SimpleImputer(strategy='median')

# Fit the imputer on the dataframe and transform it
df_filled = pd.DataFrame(imputer.fit_transform(df_transformed), columns=df_transformed.columns)

df_transformed[cols_with_missing_data] = df_filled[cols_with_missing_data]

# Select only the variables with missing data
df_selected = df_filled[cols_with_missing_data]

# Print the updated dataframe
print(df_selected)

In [None]:
df_transformed.head(5)

Check again for missing values

In [None]:
vars_with_missing_data = df_transformed.columns[df_transformed.isna().sum() > 0].to_list()
vars_with_missing_data

## Push cleaned file to repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection/cleaned/') 
except Exception as e:
  print(e)

df_transformed.to_csv(f"outputs/datasets/collection/cleaned/house_prices_records_cleaned.csv",index=False)