# Data Cleaning


## Objectives

+ Imput missing data.
+ Clean data.

## Inputs

+ outs/datasets/collection/LoanDefault.csv

## Outputs

+ Generate a cleaned dataset and divide it into train and test sets.
+ Save this in outputs/datasets/cleaned.

---

## Preparing for cleaning

We must first change the working directory and then load the data.


In [None]:
import os

current_dir = os.getcwd() # get the current working directory
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir)) # change directory to parent directory
print("The directory you are in is:", os.getcwd()) # print current directory

In [28]:
import pandas as pd

df = pd.read_csv("outputs/datasets/collection/LoanDefault.csv")
df.isna().sum()[df.isna().sum() > 0]

loan_limit                    3344
approv_in_adv                  908
loan_purpose                   134
rate_of_interest             36439
Interest_rate_spread         36639
Upfront_charges              39642
term                            41
Neg_ammortization              121
property_value               15098
income                        9150
age                            200
submission_of_application      200
LTV                          15098
dtir1                        24121
dtype: int64

confirming that there are many with missing values.
we will impute like this this and that
####################### fix

In [7]:
features_with_missing = df.columns[df.isna().any()].tolist()
features_with_missing

['loan_limit',
 'approv_in_adv',
 'loan_purpose',
 'rate_of_interest',
 'Interest_rate_spread',
 'Upfront_charges',
 'term',
 'Neg_ammortization',
 'property_value',
 'income',
 'age',
 'submission_of_application',
 'LTV',
 'dtir1']

data cleaning summary

drop ID and year
impute values to features with missing data (explain choices)

capping LTV and property value

################ fix

In [42]:
from feature_engine.imputation import (CategoricalImputer,
                                       MeanMedianImputer,
                                       ArbitraryNumberImputer)
from feature_engine.selection import DropFeatures
from feature_engine.outliers import OutlierTrimmer


# Drop ID and year columns
drop_features = DropFeatures(features_to_drop=['ID', 'year'])
df_cleaned = drop_features.fit_transform(df)

# Impute missing values for categorical features
impute_categorical = CategoricalImputer(imputation_method="frequent")
df_cleaned = impute_categorical.fit_transform(df_cleaned)

# Impute missing values for numerical features
median_variables = ["Upfront_charges",
                    "rate_of_interest",
                    "property_value",
                    "income"]
impute_median = MeanMedianImputer(imputation_method="median",
                                  variables=median_variables)
df_cleaned = impute_median.fit_transform(df_cleaned)

mean_variables = ["Interest_rate_spread", "dtir1"]
impute_mean = MeanMedianImputer(imputation_method="mean",
                                variables=mean_variables)
df_cleaned = impute_mean.fit_transform(df_cleaned)

number_of_terms = df["term"].max()
impute_max = ArbitraryNumberImputer(arbitrary_number=number_of_terms,
                                    variables=["term"])
df_cleaned = impute_max.fit_transform(df_cleaned)

df_cleaned.loc[df_cleaned["LTV"].isnull(), "LTV"] = (
    df_cleaned["loan_amount"] / df_cleaned["property_value"] * 100
)

# Eliminate outliers from LTV as they don't make sense
outliers = OutlierTrimmer(capping_method="quantiles",
                          fold=0.05,
                          variables=["LTV"])
df_cleaned = outliers.fit_transform(df_cleaned)


In [40]:

# Check for missing values again
df_cleaned.isna().sum()[df_cleaned.isna().sum() > 0]

Series([], dtype: int64)

### Split Train and Test Set

In [50]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df_cleaned,
                                        df_cleaned['Status'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

TrainSet shape: (113075, 32) 
TestSet shape: (28269, 32)


## Save data

In [49]:
import os
try: 
    os.makedirs("outputs/datasets/cleaned")
except Exception as e:
    print(e)

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSet.csv", index=False)
TestSet.to_csv("outputs/datasets/cleaned/TestSet.csv", index=False)