# **Data Cleaning Notebook**

## Objectives

* Clean data
* Split cleaned dataset into Train and Test sets

## Inputs

* outputs/datasets/collection/LoanDefaultData.csv

## Outputs

* Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned


---

In [None]:
# Ignore FutureWarnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Change working directory

We need to change the working directory from its current folder, where the notebook is stored, to its parent folder
* First we access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

* Then we want to make the parent of the current directory the new current directory
    * os.path.dirname() gets the parent directory
    * os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"You set a new current directory: {current_dir}")

---

# Load Data

In [None]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/LoanDefaultData.csv")
df.head(3)

# Data Cleaning

## Dropping Variables

In the data collection step we already established that there are not duplicated `LoanID`'s in the dataset. As the variable is only an identifier for each record, we will now drop it from the dataset.

In [None]:
df.drop("LoanID", axis=1, inplace=True)

## Data Types

In the exploratory analysis we have seen that `NumCreditLines`, `LoanTerm` and `Default` are categorical variables rather than numerical ones. To keep the data types consistent compared to the other categorical variables, we will transform them all into the `category` data type.

In [None]:
for col in ["NumCreditLines", "LoanTerm", "Default", df.select_dtypes(include='object').columns]:
    df[col] = df[col].astype('category')

df.dtypes

## Missing Data

Once again we check for missing data and can confirm that no missing data is present in the dataset.

In [None]:
print("Number of missing values in each column:")
print(df.isna().sum())

print("\nTotal number of missing values in the dataframe:", df.isna().sum().sum())

* We have already checked for outliers and consistency of categorical values during the exploratory data analyses and found no issues. Since the dataset also contains no missing values, there are no further data cleaning steps required at this point

# Split Train and Test Set

After cleaning the dataset we can split it into train and test sets for modelling.

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['Default'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

Re-evaluate for missing data in the TrainSet:

In [None]:
print("\nTotal number of missing values in the train set:", TrainSet.isna().sum().sum())

---

# Push files to Repo

In [18]:
import joblib
import os

# Set the file_path
file_path = f'outputs/datasets/cleaned'

# Try to generate output folder
try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

# Save the Train and Test sets as csv files for further use
filename = "TrainSet.csv"
TrainSet.to_csv(f"{file_path}/{filename}", index=False)

filename = "TestSet.csv"
TestSet.to_csv(f"{file_path}/{filename}", index=False)


---

# Conclusions and Next Steps

The dataset required only minimal data cleaning activities.

Next Steps:
* Prepare data for feature engineering and modeling: scaling numerical variables, encoding categorical features, and creating new derived features if needed.