# Loan Portfolio Data Cleaning and Transformation

In this notebook, we will focus on cleaning and transforming the loan portfolio data.
We will ensure that all columns are in the correct format, convert certain columns to categorical types, and handle missing values.
This is a critical step before proceeding with exploratory data analysis (EDA) and querying the data.

The steps we'll follow:
1. Load the dataset.
2. Apply necessary transformations (convert dates, clean columns, handle missing values).
3. Prepare the data for further analysis.

## Step 1: Load the Dataset

We'll begin by loading the loan portfolio dataset into a Pandas DataFrame to inspect the data structure.


In [None]:
import pandas as pd

# Load the dataset from the CSV file
df = pd.read_csv('loan_payments.csv')

# Display the first few rows to inspect the data
df.head()


## Step 2: Inspect the Dataset

We will inspect the columns and data types to identify which columns need to be transformed.


In [None]:
# Display basic information about the dataset, including column names and data types

df.info()


## Step 3: Data Cleaning and Transformation

We have identified that certain columns in the dataset are not in the correct format:
- Date columns need to be converted to a `datetime` format.
- The `term` column needs to be cleaned by removing unnecessary text and converting to a numerical format.
- Columns like `grade`, `sub_grade`, `home_ownership`, and `loan_status` should be converted to categorical data types.
- We will handle missing values in relevant columns.

We will now import and apply a `DataTransform` class that handles these transformations.

In [None]:
# Import the DataTransform class from the data_transform.py file

from d_transform import DataTransform


## Step 4: Apply Data Transformations

We'll now use the `DataTransform` class to:
- Convert date columns (`issue_date`, `last_payment_date`, etc.) to a `datetime` format.
- Clean the `term` column to retain only the numeric value.
- Convert relevant columns to categorical data types.
- Handle missing values in the appropriate columns.


In [None]:
# Initialize the DataTransform class with the DataFrame
transformer = DataTransform(df)

# Convert date columns to datetime
date_columns = ['issue_date', 'last_payment_date', 'next_payment_date', 'last_credit_pull_date']
transformer.convert_to_datetime(date_columns)

# Clean the 'term' column
transformer.clean_term_column()

# Convert relevant columns to categorical data type
categorical_columns = ['grade', 'sub_grade', 'home_ownership', 'verification_status', 
                       'loan_status', 'purpose', 'application_type']
transformer.convert_to_categorical(categorical_columns)

# Handle missing values in specified columns
columns_with_missing_values = ['mths_since_last_delinq', 'mths_since_last_record', 'mths_since_last_major_derog']
transformer.handle_missing_values(columns_with_missing_values)

# View the updated DataFrame to ensure the transformations were applied correctly
df.head()


## Step 5: Summary of Transformations

The data has now been cleaned and transformed:
- Date columns have been converted to `datetime` format.
- The `term` column has been cleaned and converted to a numerical type.
- Categorical columns have been appropriately transformed to `category` data type.
- Missing values have been handled in columns related to delinquency and records.

The dataset is now ready for further analysis and exploration in the next steps.