# Data Exploration

## Objectives

* Perform univariate and correlation analyses to explore the dataset’s structure, identify key relationships between variables and generate insights relevant to Business Requirement 1
    * Business Requirement 1: Data Insights (Conventional Analysis)  
        Identify key customer and loan attributes that are most correlated with loan default. Provide visual and statistical insights to help business analysts understand the primary drivers of credit risk.

## Inputs

* outputs/datasets/collection/LoanDefaultData.csv

## Outputs

* Generate code that answers Business Requirement 1 and can be used within the Streamlit App


---

In [None]:
# Ignore FutureWarnings for message "is_categorical_dtype is deprecated"
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="pandas.core.dtypes.common")

# Change working directory

We need to change the working directory from its current folder, where the notebook is stored, to its parent folder
* First we access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

* Then we want to make the parent of the current directory the new current directory
    * os.path.dirname() gets the parent directory
    * os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"You set a new current directory: {current_dir}")

---

# Load Data

As the variable `LoanID` is a unique identifier for each record, it does not contribute to the prediction and will therefore be dropped for the following analysis.

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/LoanDefaultData.csv")
    .drop(['LoanID'], axis=1)
    )
df.head(3)

# Data Exploration

we will check univariate and multivariate analysis -> todo describe section

## Univariate Analysis

In this section we examine each variable individually, understand its distribution and check for missing values. This helps us get a clear overview of the dataset before moving on to relationships between variables.

In [None]:
from ydata_profiling import ProfileReport

# Convert object columns to categorical so that it can be displayed properly in the report
df_cat = df.copy()
for col in df_cat.select_dtypes(include='object').columns:
    df_cat[col] = df_cat[col].astype('category')

# Also transform tagret variable to categorical
df_cat["Default"] = df_cat["Default"].astype('category')
    
pandas_report = ProfileReport(df=df_cat, minimal=True)
pandas_report.to_notebook_iframe() # needs: pip --upgrade setuptools and pip install notebook ipython==8.24.0  ipykernel ipywidgets

* The profile report confirmed that the dataset contains no missing values
* Additionally, it shows that the variables `NumCreditLines` and `LoanTerm` are rather categorical variables than continuous numerical variables, as they have only a limited number of different values. They will be transformed for the following analyses

In [None]:
for col in ["NumCreditLines", "LoanTerm"]:
    df_cat[col] = df_cat[col].astype('category')

df_cat.dtypes

To make key observations easier to digest, additional univariate analyses were performed, highlighting distributions, skewness/kurtosis and class balance for both numerical and categorical features.

#### Distribution Analysis of Numerical Variables

In [None]:
import pandas as pd
import numpy as np

# Select numerical columns
num_cols = df_cat.select_dtypes(include=np.number).columns

# Create summary dataframe
summary = df_cat[num_cols].describe().T  # gives count, mean, std, min, 25%, 50%, 75%, max

# Add skewness
summary['skewness'] = df_cat[num_cols].skew()

# Optionally, add kurtosis
summary['kurtosis'] = df_cat[num_cols].kurtosis()

# Round for better readability
summary = summary.round(2)

summary

* The numerical variables in the dataset show a generally well-balanced distribution pattern. Most features (such as Age, Income, LoanAmount, and CreditScore) have skewness values close to 0 and slightly negative kurtosis values around −1.2, indicating approximately symmetric distributions with light tails. This suggests that the dataset does not contain extreme outliers or heavy skewness that would require major transformations (e.g., log or Box-Cox scaling).

The following boxplots confirm these observations:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

numeric_cols = df_cat.select_dtypes(include=['float64', 'int64']).columns
n_cols = 2  # number of plots per row
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols  # calculate rows needed

fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols*5, n_rows*2))
axes = axes.flatten()  # flatten 2D array for easy indexing

for i, col in enumerate(numeric_cols):
    sns.boxplot(x=df_cat[col], ax=axes[i], palette="Set2", hue=df_cat[col], legend=False)
    axes[i].set_title(f"{col}")

# Remove any unused subplots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

#### Distribution Analysis of Categorical Variables

In [None]:
# Select categorical columns
cat_cols = df_cat.select_dtypes(include='category').columns.tolist()

# Grid layout for plots
n_cols = 3
n_rows = (len(cat_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols*5, n_rows*3))
axes = axes.flatten()

for i, col in enumerate(cat_cols):
    sns.countplot(x=col, data=df_cat, ax=axes[i], hue=col, palette='Set2')
    axes[i].set_title(f"{col} Distribution")
    axes[i].set_xlabel('')
    axes[i].set_ylabel('Count')

# Remove empty subplots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

* The categorical variables in the dataset are generally well balanced, with each feature containing only 2–5 distinct classes. This indicates that the features are not overly granular, making them suitable for analysis and modeling. 
* The exception is the target variable `Default`, which is imbalanced toward 0 (non-default), reflecting that the majority of borrowers in the dataset did not default, what could already be seen in the "Data Collection" Notebook in the "Target Variable Exploration" section.

Overall, the distribution analysis confirms that:
* The numerical as well as categorical predictors are well-balanced and suitable for further modeling without extensive preprocessing
* The target imbalance in `Default` should be addressed later during model development

---