# Exploratory Data Analysis of Credit Risk Data

This notebook analyzes a credit risk dataset to identify patterns associated with loan defaults.
The focus is on data quality, feature distributions, and relationships relevant for modeling.


## Setup & Configuration

This section defines imports, paths, and plotting configuration to ensure
consistent and reproducible visualizations throughout the analysis.


“The project root is assumed to be the current working directory (the notebook is located in the project root).”

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Projekt-Root (Notebook liegt im Root)
ROOT = Path.cwd()

# Figures-Ordner
FIG_DIR = ROOT / "reports" / "figures"
FIG_DIR.mkdir(parents=True, exist_ok=True)


plt.rcParams.update({
    "figure.figsize": (10, 5),
    "figure.dpi": 120,
    "savefig.dpi": 220,
    "axes.titlesize": 14,
    "axes.labelsize": 11,
    "xtick.labelsize": 10,
    "ytick.labelsize": 10,
    "axes.grid": False,
})

def savefig(name: str):
    plt.tight_layout()
    plt.savefig(FIG_DIR / f"{name}.png", bbox_inches="tight")
    plt.close()


## 1. Data Loading


This section loads the dataset using a helper function and performs initial sanity checks.


In [2]:
DATA_PATH = ROOT / "data" / "credit_data.csv"

df = pd.read_csv(DATA_PATH)

print("Shape:", df.shape)
display(df.head(3))
display(df.sample(3, random_state=42))


Shape: (32581, 12)


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
14668,24,28000,OWN,6.0,HOMEIMPROVEMENT,B,10000,10.37,0,0.36,N,2
24614,27,64000,RENT,0.0,PERSONAL,C,10000,15.27,0,0.16,Y,10
11096,26,72000,MORTGAGE,10.0,EDUCATION,D,16000,,0,0.22,N,3


In [3]:
DATA_PATH = Path.cwd() / "data" / "credit_data.csv"

print("DATA_PATH:", DATA_PATH.resolve())
print("Exists:", DATA_PATH.exists())

df = pd.read_csv(DATA_PATH)
df.head()

DATA_PATH: C:\Users\Metin\Documents\credit-risk-analysis\data\credit_data.csv
Exists: True


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [4]:
print(DATA_PATH.resolve())


C:\Users\Metin\Documents\credit-risk-analysis\data\credit_data.csv


## 2. Dataset Overview


The dataset contains a mix of numerical and categorical features related to borrower characteristics and loan attributes.


In [5]:
df.shape


(32581, 12)

In [6]:
df.columns


Index(['person_age', 'person_income', 'person_home_ownership',
       'person_emp_length', 'loan_intent', 'loan_grade', 'loan_amnt',
       'loan_int_rate', 'loan_status', 'loan_percent_income',
       'cb_person_default_on_file', 'cb_person_cred_hist_length'],
      dtype='object')

In [7]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB


The dataset contains 32,581 observations with 12 features, comprising a mix of numerical and categorical variables related to borrower characteristics and loan attributes.
Most features are complete, with missing values primarily observed in person_emp_length and loan_int_rate.
This overview provides a foundation for assessing data quality and guiding subsequent preprocessing steps.

## 3. Missing Values Analysis

This section analyzes the proportion of missing values per feature to assess
data quality and guide preprocessing decisions.


In [8]:
missing = df.isna().mean().sort_values(ascending=False) * 100
missing = missing[missing > 0]


In [9]:
fig, ax = plt.subplots()
ax.bar(missing.index, missing.values)
ax.set_title("Missing Values per Feature (%)")
ax.set_ylabel("Percentage")
ax.set_xlabel("Feature with missing values")
ax.tick_params(axis="x", rotation=45)
ax.grid(axis="y", alpha=0.3)

savefig("missing_values_percent")
plt.show()

missing.round(2)



loan_int_rate        9.56
person_emp_length    2.75
dtype: float64

Missing values are present primarily in `loan_int_rate` (~9.6%) and
`person_emp_length` (~2.8%). Given their proportion, these features will require
imputation or careful handling during preprocessing to avoid data leakage or loss
of information.


## 4. Target Variable


In [10]:
df["loan_status"].value_counts(normalize=True)


loan_status
0    0.781836
1    0.218164
Name: proportion, dtype: float64

## 5. Loan Interest Rate vs Loan Status


### Descriptive Statistics


In [11]:
df.drop(columns=["loan_status"]).describe()


Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length
count,32581.0,32581.0,31686.0,32581.0,29465.0,32581.0,32581.0
mean,27.7346,66074.85,4.789686,9589.371106,11.011695,0.170203,5.804211
std,6.348078,61983.12,4.14263,6322.086646,3.240459,0.106782,4.055001
min,20.0,4000.0,0.0,500.0,5.42,0.0,2.0
25%,23.0,38500.0,2.0,5000.0,7.9,0.09,3.0
50%,26.0,55000.0,4.0,8000.0,10.99,0.15,4.0
75%,30.0,79200.0,7.0,12200.0,13.47,0.23,8.0
max,144.0,6000000.0,123.0,35000.0,23.22,0.83,30.0


### Loan Interest Rate by Loan Status


In [12]:
df.groupby("loan_status")["loan_int_rate"].mean()


loan_status
0    10.435999
1    13.060207
Name: loan_int_rate, dtype: float64

In [13]:
df.boxplot(column="loan_int_rate", by="loan_status", grid=False)
plt.title("Loan Interest Rate by Loan Status")
plt.suptitle("")
plt.xlabel("Loan Status (0 = No Default, 1 = Default)")
plt.ylabel("Interest Rate (%)")
savefig("loan_int_rate_by_loan_status")
plt.show()



## 6. Correlation Matrix

This section examines correlations between numerical features to identify
strong relationships and potential multicollinearity.


In [14]:
# Select numerical features only
num_df = df.select_dtypes(include=["int64", "float64"]).copy()

# Compute correlation matrix
corr = num_df.corr()

# Plot
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(corr.values)

ax.set_title("Correlation Matrix (Numerical Features)")

ax.set_xticks(range(len(corr.columns)))
ax.set_yticks(range(len(corr.columns)))
ax.set_xticklabels(corr.columns, rotation=90)
ax.set_yticklabels(corr.columns)

# Colorbar
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)

savefig("correlation_matrix")
plt.show()


## 7. Key Findings


Dataset & Target:
The dataset contains 32,581 observations with 12 features. The target variable loan_status is moderately imbalanced, with approximately 78% non-defaults and 22% defaults, which should be considered in model evaluation.

Data Quality:
Most features are complete. Missing values are concentrated in loan_int_rate (~9.6%) and person_emp_length (~2.7%), indicating targeted imputation rather than broad data removal.

Interest Rate & Default Risk:
Loans that default (loan_status = 1) exhibit a substantially higher average interest rate than non-defaulting loans, suggesting interest rate is a strong indicator of credit risk.

Borrower Characteristics:
Employment length and income-related features show meaningful variation across borrowers, supporting their relevance for downstream modeling.

Correlation Structure:
Correlations among numerical features are generally moderate, indicating limited multicollinearity and suitability for standard classification models without aggressive feature elimination.