#Diploma thesis
_Credit Scoring using Machine Learning Models and SHAP explainability – Loizidis Vasileios_  
_Department of Applied Mathematics and Physical Sciences, National Technical University of Athens_  
_Supervisor: Petros Stefaneas_  
_Date: November 2025_

---

###  Overview

This notebook presents the implementation of Machine Learning Models applied to a credit scoring dataset.  
The goal is to predict the probability of default (binary outcome: default / no default) based on customer financial and demographic features.

The workflow includes:
- Data loading and initial inspection  
- Exploratory Data Analysis (EDA)  
- Preprocessing (missing values, scaling)  
- Models training and evaluation (Logistic Regression, SVM, Random Forest, XGBoost, Neural Networks)  
- Performance comparison and visualization  
- SHAP explainability

---


---

### Tools and Libraries
- Python (Google Colab)  
- pandas, numpy, matplotlib, seaborn  
- scikit-learn  



***Data Loading and Overview***

In this section I connect Google Drive to Colab, load the dataset (cs-training.csv) and inspect dataset shape, column types and missing values





In [5]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from google.colab import files
uploaded = files.upload()


In [None]:
import pandas as pd
df= pd.read_csv("cs-training.csv")
df.head()

In [None]:
import pandas as pd
df = pd.read_csv("cs-training.csv", encoding="latin1")
df.head()

In [None]:
df=df.drop(columns=["Unnamed: 0"], errors="ignore")

In [None]:
table = pd.DataFrame({
    "Τύπος δεδομένων": df.dtypes,
    "Πλήθος μη κενών": df.count(),
    "Πλήθος ελλειπόντων": df.isna().sum(),
    "Πλήθος διακεκριμένων": df.nunique(),
    "Ελάχιστη τιμή": df.min(numeric_only=True),
    "Μέγιστη τιμή": df.max(numeric_only=True)
})
table


In [None]:
latex_code=table.to_latex(index=True)
print(latex_code)


In [None]:
df.describe().T

**Exploratory Data Analysis (EDA)**

I examine:
-Summary statistics (mean, median, min, max)
-Percentage of missing values
-Basics variable distributions and relationships.

In [None]:
missing_pct=(df.isna().sum() / len(df) * 100).sort_values(ascending=False)
missing_pct

In [None]:
df["SeriousDlqin2yrs"].value_counts(normalize=True) * 100

In [None]:
df["age"].describe()
df["age"].sort_values().head(10)
df["age"].sort_values(ascending=False).head(10)

In [None]:
df["DebtRatio"].describe()
df["DebtRatio"].sort_values(ascending=False).head(10)

In [None]:
df["RevolvingUtilizationOfUnsecuredLines"].describe()
df["RevolvingUtilizationOfUnsecuredLines"].sort_values(ascending=False).head(10)

In [None]:
import numpy as np
for col in ["DebtRatio", "RevolvingUtilizationOfUnsecuredLines"]:
    upper = df[col].quantile(0.99)
    df[col]=np.where(df[col]>upper, upper, df[col])

In [None]:
df[["DebtRatio", "RevolvingUtilizationOfUnsecuredLines"]].describe()

**Train-Test Split**

I split the data into train (80%) and test (20%) sets using "stratify=y" to preserve the original class distribution.

In [None]:
from sklearn.model_selection import train_test_split


X = df.drop(columns=["SeriousDlqin2yrs"])
y = df["SeriousDlqin2yrs"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

print(len(X_train), len(X_test), len(y_train), len(y_test))


In [None]:
X_train.isna().sum().sort_values(ascending=False).head(10)


**Handling Missing Values**

I inpute:
-"Monthly income" missing values with the median from the training set
-"NumberOfDependents" missing values with mode from the training set
These statistics are computed only on the training data to prevent information leakage.

In [None]:

median_income = X_train["MonthlyIncome"].median()
mode_dep_series = X_train["NumberOfDependents"].mode(dropna=True)
mode_dependents = int(mode_dep_series.iloc[0]) if len(mode_dep_series) > 0 else 0


X_train = X_train.copy()
X_test  = X_test.copy()

X_train["MonthlyIncome"] = X_train["MonthlyIncome"].fillna(median_income)
X_test["MonthlyIncome"]  = X_test["MonthlyIncome"].fillna(median_income)

X_train["NumberOfDependents"] = X_train["NumberOfDependents"].fillna(mode_dependents)
X_test["NumberOfDependents"]  = X_test["NumberOfDependents"].fillna(mode_dependents)


X_train.isna().sum().sum(), X_test.isna().sum().sum()


**Feature Scaling**

I apply "StandardScaler" to normalize numeric features so that they have mean=0 and standard deviation=1. This improves model convergence and stability.

In [None]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled_df  = pd.DataFrame(X_test_scaled,  columns=X_test.columns,  index=X_test.index)

len(X_train_scaled_df), len(X_test_scaled_df), len(y_train), len(y_test)


**Training the Baseline Logistic Regression Model**

I train a simple Logistic Regression model.

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train_scaled_df, y_train)


In [None]:
y_pred  = log_reg.predict(X_test_scaled_df)
y_proba = log_reg.predict_proba(X_test_scaled_df)[:, 1]
len(y_pred), len(y_test)


**Evaluation of the Baseline Model**

I compute key performance metrics:

-Accuracy, Precision, Recall, F1-score, ROC-AUC, Confusion Matrix.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
import matplotlib.pyplot as plt

acc  = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec  = recall_score(y_test, y_pred)
f1   = f1_score(y_test, y_pred)
roc  = roc_auc_score(y_test, y_proba)

print(f"Accuracy:  {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1-score:  {f1:.4f}")
print(f"ROC AUC:   {roc:.4f}")

cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:\n", cm)


**Handling Class Imbalance with SMOTE**

I apply SMOTE to balance the classes in the training set only. Then we retrain the Logistic Regression model on the resampled data.

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train_scaled_df, y_train)

print("Πριν SMOTE:", y_train.value_counts(normalize=True))
print("Μετά SMOTE:", y_train_res.value_counts(normalize=True))


In [None]:
log_reg_smote = LogisticRegression(max_iter=1000, random_state=42)
log_reg_smote.fit(X_train_res, y_train_res)


**Evaluation after SMOTE**

I evaluate the model trained on balanced data against the original test set and compare results with the baseline model.

In [None]:
y_pred_smote  = log_reg_smote.predict(X_test_scaled_df)
y_proba_smote = log_reg_smote.predict_proba(X_test_scaled_df)[:, 1]

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

acc_s  = accuracy_score(y_test, y_pred_smote)
prec_s = precision_score(y_test, y_pred_smote)
rec_s  = recall_score(y_test, y_pred_smote)
f1_s   = f1_score(y_test, y_pred_smote)
roc_s  = roc_auc_score(y_test, y_proba_smote)

print(f"Accuracy:  {acc_s:.4f}")
print(f"Precision: {prec_s:.4f}")
print(f"Recall:    {rec_s:.4f}")
print(f"F1-score:  {f1_s:.4f}")
print(f"ROC AUC:   {roc_s:.4f}")


**Visualization and Model Comparison**

I plot confusion matrices and summarize the key metrics of both models.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred_smote)
sns.heatmap(cm, annot=True, fmt="d", cmap="Greens")
plt.xlabel("Προβλεπόμενη τιμή")
plt.ylabel("Πραγματική τιμή")
plt.title("Confusion Matrix - Logistic Regression (με SMOTE)")
plt.show()


In [None]:
import pandas as pd

results = {
    "Μοντέλο": ["Logistic Regression (χωρίς SMOTE)", "Logistic Regression (με SMOTE)"],
    "Accuracy": [0.93, 0.75],
    "Precision": [0.58, 0.16],
    "Recall": [0.044, 0.68],
    "F1-score": [0.08, 0.27],
    "ROC–AUC": [0.71, 0.79]
}

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))


In [None]:
y_pred_smote = log_reg_smote.predict(X_test_scaled_df)
cm_smote = confusion_matrix(y_test, y_pred_smote)

plt.figure(figsize=(5,4))
ax = sns.heatmap(cm_smote, annot=True, fmt="d", cmap="Greens")
plt.xlabel("Προβλεπόμενη τιμή")
plt.ylabel("Πραγματική τιμή")
plt.title("Confusion Matrix - Logistic Regression (με SMOTE)")
plt.tight_layout()


plt.savefig("confusion_matrix_logistic_smote.png", dpi=300, bbox_inches="tight")
plt.show()
plt.close()
