### Client credit card default payment prediction using differnt supervised classification Algorithms

####  About Default Payment 

Credit default occurs when a borrower fails to repay a debt obligation, leading to financial losses for lenders. Predicting defaults using machine learning helps banks and financial institutions assess risk and minimize potential losses.

#### Problem of the Statement

The objective of eveloping model is to predict weather credit card payment next month default or not from the prospective of risk management. The datasets consist of serveral credict history predicator (Independent) variable and one target variable, default payment next month.

#### Descriptions
The "Default of Credit Card Clients" dataset from UCI contains 30,000 records of Taiwanese credit card users, predicting payment defaults. It includes 23 features such as:

Key columns: LIMIT_BAL (credit limit), SEX, EDUCATION, MARRIAGE, AGE, PAY_0-PAY_6 (payment history), BILL_AMT1-BILL_AMT6 (billing amounts), and PAY_AMT1-PAY_AMT6 (payment amounts). The target variable is default.payment.next.month (binary: 1 = default, 0 = no default).

The dataset is imbalanced, with 6,636 defaults (22.1%) and 23,364 non-defaults (77.9%), making it useful for classification and risk modeling.

##### For prediction we require dataset, here we have datasets which is downloded from UCI Machine Learning Dataset 

Datasets Link = https://archive.ics.uci.edu/static/public/350/default+of+credit+card+clients.zip

#### Step1:  Install All Required Python Package 

In [None]:
import zipfile
import requests
# python Libraries for
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#for avoiding warnings
import warnings 
warnings.filterwarnings('ignore')
#python feature scalling standard library
from sklearn.preprocessing import StandardScaler
#Pca
from sklearn.decomposition import PCA
#train and test split library
from sklearn.model_selection import train_test_split
#standard library for confusion-matrix, roc-auc-score and curve, classificaion Report of Model
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
#sklearn logistic regressing model
from sklearn.linear_model import LogisticRegression
#TrainScore and trainScore of linear Regression (train=80% & test -20%)
from sklearn.metrics import accuracy_score
#i have imported here KNeighborsClassifier form sklearn library
from sklearn.neighbors import KNeighborsClassifier
#Naive Bays 
from sklearn.naive_bayes import GaussianNB
#Ensemble Classifer (Random Forest Classifer)
from sklearn.ensemble import RandomForestClassifier


#### Step-2: Download datasets using Url 

In [None]:
# Direct Link to download the dataset fro UCI Repo
url ='https://archive.ics.uci.edu/static/public/350/default+of+credit+card+clients.zip'
dataset_zip = "default_of_credit_card_clients.zip"

# Download the zip file
response = requests.get(url)
with open(dataset_zip, "wb") as file:
    file.write(response.content)

# Code to extract the dataset
with zipfile.ZipFile(dataset_zip, 'r') as zip_ref:
    zip_ref.extractall("default_of_credit_card_clients")
    
#loading dataset from extract folder
extract_folder = "default_of_credit_card_clients/default of credit card clients.xls"
print(extract_folder)

In [None]:
#Code to covnvert excel files into CSV files 
# Path to the XLS file
xls_file = "default_of_credit_card_clients/default of credit card clients.xls"
# Load the XLS file into a DataFrame
df = pd.read_excel(xls_file, header=1)  # Skip the first row with general info
#Specify the output CSV file path
csv_file = "default_of_credit_card_clients/default_of_credit_card_clients.csv"
# Save the DataFrame to CSV
df.to_csv(csv_file, index=False)
#loading CSV datasets 
data = pd.read_csv('default_of_credit_card_clients/default_of_credit_card_clients.csv')

#### Step-3 Exploratory Data analysis 

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.tail(n=5)

In [None]:
print('Stastical view of data :\n', )
data.describe()

In [None]:
print(data.info())

#### Step-4 Data Cleaning & preprocessing

In [None]:
#Columns Name Transformations
data.rename(columns={
    "PAY_0": "PAY_SEP", "PAY_2": "PAY_AUG", "PAY_3": "PAY_JUL",
    "PAY_4": "PAY_JUN", "PAY_5": "PAY_MAY", "PAY_6": "PAY_APR",
    "default payment next month": "DEFAULT"
}, inplace=True)

#for easy understand and value changes 
data['EDUCATION'] = data['EDUCATION'].replace({0: 4, 5: 4, 6: 4})
data['EDUCATION'] = data['EDUCATION'].map({
    1: 'Graduate School',
    2: 'University',
    3: 'High School',
    4: 'Others'
})

# Recode MARRIAGE categories
data['MARRIAGE'] = data['MARRIAGE'].replace({0: 3})
data['MARRIAGE'] = data['MARRIAGE'].map({
    1: 'Married',
    2: 'Single',
    3: 'Others'
})
data.head(20)

In [None]:
print(data.columns)
# Check for duplicates
print("Number of Duplicate data :\n", data.duplicated().sum())

In [None]:
# Plot the target variable 
#1- means to fail to make payement
#0 means sucessful to pay 
plt.figure(figsize=(4, 4))
sns.countplot(x='DEFAULT', data=data, palette='Set1', hue='DEFAULT')
plt.title('Distribution of Default (Target) Variable')
plt.xlabel('Default (0 = No, 1 = Yes)')
plt.ylabel('Numbers of Client')
plt.savefig('./imbalanced_plot.svg')
plt.show()


In [None]:
# Drop columns from index 1 to 2
clean_data= data.drop(['ID', 'SEX', 'EDUCATION', 'MARRIAGE'], axis=1 )
clean_data.head()
print(clean_data.isnull().sum())  # Should return 0 for all columns

In [None]:
print(clean_data.isnull().sum())  # Should return 0 for all columns

In [None]:
# Plot the correlation matrix
plt.figure(figsize=(12, 8))
corr_matrix = clean_data.corr()
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="viridis", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

In [None]:
pair_plot = sns.pairplot(clean_data[['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4',
                                     'BILL_AMT5','BILL_AMT6','DEFAULT']], palette='Set1',
                         hue='DEFAULT', diag_kind='kde', corner=True)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
pair_plot = sns.pairplot(clean_data[['LIMIT_BAL' ,'DEFAULT']],
                         hue='DEFAULT', palette='Set1', diag_kind='kde', corner=True)
plt.title('Density plot of LIMIT_BAL by default type')
plt.show()

#### Step 5: Feature Selections and Feature Extractions

In [None]:

# Separate features and target
target_value = 'DEFAULT'
X = clean_data.drop(target_value, axis=1)
y = clean_data[target_value]

# Standardize the features (PCA requires standardized on except trage data)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


#### Step 6: Applying  PCA to reduce the dimentions of data 

In [None]:

pca = PCA(n_components=0.95)  # Retain 95% of the variance

X_pca = pca.fit_transform(X_scaled)

print(f"Original number of features: {X_scaled.shape[1]}")
# Check the number of components
print(f"Number of components: {pca.n_components_}")


In [None]:
# Explained Variance Ratio
explained_variance = pca.explained_variance_ratio_
print("\nExplained Variance Ratio by Each Principal Component:")
print(explained_variance *100)

In [None]:
#principle Component Analysis
plt.figure(figsize=(6, 4))
sns.barplot(explained_variance)
plt.title("Principle Component Analysis")
plt.xlabel("Principal component")
plt.ylabel("Explained variance ratio")
plt.show()

In [None]:
# Cumulative Explained Variance
cumulative_explained_variance = np.cumsum(explained_variance)
print("\nCumulative Explained Variance:")
print(cumulative_explained_variance)


In [None]:
# Plot Cumulative Explained Variance
plt.figure(figsize=(6, 4))
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance by Principal Components')
plt.grid()
plt.show()

In [None]:
#Split the xData into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

print('X train value size', X_train.shape)
print('y train value size', y_train.shape)
print('X_test shape :', X_train.shape)
print('y_test shape :', y_test.shape)

#### Step 6:  Model Building 
##### 1 Logestic Regression

In [None]:
lr = LogisticRegression(solver='liblinear', multi_class='ovr')
lr.fit(X_train , y_train)

In [None]:
#logistic regression prediction making
lr_predict = lr.predict(X_test)
# Probabilities for ROC curve
lr_pred_proba = lr.predict_proba(X_test)[:, 1] 


In [None]:
#Seaborns heatmap library fror confusion matrix with PCA
plt.figure(figsize=(3, 2))
sns.heatmap(confusion_matrix(y_test, lr_predict), annot=True, fmt="d", 
            xticklabels=['No Default', 'Default'],
            yticklabels=['No Default', 'Default'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix Logistic Regression with PCA')
plt.show()

In [None]:
print("Actual Logistic Regression (Accuracy):", accuracy_score(y_test, lr_predict)*100) #Actual Predictions

In [None]:
print("Classification Report of Logistic Regression: \n", classification_report(y_test, lr_predict,digits=4))

In [None]:
# AUC-ROC Score
auc_score = roc_auc_score(y_test, lr_pred_proba)
print(f"\nAUC-ROC Score: {auc_score:.4f}")

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, lr_pred_proba)
plt.figure(figsize=(5, 3))
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Logestic Regression with PCA')
plt.legend(loc='lower right')
plt.show()

#### 2. K-Neighbours Network Classifier

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)


In [None]:
#making prediction on test by using SVM
knn_pred = knn.predict(X_test)
knn_pred_proba = knn.predict_proba(X_test)[:, 1]  # Probabilities for ROC curve

In [None]:
# seaborn hitmap to show the vusilize confusion matix 
plt.figure(figsize=(3, 2))
sns.heatmap(confusion_matrix(y_test, knn_pred), annot=True, fmt="d", 
            xticklabels=['No Default', 'Default'],
            yticklabels=['No Default', 'Default'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix KNN with PCA')
plt.show()

In [None]:
#for simple print statement 
print("Actual Accuracy score of KNN:", accuracy_score(y_test, knn_pred)*100) #Actual Prediction

In [None]:
print("Classification Report of K-Neighbors Networks: \n", classification_report(y_test, knn_pred,digits=4))

In [None]:
# AUC-ROC Score
auc_score = roc_auc_score(y_test, knn_pred_proba)
print(f"\nAUC-ROC with PCA Score: {auc_score:.4f}")

In [None]:
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, knn_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f'KNN (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve (KNN with PCA)')
plt.legend(loc='lower right')
plt.show()

#### 3. Naive Bayes Classifier Algorithms

In [None]:
nb = GaussianNB()
nb.fit(X_train, y_train)


In [None]:
#making prediction on test by using SVM
nb_pred = nb.predict(X_test)
nb_pred_proba = nb.predict_proba(X_test)[:, 1]  # Probabilities for ROC curve

In [None]:
#seaborn hitmap to show the vusilize confusion matix 
plt.figure(figsize=(3, 2))
sns.heatmap(confusion_matrix(y_test, nb_pred), annot=True, fmt="d", 
            xticklabels=['No Default', 'Default'],
            yticklabels=['No Default', 'Default'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Naive Bays with PCA)')
plt.show()

In [None]:
#Accuracy
print(" Actual Accuracy score of Naive Bays:", accuracy_score(y_test, nb_pred)*100) #Actual Prediction

In [None]:
print("Classification Report of Naive Bayes: \n", classification_report(y_test, nb_pred,digits=4))

In [None]:
# AUC-ROC Score
auc_score = roc_auc_score(y_test, nb_pred_proba)
print(f"\nAUC-ROC Score with PCA: {auc_score:.4f}")

In [None]:
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, nb_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f'Naive Bayes (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve (Naive Bayes with PCA)')
plt.legend(loc='lower right')
plt.show()

#### 4 : Random Forest Classifier 

In [None]:
RFC = RandomForestClassifier(n_estimators=100, random_state=42)
RFC.fit(X_train, y_train)

In [None]:
rfc_pred = RFC.predict(X_test)
rfc_pred_proba = RFC.predict_proba(X_test)[:, 1]  # Probabilities for ROC curve

In [None]:
# seaborn hitmap to show the vusilize confusion matix 
plt.figure(figsize=(3, 2))
sns.heatmap(confusion_matrix(y_test, rfc_pred), annot=True, fmt="d", xticklabels=['No Default', 'Default'],yticklabels=['No Default', 'Default'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Random Forest Classifier with PCA)')
plt.show()

In [None]:
#Accuracy Score
print("Accuracy score of Random Forest (with PCA):", accuracy_score(y_test, rfc_pred)*100) #Actual Prediction

In [None]:
print("Classification Report Random Forest : \n", classification_report(y_test, rfc_pred,digits=4))

In [None]:
# AUC-ROC Score
auc_score = roc_auc_score(y_test, rfc_pred_proba)
print(f"\nAUC-ROC Score with PCA: {auc_score:.4f}")

In [None]:
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, rfc_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f'Random forest (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve (Random forest classifier with PCA)')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, rfc_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f'Randoms Forest with PCA (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve (Naive Bayes with PCA)')
plt.legend(loc='lower right')
plt.show()


In [None]:
Compare = pd.DataFrame(
    {'Model':["Logistic Regression", "K-Neighbours Algorithms", "Gaussian Naive Bays", "Random Forest Classifier"],
      'Accuracy': [accuracy_score(y_test, lr_predict)*100,accuracy_score(y_test, knn_pred)*100,
                         accuracy_score(y_test, nb_pred)*100,  accuracy_score(y_test, rfc_pred)*100 ]
            })
#Model comparing & gives best model at first 
Compare.sort_values(by='Accuracy', ascending=False)