# Credit Card Approval Prediction : A Business and Data Analysis Report

## Introduction

In the financial services industry, understanding and predicting customer behavior is crucial for risk management and decision-making. This project focuses on analyzing the creditworthiness of clients using historical application and credit record data. The objective is to develop predictive models that can classify clients as 'good' or 'bad' credit risks. 'Good' credit risks are clients who are likely to meet their credit obligations on time, while 'bad' credit risks are clients who are likely to default or have overdue payments.This classification helps financial institutions in minimizing defaults while optimizing credit allocation.

## Data Exploration

Before diving into model building, it's essential to understand the data at hand. The datasets used in this project include `application_record.csv`, which contains demographic and financial details of clients, and `credit_record.csv`, which tracks their credit history over time.

In [1]:
# Import necessary libraries
import pandas as pd

# Load datasets
application_df = pd.read_csv('./data/raw/application_record.csv')
credit_df = pd.read_csv('./data/raw/credit_record.csv')

# Display first few rows of application data
print(application_df.head())

# Display first few rows of credit data
print(credit_df.head())

# Basic information about datasets
print(application_df.info())
print(credit_df.info())

# Check if 'TARGET' column is in the dataset (you may need to add or rename it)
print("Columns in application_df:", application_df.columns)
print("Columns in credit_df:", credit_df.columns)


FileNotFoundError: [Errno 2] No such file or directory: './data/raw/application_record.csv'

The application_df dataset includes key attributes such as gender, income, employment status, and the number of dependents, which are pivotal in determining the client's ability to repay loans. On the other hand, the credit_df dataset captures the client's payment history, which directly reflects their credit behavior. By merging these datasets, we can create a comprehensive view of each client that includes both their capacity to repay (from the application data) and their willingness to repay (from the credit data). The OCCUPATION_TYPE column has a significant number of missing values, which must be handled during preprocessing. These missing values could be filled based on other correlated variables or imputed using statistical techniques.


Understanding client profiles through this data is critical for financial institutions. For instance, clients with high income but poor credit history might be considered higher risk compared to those with moderate income and a spotless payment history. Thus, the merging of these datasets allows for more nuanced decision-making, enabling the institution to balance between potential revenue (from interest payments) and risk (from defaults).

## Data Preprocessing
To prepare the data for modeling, it must be cleaned and transformed. This involves handling missing values, encoding categorical variables, and creating new features that might enhance model performance. The OCCUPATION_TYPE column contains many missing values, which need to be addressed during preprocessing to maintain data quality and model accuracy. STATUS values of 2, 3, 4, and 5 indicate clients with increasingly severe overdue payments, making them higher risk for default.

In [4]:
# Import necessary libraries
import pandas as pd

# Function to load data
def load_data():
    application_df = pd.read_csv('./data/raw/application_record.csv')
    credit_df = pd.read_csv('./data/raw/credit_record.csv')
    return application_df, credit_df

# Function to preprocess data
def preprocess_data(application_df, credit_df):
    # Handle missing values
    application_df.fillna(application_df.median(numeric_only=True), inplace=True)
    
    # Convert categorical columns to numeric using one-hot encoding
    categorical_cols = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 
                        'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 
                        'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE']
    
    application_df = pd.get_dummies(application_df, columns=categorical_cols, drop_first=True)

    # Create 'TARGET' column: 1 if STATUS indicates overdue payments, else 0
    credit_df['TARGET'] = credit_df['STATUS'].apply(lambda x: 1 if x in ['2', '3', '4', '5'] else 0)
    
    # Aggregate 'TARGET' to have one row per ID
    credit_agg = credit_df.groupby('ID')['TARGET'].max().reset_index()
    
    # Merge datasets on ID column
    merged_data = pd.merge(application_df, credit_agg, on='ID', how='inner')
    
    return merged_data

# Load and preprocess data
application_df, credit_df = load_data()
merged_data = preprocess_data(application_df, credit_df)

# Save preprocessed data
merged_data.to_csv('./data/processed/merged_data.csv', index=False)

# Display first few rows of merged data
print(merged_data.head())

# Check the columns to ensure 'TARGET' is present
print("Columns in merged_data:", merged_data.columns)


FileNotFoundError: [Errno 2] No such file or directory: './data/raw/application_record.csv'

Data preprocessing is a critical step to ensure that the data fed into the machine learning models is clean and well-structured. For example, categorical variables like CODE_GENDER and FLAG_OWN_CAR are converted into numerical values through one-hot encoding. Additionally, the TARGET variable is created to indicate whether a client has had overdue payments, which will serve as the dependent variable in our predictive models. STATUS values 2, 3, 4, and 5 represent clients who have overdue payments of varying severity. These were selected to identify clients at higher risk of default, as each of these values corresponds to increasingly severe levels of delinquency.

Proper data preprocessing not only improves the accuracy of predictive models but also ensures that business decisions based on these models are reliable. For instance, by carefully handling missing values and encoding variables correctly, we reduce the risk of making erroneous predictions that could lead to either rejecting a creditworthy client or approving a high-risk loan.

## Feature Engineering
Feature engineering involves creating new features that might enhance the predictive power of the model. For example, calculating the Income_Per_Family_Member provides insights into the financial burden on the client, which could be a significant predictor of their creditworthiness. Lenders often assess a client’s disposable income after accounting for dependents. By calculating Income_Per_Family_Member, we account for the financial strain that dependents place on a client's income, which provides a more accurate measure of their financial stability and repayment capacity.

In [3]:
# Import necessary libraries
import pandas as pd

# Load preprocessed data
merged_data = pd.read_csv('./data/processed/merged_data.csv')

# Function for feature engineering
def feature_engineering(merged_data):
    # Feature engineering example
    merged_data['Income_Per_Family_Member'] = merged_data['AMT_INCOME_TOTAL'] / merged_data['CNT_FAM_MEMBERS']
    
    # Encode categorical variables (already done in preprocessing)
    
    return merged_data

# Apply feature engineering
feature_engineered_data = feature_engineering(merged_data)

# Save feature-engineered data
feature_engineered_data.to_csv('./data/processed/feature_engineered_data.csv', index=False)

# Display first few rows of feature engineered data
print(feature_engineered_data.head())

# Check the columns to ensure 'TARGET' is present
print("Columns in feature_engineered_data:", feature_engineered_data.columns)


        ID  CNT_CHILDREN  AMT_INCOME_TOTAL  DAYS_BIRTH  DAYS_EMPLOYED  \
0  5008804             0          427500.0      -12005          -4542   
1  5008805             0          427500.0      -12005          -4542   
2  5008806             0          112500.0      -21474          -1134   
3  5008808             0          270000.0      -19110          -3051   
4  5008809             0          270000.0      -19110          -3051   

   FLAG_MOBIL  FLAG_WORK_PHONE  FLAG_PHONE  FLAG_EMAIL  CNT_FAM_MEMBERS  ...  \
0           1                1           0           0              2.0  ...   
1           1                1           0           0              2.0  ...   
2           1                0           0           0              2.0  ...   
3           1                0           1           1              1.0  ...   
4           1                0           1           1              1.0  ...   

   OCCUPATION_TYPE_Managers  OCCUPATION_TYPE_Medicine staff  \
0                

Feature engineering plays a crucial role in improving the performance of predictive models. By creating the Income_Per_Family_Member feature, we gain a more granular view of the client's financial stability, which is likely to be a strong predictor of their ability to manage debt. This feature helps differentiate between clients with similar total incomes but different family sizes, providing the model with more nuanced information.

From a business perspective, the Income_Per_Family_Member feature allows lenders to better assess the financial strain on clients. A lower value might indicate higher financial burden, suggesting that the client is at greater risk of default. Thus, this feature aids in identifying clients who, despite having a high total income, may still struggle to meet their financial obligations due to the number of dependents.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Set a style for the plots
sns.set(style="whitegrid")

# Visualization 1: Distribution of Target Variable
plt.figure(figsize=(10, 6))
sns.countplot(y=y, palette="viridis")
plt.title("Distribution of Target Variable (Class Imbalance)")
plt.xlabel("Count")
plt.ylabel("Target")
plt.show()

The Distribution of the Target Variable, which highlights the class imbalance in your dataset.

-- The Target variable represents the classification of clients as either 'good'    (likely to meet their credit obligations) or 'bad' (likely to default).

The majority of the clients are classified as '0' (good), which is represented by the large bar.
A much smaller portion of clients are classified as '1' (bad), shown by the smaller bar.

In [None]:
# Plot histograms for key numerical features
plt.figure(figsize=(14, 6))

plt.subplot(1, 3, 1)
sns.histplot(merged_data['AMT_INCOME_TOTAL'], kde=True, color='blue')
plt.title('Distribution of Total Income')

plt.subplot(1, 3, 2)
sns.histplot(abs(merged_data['DAYS_BIRTH'])/365, kde=True, color='green')
plt.title('Distribution of Age (in years)')



plt.tight_layout()
plt.show()

Graph on left displays the distribution of clients' income levels in the dataset.

Understanding the income distribution helps in assessing the financial capacity of the clients. It gives insights into whether the dataset contains more low-income or high-income clients, which could influence creditworthiness.

Graph on right shows the distribution of clients' ages, derived from the DAYS_BIRTH feature (converted to years).

Age can be a critical factor in credit risk assessment. Younger clients might have different spending and repayment behaviors compared to older clients.

In [None]:
# Correlation heatmap
plt.figure(figsize=(12, 10))
corr_matrix = feature_engineered_data.corr()
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


Correlation heatmap visualizes the correlation between different features in the dataset. The color intensity indicates the strength of the correlation, with redder tones showing stronger positive correlations and bluer tones indicating negative correlations.

## Model Training: Random Forest Classifier
The Random Forest Classifier is chosen for its robustness and ability to handle imbalanced datasets.  Random Forests work by generating multiple decision trees and averaging their outputs. This ensemble method reduces overfitting by relying on a collection of weaker models that, when combined, lead to stronger predictive performance. This model will be trained on the feature-engineered data to predict whether a client is a 'good' or 'bad' credit risk.

In [4]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib

# Load feature engineered data
feature_engineered_data = pd.read_csv('./data/processed/feature_engineered_data.csv')

# Ensure 'TARGET' column is present
if 'TARGET' not in feature_engineered_data.columns:
    raise KeyError("'TARGET' column not found in the dataset")

# Train model function
def train_model(feature_engineered_data):
    X = feature_engineered_data.drop('TARGET', axis=1)
    y = feature_engineered_data['TARGET']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    print("Random Forest Accuracy:", accuracy)
    print("Random Forest Classification Report:\n", classification_report(y_test, y_pred))
    
    # Save the model
    joblib.dump(model, './models/credit_approval_rf_model.pkl')

# Train the model
train_model(feature_engineered_data)


Random Forest Accuracy: 0.9787438288535382
Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      7175
           1       0.29      0.22      0.25       117

    accuracy                           0.98      7292
   macro avg       0.64      0.61      0.62      7292
weighted avg       0.98      0.98      0.98      7292



In [None]:
from sklearn.model_selection import learning_curve
import numpy as np

# Function to plot learning curve
def plot_learning_curve(estimator, X, y, cv=5, n_jobs=None, train_sizes=np.linspace(0.1, 1.0, 10), scoring='accuracy'):
    plt.figure(figsize=(12, 8))
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring=scoring)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")

    plt.title('Learning Curve')
    plt.xlabel('Training Examples')
    plt.ylabel('Score')
    plt.legend(loc='best')
    plt.grid()
    plt.show()

# Plot the learning curve for Random Forest model
plot_learning_curve(model, X, y)

Both the training score (red line) and the cross-validation score (green line) start very high, close to 1.0 (100% accuracy). This indicates that the model is initially perfrmoing very well on both the training data and the validation data.

In [None]:
# Load the trained Random Forest model
import joblib
model = joblib.load('./models/credit_approval_rf_model.pkl')

X = feature_engineered_data.drop('TARGET', axis=1)
y = feature_engineered_data['TARGET']

# Feature importance plot
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
feature_importances = feature_importances.sort_values(ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x=feature_importances[:20], y=feature_importances.index[:20])
plt.title('Top 20 Feature Importances (Random Forest)')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.show()


In [None]:
# Generate predictions
y_pred_rf = model.predict(X)

# Compute confusion matrix
cm = confusion_matrix(y, y_pred_rf)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title('Confusion Matrix for Random Forest Model')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


Confusion matrix shown above illustrates the performance of the Random Forest model in classifying the test data into two categories: '0' (negative class, likely 'good' credit risk) and '1' (positive class, likely 'bad' credit risk).

In [None]:
# Compute ROC curve and ROC area for Random Forest
fpr, tpr, _ = roc_curve(y, model.predict_proba(X)[:, 1])
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

AUC  is very high and indicates that the model is performing exceptionally well at distinguishing between the two classes (e.g., 'good' vs. 'bad' clients).

The model is highly effective in distinguishing between clients who are likely to meet their credit obligations ('good' clients) and those who are at risk of defaulting ('bad' clients). This high AUC value indicates that the model is very reliable in its predictions.

The Random Forest model provides a high accuracy of 97.87%, which indicates that it performs well in predicting 'good' clients. However, the classification report shows that the model struggles to correctly classify 'bad' clients, primarily due to the class imbalance in the dataset. The model's lower recall and precision for detecting 'bad' clients is concerning for real-world applications, as it may miss identifying high-risk individuals.

For financial institutions, the ability to correctly identify 'bad' clients is as important as identifying 'good' ones. Although the Random Forest model shows high overall accuracy, its lower performance in identifying 'bad' clients suggests that additional measures, such as re-sampling techniques or adjusting the decision threshold, may be necessary to improve the model's effectiveness. A model that consistently misses high-risk clients could lead to increased default rates and financial losses.

## Model Training: Neural Network
A Neural Network model is also trained to explore if a more complex, non-linear model can improve the prediction of 'bad' clients.

In [5]:
# Import necessary libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load feature engineered data
feature_engineered_data = pd.read_csv('./data/processed/feature_engineered_data.csv')

# Prepare the data
X = feature_engineered_data.drop('TARGET', axis=1).astype(np.float32)
y = feature_engineered_data['TARGET'].astype(np.float32)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a simple neural network model
def build_neural_network(input_dim):
    model = Sequential()
    model.add(Dense(64, input_dim=input_dim, activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Build and train the model
model = build_neural_network(X_train.shape[1])
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model
y_pred_nn = (model.predict(X_test) > 0.5).astype("int32")
print("Neural Network Accuracy:", model.evaluate(X_test, y_test)[1])
print("Neural Network Classification Report:")
print(classification_report(y_test, y_pred_nn))


2024-08-14 17:06:10.059961: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-14 17:06:10.060887: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-14 17:06:10.060959: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-14 17:06:10.211995: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-14 17:06:15.753547: I te

Epoch 1/10


2024-08-14 17:06:18.573537: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-08-14 17:06:19.316806: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fce780d6ca0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-08-14 17:06:19.316863: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro T1000, Compute Capability 7.5
2024-08-14 17:06:19.349383: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-08-14 17:06:19.419204: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8907
2024-08-14 17:06:19.533484: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Neural Network Accuracy: 0.9839550256729126
Neural Network Classification Report:
              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99      7175
         1.0       0.00      0.00      0.00       117

    accuracy                           0.98      7292
   macro avg       0.49      0.50      0.50      7292
weighted avg       0.97      0.98      0.98      7292



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The Neural Network model achieves a similarly high accuracy to the Random Forest model, indicating that it is also effective in predicting 'good' clients. However, like the Random Forest model, it struggles to accurately identify 'bad' clients, as evidenced by low precision and recall for the 'bad' class. This is a common challenge when dealing with imbalanced datasets, where the model tends to favor the majority class.

While Neural Networks can capture complex relationships in data, their performance in this scenario is similar to the Random Forest model. The challenge of accurately predicting 'bad' clients persists, highlighting the need for further model tuning or alternative approaches. In a business context, relying solely on this model could result in missed opportunities to mitigate risk. Therefore, additional techniques such as cost-sensitive learning or anomaly detection might be explored to enhance the identification of high-risk clients.

In [6]:
# Summary of the analysis
print("Interpreting the Results:")

# Random Forest Model Interpretation
# Here you can discuss feature importances and model performance.
print("Random Forest Model: The Random Forest model achieved a high accuracy, primarily because it correctly classified most 'good' clients. However, its ability to detect 'bad' clients is limited due to the class imbalance.")

# Neural Network Model Interpretation
# Here you can discuss how the neural network performed and any overfitting signs.
print("Neural Network Model: The Neural Network also achieved high accuracy, but similarly struggled with detecting 'bad' clients. Tuning the model and handling class imbalance could improve its performance.")

# Conclusion
print("Conclusion: Both models perform well in identifying 'good' clients, but struggle with 'bad' clients due to the imbalance in the dataset. Techniques like SMOTE and further tuning of the models may improve the classification of 'bad' clients.")


Interpreting the Results:
Random Forest Model: The Random Forest model achieved a high accuracy, primarily because it correctly classified most 'good' clients. However, its ability to detect 'bad' clients is limited due to the class imbalance.
Neural Network Model: The Neural Network also achieved high accuracy, but similarly struggled with detecting 'bad' clients. Tuning the model and handling class imbalance could improve its performance.
Conclusion: Both models perform well in identifying 'good' clients, but struggle with 'bad' clients due to the imbalance in the dataset. Techniques like SMOTE and further tuning of the models may improve the classification of 'bad' clients.


The analysis reveals that while both the Random Forest and Neural Network models perform well overall, they are not particularly effective at identifying 'bad' clients. This limitation is largely due to the imbalanced nature of the dataset, where 'bad' clients represent a small fraction of the total.

For a financial institution, the ability to accurately identify 'bad' clients is crucial for risk management. The current models, while accurate overall, may not sufficiently protect against defaults. This insight underscores the need for model improvements, such as re-sampling techniques (e.g., SMOTE) or the implementation of custom loss functions that penalize false negatives more heavily. By doing so, the institution can better safeguard its financial health while still capitalizing on lending opportunities.

Conclusion
In conclusion, this project highlights the complexities of credit risk modeling, particularly when dealing with imbalanced datasets. Both the Random Forest and Neural Network models demonstrate high accuracy in predicting 'good' clients but fall short in identifying 'bad' clients. This challenge must be addressed to create a robust credit scoring model that accurately reflects the risk profiles of all clients.