# Loan Approval Prediction

## Introduction

This notebook presents a machine learning solution for predicting loan approval status. The goal is to develop a model that can accurately determine whether a loan application will be approved or not, based on various applicant and loan characteristics.

### Project Overview:
- **Objective**: Predict the probability of loan approval for each applicant.
- **Evaluation Metric**: Area Under the ROC Curve (AUC-ROC)
- **Data**: Training and test datasets containing applicant information and loan details.

### Key Steps:
1. Data Loading and Exploration
2. Data Cleaning and Preprocessing
3. Feature Engineering
4. Model Development using Neural Networks
5. Model Evaluation
6. Prediction on Test Data and Submission

This project showcases the application of deep learning techniques to a real-world financial problem, demonstrating the power of neural networks in making complex decisions based on multiple input features.

Let's dive into the code and see how we approach this challenging prediction task!

# Intialization

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.metrics import roc_auc_score


# Data Loading and Exploration

In [2]:
# Load the training dataset
train_df = pd.read_csv('/kaggle/input/playground-series-s4e10/train.csv')

# Load the test dataset
test_df = pd.read_csv('/kaggle/input/playground-series-s4e10/test.csv')

# Display basic information about the datasets
print("Training dataset shape:", train_df.shape)
print("Test dataset shape:", test_df.shape)

# Display the first few rows of the training dataset
print("\nFirst few rows of the training dataset:")
print(train_df.head())

# Display column names
print("\nColumn names:")
print(train_df.columns)

# Display basic statistics of the training dataset
print("\nBasic statistics of the training dataset:")
print(train_df.describe())

# Check for missing values in the training dataset
print("\nMissing values in the training dataset:")
print(train_df.isnull().sum())

Training dataset shape: (58645, 13)
Test dataset shape: (39098, 12)

First few rows of the training dataset:
   id  person_age  person_income person_home_ownership  person_emp_length  \
0   0          37          35000                  RENT                0.0   
1   1          22          56000                   OWN                6.0   
2   2          29          28800                   OWN                8.0   
3   3          30          70000                  RENT               14.0   
4   4          22          60000                  RENT                2.0   

  loan_intent loan_grade  loan_amnt  loan_int_rate  loan_percent_income  \
0   EDUCATION          B       6000          11.49                 0.17   
1     MEDICAL          C       4000          13.35                 0.07   
2    PERSONAL          A       6000           8.90                 0.21   
3     VENTURE          B      12000          11.11                 0.17   
4     MEDICAL          A       6000           6.92   

# Data Cleaning and Preprocessing

In [3]:
def clean_dataset(df):
    # Check for missing values
    print("Missing values before handling:")
    print(df.isnull().sum())
    
    # Handle missing values
    # For numeric columns, fill with median
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median())
    
    # For categorical columns, fill with mode
    categorical_columns = df.select_dtypes(include=['object']).columns
    df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns].mode().iloc[0])
    
    print("\nMissing values after handling:")
    print(df.isnull().sum())
    
    # Check for and remove duplicates
    duplicates = df.duplicated()
    print(f"\nNumber of duplicate rows: {duplicates.sum()}")
    df = df.drop_duplicates()
    
    return df

# Clean training data
print("Cleaning training data:")
train_df = clean_dataset(train_df)

print("\nCleaning test data:")
test_df = clean_dataset(test_df)

# Print shapes after cleaning
print("\nShape of training data after cleaning:", train_df.shape)
print("Shape of test data after cleaning:", test_df.shape)

Cleaning training data:
Missing values before handling:
id                            0
person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
loan_status                   0
dtype: int64

Missing values after handling:
id                            0
person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
loan_status                   0
dtype: int64

Number of duplicate rows: 0

Cleaning test data:
Miss

# Feature Engineering

In [4]:
# Separate features and target
X_train = train_df.drop(['id', 'loan_status'], axis=1)
y_train = train_df['loan_status']
X_test = test_df.drop('id', axis=1)

# Define categorical and numerical columns
categorical_features = ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file']
numerical_features = ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length']

# Create preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first', sparse=False), categorical_features)
    ])

# Fit the preprocessor on the training data and transform both training and test data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

# Get feature names after preprocessing
onehot_encoder = preprocessor.named_transformers_['cat']
if hasattr(onehot_encoder, 'get_feature_names_out'):
    # For newer scikit-learn versions
    cat_feature_names = onehot_encoder.get_feature_names_out(categorical_features)
else:
    # For older scikit-learn versions
    n_categories = [len(onehot_encoder.categories_[i]) - 1 for i in range(len(categorical_features))]
    cat_feature_names = [f"{feat}_{i}" for feat, n in zip(categorical_features, n_categories) for i in range(n)]

feature_names = numerical_features + cat_feature_names.tolist()

# Convert to DataFrames
X_train_preprocessed = pd.DataFrame(X_train_preprocessed, columns=feature_names)
X_test_preprocessed = pd.DataFrame(X_test_preprocessed, columns=feature_names)

print("Preprocessed training data shape:", X_train_preprocessed.shape)
print("Preprocessed test data shape:", X_test_preprocessed.shape)
print("\nFirst few rows of preprocessed training data:")
print(X_train_preprocessed.head())

Preprocessed training data shape: (58645, 22)
Preprocessed test data shape: (39098, 22)

First few rows of preprocessed training data:
   person_age  person_income  person_emp_length  loan_amnt  loan_int_rate  \
0    1.566200      -0.765768          -1.187200  -0.578306       0.267616   
1   -0.920057      -0.212128           0.328047  -0.937775       0.880532   
2    0.240196      -0.929223           0.833130  -0.578306      -0.585854   
3    0.405947       0.156966           2.348377   0.500101       0.142396   
4   -0.920057      -0.106673          -0.682117  -0.578306      -1.238314   

   loan_percent_income  cb_person_cred_hist_length  \
0             0.117378                    2.031798   
1            -0.973242                   -0.946489   
2             0.553626                    1.039036   
3             0.117378                   -0.201917   
4            -0.646056                   -0.698298   

   person_home_ownership_OTHER  person_home_ownership_OWN  \
0               



# Model Development using Neural Networks
# Model Evaluation
# Prediction on Test Data and Submission

In [5]:
# Split the data
X_train, X_val, y_train, y_val = train_test_split(X_train_preprocessed, y_train, test_size=0.2, random_state=42)

# Define the model
model = keras.Sequential([
    layers.Input(shape=(22,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=50,
    batch_size=32,
    callbacks=[keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)]
)

# Evaluate the model
val_predictions = model.predict(X_val)
val_auc = roc_auc_score(y_val, val_predictions)
print(f"Validation AUC-ROC score: {val_auc:.4f}")

# Make predictions on the test set
test_predictions = model.predict(X_test_preprocessed)

# Prepare submission (assuming 'id' column is available in test_df)
submission = pd.DataFrame({
    'id': test_df['id'],
    'loan_status': test_predictions.flatten()
})
submission.to_csv('submission.csv', index=False)
print("Submission file created: submission.csv")

Epoch 1/50
[1m1467/1467[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.8822 - loss: 0.2939 - val_accuracy: 0.9372 - val_loss: 0.1911
Epoch 2/50
[1m1467/1467[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9360 - loss: 0.1960 - val_accuracy: 0.9389 - val_loss: 0.1883
Epoch 3/50
[1m1467/1467[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9408 - loss: 0.1862 - val_accuracy: 0.9441 - val_loss: 0.1790
Epoch 4/50
[1m1467/1467[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9438 - loss: 0.1819 - val_accuracy: 0.9448 - val_loss: 0.1780
Epoch 5/50
[1m1467/1467[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9444 - loss: 0.1799 - val_accuracy: 0.9448 - val_loss: 0.1768
Epoch 6/50
[1m1467/1467[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9452 - loss: 0.1784 - val_accuracy: 0.9463 - val_loss: 0.1767
Epoch 7/50
[1m1

# Conclusion

## Summary of Approach and Results

In this project, we developed a machine learning model to predict loan approval probabilities. Our approach involved several key steps:

1. **Data Preprocessing**: We cleaned the dataset, handled missing values, and encoded categorical variables.
2. **Feature Engineering**: We scaled numerical features and one-hot encoded categorical features to prepare them for our model.
3. **Model Development**: We implemented a neural network using TensorFlow/Keras, with multiple dense layers and early stopping to prevent overfitting.
4. **Model Evaluation**: We achieved a validation AUC-ROC score of 0.9353, indicating strong predictive performance.

## Key Findings

- The neural network model demonstrated high accuracy in predicting loan approval status.
- The use of early stopping helped optimize the model's performance and prevent overfitting.
- Feature preprocessing, including scaling and encoding, played a crucial role in the model's success.