# Credit Card Approval Prediction using SVM
**Collaborator:**
- Dinh Minh NGUYEN
- Luong Phuong Truc HUYNH
- Quang Huy PHUNG

## Credit Card Approval Dataset

## Source: [Kaggle](https://www.kaggle.com/datasets/samuelcortinhas/credit-card-approval-clean-data/code)

## Description:
The dataset provides information on credit card applicants, including demographic and financial attributes, alongside the approval status of their credit card applications.

## Features:
- Gender: Gender of the applicant (0 for female, 1 for male)
- Age: Age of the applicant
- Debt: Debt-to-income ratio of the applicant
- Married: Marital status of the applicant (0 for unmarried, 1 for married)
- BankCustomer: Indicates if the applicant is a bank customer (0 for no, 1 for yes)
- Industry: Industry in which the applicant works
- Ethnicity: Ethnicity of the applicant
- YearsEmployed: Number of years employed
- PriorDefault: Indicates if the applicant had defaulted on a previous credit card (0 for no, 1 for yes)
- Employed: Indicates if the applicant is currently employed (0 for no, 1 for yes)
- CreditScore: Credit score of the applicant
- DriversLicense: Indicates if the applicant has a driver's license (0 for no, 1 for yes)
- Citizen: Citizenship status of the applicant
- ZipCode: Zip code of the applicant
- Income: Income of the applicant

## Target Variable:
- Approved: Indicates whether the credit card application was approved (0 for not approved, 1 for approved)

## Number of Instances: 690

## Data Types:
- Categorical Variables: Gender, Married, BankCustomer, Industry, Ethnicity, PriorDefault, Employed, DriversLicense, Citizen, ZipCode
- Numerical Variables: Age, Debt, YearsEmployed, CreditScore, Income

## Potential Preprocessing Steps:
1. Handling missing values
2. Encoding categorical variables
3. Scaling numerical features
4. Handling imbalanced data
5. Splitting data into training and testing sets

This dataset is suitable for building predictive models to estimate the likelihood of credit card approval based on applicant attributes.


## Import

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

## Load dataset

In [2]:
data = pd.read_csv("../data/credit_card_approvals.csv")
data

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1,30.83,0.000,1,1,Industrials,White,1.25,1,1,1,0,ByBirth,202,0,1
1,0,58.67,4.460,1,1,Materials,Black,3.04,1,1,6,0,ByBirth,43,560,1
2,0,24.50,0.500,1,1,Materials,Black,1.50,1,0,0,0,ByBirth,280,824,1
3,1,27.83,1.540,1,1,Industrials,White,3.75,1,1,5,1,ByBirth,100,3,1
4,1,20.17,5.625,1,1,Industrials,White,1.71,1,0,0,0,ByOtherMeans,120,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,1,21.08,10.085,0,0,Education,Black,1.25,0,0,0,0,ByBirth,260,0,0
686,0,22.67,0.750,1,1,Energy,White,2.00,0,1,2,1,ByBirth,200,394,0
687,0,25.25,13.500,0,0,Healthcare,Latino,2.00,0,1,1,1,ByBirth,200,1,0
688,1,17.92,0.205,1,1,ConsumerStaples,White,0.04,0,0,0,0,ByBirth,280,750,0


In [3]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [4]:
le = LabelEncoder()
for col in data.columns:
    if data[col].dtypes == 'object':
        data[col] = le.fit_transform(data[col])
data

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1,30.83,0.000,1,1,7,4,1.25,1,1,1,0,0,202,0,1
1,0,58.67,4.460,1,1,9,1,3.04,1,1,6,0,0,43,560,1
2,0,24.50,0.500,1,1,9,1,1.50,1,0,0,0,0,280,824,1
3,1,27.83,1.540,1,1,7,4,3.75,1,1,5,1,0,100,3,1
4,1,20.17,5.625,1,1,7,4,1.71,1,0,0,0,1,120,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,1,21.08,10.085,0,0,3,1,1.25,0,0,0,0,0,260,0,0
686,0,22.67,0.750,1,1,4,4,2.00,0,1,2,1,0,200,394,0
687,0,25.25,13.500,0,0,6,2,2.00,0,1,1,1,0,200,1,0
688,1,17.92,0.205,1,1,2,4,0.04,0,0,0,0,0,280,750,0


In [5]:
# Remove duplicate rows
data = data.drop_duplicates()
data

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1,30.83,0.000,1,1,7,4,1.25,1,1,1,0,0,202,0,1
1,0,58.67,4.460,1,1,9,1,3.04,1,1,6,0,0,43,560,1
2,0,24.50,0.500,1,1,9,1,1.50,1,0,0,0,0,280,824,1
3,1,27.83,1.540,1,1,7,4,3.75,1,1,5,1,0,100,3,1
4,1,20.17,5.625,1,1,7,4,1.71,1,0,0,0,1,120,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,1,21.08,10.085,0,0,3,1,1.25,0,0,0,0,0,260,0,0
686,0,22.67,0.750,1,1,4,4,2.00,0,1,2,1,0,200,394,0
687,0,25.25,13.500,0,0,6,2,2.00,0,1,1,1,0,200,1,0
688,1,17.92,0.205,1,1,2,4,0.04,0,0,0,0,0,280,750,0


In [6]:
X = data.drop(columns=['Approved'])
y = data['Approved']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [8]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test  = scaler.fit_transform(X_test)

In [9]:
param_grid = {
    'C': np.linspace(2 ** -5, 2 ** 15, 16),
    'kernel': ['rbf'],
    'gamma': np.linspace(2 ** -15, 2 ** 3, 16)
}

# We make a classifier
clf = svm.SVC()

# We make a grid search using GridSearchCV()
grid = GridSearchCV(clf, param_grid, cv=5, verbose=2)

# We train it
grid.fit(X_train, y_train)

# We print the best hyperparameters
print("Best Hyperparameters::\n{}".format(grid.best_params_))

Fitting 5 folds for each of 256 candidates, totalling 1280 fits
[CV] END ......C=0.03125, gamma=3.0517578125e-05, kernel=rbf; total time=   0.0s
[CV] END ......C=0.03125, gamma=3.0517578125e-05, kernel=rbf; total time=   0.0s
[CV] END ......C=0.03125, gamma=3.0517578125e-05, kernel=rbf; total time=   0.0s
[CV] END ......C=0.03125, gamma=3.0517578125e-05, kernel=rbf; total time=   0.0s
[CV] END ......C=0.03125, gamma=3.0517578125e-05, kernel=rbf; total time=   0.0s
[CV] END ......C=0.03125, gamma=0.53336181640625, kernel=rbf; total time=   0.0s
[CV] END ......C=0.03125, gamma=0.53336181640625, kernel=rbf; total time=   0.0s
[CV] END ......C=0.03125, gamma=0.53336181640625, kernel=rbf; total time=   0.0s
[CV] END ......C=0.03125, gamma=0.53336181640625, kernel=rbf; total time=   0.0s
[CV] END ......C=0.03125, gamma=0.53336181640625, kernel=rbf; total time=   0.0s
[CV] END .....C=0.03125, gamma=1.066693115234375, kernel=rbf; total time=   0.0s


[CV] END .....C=0.03125, gamma=1.066693115234375, kernel=rbf; total time=   0.0s
[CV] END .....C=0.03125, gamma=1.066693115234375, kernel=rbf; total time=   0.0s
[CV] END .....C=0.03125, gamma=1.066693115234375, kernel=rbf; total time=   0.0s
[CV] END .....C=0.03125, gamma=1.066693115234375, kernel=rbf; total time=   0.0s
[CV] END ....C=0.03125, gamma=1.6000244140625002, kernel=rbf; total time=   0.0s
[CV] END ....C=0.03125, gamma=1.6000244140625002, kernel=rbf; total time=   0.0s
[CV] END ....C=0.03125, gamma=1.6000244140625002, kernel=rbf; total time=   0.0s
[CV] END ....C=0.03125, gamma=1.6000244140625002, kernel=rbf; total time=   0.0s
[CV] END ....C=0.03125, gamma=1.6000244140625002, kernel=rbf; total time=   0.1s
[CV] END .....C=0.03125, gamma=2.133355712890625, kernel=rbf; total time=   0.0s
[CV] END .....C=0.03125, gamma=2.133355712890625, kernel=rbf; total time=   0.1s
[CV] END .....C=0.03125, gamma=2.133355712890625, kernel=rbf; total time=   0.0s
[CV] END .....C=0.03125, gam

In [10]:
# We select the best model
best_model = grid.best_estimator_

# We print the best hyperparameters
print("Best Hyperparameters:\n{}".format(best_model))

Best Hyperparameters:
SVC(C=2184.5625, gamma=3.0517578125e-05)


In [11]:
# Train the SVM with the best hyperparameters
best_model.fit(X_train, y_train)

In [12]:
accuracy = best_model.score(X_test, y_test)
print("Accuracy on the test set:", accuracy)

Accuracy on the test set: 0.8439306358381503


In [13]:
y_pred = best_model.predict(X_test)
y_pred

array([0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1])

In [14]:
compare_df = pd.DataFrame({'actual_value': y_test,'predicted_value': y_pred})
compare_df

Unnamed: 0,actual_value,predicted_value
286,0,0
511,1,1
257,0,0
336,0,0
318,1,0
...,...,...
357,0,0
215,1,1
629,0,0
390,0,0


In [15]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [16]:
confusion_matrix = confusion_matrix(y_test, y_pred)
accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
print(confusion_matrix)

              precision    recall  f1-score   support

           0       0.90      0.79      0.84        91
           1       0.80      0.90      0.85        82

    accuracy                           0.84       173
   macro avg       0.85      0.85      0.84       173
weighted avg       0.85      0.84      0.84       173

[[72 19]
 [ 8 74]]


In [17]:
log_acc=accuracy_score(y_test,y_pred)*100
log_acc

84.39306358381504