# Credit Card Approval Prediction

This project uses the ["Credit Approval"](https://archive.ics.uci.edu/dataset/27/credit+approval) dataset from the UCI Machine Learning Repository to develop a machine learning model that predicts whether a credit card application will be approved.

## About the Dataset

The dataset contains 690 credit card applications, each described by 15 anonymized attributes (including both categorical and continuous features). The main goal is to classify each application as approved (`+`) or denied (`-`).

- **Instances:** 690  
- **Attributes:** 15
- **Task:** Classification  
- **Missing values:** Yes

Although all attribute names and values have been anonymized for confidentiality, this dataset is a great testbed for various data preprocessing techniques and classification methods due to its diversity of data types and the presence of missing values.


In [0]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier


In [0]:
df = pd.read_csv("Data/crx.data", header=None)

In [0]:
display(df.head())

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


## 2. Inspecting the Applications

A quick look at the raw data reveals that each credit card application contains a mix of numerical and categorical variables. Although the column names are anonymized, typical features in such datasets may include:

- Gender
- Age
- Debt
- Marital Status
- Bank Customer Status
- Education Level
- Ethnicity
- Years Employed
- Prior Default
- Employment Status
- Credit Score
- Driver's License
- Citizenship
- Zip Code
- Income
- Approval Status

This initial inspection highlights the diversity of data types present. Before moving on to modeling, it is important to further explore the dataset to identify any additional issues, such as missing values or inconsistent formats, that may require preprocessing.


In [0]:
df.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [0]:
df.tail(17)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


In [0]:
for col in df.columns:
    print(f"\n Column: {col}")
    print(df[col].unique())


 Column: 0
['b' 'a' '?']

 Column: 1
['30.83' '58.67' '24.50' '27.83' '20.17' '32.08' '33.17' '22.92' '54.42'
 '42.50' '22.08' '29.92' '38.25' '48.08' '45.83' '36.67' '28.25' '23.25'
 '21.83' '19.17' '25.00' '47.75' '27.42' '41.17' '15.83' '47.00' '56.58'
 '57.42' '42.08' '29.25' '42.00' '49.50' '36.75' '22.58' '27.25' '23.00'
 '27.75' '54.58' '34.17' '28.92' '29.67' '39.58' '56.42' '54.33' '41.00'
 '31.92' '41.50' '23.92' '25.75' '26.00' '37.42' '34.92' '34.25' '23.33'
 '23.17' '44.33' '35.17' '43.25' '56.75' '31.67' '23.42' '20.42' '26.67'
 '36.00' '25.50' '19.42' '32.33' '34.83' '38.58' '44.25' '44.83' '20.67'
 '34.08' '21.67' '21.50' '49.58' '27.67' '39.83' '?' '37.17' '25.67'
 '34.00' '49.00' '62.50' '31.42' '52.33' '28.75' '28.58' '22.50' '28.50'
 '37.50' '35.25' '18.67' '54.83' '40.92' '19.75' '29.17' '24.58' '33.75'
 '25.42' '37.75' '52.50' '57.83' '20.75' '39.92' '24.75' '44.17' '23.50'
 '47.67' '22.75' '34.42' '28.42' '67.75' '47.42' '36.25' '32.67' '48.58'
 '33.58' '18.83' 

In [0]:
#? values to Nan

df = df.replace('?', np.nan)   
df.isnull().sum()


0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

In [0]:
df = df.fillna(df.mean())

  df = df.fillna(df.mean())


In [0]:
df.isnull().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

In [0]:
df.dtypes

0      object
1      object
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12     object
13     object
14      int64
15     object
dtype: object

In [0]:
#categorical missing values will be imputed with the most frequent values
for col in df.columns:

    if df[col].dtypes == 'object':

        df = df.fillna(df[col].value_counts().index[0])

df.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

In [0]:
#preprocessing

le = LabelEncoder()

for col in df.columns:

    if df[col].dtypes == 'object':
        df[col] = le.fit_transform(df[col])

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,1,156,0.0,2,1,13,8,1.25,1,1,1,0,0,68,0,0
1,0,328,4.46,2,1,11,4,3.04,1,1,6,0,0,11,560,0
2,0,89,0.5,2,1,11,4,1.5,1,0,0,0,0,96,824,0
3,1,125,1.54,2,1,13,8,3.75,1,1,5,1,0,31,3,0
4,1,43,5.625,2,1,13,8,1.71,1,0,0,0,2,37,0,0


In [0]:
df_np = df.to_numpy()

X, y = df_np[:, :-1], df_np[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
)

#LogisticRegression

In [0]:
scaler = MinMaxScaler(feature_range=(0,1))

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [0]:
log_reg = LogisticRegression()

log_reg.fit(X_train, y_train)

In [0]:
y_pred = log_reg.predict(X_test)

print("Accuracy: ", log_reg.score(X_test, y_test))

confusion_matrix(y_test, y_pred)

Accuracy:  0.8333333333333334


array([[92, 11],
       [27, 98]])

In [0]:
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 200, 400]

param_grid = dict(tol = tol, max_iter = max_iter)

grid_search = GridSearchCV(
    estimator = log_reg,
    param_grid = param_grid,
    cv=10,
    scoring='accuracy',
    n_jobs=-1
)

grid_search_result = grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search_result.best_params_)
print("Best accuracy:", grid_search_result.best_score_)

best_model = grid_search_result.best_estimator_

y_pred = best_model.predict(X_test)

print("Test accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))

Best parameters: {'max_iter': 100, 'tol': 0.01}
Best accuracy: 0.8614246068455135
Test accuracy: 0.8333333333333334

Confusion matrix:
 [[92 11]
 [27 98]]

Classification report:
               precision    recall  f1-score   support

         0.0       0.77      0.89      0.83       103
         1.0       0.90      0.78      0.84       125

    accuracy                           0.83       228
   macro avg       0.84      0.84      0.83       228
weighted avg       0.84      0.83      0.83       228



#Other models

In [0]:
models = {
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier()
}

kfold = KFold(n_splits=10, shuffle=True, random_state=42)

best_model = None
best_score = 0.0

print("K-Fold cross-validation results:")
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    mean_score = scores.mean()
    print(f"{name}: {mean_score:.4f} ± {scores.std():.4f}")
    if mean_score > best_score:
        best_score = mean_score
        best_model = model
print(f"\n ✅ Best Model: {best_model}")

best_model.fit(X_train, y_train)


y_pred = best_model.predict(X_test)


print("\n✅ Test accuracy:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

K-Fold cross-validation results:
Decision Tree: 0.8072 ± 0.0813
Random Forest: 0.8655 ± 0.0754
SVM: 0.8613 ± 0.0705
KNN: 0.8593 ± 0.0618

 ✅ Best Model: RandomForestClassifier()

✅ Test accuracy:
Accuracy: 0.868421052631579

Confusion Matrix:
 [[ 89  14]
 [ 16 109]]

Classification Report:
               precision    recall  f1-score   support

         0.0       0.85      0.86      0.86       103
         1.0       0.89      0.87      0.88       125

    accuracy                           0.87       228
   macro avg       0.87      0.87      0.87       228
weighted avg       0.87      0.87      0.87       228



In [0]:
# RandomForest > LogisticRegression 