# Credit Card Approval Prediction

This project uses the ["Credit Approval"](https://archive.ics.uci.edu/dataset/27/credit+approval) dataset from the UCI Machine Learning Repository to develop a machine learning model that predicts whether a credit card application will be approved.

## About the Dataset

The dataset contains 690 credit card applications, each described by 15 anonymized attributes (including both categorical and continuous features). The main goal is to classify each application as approved (`+`) or denied (`-`).

- **Instances:** 690  
- **Attributes:** 15
- **Task:** Classification  
- **Missing values:** Yes

Although all attribute names and values have been anonymized for confidentiality, this dataset is a great testbed for various data preprocessing techniques and classification methods due to its diversity of data types and the presence of missing values.


In [0]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier


In [0]:
df = pd.read_csv("Data/crx.data", header=None)

In [0]:
display(df.head())

## 2. Inspecting the Applications

A quick look at the raw data reveals that each credit card application contains a mix of numerical and categorical variables. Although the column names are anonymized, typical features in such datasets may include:

- Gender
- Age
- Debt
- Marital Status
- Bank Customer Status
- Education Level
- Ethnicity
- Years Employed
- Prior Default
- Employment Status
- Credit Score
- Driver's License
- Citizenship
- Zip Code
- Income
- Approval Status

This initial inspection highlights the diversity of data types present. Before moving on to modeling, it is important to further explore the dataset to identify any additional issues, such as missing values or inconsistent formats, that may require preprocessing.


In [0]:
df.describe()

In [0]:
df.info()

In [0]:
df.tail(17)

In [0]:
for col in df.columns:
    print(f"\n Column: {col}")
    print(df[col].unique())

In [0]:
#? values to Nan

df = df.replace('?', np.nan)   
df.isnull().sum()


In [0]:
df = df.fillna(df.mean())

In [0]:
df.isnull().sum()

In [0]:
df.dtypes

In [0]:
#categorical missing values will be imputed with the most frequent values
for col in df.columns:

    if df[col].dtypes == 'object':

        df = df.fillna(df[col].value_counts().index[0])

df.isnull().sum()

In [0]:
#preprocessing

le = LabelEncoder()

for col in df.columns:

    if df[col].dtypes == 'object':
        df[col] = le.fit_transform(df[col])

df.head()

In [0]:
df_np = df.to_numpy()

X, y = df_np[:, :-1], df_np[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
)

#LogisticRegression

In [0]:
scaler = MinMaxScaler(feature_range=(0,1))

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [0]:
log_reg = LogisticRegression()

log_reg.fit(X_train, y_train)

In [0]:
y_pred = log_reg.predict(X_test)

print("Accuracy: ", log_reg.score(X_test, y_test))

confusion_matrix(y_test, y_pred)

In [0]:
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 200, 400]

param_grid = dict(tol = tol, max_iter = max_iter)

grid_search = GridSearchCV(
    estimator = log_reg,
    param_grid = param_grid,
    cv=10,
    scoring='accuracy',
    n_jobs=-1
)

grid_search_result = grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search_result.best_params_)
print("Best accuracy:", grid_search_result.best_score_)

best_model = grid_search_result.best_estimator_

y_pred = best_model.predict(X_test)

print("Test accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))

#Other models

In [0]:
models = {
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier()
}

kfold = KFold(n_splits=10, shuffle=True, random_state=42)

best_model = None
best_score = 0.0

print("K-Fold cross-validation results:")
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    mean_score = scores.mean()
    print(f"{name}: {mean_score:.4f} ± {scores.std():.4f}")
    if mean_score > best_score:
        best_score = mean_score
        best_model = model
print(f"\n ✅ Best Model: {best_model}")

best_model.fit(X_train, y_train)


y_pred = best_model.predict(X_test)


print("\n✅ Test accuracy:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

In [0]:
# RandomForest > LogisticRegression 