# Credit Card Approval

Data Set Link : https://www.kaggle.com/datasets/samuelcortinhas/credit-card-approval-clean-data

Github Link : https://github.com/hasnainzahid/Machine_Learning

On a day to day basics there are high number of credit card applications recieved by the bank. Going through the applications manually is a tedious time consuming and prone to errors.The task can be automated by usning machine Learning to procedit if the application willl be approved or not.

The Dataset contains the following features:
Age Gender Income Education Marital Status Credit History Employment Status Debt Number of Dependents Approved






# Import Necessary Libraries

# Load the Dataset

# Data Preprocessing:
A. Split the dataset into training and test sets.#
B. Explore the training dataset to understand its structure and characteristics.
C. Split the features and the target label (e.g., 'Approved') from the training and test datasets.

# Feature Engineering:
A. Encode categorical features using techniques like one-hot encoding.
B. Normalize the features, especially numerical ones, to ensure they are on a similar scale.

# Model Selection and Hyperparameter Tuning:
A. Choose machine learning models suitable for the classification task (e.g., Logistic Regression, Decision Tree, Random Forest, KNN, SVM).
B. Perform hyperparameter tuning for each selected model to find the best set of hyperparameters.

# Model Training and Evaluation:
A. Train the models with the training data.
B. Evaluate the models using appropriate evaluation metrics like accuracy, precision, recall, F1-score, and confusion matrix.
C. Select the best-performing model (e.g., Random Forest) based on the evaluation results.

# Testing the Best Model:
Test the selected best model on the test dataset and calculate and display the accuracy, precision, recall, F1-score, and confusion matrix to assess its performance.





# Libraries

In [None]:
import pandas as pd
import numpy as np
import sklearn.preprocessing
import sklearn.metrics
from scipy import sparse
from sklearn.preprocessing import LabelEncoder, OneHotEncoder,StandardScaler
from sklearn.model_selection import GridSearchCV as GSCV
from sklearn.model_selection import train_test_split as TTS
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.linear_model import LogisticRegression as LR
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn import svm
from urllib.request import urlretrieve

# Load CSV File

In [None]:
urlretrieve("https://raw.githubusercontent.com/hasnainzahid/Machine_Learning/main/Credit%20Card%20Approval.csv", filename="credit_card_approval.csv")
credit_card_data = pd.read_csv("credit_card_approval.csv")

credit_card_data.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1,30.83,0.0,1,1,Industrials,White,1.25,1,1,1,0,ByBirth,202,0,1
1,0,58.67,4.46,1,1,Materials,Black,3.04,1,1,6,0,ByBirth,43,560,1
2,0,24.5,0.5,1,1,Materials,Black,1.5,1,0,0,0,ByBirth,280,824,1
3,1,27.83,1.54,1,1,Industrials,White,3.75,1,1,5,1,ByBirth,100,3,1
4,1,20.17,5.625,1,1,Industrials,White,1.71,1,0,0,0,ByOtherMeans,120,0,1


In the code above,the necessary libraries and load the dataset from a given URL. First few rows have been displayed from the dataset get a look of its structure.

The dataset is downloaded from the provided URL and loaded into a panda Data Frame. The for a quick overview of the data is displayed uning credit_card_data.head()

# Spilitting the Dataset

In [None]:
training_data, testing_data = TTS(credit_card_data)

display("Total Size", credit_card_data.shape)
display('Training Size', training_data.shape)
display("Testing Size", testing_data.shape)

'Total Size'

(690, 16)

'Training Size'

(517, 16)

'Testing Size'

(173, 16)

Here, we split the dataset into training and test sets using the train_test_split function from scikit-learn. This allows us to have separate datasets for training and evaluating the models

# Data Exploration

In [None]:
training_data.isnull().sum()

Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
Industry          0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
DriversLicense    0
Citizen           0
ZipCode           0
Income            0
Approved          0
dtype: int64

Check for missing values

In [None]:
training_data["Industry"].unique()

array(['Energy', 'ConsumerStaples', 'Real Estate', 'Industrials',
       'InformationTechnology', 'Research', 'Healthcare', 'Education',
       'ConsumerDiscretionary', 'CommunicationServices', 'Utilities',
       'Materials', 'Financials', 'Transport'], dtype=object)

check for unique values

In [None]:
training_data["Approved"].value_counts()

0    287
1    230
Name: Approved, dtype: int64

count the occurance of values in approved column

In this part, we perform some basic data exploration. We check for missing values, unique values in a categorical column, and count occurrences of different values in the target variable 'Approved'.

# Data Preprocessing

## Splitting Features and Traget Label

In [None]:

x_training_data = training_data.drop("Approved", axis=1)
y_training_data = training_data["Approved"]


x_testing_data = testing_data.drop("Approved", axis=1)
y_testing_data = testing_data["Approved"]


display("Training Size:", x_training_data.shape)
display("Testing Size:", x_testing_data.shape)

'Training Size:'

(517, 15)

'Testing Size:'

(173, 15)

In [None]:
x_training_data.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income
498,1,25.75,0.5,1,1,Energy,White,1.46,1,1,5,1,ByBirth,312,0
355,0,16.0,0.165,1,1,ConsumerStaples,White,1.0,0,1,2,1,ByBirth,320,1
418,1,38.42,0.705,1,1,Energy,White,0.375,0,1,2,0,ByBirth,225,500
61,1,31.67,16.165,1,1,Real Estate,White,3.0,1,1,9,0,ByBirth,250,730
517,1,16.08,0.75,1,1,Energy,White,1.75,1,1,5,1,ByBirth,352,690


Here, we split the dataset into features (X_training_data) and the target label (y_training_data) for both the training and test sets. This separation is vital for training and evaluating the machine learning models.

## Encoding

In [None]:
# Convert to dense arrays if sparse
x_training_data_dense = x_training_data.toarray() if sparse.issparse(x_training_data) else x_training_data
x_testing_data_dense = x_testing_data.toarray() if sparse.issparse(x_testing_data) else x_testing_data

# Convert to string (if needed)
x_training_data_str = x_training_data_dense.astype(str)
x_testing_data_str = x_testing_data_dense.astype(str)

# Use the OneHotEncoder
encoder = sklearn.preprocessing.OneHotEncoder(handle_unknown="ignore")

# Fit and transform the encoder
x_training_data_encoded = encoder.fit_transform(x_training_data_str)
x_testing_data_encoded = encoder.transform(x_testing_data_str)

# Drop the original 'Approved' column from the encoded data)
x_training_data_encoded = x_training_data_encoded[:, :-1]  # Drop the last column
x_testing_data_encoded = x_testing_data_encoded[:, :-1]  # Drop the last column

display("Training Data Size:", x_training_data_encoded.shape)
display("Testing Data Size:", x_testing_data_encoded.shape)

'Training Data Size:'

(517, 992)

'Testing Data Size:'

(173, 992)

## Combine With Numerical Features

In [None]:
# Get the numerical features
numerical_features = [column for column in x_training_data.columns if pd.api.types.is_numeric_dtype(x_training_data[column])]

# Convert numerical features to arrays
x_training_numerical = x_training_data[numerical_features].values
x_testing_numerical = x_testing_data[numerical_features].values

# Combine the numerical and one-hot encoded features
x_training_data_final = np.hstack((x_training_data_encoded.toarray(), x_training_numerical))
x_testing_data_final = np.hstack((x_testing_data_encoded.toarray(), x_testing_numerical))

x_training_data_final = pd.DataFrame(x_training_data_final)
x_testing_data_final = pd.DataFrame(x_testing_data_final)


display("Combined Training Data Size:", x_training_data_final.shape)
display("Combined Testing Data Size:", x_testing_data_final.shape)


'Combined Training Data Size:'

(517, 1004)

'Combined Testing Data Size:'

(173, 1004)

In [None]:
display(x_training_data_final.head())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,994,995,996,997,998,999,1000,1001,1002,1003
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,1.0,1.0,1.46,1.0,1.0,5.0,1.0,312.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.165,1.0,1.0,1.0,0.0,1.0,2.0,1.0,320.0,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.705,1.0,1.0,0.375,0.0,1.0,2.0,0.0,225.0,500.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,16.165,1.0,1.0,3.0,1.0,1.0,9.0,0.0,250.0,730.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.75,1.0,1.0,1.75,1.0,1.0,5.0,1.0,352.0,690.0


## Normilazition

In [None]:
scaler = sklearn.preprocessing.StandardScaler(with_mean=False)
scaler.fit(x_training_data_final)
x_data_training = scaler.transform(x_training_data_final)
x_data_testing = scaler.transform(x_testing_data_final)


display("Traning Data Size", x_data_training.shape)
display("Testing Data Size", x_data_testing.shape)

'Traning Data Size'

(517, 1004)

'Testing Data Size'

(173, 1004)

Normalization is a technique for adjusting the values of features in a dataset so that they are all on a similar scale. This helps to prevent any particular feature from having a disproportionate influence on the learning process

Next step is to proceed with hyperparameter tuning and model selection, explaining the process and results in detail.

## Model Selection and Hyperparameter Tuning

In [None]:
# Logistic Regression

parameters_grid = {

    "C": [0.1, 1, 10, 100],
    "max_iter": [100, 1000, 10000],
    "tol": [0.0001, 0.001, 0.01]
}

M1 = GSCV(LR(),
     parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)

M1.fit(x_data_training, y_training_data)

display("L.R Accuracy = {:.2f}".format(M1.best_score_))
display("Best Parameters of L.R = {}".format(M1.best_params_))

'L.R Accuracy = 0.86'

"Best Parameters of L.R = {'C': 0.1, 'max_iter': 100, 'tol': 0.0001}"

In this segment, we utilize grid search to find the best hyperparameters for a Logistic Regression model. Grid search is a technique to tune hyperparameters by specifying a range of values for each parameter and evaluating all possible combinations. We can see that we get a result of 86 percent accuraccy of the best Logistic regression Model

In [None]:
# Decision Tree

parameters_grid = {

    "max_depth": range(1, 10),
    "min_samples_split": range(2, 10),
    "min_samples_leaf": range(1, 10)

}
M2 = GSCV(DT(),
      parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)

M2.fit(x_data_training, y_training_data)

display("D.T Accuracy = {:.2f}".format(M2.best_score_))
display("Best Parameters of D.T = {}".format(M2.best_params_))

'D.T Accuracy = 0.86'

"Best Parameters of D.T = {'max_depth': 1, 'min_samples_leaf': 1, 'min_samples_split': 2}"

We follow a similar procedure for hyperparameter tuning with grid search, this time for a Decision Tree model. Hyperparameters significantly affect the model's performance, and tuning them is a crucial step in the machine learning pipeline.

The accuracy of Decision tree is 86 percent

In [None]:
# Random Forest

parameters_grid = {
     "max_depth": range(1, 10),
    "min_samples_split": range(2, 5),
    "n_estimators": [100]
}
M3 = GSCV(RF(),
        parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)

M3.fit(x_data_training, y_training_data)

display("R.F Accuracy = {:.2f}".format(M3.best_score_))
display("Best Parameters of R.F = {}".format(M3.best_params_))


'R.F Accuracy = 0.88'

"Best Parameters of R.F = {'max_depth': 8, 'min_samples_split': 2, 'n_estimators': 100}"

Hyper parameter tuning is performed for Random forest we can see that the algorithm has an accuracy of 88 percent on the training data and the best features are displayed

In [None]:
# KNN

parameters_grid = {
     "n_neighbors": range(1, 5),
    "leaf_size": [1, 10],

}

M4 = GSCV(KNN(),
      parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)

M4.fit(x_data_training, y_training_data)
best_model = M4.best_estimator_

display("KNN Accuracy = {:.2f}".format(best_model.score(x_data_training, y_training_data)))
display("Best Parameters of KNN = {}".format(M4.best_params_))


'KNN Accuracy = 0.79'

"Best Parameters of KNN = {'leaf_size': 1, 'n_neighbors': 4}"

Here, we perform hyperparameter tuning for a K-Nearest Neighbors (KNN) model. KNN is a simple and widely used classification algorithm based on feature similarity.
The accuracy is 79 percent

In [None]:
# SVM

parameters_grid = {
   "C": [1.0, 10.0, 100.0],
    "gamma": ["scale", "auto"],
}

M5 = GSCV(sklearn.svm.SVC(),
        parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)

M5.fit(x_data_training, y_training_data)

display("SVM Accuracy = {:.2f}".format(M5.best_score_))
display("Best Parameters of SVM = {}".format(M5.best_params_))

'SVM Accuracy = 0.83'

"Best Parameters of SVM = {'C': 10.0, 'gamma': 'scale'}"

Hyper parameter tuning is performed for SVM we can see that the algorithm has an accuracy of 83 percent on the training data and the best features are displayed

# Best Model Testing

In [None]:
# Random Forest

y_prediction = M3.predict(x_data_testing)
accuracy = sklearn.metrics.accuracy_score(y_testing_data, y_prediction)
cm = sklearn.metrics.confusion_matrix(y_testing_data, y_prediction)
precision,recall, f1, support = sklearn.metrics.precision_recall_fscore_support(y_testing_data, y_prediction, zero_division=1)

display("Accuracy =", accuracy)
display("Precision =", precision)
display("Recall =", recall)
display("F1-Score =", f1)
display("Confusion Matrix:\n", cm)

'Accuracy ='

0.8728323699421965

'Precision ='

array([0.8627451 , 0.88732394])

'Recall ='

array([0.91666667, 0.81818182])

'F1-Score ='

array([0.88888889, 0.85135135])

'Confusion Matrix:\n'

array([[88,  8],
       [14, 63]])

Here, we apply the best-performing model (Random Forest) on the test set and evaluate its performance using various metrics such as accuracy, precision, recall, F1-score, and the confusion matrix

# Conclusion and Discussion
The machine learning pipeline presented here aimed to automate the credit card approval process using a dataset with various features related to applicants. The pipeline consists of data preprocessing, model selection, hyperparameter tuning, and evaluation of the best model.

## Data Preprocessing

The dataset was preprocessed tho handel missing values, encode categorical features and normalize numerical features. This step is crucial to ensure quality and uniformity of the data for model traning

## Model Selection and Hyperparameter Tuning:

Several models were considered including LR (Logistic Regression), DT(Decision Tree) RF(Random Forest) KNN(K-Nearest Neighbors) and SVM(Support Vector Machine). Hyperparameters for each model were tuned using Grid Search to optamize their performance.



1.   LR: acchived an accuracy of 86%
2.   DT: acchived the accuracy of 86%
3.   RF: acchived accuracy of 88%
4.   KNN: accuracy of 79%
5.   SVM: accuracy of 83%


## Model Evaluation:
The Random Forest model, which achieved the highest accuracy, was selected as the best model. Its performance was evaluated using accuracy, precision, recall, F1-score, and a confusion matrix on the test dataset.



1.   Accuracy: 88%
2.   Precision: Based on the class (Approved or Not Approved)
3.   F1-Score: Based on the class (Approved or Not Approved)
4.   Confusion Matrix: Provides a detailed breakdown of true positives, true negatives, false positives, and false negatives.


# Strengths:
Data Preprocessing: The pipeline effectively handled missing values, encoded categorical features, and standardized the data through normalization, ensuring the quality and uniformity of the dataset for model training.

Hyperparameter Tuning: Grid search was employed to systematically explore a range of hyperparameters for each model, assisting in finding the best set of hyperparameters that optimize model performance.

Model Evaluation: The best-performing model (Random Forest) was evaluated comprehensively using accuracy, precision, recall, F1-score, and the confusion matrix, providing a detailed view of its performance.

# Limitations:
Limited Hyperparameters: The hyperparameters explored were restricted to predetermined ranges, and a more thorough search might potentially increase model performance.

Assumption of Best Model: Random Forest was assumed to be the best model based on training data performance. However, the best model choice might vary with different datasets or problem domains.

Assumed Importance of Features: The importance of features in the Random Forest model was not analyzed. Understanding feature importance could provide valuable insights into the decision-making process of the model.

Imbalanced Data: The dataset might be imbalanced with respect to the target variable (Approved/Not Approved). Addressing this imbalance is crucial to avoid biased model performance.

# Conclusion

Finally, this is a complete machine learning pipline used to automate credit card approval processes. Following extensive testing, the Random Forest model was determined to be the best-performing model for this particular dataset. Additional modifications and studies may improve the model's accuracy and Adaptability for real-world credit card approval applications.

In [None]:
!jupyter nbconvert Credit_Card_Approval.ipynb --to html

[NbConvertApp] Converting notebook Credit_Card_Approval.ipynb to html
[NbConvertApp] Writing 748883 bytes to Credit_Card_Approval.html
