<a href="https://colab.research.google.com/github/hemantsingla96/credit-card-approval-prediction/blob/main/Credit_Card_Approval_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'credit-card-approval:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F4621749%2F7875621%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240727%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240727T081026Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D2f31b50311668792a548714b036b6961bd2d3f46e8bc4e5f1a322036e964bd3783dac7c55e1dbc4215ca89dd735e5d948ad674c36c13a58fe32ee02452e04c0f477ef56e8933159ae12aa7fb8086ecff67c1c45dca0de01a3c5c90015cd05f780cb5b85d0573dffe1a54ee831b9d32ded6880d298de0e8cad132cf6d5be77aa01029a68660bed0846efea4b1b78f58a24e1c66e1159683277a8ec2cd513217c1c32e41c40d8c5e1089eed40056d21a06ed85aba93bab13b7545ceb06c21ef55bcd346be718007f51faa96dee66e19101fd3efc55b07011654d9ad8690c8750beebe5462cdf13186aae3f1b78d58e99bdea236996455d1f89f68a120ac00b5126'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Commercial banks get a lot of requests for credit cards. But, they often reject many of them because of reasons like having too much debt,low income, or too many inquiries on a person's credit history. Analyzing these requests manually takes a lot of time and can lead to mistakes.Thankfully, we can use machine learning to automate this process, just like most banks do. In this guide, we'll make a credit card approval predictor using machine learning techniques

**Section 1.** Why is your proposal important in today’s world? How predicting a good client is worthy for a bank?

Predicting good client is crucial for banks to minimize creditr risk and make informed lending decisions in today's world where DATA SCIENCE & ML are becoming integral to financial institution accurate credit card approvalpredictions can significantly impact a bank's profitability and customer satisfaction

**2. How is it going to impact the banking sector?**

Accurate credit card approval predictions can lead to reduced default rates, lower financial losses, and improved customer experiences. This can positively impact a bank's profitability, reputation, and competitiveness in the market.

**3. If any, what is the gap in the knowledge or how your proposed method canbe helpful if required in the future for any bank in India?**

The proposed method can address the gap in credit assessment by incorporating advanced data analysis and machine learning techniques. Its adaptability makes it valuable for banks in India and beyond, as it can continuously evolve to consider new data sources and enhance creditworthiness assessments.

**Section 2: Initial Hypotheses**

In the Data Analysis (DA) track, we will aim to identify patterns in the data and important features that may impact a Machine Learning (ML) model. Our initial hypotheses are:

**Hypothesis 1:** Income type, annual income, and education level are crucial factors in predicting credit card approval.

**Hypothesis 2:** Car ownership, property ownership, and family size may influence credit card approval decisions.

**Hypothesis 3:** The length of employment and the presence of a mobile phone or email address could be relevant features.

**Hypothesis 4:** Gender, marital status, and housing type may also play a role in credit card approval.

**Section 3: Data Analysis Approach**

Our data analysis approach will involve: Exploratory Data Analysis (EDA) to identify important patterns, correlations, and outliers in the data. Feature engineering techniques to create relevant features that can improve model performance. Utilizing visualization tools to justify our findings and provide insights into the relationships within the data.

**Feature Engineering**

This Project required some feature Engineering techniques like onehot encoding,label encoding to transform the categorical variable into numerical variable and feature engineering like standard scaler to normalize the data for machine learning

**justification of data analysis approach**

Here we used visulisation tools like boxplot to determine the outlier and histogram to show the distribution of data. Replace the null value with median and mode to make the data more suitable for machine learning

**Section 4: Machine Learning Approach**

We will use various machine learning models, including but not limited to logistic regression, decision trees, random forests,SVM and XGBoost, to predict credit card approval based on customer information.

**Justification for Model Selection:**

Logistic Regression: A simple yet interpretable model to establish a baseline. Decision Trees and Random Forests: To capture non-linear relationships and feature interactions. XGBoost: To improve predictive accuracy by combining multiple weak models

**Steps to Improve Model Accuracy:**

Feature selection to identify the most relevant variables. Hyperparameter tuning for model optimization. Cross-validation to assess model performance. Evaluation metrics, such as accuracy, precision, recall, and F1-score, to justify the chosen model.

**Comparison of Models:**

We compared the performance of at least four machine learning models using classification_report, accuracy_score, confusion_matrix and cross valiation to determine the most suitable model for credit card approval prediction. XGBoost Model is giving highest accuracy of 90 %, hence we will use XGBoost Model for predicion among the four model.

########

A number of applications for credit cards are received by commercial banks. For various factors, many of them are refused, such as high debt balances, low income levels, or too many questions into an individual's credit report, for instance. It is mundane, error-prone, and time-consuming and time is money! to manually analyze these applications. Luckily,with the power of machine learning, this activity can be automated and almost every commercial bank does so nowadays.In this notebook, we will create an automated credit card approval predictor using machine learning techniques, just like the real banks do.

**This notebook follows the some instructions**

At first , we load and view the dataset.

Here, we could see some numerical and categorical values.This valus are some missing , attributes are not relevant.

We will have to preprocesses the data to ensure before implementing machine learning model and so that we can get better performance from my model.

We will make visualize on the data cause visualize data can say a thousands of word in insights.

We will evaluate performance the model.

Then we will do hyperparameter tuning and optimization so that we can get better performance.

We will also do comparison some machine learning model which model will give a better result.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df1 = pd.read_csv("/kaggle/input/credit-card-approval/Credit_card.csv")
df2 = pd.read_csv("/kaggle/input/credit-card-approval/Credit_card_label.csv")

In [None]:
df1 = df1.drop_duplicates(subset=["Ind_ID"], keep="last")

In [None]:
df = pd.merge(df1,df2,on='Ind_ID',how='inner')
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df['GENDER'].fillna(df['GENDER'].mode()[0],inplace = True)

In [None]:
df['Annual_income'].fillna(df['Annual_income'].median(),inplace = True)

In [None]:
df['Birthday_count'].fillna(df['Birthday_count'].median(),inplace = True)

In [None]:
df.isnull().sum()

In [None]:
df.drop('Type_Occupation',axis = 1,inplace = True) #dropping the Type_Occupation because it has so many null values

In [None]:
df.info()

In [None]:
cat_columns = df.columns[(df.dtypes =='object').values].tolist() #Checking the categorical columns
cat_columns

In [None]:
num_columns = df.columns[(df.dtypes !='object').values].tolist() #checking the numerical columns
num_columns

In [None]:
# Checking unique values from Categorical Columns

for i in df.columns[(df.dtypes =='object').values].tolist():
    print(i,'\n')
    print(df[i].value_counts())
    print('-----------------------------------------------')

In [None]:
df['CHILDREN'].value_counts()

In [None]:
# Checking Min , Max values from 'DAYS_BIRTH' column
print('Min DAYS_BIRTH :', df['Birthday_count'].min(),'\nMax DAYS_BIRTH :', df['Birthday_count'].max())

In [None]:
# Converting 'DAYS_BIRTH' values from Day to Years
df['Birthday_count'] = round(df['Birthday_count']/-365,0)
df.rename(columns={'Birthday_count':'AGE_YEARS'}, inplace=True)

In [None]:
df.head()

In [None]:
# Checking unique values greater than 0
df['Employed_days'].unique()

In [None]:
# As mentioned in document, if 'DAYS_EMPLOYED' is positive no, it means person currently unemployed, hence replacing it with 0
df['Employed_days'].replace(365243, 0, inplace=True)

In [None]:
# Converting 'DAYS_EMPLOYED' values from Day to Years
df['Employed_days'] = abs(round(df['Employed_days']/-365,0))
df.rename(columns={'Employed_days':'YEARS_EMPLOYED'}, inplace=True)

In [None]:
df['Mobile_phone'].value_counts()

In [None]:
# As all the values in column are 1, hence dropping column
df.drop('Mobile_phone', axis=1, inplace=True)

In [None]:
df['Work_Phone'].value_counts()

In [None]:
# This column only contains 0 & 1 values for Mobile no submitted, hence dropping column
df.drop('Work_Phone', axis=1, inplace=True)

In [None]:
df['Phone'].value_counts()

In [None]:
# This column only contains 0 & 1 values for Phone no submitted, hence dropping column
df.drop('Phone', axis=1, inplace=True)

In [None]:
df['EMAIL_ID'].value_counts()

In [None]:
# This column only contains 0 & 1 values for Email submitted, hence dropping column
df.drop('EMAIL_ID', axis=1, inplace=True)

In [None]:
df['Family_Members'].value_counts()

In [None]:
df.head()

# Visualization

In [None]:
#create plot to detect outliers
sns.boxplot(df['CHILDREN'])

In [None]:
sns.boxplot(df['Annual_income'])

In [None]:
sns.boxplot(df['AGE_YEARS'])

In [None]:
sns.boxplot(df['YEARS_EMPLOYED'])

In [None]:
sns.boxplot(df['Family_Members'])

# Removing Outliers

In [None]:
high_bound = df['CHILDREN'].quantile(0.999)
print('high_bound :', round(high_bound,2))
low_bound = df['CHILDREN'].quantile(0.001)
print('low_bound :', low_bound)

In [None]:
df = df[(df['CHILDREN']>=low_bound) & (df['CHILDREN']<=high_bound)]

In [None]:
high_bound = df['Annual_income'].quantile(0.999)
print('high_bound :', round(high_bound,0))
low_bound = df['Annual_income'].quantile(0.001)
print('low_bound :', low_bound)

In [None]:
df = df[(df['Annual_income']>=low_bound) & (df['Annual_income']<=high_bound)]

In [None]:
high_bound = df['YEARS_EMPLOYED'].quantile(0.999)
print('high_bound :', round(high_bound,0))
low_bound = df['YEARS_EMPLOYED'].quantile(0.001)
print('low_bound :', low_bound)

In [None]:
df = df[(df['YEARS_EMPLOYED']>=low_bound) & (df['YEARS_EMPLOYED']<=high_bound)]

In [None]:
high_bound = df['Family_Members'].quantile(0.999)
print('high_bound :', high_bound)
low_bound = df['Family_Members'].quantile(0.001)
print('low_bound :', low_bound)

In [None]:
df = df[(df['Family_Members']>=low_bound) & (df['Family_Members']<=high_bound)]

In [None]:
df.head()

In [None]:
# dropping ''Ind_ID' column as it is having only unique values (not required for ML Model)
df.drop('Ind_ID', axis=1, inplace=True)

In [None]:
df.head(2)

In [None]:
# This graph shows that, majority of application are submitted by Female's
plt.pie(df['GENDER'].value_counts(), labels=['Female', 'Male'], autopct='%1.2f%%')
plt.title('% of Applications submitted based on Gender')
plt.show()

In [None]:
# This graph shows that, majority of application are approved for Female's
plt.pie(df[df['label']==0]['GENDER'].value_counts(), labels=['Female', 'Male'], autopct='%1.2f%%')
plt.title('% of Applications Approved based on Gender')
plt.show()

In [None]:
# This graph shows that, majority of applicatant's dont own a car
plt.pie(df['Car_Owner'].value_counts(), labels=['No', 'Yes'], autopct='%1.2f%%')
plt.title('% of Applications submitted based on owning a Car')
plt.show()

In [None]:
# This graph shows that, majority of applicatant's own a Real Estate property / House
plt.pie(df['Propert_Owner'].value_counts(), labels=['Yes','No'], autopct='%1.2f%%')
plt.title('% of Applications submitted based on owning a Real estate property')
plt.show()

In [None]:
# This graph shows that, majority of applicatant's don't have any children
plt.figure(figsize = (8,8))
plt.pie(df['CHILDREN'].value_counts(), labels=df['CHILDREN'].value_counts().index, autopct='%1.2f%%')
plt.title('% of Applications submitted based on Children count')
plt.legend()
plt.show()

In [None]:
# This graph shows that, majority of applicatant's income lies between 1 to 3 lakh
plt.hist(df['Annual_income'], bins=20)
plt.xlabel('Total Annual Income')
plt.title('Histogram')
plt.show()

In [None]:
# This graph shows that, majority of applicatant's are working professional
plt.figure(figsize = (8,8))
plt.pie(df['Type_Income'].value_counts(), labels=df['Type_Income'].value_counts().index, autopct='%1.2f%%')
plt.title('% of Applications submitted based on Income Type')
plt.legend()
plt.show()

In [None]:
# This graph shows that, majority of applicatant's completed the Secondary Education
plt.figure(figsize=(8,8))
plt.pie(df['EDUCATION'].value_counts(), labels=df['EDUCATION'].value_counts().index, autopct='%1.2f%%')
plt.title('% of Applications submitted based on Education')
plt.legend()
plt.show()

In [None]:
# This graph shows that, majority of applicatant's are 25 to 65 years old
plt.hist(df['AGE_YEARS'], bins=20)
plt.xlabel('Age')
plt.title('Histogram')
plt.show()

In [None]:
# This graph shows that, majority of applicatant's are Employed for 0 to 7 years
plt.hist(df['YEARS_EMPLOYED'], bins=20)
plt.xlabel('No of Years Employed')
plt.title('Histogram')
plt.show()

# Feature Selection

In [None]:
df.head()

In [None]:
cat_columns

In [None]:
#Converting all Non-Numerical Columns to Numerical
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in cat_columns:

        df[col] = le.fit_transform(df[col])
df.head()

In [None]:
df.corr()

In [None]:
features =df.drop(['label'], axis=1)
target = df['label']

In [None]:
features.head()

In [None]:
features.head()

# Machine Learning Model

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features,
                                                    target,
                                                    test_size=0.2,
                                                    random_state = 10)

# Balancing dataset

In [None]:
from sklearn.preprocessing import MinMaxScaler
MMS = MinMaxScaler()
x_train_scaled = pd.DataFrame(MMS.fit_transform(x_train), columns=x_train.columns)
x_test_scaled = pd.DataFrame(MMS.transform(x_test), columns=x_test.columns)

In [None]:
# adding samples to minority class using SMOTE
from imblearn.over_sampling import SMOTE
oversample = SMOTE()

x_train_oversam, y_train_oversam = oversample.fit_resample(x_train_scaled, y_train)
x_test_oversam, y_test_oversam = oversample.fit_resample(x_test_scaled, y_test)

In [None]:
# Original majority and minority class
y_train.value_counts(normalize=True)*100

In [None]:
# after using SMOTE
y_train_oversam.value_counts(normalize=True)*100

In [None]:
# Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

log_model = LogisticRegression()
log_model.fit(x_train_oversam, y_train_oversam)

print('Logistic Model Accuracy : ', log_model.score(x_test_oversam, y_test_oversam)*100, '%')

prediction = log_model.predict(x_test_oversam)
print('\nConfusion matrix :')
print(confusion_matrix(y_test_oversam, prediction))

print('\nClassification report:')
print(classification_report(y_test_oversam, prediction))

# Decision Tree classification

In [None]:
from sklearn.tree import DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

decision_model = DecisionTreeClassifier(max_depth=12,min_samples_split=8)

decision_model.fit(x_train_oversam, y_train_oversam)

print('Decision Tree Model Accuracy : ', decision_model.score(x_test_oversam, y_test_oversam)*100, '%')

prediction = decision_model.predict(x_test_oversam)
print('\nConfusion matrix :')
print(confusion_matrix(y_test_oversam, prediction))

print('\nClassification report:')
print(classification_report(y_test_oversam, prediction))

# Random Forest classification

In [None]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier

RandomForest_model = RandomForestClassifier(n_estimators=250,
                                            max_depth=12,
                                            min_samples_leaf=16)

RandomForest_model.fit(x_train_oversam, y_train_oversam)

print('Random Forest Model Accuracy : ', RandomForest_model.score(x_test_oversam, y_test_oversam)*100, '%')

prediction = RandomForest_model.predict(x_test_oversam)
print('\nConfusion matrix :')
print(confusion_matrix(y_test_oversam, prediction))

print('\nClassification report:')
print(classification_report(y_test_oversam, prediction))


# Support Vector Machine classification

In [None]:
from sklearn.svm import SVC

svc_model = SVC()

svc_model.fit(x_train, y_train)

print('Support Vector Classifier Accuracy : ', svc_model.score(x_test, y_test)*100, '%')

prediction = svc_model.predict(x_test)
print('\nConfusion matrix :')
print(confusion_matrix(y_test, prediction))

print('\nClassification report:')
print(classification_report(y_test, prediction))

# XGBoost classification

In [None]:
from xgboost import XGBClassifier

XGB_model = XGBClassifier()

XGB_model.fit(x_train_oversam, y_train_oversam)

print('XGBoost Model Accuracy : ', XGB_model.score(x_test_oversam, y_test_oversam)*100, '%')

prediction = XGB_model.predict(x_test_oversam)
print('\nConfusion matrix :')
print(confusion_matrix(y_test_oversam, prediction))

print('\nClassification report:')
print(classification_report(y_test_oversam, prediction))

# Validation

**K-Fold Cross Validation**

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
kfold = KFold(3)

In [None]:
# Decision Tree classification

results=cross_val_score(decision_model,features,target,cv=kfold)
print(results*100,'\n')

print(np.mean(results)*100)

In [None]:
# Random Forest classification

results=cross_val_score(RandomForest_model,features,target,cv=kfold)
print(results*100,'\n')

print(np.mean(results)*100)

In [None]:
# XGBoost classification

results=cross_val_score(XGB_model,features,target,cv=kfold)
print(results*100,'\n')

print(np.mean(results)*100)

# Conclusion

As we have seen that, XGBoost Model is giving highest accuracy of 90 %, hence we will use XGBoost Model for predicion