# 1. Credit card applications
<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, I will build an automatic credit card approval predictor using machine learning techniques.</p>
<p></p>
<p>I'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository. The structure of this notebook is as follows:</p>
<ul>
<li>First, I will start off by loading and viewing the dataset.</li>
<li>I will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.</li>
<li>I will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.</li>
<li>After the data is in good shape, I will do some exploratory data analysis to build our intuitions.</li>
<li>Finally, I will build a machine learning model that can predict if an individual's application for a credit card will be accepted.</li>
</ul>
<p>First, loading and viewing the dataset. We find that since this data is confidential, the contributor of the dataset has anonymized the feature names.</p>

In [1]:
import pandas as pd

# Load dataset
cc_apps = pd.read_csv("datasets/cc_approvals.data", header=None)#, na_values=['na', '?'])

# Inspect data:
cc_apps.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [4]:
#Missing values per column
pd.DataFrame(cc_apps.isna().sum(), columns=[ '#_missing_values'])

#Later on I realized that this might be a bit missleading as the data types are not correct yet and missing values are 
#not properly identified, this will be used once the data type is changed. 

Unnamed: 0,#_missing_values
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


In [6]:
# Number of approved and rejected card applications in dataset:
cc_apps.iloc[:, -1].value_counts(normalize=True).round(3)#As we can see, the classes are fairly balanced

## Dataset Description 

The following is a description of all the 16 variables (revealed), this is very helpfull in the sense it can help to understand the nature of my data:


 Male          : num  1 1 0 0 0 0 1 0 0 0 ...

 Age           : chr  "58.67" "24.50" "27.83" "20.17" ...

 Debt          : num  4.46 0.5 1.54 5.62 4 ...
 
 Married       : chr  "u" "u" "u" "u" ...
 
 BankCustomer  : chr  "g" "g" "g" "g" ...
 
 EducationLevel: chr  "q" "q" "w" "w" ...
 
 Ethnicity     : chr  "h" "h" "v" "v" ...
 
 YearsEmployed : num  3.04 1.5 3.75 1.71 2.5 ...
 
 PriorDefault  : num  1 1 1 1 1 1 1 1 1 0 ...
 
 Employed      : num  1 0 1 0 0 0 0 0 0 0 ...
 
 CreditScore   : num  6 0 5 0 0 0 0 0 0 0 ...
 
 DriversLicense: chr  "f" "f" "t" "f" ...
 
 Citizen       : chr  "g" "g" "g" "s" ...
 
 ZipCode       : chr  "00043" "00280" "00100" "00120" ...
 
 Income        : num  560 824 3 0 0 ...
 
 Approved      : chr  "+" "+" "+" "+" ...

## 2. Inspecting the applications
<p>Let's try to figure out the most important features of a credit card application. The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. This gives a pretty good starting point, and I can map these features with respect to the columns in the output.   </p>
<p>As it can be seen at first glance, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before that is done, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.</p>

In [7]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info(memory_usage='deep')
print(cc_apps_info)

print("\n")

cc_apps.tail(17)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     690 non-null object
1     690 non-null object
2     690 non-null float64
3     690 non-null object
4     690 non-null object
5     690 non-null object
6     690 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    690 non-null object
14    690 non-null int64

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


## 3. Handling the missing values (part i)
There are some issues that will affect the performance of our machine learning model(s) if they go unchanged:
<ul>
<li>The dataset contains both numeric and non-numeric data.</li>
    
<li>The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000.</li>

<li>Finally, the dataset has missing values, which I'll be taken care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output.</li>
</ul>
<p>Now, let's temporarily replace these missing value question marks with NaN.</p>

In [10]:
import numpy as np

# Inspect missing values in the dataset
print(cc_apps.tail(17))

# Replace the '?'s with NaN
cc_apps = cc_apps.replace('?', np.nan)#intead of replacing with a string NaN I have
#to use the nan representation from np.

print("\n")
print("\n")

# Confirm that there are no missing values left:
print(cc_apps.tail(17))

      0      1       2  3  4   5   6      7  8  9   10 11 12     13   14 15
673  NaN  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -
674    a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -
675    a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -
676    a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -
677    b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -
678    a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -
679    a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -
680    b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -
681    b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -
682    b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  00140    2  -
683    b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  00240    3  -
684    b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  00400    0  -
685    b  21

## 4. Handling the missing values (part ii)
<p>I've replaced all the question marks with NaNs. This is going to help in the next missing value treatment.</p>

<p>Now, I am going to impute the missing values with a strategy called mean imputation.</p>

In [11]:
# Impute the missing values with mean imputation ONLY FOR CONINIOUS ATTRIBUTES
cc_apps.fillna(cc_apps.iloc[:,[2,7,10,14]].mean(), inplace=True)

# Count the number of NaNs in the dataset to verify
cc_apps.isna().sum()
#Now I can see the missing values unlike the initial inspection.

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

## 5. Handling the missing values (part iii)
<p>I have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this why the mean imputation strategy would not work here. This needs a different treatment. </p>
<p>Now I need to impute these missing values with the most frequent values as present in the respective columns.</p>

In [12]:
# Iterate over each column of cc_apps
for col in [0, 1, 3, 4, 5, 6, 8, 9, 11, 12, 13,15]:
    # Check if the column is of object type
    if cc_apps[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps[col] = cc_apps.iloc[:,col].fillna(cc_apps[col].value_counts().index[0])
#Since the counts of the most frequent values per column are returned in descending order by 
#default, I can select the first indext to get the most frequent value. 



# Count the number of NaNs in the dataset and print the counts to verify
cc_apps.isna().sum()# at this point I should not have missing values.

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

## 6. Preprocessing the data (part i)

<p>There is still some minor but essential data preprocessing needed before I proceed towards building the machine learning model.</p>
<p></p>
<li>Convert the non-numeric data into numeric.</li>
<li>Split the data into train and test sets. </li>
<li>Scale the feature values to a uniform range.</li>

<p>First, I will be converting all the non-numeric values into numeric ones. I will do this by using a technique called <b><i>label encoding.</i></b></p>

In [16]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder:
le= LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in [0, 1, 3, 4, 5, 6, 8, 9, 11, 12, 13,15]:
    # Compare if the dtype is object
    if cc_apps[col].dtypes=='object':
    # Use LabelEncoder to do the numeric transformation (codification)
        cc_apps[col]=le.fit_transform(cc_apps[col])
        

In [17]:
#Sanity check:
cc_apps.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     690 non-null int32
1     690 non-null int32
2     690 non-null float64
3     690 non-null int32
4     690 non-null int32
5     690 non-null int32
6     690 non-null int32
7     690 non-null float64
8     690 non-null int32
9     690 non-null int32
10    690 non-null int64
11    690 non-null int32
12    690 non-null int32
13    690 non-null int32
14    690 non-null int64
15    690 non-null int32
dtypes: float64(2), int32(12), int64(2)
memory usage: 54.0 KB


## 7. Splitting the dataset into train and test sets
I will remove certain features like DriversLicense and Zipcode which may not be useful for our predictive model.

In [18]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
cc_apps = cc_apps.drop([11, 13], axis=1)#irelevant features removed
cc_apps = np.array(cc_apps)

# Segregate features and labels into separate variables
X,y = cc_apps[:,0:-1] , cc_apps[:,-1]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                y,
                                test_size=.33,
                                random_state=42)

## 8. Preprocessing the data (part ii)

There is only one final preprocessing step left of scaling before we can fit a machine learning model to the data.

Now, let's try to understand what these scaled values mean in the real world. Let's use CreditScore as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore of 1 is the highest since we're rescaling all the values to the range of 0-1.

In [19]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

## 9. Fitting a logistic regression model to the train set

In [20]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

## 10. Making predictions and evaluating performance

We will now evaluate our model on the test set with respect to classification accuracy. But we will also take a look the model's confusion matrix. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects. 

In [21]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

# confusion matrix of the logreg model
confusion_matrix(y_test, y_pred)

Accuracy of logistic regression classifier:  0.8377192982456141


array([[93, 10],
       [27, 98]], dtype=int64)

## 11. Grid searching and making the model perform better

The model was able to yield an accuracy score of almost 84% which is quite good.

For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

Let's see if we can do better. We can perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.

I will grid search over the following two:

<p></p>
   <li>tol</li>
   <li>max_iter</li>


In [22]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol=[0.01, 0.001, 0.0001]
max_iter=[100, 150, 200]


# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict({'tol':tol, 'max_iter':max_iter})

## 12. Finding the best performing model

Now, we will begin the grid search to see which values perform best.

We will instantiate GridSearchCV() with our earlier logreg model with all the data we have. Instead of passing train and test sets separately, we will supply X (scaled version) and y. We will also instruct GridSearchCV() to perform a cross-validation of five folds.

We'll end the notebook by storing the best-achieved score and the respective best parameters.

In [23]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))


Best: 0.852174 using {'max_iter': 100, 'tol': 0.01}




## 13. Conclusion

With the help of a gridsearch I have managed to slightly increase the model's accuracy as optimal parameters were used to implement the model. We can also observe that with the help of some data transformations such as scaling, the model can have a better performance and produce a more accurate output. 