# Using a Logistic Regression Machine Learning Model to Predict the  Classification of Credit Card Approval Ratings

<p>I will be applying Machine Learning techniques to a dataset containing credit card applicant information to determine whether the application will be approved or otherwise. This will be a Classification problem, where the target variable "Approval Status" indicates an approved or denied application.</p>
<li>The column names were ommitted in the original data to preserve anonimity, so they have been manually added in.</li>
<li>The data file was found from the UCI Machine Learning Repository. This project was also done in reference to a similair project done with the R programming language;</li>
<p>http://archive.ics.uci.edu/ml/datasets/credit+approval</p>
<p>http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html</p>

In [76]:
import pandas as pd
import numpy as np

card_apps = pd.read_csv("crx.data", header=None)
card_apps.columns = ['Gender','Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel', 'Ethnicity', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore', 'DriversLicense', 'Citizen', 'ZipCode', 'Income', 'ApprovalStatus']
card_apps.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


### Information about the DataFrame at a Glance
<p>We will conduct standard EDA on our dataset to get an idea of what sort of data we are dealing with.</p>
<li>.describe() will give us summary statistics for our numeric columns. This will inform us of the range of the data</li>
<li>.info() gives us information on the data types of the columns, as well as the amount of entries and variables our dataset contains</li>
<li>We can see that we have 690 entries, which tells us we have data relating to 690 credit card applicants. We also have 16 columns, indicating 16 different variables to train our Machine Learning algorithm with</li>

In [77]:
#Displaying summary statistics, 
print(card_apps.describe())
print("\n")
print(card_apps.info())
print("\n")

card_apps.tail(17)

             Debt  YearsEmployed  CreditScore         Income
count  690.000000     690.000000    690.00000     690.000000
mean     4.758725       2.223406      2.40000    1017.385507
std      4.978163       3.346513      4.86294    5210.102598
min      0.000000       0.000000      0.00000       0.000000
25%      1.000000       0.165000      0.00000       0.000000
50%      2.750000       1.000000      0.00000       5.000000
75%      7.207500       2.625000      3.00000     395.500000
max     28.000000      28.500000     67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
Gender            690 non-null object
Age               690 non-null object
Debt              690 non-null float64
Married           690 non-null object
BankCustomer      690 non-null object
EducationLevel    690 non-null object
Ethnicity         690 non-null object
YearsEmployed     690 non-null float64
PriorDefault      690 non-null object


Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


### NaN Identification
<p>We have observed a missing value in the Gender column, signified by a ? since it is catagorical data. We will need to deal with this missing value as we want to keep as much data as we can without dropping rows unnecessarily</p>
<li>We will use .replace() on our dataframe to replace any strings containing '?' with numpy NaN values.</li>

In [78]:
#? in 673rd row. We will replace, then observe
card_apps = card_apps.replace('?', np.NaN)
card_apps.tail(17)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


### Replacing NaN Values
<p>For NaN values in numeric columns, we will use Mean Imputation to replace those NaN values with the Mean value of that column</p>
<p>Categorical and other non-numeric columns will have thier missing values replaced with the most frequent occuring value of that column</p>

In [79]:
#Mean imputation for NaN values in numeric columns
card_apps.fillna(card_apps.mean(), inplace=True)

#Columns of the data type 'object' will have thier missing values replaced with the most frequently occuring value
for col in card_apps:
    if card_apps[col].dtypes == 'object':
        card_apps = card_apps.fillna(card_apps[col].value_counts().index[0])

#We will inspect each columns null value count to ensure that these methods were successful
print(card_apps.isnull().sum())

Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
DriversLicense    0
Citizen           0
ZipCode           0
Income            0
ApprovalStatus    0
dtype: int64


### Data Preprocessing Stage
<p>We will use various scikit-learn modules to preprocess our data in three steps
<li>Non-numeric data needs to be converted to numeric through label encoding
<li>Split the data into train and test sets with train_test_split
<li>Scale the feature values to a uniform range.

In [80]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

#Columns are iterated over. 
#Non-numeric columns will have thier values changed to an interger of either 0,1,2... depending on its order
for col in card_apps:
    if card_apps[col].dtype=='object':
        card_apps[col] = le.fit_transform(card_apps[col])

card_apps.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
0,1,156,0.0,2,1,13,8,1.25,1,1,1,0,0,68,0,0
1,0,328,4.46,2,1,11,4,3.04,1,1,6,0,0,11,560,0
2,0,89,0.5,2,1,11,4,1.5,1,0,0,0,0,96,824,0
3,1,125,1.54,2,1,13,8,3.75,1,1,5,1,0,31,3,0
4,1,43,5.625,2,1,13,8,1.71,1,0,0,0,2,37,0,0


In [81]:
from sklearn.model_selection import train_test_split

card_apps = card_apps.values

# Segregate features and labels into separate variables
# Columns from beginning to second last are assigned to X, and our target variable is assigned to y 
X,y = card_apps[:,0:14] , card_apps[:,15]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [82]:
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

### Fitting the Logistic Regression Model to the Data
<p>We will use a Logistic Regression model to assist us in Approval predictions. Logistic regression will allows us to predict a binary outcome in regards to the feature variables we train it with.</p>
<li>There are two outcomes of our target variable; Approved: '+', Denied: '-'</li>

In [83]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression(solver='lbfgs')

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Using our fitted LogReg model for Prediction
<p>We will use the Accuracy Score to measure the model's overall performance, as well as a Confusion Matrix to evaluate the models ability to correctly label applicants as Approved or Denied, and how often it incorrectly labeled applicants</p>
<li>The top right value indicates correctly labelled Approved statuses (True Positives) 
<li>The bottom left value indicates correctly labelled Denied statuses (True Negatives)
<li>The other two values indicate the False Negatives and False Positives the model predicted

In [84]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

# Print the confusion matrix of the logreg model
print(confusion_matrix(y_test, y_pred))

Accuracy of logistic regression classifier:  0.8333333333333334
[[92 11]
 [27 98]]


### Conclusion
<li>When applied to the test data, our model predicted the correct classification with 83.77% accuracy.</li>
<li>It was able to correctly label 92 applications as Denied, and 99 as approved.</li>
<li>It labelled 37 applications incorrectly, however</li>
<p>To improve our results, we could apply a grid search over possible hyperparameter values to select the values which yield the best predictions</p>