# Credit approvals


Author: Nick Van Bergen <br><br>
**To view fully rendered notebook click [here](https://nbviewer.org/github/nvbergen/DC_project_CC_apps/blob/566ba446ee359173d6cccd8b81a8f2b942c5a742/Code/Credit%20approval%20Project.ipynb#top)**

This project inspired by guided project from [datacamp](https://app.datacamp.com/learn/projects/558). <br>
The data souce is from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/credit+approval) <br> 





<a id = "top"></a>

---
## Contents
1. [Background and problem statement](#background)
1. [Library Imports](#imports)
1. [Data Ingestion](#ingest)
    * Data Description
    * Data set dictionary Column Names
1. [Data set Inspection](#inspect)
1. [Missing Values](#missing)
1. EDA
    * Location
    * Variability
    * Distribution
    * Correlation
1. Data Preprocessing
1. Logistic Regression Classifier
    1.  Fit
    1. score 
    1. grid search
1. [Conclusions and Recommendations](#conclusions)

<a id = "background"></a>

---
## Background and problem statement.
[[Back to contents]](#top) *-* [[Next Section: Library Imports]](#imports)<br> <br>
There was a time when credit card applications were conducted in-person or over the phone. Data was collected by a human and then entered into the issuer's system for approval. The goal of a credit card issuer is to well _issue credit_. The issuer will earn a profit on the interest against the money borrowed. There are several risks to the issuer. One of these risks is issuing credit to a risky borrower whom fails in their contractual obligation to repay the lender. <br><br>
Banks beleive that there are objective factors that can predict if a borrower is apt to fail in their obligation and thus influence an issuers decision to approve or decline a prospective borrower from a credit card. <br><br>
There are operational risks associated with collecting and inputing data as well as decisioning the application if left totally up to a human operator. Thankfully, we can utilize some statistical learning methods on past applications to speed up the approval process. It is believed that most, if not all, credit issuers today utilize some form of **machine learning** to mitigate their business and operational risks. <br><br>
**Problem Statement:** _Can machine learning provide us with an accurate solution to predict if a borrower will be approved fro credit cards or not?_ 

<a id = "imports"></a>

---
## Imports
[[Back to contents]](#top) *-* [[Next Section: Data Ingestion]](#ingest)<br> <br>

In [2]:
#analytical and visualization libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#modeling tools
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler

#model
from sklearn.linear_model import LogisticRegression


#model evaluations
from sklearn.metrics import confusion_matrix

<a id="ingest"></a>

---
## Data Ingestion
[[Back to contents]](#top) *-* [[Next Section: Data Inspection]](#inspect)<br> <br>

In [12]:
apps = pd.read_csv("../data/crx.data", header = None)

In [13]:
apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


### Data Descriptions

The data set has been **anonymized** to protect sensitive information under general view of public analysts. <br> 
> This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. <br>

### Column Names
The website does offer an appropriate description of the columns. However, per above, there is no way to build an intuition from the information at hand. 

>Attribute Information:<br>
A0: b, a. <br>
A1: continuous. <br>
A2: continuous. <br>
A3: u, y, l, t. <br>
A4: g, p, gg. <br>
A5: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff. <br>
A6: v, h, bb, j, n, z, dd, ff, o. <br>
A7: continuous. <br>
A8: t, f. <br>
A9: t, f. <br>
A10: continuous. <br>
A11: t, f. <br>
A12: g, p, s. <br>
A13: continuous. <br>
A14: continuous. <br>
A15: +,- (class attribute)<br>

[Source](https://archive.ics.uci.edu/ml/datasets/credit+approval) <br>
<br>
There are many notebooks that explain _probable_ feature names. One such blog describes the features, in order, as:
>Probable Feature Names: <br> 
A0: `Gender` <br>
A1: `Age` <br>
A2: `Debt` <br>
A3: `Married` <br>
A4: `BankCustomer` <br>
A5: `EducationLevel` <br>
A6: `Ethnicity` <br>
A7: `YearsEmployed` <br>
A8: `PriorDefault` <br>
A9: `Employed` <br>
A10: `CreditScore` <br>
A11: `DriversLicense` <br>
A12: `Citizen` <br>
A13: `ZipCode` <br>
A14: `Income` <br>
A15: `ApprovalStatus.` <br> 

[Source](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html)

### Data Dictionary and column headings

Using the above information we will develop a full data dictionary below and map to our dataframe:

| Index | Feature name*    | Type                  | Value Range                                                             |                     Notes                     |
|:-----:|------------------|-----------------------|-------------------------------------------------------------------------|:---------------------------------------------:|
| 0     | _Gender_         | Categorical           | `b`, `a`                                                                | Definitions unknown                           |
| 1     | _Age_            | Numeric, Continuous   |                                                                         | Decimalized, in years                         |
| 2     | _Debt_           | Numeric, Continuous   |                                                                         | Definitions unknown                           |
| 3     | _Married_        | Categorical           | `u`, `y`, `l`, `t`                                                      | Definitions unknown                           |
| 4     | _BankCustomer_   | Categorical           | `g`, `p`, `gg`                                                          | Definitions unknown                           |
| 5     | _EducationLevel_ | Categorical           | `c`, `d`, `cc`, `i`, `j`, `k`, `m`, `r`, `q`, `w`, `x`, `e`, `aa`, `ff` | Definitions unknown                           |
| 6     | _Ethnicity_      | Categorical           | `v`, `h`, `bb`, `j`, `n`, `z`, `dd`, `ff`, `o`                          | Definitions unknown                           |
| 7     | _YearsEmployed_  | Numeric, Continuous   |                                                                         | Decimalized, in years                         |
| 8     | _PriorDefault_   | Categorical           | `t`, `f`                                                                | True = `t` = 1 <br> False = `f` = 0            |
| 9     | _Employed_       | Categorical           | `t`, `f`                                                                | True =  `t` = 1 <br> False = `f` = 0            |
| 10    | _CreditScore_    | Numeric, Continuous   |                                                                         |                                               |
| 11    | _DriversLicense_ | Categorical           | `t`,`f`                                                                 | True =  `t` = 1 <br> False = `f` = 0            |
| 12    | _Citizen_        | Categorical           | `g`, `p`, `s`                                                           | Unknown Definition                            |
| 13    | _ZipCode_        | Numeric, **discrete** |                                                                         | This should be **Categorical** and non-numeric                                              |
| 14    | _Income_         | Numeric, Continuous   |                                                                         |                                               |
| 15    | _ApprovalStatus_ | Categorical           | `+`,`-`                                                                 | `+` = Approved = 1 <br>  `-` = Not Approved = 0 |

In [19]:
#apply new column names to dataframe
apps.columns = ["Gender", "Age", "Debt", 
"Married", "BankCustomer", "EducationLevel", 
"Ethnicity", "YearsEmployed", "PriorDefault", 
"Employed", "CreditScore", "DriversLicense", 
"Citizen", "ZipCode", "Income", "ApprovalStatus"]
apps.head(3)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+


<a id="inspect"></a>

---
## Data Inspection
[[Back to contents]](#top) *-* [[Next Section]](#next)<br> <br>
In this section, I would like to inspect the dataset overal to match the dictiony to our data. 


In [20]:
apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          690 non-null    object 
 1   Age             690 non-null    object 
 2   Debt            690 non-null    float64
 3   Married         690 non-null    object 
 4   BankCustomer    690 non-null    object 
 5   EducationLevel  690 non-null    object 
 6   Ethnicity       690 non-null    object 
 7   YearsEmployed   690 non-null    float64
 8   PriorDefault    690 non-null    object 
 9   Employed        690 non-null    object 
 10  CreditScore     690 non-null    int64  
 11  DriversLicense  690 non-null    object 
 12  Citizen         690 non-null    object 
 13  ZipCode         690 non-null    object 
 14  Income          690 non-null    int64  
 15  ApprovalStatus  690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [23]:
apps.describe()

Unnamed: 0,Debt,YearsEmployed,CreditScore,Income
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


I am supposed to have 6 numeric variables, and the `df.describe()` method only returned four. <br><br>
The `df.describe()` method is useful because, by default it will take the _numeric_ columns automatically. After applying this method, we see that only _Debt_, _YearsEmployed_, _CreditScore_, and _Income_ were returned. <br><br>
I am missing _Age_ and _ZipCode_, which were not returned due to the datatype being `object` as opposed to `int` or `float`.

In [28]:
#using .difference() we can exclude normally numeric columns and use .describe() on our categorical features. 
apps[apps.columns.difference(list(apps.describe().columns))].describe()

Unnamed: 0,Age,ApprovalStatus,BankCustomer,Citizen,DriversLicense,EducationLevel,Employed,Ethnicity,Gender,Married,PriorDefault,ZipCode
count,690,690,690,690,690,690,690,690,690,690,690,690
unique,350,2,4,3,2,15,2,10,3,4,2,171
top,?,-,g,g,f,c,f,v,b,u,t,0
freq,12,383,519,625,374,137,395,399,468,519,361,132


The above process shows we got lucky with _Age_ column showing that there are 12 values that take the `?` form. <br><br>
Aslo, of note in the _ZipCode_ column is the value `00000` which indicates a missing value and the fact that this column has taken the datatype `object`. <br>Is this datatyping **incorrect** though? At least for _ZipCode_ it seems appropriate to not attempt to describe this data numerically since the number only represents a physical location and not a measurement. <br><br>
We will still need to find and deal with missing values in our dataset before we can explore the data in more detail. 

<a id="missing"></a>

---
## Missing Values
[[Back to contents]](#top) *-* [[Next Section: EDA]](#EDA)<br> <br>
We are hunting down missing values and addressing them by ascribing the `np.nan()` type to the datum that is missing.<br><br> 
One of the easiest ways to see data that is mismatched is to use `df.sortvalues()`.


### First column investigation
The data dictionary shows that there should only be two values `b` or `a` in our first column _Gender_.

In [33]:
apps['Gender'].describe()

count     690
unique      3
top         b
freq      468
Name: Gender, dtype: object

Seeing that we have 3 values instead of 2: lets see what they are.

In [34]:
apps['Gender'].unique()

array(['b', 'a', '?'], dtype=object)

I will replace obvious missing values with numpy's `NaN` data type. 

In [36]:
apps['Gender'] = apps['Gender'].replace('?', np.nan)

In [39]:
apps['Gender'].isna().sum()

12

Going forward, in each column, replacing `?` with `np.nan` using the same method above. 

In [42]:
apps = apps.replace('?', np.nan)

In [43]:
apps.isna().sum()

Gender            12
Age               12
Debt               0
Married            6
BankCustomer       6
EducationLevel     9
Ethnicity          9
YearsEmployed      0
PriorDefault       0
Employed           0
CreditScore        0
DriversLicense     0
Citizen            0
ZipCode           13
Income             0
ApprovalStatus     0
dtype: int64

### Next columns
I will go through each column to attempt to understand the ranges of data and see if we can identify any other obvious candidates to be `missing` values. 

In [45]:
for col in apps.columns:
    print(apps[col].unique())

['b' 'a' nan]
['30.83' '58.67' '24.50' '27.83' '20.17' '32.08' '33.17' '22.92' '54.42'
 '42.50' '22.08' '29.92' '38.25' '48.08' '45.83' '36.67' '28.25' '23.25'
 '21.83' '19.17' '25.00' '47.75' '27.42' '41.17' '15.83' '47.00' '56.58'
 '57.42' '42.08' '29.25' '42.00' '49.50' '36.75' '22.58' '27.25' '23.00'
 '27.75' '54.58' '34.17' '28.92' '29.67' '39.58' '56.42' '54.33' '41.00'
 '31.92' '41.50' '23.92' '25.75' '26.00' '37.42' '34.92' '34.25' '23.33'
 '23.17' '44.33' '35.17' '43.25' '56.75' '31.67' '23.42' '20.42' '26.67'
 '36.00' '25.50' '19.42' '32.33' '34.83' '38.58' '44.25' '44.83' '20.67'
 '34.08' '21.67' '21.50' '49.58' '27.67' '39.83' nan '37.17' '25.67'
 '34.00' '49.00' '62.50' '31.42' '52.33' '28.75' '28.58' '22.50' '28.50'
 '37.50' '35.25' '18.67' '54.83' '40.92' '19.75' '29.17' '24.58' '33.75'
 '25.42' '37.75' '52.50' '57.83' '20.75' '39.92' '24.75' '44.17' '23.50'
 '47.67' '22.75' '34.42' '28.42' '67.75' '47.42' '36.25' '32.67' '48.58'
 '33.58' '18.83' '26.92' '31.25' '56.50' 

## Exploratory Data Analysis

## Data Preprocessing

## Machine Learning Model

### Fit Model

### Score Evaluate Model 

### Grid Search

<a id="conclusions"></a>
## Conclusions and Recommendations
---
[[Back to contents]](#top)

Here we discuss the general conclusions made. 

Conclusions!
</a>