# Credit-Card Default Risk
#### by Perry Shyr
## _3-of-7. Feature Engineering_
![](../images/cards.png)

## Problem Statement

### Credit-card lenders absorb significant losses from consumer defaults.  This capstone revolves around the detection of anomalies in customer demographic and borrowing history to identify credit-card default risk.  It is a binary classification problem with customers who default as the positive class, and unbalanced classes.  For supervised modeling, lenders are probably interested in not just the True-Positive (TP) rate, but also in the False-Negative (FN) and False-Positve (FP) rates. The best Estimator should minimize for both FN's and FP's while generalizing for TP's.

## Executive Summary

### Neural networks have been found to be the best estimators in the literature.

## A. Code Libraries Used

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## B. Load Data from Notebook-1

### There were no results from our data-exploration notebook that were saved nor carried over for this current notebook.

In [46]:
accts = pd.read_csv('../assets/credit_data_processed.csv', index_col='ID')

In [47]:
print(accts.shape)     # Review the attributes of the data.
accts.head()

(29965, 24)


Unnamed: 0_level_0,credit_limit,gender,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,Oct_Default
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


## C. Create new features that might serve as better predictors of the positive class.

### The characteristics that support the choice of a new feature to model with include a feature that has higher values for the positive class relative to the negative class.  If this feature is also time-dependent and increasing over time, that would be beneficial for modeling.

### Two features seem to show these properties.  These are leverage as defined by how much credit is extended to an account in terms of the set limits to borrowing, and balance carried as defined by the multiple of billed amount to amount paid towards that balance in the prior period.

In [48]:
accts['leverage_1'] = accts['BILL_AMT1']/accts['credit_limit']   # First new featire, Leverage.
accts['leverage_2'] = accts['BILL_AMT2']/accts['credit_limit']
accts['leverage_3'] = accts['BILL_AMT3']/accts['credit_limit']
accts['leverage_4'] = accts['BILL_AMT4']/accts['credit_limit']
accts['leverage_5'] = accts['BILL_AMT5']/accts['credit_limit']
accts['leverage_6'] = accts['BILL_AMT6']/accts['credit_limit']

In [49]:
accts['bill_to_pay1'] = accts['BILL_AMT1']/accts['PAY_AMT1']   # Second new features, Balance-carried.
accts['bill_to_pay2'] = accts['BILL_AMT2']/accts['PAY_AMT2']
accts['bill_to_pay3'] = accts['BILL_AMT3']/accts['PAY_AMT3']
accts['bill_to_pay4'] = accts['BILL_AMT4']/accts['PAY_AMT4']
accts['bill_to_pay5'] = accts['BILL_AMT5']/accts['PAY_AMT5']
accts['bill_to_pay6'] = accts['BILL_AMT6']/accts['PAY_AMT6']

In [11]:
print(accts.shape)
accts.head().T

(29965, 36)


ID,1,2,3,4,5
credit_limit,20000.0,120000.0,90000.0,50000.0,50000.0
gender,2.0,2.0,2.0,2.0,1.0
EDUCATION,2.0,2.0,2.0,2.0,2.0
MARRIAGE,1.0,2.0,2.0,1.0,1.0
AGE,24.0,26.0,34.0,37.0,57.0
PAY_0,2.0,-1.0,0.0,0.0,-1.0
PAY_2,2.0,2.0,0.0,0.0,0.0
PAY_3,-1.0,0.0,0.0,0.0,-1.0
PAY_4,-1.0,0.0,0.0,0.0,0.0
PAY_5,-2.0,0.0,0.0,0.0,0.0


### We see the new features added bringing the total of predictor candidates to 35.  However, we now have problem values (nulls and positive-infinites and negative infinites.

## D. Resolve the null and (+/-) infinite values.

In [50]:
accts.fillna(0, inplace=True)

### The nulls above are filled with zeroes.  Next, there are positive and negative infinites to resolve

### Let's find the largest non-infinite value.

In [15]:
accts['bill_to_pay1'].sort_values(ascending=False).head(3500)

ID
1                 inf
5960              inf
5898              inf
5905              inf
13179             inf
5915              inf
27511             inf
5916              inf
13178             inf
24333             inf
5924              inf
13164             inf
5925              inf
27503             inf
5928              inf
5929              inf
5931              inf
13159             inf
27496             inf
13154             inf
13145             inf
24346             inf
13140             inf
24351             inf
13101             inf
5952              inf
5953              inf
13185             inf
5895              inf
13205             inf
             ...     
17022             inf
28980             inf
17028             inf
22688             inf
17628             inf
2852              inf
22573             inf
16773             inf
2850              inf
28662             inf
16771             inf
16769             inf
3075              inf
28707             inf
2896   

In [25]:
accts['bill_to_pay2'].sort_values(ascending=False).head(3230)

ID
5386              inf
4776              inf
21401             inf
21403             inf
21404             inf
4790              inf
21416             inf
4787              inf
4784              inf
4779              inf
21448             inf
21517             inf
21463             inf
4767              inf
21481             inf
4756              inf
21487             inf
4754              inf
21501             inf
4741              inf
4792              inf
21389             inf
4802              inf
21372             inf
21204             inf
21213             inf
21224             inf
21225             inf
21233             inf
21242             inf
             ...     
277               inf
27449             inf
27448             inf
1339              inf
29929             inf
8091     9.212050e+04
2747     8.062700e+04
319      5.095500e+04
21939    5.022100e+04
17613    5.005033e+04
17133    4.637500e+04
84       4.635333e+04
9251     4.352040e+04
25541    4.248675e+04
9685   

In [29]:
accts['bill_to_pay3'].sort_values(ascending=False).head(3420)

ID
1                 inf
16174             inf
16122             inf
16127             inf
16129             inf
16135             inf
16137             inf
16151             inf
27549             inf
27548             inf
16154             inf
16164             inf
16171             inf
16210             inf
16096             inf
16215             inf
16224             inf
16252             inf
16259             inf
27537             inf
16260             inf
16261             inf
16273             inf
16275             inf
16290             inf
16300             inf
16101             inf
16084             inf
27527             inf
15958             inf
             ...     
7960              inf
7962              inf
8105              inf
7965              inf
8076              inf
10554    1.236580e+05
14275    1.128240e+05
13054    9.549900e+04
25898    6.342050e+04
26235    5.192400e+04
8584     4.563325e+04
9921     3.875540e+04
27354    3.415500e+04
9326     3.412750e+04
11117  

In [33]:
accts['bill_to_pay4'].sort_values(ascending=False).head(3620)

ID
3815              inf
26854             inf
8463              inf
26870             inf
2408              inf
3833              inf
14080             inf
14079             inf
2406              inf
8471              inf
798               inf
26859             inf
4729              inf
799               inf
20509             inf
14073             inf
2405              inf
14040             inf
14068             inf
26850             inf
10623             inf
11661             inf
8495              inf
14058             inf
20526             inf
14049             inf
811               inf
8508              inf
3841              inf
14045             inf
             ...     
27923             inf
437               inf
17140             inf
28315             inf
5916              inf
7632              inf
16663             inf
18734             inf
5455              inf
6086              inf
18738             inf
545               inf
3578              inf
27921             inf
2720   

In [39]:
accts['bill_to_pay5'].sort_values(ascending=False).head(3570)

ID
2128              inf
14035             inf
13961             inf
13963             inf
13967             inf
3713              inf
26823             inf
20904             inf
23932             inf
761               inf
13998             inf
26817             inf
14001             inf
14018             inf
20903             inf
14026             inf
3683              inf
3370              inf
26802             inf
3681              inf
26800             inf
14052             inf
14053             inf
14061             inf
3675              inf
26793             inf
14070             inf
23935             inf
3664              inf
14073             inf
             ...     
23493             inf
10293             inf
28302             inf
23018             inf
29666             inf
23170    3.152910e+05
15574    2.635280e+05
27552    7.499900e+04
12321    7.184050e+04
3910     5.599650e+04
18043    5.021700e+04
2655     3.454900e+04
28152    3.348917e+04
20573    3.005200e+04
12823  

In [45]:
accts['bill_to_pay6'].sort_values(ascending=False).head(3480)

ID
10356             inf
5203              inf
27429             inf
27428             inf
20283             inf
20285             inf
20291             inf
5235              inf
20315             inf
20317             inf
1417              inf
20324             inf
1418              inf
5213              inf
20368             inf
5339              inf
20371             inf
20375             inf
20376             inf
20378             inf
27409             inf
27408             inf
20380             inf
20382             inf
27405             inf
20384             inf
20385             inf
20386             inf
20270             inf
20265             inf
             ...     
708               inf
7744              inf
7748              inf
7755              inf
28707             inf
15101             inf
7756              inf
15156             inf
28642             inf
28667             inf
7763              inf
715               inf
15085             inf
714               inf
7767   

### The largest non-infinite values appears to be 68,649.6; 92,120.5; 123,658.0; 202,835.0; 315,291.0 and 134,470.0. 

### Let's set the positive-infinites to "319,999.0," above the feature maximum. 

In [53]:
accts.replace(np.inf, 319999, inplace=True)

### Next, let's find the largest non-infinite negative value.

In [58]:
accts['bill_to_pay1'].sort_values(ascending=False).tail(300)

ID
28194      -1.342105
352        -1.421822
6151       -2.283721
22117      -4.490164
27817     -15.927273
12274     -21.050279
27355     -31.333333
29492     -32.589189
27034     -33.933333
28273     -44.250000
22999     -65.000000
10146    -166.000000
7293     -185.416667
8836    -1913.500000
13754           -inf
28770           -inf
3287            -inf
3859            -inf
1280            -inf
29081           -inf
5437            -inf
6187            -inf
25894           -inf
24186           -inf
6179            -inf
24210           -inf
25711           -inf
25683           -inf
13482           -inf
25669           -inf
            ...     
21379           -inf
27177           -inf
8687            -inf
4481            -inf
22403           -inf
28128           -inf
8675            -inf
6707            -inf
5124            -inf
4498            -inf
15265           -inf
8756            -inf
26161           -inf
1019            -inf
23607           -inf
2209            -inf
28182     

In [62]:
accts['bill_to_pay2'].sort_values(ascending=False).tail(350)

ID
10863     -1.000000
9357      -1.000000
2319      -1.000000
6377      -1.000000
6707      -1.000000
13002     -1.000000
18961     -1.000000
24005     -1.000000
5012      -1.006211
17308     -1.027586
16748     -1.340763
22741     -1.365276
3053      -1.377778
4835      -1.552308
12274     -2.187778
4440      -2.307692
13872     -2.528764
9410      -5.772606
28720    -12.713568
17207    -64.166667
8523    -133.000000
11534   -256.000000
22117   -260.121212
29492   -325.678571
15676   -487.000000
7293    -532.687500
8616    -762.000000
23642          -inf
1796           -inf
21772          -inf
            ...    
19941          -inf
8687           -inf
24867          -inf
16021          -inf
12574          -inf
15270          -inf
18849          -inf
21313          -inf
15265          -inf
10045          -inf
28191          -inf
15256          -inf
19275          -inf
18534          -inf
794            -inf
12004          -inf
23078          -inf
7097           -inf
6168           -i

In [65]:
accts['bill_to_pay3'].sort_values(ascending=False).tail(350)

ID
9254       -1.004959
29699      -1.030833
24090      -1.037400
1035       -1.160000
36         -1.330808
10415      -1.400000
7329       -1.531987
15195      -1.743534
25360      -1.911392
6025       -1.993500
13002      -2.000000
16748      -2.340763
3053       -2.377778
9192       -2.437167
15762      -2.690299
27990      -3.438500
17207      -5.053333
15138      -5.108527
23073      -8.390000
27073     -78.714286
17297     -85.555556
6354     -117.142857
8348     -895.815789
29492   -3843.916667
10787           -inf
5052            -inf
22158           -inf
2959            -inf
20320           -inf
7691            -inf
            ...     
10250           -inf
16021           -inf
6064            -inf
5466            -inf
25874           -inf
5469            -inf
24060           -inf
11935           -inf
29503           -inf
18849           -inf
547             -inf
28761           -inf
10179           -inf
6168            -inf
25467           -inf
19950           -inf
13754     

In [67]:
accts['bill_to_pay4'].sort_values(ascending=False).tail(330)

ID
21257    -423.333333
27073    -486.000000
15214    -914.200000
15138   -1017.444444
2906            -inf
20067           -inf
3350            -inf
15427           -inf
24675           -inf
27401           -inf
20483           -inf
24531           -inf
9830            -inf
15406           -inf
6804            -inf
4869            -inf
5019            -inf
174             -inf
20891           -inf
18454           -inf
10045           -inf
25314           -inf
10024           -inf
15621           -inf
5012            -inf
10023           -inf
10009           -inf
8037            -inf
24667           -inf
14742           -inf
            ...     
26076           -inf
11775           -inf
16855           -inf
23535           -inf
18679           -inf
10750           -inf
1850            -inf
733             -inf
28828           -inf
14185           -inf
4733            -inf
6471            -inf
14179           -inf
25820           -inf
28807           -inf
28805           -inf
728       

In [71]:
accts['bill_to_pay5'].sort_values(ascending=False).tail(400)

ID
9410       -1.156164
27990      -1.188500
24458      -1.203077
24675      -1.203704
10250      -1.207333
7858       -1.645000
36         -1.878788
13984      -2.800587
25360      -3.911392
13002      -4.000000
24667      -4.175862
1012       -6.288538
26969     -17.846154
4693      -22.715789
27924     -63.666667
16748     -76.142857
24520     -98.714286
6699     -125.142857
25288    -237.000000
26839    -237.961538
25115    -555.000000
21257    -590.090909
27073    -632.857143
15138    -902.304348
29492   -6625.875000
15406           -inf
3375            -inf
174             -inf
3350            -inf
15460           -inf
            ...     
11575           -inf
12032           -inf
24352           -inf
24308           -inf
1405            -inf
23836           -inf
12691           -inf
12684           -inf
23880           -inf
7274            -inf
28704           -inf
12653           -inf
12546           -inf
24021           -inf
24036           -inf
28655           -inf
24081     

In [74]:
accts['bill_to_pay6'].sort_values(ascending=False).tail(400)

ID
22873      -4.168831
25360      -4.911392
19313      -4.995272
19871      -5.225532
7858       -6.151163
16051     -18.882353
4693      -47.122530
15138    -840.000000
25115    -944.000000
29492   -6758.928571
21257   -7741.000000
10659           -inf
6354            -inf
18069           -inf
18772           -inf
8453            -inf
29503           -inf
10349           -inf
6489            -inf
26745           -inf
8585            -inf
7728            -inf
26861           -inf
15357           -inf
17741           -inf
13435           -inf
26763           -inf
22171           -inf
29959           -inf
18510           -inf
            ...     
23768           -inf
21302           -inf
3321            -inf
19268           -inf
15050           -inf
27566           -inf
10075           -inf
19275           -inf
17360           -inf
19288           -inf
1849            -inf
1850            -inf
22508           -inf
4778            -inf
21233           -inf
3350            -inf
6739      

### The largest non-infinite negative values appears to be -1,913.5; -762.0; -3,843.9; 1,017.4; -6,625.9 and -7,741.0. 

### Let's set the negative-infinites to "-7,999.0," below the feature minimum. 

In [75]:
accts.replace(-np.inf, -7999, inplace=True)

### Let's check the numeric-summary function.

In [76]:
accts.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
credit_limit,29965.0,167442.005006,129760.135222,10000.0,50000.0,140000.0,240000.0,1000000.0
gender,29965.0,1.603738,0.489128,1.0,1.0,2.0,2.0,2.0
EDUCATION,29965.0,1.84275,0.744513,1.0,1.0,2.0,2.0,4.0
MARRIAGE,29965.0,1.551877,0.521997,0.0,1.0,2.0,2.0,3.0
AGE,29965.0,35.487969,9.219459,21.0,28.0,34.0,41.0,79.0
PAY_0,29965.0,-0.016753,1.123492,-2.0,-1.0,0.0,0.0,8.0
PAY_2,29965.0,-0.131854,1.196322,-2.0,-1.0,0.0,0.0,8.0
PAY_3,29965.0,-0.164392,1.195878,-2.0,-1.0,0.0,0.0,8.0
PAY_4,29965.0,-0.218922,1.168175,-2.0,-1.0,0.0,0.0,8.0
PAY_5,29965.0,-0.264509,1.13222,-2.0,-1.0,0.0,0.0,8.0


### It looks like the values are valid now.  We can save the values for later use.

## E. Save the values to a separate file.

In [77]:
accts.to_csv('../assets/credit_new_features.csv')

## Continue to Notebook-4, Model Building.

### We have progressed through the data-science workflow in these first three notebooks.  In the code above, six additional features have been added as promising candidates for being strong predictors.  The invalid values have been resolved, so the data is ready for next steps.  We may or may not use all of the features which we have thus far once features selection is called for.

### In the next notebook, we will proceed with pre-modeling, test and optimize various classification models, score/predict, save and finally evaluate our choices for productiion deployment.

## New features introduce multi-collinearity without adding new information.  I could try bootstrapping or removing non-default accounts to match the 6,600 default accounts, but in reality I would never have this option due to the nature of anomalies being rare.  A weighting penalty might be the answer, or using a custom-objective function for False-negative Rate in the loss function.  Either way, I need a loss function tailored to this kind of problem.

# I should REPLACE the 'LIMIT_BAL,' 'BILL_AMT' and 'PAY_AMT features before saving the processed data!