# Credit Card Approvals

## 1. Introduction

A big challenge financial instituitions face is the decision of whom they'll give a credit card to. This comes with a lot of dificulties due to its strong relation to the survival of the company itself. If they give too many credit cards with high enough limits, the instituion won't be expected to last. Added to this, we should consider all the factors that may affect our decision, such as wages, age, time they've been in the workforce and so forth. 

This notebook aims to explore the dataset, prepare the data, use machine learning algorithms, and measure its performance.

UCI Credit Card dataset available at http://archive.ics.uci.edu/ml/datasets/credit+approval

###### This notebook is based on [Sayak Paul's Datacamp Project](https://www.datacamp.com/projects/558).


![Cartão de crédito passando na máquina](https://images.immedia.com.br//32/32416_2_EL.jpg)

### Import the data

The data can be loaded as ('datasets/cc_approvals.data'). 

In [394]:
import pandas as pd

df = pd.read_csv('datasets/cc_approvals.data', header = None, na_values = '?')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     678 non-null object
1     678 non-null float64
2     690 non-null float64
3     684 non-null object
4     684 non-null object
5     681 non-null object
6     681 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    677 non-null float64
14    690 non-null int64
15    690 non-null object
dtypes: float64(4), int64(2), object(10)
memory usage: 86.3+ KB


As we can see, the column names are not defined. This is done to keep the data anonymous. There are also some missing values and we'll have to deal with them. We could either ignore them or find good values to replace them with. The latter option is called imputing and is the choice we make here, so we don't lose information about our dataset.

### Imputing Missing data

To deal with the missing data in the categorial columns, we're going to fill them with the most frequent value in the column. For the numerical columns, we'll replace the nan values for the mean. Another way we could do this is by estimating the value based on the values of other columns. There are different ways to do this, such as hot deck, cold deck and regression. 

For a first version, though, this is a good benchmark. After that, we may want to compare the application of the Hot Deck technique and see the additional benefits it provides, if any.

In [395]:
for col in df.columns:
    
    # For the string columns, replace with the most frequent value of columns.
    if df[col].dtype == 'object':
        df[col] = df[col].fillna(df[col].value_counts().index[0])
        
    # For numerical columns, replace nan with the mean.
    else:
        df[col] = df[col].fillna(df[col].mean())

### Renaming columns

The columns are not identified for privacy purposes, but [this post by Ryan Kuhn](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) shows that the columns probably are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. So let's rename the DataFrame's columns.

In [396]:
df.columns = ['Gender', 'Age', 'Debt', 'Married',
              'BankCustomer', 'EducationLevel', 'Ethnicity',
              'YearsEmployed', 'PriorDefault',
              'Employed', 'CreditScore', 'DriversLicense',
              'Citizen', 'ZipCode', 'Income', 'ApprovalStatus']

df.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


### Feature Selection

We assume here that <code>gender</code>, <code>Citizen</code>, <code>ZipCode</code> and <code>Ethnicity</code> should not be considered as a factor to the decision of a credit card approval, so we drop these columns. This step is called feature selection.

In [397]:
df = df.drop(['Gender', 'ZipCode', 'Ethnicity'], axis = 1)

### Preprocessing the data

To transform the dataset from text to numerical we'll use LabelEncoder from scikit-learn's library. We'll do this for faster computation and, besides that, scikitlearn deals only with numerical variables. For the column 15, the target variable, we'll do the transformation using the replace method, in order to keep the equivalence '+' : 1 and '-' : 0

In [398]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in df.columns:
    if col == 'ApprovalStatus':
        df[col] = df[col].replace('+',1).replace('-',0)
        
    if df[col].dtype == 'object':
        df[col] = le.fit_transform(df[col])

df.tail(20)

Unnamed: 0,Age,Debt,Married,BankCustomer,EducationLevel,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,Income,ApprovalStatus
670,47.17,5.835,1,0,12,5.5,0,0,0,0,0,150,0
671,25.83,12.835,1,0,2,0.5,0,0,0,0,0,2,0
672,50.25,0.835,1,0,0,0.5,0,0,0,1,0,117,0
673,29.5,2.0,2,2,4,2.0,0,0,0,0,0,17,0
674,37.33,2.5,1,0,6,0.21,0,0,0,0,0,246,0
675,41.58,1.04,1,0,0,0.665,0,0,0,0,0,237,0
676,30.58,10.665,1,0,10,0.085,0,1,12,1,0,3,0
677,19.42,7.25,1,0,9,0.04,0,1,1,0,0,1,0
678,17.92,10.21,1,0,5,0.0,0,0,0,0,0,50,0
679,20.08,1.25,1,0,1,0.0,0,0,0,0,0,0,0


### Defining X and y

The last column will be defined as the target variable (y). All the others will become the features array (X).

In [399]:
df_arr = df.values

X, y = df_arr[:,:-1] , df_arr[:,-1]

### Spliting the data between train and test sets

Let's split our data in train and test set, so we can validate our model after fitting it.

In [400]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3)

### Scaling the data

Since the data ranges very differently from column to column, we may use a scaler to reduce this effect, because some models are more sensitive to variance among the data. To do this, we'll import then Scaler.

In [401]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range = (0,1))

scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.fit_transform(X_test)

### Fitting a Random Forest Classifier

In [402]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators = 100,
                             oob_score = True,
                             random_state = 42)

clf.fit(scaled_X_train, y_train)

y_pred = clf.predict(scaled_X_test)

y_pred

array([1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 1., 0., 1., 0., 1., 1.,
       0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
       1., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0.,
       0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 1., 1., 0.,
       1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1.,
       0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0., 1., 1., 1., 0., 0.,
       1., 0., 0., 0., 1., 0., 1., 1., 1., 1., 0., 0., 0., 1., 1., 0., 1.,
       1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1.,
       1., 1., 0., 1., 0., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0., 1.,
       1., 0., 0.])

In [403]:
from sklearn.metrics import confusion_matrix

print("Accuracy of Random Forest Classifier: ", clf.score(scaled_X_test, y_test))

# Print the confusion matrix of the logreg model
print(confusion_matrix(y_test, y_pred))

Accuracy of Random Forest Classifier:  0.8792270531400966
[[100  16]
 [  9  82]]


### Optimizing results: hyperparameter tuning

We can use GridSearchCV to do hyperparameter tuning in the Random Forest and a cross validation to prevent overfitting. The GridSearchCV does the computation using all the combinations amongst the parameters grid. 

It does a cross validation too.


In [419]:
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators':[200],
    'max_depth':[2,3,4,5,6,7,8,9,10,11],
    'oob_score':[True],
    'max_features':[2,3,4,5,8,10]
}

gscv = GridSearchCV(clf, params, cv = 3, n_jobs = -1)

gscv_result = gscv.fit(scaled_X_train,y_train)

gscv_report = gscv.cv_results_

y_gscv = gscv.predict(scaled_X_test)

### Evaluating performance

In [420]:
print("Accuracy of Random Forest Classifier on train data: ", gscv.score(scaled_X_train, y_train))

print("Accuracy of Random Forest Classifier on test data: ", gscv.score(scaled_X_test, y_test))

print(confusion_matrix(y_test, y_gscv))

best_score, best_params = gscv_result.best_score_, gscv_result.best_params_

print("Best: %f using %s" % (best_score, best_params))

Accuracy of Random Forest Classifier on train data:  0.9751552795031055
Accuracy of Random Forest Classifier on test data:  0.8792270531400966
[[103  13]
 [ 12  79]]
Best: 0.873706 using {'max_depth': 8, 'max_features': 2, 'n_estimators': 200, 'oob_score': True}
