# German Credit Analysis, Using DNN

# Summary of the project

The purpose of this project is to replicate the findings in "Combining Feature Selection and Neural Networks for Solving Classification Problems," O'Dea, P., Griffith, J., O'Riordan, C. This paper used feature selection and Neural Networks to solve a classification problem using the German Credit data set.

This approach took the form of two phases: The first phase was to use Information Theory to determine the attributes of the German credit data set that are the most important in classifiying the data record as good credit or bad credit. The second phase was to process the data and transform it into a form usable by the neural network model, train the model using 97.5% of the German Credit data set and finally test the model using the remaining 2.5% of the data set.

# Introduction to Data Mining Methods and their Applications

There are many different data mining methods used to solve many different problems. Many of these methods are outlined in Provost & Fawcett pages 20-23:

- Data mining is deployed in many different "classification and association rules, and to item-set recognition and sequential pattern recoginition problems" (O'Dea, p. 1). Many of these methods are outlined in Provost & Fawcett pages 20-23.


- When deploying classification methods, the key objective is to predict which class an item belong to based on a set of attributes (Provost & Fawcett, p. 20). Applications of classification involve prediciting customer churn, or in classifying what class an item belongs to. Examples of methods involved in classifications are Neural Networks, score cards, and decision trees.


- Regression, or "value estimation" (Provost & Fawcett, p. 21) is a technique that attempts to predict the numerical value based on past data. The technique uses a linear mathematical model (often using the least squares approach), that can predict the dependent variable from an independent variable. Applications of regression models include predicting the temperature on a given day based on previous days (or seasonal data).


- Similarity matching used known data about items and attempts to identify similar items. Applications of similarity matching include finding customers that are similar to your best customer.


- Clustering methods is an unsupervisied (no target attribute), that attempts to cluster or group like items together. Applications of clustering include finding segments in groups of customers.


- Co-occurrence grouping is a data mining method that attempts to associate items based on their transactions. It is also called assiaction rule discovery or market-basket analysis. An application of co-occurrence grouping is to suggest products to customers based on the items they have in their shopping cart. A famous example of co-occurrence grouping is the diapers and beer association (Whitehorn, 2006).


- Profiling is a data mining method that tries to describe the behavior of individuals. It is most useful in "fraud detection and monitoring intrusion detection" (Provost & Fawcett, p. 22).


- Link prediction attempt to determine whether or not a link between items should exist based on associated other links. A common applicaiton of link prediction is friend sugestions in Facebook.


- Data reduction is a data mining method that attempts to reduce the complexity and volume of a large data set to a more manageable size by pruning the less important information. Data reduction often results in data loss, but at the gain of performance and easier understanding of the data set involved.


- Causal modeling tries to find links between events and actions.

# Techniques and Approaches

The techniques used in this model include data segmentation, data reduction, and classification. The first phase of the project is to determine the most important attributes to use when training and testing the model, and reduce the data set to those attributes. This is done using Information Theory.

The second phase is to process the reduced data set into a form that can be easily used by the model. This is done by binning the data sets for numerical data into clusters.

# Cross Industry Standard Process for Data Mining (CRISP-DM)

We need a process for data mining in order to have a reasonable amount of “consistency, repeatability, and objectiveness” (Provost, 2013, p. 27) of the outcomes. The CRISP-DM method is one of the most popular data mining processes used in the industry. The CRISP-DM process is an iterative process. Each iteration helps to inform about the data.


## Business Understanding

The business understanding defines the “problem to be solved” (p. 28). The problem here is a feature selection and classification problem: Using the German dataset we need to predict if an applicant is a “good or bad credit risk” (O’Dae, Griffith, O’Riordan; p. 6).





Some necessary libraries:

In [36]:
from __future__ import division
from math import log
import pandas as pd

## Data Understanding

The data is composed of twenty attributes.

In [37]:
german = pd.read_csv('MGMT635_GermanCreditData.csv')

In [38]:
german.head()

Unnamed: 0,status,duration,credit_history,purpose,credit_amount,savings,employment_duration,installment_rate,personal_status,debtors,...,property,age,installment_plans,housing,existing_credits,job,liable_people,telephone,foreign_worker,target
0,11,6,34,43,1169,65,75,4,93,101,...,121,67,143,152,2,173,1,192,201,1
1,12,48,32,43,5951,61,73,2,92,101,...,121,22,143,152,1,173,1,191,201,2
2,14,12,34,46,2096,61,74,2,93,101,...,121,49,143,152,1,172,2,191,201,1
3,11,42,32,42,7882,61,74,2,93,103,...,122,45,143,153,1,173,2,191,201,1
4,11,24,33,40,4870,61,73,3,93,101,...,124,53,143,153,2,173,2,191,201,2


There are two types of attributes: Categorical which are in dictionary 1, and numerical which is in dictionary 2.

In [39]:
attribute_dict = {1:['status', 'credit_history', 'purpose', 'savings',
                    'employment_duration', 'personal_status', 'debtors','property', 'installment_plans', 'housing', 
                    'job', 'telephone', 'foreign_worker'], 
                  2:['duration', 'credit_amount', 'installment_rate', 'residence', 'age', 'existing_credits',
                     'liable_people']}

A good way to get insights into the data set is to get some basic statistics. Below is a display of some basic statisitcal information which will tell us things such as mean, standard deviation, minimum and maximum values. For categorical data, the mode would be a useful statistic.

In [40]:
german.describe()

Unnamed: 0,status,duration,credit_history,purpose,credit_amount,savings,employment_duration,installment_rate,personal_status,debtors,...,property,age,installment_plans,housing,existing_credits,job,liable_people,telephone,foreign_worker,target
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,12.577,20.903,32.545,47.148,3271.258,62.105,73.384,2.973,92.682,101.145,...,122.358,35.546,142.675,151.929,1.407,172.904,1.155,191.404,201.037,1.3
std,1.257638,12.058814,1.08312,40.095333,2822.736876,1.580023,1.208306,1.118715,0.70808,0.477706,...,1.050209,11.375469,0.705601,0.531264,0.577654,0.653614,0.362086,0.490943,0.188856,0.458487
min,11.0,4.0,30.0,40.0,250.0,61.0,71.0,1.0,91.0,101.0,...,121.0,19.0,141.0,151.0,1.0,171.0,1.0,191.0,201.0,1.0
25%,11.0,12.0,32.0,41.0,1365.5,61.0,73.0,2.0,92.0,101.0,...,121.0,27.0,143.0,152.0,1.0,173.0,1.0,191.0,201.0,1.0
50%,12.0,18.0,32.0,42.0,2319.5,61.0,73.0,3.0,93.0,101.0,...,122.0,33.0,143.0,152.0,1.0,173.0,1.0,191.0,201.0,1.0
75%,14.0,24.0,34.0,43.0,3972.25,63.0,75.0,4.0,93.0,101.0,...,123.0,42.0,143.0,152.0,2.0,173.0,1.0,192.0,201.0,2.0
max,14.0,72.0,34.0,410.0,18424.0,65.0,75.0,4.0,94.0,103.0,...,124.0,75.0,143.0,153.0,4.0,174.0,2.0,192.0,202.0,2.0


## Data Preparation

Using entropy and information gain, seven attributes were selected as the most useful in determining the credit worthiness of a candidate based on information gain. The attributes selected were: the status of existing checking account, the credit duration in months, credit history, credit amount, savings accounts/bonds, housing (rent, own, for free), and whether or not the person is a foreign worker.

O’Dea, et. al. used the thermometer coding scheme in order to represent the attribute values. For example for the status attribute, the values less-200DM were coded as {001}, over-200DM was coded as {011}, and no-account was coded as {111}. Also, continuous data, such as duration and credit_amount were binned (or bucketed) in order to aggregate the data into a more useful form for the model to use. Table 2 of Dea, shows the binary representation used to represent the inputs to the neural network (25 inputs in all).

In our project we demonstrate how feature selection is used to discover interesting attributes to use for the model, how binning can transform data from numerical to categorical, and finally how the data can be shuffled and split into training and testing sets.

### Feature Selection
Using information theory, as outlined in Provost and Fawcett (2013). We calculated the information gained for each of the attributes. As an example we will display the calculations for the attribute status.

First, based on the data set, the overall probabilities of good credit or bad credit can be calculated:

In [41]:
p_parent_good = german.where(german.target==1).dropna().shape[0]/german.shape[0]
print("Probability of good credit = {}".format(p_parent_good))
p_parent_bad = german.where(german.target==2).dropna().shape[0]/german.shape[0]
print("Probability of bad credit = {}".format(p_parent_bad))

Probability of good credit = 0.7
Probability of bad credit = 0.3


The total entropy of the data set is calculated:

In [42]:
parent_entropy = - (p_parent_good * log(p_parent_good, 2) + p_parent_bad * log(p_parent_bad, 2))
print("Parent Entropy = {}".format(parent_entropy))

Parent Entropy = 0.8812908992306927


In order to understand how informative the attribute is we need to calculate the information gained. This is done by calculating how much the attribute reduces the entropy of the segmentations created by splitting the data set along the values of the attribute.

In [43]:
feature_values_for_status = [11,12,13,14]

status_value_series = {}
#Split the data set along the values of the attributes.
for value in feature_values_for_status:
    status_value_series[value] = german.where(german['status']==value).dropna().target

IG_children = 0
for key, series in status_value_series.items():
    p_status_value = series.shape[0] / german.shape[0]
    p_series_good = series.where(series==1).dropna().shape[0] / series.shape[0]
    p_series_bad = series.where(series==2).dropna().shape[0] / series.shape[0]
    entropy_child = -(p_series_good * log(p_series_good, 2) + p_series_bad * log(p_series_bad, 2))
    IG_children = IG_children + (p_status_value * entropy_child)
    print("Probability of value {}: {}".format(key, p_status_value))
    print("Probability for value {}, to have good credit: {}".format(key, p_series_good))
    print("Probability for value {}, to have bad credit: {}".format(key, p_series_bad))
    print("Entropy of child with value {}: {}".format(key, entropy_child))
    print("--------------------------------------------------------")
    

Probability of value 11: 0.274
Probability for value 11, to have good credit: 0.5072992700729927
Probability for value 11, to have bad credit: 0.4927007299270073
Entropy of child with value 11: 0.9998462628494693
--------------------------------------------------------
Probability of value 12: 0.269
Probability for value 12, to have good credit: 0.6096654275092936
Probability for value 12, to have bad credit: 0.3903345724907063
Entropy of child with value 12: 0.9650151205034324
--------------------------------------------------------
Probability of value 13: 0.063
Probability for value 13, to have good credit: 0.7777777777777778
Probability for value 13, to have bad credit: 0.2222222222222222
Entropy of child with value 13: 0.7642045065086203
--------------------------------------------------------
Probability of value 14: 0.394
Probability for value 14, to have good credit: 0.883248730964467
Probability for value 14, to have bad credit: 0.116751269035533
Entropy of child with value 14

We use the sum of the products of the children entropies and probabilities and subtract it from the entropy of the parent:

In [44]:
IG = parent_entropy - IG_children
print("Information Gain for attribute status: {}".format(IG))

Information Gain for attribute status: 0.09473884155263945


### Binning Attributes

It is useful to bin or basket together numerical data. You loose data doing this but the outcome is more useful for the neural network model.

In [45]:
german.duration = pd.cut(german.duration, bins=4, labels=False, include_lowest=True)
german.credit_amount = pd.cut(german.credit_amount, bins=4, labels=False, include_lowest=True)

##### Noted as most relevant attributes from the paper are:

In [46]:
german2 = german[['duration', 'credit_history', 'credit_amount', 'savings', 
                  'status', 'housing', 'foreign_worker', 'target']]

In [47]:
german2.head()


Unnamed: 0,duration,credit_history,credit_amount,savings,status,housing,foreign_worker,target
0,0,34,0,65,11,152,201,1
1,2,32,1,61,12,152,201,2
2,0,34,0,61,14,152,201,1
3,2,32,1,61,11,153,201,1
4,1,33,1,61,11,153,201,2


### Split up the data into a training/testing set

In [48]:
from sklearn.model_selection import train_test_split

In [49]:
x_data = german2.drop('target',axis=1)

In [50]:
y_labels = german['target']

In [51]:
X_train, X_test, y_train, y_test = train_test_split(x_data, y_labels, test_size=0.025, random_state=101)

### Min-Max Normalization

In [52]:
min_train = X_train.min(axis=0)
max_train = X_train.max(axis=0)

range_train = (X_train - min_train).max(axis=0)
range_train_new = max_train-min_train

X_train_scaled = (X_train - min_train)/range_train
X_test_scaled = (X_test - min_train)/range_train

In [53]:
german.columns

Index(['status', 'duration', 'credit_history', 'purpose', 'credit_amount',
       'savings', 'employment_duration', 'installment_rate', 'personal_status',
       'debtors', 'residence', 'property', 'age', 'installment_plans',
       'housing', 'existing_credits', 'job', 'liable_people', 'telephone',
       'foreign_worker', 'target'],
      dtype='object')

# Model Based on Neural Networks

## Model Selection

We chose to use a high level api provided by TensorFlow. TensorFlow has a number of estimator API’s, which allows a user to tap into pre-built models. We happened to choose the DNN Classifier (dense neural network). We also ran the data through TF’s Linear Classifier, but decided to go with the dense neural network because it produced better results. The DNN Classifier is also an ideal model for binary classification problems - in our case we needed to predict 1 or 2 (Good vs Bad).






In [54]:
import tensorflow as tf

## Model Implementation

First, the user would define the various features. The main part of defining the features in TF is setting the type of date, for instance is the data categorical or numeric. In our case all the categorical attributes were translated into numbers, so we set each feature to numeric.

Second, we defined the input function. This packages the data in the form of a Pandas dataframe and inserts it into the TensorFlow model. Within this input function you can set the number of epochs, batch size etc. 

In [55]:
feature_columns = []
for key in german2.drop('target', axis=1).columns:
    feature_columns.append(tf.feature_column.numeric_column(key=key))

In [56]:
for attribute in attribute_dict.values():
    print(attribute)

['status', 'credit_history', 'purpose', 'savings', 'employment_duration', 'personal_status', 'debtors', 'property', 'installment_plans', 'housing', 'job', 'telephone', 'foreign_worker']
['duration', 'credit_amount', 'installment_rate', 'residence', 'age', 'existing_credits', 'liable_people']


In [57]:
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train_scaled,y=y_train,batch_size=10,num_epochs=1000,shuffle=True)

Third, we defined the actual model. In this function the user would set the number of neurons as well as the number of hidden layers. The model also comes equipped with gradient descent optimizers - the default is the Adagrad Algorithm which is what we used.

In [58]:
dnn_model = tf.estimator.DNNClassifier(hidden_units=[8,8,8],n_classes=4, feature_columns=feature_columns)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_tf_random_seed': 1, '_log_step_count_steps': 100, '_keep_checkpoint_max': 5, '_save_checkpoints_steps': None, '_session_config': None, '_save_checkpoints_secs': 600, '_model_dir': 'C:\\Users\\JMORR_~1\\AppData\\Local\\Temp\\tmpnatmvjf5', '_save_summary_steps': 100, '_keep_checkpoint_every_n_hours': 10000}


Lastly, we ran the model by using the train function on the model. This function took in the argument of how many steps the model should run for and your input function. 

In [59]:
dnn_model.train(input_fn=input_func,steps=1000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\JMORR_~1\AppData\Local\Temp\tmpnatmvjf5\model.ckpt.
INFO:tensorflow:step = 1, loss = 14.6734
INFO:tensorflow:global_step/sec: 145.459
INFO:tensorflow:step = 101, loss = 5.26723 (0.695 sec)
INFO:tensorflow:global_step/sec: 147.98
INFO:tensorflow:step = 201, loss = 5.22755 (0.684 sec)
INFO:tensorflow:global_step/sec: 132.923
INFO:tensorflow:step = 301, loss = 5.94099 (0.745 sec)
INFO:tensorflow:global_step/sec: 140.434
INFO:tensorflow:step = 401, loss = 5.00192 (0.719 sec)
INFO:tensorflow:global_step/sec: 116.307
INFO:tensorflow:step = 501, loss = 4.88976 (0.850 sec)
INFO:tensorflow:global_step/sec: 113.86
INFO:tensorflow:step = 601, loss = 4.68564 (0.874 sec)
INFO:tensorflow:global_step/sec: 135.305
INFO:tensorflow:step = 701, loss = 2.6906 (0.749 sec)
INFO:tensorflow:global_step/sec: 111.445
INFO:tensorflow:step = 801, loss = 3.50019 (0.889 sec)
INFO:tensorflow:global_step/sec: 140.443
IN

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x362d6bc4a8>

## Testing and Evaluations

The tensorflow api comes equipped with a evaluation method. Here you can input your testing set and compare the results against the actual targets. In addition to this, we created a scoring system where a penalty would be applied for wrong predictions. 5 points would be allotted if the model predicted a 1 but the actual was a 2. 1 point was was allotted if it had predicted a 2 buth the actual was a 1. 0 points for correct answers. In short we looked for a model that maximized the accuracy percentage and minimized the penalty score.

To stream line the testing process we developed a script that allowed a user to input a range of parameters. The user would also input how many times each type of model should run. For our testing we had the model run 5 times each and then we averaged the results. The layers we tested ranged from 1 to 5, and neurons ranged from 7 to 10. The output of these tests can be found in the "Testing Results" folder. 

The optimal model proved to have 8 neurons and 3 layers.

In [60]:
eval_input_func = tf.estimator.inputs.pandas_input_fn(x=X_test_scaled,y=y_test,batch_size=10,shuffle=False)

In [61]:
dnn_model.evaluate(eval_input_func)

INFO:tensorflow:Starting evaluation at 2019-03-03-00:00:46
INFO:tensorflow:Restoring parameters from C:\Users\JMORR_~1\AppData\Local\Temp\tmpnatmvjf5\model.ckpt-1000
INFO:tensorflow:Finished evaluation at 2019-03-03-00:00:48
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.8, average_loss = 0.448037, global_step = 1000, loss = 3.73364


{'accuracy': 0.80000001,
 'average_loss': 0.44803688,
 'global_step': 1000,
 'loss': 3.7336407}

In [62]:
evaluation = dnn_model.evaluate(eval_input_func)

INFO:tensorflow:Starting evaluation at 2019-03-03-00:00:54
INFO:tensorflow:Restoring parameters from C:\Users\JMORR_~1\AppData\Local\Temp\tmpnatmvjf5\model.ckpt-1000
INFO:tensorflow:Finished evaluation at 2019-03-03-00:00:56
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.8, average_loss = 0.448037, global_step = 1000, loss = 3.73364


In [63]:
tf_eval = pd.DataFrame([{"Accuracy": evaluation['accuracy'],"Loss":evaluation["loss"],"Average Loss": evaluation["average_loss"]}])

In [64]:
predictions = list(dnn_model.predict(eval_input_func))

INFO:tensorflow:Restoring parameters from C:\Users\JMORR_~1\AppData\Local\Temp\tmpnatmvjf5\model.ckpt-1000


In [65]:
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])
    
test_targets = []
for y in y_test:
    test_targets.append(y)

### Penalty Score

In [66]:
results = pd.DataFrame({"Predicted Result": final_preds,"Actual Result":test_targets})
score = 0
index = 0

for x in results['Actual Result']:
    if x == 2 and results["Predicted Result"][index] == 1:
        score = score + 5
    if x == 1 and results["Predicted Result"][index] == 2:
        score = score + 1
    else:
        score = score + 0
    index += 1
score

21

### Confusion Matrix

In [67]:
y_actu = pd.Series(test_targets, name='Actual')
y_pred = pd.Series(final_preds, name='Predicted')

In [68]:
df_confusion = pd.crosstab(y_actu, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
df_confusion

Predicted,1,2,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,17,1,18
2,4,3,7
All,21,4,25


In [69]:
with pd.ExcelWriter('Results.xlsx') as writer:
    results.to_excel(writer, sheet_name='Output')
    tf_eval.to_excel(writer, sheet_name='Evaluation')
    df_confusion.to_excel(writer, sheet_name='Confusion Matrix')

In [70]:
import os
file = "Results.xlsx"
os.startfile(file)

## Deployment

This model could be deployed to loan officers who could use it to conduct preliminary screenings of applicants. The model could be embedded in a web application. The bank representative could input a user’s information and it would output if the applicant was a good or bad candidate. The rep could then use this information to present appropriate options for the potential borrower. We would not suggest using this model to be the end all solution for determining who gets a loan, it should mainly be used as an additional tool.

As data is acquired from their lending base it can be used to retrain the model. To sufficiently do this a data pipeline could be constructed.

## Results

The chart below shows the results of the models with 8 neurons. As you can see the model with 3 hidden layers performed the best. This model maximized the average accuracy yielding 83.2% and minimized the average penalty score of 14.6.

<div><img src="accuracy.png"><img src="accuracy_graph.png"></div>

# Conclusion

We successfully replicated the findings in "Combining Feature Selection and Neural Networks for Solving Classification Problems,", and slightly exceeded their performance of their model. One caveat might be that their hold out set was slightly larger which may have contributed to their lower classification accuracy of 74.25% compared to our 83.2%. 

Within this project we demonstrated feature selection, data preprocessing, parameter tuning, testing strategies, and a deployment roadmap. This approach justifies why neural networks are a good tool for classification problems.

# References

Whitehorn, M. (2006). "The parable of the beer and diapers", The Register. Retrieved from: https://www.theregister.co.uk/2006/08/15/beer_diapers/

Provost, F. & Fawcett, T. (2013). "Data Science for Business". O'Reilly. Sebastopol, CA.

O'Dea, P., Griffith, J., O'Riordan, C. "Combining Feature Selection and Neural Networks for Solving Classification Problems". National University of Ireland. Galway, Galway.