[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorials/4_nb_data_preparation.ipynb) 

## Credit Borrower Classification 


## The HMEQ data set
Our data set, called the  "Home Equity" or , in brief, HMEQ data set, is provided by www.creditriskanalytics.net. It comprises  information about a set of borrowers, which are categorized along demographic variables and variables concerning their business relationship with the lender. A binary target variable called 'BAD' is  provided and indicates whether a borrower has repaid her/his debt. You can think of the data as a standard use case of binary classification.

You obtain the data, together with other interesting finance data sets, directly from www.creditriskanalytics.net. The website also provides a brief description of the data set. Specifically, the data set consists of 5,960 observations and 13 features including the target variable. The variables are defined as follows:

- BAD: the target variable, 1=default; 0=non-default 
- LOAN: amount of the loan request
- MORTDUE: amount due on an existing mortgage
- VALUE: value of current property
- REASON: DebtCon=debt consolidation; HomeImp=home improvement
- JOB: occupational categories
- YOJ: years at present job
- DEROG: number of major derogatory reports
- DELINQ: number of delinquent credit lines
- CLAGE: age of oldest credit line in months
- NINQ: number of recent credit inquiries
- CLNO: number of credit lines
- DEBTINC: debt-to-income ratio

As you can see, the features aim at describing the financial situation of a borrower. It makes sense to familiarize yourself with the above features. Make sure you understand what type of information they provide and what this information might reveal about the risk of defaulting.  

---

In [None]:
import pandas as pd 

In [None]:
# You have to update the code such that the variable file includes the correct path to the csv file on your computer
file = '../data/hmeq.csv'
df = pd.read_csv(file)

In [None]:
df

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,
1,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,
2,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,
3,1,1500,,,,,,,,,,,
4,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5955,0,88900,57264.0,90185.0,DebtCon,Other,16.0,0.0,0.0,221.808718,0.0,16.0,36.112347
5956,0,89000,54576.0,92937.0,DebtCon,Other,16.0,0.0,0.0,208.692070,0.0,15.0,35.859971
5957,0,89200,54045.0,92924.0,DebtCon,Other,15.0,0.0,0.0,212.279697,0.0,15.0,35.556590
5958,0,89800,50370.0,91861.0,DebtCon,Other,14.0,0.0,0.0,213.892709,0.0,16.0,34.340882


#  Mission
The final objective is to build a classifier that can distinguish good from bad borrowers. 
To do that, the following tasks need to be accomplished:
1. Understand the content that is available in the dataset. Spend soem time on Explanatory data analysis (EDA).
2. Pre-process the data. Careful you will find missing values as well as different data types (categorical and numerical). 

3. NN model

    a. Build a NN classifier to classify the borrowers based on the available borrower information. 
    
    b. Analyze the outcomes.

4. Benchmark: ML models

    a. Build a benchmark model, such as LR, SVM, RF to compare the NN performance

    b. Report the top 5 most predictive features based on RF feature importance and or interpret the LR coefficients to get some interpretability and understand which features were most predictive

    c. Analyze the outcomes.

6. Use the different packages of visualization to visualize findings from both approaches.
7. Compare the results from all models by using the appropriate measures. Think about which measures are important for this task, AUC, F1, Recall or Sensitivity? Justify your choice. Eventually do some Post Processing


## Data pipeline construction revisited
Here is a sample of how your data preparation process should look. This can however vary depending on the data with which you are working and the goal of your model. In general, you may at least want to follow these steps:

1) Basic cleaning of null values, duplicates and outliers

These values will have a big impact on how Python runs and how models are calculated. When it comes to NaN values and outliers, there are several approaches to deal with them. You could remove them, replace them with another value (indicator value, mode, mean, max, min, etc), use a mini-model to impute them among other options. 

2) Encode variables in the most appropriate way

Check your dataframe using the method .info() . Are your continuous variables encoded as integers or floats? If they are type object then the column may need to be checked again for null values. It is good practice to change any categorical variables to the category data type as it increases processing time. Once you have confirmed that you will finalize the model with a categorical variable, it can be good to use one-hot encoding.


3) Ensure variables fit statistical assumptions/model requirements

The next steps are a bit more complex and will depend on which model you need to use. Your goal is to make sure your data will be accurately processed by your model. Note that a lot of statistical assumptions depend on normal distribution and scaling. 
In this step, you may also want to remove features which are highly correlated to one another.
Generally for NN you will not need to check for correlated values as the network should be able to learn itself the important features and weights. 

