# Bias-Removing Autoencoder for Reject Inference
### Group 1 - Seminar Information Systems (WiSe 2021/22)
#### Klemens Lehmann (), Jan Papmeier (604492), Lukas Voege (615033)

### Table of Contents

1. Introduction
1. Problem Setting
    1. _Credit Scroring_
    1. _Acceptance Loop and Sampling Bias_
1. Current Methods in Reject Inference
    1. _Reject Inference as Missing Data Problem_
    1. _Missing Data in Existing Reject Inference Methods_
    1. _Missing Data in this Project_
1. Reject Inference with Autoencoders
    1. _What is an Autoencoder_
    1. _How could an Autoencoder help with sampling selection bias? - Loss Function_
1. Testing and Results
1. Conclusion

## 1. Introduction

The goal of credit scoring is to predict whether an applicant will repay on time if accepted. Among the large number of applicants, this prediction is crucial to determining which ones to approve and which ones to decline. Consequently, we would like to predict the probability that an applicant will default. An applicant can be selected or rejected based on this probability, which is the output of a credit scoring model. In general, credit scoring models are trained on accepted cases only, which may lead to biased estimates. For at least 30 years, reject inference methods have been researched to tackle this problem as can be seen in Joanes (1993). In reject inference the data for the rejected is included, or accounted for, in the model in addition to the accepted cases. There have been discussed several reject inference methods in the literature. A recent overview can be found in Kozodoi et al. (2019). We will discuss a categorization of these approaches in the reject inference section. In this notebook we try to add a special type of neural network, namely an Autoencoder, as a Reject Inference method to achieve a reduction of sampling bias. In the next section, we will describe the problem setting and the acceptance loop in credit scoring. The third section includes the reject inference methods and a definition of the process. In the fourth section, we present the autoencoder neural network and how we apply this concept to our work. In the fifth section we show the results of our experiments. Lastly, we will conclude with a summary and an outlook.

## 2. Problem Setting
#### 2.1 Credit Scoring
Credit scoring models evaluate the default risk (probability of not paying back) of any applicant applying for a loan. Several methods have been developed since Durant’s first publication in 1941. For an overview see Lessmann et al. (2015). Basic credit scoring models are trained on accepted cases only, which have been accepted in the past and labelled. Credit scoring models are part of supervised learning, which is why we need labelled data to train. Since we are unable to know the true label of a rejected credit application, we are left with no other option, than to train our model on data points that itself once accepted. This creates a sampling bias, since the training data is not representive for the whole population of possible borrowers.

#### 2.2 Sampling Bias and Acceptance Loop
The sampling bias is created through the so-called Acceptance Loop, as you can see in figure 1. Our model, which can be based on Decision Trees, Logistic Regression, etc., is trained on old data of applicants it once itself accepted. This model is then used to evaluate new applicants to either accept or reject them. The accepted cases, once labelled, are then used to continue train our base model. This is creating a loop, which is only processing data of accepted cased, therefore the name acceptance loop. This relationship is also marked red in Figure 1.

## 3. Current Methods in Reject Inference
As we have seen credit risk model by design don’t have target data for rejected applicants to train on. Since the applicants are in general not rejected at random, this introduces a sampling bias. To address this bias several reject inference methods have been developed. They all have in common that they, at least implicitly, have to consider the labels for rejected applicants as missing data. In regard to reject inference this was first explicitly done by Feelders (2000) based on the missing data classification from Little and Rubin (2002, first published 1987).
#### 3.1  Reject Inference as Missing Data Problem
To formalize the missing data treatment, it is important to look at the missing data mechanisms. The relation between the missing data and the features in the model. The data is called missing completely at random (MCAR) if the missingness is unrelated to the data. If it is related only to the observed values, it is called missing at random (MAR). But if it is related to the missing or unobserved values, it is called missing not at random (MNAR). (See Little and Rubin 2002, 11ff)
With regard to loglikelihood inference the missing-data mechanism is called ignorable if the data is MAR and the missing-data parameters and the model parameters are unrelated. (See Little and Rubin 2002, 119ff)
#### 3.2 Missing Data in Existing Reject Inference Methods
To assume MCAR in the context of credit scoring it would be necessary to choose applicants at random. Also, in this situation we would not expect to see a sampling bias. Usually this is not the case and there is a selection process in place. A few rare studies with a dataset with almost no selection in place can be found in Banasik et al. (2003) and following works from Banasik and Crook (2004, 2005, 2007). Banasik et al. (2003) argue that the most important parameter in reject inference is the “accurate estimation of the potential good-bad ratio for the population of all applicants.” In general, the MCAR assumption is usually not applied in the credit scoring context. (See Feelders 2001 and Ehrhardt et al. 2021)
Most of the reject inference methods in the literature are based on an MAR ignorable assumption for the date. In the context of credit scoring this is a reasonable assumption. Since in general the credit decision is based on the model and therefore on the available data only. Although it should be noted that there are several mechanisms possible which are not fulfilling the assumption. Like overriding the model decision from a human in the process. Or if an applicant decides to accept the offer from another institute. Erhardt et al. (2021) use “not financed” instead of “rejected” to point out this distinction. Similarly, there might be other mechanisms which are not covered by the model. (See Feelders 2001 and Ehrhardt et al. 2021)
Erhardt et al. (2021) categorize and try to formalize reject inference methods even if this hasn’t been done formally. Based on these categories there are five types of methods based on MAR ignorable assumptions. Including ignoring the rejected applicants which also needs these assumptions. Further methods are fuzzy augmentation, reclassification, augmentation and a method called twins. Not included are more recent methods based on isolation forests, Bayesian networks Support Vector Machines or Deep Learning. Almost all of these are also based on the MAR ignorable assumption. (See Ehrhardt et al. 2021)
To cover these mechanisms, one would need to assume MNAR data. This limits the statistical methods available to huge extent. Therefore, there are only a few reject inference methods which are based on this assumption. In the literature only Parcelling and a bivariate probit model with sample selection. But if the additional information to cover the MNAR mechanisms can be reasonably provided is not clear. (See Feelders 2001 and Ehrhardt et al. 2021)
#### 3.3 Missing Data in this Project
In our project we also assume MAR ignorable date. Assuming the data coming from a single distribution and all the missing data mechanisms covered by the model. For our experiments these assumptions should be completely fulfilled. The full data is known and exclusion based solely on the model. In reality this would probably not be the case. Especially since we treat the available datasets as through the door population. Which are most likely already only datasets on accepted applicants, where we simulate a second rejection mechanism on top. In addition to other missing data mechanisms, the through the door population will have different a different distribution. Which can not be covered in our experiments on the available datasets.


## 4. Reject Inference with Autoencoders
#### 4.1 What is an Autoencoder

#### 4.2 How could an Autoencoder help with sampling selection bias? - Loss Function

## 5. Testing and Results

## 6. Conclusion