Skip to content

mminici/credit_scoring

Repository files navigation

Credit Scoring

Introduction

Credit scoring is a set of statistical techniques to assess the creditworthiness of a potential customer for a financial institution. The recent economic crisis that involved most of the developed countries worldwide has been generated by an unexpected number of defaults in the US home mortgages market. This fact highlighted the importance of determining the credit risk ex ante.

In the past, loans were granted on the basis of the connection between the credit analyst and the customer, but due to the growing competition in the market and the high volume of requests, financial institutions have begun exploiting automatic methods to support their decisions.

This project is based on an application scoring scenario, because it uses information gathered in the application form compiled by the customer at the beginning of the loan request.

Dataset

The dataset consists of information about the loan such as the time and the amount and socio-demographic data provided in the application form by the client. Also credit bureaux data about the credit history of the customer have been employed.

Preprocessing

Preprocessing is an important step in the process of extracting knowledge from data. Data harvesting methods are often imprecise and data quality is barely checked, this can lead to data outside the domain(such as negative incomes), impossible combinations(sex:male;pregnant:yes) or missing values. The representation and the quality of data is a major requirement to correctly analyze data.

Delete discriminatory variables

Given the importance on a individual's life that granting a loan could have, it is important to not take decisions based on discriminatory variables. Since the beginning of credit scoring, a strict jurisdiction has influenced how these automatic processes must work. Equal Credit Opportunity Act is a US law, enacted in 1974, that forbids to take decisions on credit transactions on the basis of race, color, religion, national origin, sex, marital status, or age. Recently also in the European Union, thanks to the GDPR, the credit scoring has been named as an example where algorithms cannot make use of certain information to work and that credit allowance must always be mediated by a human actor(you can't fully automate this kind of processes with algorithms).

The following variables have been deleted by the dataset:

  • Sex
  • Age
  • Civil status
  • Country of origin

Target variable manipulation

The original dataset classifies each client as either a good payer, a payer late with payments or a payer in litigation. In order to go back to a binary classification problem the last two categories have been reconducted to a unique level named as "bad payer".

Dummy variables

All qualitative variables have been transformed through a dummization process. The dummy variable takes value one or zero whether the categorical effect is present or not.

Handlind missing values

Given the low presence of missing values, all the records containing NA values have been deleted by the dataset.

Splitting into training and test set

The "70-30" division have been used following a common practice in the literature(Lee et al, 2002; Desai et al, 1996; Boritz & Kennedy, 1995; Dutta et al, 1994).

Standardization of predictors

The standardization procedure is often used in the Machine Learning community and it aims at solving many problems such as easing the regression coefficients interpretation, the correct use of distance-based algorithms and regularized procedures and other things.

The formula used follows:

$z = \dfrac{x - \mu_x}{\sigma_x} $

where $\mu_x$ is the mean and $\sigma_x$ is the standard deviation with respect to the variable $x$.

Undersampling

One of the most common problems in credit scoring is having an unbalanced dataset. This can lead to many missclassification errors relative to the rare target variable. To solve this problem it has been performed an undersampling on the overrepresented variable using a stratified samplign that works as follows:

  1. Randomly extract from the training set a fraction of the over represented category.
  2. Move this fraction from the training set to the test set.

At the end of the procedure, the training set consists for a 50% of "good payers" and for the other half of "bad payers".

Quality assessment

In the literature, there isn't a unique criteria to assess the quality of a credit scoring system. That's why it is always better to make use of multiple tools to describe the predictive ability of the algorithm.

in this project, the Logistic Regression has been exploited. It is based on the sigmoid function and it outputs a number between 0 and 1 that represents the probability of a customer to belong to one of the two categories, in this case to be a good payer.

The criteria used has been the ROC curve, along with the Area Under Curve, and the Confusion Matrix, along with sensitivity and accuracy. To have a better perspective of the predictive abilities of the algorithm, a comparison with the results shown at the state-of-the-art has been performed.

Explainability

As highlighted in the introduction, it is important to state which are the variables that most influence the credit scoring procedure. AI/ML suffer from low explainability and the black-box nature of the algorithms tackles the accountability needed in critical applications. One of the techniques used by scholars to shed a light on the decision process of algorithms is the Feature Importance Algorithm. This method has been used in this project to evaluate which are the most important features when it comes to grant a loan.

Final notes

All the info contained in this README are a short summary of the report that you can find on the repository. All the references and the bibliography can be found on the report and on the Jupyter Notebook.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published