<img src=images/gdd-logo.png align=right width=300px style='padding:20px'>

# Hackathon: Credit Lending

Welcome to the hackathon! In this hackathon, you'll get the opportunity to try out your chosen explainability technique(s) on a dataset on credit lending. 


### Outline
1. [Problem Introduction](#intro)
1. [About the data](#data)
1. [Creating the model](#model) 
1. [Assignment](#assignment)

<a id = 'intro'></a>

## Problem Introduction
You work at a Bank in the credit risk department. Your stakeholders want to be able to determine the level of risk for all internal customers, so that if the customer applies for a new product (eg. loan, credit card, overdraft, mortgage) the correct decision can be made as to whether they should get the product or not.

The problem with the curent set up is that each individual product has their own way of determining a customer's level of risk. This means that a customer may be rejected to extend their overdraft to €2.000 but accepted to have a loan of €2.000 at the same time. 

This is providing a bad customer journey which is why the bank wants to consolidate this view of risk so that consistent lending decisions can be made.

Often customers who get rejected will complain about their decision so it is important that we can send advisers to them to talk through their situation and suggest a good option for their lending.

**Note:** The regulators consider customers are "unable to repay" when they have missed 3 payments (on any product) in the last 2 years. 

<img src="images/credit.jpeg" style="display: block;margin-left: auto;margin-right: auto;height: 200px"/>



<a id = 'data'></a>

## About the data

We have access to data on a large number of customers.

The unable to repay flag is named `bad_flag` and is determined using the description above. The rest of the features in the dataset are described below:

|Variable Name|	Description	|Type|
|:---|:---|:---|
|rev_unsecured|	Total balance on loans (no real estate) as a percentage of the total loan taken 	|percentage|
|age	|Age of borrower in years	|integer|
|days_past_due_30	|Number of times borrower has been 30-59 days past due in the last |2 years |integer|
|debt_ratio	|Monthly debt payments, alimony, living costs divided by monthy gross income|	float (%)|
|income |	Monthly income	|float|
|num_credit	|Number of Open loans (car loan, mortgage, credit cards etc.)	|integer|
|num_days_late	|Number of times borrower has been 90 days or more past due.	|integer|
|num_realestate	|Number of mortgage and real estate loans including home equity lines of credit	|integer|
|days_past_due_60	|Number of times borrower has been 60-89 days past due in the last 2 years.|	integer|
|num_deps|	Number of dependents in family excluding themselves (spouse, children etc.)	|integer|
|bad_flag	|Missed 3 monthly payments or more in past two years |	Y/N|


In [None]:
import pandas as pd

credit = pd.read_csv('data/credit.csv')
credit.head()

<a id = 'model'></a>

## Creating the model

We start by loading the data, selecting the features and target and creating a train and test set.

In [None]:
from sklearn.model_selection import train_test_split

features = ['num_realestate', 'debt_ratio', 'income', 'num_credit', 'num_deps']
X = credit.loc[:, features]
y = credit.loc[:, 'bad_flag']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=111)

Because there are some missing values in two of the columns, we'll impute those using SimpleImpter within ColumnTransformer.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

column_imputer = ColumnTransformer(
    [
        ('imp_inc', SimpleImputer(strategy='mean'), ['income']),
        ('imp_deps', SimpleImputer(strategy='most_frequent'), ['num_deps'])
    ]
    , remainder='passthrough'
)

Now we can create a pipeline with both the preprocessing steps and the model we want, and see how our model performs.

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

pipeline = Pipeline(steps = [
    ('imputer', column_imputer),
    ('model', model)
])

pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

<a id = 'assignment'></a>

# <mark>Assignment</mark>

### Theory Questions

1. Read the problem description. Which type of explainability method do you imagine would be most suitable for this problem: 
    - Local (explains one single prediction) or global (explains model behaviour)? 
    - Feature importance (determining which features have the biggest impact on your predictions) or feature sensitivity (determining how predictions would be affected to changes in feature values)? 

2. Are there any inherently interpretable models that spring to mind that can help you address the need for explainability for this problem? The model implemented is a Random Forest. Was that a good choice?


3. What model-agnostic techniques would be appropriate to address the need for explainability for this problem?


#### Bonus 
4. Explainability methods can be used to enhance the customers' understanding of the model's decision making process. 

    What are the pros & cons (for the business in question) to use explainability methods this way? 

5. Some explainability methods are less useful when features are highly correlated. Is that applicable to this dataset, and if so, what can you do to still interpret your model?

### Do-it-yourself
The explainability techniques covered in the workshop were: 
* Ceteris Paribus (local sensitivity)
* Prediction Break-Down (local feature importance)
* Permutation Feature Importance (global feature importance)
* Partial Dependence Plots (global sensitivity)

Implement the technique that you deem most appropriate for this problem. Consider both the problem statement, as well as the advantages and disadvantages of each method. Refer back to the [slides](https://github.com/marysia/explainability-workshop/blob/master/presentation.pdf) if necessary. 

#### Bonus challenges: 
* Also try out the other explainability techniques!
* Use a Decision Tree model and visualise the decision tree. Experiment with various settings for `max_depth`. How does this influence the model's performance and the degree of explainability? 
* Find feature importances of the Random Forest model pipeline. Compare it to the results of Permutation Feature Importance. 
* Try out any other model you'd like.
* Find example-based explanations. 
    * Counterfactuals
    * Adversarial examples
* Try k-Nearest Neighbors classifier and extract the example data points that explain a specific prediction.


<img src='images/gdd-logo.png' align=right width=300px>