<a href="https://colab.research.google.com/github/longhowlam/python_hobby_stuff/blob/master/german_credit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook shows the analysis results of the German credit data set, as part of the assessment for E.E. The data set I used in this notebook is dowloabded from Kaggle. See the data on Kaggle[ here](https://www.kaggle.com/btolar1/weka-german-credit?select=credit-g.csv). 

The main modeling tool I am using here is **pycaret**. It is a low code tool that is essentially a wrapper around other ML tools such as  scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, etc. The uniform low code approach which will benefit readability and maintainability of ML code, while on the other hand the code does allow for enough flexbility to tweak models if needed. Moreover, it provides interactive explorative data analysis.

This notebook was created on google colab, it provides a free python run-time environment accessible from just a browser with already a lot python packages installed. Moreover, the interactive [data table viewer](https://colab.research.google.com/notebooks/data_table.ipynb) inside colab notebooks are very handy to browse trough data.  



# German Credit Introduction

The german credit data set is a data set with 1000 loan applicants, each row is a loan applicant. There are 20 loan and loan applicant characteristics and one target. The target is binary and tells us whether the loan is considerd bad or good.

# Installs and packages needed

In [None]:
### install pycaret and autoviz
!pip install pycaret
!pip install autoviz

In [None]:
!pip show pycaret autoviz

In [20]:
#### python packages needed

import pandas as pd
import numpy as np

from autoviz.AutoViz_Class import AutoViz_Class
from pycaret.classification import *
import plotly.express as px

## Data Import and first quality assessment

In [None]:
german_credit = pd.read_csv('credit-g.csv')
german_credit.shape

(1000, 21)

In [23]:
### Create a numerical target 1 / 0, that is usefull for plotting purposes.
german_credit['num_target'] =  np.where(german_credit['class'] == 'good', 1, 0)

In [24]:
german_credit.head(5)

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class,num_target
0,<0,6,critical/other existing credit,radio/tv,1169,no known savings,>=7,4,male single,none,4,real estate,67,none,own,2,skilled,1,yes,yes,good,1
1,0<=X<200,48,existing paid,radio/tv,5951,<100,1<=X<4,2,female div/dep/mar,none,2,real estate,22,none,own,1,skilled,1,none,yes,bad,0
2,no checking,12,critical/other existing credit,education,2096,<100,4<=X<7,2,male single,none,3,real estate,49,none,own,1,unskilled resident,2,none,yes,good,1
3,<0,42,existing paid,furniture/equipment,7882,<100,4<=X<7,2,male single,guarantor,4,life insurance,45,none,for free,1,skilled,2,none,yes,good,1
4,<0,24,delayed previously,new car,4870,<100,1<=X<4,3,male single,none,4,no known property,53,none,for free,2,skilled,2,none,yes,bad,0


## Data Exploration

We are going to use pandas profiling to glance trough visuals to get a first understanding of the data. The visuals can be displayed in the notebook, or it might be more convenient to export the output to a html file from which it is easier to dive into the visuals.

In [31]:
from pandas_profiling import ProfileReport
prof = ProfileReport(german_credit, interactions=None,  title="German Credit Profiling Report")
prof.to_file(output_file='german_credit_profile.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [32]:
prof



## Missing data

Be prudent when there are missing values, sometimes people fill in the blanks with the median or mean value for numerical missing values, or the most occuring category level when the data is categorical. I think that is a bad thing to do.

1. Find out why the data is missing, and see if missing data can occur when applying the model.
2. Missing value can be a category on its own. I.e. Credit Purpose, missing we could set the category to unknowm. 
3. In case of credit scoring, be prudent, don't give the benefit of the doubt. Prevent the loan applicant hacking the system by rather giving no income than a 'bad income'. Of course if good loan policies are in place this should not be possible.

## The target and input variables

Here we make use of the numerical target (being 1 for good and 0 for bad). When using lowess smoothing (or just averages for categorical variables) we immediately get the good/bad ratio estimated for certain  values of the inputs.

In [38]:
px.scatter(
    german_credit, 
    x = 'age',
    y = 'num_target',  
    trendline = "lowess", 
    title = 'Good / Bad ratio versus age' 
  )

In [39]:
px.scatter(
    german_credit, 
    x = 'duration',
    y = 'num_target',  
    trendline = "lowess", 
    title = 'Good / Bad ratio versus duration' 
  )

In [42]:
px.histogram(
  german_credit, 
  x = 'own_telephone',
  y = 'num_target',
  histfunc = 'avg'
)

## Predictive Models

## Model explainability

## AI Fairness