# Credit risk for German banks

#### EL = PD * LGD * ED
- EL: Expected Loss
- PD: Probability of Default
- LGD: Loss given default
- ED: Exposure at Default

### Exploratory Data Analysis

In [2]:
import pandas as pd

In [24]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [4]:
df = pd.read_csv('credit.csv')
print(df.shape)

(1000, 21)


In [5]:
df.columns

Index(['Creditability', 'Account Balance', 'Duration of Credit (month)',
       'Payment Status of Previous Credit', 'Purpose', 'Credit Amount',
       'Value Savings/Stocks', 'Length of current employment',
       'Instalment per cent', 'Sex & Marital Status', 'Guarantors',
       'Duration in Current address', 'Most valuable available asset',
       'Age (years)', 'Concurrent Credits', 'Type of apartment',
       'No of Credits at this Bank', 'Occupation', 'No of dependents',
       'Telephone', 'Foreign Worker'],
      dtype='object')

In [6]:
num_records = len(df)
num_records

1000

Check the data types of columns

In [7]:
data_types = df.dtypes
data_types

Creditability                        int64
Account Balance                      int64
Duration of Credit (month)           int64
Payment Status of Previous Credit    int64
Purpose                              int64
Credit Amount                        int64
Value Savings/Stocks                 int64
Length of current employment         int64
Instalment per cent                  int64
Sex & Marital Status                 int64
Guarantors                           int64
Duration in Current address          int64
Most valuable available asset        int64
Age (years)                          int64
Concurrent Credits                   int64
Type of apartment                    int64
No of Credits at this Bank           int64
Occupation                           int64
No of dependents                     int64
Telephone                            int64
Foreign Worker                       int64
dtype: object

In [8]:
def missing_ratio(data):
    return data.isna().mean() * 100

def num_diff_vals(data):
    return data.nunique()

def diff_val(data):
    return data.dropna().unique()

In [9]:
cols_info = df.agg([missing_ratio, pd.Series.max, pd.Series.min])
cols_info

Unnamed: 0,Creditability,Account Balance,Duration of Credit (month),Payment Status of Previous Credit,Purpose,Credit Amount,Value Savings/Stocks,Length of current employment,Instalment per cent,Sex & Marital Status,Guarantors,Duration in Current address,Most valuable available asset,Age (years),Concurrent Credits,Type of apartment,No of Credits at this Bank,Occupation,No of dependents,Telephone,Foreign Worker
missing_ratio,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,4.0,72.0,4.0,10.0,18424.0,5.0,5.0,4.0,4.0,3.0,4.0,4.0,75.0,3.0,3.0,4.0,4.0,2.0,2.0,2.0
min,0.0,1.0,4.0,0.0,0.0,250.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,19.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Check for data class imbalance

In [10]:
df['Creditability'].value_counts(normalize=True)

1    0.7
0    0.3
Name: Creditability, dtype: float64

For class data imbalance, I will use two different approach:
- Upsampling for minor class + Logistics regression
- Tree-based statiscal learning algorithm

Check feature values imbalance in columns with categorical values.

In [13]:
numerical_cols = ['Duration of Credit (month)', 'Credit Amount', 'Age (years)']

In [43]:
(pd.DataFrame(
    df.loc[:, ~df.columns.isin(numerical_cols)]
    .melt(var_name='column', value_name='value')
    .groupby(by=['column'])['value'].apply(pd.Series.value_counts, normalize=True))
.sort_values(by=['column', 'value']))

Unnamed: 0_level_0,Unnamed: 1_level_0,value
column,Unnamed: 1_level_1,Unnamed: 2_level_1
Account Balance,3,0.063
Account Balance,2,0.269
Account Balance,1,0.274
Account Balance,4,0.394
Concurrent Credits,2,0.047
Concurrent Credits,1,0.139
Concurrent Credits,3,0.814
Creditability,0,0.3
Creditability,1,0.7
Duration in Current address,1,0.13


There are a huge data imbalance in the "Guarantors" and "Foreign Worker" features => so I will drop these features from the data.

### Data preprocess

Omit features that aren't related to creditability intuitively

In [None]:
omitted_features = ['Telephone', 'Guarantors', 'Foreign Worker']

## Probability of Default

In [None]:
def pd(data):
    p