# Introduction

German credit rating dataset is provided by Prof. Hofmann, contains categorical/symbolic attributes of the persons who availed the credit and the current status of the credit (at the time when the dataset is prepared). The status of the credit is indicated by 1 for good credits and 2 for bad credits.  We have changed the values to 0 and 1 i.e. 0 for good credit and 1 for bad credit, for this exercise.

The detailed description of variables can be found at the following link.

https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

| Variable | Variable Type| Description | Categories | 
| :----------| :-------------- |:--------------| :--------------|
|checkin_acc| categorical | Status of existing checking account | <ul><li> A11 : ... < 0 DM </li><li> A12 : 0 <= ... < 200 DM </li><li> A13 : ... >= 200 DM / salary assignments for at least 1 year </li><li> A14 : no checking account </li></ul>|
|duration| numerical | Duration | |
|credit_history| categorical | Credit History | <ul><li>A30 : no credits taken/ all credits paid back duly </li><li>A31 : all credits at this bank paid back duly </li><li>A32 : existing credits paid back duly till now </li><li>A33 : delay in paying off in the past </li><li>A34 : critical account/ other credits existing (not at this bank) </li></ul>|
|amount| numerical | Credit amount | |
|savings_acc| Categorical | Savings account/bonds | <ul><li>A61 : ... < 100 DM </li><li>A62 : 100 <= ... < 500 DM </li><li>A63 : 500 <= ... < 1000 DM </li><li>A64 : .. >= 1000 DM </li><li>A65 : unknown/ no savings account </li></ul>|
|present_emp_since| Categorical | Present employment since | <ul><li>A71 : unemployed </li><li>A72 : ... < 1 year </li><li>A73 : 1 <= ... < 4 years </li><li>A74 : 4 <= ... < 7 years </li><li>A75 : .. >= 7 years </li></ul>|
|inst_rate| numerical | Installment rate in percentage of disposable income  | |
|personal_status| Categorical | Personal status and sex | <ul><li>A91 : male : divorced/separated </li><li>A92 : female : divorced/separated/married </li><li>A93 : male : single </li><li>A94 : male : married/widowed </li><li>A95 : female : single </li></ul>|
|residing_since| numerical | residing since in years | |
|age| numerical | age in years | |
|inst_plans| categorical | Other installment plans | <ul><li>A141 : bank </li><li>A142 : stores </li><li>A143 : none </li></ul> |
|num_credits| Categorical | Number of existing credits at this bank | <ul><li> A11 : ... < 0 DM </li><li> A12 : 0 <= ... < 200 DM </li><li> A13 : ... >= 200 DM / salary assignments for at least 1 year </li><li> A14 : no checking account </li></ul>|
|job| categorical | job | <ul><li>A171 : unemployed/ unskilled - non-resident </li><li>A172 : unskilled - resident </li><li>A173 : skilled employee / official </li> <li>A174 : management/ self-employed/highly qualified employee/ officer </li> </ul> |
|status| categorical | Credit status | <ul><li> 0: Good Credit </li><li> 1: Bad Credit </li></ul>|

## Loading the dataset

In [None]:
import pandas as pd
import numpy as np
import seaborn as sn
import random

In [None]:
random_state = 100
np.random.seed(100)
random.seed(100)

In [None]:
credit_df = pd.read_csv( "https://raw.githubusercontent.com/manaranjanp/MICA_Classes/main/cases/German_Credit_Data.csv" )

In [None]:
credit_df.head()

In [None]:
credit_df.columns

In [None]:
credit_df.info();

### Distribution of Good and Bad Credits

In [None]:
credit_df.status.value_counts()

In [None]:
sn.countplot( data = credit_df,
              x = 'status' );

There are about 300 defaults and 700 non-default observations

## Creating Dummy Features

In [None]:
credit_df.columns

In [None]:
list( credit_df.columns )

## Selecting Features

In [None]:
X_features = list( credit_df.columns )
X_features.remove( 'status' )
X_features

In [None]:
encoded_credit_df = pd.get_dummies( credit_df[X_features], drop_first = True )

In [None]:
encoded_credit_df.columns

### Set the Response Variable and Independent Variables

In [None]:
Y = credit_df.status
X = encoded_credit_df

## Splitting Datasets into Train and Test Sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, Y, 
                                                    test_size = 0.3, 
                                                    random_state = 42 )

In [None]:
X_train[0:2]

# Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg_v1 = LogisticRegression(random_state=100, max_iter=1000)

In [None]:
logreg_v1.fit( X_train, y_train )

## Predict Test Data and Measure Accuracy

### Assuming default if probability is more than 0.5 

In [None]:
y_preds = logreg_v1.predict( X_test )

In [None]:
y_preds[0:10]

In [None]:
y_pred_df = pd.DataFrame( {"actual": y_test, "predicted" : y_preds } )

In [None]:
y_pred_df[0:10]

## Build a Confusion Matrix

Note: Discuss the importance of FPs and FNs

In [None]:
import matplotlib.pylab as plt
import seaborn as sn
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

In [None]:
cm = confusion_matrix(y_pred_df.actual, y_pred_df.predicted, labels= [1, 0])
cm

In [None]:
cm_plot = ConfusionMatrixDisplay(confusion_matrix=cm, 
                                 display_labels=['Bad Credit', 'Good Credit'])
cm_plot.plot();

### Overall accuracy of the model

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_pred_df.actual, y_pred_df.predicted))

## Dealing with imbalnce data: Class Weights

Note: Discuss the loss function for the logistic regression

In [None]:
from sklearn.utils.class_weight import compute_class_weight

In [None]:
class_weights = compute_class_weight(class_weight= 'balanced', 
                                     classes = np.unique(y_train),
                                     y = y_train)

In [None]:
class_weights

In [None]:
class_weights_dict = dict(zip(np.unique(X_train), class_weights))
class_weights_dict

In [None]:
balanced_logreg_v2 = LogisticRegression(random_state=100, max_iter=1000, class_weight=class_weights_dict)

In [None]:
balanced_logreg_v2.fit(X_train, y_train)

In [None]:
y_pred_v2 = balanced_logreg_v2.predict(X_test)

In [None]:
cm_v2 = confusion_matrix(y_test, y_pred_v2, labels= [1, 0])

In [None]:
cm_plot = ConfusionMatrixDisplay(confusion_matrix=cm_v2, 
                                 display_labels=['Bad Credit', 'Good Credit'])
cm_plot.plot();

In [None]:
print(classification_report(y_test, y_pred_v2))

# Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree_clf = DecisionTreeClassifier(max_depth=5, 
                                  criterion='gini', 
                                  class_weight=class_weights_dict)

In [None]:
tree_clf.fit(X_train, y_train)

In [None]:
y_pred_dtrees = tree_clf.predict(X_test)

In [None]:
cm_tree = confusion_matrix(y_test, y_pred_dtrees, labels=[1, 0])

In [None]:
cm_tree

In [None]:
print(classification_report(y_test, y_pred_dtrees))

In [None]:
from sklearn.tree import plot_tree

In [None]:
plt.figure(figsize = (50, 12))
plot_tree(tree_clf,
          feature_names = X_test.columns,
          class_names = ['Good Credit', 'Bad Credit'],
          filled = True,
          max_depth=3,
          fontsize = 10);
plt.savefig('tree.png')

## Participant Exercise:

1. Increase the max_depth to 7 and check if it gives better recall value.
2. Set criteria to gini and verify if it gives better recall value.

## Explain the concept of underfitting and overfitting

### Feature Importance

In [None]:
tree_clf.feature_importances_

In [None]:
feature_imp_df = pd.DataFrame({'features': X_train.columns,
                               'importance': tree_clf.feature_importances_})

In [None]:
feature_imp_df = feature_imp_df.sort_values("importance", ascending = False)
feature_imp_df

In [None]:
feature_imp_df[0:10]

| Variable | Variable Type| Description | Categories | 
| :----------| :-------------- |:--------------| :--------------|
|checkin_acc| categorical | Status of existing checking account | <ul><li> A11 : ... < 0 DM </li><li> A12 : 0 <= ... < 200 DM </li><li> A13 : ... >= 200 DM / salary assignments for at least 1 year </li><li> A14 : no checking account </li></ul>|
|duration| numerical | Duration | |
|credit_history| categorical | Credit History | <ul><li>A30 : no credits taken/ all credits paid back duly </li><li>A31 : all credits at this bank paid back duly </li><li>A32 : existing credits paid back duly till now </li><li>A33 : delay in paying off in the past </li><li>A34 : critical account/ other credits existing (not at this bank) </li></ul>|
|amount| numerical | Credit amount | |
|savings_acc| Categorical | Savings account/bonds | <ul><li>A61 : ... < 100 DM </li><li>A62 : 100 <= ... < 500 DM </li><li>A63 : 500 <= ... < 1000 DM </li><li>A64 : .. >= 1000 DM </li><li>A65 : unknown/ no savings account </li></ul>|
|present_emp_since| Categorical | Present employment since | <ul><li>A71 : unemployed </li><li>A72 : ... < 1 year </li><li>A73 : 1 <= ... < 4 years </li><li>A74 : 4 <= ... < 7 years </li><li>A75 : .. >= 7 years </li></ul>|
|inst_rate| numerical | Installment rate in percentage of disposable income  | |
|personal_status| Categorical | Personal status and sex | <ul><li>A91 : male : divorced/separated </li><li>A92 : female : divorced/separated/married </li><li>A93 : male : single </li><li>A94 : male : married/widowed </li><li>A95 : female : single </li></ul>|
|residing_since| numerical | residing since in years | |
|age| numerical | age in years | |
|inst_plans| categorical | Other installment plans | <ul><li>A141 : bank </li><li>A142 : stores </li><li>A143 : none </li></ul> |
|num_credits| Categorical | Number of existing credits at this bank | <ul><li> A11 : ... < 0 DM </li><li> A12 : 0 <= ... < 200 DM </li><li> A13 : ... >= 200 DM / salary assignments for at least 1 year </li><li> A14 : no checking account </li></ul>|
|job| categorical | job | <ul><li>A171 : unemployed/ unskilled - non-resident </li><li>A172 : unskilled - resident </li><li>A173 : skilled employee / official </li> <li>A174 : management/ self-employed/highly qualified employee/ officer </li> </ul> |
|status| categorical | Credit status | <ul><li> 0: Good Credit </li><li> 1: Bad Credit </li></ul>|

## Finding Optimal Cutoff Probability for Bad Credits

### Find optimal cutoff probability using cost 

In [None]:
tree_clf.predict_proba(X_test)[0:10]

In [None]:
y_pred_tree_df = pd.DataFrame( { "actual": y_test,
                                 "predicted": y_pred_dtrees } )

In [None]:
y_pred_tree_df['pred_probs'] = tree_clf.predict_proba(X_test)[:, 1]

In [None]:
y_pred_tree_df.sample(10)

In [None]:
y_pred_tree_df['predicted_new'] = y_pred_tree_df.pred_probs.map(lambda x: 1 if x > 0.5 else 0)

In [None]:
cm_1 = confusion_matrix( y_pred_tree_df.actual, y_pred_tree_df.predicted_new, labels = [1,0] )

In [None]:
cm_1

In [None]:
cm_1[1, 0] # false positives

In [None]:
cm_1[0, 1] # false negatives

In [None]:
def get_total_cost( actual, predicted ):
    cm_mat = confusion_matrix( actual, predicted, labels = [1,0] )
    return cm_mat[0,1] * 3 + cm_mat[1,0] * 1

In [None]:
get_total_cost( y_pred_tree_df.actual, y_pred_tree_df.predicted_new )

In [None]:
cost_df = pd.DataFrame( columns = ['prob', 'cost'])

In [None]:
cutoff_probs = np.arange(0, 1, 0.05)

In [None]:
idx = 0
for cutoff in cutoff_probs:
    cost = get_total_cost(y_pred_tree_df.actual, 
                          y_pred_tree_df.pred_probs.map(lambda x: 1 if x > cutoff  else 0))
    cost_df.loc[idx] = [cutoff, cost]
    idx += 1

In [None]:
cost_df.sort_values( 'cost', ascending = True )[0:5]

In [None]:
y_pred_tree_df['predicted_final'] = y_pred_tree_df.pred_probs.map( lambda x: 1 if x > 0.45 else 0)

In [None]:
cm_tree_v2 = confusion_matrix(y_pred_tree_df.actual, 
                              y_pred_tree_df.predicted_final, labels= [1, 0])

In [None]:
cm_plot = ConfusionMatrixDisplay(confusion_matrix=cm_tree_v2, 
                                 display_labels=['Bad Credit', 
                                                 'Good Credit'])
cm_plot.plot();

In [None]:
print(classification_report(y_pred_tree_df.actual, y_pred_tree_df.predicted_final))