### **5. Modeling**
---

We will perform:
1. Forward Selection Procedure
2. Best Model Adjustment

#### **Perform Forward Selection Procedure**
---
Begin with null model (no predictors), then adds predictor that gives the greatest additional improvement to the model, one-at-a-time.

In [1]:
# Import library
import pandas as pd
import numpy as np

# Import library for modeling
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

# Load configuration
import src.utils as utils

In [2]:
CONFIG_DATA = utils.config_load()
CONFIG_DATA

{'raw_dataset_path': 'data/raw/german_credit_data.csv',
 'dataset_path': 'data/output/data.pkl',
 'predictors_set_path': 'data/output/predictors.pkl',
 'response_set_path': 'data/output/response.pkl',
 'train_path': ['data/output/X_train.pkl', 'data/output/y_train.pkl'],
 'test_path': ['data/output/X_test.pkl', 'data/output/y_test.pkl'],
 'data_train_path': 'data/output/data_train.pkl',
 'data_train_binned_path': 'data/output/data_train_binned.pkl',
 'crosstab_list_path': 'data/output/crosstab_list.pkl',
 'WOE_table_path': 'data/output/WOE_table.pkl',
 'IV_table_path': 'data/output/IV_table.pkl',
 'WOE_map_dict_path': 'data/output/WOE_map_dict.pkl',
 'X_train_woe_path': 'data/output/X_train_woe.pkl',
 'response_variable': 'Risk',
 'test_size': 0.2,
 'num_variable': ['Age', 'Credit_amount', 'Duration'],
 'cat_variable': ['Sex',
  'Job',
  'Housing',
  'Saving_accounts',
  'Checking_account',
  'Purpose'],
 'missing_columns': ['Saving_accounts', 'Checking_account'],
 'num_of_bins': 4,
 '

Define function `forward()` to fit a model on the train set and calculate its CV score from the validation set.

In [3]:
# Function to perform forward selection procedure
def forward(X, y, predictors, scoring='roc_auc', cv=5):
    """Function to perform forward selection procedure"""

    # Define sample size and  number of all predictors
    n_samples, n_predictors = X.shape

    # Define list of all predictors
    col_list = np.arange(n_predictors)

    # Define remaining predictors for each k
    remaining_predictors = [p for p in col_list if p not in predictors]

    # Initialize list of predictors and its CV Score
    pred_list = []
    score_list = []

    # Cross validate each possible combination of remaining predictors
    for p in remaining_predictors:
        combi = predictors + [p]

        # Extract predictors combination
        X_ = X[:, combi]
        y_ = y

        # Define the estimator
        model = LogisticRegression(penalty = None,
                                   class_weight = 'balanced')

        # Cross validate the recall scores of the model
        cv_results = cross_validate(estimator = model,
                                    X = X_,
                                    y = y_,
                                    scoring = scoring,
                                    cv = cv)

        # Calculate the average CV/recall score
        score_ = np.mean(cv_results['test_score'])

        # Append predictors combination and its CV Score to the list
        pred_list.append(list(combi))
        score_list.append(score_)

    # Tabulate the results
    models = pd.DataFrame({"Predictors": pred_list,
                           "CV Score": score_list})

    # Choose the best model
    best_model = models.loc[models['CV Score'].argmax()]

    return models, best_model

In [4]:
# Function to perform forward selection on all characteristics
def run_forward():
    """Function to perform forward selection on all characteristics"""

    cv = CONFIG_DATA['num_of_cv']
    scoring = CONFIG_DATA['scoring']

    X_train_woe_path = CONFIG_DATA['X_train_woe_path']
    X_train_woe = utils.pickle_load(X_train_woe_path)
    X_train = X_train_woe.to_numpy()

    y_train_path = CONFIG_DATA['train_path'][1]
    y_train = utils.pickle_load(y_train_path)
    y_train = y_train.to_numpy()

    # First, fit the null model
    # Define predictor for the null model
    predictor = []

    # The predictor in the null model is zero values for all predictors
    X_null = np.zeros((X_train.shape[0], 1))

    # Define the estimator
    model = LogisticRegression(penalty = None,
                               class_weight = 'balanced')

    # Cross validate
    cv_results = cross_validate(estimator = model,
                                X = X_null,
                                y = y_train,
                                cv = cv,
                                scoring = scoring)

    # Calculate the average CV score
    score_ = np.mean(cv_results['test_score'])

    # Create table for the best model of each k predictors
    # Append the results of null model
    forward_models = pd.DataFrame({"Predictors": [predictor],
                                   "CV Score": [score_]})

    # Next, perform forward selection for all predictors
    # Define list of predictors
    predictors = []
    n_predictors = X_train.shape[1]

    # Perform forward selection procedure for k=1,...,n_predictors
    for k in range(n_predictors):
        _, best_model = forward(X = X_train,
                                y = y_train,
                                predictors = predictors,
                                scoring = scoring,
                                cv = cv)

        # Tabulate the best model of each k predictors
        forward_models.loc[k+1] = best_model
        predictors = best_model['Predictors']

    # Find the best CV score
    best_idx = forward_models['CV Score'].argmax()
    best_cv_score = forward_models['CV Score'].loc[best_idx]
    best_predictors = forward_models['Predictors'].loc[best_idx]

    # Print the summary
    print('===================================================')
    print('Best index            :', best_idx)
    print('Best CV Score         :', best_cv_score)
    print('Best predictors (idx) :', best_predictors)
    print('Best predictors       :')
    print(X_train_woe.columns[best_predictors].tolist())
    print('===================================================')

    print(forward_models)
    print('===================================================')
    
    forward_models_path = CONFIG_DATA['forward_models_path']
    utils.pickle_dump(forward_models, forward_models_path)

    best_predictors_path = CONFIG_DATA['best_predictors_path']
    utils.pickle_dump(best_predictors, best_predictors_path)

    return forward_models, best_predictors

In [5]:
# Run the function
run_forward()

Best index            : 1
Best CV Score         : 0.8291666666666666
Best predictors (idx) : [4]
Best predictors       :
['Saving_accounts']
                    Predictors  CV Score
0                           []  0.000000
1                          [4]  0.829167
2                       [4, 1]  0.812500
3                    [4, 1, 5]  0.779167
4                 [4, 1, 5, 0]  0.762500
5              [4, 1, 5, 0, 8]  0.762500
6           [4, 1, 5, 0, 8, 7]  0.770833
7        [4, 1, 5, 0, 8, 7, 2]  0.766667
8     [4, 1, 5, 0, 8, 7, 2, 6]  0.741667
9  [4, 1, 5, 0, 8, 7, 2, 6, 3]  0.729167


(                    Predictors  CV Score
 0                           []  0.000000
 1                          [4]  0.829167
 2                       [4, 1]  0.812500
 3                    [4, 1, 5]  0.779167
 4                 [4, 1, 5, 0]  0.762500
 5              [4, 1, 5, 0, 8]  0.762500
 6           [4, 1, 5, 0, 8, 7]  0.770833
 7        [4, 1, 5, 0, 8, 7, 2]  0.766667
 8     [4, 1, 5, 0, 8, 7, 2, 6]  0.741667
 9  [4, 1, 5, 0, 8, 7, 2, 6, 3]  0.729167,
 [4])

In [6]:
X_train_path = CONFIG_DATA['X_train_woe_path']
X_train_woe = utils.pickle_load(X_train_path)
X_train = X_train_woe

In [7]:
X_train

Unnamed: 0,Age,Sex,Job,Housing,Saving_accounts,Checking_account,Credit_amount,Duration,Purpose
485,0.097164,0.127017,-0.200671,0.190279,-0.278534,-0.329092,-0.070452,0.454913,-0.103148
390,-0.162248,0.127017,-0.200671,0.190279,-0.278534,1.183691,0.251314,0.018868,-0.103148
23,0.097164,0.127017,0.007692,0.190279,-0.136451,-0.329092,0.251314,0.454913,-0.103148
814,0.097164,0.127017,0.007692,-0.624154,-0.278534,-0.902358,0.447748,-0.613683,-0.103148
107,-0.162248,0.127017,0.007692,0.190279,-0.278534,-0.329092,-0.524524,0.454913,-0.103148
...,...,...,...,...,...,...,...,...,...
324,0.433636,-0.277765,0.007692,0.190279,-0.278534,1.183691,-0.070452,0.018868,-0.103148
428,-0.250071,0.127017,0.007692,0.190279,-0.278534,1.183691,-0.070452,0.454913,-0.143569
637,-0.250071,0.127017,0.007692,0.190279,-0.278534,1.183691,-0.524524,-0.613683,0.457651
688,0.433636,0.127017,0.007692,0.190279,-0.136451,1.183691,0.447748,0.454913,0.457651


In [8]:
# Function to fit the best model on whole X_train
def best_model_fitting(best_predictors):
    """Function to fit best model on whole X_train"""

    X_train_path = CONFIG_DATA['X_train_woe_path']
    X_train_woe = utils.pickle_load(X_train_path)
    X_train = X_train_woe.to_numpy()

    y_train_path = CONFIG_DATA['train_path'][1]
    y_train = utils.pickle_load(y_train_path)
    y_train = y_train.to_numpy()

    if best_predictors is None:
        best_predictors_path = CONFIG_DATA['best_predictors_path']
        best_predictors = utils.pickle_load(best_predictors_path)
        print(f"Best predictors index   :", best_predictors)
    else:
        print(f"[Adjusted] best predictors index   :", best_predictors)

    # Define X with best predictors
    X_train_best = X_train[:, best_predictors]

    # Fit best model
    best_model = LogisticRegression(penalty = None,
                                    class_weight = 'balanced')
    best_model.fit(X_train_best, y_train)

    print(best_model)

    # Extract the best model' parameter estimates
    best_model_intercept = pd.DataFrame({'Characteristic': 'Intercept',
                                         'Estimate': best_model.intercept_})
    
    best_model_params = X_train_woe.columns[best_predictors].tolist()

    best_model_coefs = pd.DataFrame({'Characteristic': best_model_params,
                                     'Estimate': np.reshape(best_model.coef_, 
                                                            len(best_predictors))})

    best_model_summary = pd.concat((best_model_intercept, best_model_coefs),
                                   axis = 0,
                                   ignore_index = True)
    
    print('===================================================')
    print(best_model_summary)
    
    best_model_path = CONFIG_DATA['best_model_path']
    utils.pickle_dump(best_model, best_model_path)

    best_model_summary_path = CONFIG_DATA['best_model_summary_path']
    utils.pickle_dump(best_model_summary, best_model_summary_path)

    return best_model, best_model_summary

In [9]:
# Check the function
best_model_fitting(best_predictors = None)

Best predictors index   : [4]
LogisticRegression(class_weight='balanced', penalty=None)
    Characteristic      Estimate
0        Intercept  6.516535e-17
1  Saving_accounts -1.000000e+00


(LogisticRegression(class_weight='balanced', penalty=None),
     Characteristic      Estimate
 0        Intercept  6.516535e-17
 1  Saving_accounts -1.000000e+00)

#### **Best Model Adjustment**
---

Scorecards with too few characteristics are generally unable to withstand the test of time:
  - They are susceptible to minor changes in the applicant profile.
  - A good adjudicator will never look at just two characteristics from an application form to make a decision.

We will include all characteristics in the final model.
  - From the independence test, all characteristics are not independent of the response variable (probability of default).
  - Generally, a final scorecards consist of between 8 and 15 characteristics

In [10]:
# Adjust the best predictors
best_model_fitting(best_predictors = [0,1,2,4,5,7,8])

[Adjusted] best predictors index   : [0, 1, 2, 4, 5, 8]
LogisticRegression(class_weight='balanced', penalty=None)
     Characteristic  Estimate
0         Intercept -0.005353
1               Age -0.770857
2               Sex -0.891186
3               Job -1.103236
4   Saving_accounts -0.663096
5  Checking_account -0.923298
6           Purpose -0.858377


(LogisticRegression(class_weight='balanced', penalty=None),
      Characteristic  Estimate
 0         Intercept -0.005353
 1               Age -0.770857
 2               Sex -0.891186
 3               Job -1.103236
 4   Saving_accounts -0.663096
 5  Checking_account -0.923298
 6           Purpose -0.858377)