# Predicting the outcome of loan applications
# 3a. Logistic regression model
For this problem, we care about having a low number of false negatives as possible. False negatives, i.e. people we accept but should have rejected, pose a greater risk, because they could lead to loss of the capital lent as well as potential revenue from the interest. While high recall is associated to few false negatives, we can not tune model parameters by optimising on recall alone, otherwise the model will be pushed to behave like a random model, which has a recall of 1.

In [1]:
import os
import pandas as pd
import numpy as np
import sys

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

p = os.path.abspath('../')
if p not in sys.path:
    sys.path.append(p)

from shared.data_processing import CategoricalEncoder
from shared.data_processing import FeatureSelector


## Load data

In [2]:
df = pd.read_csv('./data/loan_data_prepped.csv')

In [3]:
X = df.drop(['label', 'accepted'], axis=1)
y = df['label']

## Set the features
Features of interest we have identified in the previous notebooks.

In [4]:
NUMERICAL_FEATURES = ['duration', 'loan_amount', 'age']

FIXED_CATEGORICAL = ['foreign_worker_binary', 'checking_status_ordinal', 'savings_status_ordinal',
                     'employment_ordinal', 'installment_commitment_ordinal']

OTHER_CATEGORICAL = ['loan_history', 'purpose', 'other_parties', 'property_magnitude',
                     'other_payment_plans', 'housing', 'personal_status', 'job']

FEATURES = NUMERICAL_FEATURES + FIXED_CATEGORICAL + OTHER_CATEGORICAL

In [5]:
FEATURES

['duration',
 'loan_amount',
 'age',
 'foreign_worker_binary',
 'checking_status_ordinal',
 'savings_status_ordinal',
 'employment_ordinal',
 'installment_commitment_ordinal',
 'loan_history',
 'purpose',
 'other_parties',
 'property_magnitude',
 'other_payment_plans',
 'housing',
 'personal_status',
 'job']

## Tune model parameters
For Logistic Regression, only the regularisation strength needs to be tuned. Given the small size of the data set, I'm using 50 stratified random folds.

In [6]:
encoder = CategoricalEncoder(features_to_encode=OTHER_CATEGORICAL)

In [7]:
selector = FeatureSelector(features_to_select=NUMERICAL_FEATURES + FIXED_CATEGORICAL)

In [8]:
logistic = LogisticRegression(max_iter=2000, class_weight='balanced', solver='newton-cg')

In [9]:
pipe = Pipeline(steps=[('encode', encoder),
                       ('select', selector),
                       ('scale', StandardScaler()),
                       ('logistic', logistic)])

In [10]:
param_grid_coarse = {'logistic__C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000]}
param_grid_fine = {'logistic__C': np.logspace(0, 2, 20)}
param_grid_finest = {'logistic__C': np.logspace(0, 1, 20)}

In [11]:
grid = GridSearchCV(pipe,
                    verbose=1,
                    cv=StratifiedKFold(n_splits=50),
                    scoring=make_scorer(f1_score),
                    param_grid=param_grid_finest)

In [12]:
grid.fit(X, y)

Fitting 50 folds for each of 20 candidates, totalling 1000 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:   55.0s finished


GridSearchCV(cv=StratifiedKFold(n_splits=50, random_state=None, shuffle=False),
             estimator=Pipeline(steps=[('encode',
                                        CategoricalEncoder(features_to_encode=['loan_history',
                                                                               'purpose',
                                                                               'other_parties',
                                                                               'property_magnitude',
                                                                               'other_payment_plans',
                                                                               'housing',
                                                                               'personal_status',
                                                                               'job'])),
                                       ('select',
                                        FeatureSelector(fea

Find the best value.

In [13]:
best_regularisation = grid.best_estimator_.get_params()['logistic__C']
best_regularisation

1.4384498882876628

## Evaluate the best model over 50 random splits
Given that the data set is so small, it's important to evaluate over many random train/test splits, so that we get a better picture of the metrics.

In [14]:
logistic_tuned = LogisticRegression(max_iter=2000,
                                    class_weight='balanced',
                                    solver='newton-cg',
                                    C=best_regularisation)

In [15]:
pipe_tuned = Pipeline(steps=[('encoder', encoder),
                             ('select', selector),
                             ('scaler', StandardScaler()),
                             ('logistic', logistic_tuned)])

In [16]:
sss = StratifiedShuffleSplit(n_splits=50, test_size=0.25)

In [17]:
METRIC_FUNCTIONS = {
    'accuracy': accuracy_score,
    'precision': precision_score,
    'recall': recall_score,
    'f1': f1_score
}

In [18]:
METRICS = {k: [] for k in METRIC_FUNCTIONS.keys()}

for train_IDX, test_IDX in sss.split(X, y):
    pipe_tuned.fit(X.loc[train_IDX], y.loc[train_IDX])
    logistic_predictions = pipe_tuned.predict(X.loc[test_IDX])
    truth = y.loc[test_IDX]
    
    for key, metric in METRIC_FUNCTIONS.items():
        METRICS[key].append(metric(truth, logistic_predictions))

In [19]:
{k: np.mean(v) for k, v in METRICS.items()}

{'accuracy': 0.6472800000000001,
 'precision': 0.4409132833860982,
 'recall': 0.6469333333333332,
 'f1': 0.5235762495915697}

In [20]:
{k: np.std(v) for k, v in METRICS.items()}

{'accuracy': 0.02723162866961872,
 'precision': 0.02988886000912991,
 'recall': 0.05702178920767433,
 'f1': 0.03434250281964881}

Check accuracy on training set to check for overfitting.

In [21]:
pipe_tuned.score(X.loc[train_IDX], y.loc[train_IDX])

0.6533333333333333