# Predicting the outcome of loan applications
# 3a. Logistic regression model
The metric we will try to optimise is **recall**, because we care about minimising the false negatives. False negative, i.e. people we accept but should have rejected, pose a greater risk, because they could lead to loss of the capital lost as well as potential revenue from the interest.

In [1]:
import os
import pandas as pd
import numpy as np
import sys

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

p = os.path.abspath('../')
if p not in sys.path:
    sys.path.append(p)

from shared.data_processing import CategoricalEncoder
from shared.data_processing import FeatureSelector


## Load data

In [2]:
df = pd.read_csv('./data/loan_data_prepped.csv')

In [3]:
X = df.drop(['label', 'accepted'], axis=1)
y = df['label']

## Set the features
Features of interest we have identified in the previous notebooks.

In [4]:
NUMERICAL_FEATURES = ['duration', 'loan_amount', 'age']

FIXED_CATEGORICAL = ['foreign_worker_binary', 'checking_status_ordinal', 'savings_status_ordinal',
                     'employment_ordinal', 'installment_commitment_ordinal']

OTHER_CATEGORICAL = ['loan_history', 'purpose', 'other_parties', 'property_magnitude',
                     'other_payment_plans', 'housing', 'personal_status', 'job']

FEATURES = NUMERICAL_FEATURES + FIXED_CATEGORICAL + OTHER_CATEGORICAL

In [5]:
FEATURES

['duration',
 'loan_amount',
 'age',
 'foreign_worker_binary',
 'checking_status_ordinal',
 'savings_status_ordinal',
 'employment_ordinal',
 'installment_commitment_ordinal',
 'loan_history',
 'purpose',
 'other_parties',
 'property_magnitude',
 'other_payment_plans',
 'housing',
 'personal_status',
 'job']

## Tune model parameters
For Logistic Regression, only the regularisation strength needs to be tuned. Given the small size of the data set, I'm using 50 stratified random folds.

In [6]:
encoder = CategoricalEncoder(features_to_encode=OTHER_CATEGORICAL)

In [7]:
selector = FeatureSelector(features_to_select=NUMERICAL_FEATURES + FIXED_CATEGORICAL)

In [8]:
logistic = LogisticRegression(max_iter=2000, class_weight='balanced', solver='newton-cg')

In [9]:
pipe = Pipeline(steps=[('encode', encoder),
                       ('select', selector),
                       ('scale', StandardScaler()),
                       ('logistic', logistic)])

In [22]:
param_grid_coarse = {'logistic__C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000]}
param_grid_fine = {'logistic__C': np.logspace(0, 100, 20)}

In [24]:
grid = GridSearchCV(pipe,
                    verbose=2,
                    cv=StratifiedKFold(n_splits=50),
                    scoring=make_scorer(f1_score),
                    param_grid=param_grid_fine)

In [25]:
grid.fit(X, y)

Fitting 50 folds for each of 20 candidates, totalling 1000 fits
[CV] logistic__C=10.0 ................................................
[CV] ................................. logistic__C=10.0, total=   0.1s
[CV] logistic__C=10.0 ................................................
[CV] ................................. logistic__C=10.0, total=   0.1s
[CV] logistic__C=10.0 ................................................
[CV] ................................. logistic__C=10.0, total=   0.1s
[CV] logistic__C=10.0 ................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


[CV] ................................. logistic__C=10.0, total=   0.1s
[CV] logistic__C=10.0 ................................................
[CV] ................................. logistic__C=10.0, total=   0.1s
[CV] logistic__C=10.0 ................................................
[CV] ................................. logistic__C=10.0, total=   0.1s
[CV] logistic__C=10.0 ................................................
[CV] ................................. logistic__C=10.0, total=   0.1s
[CV] logistic__C=10.0 ................................................
[CV] ................................. logistic__C=10.0, total=   0.1s
[CV] logistic__C=10.0 ................................................
[CV] ................................. logistic__C=10.0, total=   0.1s
[CV] logistic__C=10.0 ................................................
[CV] ................................. logistic__C=10.0, total=   0.1s
[CV] logistic__C=10.0 ................................................
[CV] .

[CV] .................... logistic__C=1623776.739188721, total=   0.1s
[CV] logistic__C=1623776.739188721 ...................................
[CV] .................... logistic__C=1623776.739188721, total=   0.1s
[CV] logistic__C=1623776.739188721 ...................................
[CV] .................... logistic__C=1623776.739188721, total=   0.1s
[CV] logistic__C=1623776.739188721 ...................................
[CV] .................... logistic__C=1623776.739188721, total=   0.1s
[CV] logistic__C=1623776.739188721 ...................................
[CV] .................... logistic__C=1623776.739188721, total=   0.1s
[CV] logistic__C=1623776.739188721 ...................................
[CV] .................... logistic__C=1623776.739188721, total=   0.1s
[CV] logistic__C=1623776.739188721 ...................................
[CV] .................... logistic__C=1623776.739188721, total=   0.1s
[CV] logistic__C=1623776.739188721 ...................................
[CV] .

[CV] ................... logistic__C=263665089873.03555, total=   0.1s
[CV] logistic__C=263665089873.03555 ..................................
[CV] ................... logistic__C=263665089873.03555, total=   0.1s
[CV] logistic__C=263665089873.03555 ..................................
[CV] ................... logistic__C=263665089873.03555, total=   0.0s
[CV] logistic__C=263665089873.03555 ..................................
[CV] ................... logistic__C=263665089873.03555, total=   0.1s
[CV] logistic__C=263665089873.03555 ..................................
[CV] ................... logistic__C=263665089873.03555, total=   0.1s
[CV] logistic__C=263665089873.03555 ..................................
[CV] ................... logistic__C=263665089873.03555, total=   0.1s
[CV] logistic__C=263665089873.03555 ..................................
[CV] ................... logistic__C=263665089873.03555, total=   0.1s
[CV] logistic__C=263665089873.03555 ..................................
[CV] .

[CV] ................ logistic__C=4.281332398719396e+16, total=   0.1s
[CV] logistic__C=4.281332398719396e+16 ...............................
[CV] ................ logistic__C=4.281332398719396e+16, total=   0.1s
[CV] logistic__C=4.281332398719396e+16 ...............................
[CV] ................ logistic__C=4.281332398719396e+16, total=   0.1s
[CV] logistic__C=4.281332398719396e+16 ...............................
[CV] ................ logistic__C=4.281332398719396e+16, total=   0.0s
[CV] logistic__C=4.281332398719396e+16 ...............................
[CV] ................ logistic__C=4.281332398719396e+16, total=   0.1s
[CV] logistic__C=4.281332398719396e+16 ...............................
[CV] ................ logistic__C=4.281332398719396e+16, total=   0.1s
[CV] logistic__C=4.281332398719396e+16 ...............................
[CV] ................ logistic__C=4.281332398719396e+16, total=   0.1s
[CV] logistic__C=4.281332398719396e+16 ...............................
[CV] .

[CV] ................ logistic__C=6.951927961775591e+21, total=   0.1s
[CV] logistic__C=6.951927961775591e+21 ...............................
[CV] ................ logistic__C=6.951927961775591e+21, total=   0.1s
[CV] logistic__C=6.951927961775591e+21 ...............................
[CV] ................ logistic__C=6.951927961775591e+21, total=   0.1s
[CV] logistic__C=6.951927961775591e+21 ...............................
[CV] ................ logistic__C=6.951927961775591e+21, total=   0.1s
[CV] logistic__C=6.951927961775591e+21 ...............................
[CV] ................ logistic__C=6.951927961775591e+21, total=   0.1s
[CV] logistic__C=6.951927961775591e+21 ...............................
[CV] ................ logistic__C=6.951927961775591e+21, total=   0.1s
[CV] logistic__C=6.951927961775591e+21 ...............................
[CV] ................ logistic__C=6.951927961775591e+21, total=   0.1s
[CV] logistic__C=6.951927961775591e+21 ...............................
[CV] .

[CV] ............... logistic__C=1.8329807108324377e+32, total=   0.1s
[CV] logistic__C=1.8329807108324377e+32 ..............................
[CV] ............... logistic__C=1.8329807108324377e+32, total=   0.1s
[CV] logistic__C=1.8329807108324377e+32 ..............................
[CV] ............... logistic__C=1.8329807108324377e+32, total=   0.1s
[CV] logistic__C=1.8329807108324377e+32 ..............................
[CV] ............... logistic__C=1.8329807108324377e+32, total=   0.0s
[CV] logistic__C=1.8329807108324377e+32 ..............................
[CV] ............... logistic__C=1.8329807108324377e+32, total=   0.0s
[CV] logistic__C=1.8329807108324377e+32 ..............................
[CV] ............... logistic__C=1.8329807108324377e+32, total=   0.1s
[CV] logistic__C=1.8329807108324377e+32 ..............................
[CV] ............... logistic__C=1.8329807108324377e+32, total=   0.1s
[CV] logistic__C=1.8329807108324377e+32 ..............................
[CV] .

[CV] ................ logistic__C=2.976351441631313e+37, total=   0.1s
[CV] logistic__C=2.976351441631313e+37 ...............................
[CV] ................ logistic__C=2.976351441631313e+37, total=   0.1s
[CV] logistic__C=2.976351441631313e+37 ...............................
[CV] ................ logistic__C=2.976351441631313e+37, total=   0.1s
[CV] logistic__C=2.976351441631313e+37 ...............................
[CV] ................ logistic__C=2.976351441631313e+37, total=   0.0s
[CV] logistic__C=2.976351441631313e+37 ...............................
[CV] ................ logistic__C=2.976351441631313e+37, total=   0.1s
[CV] logistic__C=2.976351441631313e+37 ...............................
[CV] ................ logistic__C=2.976351441631313e+37, total=   0.1s
[CV] logistic__C=2.976351441631313e+37 ...............................
[CV] ................ logistic__C=2.976351441631313e+37, total=   0.1s
[CV] logistic__C=2.976351441631313e+37 ...............................
[CV] .

[CV] ................ logistic__C=4.832930238571732e+42, total=   0.1s
[CV] logistic__C=4.832930238571732e+42 ...............................
[CV] ................ logistic__C=4.832930238571732e+42, total=   0.1s
[CV] logistic__C=4.832930238571732e+42 ...............................
[CV] ................ logistic__C=4.832930238571732e+42, total=   0.0s
[CV] logistic__C=4.832930238571732e+42 ...............................
[CV] ................ logistic__C=4.832930238571732e+42, total=   0.0s
[CV] logistic__C=4.832930238571732e+42 ...............................
[CV] ................ logistic__C=4.832930238571732e+42, total=   0.1s
[CV] logistic__C=4.832930238571732e+42 ...............................
[CV] ................ logistic__C=4.832930238571732e+42, total=   0.1s
[CV] logistic__C=4.832930238571732e+42 ...............................
[CV] ................ logistic__C=4.832930238571732e+42, total=   0.0s
[CV] logistic__C=4.832930238571732e+42 ...............................
[CV] .

[CV] ................ logistic__C=7.847599703514558e+47, total=   0.1s
[CV] logistic__C=7.847599703514558e+47 ...............................
[CV] ................ logistic__C=7.847599703514558e+47, total=   0.1s
[CV] logistic__C=7.847599703514558e+47 ...............................
[CV] ................ logistic__C=7.847599703514558e+47, total=   0.0s
[CV] logistic__C=7.847599703514558e+47 ...............................
[CV] ................ logistic__C=7.847599703514558e+47, total=   0.1s
[CV] logistic__C=7.847599703514558e+47 ...............................
[CV] ................ logistic__C=7.847599703514558e+47, total=   0.1s
[CV] logistic__C=7.847599703514558e+47 ...............................
[CV] ................ logistic__C=7.847599703514558e+47, total=   0.1s
[CV] logistic__C=7.847599703514558e+47 ...............................
[CV] ................ logistic__C=7.847599703514558e+47, total=   0.1s
[CV] logistic__C=7.847599703514558e+47 ...............................
[CV] .

[CV] ............... logistic__C=1.2742749857031217e+53, total=   0.1s
[CV] logistic__C=1.2742749857031217e+53 ..............................


KeyboardInterrupt: 

Find the best value.

In [21]:
best_regularisation = grid.best_estimator_.get_params()['logistic__C']
best_regularisation

10

## Evaluate the best model over 50 random splits
Given that the data set is so small, it's important to evaluate over many random train/test splits, so that we get a better picture of the metrics. With the best value of the regularisation parameter, we get an:
- **average recall of 0.74**;
- **average precision of 0.48**.

In [None]:
logistic_tuned = LogisticRegression(max_iter=2000,
                                    class_weight='balanced',
                                    solver='lbfgs',
                                    C=1)

In [None]:
pipe_tuned = Pipeline(steps=[('encoder', encoder),
                             ('select', selector),
                             ('scaler', StandardScaler()),
                             ('logistic', logistic_tuned)])

In [None]:
pipe_pre = Pipeline(steps=[('encode', encoder),
                       ('select', selector),
                       ('scale', StandardScaler())])

In [None]:
X_pre = pipe_pre.fit_transform(X, y)

In [None]:
X_pre

In [None]:
X.shape

In [None]:
sss = StratifiedShuffleSplit(n_splits=50, test_size=0.25)

In [None]:
METRIC_FUNCTIONS = {
    'accuracy': accuracy_score,
    'precision': precision_score,
    'recall': recall_score,
    'f1': f1_score
}

In [None]:
METRICS = {k: [] for k in METRIC_FUNCTIONS.keys()}

for train_IDX, test_IDX in sss.split(X, y):
    pipe_tuned.fit(X.loc[train_IDX], y.loc[train_IDX])
    logistic_predictions = pipe_tuned.predict(X.loc[test_IDX])
    truth = y.loc[test_IDX]
    
    for key, metric in METRIC_FUNCTIONS.items():
        METRICS[key].append(metric(truth, logistic_predictions))

In [None]:
{k: np.mean(v) for k, v in METRICS.items()}

In [None]:
{k: np.std(v) for k, v in METRICS.items()}

Check accuracy on training set to check for overfitting.

In [None]:
pipe_tuned.score(X.loc[train_IDX], y.loc[train_IDX])