# Predicting the outcome of loan applications
# 3a. Logistic regression model
The metrics we will try to optimise is **recall**, because we care about minimising the false negatives. False negative, i.e. people we accept but should have rejected, pose a greater risk, because they could lead to loss of the capital lost as well as potential revenue from the interest.

In [1]:
import os
import pandas as pd
import numpy as np
import sys

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

p = os.path.abspath('../')
if p not in sys.path:
    sys.path.append(p)

from shared.data_processing import CategoricalEncoder


## Load data

In [2]:
df = pd.read_csv('./data/loan_data_prepped.csv')

## Set the features
Features of interest we have identified in the previous notebooks.

In [3]:
OTHER_CATEGORICAL = ['loan_history', 'purpose', 'other_parties', 'property_magnitude',
                     'other_payment_plans', 'housing', 'personal_status', 'job']

In [4]:
NUMERICAL_FEATURES = ['duration', 'loan_amount', 'age']

FIXED_CATEGORICAL = ['foreign_worker_binary', 'checking_status_ordinal', 'savings_status_ordinal',
                     'employment_ordinal', 'installment_commitment_ordinal']

VARIABLE_CATEGORICAL = [f'{feature}_encoded' for feature in OTHER_CATEGORICAL]

FEATURES = NUMERICAL_FEATURES + FIXED_CATEGORICAL + VARIABLE_CATEGORICAL

In [5]:
FEATURES

['duration',
 'loan_amount',
 'age',
 'foreign_worker_binary',
 'checking_status_ordinal',
 'savings_status_ordinal',
 'employment_ordinal',
 'installment_commitment_ordinal',
 'loan_history_encoded',
 'purpose_encoded',
 'other_parties_encoded',
 'property_magnitude_encoded',
 'other_payment_plans_encoded',
 'housing_encoded',
 'personal_status_encoded',
 'job_encoded']

## Tune model parameters
For Logistic Regression, only the regularisation strength needs to be tuned.

In [6]:
encoder = CategoricalEncoder(features_to_encode=OTHER_CATEGORICAL,
                             target='label',
                             features_to_return=FEATURES)

In [7]:
logistic = LogisticRegression(max_iter=2000, class_weight='balanced', solver='lbfgs')

In [8]:
pipe = Pipeline(steps=[('encoder', encoder),
                       ('scaler', StandardScaler()),
                       ('logistic', logistic)])

In [9]:
grid = GridSearchCV(pipe,
                    verbose=10,
                    cv=StratifiedKFold(n_splits=50),
                    scoring=make_scorer(recall_score),
                    param_grid={
#                         'logistic__C': [1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
                       'logistic__C': np.logspace(-7, -5, 20)
                    })

In [10]:
grid.fit(df, df['label'])

Fitting 50 folds for each of 20 candidates, totalling 1000 fits
[CV] logistic__C=1e-07 ...............................................
[CV] ...... logistic__C=1e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1e-07 ...............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.3s remaining:    0.0s


[CV] ...... logistic__C=1e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.5s remaining:    0.0s


[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.8s remaining:    0.0s


[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] ...... logistic__C=1e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1e-07 ...............................................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    1.0s remaining:    0.0s


[CV] ...... logistic__C=1e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] ...... logistic__C=1e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] ..................... logistic__C=1e-07, score=1.0, total=   0.1s
[CV] logistic__C=1e-07 ...............................................
[CV] .

[CV] .... logistic__C=1.2742749857031348e-07, score=1.0, total=   0.1s
[CV] logistic__C=1.2742749857031348e-07 ..............................
[CV] .... logistic__C=1.2742749857031348e-07, score=1.0, total=   0.1s
[CV] logistic__C=1.2742749857031348e-07 ..............................
[CV]  logistic__C=1.2742749857031348e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1.2742749857031348e-07 ..............................
[CV]  logistic__C=1.2742749857031348e-07, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=1.2742749857031348e-07 ..............................
[CV]  logistic__C=1.2742749857031348e-07, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=1.2742749857031348e-07 ..............................
[CV]  logistic__C=1.2742749857031348e-07, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=1.2742749857031348e-07 ..............................
[CV] .... logistic__C=1.2742749857031348e-07, score=1.0, total=   0.1s
[CV] logistic__C=1.27427498570313

[CV] ..... logistic__C=1.623776739188721e-07, score=1.0, total=   0.1s
[CV] logistic__C=1.623776739188721e-07 ...............................
[CV] ..... logistic__C=1.623776739188721e-07, score=1.0, total=   0.1s
[CV] logistic__C=1.623776739188721e-07 ...............................
[CV]  logistic__C=1.623776739188721e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1.623776739188721e-07 ...............................
[CV]  logistic__C=1.623776739188721e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1.623776739188721e-07 ...............................
[CV]  logistic__C=1.623776739188721e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1.623776739188721e-07 ...............................
[CV] ..... logistic__C=1.623776739188721e-07, score=1.0, total=   0.1s
[CV] logistic__C=1.623776739188721e-07 ...............................
[CV] ..... logistic__C=1.623776739188721e-07, score=1.0, total=   0.1s
[CV] logistic__C=1.623776739188721e-07 ........

[CV] ...... logistic__C=2.06913808111479e-07, score=1.0, total=   0.1s
[CV] logistic__C=2.06913808111479e-07 ................................
[CV] ...... logistic__C=2.06913808111479e-07, score=1.0, total=   0.1s
[CV] logistic__C=2.06913808111479e-07 ................................
[CV]  logistic__C=2.06913808111479e-07, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=2.06913808111479e-07 ................................
[CV]  logistic__C=2.06913808111479e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=2.06913808111479e-07 ................................
[CV] ...... logistic__C=2.06913808111479e-07, score=1.0, total=   0.1s
[CV] logistic__C=2.06913808111479e-07 ................................
[CV] ...... logistic__C=2.06913808111479e-07, score=1.0, total=   0.1s
[CV] logistic__C=2.06913808111479e-07 ................................
[CV]  logistic__C=2.06913808111479e-07, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=2.06913808111479e-07 ............

[CV] .... logistic__C=2.6366508987303554e-07, score=1.0, total=   0.1s
[CV] logistic__C=2.6366508987303554e-07 ..............................
[CV] .... logistic__C=2.6366508987303554e-07, score=1.0, total=   0.1s
[CV] logistic__C=2.6366508987303554e-07 ..............................
[CV]  logistic__C=2.6366508987303554e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=2.6366508987303554e-07 ..............................
[CV] .... logistic__C=2.6366508987303554e-07, score=0.5, total=   0.1s
[CV] logistic__C=2.6366508987303554e-07 ..............................
[CV] .... logistic__C=2.6366508987303554e-07, score=1.0, total=   0.1s
[CV] logistic__C=2.6366508987303554e-07 ..............................
[CV] .... logistic__C=2.6366508987303554e-07, score=1.0, total=   0.1s
[CV] logistic__C=2.6366508987303554e-07 ..............................
[CV] .... logistic__C=2.6366508987303554e-07, score=1.0, total=   0.1s
[CV] logistic__C=2.6366508987303554e-07 ..........................

[CV] ..... logistic__C=3.359818286283781e-07, score=1.0, total=   0.1s
[CV] logistic__C=3.359818286283781e-07 ...............................
[CV]  logistic__C=3.359818286283781e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=3.359818286283781e-07 ...............................
[CV]  logistic__C=3.359818286283781e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=4.281332398719396e-07 ...............................
[CV]  logistic__C=4.281332398719396e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=4.281332398719396e-07 ...............................
[CV] ..... logistic__C=4.281332398719396e-07, score=1.0, total=   0.1s
[CV] logistic__C=4.281332398719396e-07 ...............................
[CV] ..... logistic__C=4.281332398719396e-07, score=1.0, total=   0.1s
[CV] logistic__C=4.281332398719396e-07 ...............................
[CV]  logistic__C=4.281332398719396e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=4.281332398719396e-0

[CV] ..... logistic__C=5.455594781168515e-07, score=1.0, total=   0.1s
[CV] logistic__C=5.455594781168515e-07 ...............................
[CV] ..... logistic__C=5.455594781168515e-07, score=1.0, total=   0.1s
[CV] logistic__C=5.455594781168515e-07 ...............................
[CV]  logistic__C=5.455594781168515e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=5.455594781168515e-07 ...............................
[CV]  logistic__C=5.455594781168515e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=5.455594781168515e-07 ...............................
[CV] ..... logistic__C=5.455594781168515e-07, score=1.0, total=   0.1s
[CV] logistic__C=5.455594781168515e-07 ...............................
[CV] ..... logistic__C=5.455594781168515e-07, score=1.0, total=   0.1s
[CV] logistic__C=5.455594781168515e-07 ...............................
[CV] ..... logistic__C=5.455594781168515e-07, score=1.0, total=   0.1s
[CV] logistic__C=5.455594781168515e-07 ..................

[CV] ..... logistic__C=6.951927961775605e-07, score=1.0, total=   0.1s
[CV] logistic__C=6.951927961775605e-07 ...............................
[CV]  logistic__C=6.951927961775605e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=6.951927961775605e-07 ...............................
[CV] ..... logistic__C=6.951927961775605e-07, score=1.0, total=   0.1s
[CV] logistic__C=6.951927961775605e-07 ...............................
[CV] ..... logistic__C=6.951927961775605e-07, score=1.0, total=   0.1s
[CV] logistic__C=6.951927961775605e-07 ...............................
[CV]  logistic__C=6.951927961775605e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=6.951927961775605e-07 ...............................
[CV]  logistic__C=6.951927961775605e-07, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=6.951927961775605e-07 ...............................
[CV]  logistic__C=6.951927961775605e-07, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=6.951927961775605e-0

[CV]  logistic__C=8.858667904100833e-07, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=8.858667904100833e-07 ...............................
[CV] ..... logistic__C=8.858667904100833e-07, score=1.0, total=   0.1s
[CV] logistic__C=8.858667904100833e-07 ...............................
[CV] ..... logistic__C=8.858667904100833e-07, score=1.0, total=   0.1s
[CV] logistic__C=8.858667904100833e-07 ...............................
[CV] ..... logistic__C=8.858667904100833e-07, score=1.0, total=   0.1s
[CV] logistic__C=8.858667904100833e-07 ...............................
[CV] ..... logistic__C=8.858667904100833e-07, score=1.0, total=   0.1s
[CV] logistic__C=8.858667904100833e-07 ...............................
[CV]  logistic__C=8.858667904100833e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=8.858667904100833e-07 ...............................
[CV]  logistic__C=8.858667904100833e-07, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=8.858667904100833e-07 ........

[CV] .... logistic__C=1.1288378916846883e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.1288378916846883e-06 ..............................
[CV] .... logistic__C=1.1288378916846883e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.1288378916846883e-06 ..............................
[CV] .... logistic__C=1.1288378916846883e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.1288378916846883e-06 ..............................
[CV]  logistic__C=1.1288378916846883e-06, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=1.1288378916846883e-06 ..............................
[CV]  logistic__C=1.1288378916846883e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1.1288378916846883e-06 ..............................
[CV] .... logistic__C=1.1288378916846883e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.1288378916846883e-06 ..............................
[CV] .... logistic__C=1.1288378916846883e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.1288378916846883e-06 ...............

[CV] ..... logistic__C=1.438449888287663e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.438449888287663e-06 ...............................
[CV] ..... logistic__C=1.438449888287663e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.438449888287663e-06 ...............................
[CV]  logistic__C=1.438449888287663e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1.438449888287663e-06 ...............................
[CV] ..... logistic__C=1.438449888287663e-06, score=0.5, total=   0.1s
[CV] logistic__C=1.438449888287663e-06 ...............................
[CV] ..... logistic__C=1.438449888287663e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.438449888287663e-06 ...............................
[CV] ..... logistic__C=1.438449888287663e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.438449888287663e-06 ...............................
[CV] ..... logistic__C=1.438449888287663e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.438449888287663e-06 ............................

[CV] .... logistic__C=1.8329807108324375e-06, score=1.0, total=   0.1s
[CV] logistic__C=1.8329807108324375e-06 ..............................
[CV]  logistic__C=1.8329807108324375e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1.8329807108324375e-06 ..............................
[CV]  logistic__C=1.8329807108324375e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=2.335721469090121e-06 ...............................
[CV]  logistic__C=2.335721469090121e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=2.335721469090121e-06 ...............................
[CV] ..... logistic__C=2.335721469090121e-06, score=1.0, total=   0.1s
[CV] logistic__C=2.335721469090121e-06 ...............................
[CV] ..... logistic__C=2.335721469090121e-06, score=1.0, total=   0.1s
[CV] logistic__C=2.335721469090121e-06 ...............................
[CV]  logistic__C=2.335721469090121e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=2.335721469090121e

[CV] ..... logistic__C=2.976351441631319e-06, score=1.0, total=   0.1s
[CV] logistic__C=2.976351441631319e-06 ...............................
[CV] ..... logistic__C=2.976351441631319e-06, score=1.0, total=   0.1s
[CV] logistic__C=2.976351441631319e-06 ...............................
[CV] ..... logistic__C=2.976351441631319e-06, score=1.0, total=   0.1s
[CV] logistic__C=2.976351441631319e-06 ...............................
[CV] ..... logistic__C=2.976351441631319e-06, score=1.0, total=   0.1s
[CV] logistic__C=2.976351441631319e-06 ...............................
[CV]  logistic__C=2.976351441631319e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=2.976351441631319e-06 ...............................
[CV]  logistic__C=2.976351441631319e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=2.976351441631319e-06 ...............................
[CV] ..... logistic__C=2.976351441631319e-06, score=1.0, total=   0.1s
[CV] logistic__C=2.976351441631319e-06 ..................

[CV] .... logistic__C=3.7926901907322535e-06, score=1.0, total=   0.1s
[CV] logistic__C=3.7926901907322535e-06 ..............................
[CV] .... logistic__C=3.7926901907322535e-06, score=1.0, total=   0.1s
[CV] logistic__C=3.7926901907322535e-06 ..............................
[CV] .... logistic__C=3.7926901907322535e-06, score=1.0, total=   0.1s
[CV] logistic__C=3.7926901907322535e-06 ..............................
[CV] .... logistic__C=3.7926901907322535e-06, score=1.0, total=   0.1s
[CV] logistic__C=3.7926901907322535e-06 ..............................
[CV]  logistic__C=3.7926901907322535e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=3.7926901907322535e-06 ..............................
[CV] .... logistic__C=3.7926901907322535e-06, score=1.0, total=   0.1s
[CV] logistic__C=3.7926901907322535e-06 ..............................
[CV] .... logistic__C=3.7926901907322535e-06, score=1.0, total=   0.1s
[CV] logistic__C=3.7926901907322535e-06 ..........................

[CV]  logistic__C=4.832930238571752e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=4.832930238571752e-06 ...............................
[CV]  logistic__C=4.832930238571752e-06, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=4.832930238571752e-06 ...............................
[CV]  logistic__C=4.832930238571752e-06, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=4.832930238571752e-06 ...............................
[CV]  logistic__C=4.832930238571752e-06, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=4.832930238571752e-06 ...............................
[CV] ..... logistic__C=4.832930238571752e-06, score=1.0, total=   0.1s
[CV] logistic__C=4.832930238571752e-06 ...............................
[CV] ..... logistic__C=4.832930238571752e-06, score=1.0, total=   0.1s
[CV] logistic__C=4.832930238571752e-06 ...............................
[CV] ..... logistic__C=4.832930238571752e-06, score=1.0, total=   0.1s
[CV] logistic__C=4.832930238571752e-0

[CV] ..... logistic__C=6.158482110660255e-06, score=1.0, total=   0.1s
[CV] logistic__C=6.158482110660255e-06 ...............................
[CV]  logistic__C=6.158482110660255e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=6.158482110660255e-06 ...............................
[CV]  logistic__C=6.158482110660255e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=6.158482110660255e-06 ...............................
[CV]  logistic__C=6.158482110660255e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=6.158482110660255e-06 ...............................
[CV] ..... logistic__C=6.158482110660255e-06, score=1.0, total=   0.1s
[CV] logistic__C=6.158482110660255e-06 ...............................
[CV] ..... logistic__C=6.158482110660255e-06, score=1.0, total=   0.1s
[CV] logistic__C=6.158482110660255e-06 ...............................
[CV] ..... logistic__C=6.158482110660255e-06, score=1.0, total=   0.1s
[CV] logistic__C=6.158482110660255e-06 ........

[CV]  logistic__C=7.847599703514607e-06, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=7.847599703514607e-06 ...............................
[CV]  logistic__C=7.847599703514607e-06, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=7.847599703514607e-06 ...............................
[CV] ..... logistic__C=7.847599703514607e-06, score=1.0, total=   0.1s
[CV] logistic__C=7.847599703514607e-06 ...............................
[CV] ..... logistic__C=7.847599703514607e-06, score=1.0, total=   0.1s
[CV] logistic__C=7.847599703514607e-06 ...............................
[CV]  logistic__C=7.847599703514607e-06, score=0.6666666666666666, total=   0.1s
[CV] logistic__C=7.847599703514607e-06 ...............................
[CV] ..... logistic__C=7.847599703514607e-06, score=1.0, total=   0.1s
[CV] logistic__C=7.847599703514607e-06 ...............................
[CV] ..... logistic__C=7.847599703514607e-06, score=1.0, total=   0.1s
[CV] logistic__C=7.847599703514607e-06 ........

[CV] ...... logistic__C=1e-05, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1e-05 ...............................................
[CV] ..................... logistic__C=1e-05, score=0.5, total=   0.1s
[CV] logistic__C=1e-05 ...............................................
[CV] ..................... logistic__C=1e-05, score=1.0, total=   0.1s
[CV] logistic__C=1e-05 ...............................................
[CV] ..................... logistic__C=1e-05, score=1.0, total=   0.1s
[CV] logistic__C=1e-05 ...............................................
[CV] ..................... logistic__C=1e-05, score=1.0, total=   0.1s
[CV] logistic__C=1e-05 ...............................................
[CV] ..................... logistic__C=1e-05, score=1.0, total=   0.1s
[CV] logistic__C=1e-05 ...............................................
[CV] ...... logistic__C=1e-05, score=0.8333333333333334, total=   0.1s
[CV] logistic__C=1e-05 ...............................................
[CV] .

[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:  1.8min finished


GridSearchCV(cv=StratifiedKFold(n_splits=50, random_state=None, shuffle=False),
       error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('encoder', CategoricalEncoder(features_to_encode=['loan_history', 'purpose', 'other_parties', 'property_magnitude', 'other_payment_plans', 'housing', 'personal_status', 'job'],
          features_to_return=['duration', 'loan_amount', 'age', 'foreign_worker_binary', 'checking_status_ordinal',...enalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'logistic__C': array([1.00000e-07, 1.27427e-07, 1.62378e-07, 2.06914e-07, 2.63665e-07,
       3.35982e-07, 4.28133e-07, 5.45559e-07, 6.95193e-07, 8.85867e-07,
       1.12884e-06, 1.43845e-06, 1.83298e-06, 2.33572e-06, 2.97635e-06,
       3.79269e-06, 4.83293e-06, 6.15848e-06, 7.84760e-06, 1.00000e-05])},
       pre_dispatch='2*n_jobs', refit=True, return_train

Find the best value.

In [11]:
best_regularisation = grid.best_estimator_.get_params()['logistic__C']
best_regularisation

1.623776739188721e-07

## Evaluate the best model over 50 random splits
Given that the data set is so small, it's important to evaluate over many random train/test splits, so that we get a better picture of the metrics. With the best value of the regularisation parameter, we get an **average recall of about 0.74**, with an **average precision of 0.48**.

In [12]:
logistic_tuned = LogisticRegression(max_iter=2000,
                                    class_weight='balanced',
                                    solver='lbfgs',
                                    C=best_regularisation)

In [13]:
pipe_tuned = Pipeline(steps=[('encoder', encoder),
                             ('scaler', StandardScaler()),
                             ('logistic', logistic)])

In [14]:
sss = StratifiedShuffleSplit(n_splits=50, test_size=0.25)

In [15]:
METRIC_FUNCTIONS = {
    'accuracy': accuracy_score,
    'precision': precision_score,
    'recall': recall_score,
    'f1': f1_score
}

In [16]:
METRICS = {k: [] for k in METRIC_FUNCTIONS.keys()}

X = df
y = df['label']

for train_IDX, test_IDX in sss.split(X, y):
    pipe_tuned.fit(X.loc[train_IDX], y.loc[train_IDX])
    logistic_predictions = pipe_tuned.predict(X.loc[test_IDX])
    truth = y.loc[test_IDX]
    
    for key, metric in METRIC_FUNCTIONS.items():
        METRICS[key].append(metric(truth, logistic_predictions))

In [17]:
{k: np.mean(v) for k, v in METRICS.items()}

{'accuracy': 0.6667199999999999,
 'f1': 0.5700134064362551,
 'precision': 0.48455631518404224,
 'recall': 0.7426666666666667}

In [18]:
{k: np.std(v) for k, v in METRICS.items()}

{'accuracy': 0.07311772425342573,
 'f1': 0.049143934424600626,
 'precision': 0.06514204615203176,
 'recall': 0.1606265510085082}

Check accuracy on training set to check for overfitting.

In [19]:
pipe_tuned.score(X.loc[train_IDX], y.loc[train_IDX])

0.7226666666666667