In [18]:
import io

import pandas as pd

In [7]:
SOURCE_FILE = './credit.csv'

In [16]:
with open(SOURCE_FILE, 'r') as file:
    data = file.read()
    
data = data.replace('"\n\n"', '\n').replace('"', '')

## Problem description
**Number of records**: 1000

**Input (8) variables**: 
* `checking_balance`: Categorical - Status of existing checking account in Deutsche Mark (DM): 
    * `unknown`: 394
    * `< 0 DM`: 274
    * `1 - 200 DM`: 269
    * `> 200 DM`: 63
* `savings_balance`: Categorical - Savings account/bonds in Deutsche Mark (DM):
    * `unknown`: 183
    * `< 100 DM`: 603
    * `101 - 500 DM`: 103
    * `501 - 1000 DM`: 63
    * `> 1000 DM`: 48
* `installment_rate`: Numerical - Installment rate in percentage of disposable income.
* `personal_status`: Categorical - Personal status and sex: 
    * `female`: 310
    * `single male`: 548
    * `married male`: 92
    * `divorced male`: 50
* `residence_history`: Numerical - Present residence since
* `installment_plan`: Categorical - Other installment plans:
    * `none`: 814
    * `bank`: 139
    * `stores`: 47
* `existing_credits`: Numerical - Number of existing credits at this bank
* `dependents`: Numerical - Number of people the debtor is required to provide for

**Target**:
* `default`: Categorical - Credit default:
    * `1`: Debtor paid back its loan
    * `2`: Debtor defaulted on its loan
    
**Problem type**: Classification
    * **Imbalanced dataset**: Yes and moderate. 700 `1`s vs 300 `2`s.
**Missing values**: No.

In [119]:
input_features = ['checking_balance', 'savings_balance', 'installment_rate', 
           'personal_status', 'residence_history', 'installment_plan', 
           'existing_credits', 'dependents']

target = 'default'

usecols = input_features + [target]

df = pd.read_csv(io.StringIO(data), header=0, usecols=usecols)

df[target] = df[target].map({1:0, 2:1})

In [120]:
df.shape

(1000, 9)

In [121]:
df.head()

Unnamed: 0,checking_balance,savings_balance,installment_rate,personal_status,residence_history,installment_plan,existing_credits,default,dependents
0,< 0 DM,unknown,4,single male,4,none,2,0,1
1,1 - 200 DM,< 100 DM,2,female,2,none,1,1,1
2,unknown,< 100 DM,2,single male,3,none,1,0,2
3,< 0 DM,< 100 DM,2,single male,4,none,1,0,2
4,< 0 DM,< 100 DM,3,single male,4,none,2,1,2


In [122]:
CATEGORICAL_VARIABLES = ['checking_balance', 'savings_balance', 
           'personal_status', 'installment_plan']

In [123]:
SEED = 42
TEST_SIZE = 0.3

import sklearn.model_selection as sk_ms

X, y = df[input_features], df[target]

X_train, X_test, y_train, y_test = sk_ms.train_test_split(X, y, 
                                                          test_size=TEST_SIZE, 
                                                         random_state=SEED,
                                                         shuffle=True,
                                                         stratify=y)

In [64]:
from sklearn.ensemble import RandomForestClassifier

In [124]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators=100, criterion='gini',
                                   class_weight='balanced', random_state=SEED, 
                                    max_depth=None, min_samples_leaf=5)

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), CATEGORICAL_VARIABLES)
    ])

model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', classifier)])

In [125]:
cross_validator = sk_ms.StratifiedKFold(n_splits=5, random_state=SEED, shuffle=True)

In [127]:
param_grid = {
    'classifier__max_depth': [3, 4, 5, 6, 7],
    'classifier__min_samples_leaf': [1, 3, 5, 7]
}

grid_searcher = sk_ms.GridSearchCV(estimator=model, 
                                   param_grid=param_grid, 
                                   scoring='f1',
                                   cv=cross_validator)

grid_searcher.fit(X_train, y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('preprocessor',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('cat',
                                                                         OneHotEncoder(categories='auto',
                                                                                       drop=None,
                                                                                       dtype=<class 'numpy.float64'>,
                                                                                

In [130]:
bmodel = grid_searcher.best_estimator_
metrics.f1_score(y_test, bmodel.predict(X_test))

0.525

In [136]:
df[input_features+[target]].to_csv('data.csv', index=False)

In [129]:
grid_searcher.best_params_

{'classifier__max_depth': 7, 'classifier__min_samples_leaf': 3}

In [128]:
grid_searcher.best_score_

0.5725780143753122

In [68]:
import sklearn.metrics as metrics
import numpy as np

In [90]:
import sklearn.compose

sklearn.compose.ColumnTransformer()

In [94]:
train_precision_scores = []
test_precision_scores = []
train_recall_scores = []
test_recall_scores = []
train_f1_scores = []
test_f1_scores = []
train_pr_curve_auc = []
test_pr_curve_auc = []
for train_index, test_index in cross_validator.split(X, y):
    X_train, y_train = X.iloc[train_index], y.iloc[train_index]
    X_test, y_test = X.iloc[test_index], y.iloc[test_index]
    model.fit(X_train, y_train)
    
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    train_precision_scores.append(metrics.precision_score(y_train, y_train_pred))
    test_precision_scores.append(metrics.precision_score(y_test, y_test_pred))
    train_recall_scores.append(metrics.recall_score(y_train, y_train_pred))
    test_recall_scores.append(metrics.recall_score(y_test, y_test_pred))
    train_f1_scores.append(metrics.f1_score(y_train, y_train_pred))
    test_f1_scores.append(metrics.f1_score(y_test, y_test_pred))
    
    y_train_proba = model.predict_proba(X_train)[:,1]
    y_test_proba = model.predict_proba(X_test)[:,1]
    
    train_pre, train_rec, _ = metrics.precision_recall_curve(y_train, y_train_proba) 
    test_pre, test_rec, _ = metrics.precision_recall_curve(y_test, y_test_proba) 
    
    train_pr_curve_auc.append(metrics.auc(train_rec, train_pre))
    test_pr_curve_auc.append(metrics.auc(test_rec, test_pre))
    
print('Training set metrics')
print('--------------------')
print('Precision: {:04.3f} ({:04.3f})'.format(np.mean(train_precision_scores), np.std(train_precision_scores)))
print('Recall: {:04.3f} ({:04.3f})'.format(np.mean(train_recall_scores), np.std(train_recall_scores)))
print('F1-score: {:04.3f} ({:04.3f})'.format(np.mean(train_f1_scores), np.std(train_f1_scores)))
print('PR curve AUC: {:04.3f} ({:04.3f})'.format(np.mean(train_pr_curve_auc), np.std(train_pr_curve_auc)))
print('\n')
print('Validation set metrics')
print('--------------------')
print('Precision: {:04.3f} ({:04.3f})'.format(np.mean(test_precision_scores), np.std(test_precision_scores)))
print('Recall: {:04.3f} ({:04.3f})'.format(np.mean(test_recall_scores), np.std(test_recall_scores)))
print('F1-score: {:04.3f} ({:04.3f})'.format(np.mean(test_f1_scores), np.std(test_f1_scores)))
print('PR curve AUC: {:04.3f} ({:04.3f})'.format(np.mean(test_pr_curve_auc), np.std(test_pr_curve_auc)))

Training set metrics
--------------------
Precision: 0.490 (0.011)
Recall: 0.803 (0.014)
F1-score: 0.608 (0.010)
PR curve AUC: 0.567 (0.024)


Validation set metrics
--------------------
Precision: 0.455 (0.026)
Recall: 0.757 (0.049)
F1-score: 0.568 (0.034)
PR curve AUC: 0.458 (0.047)


### 1. Script that splits the input data file into train and test set

In [139]:
import os 

import pandas as pd
from sklearn.model_selection import train_test_split

INPUT_FILE_PATH = './data.csv'
OUTPUT_DIR_PATH = '.'
SEED = 42
TEST_SIZE = 0.3

INPUT_FEATURES = ['checking_balance', 'savings_balance', 'installment_rate', 
           'personal_status', 'residence_history', 'installment_plan', 
           'existing_credits', 'dependents']
TARGET = 'default'

input_data = pd.read_csv(INPUT_FILE_PATH, 
                         header=0, 
                         usecols=[*INPUT_FEATURES, TARGET])

datasets = train_test_split(input_data, 
                            test_size=TEST_SIZE, 
                            random_state=SEED,
                            shuffle=True,
                            stratify=input_data[TARGET])

output_paths = [os.path.join(OUTPUT_DIR_PATH, filename) 
                for filename in ('train.csv', 'test.csv')]

for dataset, output_path in zip(datasets, output_paths):
    dataset.to_csv(output_path, index=False)

(700, 9)

### 2. Script that splits the input data file into train and test set

In [None]:
import datetime as dt
import os
import pickle

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

EXPERIMENT_ID = 'CDEFAULT_RF'
INPUT_FILE_PATH = './train.csv'
OUTPUT_DIR_PATH = '.'
SEED = 42

NUMERICAL_FEATURES = ['residence_history', 'installment_rate',  
           'existing_credits', 'dependents']

CATEGORICAL_FEATURES = ['checking_balance', 'savings_balance', 
           'personal_status', 'installment_plan']

TARGET = 'default'

usecols = [*NUMERICAL_FEATURES, *CATEGORICAL_FEATURES, TARGET]
input_data = pd.read_csv(INPUT_FILE_PATH, 
                         header=0, 
                         usecols=usecols)

input_data[TARGET] = input_data[TARGET].map({1:0, 2:1})

input_features = NUMERICAL_FEATURES + CATEGORICAL_FEATURES
X, y = input_data[input_features], input_data[TARGET]

classifier = RandomForestClassifier(n_estimators=100, 
                                    criterion='gini',
                                    class_weight='balanced',
                                    random_state=SEED, 
                                    max_depth=3, 
                                    min_samples_leaf=7)

preprocessor = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(), CATEGORICAL_FEATURES)
    ])

model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', classifier)])

model.fit(X, y)

current_timestamp = dt.datetime.strftime(dt.datetime.now(), '%Y%m%d%H%M%S')
file_name = '_'.join([EXPERIMENT_ID, current_timestamp])
joblib.dump(model, os.path.join(OUTPUT_DIR_PATH, file_name))

In [157]:
import joblib
import re

import pandas as pd
import sklearn.metrics as metrics

TRAIN_FILE_PATH = './train.csv'
TEST_FILE_PATH = './train.csv'
MODEL_FILE_PATH = 'model.joblib'
OUTPUT_DIR_PATH = '.'

INPUT_FEATURES = ['residence_history', 'installment_rate', 'existing_credits', 
                  'dependents', 'checking_balance', 'savings_balance', 
                  'personal_status', 'installment_plan']
TARGET = 'default'

train_data = pd.read_csv(TRAIN_FILE_PATH, 
                        header=0, 
                        usecols=[*INPUT_FEATURES, TARGET])

test_data = pd.read_csv(TEST_FILE_PATH, 
                        header=0, 
                        usecols=[*INPUT_FEATURES, TARGET])

trained_model = joblib.load(MODEL_FILE_PATH)

X_train = train_data[INPUT_FEATURES]
X_test = test_data[INPUT_FEATURES]
y_train = train_data[TARGET].map({1:0, 2:1})
y_test = test_data[TARGET].map({1:0, 2:1})

y_train_pred = trained_model.predict(X_train)
y_test_pred = trained_model.predict(X_test)
    
train_precision = metrics.precision_score(y_train, y_train_pred)
test_precision = metrics.precision_score(y_test, y_test_pred)
train_recall = metrics.recall_score(y_train, y_train_pred)
test_recall = metrics.recall_score(y_test, y_test_pred)
train_f1 = metrics.f1_score(y_train, y_train_pred)
test_f1 = metrics.f1_score(y_test, y_test_pred)

y_train_proba = model.predict_proba(X_train)[:,1]
y_test_proba = model.predict_proba(X_test)[:,1]
    
train_pre, train_rec, _ = metrics.precision_recall_curve(y_train, y_train_proba) 
test_pre, test_rec, _ = metrics.precision_recall_curve(y_test, y_test_proba) 
train_pr_curve_auc = metrics.auc(train_rec, train_pre)
test_pr_curve_auc = metrics.auc(test_rec, test_pre)

model_name = re.search(pattern=r'(\w+)\.joblib', string=MODEL_FILE_PATH).group(1)

report_lines = [
    'TRAINING REPORT',
    'Model: {}'.format(model_name),
    '\n',
    'Training set metrics',
    '----------------------',
    'Precision: {:04.3f}'.format(train_precision),
    'Recall: {:04.3f}'.format(train_recall),
    'F1-score: {:04.3f}'.format(train_f1),
    'PR curve AUC: {:04.3f}'.format(train_pr_curve_auc),
    '\n',
    'Test set metrics',
    '----------------------',
    'Precision: {:04.3f}'.format(test_precision),
    'Recall: {:04.3f}'.format(test_recall),
    'F1-score: {:04.3f}'.format(test_f1),
    'PR curve AUC: {:04.3f}'.format(test_pr_curve_auc)    
]

file_name = os.path.join(OUTPUT_DIR_PATH, 'training_report_{}.txt'.format(model_name))
with open(file_name, 'w') as file:
    file.write('\n'.join(report_lines))

'20200502172716'

Next steps

Technical choices and comments 

Data and modelling
The input dataset consists in 4 categorical variables and 4 numerical variables and 1 target variable.

Two categorical variables (`checking_balance` and `savings_balance`) include missing values (labelled `unknown`). These missing values were not imputed nor discarded and were encoded like all the other categories. This choice has been made 1) for simplicity 2) because in our specific use case, a missing value can be information on its own.

A next modelling step would consider trying to impute the above-mentionned missing values and/or try to consider the `checking_balance` and `savings_balance` variables not as simple categorical variables but as ordinal variables.

Aside from one-hot encoding categorical variables, our preprocessing simply consists in recoding the target variable to better match `sklearn`'s encoding for a target binary variable (`0`/`1` or `-1`/`1`). In particular, numerical variables were not scaled as it was not required by our use of decision trees, nor transformed (ex: log-transformed) as the examination of their distribution show this was not necessary.  

We choose a random forest as classifier. Hyperparameter tuning was performed using a grid search (using the F1-score as metric) and a 5-fold stratified cross-validation. As an imbalanced problem (70%/30% class balance) and to avoid creating a majority-class bias, samples were weighted using their class inverse frequency (`class_weight='balanced'`).

Notice on chosen metrics 
As our classification problem is imbalanced:
* We choose not to include the accuracy in our metrics
* We choose not to use the ROC curve and the associated area-under-curve (AUC) but preferred the Precision/Recall (PR) curve and its AUC.

Notice on performance
Assessed model performance seems rather mediocre at first sight: 0.XX precision and 0.XX recall. From a business perspective it may not be as bad:
* The model avoids to grant a loan to XX% of the bad debtors.
* Among all the declined loan submissions however, XX% were actually sound projects.

However, we do not know how business values a missed credit opportunity and a default.

Next modelling steps would include:
* Experimenting with other model specification
* Experimenting with model selection metrics that give different weights 

Simply one hot encoded variables
Model spec determined using GridSearch, code not included as not part of requirements
Model training understood as not including hyperparameter tuning. Hyperparameter tuning is part of model specification. Reworking model spec is not required as often as retraining. Resource consuming and may require a data scientist.

Script 1
We choose a stratified split to ensure that the minority class proportion remains the same after the split. 

The script could be improved by granting the user more flexibility with more command line arguments.

Script 2
Model training has been understood as not including the hyperparameter tuning (and as not required, the code related to the hyperparameter tuning of our model has not been included). In a perspective of frequent and automatic retrainings, hyperparameter tuning is indeed not included in trainings for two main reasons:
* Hyperparameter tuning is costly
* Hyperparameter tuning is part of the model specification which should not change with each training (it can but less frequently). Furthermore, as model specification may require the intervention of the data scientist, we may not want to automate it. 

We can make to suggestions of improvement for this script:
* The model specification is currently hard-coded into the script. We could leverage the fact that this specification is included in the model file. Re-training a model could therefore be done by simply using the latest model.
* The input feature order is currently hard-coded into the script and it would be better if it were attached to the model as metadata.

Script 3
We described our choice of metrics further above. This script generated report as a simple text (.txt) file. Including curves or more generally plots would require it to produce a CSV.

Include std dev