# Threshold Tuning and Saving to Pickle File

In this notebook we will perfrom threshold tuning and save our model to a pickle file.

## Importing Packages

In [1]:
import pandas as pd
import numpy as np

## Reading-In Training Data

In [2]:
df_default = pd.read_csv("data_processed/01_binary_training.csv", low_memory=False)
df_default.head().T

Unnamed: 0,0,1,2,3,4
funded_amnt,30000.0,7850.0,25000.0,23000.0,12000.0
addr_state,IL,IN,AZ,CO,TX
annual_inc,70000.0,95000.0,115000.0,177000.0,98000.0
application_type,Individual,Individual,Individual,Individual,Individual
dti,22.78,13.97,23.27,13.91,22.3
earliest_cr_line,Apr-1996,Apr-1990,Dec-2001,Aug-2001,Sep-2000
emp_length,< 1 year,4 years,9 years,< 1 year,10+ years
emp_title,Surgical Clinical Reviewer,Technician,VP of Operations,Property Manager,HEAVY DUTY DRIVER
fico_range_high,729.0,679.0,704.0,704.0,699.0
fico_range_low,725.0,675.0,700.0,700.0,695.0


## Feature Selection

Moving forward we will focus on the numeric features only.

In [3]:
numeric_features = [
    "funded_amnt",
    "last_pymnt_amnt",
    "int_rate",
    "loan_amnt",
    "installment",
    "acc_open_past_24mths",
    "dti",
    "fico_range_low",
    "mort_acc",
]

## Creating the `FeatureSelector` Column Transformer

Creating the customer feature selector which will be the first step in all the pipelines.

In [4]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[self.columns]

## Preprocessing

Next we use `Pipelines` and `ColumnTransformers` to affect preprocessing of our data.

In [5]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

In [6]:
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler(with_mean=True))  
])

In [7]:
preprocessor = ColumnTransformer(transformers=[
    ('numerical', numerical_transformer, numeric_features),],
    remainder = 'passthrough',
)

## Model Definition and Fitting

Now, lets' put our preprocessor together with a `RandomForestClassifier` estimator and fit our model.

In [8]:
model = Pipeline(steps=[
    ('feature_selector', FeatureSelector(numeric_features)),
    ('preprocessor', preprocessor),
    ('random_forest', RandomForestClassifier(n_estimators=50, max_depth=10, n_jobs=-1, random_state=0))
])

In [9]:
X = df_default.drop(columns=(['charged_off']))
y = df_default['charged_off']

In [10]:
model.fit(X, y)

As we can see, our model has decent accuracy and predicts a reasonable number of defaults.

In [11]:
model.score(X, y)

0.8844236680762895

In [12]:
model.predict(X).mean()

0.16917702654506547

## Threshold Tuning

The default threshold of discrimination for **sklearn** classifiers is 0.5.  Sometimes with imbalanced data sets performance can be improve by choosing a different threshold.  

We will now search for the optimal threshold.

In [13]:
# predict probabilities
yhat = model.predict_proba(X)

In [14]:
# keep probabilities for the positive outcome only
probs = yhat[:, 1]
probs

array([0.17199049, 0.00336728, 0.00094877, ..., 0.58326199, 0.00066442,
       0.65419674])

In [15]:
# define thresholds
thresholds = np.arange(0, 1, 0.01)

In [16]:
# apply threshold to positive probabilities to create inferences
def to_inference(pos_probs, threshold):
	return (pos_probs >= threshold).astype('int')

In [17]:
from sklearn.metrics import f1_score, accuracy_score
# evaluate each threshold
scores = [accuracy_score(y, to_inference(probs, t)) for t in thresholds]

It looks like the optimal threshold for our model is 0.47, so fairly close to 0.5.

In [18]:
# get best threshold
ix = np.argmax(scores)
print('Threshold=%.3f, Accuracy=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.480, Accuracy=0.88495


The overall number of defaults is slightly improved with this threshold.

In [22]:
to_inference(probs, 0.48).mean()

0.18192201219408638

## Saving Fitted Model to Pickle

Finally, let's save our fitted model to a pickle file.

In [20]:
import joblib

joblib.dump(model, "pickle/lending_club_model.pkl")

['pickle/lending_club_model.pkl']