This notebook shows the step covered in developing a machine learning model that is later deployed as an endpoint, ready to be integrated into other systems with a simple GET request

Load the data, select relevant variables and do some transformations

In [3]:
import numpy as np
import pandas as pd
import warnings

#Ensure type conversions are more efficient
pd.set_option('future.no_silent_downcasting', True)
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.precision', 2)


data = pd.read_csv('loan_data.csv')
data.columns = data.columns.str.replace('.','_')
data = data[['credit_policy','purpose', 'int_rate', 'installment', 'log_annual_inc', 'dti', 'fico',
       'days_with_cr_line', 'revol_bal', 'revol_util']]
data['int_rate'] = data['int_rate']*100
data['log_annual_inc'] = np.exp(data['log_annual_inc'])
data

Unnamed: 0,credit_policy,purpose,int_rate,installment,log_annual_inc,dti,fico,days_with_cr_line,revol_bal,revol_util
0,1,debt_consolidation,11.89,829.10,85000.0,19.48,737,5639.96,28854,52.1
1,1,credit_card,10.71,228.22,65000.0,14.29,707,2760.00,33623,76.7
2,1,debt_consolidation,13.57,366.86,32000.0,11.63,682,4710.00,3511,25.6
3,1,debt_consolidation,10.08,162.34,85000.0,8.10,712,2699.96,33667,73.2
4,1,credit_card,14.26,102.92,80800.0,14.97,667,4066.00,4740,39.5
...,...,...,...,...,...,...,...,...,...,...
9573,0,all_other,14.61,344.76,195000.0,10.39,672,10474.00,215372,82.1
9574,0,all_other,12.53,257.70,69000.0,0.21,722,4380.00,184,1.1
9575,0,debt_consolidation,10.71,97.81,40000.0,13.09,687,3450.04,10036,82.9
9576,0,home_improvement,16.00,351.58,50000.0,19.18,692,1800.00,0,3.2


View a summary of the data

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   credit_policy      9578 non-null   int64  
 1   purpose            9578 non-null   object 
 2   int_rate           9578 non-null   float64
 3   installment        9578 non-null   float64
 4   log_annual_inc     9578 non-null   float64
 5   dti                9578 non-null   float64
 6   fico               9578 non-null   int64  
 7   days_with_cr_line  9578 non-null   float64
 8   revol_bal          9578 non-null   int64  
 9   revol_util         9578 non-null   float64
dtypes: float64(6), int64(3), object(1)
memory usage: 748.4+ KB


Encode categorical variables accordingly

In [5]:
data['purpose'] = data['purpose'].astype('category')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   credit_policy      9578 non-null   int64   
 1   purpose            9578 non-null   category
 2   int_rate           9578 non-null   float64 
 3   installment        9578 non-null   float64 
 4   log_annual_inc     9578 non-null   float64 
 5   dti                9578 non-null   float64 
 6   fico               9578 non-null   int64   
 7   days_with_cr_line  9578 non-null   float64 
 8   revol_bal          9578 non-null   int64   
 9   revol_util         9578 non-null   float64 
dtypes: category(1), float64(6), int64(3)
memory usage: 683.3 KB


Do a simple exploratory plot

There is high class imbalance and surprisingly people taking loans for
debt consolidation have a better chance at get a loan compared to business
application or those seeking loans to finance education

In [7]:
import matplotlib.pyplot as plt
import seaborn as sns


fig, axes = plt.subplots(1,2, figsize = (10,4))

sns.countplot(data=data,y='purpose',hue='credit_policy',stat='percent',orient='x', ax=axes[0])
sns.countplot(data=data,x='credit_policy', hue='credit_policy', stat='percent', ax=axes[1])

ModuleNotFoundError: No module named 'dateutil.rrule'

Split the data into a training for model training and a testing set for evaluation.
Doing so prevents the model from overfitting. Overfitting results from multicollinearity, 
excess parameters or high-dimensionality

When a model is overfitting, it tends to describe noise and random errors in the data 
instead of the actual relationships that exist within the data

In [8]:
from sklearn.model_selection import train_test_split

X = data.drop('credit_policy', axis=1)
y = data['credit_policy']

X_train,X_test, y_train, y_test = train_test_split(
    X, y , test_size=.25
)

Scale the variables to ensure that the scale of a variable does not disguise its importance
Also do encoding of the categorical variables

These steps encapsulated into a data processing pipeline which will automate this process
in production

In [9]:
from sklearn.preprocessing import StandardScaler
import category_encoders as ce

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector as selector

from sklearn.ensemble import RandomForestClassifier

cat_pp = Pipeline(
    [('one_hot',ce.OneHotEncoder(return_df=False, use_cat_names=True))]
)

num_pp = Pipeline(
    [('scale',StandardScaler())]
)

processing = ColumnTransformer(
    [('cat',cat_pp,selector(dtype_include='category')),
     ('num',num_pp,selector(dtype_exclude='category'))]
)

clf = Pipeline(
    [
        ('preprocessor',processing),
        ('classifier',RandomForestClassifier(n_estimators=1000, criterion='log_loss'))
    ]
)

clf.fit(X_train,y_train)

Evaluating the model
Even without much hyperparameter tuning, the model performs very well on the test data

In [10]:
clf.score(X_test,y_test)

0.9031315240083507

We can test the model on a random sample to ensure it runs successfully

In [None]:
sample = X_test.sample()
sample
clf.predict(sample)[0]

Here the model(the pre-processing steps and the artifact) are saved in format that can be
accessed and used elsewhere. In this case, the model will be run on a server

In [11]:
import joblib

joblib.dump(clf, filename='../app/ml_model/model.joblib', compress=3)

['../app/ml_model/model.joblib']