In [1]:
import h2o
import joblib
import pandas as pd
import numpy as np
import os
import re
import helper

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


# Prediction testing
I will test whether the model can be used to predict given an input or not. First of all, I will do the testing in my local jupyter notebook. This means, I will load the model, preprocess the input data, and predict the input data. Then, I will create a FastAPI app for this model, and try to replicate the prediction using the API. 

## Local notebook prediction
First, I will get the model path and initialize H2O.

In [2]:
current_dir = os.getcwd()
model_filename = 'model/dl_grid_model_66'
knn_initial_filename = 'model/knn_imputer_model.pkl'
knn_cur_filename = 'model/knn_imputer_model_no_multicol.pkl'
scaler_filename = 'model/scaler_no_multicol.pkl'
#years_in_current_job_filename = 'model/years_in_current_job_mapping.pkl'
purpose_filename = 'model/purpose_mapping.pkl'

model_path = os.path.join(current_dir, model_filename)
knn_initial_path = os.path.join(current_dir, knn_initial_filename)
knn_cur_path = os.path.join(current_dir, knn_cur_filename)
scaler_path = os.path.join(current_dir, scaler_filename)
#years_in_current_job_path = os.path.join(current_dir, years_in_current_job_filename)
purpose_path = os.path.join(current_dir, purpose_filename)

In [3]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 17.0.8+9-LTS-211, mixed mode, sharing)
  Starting server from C:\Users\agust\Anaconda3\envs\h2o_loan_classification\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\agust\AppData\Local\Temp\tmpym8ubcvs
  JVM stdout: C:\Users\agust\AppData\Local\Temp\tmpym8ubcvs\h2o_agust_started_from_python.out
  JVM stderr: C:\Users\agust\AppData\Local\Temp\tmpym8ubcvs\h2o_agust_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,Asia/Jakarta
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.42.0.2
H2O_cluster_version_age:,1 month and 22 days
H2O_cluster_name:,H2O_from_python_agust_aktxgq
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.854 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


Then, I will load all the models or saved dictionaries used in preprocessing the data.

In [4]:
model = h2o.load_model(model_path)
knn_initial_model = joblib.load(knn_initial_path)
knn_cur_model = joblib.load(knn_cur_path)
scaler = joblib.load(scaler_path)
#years_in_current_job_mapping = joblib.load(years_in_current_job_path)
purpose_mapping = joblib.load(purpose_path)

I will create a dictionary as an input to the model. This is just to simplify the process. Usually, the software developer will create a front end app to get the input in a predetermined format, and I will get the data from the API created using requests library. The data is usually in a JSON format, and I have to convert it into python dictionary. But for this purpose, let's just say that the dictionary below is created from JSON.

In [5]:
#creating test dictionary
test_dict = {'current_loan_amount': 10167,
 'term': 'Short Term',
 'credit_score': 7380.0,
 'years_in_current_job': 3,
 'home_ownership': 'Own Home',
 'annual_income': 42701.0,
 'purpose': 'Debt Consolidation',
 'monthly_debt': 761.51,
 'years_of_credit_history': 25.8,
 'months_since_last_delinquent': 5.5,
 'number_of_open_accounts': 7,
 'number_of_credit_problems': 0,
 'current_credit_balance': 11283,
 'maximum_open_credit': 16954.0,
 'bankruptcies': 0.0,
 'tax_liens': 0.0}

Next, I will create a function with the aim of processing the dictionary and turn it into a working dataframe, with the exact same columns as the one used in the model training process. The steps are:
1. Converting the dictionary into pandas dataframe.
2. Clean credit_score column. The score above 850 will be divided by 10.
3. Convert the values in the term column into 1 for long term and 0 for short term.
4. Impute the missing columns using the saved KNN model from the previous notebook.
5. Simplify purpose into debt_consolidation, business_loans, personal_loans, and other.
6. Create the features from the previous notebook, then drop the unused columns.
7. Impute credit_utilization_ratio column if necessary.
8. One hot encode the purpose and home_ownership columns, assign it to the correct dummy variables, and drop the original columns.
9. Standardize the numerical values except for the ratio and binary columns.
10. Return the resulting dataframe

In [6]:
def create_dataframe(data, knn_initial_model, purpose_mapping, knn_cur_model, scaler):
    df = pd.DataFrame([data])
        
    #clean credit score
    df.loc[df['credit_score'] > 850, 'credit_score'] = df.loc[df['credit_score'] > 850, 'credit_score'] / 10
    
    #clean home ownership
    df['home_ownership'] = df['home_ownership'].replace('HaveMortgage', 'Home Mortgage')
    
    #convert string values into lower case and snake case
    df = df.applymap(lambda x: x if not isinstance(x, str) or not helper.has_non_ascii(x) else x.encode('ascii', 'ignore').decode('ascii'))
    categorical_cols = ['term', 'home_ownership', 'purpose']
    for col in categorical_cols:
        df[col] = helper.clean_columns(df[col].tolist())
    
    #convert term 
    term_dict = {'short_term':0, 'long_term':1}
    df.replace({"term": term_dict}, inplace=True)
    
    #impute missing values if any
    column_names_to_impute = ['current_loan_amount', 'credit_score', 'years_in_current_job', 'annual_income', 'months_since_last_delinquent', 'maximum_open_credit', 'bankruptcies', 'tax_liens']
    column_with_missing_values = df.columns[df.isnull().any()].tolist()
    imputed = knn_initial_model.transform(df[column_names_to_impute].values)
    data_temp = pd.DataFrame(imputed, columns=column_names_to_impute, index = df.index)
    df[column_with_missing_values] = data_temp[column_with_missing_values]
    
    #simplify purpose
    df['purpose'] = df['purpose'].map(purpose_mapping)
    df['purpose'].fillna('other', inplace=True)
    
    #feature engineering
    df['debt_equity_ratio'] = df['monthly_debt'] / df['annual_income']
    df['credit_utilization_ratio'] = df['current_credit_balance'] / df['maximum_open_credit']
    df['is_months_delinquent_missing'] = df['months_since_last_delinquent'].isnull().astype(int)
    df['has_stable_job'] = (df['years_in_current_job'] > 2).astype(int)
    
    #drop unneeded columns
    df.drop(['bankruptcies', 'monthly_debt', 'annual_income', 'current_credit_balance', 'years_in_current_job', 'maximum_open_credit', 'months_since_last_delinquent', 'tax_liens'], axis = 1, inplace = True)
    
    #impute credit_utilization_ratio if needed
    column_with_missing_values = df.columns[df.isnull().any()].tolist()
    if len(column_with_missing_values) > 0:
        column_names_to_impute = ['credit_utilization_ratio']
        df = df.replace([np.inf, -np.inf], np.nan)
        imputed = knn_cur_model.transform(df[column_names_to_impute].values)
        data_temp = pd.DataFrame(imputed, columns=column_names_to_impute, index = df.index)
        df[column_names_to_impute] = data_temp
    else:
        pass
    
    #one hot encode purpose and home_ownership
    #dummy variable names for purpose and home_ownership in the expected dataframe 
    all_purpose_cols = ['purpose_debt_consolidation', 'purpose_other', 'purpose_personal_loans'] 
    all_home_cols = ['home_own_home', 'home_rent']
    
    #one hot encode
    new_dummies_purpose = pd.get_dummies(df['purpose'], prefix='purpose').replace({True: 1, False: 0})
    new_dummies_home = pd.get_dummies(df['home_ownership'], prefix='home').replace({True: 1, False: 0})
    list_dummies_purpose = list(new_dummies_purpose.columns)
    list_dummies_home = list(new_dummies_home.columns)
    
    #create similar column to expected dataframe
    for col in all_purpose_cols:
        if col not in new_dummies_purpose.columns:
            new_dummies_purpose[col] = 0
    for col in all_home_cols:
        if col not in new_dummies_home.columns:
            new_dummies_home[col] = 0
    
    #drop first dummies if neccessary
    for col in list_dummies_purpose:
        if col in all_purpose_cols:
            pass
        else:
            new_dummies_purpose.drop(col, axis=1,inplace=True)

    for col in list_dummies_home:
        if col in list(all_home_cols):
            pass
        else:
            new_dummies_home.drop(col, axis=1,inplace=True)
    
    #change the values into the dummy variables
    df[new_dummies_purpose.columns] = new_dummies_purpose
    df[new_dummies_home.columns] = new_dummies_home
    
    #drop home_ownership and purpose
    df.drop(["home_ownership", "purpose"], axis=1, inplace = True)
    
    #standardize numeric values except for binaries and ratios
    cols_to_standardize = ['current_loan_amount', 'credit_score', 'years_of_credit_history', 'number_of_open_accounts', 'number_of_credit_problems']
    data_scaled = scaler.transform(df[cols_to_standardize].values)
    data_temp = pd.DataFrame(data_scaled, columns=cols_to_standardize, index = df.index)
    df[cols_to_standardize] = data_temp
    
    return df

In [7]:
#use the function to preprocess the input data
df = create_dataframe(test_dict, knn_initial_model, purpose_mapping, knn_cur_model, scaler)

Next, I will create a prediction function. The function will return a dictionary of the prediction. The function takes an H2O model, H2O dataframe, and threshold_metrics.

The reason I use threshold_metrics is to give flexibility in predicting the data for the users. I can change it into precision, recall, or f1 if necessary. The steps to create the function is as follows:
1. Predict the H2O dataframe using the model
2. Convert the prediction result to pandas dataframe
3. Get the probability of loan_given
4. Find the threshold by maximum metric (precision, recall, or f1)
5. Compare the probability of loan given to the threshold. If it's higher, then the loan is given. If it's lower, then the loan is refused.
6. Create a dictionary consisting of a list of the resulting prediction.
7. Return the dictionary.

In [8]:
def predict_data(model, hf, threshold_metrics='precision'):
    #predict
    predictions = model.predict(hf)
    
    #convert to pandas dataframe
    prediction_df = predictions.as_data_frame()
    
    #applying threshold to the prediction
    loan_given_prob = prediction_df['loan_given'].tolist()[0]
    #loan_refused_prob = prediction_df['loan_refused'].tolist()[0]
    threshold = model.find_threshold_by_max_metric(threshold_metrics)
    
    loan_prediction = [('loan_given' if loan_given_prob > threshold else 'loan_refused').replace('_', ' ').title()]
    
    #create prediction dictionary for json
    prediction_dict = {}
    prediction_dict['prediction'] = loan_prediction
    return prediction_dict

In [9]:
#parse the pandas dataframe into H2O dataframe
hf = h2o.H2OFrame(df)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [10]:
#predict the data
predict_data(model, hf, threshold_metrics='precision')

deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100%


{'prediction': ['Loan Given']}

## FastAPI Prediction
I deployed the model in the 'app' folder. The functions from the helper function is selected, and I saved those which are useful for prediction into preprocessing.py.

To run the server, go to terminal and type:
uvicorn main:LoanPredApp --reload

'main' is used because the file is called main.py
'LoanPredApp' is used because the app inside the main.py is called LoanPredApp

In [13]:
import requests

In [11]:
url = 'http://localhost:8000/predict'

To call the API, we use requests.post with the format below:

In [18]:
response = requests.post(
    url,
    json=test_dict,  # Serialize the input_dict as JSON
    headers={"Content-Type": "application/json"}  # Set the appropriate Content-Type header
)

In [19]:
response.json()

{'prediction': ['Loan Given']}

For the original data, the prediction is the same, which is 'Loan Given'. I will test it with another data below:

In [20]:
#creating test dictionary
test_dict = {'current_loan_amount': 5167,
 'term': 'Long Term',
 'credit_score': 350.0,
 'years_in_current_job': 1,
 'home_ownership': 'Rent',
 'annual_income': 12701.0,
 'purpose': 'Debt Consolidation',
 'monthly_debt': 1061.51,
 'years_of_credit_history': 25.8,
 'months_since_last_delinquent': 5.5,
 'number_of_open_accounts': 7,
 'number_of_credit_problems': 0,
 'current_credit_balance': 11283,
 'maximum_open_credit': 16954.0,
 'bankruptcies': 0.0,
 'tax_liens': 0.0}

In [21]:
response = requests.post(
    url,
    json=test_dict,  # Serialize the input_dict as JSON
    headers={"Content-Type": "application/json"}  # Set the appropriate Content-Type header
)

In [22]:
response.json()

{'prediction': ['Loan Refused']}

And for this data, the prediction is 'Loan Refused'.

The next thing to do is dockerize this app, and I can consider this project finished.