In [1]:
# Module: Data Science in Finance, AutoML
# Version 1.0
# Topic :  AutoML - H2O
# Example source: https://www.kaggle.com/wendykan/lending-club-loan-data
#####################################################################
# For support or questions, contact Sri Krishnamurthy at
# sri@quantuniversity.com
# Copyright 2018 QuantUniversity LLC.
#####################################################################

# AutoML with H2O

AutoML is the process of automating an end-to-end Machine Learning pipeline. The [H2O AutoML](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html) interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.

### Imports

In [2]:
import h2o
from h2o.automl import H2OAutoML
h2o.init(max_mem_size='3G')

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_152-release"; OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12); OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)
  Starting server from /home/qsandbox7/anaconda3/envs/auto-fin/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpyu3uaodw
  JVM stdout: /tmp/tmpyu3uaodw/h2o_qsandbox7_started_from_python.out
  JVM stderr: /tmp/tmpyu3uaodw/h2o_qsandbox7_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.1
H2O cluster version age:,12 days
H2O cluster name:,H2O_from_python_qsandbox7_bcpnzu
H2O cluster total nodes:,1
H2O cluster free memory:,2.667 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [3]:
# for numerical analysis and data processing
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.spatial.distance import cdist

### Dataset

The data set is the lending data for lendingclub from August 2011 to December 2011 for some borrowers. The feature descriptions for the data are also provided. Not all the features are required for making predictions, some features are redundant in the original data file. The provided data file is already cleaned and only relevant features are provided. There are two types of features, numerical and categorical.

Reading the input data from csv file.

In [4]:
df = pd.read_csv("../data/LendingClubLoan.csv", low_memory=False)
del df['issue_d'] # removing issue date as it wont affect the prediction (redundant feature)
df_description = pd.read_excel('../data/LCDataDictionary.xlsx').dropna()

In [5]:
df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_inc,verification_status,purpose,addr_state,dti,delinq_2yrs,inq_last_6mths,loan_status_Binary
0,5000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.0,Verified,credit_card,AZ,27.65,0,1,0
1,2500,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.0,Source Verified,car,GA,1.0,0,5,1
2,2400,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.0,Not Verified,small_business,IL,8.72,0,2,0
3,10000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.0,Source Verified,other,CA,20.0,0,1,0
4,3000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.0,Source Verified,other,OR,17.94,0,0,0


In [6]:
y ='int_rate'

### Data preprocessing
H2O library is good at handling missing data by use of H2OFrames. It also provides certain preprocessing tools.

In [7]:
hf = h2o.H2OFrame(df)

  data = _handle_python_lists(python_obj.as_matrix().tolist(), -1)[1]


Parse progress: |█████████████████████████████████████████████████████████| 100%


Test-Train split of the dataframe

In [8]:
splits = hf.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

### The following is all the code needed to find the best model:

**H2OAutoML's performance is as good as the amount of time it is allowed to optimize.**

In [9]:
aml = H2OAutoML(max_runtime_secs =600, seed = 1, project_name = "H2O_finance")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |████████████████████████████████████████████████████████| 100%


#### H2O leaderboards
H2O also provides leaderboard that gives the list of all model and hyperparameter combinations it has tried, sorted based on 'mean_residual_deviance' metric by default.

In [10]:
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_BestOfFamily_AutoML_20181108_174137,0.0677916,0.260368,0.0677916,0.205771,0.0209936
XGBoost_grid_1_AutoML_20181108_174137_model_4,0.0691001,0.262869,0.0691001,0.200239,0.0211429
GLM_grid_1_AutoML_20181108_174137_model_1,0.0709324,0.266331,0.0709324,0.212254,0.0215293
XGBoost_grid_1_AutoML_20181108_174137_model_1,0.0714904,0.267377,0.0714904,0.2117,0.0214332
XGBoost_grid_1_AutoML_20181108_174137_model_7,0.0719084,0.268157,0.0719084,0.214267,0.0215732
XGBoost_1_AutoML_20181108_174137,0.0729299,0.270055,0.0729299,0.214307,0.0216027
GBM_grid_1_AutoML_20181108_174137_model_2,0.0744171,0.272795,0.0744171,0.216698,0.0219851
XGBoost_2_AutoML_20181108_174137,0.0748907,0.273662,0.0748907,0.210917,0.0218977
XGBoost_grid_1_AutoML_20181108_174137_model_5,0.0758895,0.275481,0.0758895,0.212,0.0217845
XGBoost_grid_1_AutoML_20181108_174137_model_2,0.0759505,0.275591,0.0759505,0.216982,0.0218632




**'leader' gives us the best model out of all the models the pipeline tries.    
'model_performance()' provides all important metrics for a given model.**

In [11]:
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 0.06779160886708423
RMSE: 0.2603682178513427
MAE: 0.20577070926493538
RMSLE: 0.020993572913941036
R^2: 0.9963383687003915
Mean Residual Deviance: 0.06779160886708423
Null degrees of freedom: 1978
Residual degrees of freedom: 1972
Null deviance: 36704.07422846521
Residual deviance: 134.15959394795968
AIC: 306.0426589383233




We can predict using H2OFrames as input to the leader

In [12]:
pred = aml.leader.predict(test[0,:])
pred

stackedensemble prediction progress: |████████████████████████████████████| 100%


predict
15.6661




In [13]:
import pickle
pickle.dump(aml.leader, open('h2o_pipeline.model','wb'))

### MAPE (Mean Absolute Percentage Error)

In [14]:
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [15]:
y_test = test[y]

In [16]:
y_test_vals = y_test.as_data_frame().values.ravel()
y_test_pred_vals = aml.leader.predict(test).as_data_frame().values.ravel()

stackedensemble prediction progress: |████████████████████████████████████| 100%


In [17]:
mean_absolute_percentage_error(y_test_vals,y_test_pred_vals )

1.7935132655388024

In [18]:
y_test_pred_vals[0:5]

array([15.66606088, 12.4416658 , 15.18611102,  7.80178326, 12.54697415])

In [19]:
y_test_vals[0:5]

array([15.96, 12.69, 15.27,  7.9 , 12.69])