**Step1: Business case:**
Three data sets are provided for the project. Claim Data (Dataset 1) – all claims that were found from a Loss History Report (LHR) at driver level (Only households with claims will appear in this dataset)

Predictor Dataset (Dataset 2) – all information we have from this household’s insurance application (Application date of January 1, 2017)

Subsequent Loss Experience (Dataset 3) – one year of subsequent loss-experience of these applicants (All information gathered after application date)

Metadata – Names and descriptions of the variables contained in each of the datasets listed above.

Objective:Create a model where you are trying to identify if the applicant had a future loss (future_clm_ind) using the information known on or before the application date.


---

# Comparision with pycaret linear regression model
# Note: Data processing and EDA completed with PyCaret workflow model.
# This work uses final dataset to build autoML model using H2O package.

In [None]:
# import numpy and pandas
import numpy as np
import pandas as pd


In [None]:
# load the final processed calim data file 
df_claim=pd.read_csv('/content/_Final input datafile for claim prediction.csv')
df_claim.shape

(20000, 30)

In [None]:
# check for null values and counts
df_claim.isnull().sum()

**What is H2O?**

H2O is a Java-based software for data modeling and general computing. The H2O software is many things, but the primary purpose of H2O is as a distributed (many machines), parallel (many CPUs), in memory (several hundred GBs Xmx) processing engine.

In [None]:
# Installing the H2O AI Package for Advanced ML and Deep Learning packages
!pip install h2o

**Starting H2O and Inspecting the Cluster**

There are many tools for directly interacting with user-visible objects in the H2O cluster. Every new python session begins by initializing a connection between the python client and the H2O cluster. Note that h2o.init() accepts a number of arguments that are described in the h2o.init section.

In [None]:
#import automl to current working session
import h2o
from h2o.automl import H2OAutoML

In [None]:
# initialize h20 instances to make connection with python colab enviornment. Attempt to connect to a local server, or if not successful start a new server and connect to it.
h2o.init()

In [None]:
#obtain a high-level summary of the cluster status:
h2o.cluster_info()

In [None]:
# Converting Pandas dataframe to H2O dataframe. Primary data store for H2O.
hf_claim=h2o.H2OFrame(df_claim)

Parse progress: |█████████████████████████████████████████████████████████| 100%


H2OFrame is similar to pandas' DataFrame, or R's data.frame. One of the critical distinction is that the
data is generally not held in memory, instead it is located on a (possibly remote) H2O cluster, and thus
H2OFrame represents a mere handle to that data.

In [None]:
#Check for data frame
hf_claim

In [None]:
# index col 'unnamed' dropped from the hf dataframe
hf_claim.drop(0,axis=1)

drvr_cnt,min_age,max_age,min_mon_lic,max_mon_lic,cnt_yth,cnt_female,cnt_male,cnt_married,cnt_single,cnt_mtrcyc,cnt_majr_viol,cnt_minr_viol,cnt_lic_susp,time_w_carr,inforce_ind,fire_ind,homeowner_ind,veh_lease_cnt,veh_own_cnt,monthly_pay_ind,veh_w_comp_cnt,veh_w_ers_cnt,curnt_bi_upp,credit_score,premium,atf_claim_no,Not_atf_claim_no,future_clm_ind
2,43.79,51.37,333.48,424.39,0,2,0,1,1,0,0,0,0,3.0,1,0,0,0,0,0,1,1,100,825.95,133.6,1,1,0
1,35.64,35.64,235.74,235.74,0,0,1,0,1,0,0,0,0,1.5,1,1,1,0,0,0,1,0,25,684.65,145.8,0,0,0
1,16.0,16.0,0.0,0.0,1,1,0,0,1,0,0,2,0,5.0,1,0,1,0,0,0,1,1,100,596.32,167.0,0,0,0
1,17.88,17.88,22.59,22.59,1,1,0,0,1,0,0,0,0,0.0,1,1,0,0,0,1,1,1,100,636.76,150.3,0,0,0
1,16.0,16.0,0.0,0.0,1,0,1,0,1,0,0,0,0,2.5,1,0,0,0,0,0,1,1,50,669.57,117.0,0,0,0
2,35.39,49.77,232.67,405.2,0,0,2,2,0,0,0,0,0,3.16944,0,0,1,0,0,0,1,1,25,789.85,133.6,0,0,0
1,35.87,35.87,238.47,238.47,0,1,0,0,1,0,0,0,0,5.0,1,1,0,0,0,0,1,1,25,563.68,150.3,0,0,0
1,46.41,46.41,364.9,364.9,0,1,0,0,1,0,0,0,0,3.16944,0,1,0,0,1,0,0,1,25,614.32,94.5,0,0,0
1,35.75,35.75,237.02,237.02,0,0,1,0,1,0,0,0,0,3.0,1,0,1,0,0,1,1,1,25,580.0,167.0,0,0,0
1,51.63,51.63,427.53,427.53,0,1,0,0,1,0,0,0,0,3.16944,0,1,0,0,0,1,1,1,25,724.11,120.24,0,0,0




In [None]:
# splitting data in to  train and test set
splits=hf_claim.split_frame(ratios=[0.7])
# Assigning first split to train variable.
train=splits[0]
# Assigning second split to test variable
test = splits[1]

In [None]:
# For binary classification, response should be a factor
y='future_clm_ind' # 0= no claim, 1= yes filed claim
train[y]=train[y].asfactor()
test[y]=test[y].asfactor()

Begins an AutoML task, a background task that automatically builds a number of models 
with various algorithms and tracks their performance in a leaderboard. At any point
in the process you may use H2O's performance or prediction functions on the resulting
models.

In [None]:
# Invoking the Auto advanced ML and Deep learning algorithms restricting run time to 60 seconds
aml=H2OAutoML(max_runtime_secs=60)
# train using training data and test using test data
aml.train(y='future_clm_ind',training_frame=train) ##if x is defined as all columns except the response, then x is not required

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [None]:
# Print Leaderboard (ranked by xval metrics)
aml.leaderboard.head(10)

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
StackedEnsemble_AllModels_AutoML_20210507_182524,0.946386,0.101517,0.307577,0.11144,0.18177,0.0330404
StackedEnsemble_BestOfFamily_AutoML_20210507_182524,0.945977,0.101509,0.307621,0.0979581,0.181791,0.033048
XGBoost_grid__1_AutoML_20210507_182524_model_1,0.945058,0.102323,0.30248,0.0897754,0.183601,0.0337095
GBM_5_AutoML_20210507_182524,0.939386,0.144264,0.300353,0.128985,0.194338,0.0377674
GBM_2_AutoML_20210507_182524,0.934979,0.14535,0.287019,0.134134,0.194673,0.0378976
XGBoost_grid__1_AutoML_20210507_182524_model_2,0.932109,0.165548,0.302961,0.112863,0.197061,0.0388331
GBM_3_AutoML_20210507_182524,0.93128,0.139501,0.297915,0.116102,0.192244,0.0369578
GBM_1_AutoML_20210507_182524,0.928816,0.149835,0.298747,0.098107,0.196257,0.038517
XGBoost_1_AutoML_20210507_182524,0.928694,0.26006,0.300926,0.162821,0.244685,0.0598707
GBM_4_AutoML_20210507_182524,0.917499,0.149288,0.294419,0.139265,0.195448,0.0381997




In [None]:
#Evaluate performance on a test set
perf=aml.leader.model_performance(test)
perf


ModelMetricsBinomialGLM: stackedensemble
** Reported on test data. **

MSE: 0.029687999514403807
RMSE: 0.1723020589383766
LogLoss: 0.09054153584696435
Null degrees of freedom: 6024
Residual degrees of freedom: 6020
Null deviance: 2069.568466538201
Residual deviance: 1091.02550695592
AIC: 1101.02550695592
AUC: 0.955341188919352
AUCPR: 0.33380327757727724
Gini: 0.9106823778387041

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.12376689263456368: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,5315.0,462.0,0.08,(462.0/5777.0)
1,1,21.0,227.0,0.0847,(21.0/248.0)
2,Total,5336.0,689.0,0.0802,(483.0/6025.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.123767,0.484525,241.0
1,max f2,0.091289,0.67591,263.0
2,max f0point5,0.149739,0.382436,225.0
3,max accuracy,0.636094,0.959004,2.0
4,max precision,0.636094,0.666667,2.0
5,max recall,0.02482,1.0,318.0
6,max specificity,0.692115,0.999827,0.0
7,max absolute_mcc,0.123767,0.521469,241.0
8,max min_per_class_accuracy,0.11091,0.916046,250.0
9,max mean_per_class_accuracy,0.043511,0.94248,298.0



Gains/Lift Table: Avg response rate:  4.12 %, avg score:  4.09 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010124,0.448607,9.160167,9.160167,0.377049,0.507211,0.377049,0.507211,0.092742,0.092742,816.016658,816.016658,0.086164
1,2,0.020083,0.405939,6.478495,7.830412,0.266667,0.425796,0.322314,0.46684,0.064516,0.157258,547.849462,683.041189,0.143064
2,3,0.030041,0.375177,7.288306,7.650708,0.3,0.389453,0.314917,0.441187,0.072581,0.229839,628.830645,665.070843,0.208374
3,4,0.04,0.345802,8.90793,7.96371,0.366667,0.361109,0.327801,0.42125,0.08871,0.318548,790.793011,696.370968,0.290506
4,5,0.050124,0.3245,7.965362,7.964043,0.327869,0.336622,0.327815,0.404157,0.080645,0.399194,696.536224,696.404347,0.364054
5,6,0.100083,0.173811,8.071214,8.01754,0.332226,0.25723,0.330017,0.330815,0.403226,0.802419,707.121423,701.753999,0.732487
6,7,0.150041,0.035065,3.874183,6.637949,0.159468,0.092115,0.27323,0.251336,0.193548,0.995968,287.418283,563.794872,0.882241
7,8,0.2,0.012233,0.080712,5.0,0.003322,0.019489,0.205809,0.193422,0.004032,1.0,-91.928786,400.0,0.834343
8,9,0.300083,0.006248,0.0,3.332412,0.0,0.008592,0.137168,0.131778,0.0,1.0,-100.0,233.24115,0.729964
9,10,0.4,0.00353,0.0,2.5,0.0,0.004767,0.102905,0.100052,0.0,1.0,-100.0,150.0,0.625757





