# Random Forest Predictive Model with H2O

## Step 1 : Import all Dependencies
### Import Pyspark and create a SparkSession, set the Sparkcontext

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import SparkSession, functions, types
import random
spark = SparkSession\
    .builder\
    .appName("Transform&Load")\
    .config("spark.driver.extraClassPath","/home/jim/spark-2.4.0-bin-hadoop2.7/jars/mysql-connector-java-5.1.49.jar")\
    .getOrCreate()
spark.sparkContext.setLogLevel('WARN')
sc = spark.sparkContext

### View the Sparkcontext

In [2]:
sc

### Create an H2O cluster inside the Spark cluster

In [3]:
from pysparkling import *
hc = H2OContext.getOrCreate(spark)
import h2o

Method getOrCreate with spark argument is deprecated. Please use either just getOrCreate() or if you need to pass extra H2OConf, use getOrCreate(conf). The spark argument will be removed in release 3.32.


Connecting to H2O server at http://172.29.157.231:54325 ... successful.


0,1
H2O_cluster_uptime:,12 secs
H2O_cluster_timezone:,America/Vancouver
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.4
H2O_cluster_version_age:,6 days
H2O_cluster_name:,sparkling-water-jim_local-1591600245615
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,789 Mb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4



Sparkling Water Context:
 * Sparkling Water Version: 3.30.0.4-1-2.4
 * H2O name: sparkling-water-jim_local-1591600245615
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (0,172.29.157.231,54325)
  ------------------------

  Open H2O Flow in browser: http://172.29.157.231:54327 (CMD + click in Mac OSX)

    


## Step 2 : Get the Data
### Read the data saved as a mySQL database table into a spark Dataframe


In [5]:
spark_df = spark.read\
    .format("jdbc")\
    .option("url", "jdbc:mysql://localhost/Insurance")\
    .option("driver", "com.mysql.jdbc.Driver")\
    .option("dbtable", "Insurance_numeric").option("user", "jsully")\
    .option("password", "whatisreal1").load()
spark_df.select(spark_df.columns[:6]).show(5)

+-----------------------+---+-----------------+------+--------------------+---------------------+
|Customer Lifetime Value|Age|Effective To Date|Income|Monthly Premium Auto|Total Written Premium|
+-----------------------+---+-----------------+------+--------------------+---------------------+
|                9996.58| 44|       18-04-2019| 39433|                 105|                33435|
|                4009.22| 45|       12-04-2019|335099|                 127|                41317|
|                5805.17| 27|       18-08-2020|251960|                 247|                38908|
|                 5874.0| 35|       21-08-2020|198003|                 156|                22792|
|                7854.84| 38|       11-11-2018|410607|                  48|                33689|
+-----------------------+---+-----------------+------+--------------------+---------------------+
only showing top 5 rows



### Convert the Spark Df to an H2O frame and describe it

In [19]:
h2o_df = hc.asH2OFrame(spark_df)
print("The H2O frame has {} rows and {} columns".format(h2o_df.shape[0],h2o_df.shape[1]))
h2o_df.summary()

The H2O frame has 4856 rows and 33 columns


Unnamed: 0,Customer Lifetime Value,Age,Effective To Date,Income,Monthly Premium Auto,Total Written Premium,Losses,Loss Ratio,Growth Rate,Commissions,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Number of previous policies,CityIndex,Response Index,Coverage Index,Education Index,Employment_Status Index,Gender Index,Location_Code Index,Marital Status Index,Policy_Type Index,Policy_Rating Index,Renew_Offer_Type Index,Sales_Channel Index,Total Claim Amount Index,Feedback Index,Job Index,Company Index,Credit Card Provider Index,Churn
type,real,int,int,int,int,int,int,real,real,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int
mins,1000.92,25.0,2018.0,10102.0,25.0,5002.0,2.0,0.0,-9.998,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,5496.602436161445,37.535420098846785,2019.1630971993413,256656.37953047772,136.29715815485974,27560.837314662313,4992.529654036231,0.5052771828665551,0.035157742998353135,5031.046128500818,24.100288303130135,47.92668863261946,0.9845551894563435,4.462726523887973,10.429777594728206,3.8935337726523893,0.48331960461285006,0.984349258649094,2.4526359143327903,1.4810543657331168,0.4995881383855025,0.9899093904448104,1.4754942339373938,1.4717874794069226,1.4392504118616143,1.483113673805604,1.9754942339373942,2427.4999999999955,2.916803953871498,173.94975288303112,2021.185131795717,3.6396210873146586,0.5047364085667215
maxs,9998.96,50.0,2020.0,499861.0,250.0,49990.0,9999.0,1.0,9.995,9998.0,48.0,96.0,2.0,9.0,18.0,8.0,1.0,2.0,5.0,3.0,1.0,2.0,3.0,3.0,3.0,3.0,4.0,4855.0,6.0,519.0,4420.0,9.0,1.0
sigma,2588.9688145026144,7.436430321773121,0.7716430259183207,141502.9612782824,64.68559499575768,12928.541105040944,2896.6062673621154,0.2881419338264036,5.766021378651545,2880.622871183775,14.136018370278661,27.82175833959348,0.817653022581466,2.876061698947571,4.636621656235225,2.583954168975499,0.49977314894511604,0.8175231417287632,1.7085901741410319,1.1207487016475102,0.5000513210070957,0.8184919392456173,1.1147437914906977,1.115025631796455,1.1155744984934035,1.117837337602026,1.4139283470220758,1401.950783729584,1.993418046783163,149.13691947228054,1358.7756866175719,2.959713603438183,0.5000290542748945
zeros,0,0,0,0,0,0,0,2,0,1,92,53,1661,512,0,575,2509,1661,853,1252,2430,1651,1243,1246,1289,1235,993,1,728,179,11,872,2405
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,9996.58,44.0,2019.0,39433.0,105.0,33435.0,5531.0,0.216,0.453,4586.0,19.0,50.0,2.0,7.0,14.0,0.0,0.0,2.0,0.0,2.0,1.0,2.0,1.0,1.0,1.0,3.0,2.0,3676.0,6.0,246.0,3756.0,1.0,1.0
1,4009.22,45.0,2019.0,335099.0,127.0,41317.0,6774.0,0.263,2.07,3007.0,1.0,90.0,2.0,5.0,6.0,5.0,0.0,1.0,3.0,3.0,1.0,2.0,3.0,0.0,0.0,1.0,1.0,2150.0,5.0,6.0,1428.0,3.0,1.0
2,5805.17,27.0,2020.0,251960.0,247.0,38908.0,9089.0,0.936,6.012,1901.0,20.0,80.0,0.0,2.0,12.0,1.0,1.0,2.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,3623.0,1.0,51.0,4088.0,7.0,1.0


## Step 3 : Prepare the Dataset and feed it to a model
### Define the training parameters, input and target parameters.


In [23]:
# Input parameters that are going to train. In this case : all columns except churn
training_columns = h2o_df.columns[:-1]
# Output parameter train against input parameters
response_column = 'Churn'

### Split the Data into train and Test sets
The testing data will help you to verify the validity of your model after creating it. And it will also prevent model over fitting to the given data.

In [24]:
# Split data into train and testing
train, test = h2o_df.split_frame(ratios=[0.8])

### Initiate the Random Forest model, define it with the required parameters and then train it.

In [26]:
from h2o.estimators import H2ORandomForestEstimator
#h2o.init()
# Define model
model = H2ORandomForestEstimator(ntrees=50, max_depth=20, nfolds=10)

# Train model
model.train(x=training_columns, y=response_column, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


### CHeck the model's performance

In [27]:
performance = model.model_performance(test_data=test)
print (performance)


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 0.2601156888147461
RMSE: 0.5100153809589923
MAE: 0.5025258930220345
RMSLE: 0.3582082739015907
Mean Residual Deviance: 0.2601156888147461



#### Save the model to local file system

In [29]:
h2o.save_model(model=model, path="RF_Insurance_model", force=True)

'/home/jim/Documents/sparkling-water-3.30.0.4-1-2.4/RF_Insurance_model/DRF_model_python_1591600273347_1'

### Show the model details

In [39]:
model

Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  DRF_model_python_1591600273347_1


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,50.0,50.0,375810.0,18.0,20.0,19.86,538.0,629.0,594.06




ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.2639644080003918
RMSE: 0.5137746665615109
MAE: 0.49683306728584103
RMSLE: 0.3594929987528265
Mean Residual Deviance: 0.2639644080003918

ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.2540269320286576
RMSE: 0.5040108451498416
MAE: 0.4963127403375496
RMSLE: 0.35363843779933546
Mean Residual Deviance: 0.2540269320286576

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid,cv_6_valid,cv_7_valid,cv_8_valid,cv_9_valid,cv_10_valid
0,mae,0.4963061,0.0044579403,0.49884567,0.49887803,0.48952442,0.4946703,0.4945817,0.4956139,0.5041945,0.4912182,0.49450418,0.5010299
1,mean_residual_deviance,0.2540221,0.0044468003,0.25660747,0.25628483,0.24759479,0.25241446,0.25190705,0.25307295,0.26184928,0.24892348,0.25236067,0.2592059
2,mse,0.2540221,0.0044468003,0.25660747,0.25628483,0.24759479,0.25241446,0.25190705,0.25307295,0.26184928,0.24892348,0.25236067,0.2592059
3,r2,-0.020484393,0.017332442,-0.026490971,-0.03650415,0.00025801553,-0.009878392,-0.009735626,-0.030263606,-0.047829106,0.002726303,-0.01030276,-0.036823645
4,residual_deviance,0.2540221,0.0044468003,0.25660747,0.25628483,0.24759479,0.25241446,0.25190705,0.25307295,0.26184928,0.24892348,0.25236067,0.2592059
5,rmse,0.5039887,0.004406204,0.5065644,0.5062458,0.49758896,0.5024087,0.5019034,0.50306356,0.51171213,0.49892232,0.5023551,0.50912267
6,rmsle,0.35355663,0.0060714753,0.35622177,0.36238262,0.34413302,0.352373,0.3567748,0.34493846,0.35786304,0.34783491,0.35528642,0.3577583



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2020-06-08 00:44:03,12.717 sec,0.0,,,
1,,2020-06-08 00:44:03,12.777 sec,1.0,0.672536,0.476249,0.452305
2,,2020-06-08 00:44:03,12.811 sec,2.0,0.670496,0.489213,0.449564
3,,2020-06-08 00:44:03,12.839 sec,3.0,0.654237,0.488699,0.428026
4,,2020-06-08 00:44:03,12.860 sec,4.0,0.636324,0.485949,0.404908
5,,2020-06-08 00:44:03,12.881 sec,5.0,0.62909,0.49519,0.395754
6,,2020-06-08 00:44:03,12.901 sec,6.0,0.611925,0.493218,0.374452
7,,2020-06-08 00:44:03,12.921 sec,7.0,0.601446,0.4956,0.361738
8,,2020-06-08 00:44:03,12.942 sec,8.0,0.592244,0.495808,0.350753
9,,2020-06-08 00:44:03,12.965 sec,9.0,0.583014,0.496234,0.339905



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Total Written Premium,1472.453613,1.0,0.048682
1,Total Claim Amount Index,1440.99292,0.978634,0.047642
2,Losses,1439.602417,0.977689,0.047596
3,Commissions,1420.929321,0.965008,0.046978
4,Monthly Premium Auto,1414.790771,0.960839,0.046775
5,Income,1414.226562,0.960456,0.046757
6,Customer Lifetime Value,1369.011841,0.929749,0.045262
7,Growth Rate,1358.49646,0.922607,0.044914
8,Company Index,1357.218872,0.92174,0.044872
9,Loss Ratio,1351.888062,0.918119,0.044696



See the whole table with table.as_data_frame()




## Step 4 : Use the model to perform predictions on the Test Set
### First let us check the performance of the model using the test set

In [51]:
# Predict on the test set using the gbm model
predictions = model.predict(test)
predictions.show()

drf prediction progress: |████████████████████████████████████████████████| 100%


predict
0.64
0.553864
0.54
0.453333
0.58
0.48
0.3832
0.52
0.511351
0.523771


In [52]:
model.model_performance(test)


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 0.2601156888147461
RMSE: 0.5100153809589923
MAE: 0.5025258930220345
RMSLE: 0.3582082739015907
Mean Residual Deviance: 0.2601156888147461


