## Flood Prediction in Malawi using H2o

## What is H2o?

H2O is an open-source package for machine learning and deep learning. It is easy to use, scalable to big data and provides one of the best documentation.
The platform includes interfaces for R, Python, Scala, Java, JSON and Coffeescript/JavaScript, along with a built-in web interface, the Flow, that make it easier for non-engineers to stitch together complete analytic workflows.

H2o offers an R package that can be installed from CRAN, and a python package that can be installed from PyPI.

You can also download H2o directly from http://h2o.ai/download.

## About the Malawi Flood Prediction Challenge

The Malawi Flood Prediction Competition was hosted by Zindi from 2 December 2019 to 18 May 2020. The competition was as a result of tropical Cyclone Idai that affected millions of people in Malawi, Mozambique and Zimbabwe.

The objective of the competition is to build a machine learning model that helps predict the location and extent of floods in southern Malawi. The challenge can be found here https://zindi.africa/competitions/2030-vision-flood-prediction-in-malawi

### 1. Load data

In [1]:
# Import relevant libraries 
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
import warnings
warnings.filterwarnings('ignore')
seed = 44 #for reproduction of results

In [2]:
#Connect to a local H2o Cluster.
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
  Starting server from C:\Users\Leo\anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\Leo\AppData\Local\Temp\tmpd8axcoad
  JVM stdout: C:\Users\Leo\AppData\Local\Temp\tmpd8axcoad\h2o_Leo_started_from_python.out
  JVM stderr: C:\Users\Leo\AppData\Local\Temp\tmpd8axcoad\h2o_Leo_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,06 secs
H2O_cluster_timezone:,Africa/Harare
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.7
H2O_cluster_version_age:,1 month and 22 days
H2O_cluster_name:,H2O_from_python_Leo_d23p2q
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,1.975 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


In [3]:
# importing data
data = h2o.import_file("Malawi_Floods.csv") #imports data into an h2o cluster
submission_file = h2o.import_file("SampleSubmission.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


### 2. EDA and Preprocessing

In [4]:
print(data.shape)
data.head(3)

(16466, 40)


X,Y,target_2015,elevation,precip 2014-11-16 - 2014-11-23,precip 2014-11-23 - 2014-11-30,precip 2014-11-30 - 2014-12-07,precip 2014-12-07 - 2014-12-14,precip 2014-12-14 - 2014-12-21,precip 2014-12-21 - 2014-12-28,precip 2014-12-28 - 2015-01-04,precip 2015-01-04 - 2015-01-11,precip 2015-01-11 - 2015-01-18,precip 2015-01-18 - 2015-01-25,precip 2015-01-25 - 2015-02-01,precip 2015-02-01 - 2015-02-08,precip 2015-02-08 - 2015-02-15,precip 2015-02-15 - 2015-02-22,precip 2015-02-22 - 2015-03-01,precip 2015-03-01 - 2015-03-08,precip 2015-03-08 - 2015-03-15,precip 2019-01-20 - 2019-01-27,precip 2019-01-27 - 2019-02-03,precip 2019-02-03 - 2019-02-10,precip 2019-02-10 - 2019-02-17,precip 2019-02-17 - 2019-02-24,precip 2019-02-24 - 2019-03-03,precip 2019-03-03 - 2019-03-10,precip 2019-03-10 - 2019-03-17,precip 2019-03-17 - 2019-03-24,precip 2019-03-24 - 2019-03-31,precip 2019-03-31 - 2019-04-07,precip 2019-04-07 - 2019-04-14,precip 2019-04-14 - 2019-04-21,precip 2019-04-21 - 2019-04-28,precip 2019-04-28 - 2019-05-05,precip 2019-05-05 - 2019-05-12,precip 2019-05-12 - 2019-05-19,LC_Type1_mode,Square_ID
34.26,-15.91,0,887.764,0,0,0,14.844,14.5528,12.2378,57.4514,30.127,30.4495,1.52183,29.39,32.8783,8.1798,0.963981,16.6591,3.30447,0,12.9926,4.58286,35.0375,4.79601,28.0833,0,58.3625,18.2647,17.5375,0.896323,1.68,0,0,0,0,0,0,9,4E3C3896-14CE-11EA-BCE5-F49634744A41
34.26,-15.9,0,743.404,0,0,0,14.844,14.5528,12.2378,57.4514,30.127,30.4495,1.52183,29.39,32.8783,8.1798,0.963981,16.6591,3.30447,0,12.9926,4.58286,35.0375,4.79601,28.0833,0,58.3625,18.2647,17.5375,0.896323,1.68,0,0,0,0,0,0,9,4E3C3897-14CE-11EA-BCE5-F49634744A41
34.26,-15.89,0,565.728,0,0,0,14.844,14.5528,12.2378,57.4514,30.127,30.4495,1.52183,29.39,32.8783,8.1798,0.963981,16.6591,3.30447,0,12.9926,4.58286,35.0375,4.79601,28.0833,0,58.3625,18.2647,17.5375,0.896323,1.68,0,0,0,0,0,0,9,4E3C3898-14CE-11EA-BCE5-F49634744A41




In [5]:
data.columns

['X',
 'Y',
 'target_2015',
 'elevation',
 'precip 2014-11-16 - 2014-11-23',
 'precip 2014-11-23 - 2014-11-30',
 'precip 2014-11-30 - 2014-12-07',
 'precip 2014-12-07 - 2014-12-14',
 'precip 2014-12-14 - 2014-12-21',
 'precip 2014-12-21 - 2014-12-28',
 'precip 2014-12-28 - 2015-01-04',
 'precip 2015-01-04 - 2015-01-11',
 'precip 2015-01-11 - 2015-01-18',
 'precip 2015-01-18 - 2015-01-25',
 'precip 2015-01-25 - 2015-02-01',
 'precip 2015-02-01 - 2015-02-08',
 'precip 2015-02-08 - 2015-02-15',
 'precip 2015-02-15 - 2015-02-22',
 'precip 2015-02-22 - 2015-03-01',
 'precip 2015-03-01 - 2015-03-08',
 'precip 2015-03-08 - 2015-03-15',
 'precip 2019-01-20 - 2019-01-27',
 'precip 2019-01-27 - 2019-02-03',
 'precip 2019-02-03 - 2019-02-10',
 'precip 2019-02-10 - 2019-02-17',
 'precip 2019-02-17 - 2019-02-24',
 'precip 2019-02-24 - 2019-03-03',
 'precip 2019-03-03 - 2019-03-10',
 'precip 2019-03-10 - 2019-03-17',
 'precip 2019-03-17 - 2019-03-24',
 'precip 2019-03-24 - 2019-03-31',
 'precip 2019

In [6]:
data.columns = ["X", "Y","Target", "Elevation", "Week1_2015", "Week2_2015", "Week3_2015", "Week4_2015", "Week5_2015",
                "Week6_2015", "Week7_2015", "Week8_2015", "Week9_2015", "Week10_2015", "Week11_2015", "Week12_2015", 
                "Week13_2015", "Week14_2015", "Week15_2015", "Week16_2015", "Week17_2015","Week1_2019", "Week2_2019",
                "Week3_2019", "Week4_2019", "Week5_2019", "Week6_2019", "Week7_2019", "Week8_2019", "Week9_2019", 
                "Week10_2019","Week11_2019", "Week12_2019", "Week13_2019", "Week14_2019", "Week15_2019", "Week16_2019", 
                "Week17_2019", "LC","Square_ID"]

In [7]:
train = data[["X", "Y","Target", "Elevation", "Week1_2015", "Week2_2015", "Week3_2015", "Week4_2015", "Week5_2015",
                "Week6_2015", "Week7_2015", "Week8_2015", "Week9_2015", "Week10_2015", "Week11_2015", "Week12_2015", 
                "Week13_2015", "Week14_2015", "Week15_2015", "Week16_2015", "Week17_2015","LC","Square_ID"]]

In [8]:
test = data[["X", "Y", "Elevation","Week1_2019", "Week2_2019","Week3_2019", "Week4_2019", "Week5_2019", 
                 "Week6_2019", "Week7_2019", "Week8_2019", "Week9_2019","Week10_2019","Week11_2019", "Week12_2019",
                 "Week13_2019", "Week14_2019", "Week15_2019", "Week16_2019","Week17_2019", "LC","Square_ID"]]

In [9]:
train.shape, test.shape

((16466, 23), (16466, 22))

In [10]:
# define the features and target variable
X = train.drop('Square_ID', 1).columns
y = 'Target'
X.remove(y)

In [11]:
#kfolds
nfolds = 5 

In [12]:
# split h2o data frame into training/validation splits 
train, valid  = train.split_frame(ratios=[0.75], seed = 44)

### 3. Model building
We will fit  gradient booting and random forests models and stack them

#### Gradient Boosting Model

In [13]:
#train a gradient boosting model
gbm =  H2OGradientBoostingEstimator(ntrees = 150,
                  nfolds = nfolds,
                  fold_assignment = "Modulo",    #needed for stacking
                  keep_cross_validation_predictions = True, #needed for stacking
                  seed = 44)
gbm.train(x=X, y=y, training_frame=train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [14]:
#gbm results
gbm

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1599984842456_1


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,150.0,150.0,48623.0,5.0,5.0,5.0,7.0,32.0,21.093334




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.0082322104494102
RMSE: 0.09073152952204763
MAE: 0.03804289383721175
RMSLE: 0.06738317647775452
Mean Residual Deviance: 0.0082322104494102

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 0.01101928794939579
RMSE: 0.10497279623500458
MAE: 0.043604157818220086
RMSLE: 0.07807097083305162
Mean Residual Deviance: 0.01101928794939579

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,0.043604158,0.0007594922,0.043311615,0.042794544,0.04410953,0.04314996,0.044655137
1,mean_residual_deviance,0.011019288,0.00038962322,0.010961187,0.010979381,0.011431033,0.010424523,0.011300316
2,mse,0.011019288,0.00038962322,0.010961187,0.010979381,0.011431033,0.010424523,0.011300316
3,r2,0.7850262,0.0057546333,0.79294664,0.78731793,0.7828472,0.78471524,0.777304
4,residual_deviance,0.011019288,0.00038962322,0.010961187,0.010979381,0.011431033,0.010424523,0.011300316
5,rmse,0.10495955,0.0018646228,0.104695685,0.104782544,0.10691601,0.10210055,0.10630295
6,rmsle,0.07805957,0.0014916741,0.07760029,0.07736436,0.07925123,0.076199055,0.07988291



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2020-09-13 10:21:32,16.124 sec,0.0,0.226447,0.128853,0.051278
1,,2020-09-13 10:21:32,16.167 sec,1.0,0.212463,0.120968,0.04514
2,,2020-09-13 10:21:32,16.202 sec,2.0,0.200592,0.114063,0.040237
3,,2020-09-13 10:21:32,16.234 sec,3.0,0.190167,0.107851,0.036164
4,,2020-09-13 10:21:32,16.264 sec,4.0,0.180769,0.102148,0.032678
5,,2020-09-13 10:21:32,16.301 sec,5.0,0.172654,0.097031,0.029809
6,,2020-09-13 10:21:32,16.331 sec,6.0,0.165636,0.092508,0.027435
7,,2020-09-13 10:21:32,16.364 sec,7.0,0.15843,0.087917,0.0251
8,,2020-09-13 10:21:32,16.398 sec,8.0,0.151975,0.083679,0.023096
9,,2020-09-13 10:21:32,16.429 sec,9.0,0.146445,0.079891,0.021446



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Elevation,1779.938721,1.0,0.634016
1,Y,272.798767,0.153263,0.097171
2,Week17_2015,217.016129,0.121923,0.077301
3,X,154.909836,0.087031,0.055179
4,LC,114.828293,0.064512,0.040902
5,Week8_2015,114.257904,0.064192,0.040699
6,Week9_2015,50.210125,0.028209,0.017885
7,Week11_2015,15.762452,0.008856,0.005615
8,Week15_2015,13.864642,0.007789,0.004939
9,Week16_2015,13.024219,0.007317,0.004639



See the whole table with table.as_data_frame()




#### Random Forests Model

In [15]:
#train a random forests model
rf = H2ORandomForestEstimator(ntrees = 150,
                          nfolds = nfolds,
                          fold_assignment = "Modulo",
                          keep_cross_validation_predictions = True,
                          seed = 44)
rf.train(x=X, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [16]:
#random forests results
rf

Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  DRF_model_python_1599984842456_2


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,150.0,150.0,2540382.0,20.0,20.0,20.0,1088.0,1547.0,1344.44




ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.008882973579370474
RMSE: 0.09424952827134189
MAE: 0.036959166448788744
RMSLE: 0.06977714521306987
Mean Residual Deviance: 0.008882973579370474

ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.009183493087597095
RMSE: 0.09583054360482933
MAE: 0.03776314831516794
RMSLE: 0.07092155712957854
Mean Residual Deviance: 0.009183493087597095

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,0.03776315,0.0008916389,0.038256306,0.037745126,0.038109776,0.036238994,0.03846554
1,mean_residual_deviance,0.009183493,0.0005046626,0.009376449,0.009193788,0.009432166,0.0083182175,0.009596845
2,mse,0.009183493,0.0005046626,0.009376449,0.009193788,0.009432166,0.0083182175,0.009596845
3,r2,0.82093924,0.006303762,0.8228819,0.82190675,0.82081926,0.8282141,0.8108744
4,residual_deviance,0.009183493,0.0005046626,0.009376449,0.009193788,0.009432166,0.0083182175,0.009596845
5,rmse,0.09580068,0.0026746204,0.09683206,0.09588424,0.09711934,0.09120426,0.09796349
6,rmsle,0.07090017,0.0019471655,0.071477905,0.070059046,0.071746625,0.068035886,0.07318138



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2020-09-13 10:24:10,30.982 sec,0.0,,,
1,,2020-09-13 10:24:10,31.027 sec,1.0,0.140418,0.046353,0.019717
2,,2020-09-13 10:24:10,31.073 sec,2.0,0.136536,0.04517,0.018642
3,,2020-09-13 10:24:10,31.112 sec,3.0,0.129766,0.042954,0.016839
4,,2020-09-13 10:24:10,31.161 sec,4.0,0.125275,0.042093,0.015694
5,,2020-09-13 10:24:10,31.207 sec,5.0,0.122161,0.041557,0.014923
6,,2020-09-13 10:24:10,31.259 sec,6.0,0.117762,0.040957,0.013868
7,,2020-09-13 10:24:10,31.313 sec,7.0,0.115251,0.040961,0.013283
8,,2020-09-13 10:24:10,31.366 sec,8.0,0.112426,0.04035,0.01264
9,,2020-09-13 10:24:10,31.413 sec,9.0,0.110074,0.039798,0.012116



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Elevation,32699.486328,1.0,0.530601
1,Y,7258.33252,0.221971,0.117778
2,X,6865.935547,0.209971,0.111411
3,LC,3609.937256,0.110397,0.058577
4,Week17_2015,2528.415039,0.077323,0.041028
5,Week7_2015,1579.874512,0.048315,0.025636
6,Week13_2015,1219.905884,0.037307,0.019795
7,Week9_2015,942.736755,0.02883,0.015297
8,Week6_2015,852.137817,0.02606,0.013827
9,Week16_2015,558.104736,0.017068,0.009056



See the whole table with table.as_data_frame()




#### Stacked Model

In [17]:
ensemble = H2OStackedEnsembleEstimator(base_models=[gbm, rf])
ensemble.train(x=X, y=y, training_frame=train)

stackedensemble Model Build progress: |███████████████████████████████████| 100%


In [18]:
#model performance
ensemble.model_performance(valid)


ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 0.008924953531439572
RMSE: 0.09447197220043399
MAE: 0.03798197789639674
RMSLE: 0.06976837464737647
R^2: 0.8390348618011885
Mean Residual Deviance: 0.008924953531439572
Null degrees of freedom: 4080
Residual degrees of freedom: 4078
Null deviance: 226.37592809959853
Residual deviance: 36.4227353618049
AIC: -7668.471560338186




### 4. Submission

In [19]:
#make predictions on test data
preds = ensemble.predict(test)

stackedensemble prediction progress: |████████████████████████████████████| 100%


In [20]:
#create submission file
submission_df = submission_file.concat(preds,axis=1)[['Square_ID', 'predict']].as_data_frame(use_pandas=True)

In [21]:
submission_df

Unnamed: 0,Square_ID,predict
0,4E3C3896-14CE-11EA-BCE5-F49634744A41,-0.000893
1,4E3C3897-14CE-11EA-BCE5-F49634744A41,-0.000799
2,4E3C3898-14CE-11EA-BCE5-F49634744A41,-0.000709
3,4E3C3899-14CE-11EA-BCE5-F49634744A41,0.000336
4,4E3C389A-14CE-11EA-BCE5-F49634744A41,-0.000228
...,...,...
16461,4E6F5DFD-14CE-11EA-BCE5-F49634744A41,0.072818
16462,4E6F5DFE-14CE-11EA-BCE5-F49634744A41,0.068655
16463,4E6F5DFF-14CE-11EA-BCE5-F49634744A41,0.068604
16464,4E6F5E00-14CE-11EA-BCE5-F49634744A41,0.076309


In [22]:
submission_df.to_csv('submission.csv', index=False)

### 5. Conclusion
Without any parameter tuning and feature engineering, the model performed very well on the LB with a score of 0.097. To improve the model, we can use Random Grid Search to find the best parameters. 