# Fluvial Hazard Zone Prediction
# 5. Comparing Machine Learning Models for FHZ Prediction
<img alt="Colorado, flooding, Hickenlooper, Estes Park, Boulder, National Guard" class="" src="http://america.aljazeera.com/content/ajam/articles/2013/11/22/mountain-strong-afterthefloodcoloradoonroughroadtorecovery/jcr:content/mainpar/adaptiveimage/src.adapt.960.high.Coloflood_11_29_2013.1385146061953.jpg" width=800>

<b>Step 1</b> of this analysis introduced the problem of delineating Fluvial Hazard Zones.  
<b>Step 2</b> created a psuedo Fluvial Hazard Zone Delineation using the topo data from pre and post flood surfaces in addition to performing data cleaning tasks

<b>Steps 3, 4, and 6 </b>built machine learning models to map the 7 input features to a binary prediction - 'Fluvial Hazard Zone' or 'Not Fluvial Hazard Zone':  

<b>Step 3</b> built a Multi-Layer Perceptron model which is a simple version of the Artificial Neural Network Architecture.  
<b>Step 4</b> built a Random Forest Classifier, which is a more robust version of the Decision Tree method.  
<b>Step 6</b> built a logistic regression model (related to linear regression) to use as a baseline for comparison.

This <b>step 5 </b> compares the results of the three methods and plot the resulting predictions for exploration.

# Loading the test set: North Saint Vrain

In [1]:
# Import packages
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

from keras.utils import np_utils

Using TensorFlow backend.


In [2]:
# Read in the clean dataset of channel migration zone points
csv = r'FHZ_points_clean_NSV.csv'

df = pd.read_csv(csv, header=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2033259 entries, 0 to 2033258
Data columns (total 13 columns):
Unnamed: 0            int64
long_WGS84            float64
lat_WGS84             float64
topo2011              float64
ground_slope          float64
ground_curve          float64
near_crossing         float64
near_road             float64
near_stream           float64
stream_slope          float64
relative_elevation    float64
ground_delta          float64
target                float64
dtypes: float64(12), int64(1)
memory usage: 201.7 MB


## Convert data to numpy arrays for testing
### Predictor Features:

In [3]:
# Convert the predictor data to numpy array for neural network

# Drop all columns that aren't to be used as prediction features
drop_columns = ['Unnamed: 0', 'long_WGS84', 'lat_WGS84', 'topo2011', 'ground_delta','target']
df_predictors = df.drop(drop_columns, axis=1)

# Convert predictors to numpy array
predictors = df_predictors.values
print ('\n','predictor matrix shape: ', predictors.shape)
df_predictors.head()


 predictor matrix shape:  (2033259, 7)


Unnamed: 0,ground_slope,ground_curve,near_crossing,near_road,near_stream,stream_slope,relative_elevation
0,20.6943,1.60455,176.108993,58.420898,443.980011,0.795272,50.232899
1,23.3629,3.74078,177.811005,61.274898,444.319,0.795272,51.084499
2,23.3706,2.79134,179.548004,64.128799,444.576996,0.795272,51.965801
3,23.0968,2.16471,181.317993,66.982697,444.834992,0.795272,52.818802
4,22.435101,1.42687,183.119995,69.836601,445.092987,0.795272,53.6978


### Target Features:

In [4]:
print ('TARGET BEFORE BINARIZATION:')
print('\n', df['target'].head())

# Convert the target feature from label encoding to categorical encoding (same as one-hot encoding.)
dummy_target = np_utils.to_categorical(df['target'])
print ('\n', 'TARGET AFTER BINARIZATION:')
dummy_target[0:5]

TARGET BEFORE BINARIZATION:

 0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: target, dtype: float64

 TARGET AFTER BINARIZATION:


array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.]])

# *Artificial Neural Network (MLP model) Testing*

## Loading the trained model from disk

In [5]:
# When we want to re-evaluate the model without going through the lengthy retraining process, we can load the 
# .json and .h5 files from disk.  Start here when re-starting the kernel.
from keras.models import model_from_json

# Define the model and weights we want to load
saved_model = "MLP_model_2018-04-08 22:20:17.035485.json"     # needs to be a .json file
saved_weights = "MLP_model_weights_2018-04-08 22:20:17.035485.h5"   # needs to be a .h5 file


# load json and create model
json_file = open(saved_model, 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
# load weights into new model
model.load_weights(saved_weights)
print("Loaded model and weights from disk")

Loaded model and weights from disk


## Evaluate the MLP model on North Saint Vrain data

In [6]:
# Compile the model and test
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Use the evaluate method on the hold out data from train_test_split and print the accuracy
score = model.evaluate(predictors, dummy_target, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], score[1]*100))

acc: 72.51%


In [7]:
# Make the predictions for the entire dataset, which is the easiest way to have matching indexes for
# rejoining back to original data. 
y_hats = model.predict(predictors)

## Joining the predictions back to the original dataframe

In [8]:
# Initiate lists to hold the boolean predictions and their associated probabilities
boolean_solution = []
probability = []

# Steping through the predictions (y_hats), use argmax to return the index of 
# the highest probability for each point, and amax to return the probability of
# that prediction.  Append values to the two lists.
for i in range(len(y_hats)):
    boolean_solution.append(np.argmax(y_hats[i]))
    probability.append(np.amax(y_hats[i]))
    
# Convert lists to pandas series   
bool_series = pd.Series(boolean_solution)
prob_series = pd.Series(probability)

# Assign the series values to two new columns of df
df['FHZ_MLP_prediction']= bool_series.values
df['FHZ_MLP_probability'] = prob_series.values

df.tail()

Unnamed: 0.1,Unnamed: 0,long_WGS84,lat_WGS84,topo2011,ground_slope,ground_curve,near_crossing,near_road,near_stream,stream_slope,relative_elevation,ground_delta,target,FHZ_MLP_prediction,FHZ_MLP_probability
2033254,2050815,-105.26776,40.22061,5324.669922,4.08205,8.04308,1320.900024,334.39801,121.816002,1.08366,5.2876,-1.50732,0.0,1,0.66814
2033255,2050816,-105.267749,40.22061,5325.0,9.71195,8.15158,1323.359985,333.713013,119.613998,1.08366,5.62354,-1.94873,1.0,1,0.685544
2033256,2050817,-105.267738,40.22061,5324.180176,14.692,-1.32243,1325.829956,333.053986,117.413002,1.08366,4.80127,-1.41113,0.0,1,0.734266
2033257,2050818,-105.267727,40.22061,5323.160156,12.5361,-6.07232,1328.300049,332.420013,115.210999,1.08366,3.78711,-0.608887,0.0,1,0.75469
2033258,2050819,-105.267717,40.22061,5322.509766,7.15242,-4.78244,1330.770019,331.812988,113.010002,1.08366,3.13623,-0.087402,0.0,1,0.75229


## Score the MLP model

In [97]:
from sklearn.metrics import confusion_matrix, classification_report

true_negatives, false_positives, false_negatives, true_positives = \
    confusion_matrix(df['target'], df['FHZ_MLP_prediction']).ravel()
    
print ('true negatives: ', true_negatives, '\n',
      'false postives: ', false_positives, '\n',
      'false negatives: ', false_negatives, '\n',
      'true_positives: ', true_positives, '\n') 

print(classification_report(df['target'], df['FHZ_MLP_prediction']))

true negatives:  1316405 
 false postives:  432228 
 false negatives:  126675 
 true_positives:  157951 

             precision    recall  f1-score   support

        0.0       0.91      0.75      0.82   1748633
        1.0       0.27      0.55      0.36    284626

avg / total       0.82      0.73      0.76   2033259



In [98]:
# Create new dataframe to hold scores
scores = pd.DataFrame()
scores.style.format("{:%.1d}")
scores.columns.name = 'Model'

# Append the results to a new dataframe - 'scores'
scores = scores.append({
     'true negative': true_negatives,
     'false negative': false_negatives,
     'true positive' : true_positives,
    'false positive': false_positives
      }, ignore_index=True)

# Update the index name
last = scores.index[-1]
scores = scores.rename({last: 'MLP_model'})

scores

Model,false negative,false positive,true negative,true positive
MLP_model,126675.0,432228.0,1316405.0,157951.0


## Visualize the MLP predicted FHZ points

In [5]:
# Import modules
from bokeh.io import show, output_notebook, output_file
from bokeh.plotting import ColumnDataSource, figure, gmap
from bokeh.layouts import row, column, widgetbox
from bokeh.models import GMapOptions, LogTicker, HoverTool, LinearColorMapper
from bokeh.palettes import brewer

In [11]:
# Visualize the data using bokeh and the Google Maps API

# Create a downsampled version of the full dataframe for plotting (avoids data limit restrictions)
df_sample = df.sample(frac=0.005, replace=False)

# Create a ColumnDataSource from df: source
source = ColumnDataSource(df_sample)

# Set the mapping options, location and zoom level
map_options = GMapOptions(
    lat=np.mean(df['lat_WGS84']), 
    lng=np.mean(df['long_WGS84']),
    map_type="hybrid", zoom=14)

# Create the google maps figure: p
p = gmap(
    "AIzaSyDbo5FlMFzns5OzeuW1TA7dOikvEuF-eYI", 
    map_options, title="North Saint Vrain, Predicted Fluvial Hazard Zone Points", 
    tools='pan, wheel_zoom, box_select,lasso_select, reset, save',
    plot_width=900)

#Color

# Develop a color gradient for plotting, and color bar for legend
color_mapper = LinearColorMapper(
    palette=['#07fb12','#fb9001'],
    low=df_sample['FHZ_MLP_prediction'].min(),
    high=df_sample['FHZ_MLP_prediction'].max())

# Add circle glyphs to figure p
p.circle(
    x="long_WGS84", 
    y="lat_WGS84", 
    size=8, 
    source=source, 
    color=dict(field='FHZ_MLP_prediction', transform=color_mapper), 
    fill_alpha=0.1,
    legend='FHZ_MLP_prediction')

# Create a HoverTool object: hover
hover = HoverTool(tooltips=[
#     ('ground_delta', '@ground_delta{0.00}'),
    ('FHZ_MLP_prediction', '@FHZ_MLP_prediction{0}'),
    ('FHZ_MLP_probability', '@FHZ_MLP_probability{0.00}')])

# Add the HoverTool object to figure p
p.add_tools(hover)

# Label the axes
p.xaxis.axis_label = 'longitude WGS84'
p.yaxis.axis_label = 'latitude WGS84'

# display the plot
# output_notebook()
output_file('FHZ_prediction_Summary.html')
show(p)

# *Random Forest Classifier Testing*


In [28]:
# Load the trained model
import pickle

with open('RandomForestModel', 'rb') as f:
    rf = pickle.load(f)

# Make the predictions for the entire dataset, which is the easiest way to have matching indexes for
# rejoining back to original data. 
y_hats2 = rf.predict(predictors)

print (y_hats2)

[[ 1.  0.]
 [ 1.  0.]
 [ 1.  0.]
 ..., 
 [ 1.  0.]
 [ 1.  0.]
 [ 1.  0.]]


In [29]:
# Initiate lists to hold the boolean predictions 
rf_prediction = []

# Stepping through the predictions (y_hats), use argmax to return the index of 
# the highest probability for each point. A zero index also refers to a zero prediction.
for i in range(len(y_hats2)):
    rf_prediction.append(np.argmax(y_hats2[i]))
    
# Convert lists to pandas series   
bool_series = pd.Series(rf_prediction)

# Assign the series values to two new columns of df
df['FHZ_RF_prediction']= bool_series.values

df.tail()

Unnamed: 0.1,Unnamed: 0,long_WGS84,lat_WGS84,topo2011,ground_slope,ground_curve,near_crossing,near_road,near_stream,stream_slope,relative_elevation,ground_delta,target,FHZ_MLP_prediction,FHZ_MLP_probability,FHZ_RF_prediction
2033254,2050815,-105.26776,40.22061,5324.669922,4.08205,8.04308,1320.900024,334.39801,121.816002,1.08366,5.2876,-1.50732,0.0,1,0.66814,0
2033255,2050816,-105.267749,40.22061,5325.0,9.71195,8.15158,1323.359985,333.713013,119.613998,1.08366,5.62354,-1.94873,1.0,1,0.685544,0
2033256,2050817,-105.267738,40.22061,5324.180176,14.692,-1.32243,1325.829956,333.053986,117.413002,1.08366,4.80127,-1.41113,0.0,1,0.734266,0
2033257,2050818,-105.267727,40.22061,5323.160156,12.5361,-6.07232,1328.300049,332.420013,115.210999,1.08366,3.78711,-0.608887,0.0,1,0.75469,0
2033258,2050819,-105.267717,40.22061,5322.509766,7.15242,-4.78244,1330.770019,331.812988,113.010002,1.08366,3.13623,-0.087402,0.0,1,0.75229,0


## Score the Random Forest Model

In [99]:
true_negatives, false_positives, false_negatives, true_positives = \
    confusion_matrix(df['target'], df['FHZ_RF_prediction']).ravel()
    
print ('true negatives: ', true_negatives, '\n',
      'false postives: ', false_positives, '\n',
      'false negatives: ', false_negatives, '\n',
      'true_positives: ', true_positives, '\n') 

print(classification_report(df['target'], df['FHZ_RF_prediction']))

true negatives:  1490297 
 false postives:  258336 
 false negatives:  160286 
 true_positives:  124340 

             precision    recall  f1-score   support

        0.0       0.90      0.85      0.88   1748633
        1.0       0.32      0.44      0.37    284626

avg / total       0.82      0.79      0.81   2033259



In [100]:
# Append the results to a new dataframe - 'scores'
scores = scores.append({
     'true negative': true_negatives,
     'false negative': false_negatives,
     'true positive' : true_positives,
    'false positive': false_positives
      }, ignore_index=True)

# Update the index name
last = scores.index[-1]
scores = scores.rename({last: 'RandomForest_model'})

scores

Model,false negative,false positive,true negative,true positive
0,126675.0,432228.0,1316405.0,157951.0
RandomForest_model,160286.0,258336.0,1490297.0,124340.0


## Visualize the Random Forest predicted FHZ points

In [26]:
# Visualize the data using bokeh and the Google Maps API

# Create a downsampled version of the full dataframe for plotting (avoids data limit restrictions)
df_sample = df.sample(frac=0.005, replace=False)

# Create a ColumnDataSource from df: source
source = ColumnDataSource(df_sample)

# Set the mapping options, location and zoom level
map_options = GMapOptions(
    lat=np.mean(df['lat_WGS84']), 
    lng=np.mean(df['long_WGS84']),
    map_type="hybrid", zoom=14)

# Create the google maps figure: p
p = gmap(
    "AIzaSyDbo5FlMFzns5OzeuW1TA7dOikvEuF-eYI", 
    map_options, title="North Saint Vrain, Predicted Fluvial Hazard Zone Points", 
    tools='pan, wheel_zoom, box_select,lasso_select, reset, save',
    plot_width=900)

#Color

# Develop a color gradient for plotting, and color bar for legend
color_mapper = LinearColorMapper(
    palette=['#07fb12','#fb9001'],
    low=df_sample['FHZ_RF_prediction'].min(),
    high=df_sample['FHZ_RF_prediction'].max())

# Add circle glyphs to figure p
p.circle(
    x="long_WGS84", 
    y="lat_WGS84", 
    size=8, 
    source=source, 
    color=dict(field='FHZ_RF_prediction', transform=color_mapper), 
    fill_alpha=0.1,
    legend='FHZ_RF_prediction')

# Create a HoverTool object: hover
hover = HoverTool(tooltips=[('FHZ_RF_prediction', '@FHZ_RF_prediction{0}')])

# Add the HoverTool object to figure p
p.add_tools(hover)

# Label the axes
p.xaxis.axis_label = 'longitude WGS84'
p.yaxis.axis_label = 'latitude WGS84'

# display the plot
output_notebook()
show(p)

# *Logistic Regression Testing*

In [33]:
# Load the trained model
from sklearn.externals import joblib

logreg = joblib.load('logistic_regression_model.pkl') 

In [34]:
# Standardize the features (as done in the training phase for the Logistic Model)
from sklearn.preprocessing import StandardScaler

std_scale = StandardScaler().fit(df_predictors)
predictors_std = std_scale.transform(df_predictors)
type(predictors_std)

# Make the predictions for the entire dataset, which is the easiest way to have matching indexes for
# rejoining back to original data. 
y_hats3 = logreg.predict(predictors_std)

print (y_hats3)

[ 0.  0.  0. ...,  1.  0.  0.]


## Joining the predictions back to the original dataframe

In [35]:
df['FHZ_logReg_prediction']=y_hats3
df.tail()

Unnamed: 0.1,Unnamed: 0,long_WGS84,lat_WGS84,topo2011,ground_slope,ground_curve,near_crossing,near_road,near_stream,stream_slope,relative_elevation,ground_delta,target,FHZ_MLP_prediction,FHZ_MLP_probability,FHZ_RF_prediction,FHZ_logReg_prediction
2033254,2050815,-105.26776,40.22061,5324.669922,4.08205,8.04308,1320.900024,334.39801,121.816002,1.08366,5.2876,-1.50732,0.0,1,0.66814,0,0.0
2033255,2050816,-105.267749,40.22061,5325.0,9.71195,8.15158,1323.359985,333.713013,119.613998,1.08366,5.62354,-1.94873,1.0,1,0.685544,0,0.0
2033256,2050817,-105.267738,40.22061,5324.180176,14.692,-1.32243,1325.829956,333.053986,117.413002,1.08366,4.80127,-1.41113,0.0,1,0.734266,0,1.0
2033257,2050818,-105.267727,40.22061,5323.160156,12.5361,-6.07232,1328.300049,332.420013,115.210999,1.08366,3.78711,-0.608887,0.0,1,0.75469,0,0.0
2033258,2050819,-105.267717,40.22061,5322.509766,7.15242,-4.78244,1330.770019,331.812988,113.010002,1.08366,3.13623,-0.087402,0.0,1,0.75229,0,0.0


## Score the Logistic Regression Model

In [101]:
true_negatives, false_positives, false_negatives, true_positives = \
    confusion_matrix(df['target'], df['FHZ_logReg_prediction']).ravel()
    
print ('true negatives: ', true_negatives, '\n',
      'false postives: ', false_positives, '\n',
      'false negatives: ', false_negatives, '\n',
      'true_positives: ', true_positives, '\n') 

print(classification_report(df['target'], df['FHZ_logReg_prediction']))

true negatives:  1605544 
 false postives:  143089 
 false negatives:  211171 
 true_positives:  73455 

             precision    recall  f1-score   support

        0.0       0.88      0.92      0.90   1748633
        1.0       0.34      0.26      0.29    284626

avg / total       0.81      0.83      0.82   2033259



In [102]:
# Append the results to a new dataframe - 'scores'
scores = scores.append({
     'true negative': true_negatives,
     'false negative': false_negatives,
     'true positive' : true_positives,
    'false positive': false_positives
      }, ignore_index=True)

# Update the index name
last = scores.index[-1]
scores = scores.rename({last: 'LogReg_model'})
scores

Model,false negative,false positive,true negative,true positive
0,126675.0,432228.0,1316405.0,157951.0
1,160286.0,258336.0,1490297.0,124340.0
LogReg_model,211171.0,143089.0,1605544.0,73455.0


## Visualize the Logistic Regression predicted FHZ points

In [13]:
# Visualize the data using bokeh and the Google Maps API

# Create a downsampled version of the full dataframe for plotting (avoids data limit restrictions)
df_sample = df.sample(frac=0.005, replace=False)

# Create a ColumnDataSource from df: source
source = ColumnDataSource(df_sample)

# Set the mapping options, location and zoom level
map_options = GMapOptions(
    lat=np.mean(df['lat_WGS84']), 
    lng=np.mean(df['long_WGS84']),
    map_type="hybrid", zoom=14)

# Create the google maps figure: p
p = gmap(
    "AIzaSyDbo5FlMFzns5OzeuW1TA7dOikvEuF-eYI", 
    map_options, title="North Saint Vrain, Predicted Fluvial Hazard Zone Points", 
    tools='pan, wheel_zoom, box_select,lasso_select, reset, save',
    plot_width=900)

#Color

# Develop a color gradient for plotting, and color bar for legend
color_mapper = LinearColorMapper(
    palette=['#07fb12','#fb9001'],
    low=df_sample['FHZ_logReg_prediction'].min(),
    high=df_sample['FHZ_logReg_prediction'].max())

# Add circle glyphs to figure p
p.circle(
    x="long_WGS84", 
    y="lat_WGS84", 
    size=8, 
    source=source, 
    color=dict(field='FHZ_logReg_prediction', transform=color_mapper), 
    fill_alpha=0.1,
    legend='FHZ_logReg_prediction')

# Create a HoverTool object: hover
hover = HoverTool(tooltips=[('FHZ_logReg_prediction', '@FHZ_logReg_prediction{0}')])

# Add the HoverTool object to figure p
p.add_tools(hover)

# Label the axes
p.xaxis.axis_label = 'longitude WGS84'
p.yaxis.axis_label = 'latitude WGS84'

# display the plot
output_notebook()
show(p)

# Summary

In [103]:
scores = scores.rename({0: 'MLP model', 1: 'Random Forest model', 2: 'LogReg_model'})
scores

Model,false negative,false positive,true negative,true positive
MLP model,126675.0,432228.0,1316405.0,157951.0
Random Forest model,160286.0,258336.0,1490297.0,124340.0
LogReg_model,211171.0,143089.0,1605544.0,73455.0


# References

###  Fluvial Hazard Mapping References:  
http://coloradohazardmapping.com/hazardMapping/fluvialMapping : CWCB Fluvial Hazard Mapping Delineation Guide
http://geoinfo.msl.mt.gov/data/montana_channel_migration_zones.aspx: Montana's FHZ studies
https://cdn.shopify.com/s/files/1/0387/9521/files/boulder_flooding_damage.jpg?1118 : Photo Credit  

### Technical References:  
https://machinelearningmastery.com/save-load-keras-deep-learning-models/ : saving and loading Keras models  
https://machinelearningmastery.com/how-to-make-classification-and-regression-predictions-for-deep-learning-models-in-keras/ : predictions with trained models   
http://scikit-learn.org/stable/modules/model_persistence.html : saving and loading scikit learn models  
http://kitchingroup.cheme.cmu.edu/blog/2016/02/07/Interactive-Bokeh-plots-in-HTML/ : embedding Bokeh plots in html http://bokeh.pydata.org/en/latest/docs/user_guide/embed.html#static-data : embedding Bokeh plots in html (official)   
https://stackoverflow.com/questions/42142756/how-can-i-change-a-specific-row-label-in-a-pandas-dataframe : changing dataframe index name  
