# 1. Feature Preparation

- Define additional features `SedUn` and `SuscFrac`
- Carry out train-test split
- Fit StandardScaler to data and save.

In [46]:
import pandas as pd
pd.set_option("max_colwidth", 50)
import geopandas as gpd

In [47]:
import sklearn
import numpy as np

In [48]:
pqfile="../data_preparation/staley16_observations_catchment_fuelpars_rocktype_randn_v3.parquet"
modelDataI = gpd.read_parquet(pqfile)

In [45]:
modelDataI.columns

Index(['fire_name', 'year', 'fire_id', 'fire_segid', 'database', 'state',
       'response', 'stormdate', 'gaugedist_m', 'stormstart', 'stormend',
       'stormdur_h', 'stormaccum_mm', 'stormavgi_mmh', 'peak_i15_mmh',
       'peak_i30_mmh', 'peak_i60_mmh', 'contributingarea_km2', 'prophm23',
       'dnbr1000', 'kf', 'acc015_mm', 'acc030_mm', 'acc060_mm', 'geom', 'lon',
       'lat', 'SiteID', 'NB', 'GR', 'GS', 'SH', 'TU', 'TL', 'dom',
       'Fine fuel load', 'SAV', 'Packing ratio', 'Extinction moisture content',
       'Igneous', 'Metamorphic', 'Sedimentary', 'Unconsolidated', 'domrt',
       'logarea', 'SedUn', 'SuscFrac'],
      dtype='object')

Combining Sedimentary and Unconsolidated rocks, which have similar debris flow occurrences.

Define additional features:
-  `SedUn`: fraction of watershed covered by sedimentary and unconsolidated rocks.
- `SuscFrac`: fraction of watershed covered by susceptible vegetation types (everything except grassland, `GR`

In [25]:
modelDataI["logarea"]=modelDataI["contributingarea_km2"].apply(np.log10)
modelDataI["SedUn"]=modelDataI["Sedimentary"] + modelDataI["Unconsolidated"]
modelDataI["SuscFrac"]=modelDataI["GS"] + modelDataI["SH"] + modelDataI["TL"] + modelDataI["TU"]

In [26]:
#Columns to use for training

usecols=["stormdur_h",
         "stormaccum_mm",
         "peak_i15_mmh",
         "logarea",
         "contributingarea_km2",
         "prophm23",
         "dnbr1000",
         "kf",
         "SedUn",
         "SuscFrac",
         "response",
         "SiteID"]

In [27]:
cdata=modelDataI[usecols].copy()

In [28]:
len(cdata)

1550

In [29]:
cdata.dropna(inplace=True)
print(len(cdata))

1243


In [30]:
cdata.describe()

Unnamed: 0,stormdur_h,stormaccum_mm,peak_i15_mmh,logarea,contributingarea_km2,prophm23,dnbr1000,kf,SedUn,SuscFrac,response,SiteID
count,1243.0,1243.0,1243.0,1243.0,1243.0,1243.0,1243.0,1243.0,1243.0,1243.0,1243.0,1243.0
mean,20.680016,40.589921,22.227411,-0.395631,1.073835,0.474339,0.336466,0.233133,0.220859,0.705959,0.254224,387.855994
std,17.962493,40.257607,19.61051,0.665499,1.53267,0.276969,0.189288,0.474098,0.395089,0.228594,0.435599,198.424006
min,0.0,1.33161,1.474149,-1.696804,0.0201,0.0,0.007158,0.0,0.0,0.0,0.0,0.0
25%,3.580118,10.47627,10.093766,-0.92947,0.117633,0.224443,0.189757,0.150455,0.0,0.540533,0.0,211.5
50%,16.886956,26.157526,16.498849,-0.328886,0.468936,0.51899,0.308938,0.237299,0.0,0.74562,0.0,402.0
75%,30.296227,58.09701,26.34165,0.13842,1.375373,0.690888,0.456611,0.244665,0.146225,0.907067,1.0,561.0
max,67.570797,238.504743,122.774776,0.896973,7.888105,0.989526,0.997439,11.360418,1.0,1.0,1.0,715.0


In [40]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Data rows are split by SiteID, such that observations made at same site (during different storms) are not assigned to the both the test and training set.
The `unique` attribute is, therefore, essential - otherwise the split of train and test data by site is not guaranteed.

In [32]:
trainsites, testsites = train_test_split(cdata["SiteID"].unique(), test_size=0.20, shuffle=True, random_state=1)

trainmask=cdata["SiteID"].apply(lambda x: x in trainsites)
testmask=cdata["SiteID"].apply(lambda x: x in testsites)
trainX=cdata[trainmask].drop(columns=["response", "SiteID"])
trainY=cdata[trainmask]["response"]

testX=cdata[testmask].drop(columns=["response", "SiteID"])
testY=cdata[testmask]["response"]

This is to double-check that there is no overlap between train and test sites in terms of SiteID (intersection is empty):

In [33]:
set(trainsites) & set(testsites)

set()

Save to disk.  All ML models, saved in subsequent notebooks, will use the same train-test split.

In [39]:
import pickle
pickle.dump([trainX, trainY, testX, testY], open("staley16+addtl_feats_split.pkl", "wb"))

Scale features, and save scaler to disc.  This is to make sure that all models use the same feature scaling.

In [42]:
ssc=StandardScaler()
ssc.fit_transform(trainX)
pickle.dump(ssc, open("feature_scaler.pkl", "wb"))